EXPLORING TECHNIQUES FOR OPTIMAL - USP Thesesdigilib.library.usp.ac.fj/.../HASHac42.dir/doc.pdf ·...

EXPLORING TECHNIQUES FOR OPTIMALFEATURE AND CLASSIFIER SELECTION FORPROTEIN MODELING, FUNCTION, AND FOLD

RECOGNITION

by

Harsh Saini

A thesis submitted in fulfillment of the requirements for theDegree of Masters of Science in Computer Science

Copyright © 2015 by Harsh Saini

School of Computing, Information and Mathematical SciencesFaculty of Science, Technology and Environment

The University of the South Pacific

February, 2014

To my family

Acknowledgment

I would like to thank my family for their continued support during course of thisthesis. I would also like to thank my supervisors for their timely advice, guidanceand providing direction in this research.

i

Publications

• Harsh Saini, Gaurav Raicar, Alok Sharma, Sunil Lal, Abdollah Dehzangi, ARajeshkannan, James Lyons, Neela Biswas, and Kuldip K Paliwal. ProteinStructural Class Prediction via k-separated bigrams using Position SpecificScoring Matrix. Journal of Advanced Computational Intelligence and In-telligent Informatics, 8(4), 2014.

• Harsh Saini, Gaurav Raicar, Sunil Lal, Abdollah Dehzangi, James Lyons,Kuldip K Paliwal, Seiya Imoto, Satoru Miyano and Alok Sharma. Ge-netic algorithm for an optimized weighted voting scheme incorporating k-separated bigram transition probabilities to improve protein fold recognition.Asia Pacific World Congress on Computer Science and Engineering, 2014.(accepted)

• Harsh Saini, Gaurav Raicar, Alok Sharma, Sunil Lal, Abdollah Dehzangi,James Lyons, Kuldip K Paliwal, Seiya Imoto and Satoru Miyano. Prob-abilistic expression of spatially varied amino acid dimers for protein foldrecognition. 2014. (under review)

• Harsh Saini, Gaurav Raicar, Abdollah Dehzangi, Sunil Lal, Alok Sharma.Subcellular localization for Gram Positive and Gram Negative Bacterial Pro-teins using Linear Interpolation Smoothing Model. 2014. (under review)

ii

Abstract

Identification of the tertiary structure (three dimensional structure) of a pro-tein is a fundamental problem in biology which helps in identifying its functions.Predicting a protein’s structural class and its fold type is considered to be an in-termediate step for identifying the tertiary structure of a protein. Computationalmethods have been applied for this application by assembling information from itsstructural, physicochemical and/or evolutionary properties. In this study, variousschemes are discussed for improving protein structural class and fold recognition.A feature extraction technique is explored that extracts probabilistic expressionsof amino acid dimers, which have varying degree of spatial separation in theprimary sequences of proteins, from the Position Specific Scoring Matrix. Theexplored techniques have been evaluated using benchmarked datasets.In addition to identifying the tertiary structure for proteins, protein subcellularlocalization is an important topic in proteomics since it is related to a proteinsoverall function, help in the understanding of metabolic pathways, and in drugdesign and discovery. This study also explores the applicability of a basic ap-proximation technique called the linear interpolation smoothing for predictingprotein subcellular localizations. The proposed approach extracts features fromsyntactical information in protein sequences to build probabilistic profiles usingdependency models, which are used in linear interpolation to determine the like-lihood of a sequence to belong to a particular subcellular location.

iii

Preface

The research conducted in this study is based on the areas of protein fold recogni-tion and subcellular localization in the general area of proteomics. This documenthas been structured to allow novice and expert readers to appreciate the basicconcepts that are being researched. The document can be divided into three sec-tions, with the first section introducing the area, the second section explainingthe various tasks performed to meet the objectives of this research and the finalsection concludes the thesis with a brief summary of finding and a few recom-mendations.More specifically, the breakdown of content presented in each chapter of thisdocument has been listed below:

• Chapter 1 is the introductory chapter with details on the areas of research,its disciplines and a brief overview of recent progress in the fields.

• Chapter 2 and 3 describe the various materials used and the recently pub-lished computational methods in this field respectively.

• Chapter 4 discusses a feature extraction technique that significantly im-proves protein fold recognition in the datasets used.

• Chapter 5 investigates the applicability of this feature extraction techniquein protein structural class predication, with is related to protein fold recog-nition.

• Chapter 6 introduces a new prediction model based on the maximum like-lihood approach to predict protein subcellular localizations.

• Chapter 7 provides the concluding remarks and summarizes the researchconducted in this study.

iv

Abbreviations

• ANN - Artificial Neural Networks

• GA - Genetic Algorithm

• NB - Naïve Bayes

• PFR - Protein Fold Recognition

• PseAAC - pseudo-amino acid composition

• PSSM - Position Specific Scoring Matrix

• SCL - Subcellular Localization

• SCOP - Structural Classification of Proteins

• SCP - Structural Class Prediction

• SVM - Support Vector Machine

v

Contents

Acknowledgment i

Publications ii

Abstract iii

Preface iv

Abbreviations v

1 Introduction 11.1 Structural Classes and Folds . . . . . . . . . . . . . . . . . . . . . 21.2 Subcellular Localization . . . . . . . . . . . . . . . . . . . . . . . 31.3 Computational Intelligence in Proteomics . . . . . . . . . . . . . . 4

1.3.1 Categorization of methods . . . . . . . . . . . . . . . . . . 41.3.2 Computational representation of proteins . . . . . . . . . . 61.3.3 Related work in brief . . . . . . . . . . . . . . . . . . . . . 6

2 Materials 82.1 Datasets for SCP and PFR . . . . . . . . . . . . . . . . . . . . . . 8

2.1.1 Ding and Dubchak Dataset . . . . . . . . . . . . . . . . . 82.1.2 Extended Ding and Dubchak Dataset . . . . . . . . . . . . 92.1.3 Taguchi and Gromhia Dataset . . . . . . . . . . . . . . . . 9

2.2 Datasets for SCL . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.1 Gram Positive Dataset . . . . . . . . . . . . . . . . . . . . 92.2.2 Gram Negative Dataset . . . . . . . . . . . . . . . . . . . 13

vi

3 Prevalent Methods in Literature 143.1 Features for comparison . . . . . . . . . . . . . . . . . . . . . . . 14

3.1.1 Structural based features . . . . . . . . . . . . . . . . . . . 143.1.2 Evolutionary based features . . . . . . . . . . . . . . . . . 153.1.3 Physicochemical based features . . . . . . . . . . . . . . . 18

4 Exploring feature extraction in protein fold recognition 204.1 Conceptual Overview . . . . . . . . . . . . . . . . . . . . . . . . . 214.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.2.1 Algorithmic Details . . . . . . . . . . . . . . . . . . . . . . 224.2.2 Demonstration . . . . . . . . . . . . . . . . . . . . . . . . 23

4.3 Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.4 Classification Schemes . . . . . . . . . . . . . . . . . . . . . . . . 25

4.4.1 Classification on individual features . . . . . . . . . . . . . 254.4.2 Voting schemes for ensemble methods . . . . . . . . . . . . 254.4.3 Classification as concatenated features . . . . . . . . . . . 304.4.4 Analysis of Explored Schemes . . . . . . . . . . . . . . . . 33

4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5 Improving protein structural class prediction using feature ex-traction techniques 395.1 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . 405.2 Classification Scheme . . . . . . . . . . . . . . . . . . . . . . . . . 405.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6 Linear Interpolation Smoothing Model in Subcellular Localiza-tion 456.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.1.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 476.1.2 Optimizing λ . . . . . . . . . . . . . . . . . . . . . . . . . 506.1.3 Overall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.4 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

vii

7 Summary 61

Bibliography 69

viii

List of Figures

1.1 A depiction of the various protein structures . . . . . . . . . . . . 21.2 A diagram depicting subcellular locations in eukaryotic cells . . . 3

4.1 An illustration of dimer counting strategies . . . . . . . . . . . . . 224.2 A flowchart depicting classification using GA optimized voting

scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.3 A flowchart depicting classification using concatenated dimer features 30

6.1 An illustration of the proposed scheme . . . . . . . . . . . . . . . 52

ix

List of Tables

2.1 Summary of Ding and Dubchak (DD) dataset . . . . . . . . . . . 102.2 Summary of Extended Ding and Dubchak (EDD) dataset . . . . . 112.3 Summary of Taguchi and Gromhia (TG) dataset . . . . . . . . . . 122.4 Summary of Gram Positive bacterial protein dataset . . . . . . . . 132.5 Summary of Gram Negative bacterial protein dataset . . . . . . . 13

4.1 PSSM of a sample protein . . . . . . . . . . . . . . . . . . . . . . 244.2 Dimer probabilities with k=1 . . . . . . . . . . . . . . . . . . . . 244.3 Dimer probabilities with k=2 . . . . . . . . . . . . . . . . . . . . 244.4 Training accuracies for individual dimer features . . . . . . . . . . 264.5 Genetic Algorithm parameters and values . . . . . . . . . . . . . . 294.6 Performance of concatenated dimers during training . . . . . . . . 324.7 Summary of model performance during training for the various

schemes explored using 10 fold cross validation . . . . . . . . . . . 334.8 Performance of features on DD dataset for PFR . . . . . . . . . . 354.9 Performance of features on TG dataset for PFR . . . . . . . . . . 364.10 Performance of features on EDD dataset for PFR . . . . . . . . . 37

5.1 Summary of datasets for Structural Class Prediction . . . . . . . 405.2 Performance of concatenated dimers during training in SCP . . . 415.3 Performance of features on DD dataset for SCP . . . . . . . . . . 425.4 Performance of features on EDD dataset for SCP . . . . . . . . . 435.5 Performance of features on TG dataset for SCP . . . . . . . . . . 44

6.1 A list of parameters for the Genetic Algorithm . . . . . . . . . . . 51

x

6.2 A summary for the performance of the various models for predic-tion studied in this paper using the Gram positive bacterial datasetusing k-fold cross validation for k = 5, 6, . . . , 10 . . . . . . . . . . 54

6.3 A summary for the performance of the various models for pre-diction studied in this paper using the Gram negative bacterialdataset using k-fold cross validation for k = 5, 6, . . . , 10 . . . . . . 55

6.4 Results from the jackknife test performed on Gram positive bac-terial protein dataset . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.5 Results from the jackknife test performed on Gram negative bac-terial protein dataset . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.6 A detailed comparison of the various models studied using 10-foldcross validation on Gram positive bacterial dataset . . . . . . . . 57

6.7 A detailed comparison of the various models for prediction studiedusing 10-fold cross validation using the Gram negative bacterialdataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

xi

Chapter 1

Introduction

The objectives of this research, in brief, are to explore feature extractiontechniques and investigate various classification schemes to improve pre-diction accuracy of computational methods in protein fold recognition andsubcellular localization. This chapter introduces protein structucal classes,folds and subcellular localizations along with their biological aspects andimportance to researchers in bioinfomatics and proteomics. Common com-putational approaches are also disussed along with the different possiblesources of information available to researchers. Additionally, a summa-rized review of recent progress with regards to computational methods inthese fields have also been discussed.

Proteomics is the study of proteins, including their structures and functions, andan important component of functional genomics [1, 2]. It is an interdisciplinarydomain focusing on the exploration of proteomes from the overall level of intra-cellular protein composition, structure, and its own unique activity patterns.Proteins are vital parts of living organisms, as they are the main componentsof the physiological metabolic pathways of cells. They are complex moleculesthat perform several functions in living organisms such as structural support,movement, enzymatic activity and more.Proteins are composed of a sequence of smaller amino acids that are attachedtogether forming long peptide chains. There are 20 naturally occurring aminoacids that can be combined in various ways to form proteins. This sequenceof amino acids is called the protein’s primary structure. A protein’s secondarystructure is the localized coiling or bending of the amino acid chains formingstructures like alpha helix and beta pleated sheets. The tertiary structure ofa protein is it’s physical three dimensional shape, which is produced when theprotein molecule folds back onto itself forming disulphide bridges and hydrogen

1

bonds. The quaternary structure is a complex structure formed by the interactionof two or more proteins. This has been summarized in Figure 1.1.

Figure 1.1: A depiction of the various protein structures. Adopted fromWikipedia (https://en.wikipedia.org/wiki/Protein_structure).

1.1 Structural Classes and Folds

In biological sciences, knowing the biological function of proteins is critical inunderstanding biological processes. The tertiary structure of a protein closelyrelates biological function of the protein, thus, it is of great significance in biology[3, 4]. It also aids in the understanding of the heterogeneity of proteins, protein-protein interactions, protein-peptide interactions, and aids the development ofdrug designs. In drug design, knowing the tertiary structure of the target proteinis crucial since drugs are created to bind with the active sites on the targetprotein [3, 5].Protein fold recognition (PFR) of a protein sequence is considered to be an in-termediary step in identifying the tertiary structure of a protein. PFR is a stephigher than protein structural class prediction (SCP), which aims to determinethe structural classes, the secondary structure, of a protein. Prediction of thetertiary structure of a protein is considered a difficult problem since the pri-mary structure (sequence of amino acids) of different proteins can vary in lengthand similarities, however, they can still belong to the same fold [6]. PFR canbe defined as categorizing unknown protein sequences to its well defined folds. Majority of the folds are accurately represented and defined in the StructuralClassification of Proteins (SCOP) database [3].Furthermore, SCP aims to represent the major secondary structure of a protein.Identification of structural classes assists in identification of the overall folding

2

process and functionality of proteins. Each structural class can be further cate-gorized into a group of more specific fold types. SCOP defines up to 11 groupsof protein structural classes, however, the 4 major structural classes (all-α, all-β,α/β and α + β) cover over 90% of all identified proteins [7].

1.2 Subcellular Localization

As stated previously, most of the functions that are critical to a cell’s survival areperformed by the proteins in the cell. Therefore, a fundamental goal in proteomicsand cell biology is to identify the subcellular locations and functions of theseproteins. Subcellular Localization (SCL) can provide important insight aboutprotein functions, provide indication on the environment where they interactwith each other and other molecules, and assist in the understanding of thepathways that regulate biological processes at a cellular level [8,9]. Some commonsubcellular locations present in eukaryotic cells can be seen in Figure 1.2.

Figure 1.2: A diagram depicting subcellular locations in eukaryotic cells. Adoptedfrom Chou and Shen, 2007 [8].

3

1.3 Computational Intelligence in Proteomics

Traditional methods to decipher a protein’s structure include using X-ray crystal-lography and Nuclear Magnetic Resonance [10,11]. However, these experimentaltechniques require specialist equipment, are extremely time consuming, expensiveand seemingly impractical for protein sequences of extremely large lengths [10,12].As genome and other sequencing projects advance, there has been a monumen-tal increase in the generation of protein sequences, which has created a greaterdemand for moving from data accumulation to data interpretation [3, 13].Analysis of biological data obtained from genome sequencing is necessary to un-derstand cellular functions, drug discovery and determining evolutionary relation-ships [5, 8]. As the number of sequences for analysis increase, it gets increasingdifficult for traditional methods to meet with demands. Moreover, not all proteinsare amenable to experimental structure determination, it is necessary to investi-gate computational models as alternatives for protein sequence analysis [3].Similarly, subcellular localizations can be determined by conducting biochemicalexperiments, however, they are also time-consuming and costly [8]. To cope withthe problem of rapidly increasing protein sequences, computational methods arebeing explored to classify unknown protein sequences into their respective classes.Several approaches employing the use of computational models can be seen inliterature for these tasks. In fold recognition, mainly, this task involves predictionof 4 major structural classes (secondary structure) or the prediction of a subsetof the most common fold types, each of which belong to one of the structuralclasses [3]. Naturally, it is more challenging to predict the fold types (usually inexcess of 20) than the 4 major structural classes.In contrast, SCL has some key differences when compared to PFR and SCP.In SCL, the aim is to determine one or more subcellular localizations of thetarget protein, however, PFR and SCP are usually have one-to-one mappingbetween proteins and folds or structural classes. Additionally, problems in SCLare usually regarded by an organism specific approach, whereby predictors aredeveloped according to specific organisms such as bacteria, plants and humans.Therefore, computational techniques involving SCL should allow for multiplex(multiple sites) prediction of proteins for a range of subcellular locations, usuallyfrom 4 to 22 sites [8, 9, 14].

1.3.1 Categorization of methods

Computational methods employed for proteomics can be categorized into twolarge categories, namely being template-based and taxonomy-based approaches[12].

4

Template Approach

Template approaches involve template libraries, whereby a query protein is alignedwith recognized template proteins to predict the various attributes of the queryprotein. There are three main subcategories of the template approach. Firstly, inHomology Modeling, sequence similarity measure is used to compare the target se-quence with a known sequence and the structure is identified based on the degreeof similarity. It assumes that two proteins have similar attributes if they havehigh sequence homology [15]. Secondly, Threading detects the structural similar-ities against a library of known structures [16]. Thirdly, the De Novo Approachrelies on prediction solely from the sequence alone with physics and chemistrylaws. Suitable energy functions are used to simulate protein folding in atomicdetail using methods like molecular dynamics or Monte Carlo simulations [17].The premise of this approach, therefore, is that a protein can be accurately mod-eled if its homologous templates can be identified. However, a problem largelyassociated with template-based methods is that it is difficult to detect homologiesfor proteins with low sequence identities [18].

Taxonomic Approach

The taxonomy approach follows the assumption that in nature, there exists onlya limited, predetermined number of protein folds [10, 19]. Similar assumptionsare made in SCL where the sites are discretized. Subsequently, machine learningtechniques can be applied to these problems since now it can be viewed as aclassification problem whereby every query protein belongs to a particular label,may it be the structural class, fold or subcellular location. This is a very popularmethod whereby most implementations of taxonomy-based methods, if not all,have adopted the SCOP protein structural classification architecture [12].This approach has two distinct stages, feature extraction and classification, andboth are equally important to obtain good results [20]. Feature extraction meth-ods are required to extract useful information from proteins and machine learningclassifiers are usually required to model this information for implementing theclassification task. Machine learning techniques use algorithms that are designedto learn from a predefined training data to build a model for predication froma protein’s sequence information. Primarily, in taxonomic approaches, data isextracted from existing databases, containing sequences and experimentally de-termined secondary and tertiary structures, which are used to develop models forprediction [3].

5

1.3.2 Computational representation of proteins

Primarily, there are a few generalized categories of methods that are used torepresent proteins computationally for analysis. These methods, also known asfeature extraction techniques, can be categorized based on the type of informationfrom which data has been extracted to build computational models.Structural information (or syntactical information) largely refers to informationextracted directly from the amino acid sequence in the proteins. Some of thesetechniques include counting techniques such as amino acid composition [3], pair-wise frequency [21] and pseudo-amino acid composition [13]. These techniquescan provide good prediction accuracies, however, they may suffer if the sequenceidentity cutoff is low.Another very popular source of information is sequential evolution. This capturesthe evolution in protein sequences, which involves changes of single residues, in-sertions and deletions of several residues, gene doubling, and gene fusion. Overa period of time, such changes accumulate, so that many similarities betweeninitial and resultant amino acid sequences may be eliminated, but the corre-sponding proteins may still share many common attributes, such as belongingto the same structural class, subcellular locations and possessing basically thesame functions [14, 22]. A popular tool for extracting evolutionary informationfrom query proteins in the form of Position Specific Scoring Matrix (PSSM) isPSI-BLAST [23].Moreover, biochemical properties of amino acids can be used to model proteins.These physicochemical properties of amino acids are the expressions of each aminoacid for various biochemical measures such as hydrophobicity, hydrophilicity, po-larity and van der Waals volume. There are currently 544 amino acid indices asper AAindex version 9.1 [24]. Other possible sources of data include annotationsfrom functional domains and gene ontology databases.

1.3.3 Related work in brief

This research follows the taxonomic approach, therefore, mostly taxonomy-basedmethods will be discussed and used for comparison purposes. Additionally, morefocus is given to information extraction from structural data and sequential evolu-tion, and proposed techniques with such features from literature will be discussed.In the machine learning approach, a variety of algorithms have been utilizedin PFR and SCP. Some of the more popular classification algorithms includeLinear Discriminant Analysis [25], K-Nearest Neighbor (KNN) [26], BayesianClassifiers [27], Support Vector Machine (SVM) [5, 18, 28–32], Artificial NeuralNetworks (ANN) [33–35] and Ensemble classifiers [7, 11, 36, 37], which incorpo-rate multiple classification algorithms. Out of the previously mentioned machine

6

learning algorithms, SVM based have classifiers showed promising results in PFRand SCP [20].Most commonly used features include structural and evolutionary features inPFR and SCL, however, there are some researchers also employ physicochem-ical attributes and functional domains for feature extraction [38]. Dubchak etal. [39] suggested syntactical and physicochemical based features in which theyused five attributes of amino acids namely - hydrophobicity, predicted secondarystructure based on normalized frequency of α helix, polarity, polarizability andvan der Waals volume. These features have in turn been widely adopted by otherresearchers in PFR and SCL [3,40]. Furthermore, additional attributes have beenused later on such as solvent accessibility [41], flexibility [42] and bulkiness [43].Taguchi and Gromiha used syntactical features (occurrence and composition) forPFR [44]. Ghanty and Pal have employed pairwise frequencies of amino acids,both for adjacent and separated by one residue [21]. Chou has proposed pseudo-amino acid composition (PseAAC) based features to extract localized as well asglobal information from structural features [13] . Several researchers have usedPSSM for improving PFR and some of these include auto cross-covariance [18],bi-gram [5], tri-gram [45], sequence alignment via dynamic time warping [46].Similarly, numerous computational methods have been applied in SCL, whereboth template based and taxonomic approaches are popular. In the templateapproach, tools such as BLAST are used to make predictions, however, theseapproaches fail to preform when the query proteins do not have significant ho-mology to known proteins [8]. For taxonomic approaches, popular classifica-tion algorithms are SVM [47, 48], KNN and its variants [22, 49], Naïve Bayes(NB) [50] and ensemble classifiers [51] . As with PFR, the representation of agiven protein prior to classification is critical. Feature extraction and representa-tion techniques that obtain discriminatory information are extremely diverse inSCL and most published work includes more than one feature extractor in orderto achieve good results. Some of these techniques include amino acid composi-tion, signaling sequences, n-peptide composition, PseAAC and its various modes,gene ontology, functional domain composition, physicochemical properties andhomology [8, 52–57].

7

Chapter 2

Materials

Different datasets have been used for experimentation in this research. Thischapter provides details about these datasets and the type(s) of data theyrepresent.

Various datasets were used for the purposes of analysis and benchmarking duringthe course of this research. This section describes these datasets, their sourcesand their specific criteria based on homology cutoff. Since SCP and PFR arehierarchical in nature, datasets for these fields were reused, by reverting thesample labels from folds to their structural classes during different stages of theexperiment. However, different datasets had to be used for SCL as the problemdomain is significantly different from SCP and PFR.

2.1 Datasets for SCP and PFR

For SCP and PFR, three different datasets were used. Each of these datasetscontain samples from four major structural classes of proteins, namely - α, β, α/βand α+β. Each of these structural classes can be further categorized into severalfolds (in the excess of twenty). For SCP, the structural class labels were used forconducting the experiments and analysis whereas the labels for the various foldtypes were used in the context of PFR.

2.1.1 Ding and Dubchak Dataset

The Ding and Dubchak (DD) dataset consists of proteins that belong to 27 SCOPfolds covering all four major structural classes. The DD dataset also provides abenchmarked training set for the creation of the model and an independent test

8

set for evaluating query proteins against the model. The training dataset consistsof 311 protein sequences where any given pair of sequences do not have more than35% sequence identity for aligned subsequences longer than 80 residues and thetest set consists of 383 protein sequences where the sequence identity betweenany two given proteins is less than 40% [3]. The details of the DD dataset areprovided in Table 2.1.

2.1.2 Extended Ding and Dubchak Dataset

The Extended Ding and Dubchak (EDD) dataset extends the DD dataset byincorporating more proteins in the dataset. It, similarly, contains 3418 proteinsfrom 27 SCOP folds with less than 40% sequential similarity. The EDD datasetdoes not provide separate training and test sets [18]. The details of the EDDdataset are provided in Table 2.2.

2.1.3 Taguchi and Gromhia Dataset

The Taguchi and Gromhia (TG) dataset covers 30 SCOP folds consisting of 1612protein sequences. The TG dataset has a very low threshold for sequence similar-ity whereby sequence similarity of proteins is no more than 25% [44]. The detailsof the TG dataset are provided in Table 2.3.

2.2 Datasets for SCL

Similarly, two datasets of two different types of bacteria organisms were usedfor SCL. These datasets contain samples of proteins that belong to one or moresubcellular locations, therefore, some samples may have more than one label inthe datasets. Some common subcellular locations include the cell membrane,cell wall, extracellular regions, etc. Detailed descriptions of the datasets and thelocations they cover are provided in this section.

2.2.1 Gram Positive Dataset

This dataset comprises of Gram positive bacterial proteins that contains bothsingleplex and multiplex proteins, which cover four subcellular locations. It con-tains 519 unique proteins where 515 proteins belong only to one location and4 proteins belong to two locations. Similarly, it also has a pairwise sequencesimilarity threshold of 25% [54]. The details of the Gram positive dataset areprovided in Table 2.4.

9

Table 2.1: Summary of Ding and Dubchak (DD) dataset

Folds Train Samples Test Samplesα

Globin-like 13 6Cytochromec 7 9DNA-binding 3-helical-bundle 12 204-Helical up-and-down bundle 7 84-Helical cytokines 9 9Alpha; EF-hand 6 9β

Immunoglobulin-like β-sandwich 30 44Cupredoxins 9 12Viral coat and capsid proteins 16 13ConA-like lectins/glucanases 7 6SH3-like barrel 8 8OB-fold 13 19Trefoil 8 4Trypsin-like serineproteases 9 4Lipocalins 9 7α/β

(TIM)-barrel 29 48FAD (also NAD)-binding motif 11 12Flavodoxin-like 11 13NAD(P)-binding Rossmann-fold 13 27P-loop containing nucleotide 10 12Thioredoxin-like 9 8Ribonuclease H-like motif 10 12Hydrolases 11 7Periplasmic binding protein-like 11 4α + β

β-Grasp 7 8Ferredoxin-like 13 27Small inhibitors, toxins, lectins 13 27

10

Table 2.2: Summary of Extended Ding and Dubchak (EDD) dataset

Folds Number of Samplesα

Globin-like 41Cytochromec 35DNA-binding 3-helical-bundle 3224-Helical up-and-down bundle 694-Helical cytokines 30Alpha; EF-hand 59β

Immunoglobulin-like β-sandwich 391Cupredoxins 47Viral coat and capsid proteins 60ConA-like lectins/glucanases 57SH3-like barrel 129OB-fold 156Trefoil 45Trypsin-like serineproteases 45Lipocalins 37α/β

(TIM)-barrel 336FAD (also NAD)-binding motif 73Flavodoxin-like 130NAD(P)-binding Rossmann-fold 195P-loop containing nucleotide 239Thioredoxin-like 111Ribonuclease H-like motif 128Hydrolases 83Periplasmic binding protein-like 16α + β

β-Grasp 121Ferredoxin-like 339Small inhibitors, toxins, lectins 124

11

Table 2.3: Summary of Taguchi and Gromhia (TG) dataset

Folds Number of Samplesα

Cytochrome C 25DNA/RNA binding 3-helical bundle 103Four helical up and down bundle 26EF hand-like fold 25SAM domain-like 26α - α super helix 47β

Immunoglobulin-like β-sandwich 173Common fold of diphtheria toxin/transcription factors/cytochrome 28Cupredoxin-like 30Galactose-binding domain-like 25Concanavalin A-like lectins/glucanases 26SH3-like barrel 42OB-fold 78Double-stranded α-helix 34Nucleoplasmin-like 42α/β

TIM α/β-barrel 145NAD(P)-binding Rossmann-fold domains 77FAD/NAD(P)-binding domain 31Flavodoxin-like 55Adenine nucleotide a hydrolase-like 34P-loop containing nucleoside triphosphate hydrolases 95Thioredoxin fold 32Ribonuclease H-like motif 49S-adenosyl-L-methionine-dependent methyltransferases 34α/β-Hydrolases 37α + β

β-Grasp, ubiquitin-like 42Cystatin-like 25Ferredoxin-like 118Knottins 80Rubredoxin-like 28

12

Table 2.4: Summary of Gram Positive bacterial protein datasetSubcellular Location Number of SamplesCell membrane 174Cell wall 18Cytoplasm 208Extracellular 123

Table 2.5: Summary of Gram Negative bacterial protein datasetSubcellular Location Number of SamplesCell inner membrane 557Cell outer membrane 124Cytoplasm 410Extracellular 133Fimbrium 32Flagellum 12Nucleoid 8Periplasm 180

2.2.2 Gram Negative Dataset

In this dataset, Gram negative bacterial proteins covering eight subcellular loca-tions are collected. It contains 1392 unique proteins where 1328 proteins belongonly to one location and 64 proteins belong to two locations. Similarly, it alsohas a pairwise sequence similarity cut-off of 25% [58]. The details of the Gramnegative dataset are provided in Table 2.5.

13

Chapter 3

Prevalent Methods in Literature

In Chapter 1, various methods had been mentioned. This chapter providesthe mathematical details of these methods and their sources for featureextraction.

Recently, there have been methods published in this field that provide promisingresults and some of these methods are discussed in this section. Several featureextraction techniques and schemes that have previously been published in lit-erature are discussed. These features will be used for comparison against theschemes explored in this research in the latter chapters.

3.1 Features for comparison

In this section, features that have been adopted from literature for purposes ofcomparison and benchmark are discussed. These features have been categorizedaccording to the types of information they use to represent proteins computa-tionally.

3.1.1 Structural based features

Structural (or syntactical) features extract information directly from the proteinsequence, which is either the raw sequence or the consensus sequence. Thesefeatures may contain local and/or global information to model proteins.

14

Amino Acid Composition (AAC)

Amino Acid Composition (AAC) determines the composition of amino acids inthe protein sequence [3]. It produces a feature vector in which every elementcorresponds to the fraction of occurrence of an amino acid in the protein sequence.The fraction of occurrence for an amino acid i can be calculated as follows:

fi = ni/N

In this equation, fi is the fraction of composition for amino acid i, ni is the numberof occurrences of amino acid i and N is the length of the protein sequence. Thenumber of features produced via AAC is 20.

Pairwise Frequency

In pairwise frequency (PF1), occurrences of all possible combinations of aminoacid pairs are computed [21]. Since there are only 20 naturally occurring aminoacids, the total number of unique amino acid pairs is 400. In PF1, features arecomputed from amino acid pairs that are separated by one amino acid in theprotein sequence.

Alternate Pairwise Frequency

Alternate Pairwise Frequency (PF2) is similar to PF1 with the only major differ-ence being that features are computed from amino acid pairs that are adjacentto each other in the protein sequence [21].

Trigram Occurrence

Features by this technique are computed by determining the number of occur-rences of amino acid triplets [46]. The maximum number of combinations fortriplets of amino acids is 8000. These triplets are computed from amino acidsthat are adjacent to each other in the primary sequence.

3.1.2 Evolutionary based features

Evolutionary features utilize the information provided by tools like PSI-BLASTand SPINE-X. This information, mostly represented in the form of PSSM, is useddirectly or incorporated with other features to improve classifier performance.

15

Consensus

PSSM, as stated previously, provides the probabilities for the occurrence of aminoacids at every position in the protein sequence. When determining the consensus,amino acids with the highest probabilities can be used to substitute the variousamino acids along the protein chain whose probabilities of occurrence are lower.In this research, methods that have a prefix of PSSM+ denote methods whichcompute features from the consensus sequence rather than the raw sequence.

Auto Cross-Covariance (ACC)

Auto Cross-Covariance (ACC) is a technique that extracts features from directlyfrom PSSM by measuring correlation [18]. ACC is a concatenated feature thatbuild upon two distinct variables, auto-correlation (AC) and cross-correlation(CC).AC is variable that computes the correlation among the occurrences of the sameamino acid, i, in PSSM separated by a distance of d in the sequence as shownbelow:

AC(i, d) =L−d∑j=1

(Si,j − S̄i)(Si,j+d − S̄i)/(L − d)

In the equation shown above, L is the length of the sequence, Si,j is the probabilityof amino acid i at position j in PSSM and S̄i is the average probability for aminoacid i in PSSM as calculated by:

S̄i =L∑

j=1Si,j/L

Moreover, the variable d is a range of numbers upto a maximum threshold ofD, i.e., d = 1, 2, 3, . . . , D. Hence, the length of the features computed by AC isdependent on D, which is equal to 20 ∗ D.CC measures the correlation between two different amino acids, i1 and i2, inPSSM separated by a distance of d in the sequence, which can be computed asfollows:

CC(i1, i2, d) =L−d∑j=1

(Si1,j − S̄i1)(Si2,j+d − S̄i2)/(L − d)

In this equation, S̄i1 and S̄i2 represent the average probabilities for amino acids i1and i2 in PSSM. The length of the features computed by CC are equal to 380∗D.

16

Bi-gram Transition Probabilities via PSSM

This method, called bi-gram (for short), aims to model transition probabilities ofamino acid pairs via PSSM rather than the protein sequence [5]. The probabilityof transition from amino acids m to n from a PSSM matrix P with a length Lcan be computed as follows:

Bm,n =L−1∑i=1

Pi,mPi+1,n , where 1 ≤ m ≤ 20 and 1 ≤ n ≤ 20

This equation produces a matrix B that has 400 elements ranging over all possibleamino acid bigram combinations (202). This matrix can be re-ordered into avector as follows:

F = [B1,1, B1,2, . . . , B1,20, B2,1, B2,2, . . . , B20,1, B20,2, . . . , B20,20]

Tri-gram Transition Probabilities via PSSM

Tri-gram models the transition probabilities of three amino acids occurring con-secutively through PSSM [45]. The following equation can be used to computethe probability of transition for amino acids m, n and o in the respective orderfrom a PSSM matrix P with a length L as:

Tm,n,o =L−2∑i=1

Pi,mPi+1,nPi+2,o , where 1 ≤ m ≤ 20, 1 ≤ n ≤ 20, 1 ≤ o ≤ 20

This equation produces a matrix T that has 8000 elements containing the proba-bilities for all possible trigram combinations (203). This matrix can be re-orderedinto a vector in a way similar to bi-gram.

Alignment via Dynamic Time Warping

In Alignment via Dynamic Time Warping (DTW), features are extracted fromPSSM in a three-stage process [46]. Firstly, the dissimilarity cosine distancebetween two target proteins is determined, which can be computed as follows:

d(p1i, p2j

) = 1 − p1ipT

2j√p1i

pT1i

p2jpT

2j

In this equation, P1 and P2 are the PSSM matrices of the proteins which L1 andL2 number of rows respectively. p1i

and p2jrepresent the row vectors of P1 and

17

P2 respectively, where 1 ≤ i ≤ L1 and 1 ≤ j ≤ L2. d is the dissimilarity distancefor rows p1i

and p2j. If this is calculated for all rows of P1 and P2, a dissimilarity

matrix, S, can be constructed with dimensions L1 × L2.Upon calculating the dissimilarity matrix, the minimum cost path is determinedvia dynamic time warping to produce a cumulative dissimilarity matrix D, whichproduces the alignment cost. The cumulative dissimilarity matrix D can be com-puted as:

Di,j = min(Di,j, Di,j−1, Di−1,j−1) + Si,j

Eventually, the kernel distance between P1 and P2 is computed that produces amatrix, K, that contains the distances between all pairs of proteins in the sample.The kernel parameter, γ, is determined during training and the kernel functionis expressed below:

KP1,P2 = exp(D2P1,P2/γ2)

3.1.3 Physicochemical based features

Physicochemical features are techniques that represent proteins using the chemi-cal properties of amino acids. There are currently 544 physicochemical attributeslist as per AAindex version 9.1 [24].

Dubchak’s Features

In addition to representing proteins using AAC, Dubchak et al. (2001) proposedthat feature performance during classification can be improved if physicochemicalproperties of amino acids are also used to represent proteins [3]. Dubchak etal. suggested syntactical and physicochemical based features in which they usedfive attributes of amino acids namely - hydrophobicity (H), predicted secondarystructure based on normalized frequency of α helix (X), polarity (P), polarizability(Z) and van der Waals volume (V) [3]. These features have in turn been widelyadopted by other researchers in PFR and SCP. This technique can be abbreviatedas AAC+HXPZV.These five physicochemical attributes define the chemical expressions of the twentynaturally occurring amino acids, thus, providing a numeric expression value forevery amino acid for that particular attribute. Dubchak et al. proposed to sortthese numeric expressions and cluster them into three discrete groups (g1, g2, g3)based on the differences between expressions values, which has been done arbi-trarily in their research. Once this discretization has been conducted, Dubchak etal. computes the features using three different measures, composition, transition

18

and relative composition per percentile. Firstly, the composition of every discretegroup in the protein is determined by simply substituting the amino acids in thesequence with their corresponding groups and the composition is then computedon the newly formed sequence. Secondly, to compute transitions, simply countthe occurrences of groups in pairs. For instance, count the number of transitionsfrom g1 to g2 in the substituted sequence. Moreover, mirrored transitions aresummed as one, i.e., transitions from g1 to g2 and g2 to g1 are summed together,whereas transitions to like terms are ignored (transitions from gi to gi are notcounted). Lastly, the relative occurrence for each group is computed per per-centiles, which were the 25%, 50% and 75% percentiles. Upon computing andconcatenating these features, the resulting feature vector has the length of 125.

19

Chapter 4

Exploring feature extraction inprotein fold recognition

Identification of the tertiary structure (three dimensional shape) of a pro-tein is a fundamental problem in biology since it helps in identifying theprotein’s functions. Predicting a protein’s fold is considered to be an in-termediate step in identifying the tertiary structure of a protein. In thischapter, a novel feature extraction technique has been explored. It mod-els proteins based on the information present in sequential evolution viaPSSM. The performance of the proposed scheme has been evaluated againstthe benchmarked datasets. The proposed scheme performed well in exper-iments, showing improvements over previously published results in litera-ture.

As stated previously, this research will be adhering to the taxonomic approach andwill focus on extracting features from structural and evolutionary data sources.Recently, there has been considerable research in counting techniques to modelproteins. Some of the more basic approaches include amino acid composition,which basically represents the proportion of the various amino acids in the proteinsequence. This has extended by counting pairs of amino acids instead of singleoccurrences of amino acids as done by Ghanty and Pal (2009) [21]. Amino aciddimers are various possible pairs of amino acid that can appear in the protein se-quence and are also sometimes called bi-grams or amino acid pairs by researchers.These techniques model the protein by capturing global information, i.e., theyconsider the entire sequence as a whole as opposed to other schemes that dividethe sequence in parts, or terminals, to capture localized information.

20

The counting techniques discussed above extract features from structural infor-mation, i.e., directly from the protein sequence. However, Sharma et al. (2013)proposed a feature extraction method (bi-gram) that models probabilities of singleand pair-wise amino acid occurrences using the information present in sequentialevolution [5]. This has been further extended by Paliwal et al. (2014) to repre-sent probabilities of occurrence of tri-grams (occurrences of three amino acids ina tuple) [45]. These methods extract the evolutionary probabilities of amino acidoccurrences from PSSM, which is extracted via PSI-BLAST. This approach ofusing evolutionary-based features has provided significant improvements for PFRand SCP.The feature extractor developed in this research extends the bi-gram technique [5],whereby the amino acids under consideration for a particular bi-gram can onlybe adjacent in the primary sequence. Ghanty and Pal have shown via their workthat amino acid pairs that are separated by one residue in the primary sequencealso contain significant amounts of information that can be used to improve clas-sification [21], although they have computed features directly from the primarysequence. In this work, a feature extractor is developed that computes proba-bilistic expressions of amino acid dimer occurrences that can be spatially variedin the primary sequence. Instead of only considering dimers that are adjacentor separated by one residue, this technique also aims to consider dimers that areseparated by more than one residue.

4.1 Conceptual Overview

The focus in this research was to explore feature extraction techniques thatbuild upon structural data and sequential evolution. Conceptually, this tech-nique was formed to extend upon Sharma et al.’s (2013) and Ghanty and Pal’s(2009) method where they computed dimer occurrences from adjacent aminoacids as well as those amino acids that were separated by one residue in the se-quence [5, 21]. The aim in this research was to build upon these concepts andconsider dimers that may be spatially separated with greater distances. Thisconcept has been illustrated in Figure 4.1.In Figure 4.1, a fictional protein sequence ATRRA is depicted in part (i). In part(ii), adjacent dimers are visualized with the degree of spatial separation as k = 1.Similarly for parts (iii) and (iv), dimers are visualized with the degree of spatialseparation as k = 2 and k = 3 respectively. Dimers can also be computed usinghigher degrees of spatial separation, if required.However, computing occurrence directly from the primary sequence leads to theproblem of having too many zeros for non-existent in the extracted features,thereby, degrading classifier performance. This issue of zero valued feature vectorshas been dealt with by Sharma et al. (2013), where they compute probabilities of

21

Figure 4.1: An illustration of dimer counting strategies that lead to developmentof the proposed feature extraction technique.

dimer occurrences directly from PSSM rather than counting actual occurrences[5], which improves the classifier performance significantly.The feature extraction technique explored computes probabilistic expressions foroccurrences of amino acid dimers that may or may not be adjacent in the aminoacid chain. Amino acid dimers can be computed for various degrees of spatialseparation, which can be arbitrarily or empirically determined. This method aimsto model relationships between amino acid dimers that have not been previouslymodeled or explored in literature.

4.2 Feature Extraction

In the proposed feature extraction technique, amino acid dimers are consideredthat may be separated by other amino acids, spatially, in the primary sequence.k determines the degree of spatial separation between the dimers under consid-eration. This section describes the mathematical details of this technique alongwith a worked example for the purposes of clarity.

4.2.1 Algorithmic Details

This technique has been mathematically summarized in Equation 4.1. If P isthe PSSM matrix representation for a given protein, P will have L rows and20 columns, where L is the length of the protein sequence. The probabilisticexpression for the occurrence of a dimer of the mth amino acid to nth amino acidcan be computed using Equation 4.1 where 1 ≤ m ≤ 20 and 1 ≤ n ≤ 20 sincethere are only 20 naturally occurring amino acids.

λm,n(k) =L−k∑i=1

Pi,mPi+k,n , where 1 ≤ m ≤ 20, 1 ≤ n ≤ 20, 1 ≤ k ≤ φ (4.1)

22

Equation 4.1 extracts probabilities of occurrences of possible amino acid dimersinto matrix λ(k) for all values of m and n. There are 400 elements in λ(k),comprising of all combinations that can arise in amino acid dimers with thedegree of spatial separation as k.As discussed previously, k represents the spatial distance between amino acidpositions that are used to compute the probabilistic expressions. For k=1, theprobabilistic expressions are computed between adjacent amino acids whereas,for k=2, the probabilistic expressions are computed between amino acid dimersthat are separated by one amino acid in the primary sequence. Therefore, fork = φ, the amino acid dimers used to calculate the probabilistic expressions areseparated by φ − 1 amino acids.The extracted matrix λ(k) can be rearranged into a vector for processing as perthe equation below:

λ(k) =[λ1,1(k), λ1,2(k), . . . , λ1,20(k),λ2,1(k), λ2,2(k), . . . , λ2,20(k),. . . ,

λ20,1(k), λ20,2(k), . . . , λ20,20(k)]

(4.2)

4.2.2 Demonstration

In order to illustrate this concept, lets consider a fictional protein sequence CT-TRCC of length L=6 and assume that, for illustration purposes, there are onlythree unique naturally occurring amino acids, C, R and T. Table 4.1 shows thePSSM of this protein sequence. Equation 4.1 can be used to extract the tran-sition probabilities for k=1 and k=2. It should be noted that, in this example,the upper bound for the values of m and n is 3. Transition probabilities for k=1and k=2 are shown in Tables 4.2 and 4.3 respectively. As shown in the tables,this produces two completely different sets of features. These matrices can bereformatted as per Equation 4.2 to obtain feature vectors, which can then beprocessed by classifiers for model creation and evaluation.

4.3 Materials

The DD, EDD and TG datasets were used for experimentation in PFR. Data wassegregated into training and test sets during the training phase of experimentationwhere applicable. The training set from DD dataset was used, however, forthe remaining datasets, the data was randomly divided with 60% of the databelonging to the training set and the remaining 40% to the test set. It wasensured that the distribution of folds in the two sets were equal proportionally.

23

Table 4.1: PSSM of a sample protein - CTTRCC

Amino Acid C R TC 0.10 0.60 0.30T 0.45 0.35 0.20T 0.20 0.56 0.24R 0.31 0.42 0.27C 0.66 0.17 0.17C 0.13 0.71 0.16

Table 4.2: Amino acid dimer probabilities with the degree of spatial separationas k=1

Amino Acid C R TC 0.4874 0.8923 0.3403R 0.8129 0.8333 0.4538T 0.4497 0.4844 0.2459

Table 4.3: Amino acid dimer probabilities with the degree of spatial separationas k=2

Amino Acid C R TC 0.3318 0.4991 0.2291R 0.6527 0.8764 0.4009T 0.3155 0.4845 0.2100

24

4.4 Classification Schemes

The features extracted using Equation 4.1 can be directly used for classificationby extracting features for various degrees of spatial separation. Since the featuresextracted for different degrees of spatial separation k are independent, these fea-tures can be used individually to form prediction models. This has been doneby Sharma et. al. (2013) for k = 1, which can be viewed as a special case ofspatially separated amino acid dimers where the amino acid dimers are adja-cent in the primary sequence [5]. Similarly, features extracted for other valuesof k can be individually used to form predication models. During the course ofthis research, various strategies to form prediction models have been discussed.These strategies included individual selection, ensemble voting and concatenatedfeatures. The first two strategies employed the features independently whereasthey were considered as a single set of features in the last strategy in which theyare concatenated.

4.4.1 Classification on individual features

Initially, features extracted for different degrees of spatial separation were consid-ered individually and classification models were built to evaluate the performanceof these features. The experiment results are shown using the SVM classifier sinceit out-performed other classifiers for these features and they are shown for DD,EDD and TG datasets for some arbitrary values of k from k = 1, 2, . . . , 10 inTable 4.4.The results shown in Table 4.4 were performed on the training set and theyhighlight that, individually, these features provide relatively good classificationaccuracies, indicating that there is discriminatory information present in thesefeatures. Another observation is that the accuracies decrease steadily as thedegree of spatial separation (k) in amino acids dimers increases, indicating that,relatively, there is addition of noise or a loss of discriminatory information withlarger spatial distances within dimers.

4.4.2 Voting schemes for ensemble methods

As per the results of the previous stage of experimentation, it can be observedthat individually, the features are not able to perform close to the benchmarks.However, it is possible to create an ensemble scheme with voting where predictionsfrom multiple features can be aggregated to get a consolidated set of predictions.This ensemble model consists of several SVM instances, all of which provided theirown sets of predictions. Therefore, it was important to formulate a scheme toaggregate predictions from classifiers efficiently to produce one consolidated set of

25

Table 4.4: Individual classification accuracies of amino acid dimers with spatialseparations from k=1,2,. . . ,10 using training set

k DD EDD TG1 65.8 82.2 63.82 64.2 82.5 64.33 65.0 81.2 61.94 63.7 81.2 61.55 63.1 79.0 59.56 62.2 78.6 58.17 60.6 78.8 58.48 62.2 77.3 56.99 61.9 76.8 56.410 60.8 76.6 56.111 60.1 75.0 55.812 59.7 74.7 55.5

predictions. The selection of a method to consolidate predictions was challengingas optimal results would only be achieved when this method was able to identifycontributions by each set of features, λ(k), and suitably aggregate predictions.Considerable experimentation was carried out to determine the optimal scheme,which included simple majority voting, selection of top-n features and optimizedweighted voting using the genetic algorithm (GA).Moreover, in this scheme, it is necessary to determine which features are to beconsidered in the ensemble classification, implying that there has to be a strategyto determine the range of k. The upper bound for the range of k has been denotedby φ in Equation 4.1.In the ensemble model, different instances of SVM were used to build classificationmodels for every feature, λ(k). Therefore, in this scheme there can be φ differentinstances of SVM, each of which is trained with a particular set of features,λ(k), extracted for a particular value of k. The parameters for each of the SVMinstances were tuned such that they yielded the highest training accuracies.

26

Termination Criteria for φ

Determining the optimal value of φ is a key challenge in this scheme. If thechosen value for φ is too small, there is loss of information that may lead topoor performance of the scheme. However, a value for φ that is too large willunavoidably add noise to the data, which will also lead to a significant loss inperformance. Hence, there is a need to have a strategic approach to determinethe optimal value for φ.Essentially, training accuracies of λ(k) were observed and a suitable value for φwas determined by analyzing the training accuracies. All features for differentvalues of k were selected as long as they were approximately within 5% of thetraining accuracy of the feature for k=1. Additionally, for the purposes of sim-plicity, it was ensured that the range of k was continuous, so that the first featurethat does not meet the criteria resulted in the termination of further evaluations.As per the results in Table 4.4, for the DD dataset, the training accuracies for λ(k)have been determined for a range of k. Upon analysis, the base feature (k = 1) hasthe training accuracy of 65.8%, therefore, if the approximately 5% range criteriais applied, the minimum accuracy required for selection is around 60%. Thecriteria was met with k = 11 that has 60.1% training accuracy, therefore, optimalvalue of φ was determined of φ = 11, resulting with the range of k as k=1,. . . ,11.Similarly, for the EDD dataset, the minimum requirement was determined to beapproximately 77% resulting with φ = 8 and, for the TG dataset, φ = 7 wasselected with the minimum requirement being approximately 58%.

Simple voting scheme

Initially, a simple majority voting system was used for determining the finalprediction using all k = 1, . . . , φ SVM models. The classification accuracy of thisscheme had a slight improvement over the highest individual performance of λ(k).This scheme can be seen as a special case of the weighted voting scheme whereall the weights are equal. The results are shown in Table 4.7. For purposes ofclarity, this scheme shall be referred to as Scheme 1A.

Top-n voting scheme

The previous results show that there is only slight improvement in the results. Itcan be easily intuited that not all features from k=1,2,. . . ,φ are required for im-proving classification accuracy. Therefore, a new scheme was nominated wherebya subset of n best performing features were selected based on their training ac-curacies individually. Once these top-n features were selected, simple majorityvoting was used to determine the final set of predictions. This scheme lead to

27

further improvement in the classification accuracy as shown in Table 4.7. Forpurposes of clarity, this scheme shall be referred to as Scheme 1B.

Optimized weighted voting scheme

The approach of selecting a subset of features has its shortcomings. It was difficultto determine the optimal number of features (n) that will be selected to yield thebest results. Additionally, by removing certain features before the final prediction,there is no input of information from the dimer probabilities for those degrees ofspatial separation. Therefore, a more holistic approach was approached wherebyan evolutionary machine learning technique, genetic algorithm (GA), was used toassist in consolidating the predictions. This scheme has been shown in the formof a flowchart in Figure 4.2. For purposes of clarity, this scheme shall be referredto as Scheme 1C.

Figure 4.2: A flowchart depicting classification using GA optimized voting scheme

The evolutionary approach to machine learning is based on computational mod-els which includes natural selection and genetics. This is known as evolutionarycomputation which simulates evolution on a computer and encompasses geneticalgorithms, evolutionary strategies and genetic programming. The latter tech-niques simulate evolution using selection, mutation and reproduction processes.GA, in brief, is an optimization algorithm which iteratively improves the qualityof a solution using a stochastic approach.In the proposed scheme, GA was used to optimize weights for voting assigned toeach feature. These weights ranged are real values between 0 to 1 inclusive, with0 indicating no input for the final prediction and 1 indicating maximum inputtowards the final prediction. The final prediction was simply based on selectingthe output variable with the highest weighted votes. This has been summarizedin Algorithm 1 shown below.

28

Loop through each classSum the weighted votes for that class

End LoopPredicted Class := Class with highest votes

Algorithm 1: Weighted voting system for prediction consolidation

The Optimization Toolbox from the Matrix Laboratory (MATLAB) provided animplementation of the genetic algorithm for minimization problems, which hasbeen used in this research. Since the aim was to maximize the classificationaccuracy, the fitness function had to be modified accordingly to represent theproblem as a minimization function. The fitness function is shown in Equation 4.3and the GA parameters are provided in Table 4.5. GA was allowed to terminateafter the change in the fitness values stagnated.

Fitness = Ns

Nc + 1where Ns = Number of Samples

Nc = Number of correctly classified samples

(4.3)

Table 4.5: Genetic Algorithm parameters and values for the weight voting scheme

Parameter ValueNumber of Generations 1000Stall Limit 100Population Size 100Crossover Rate 0.8Crossover Function Two point crossoverMutation Function Adaptive FeasibleInitial Population Template All 1s

As stated previously, in order to consolidate the individual classifier predictions,we explored simple majority voting, subset selection with majority voting andweighted voting approaches, the results of which are reported in Table 4.7. Asshown by the results, each of these scheme lead to an improvement on the highestindividual performance by λ(k), however, GA optimized weighted voting schemeleads to the highest improvement in performance.

29

It should be noted that for the afore mentioned schemes that describe the variousmethods for the voting scheme, and the termination criteria along with respectivevalues of φ are the same as those mentioned initially in this section.

4.4.3 Classification as concatenated features

The reader may have deciphered some shortcomings of the previous schemes,such as the selection of termination value and its criteria are mostly dependenton researcher’s biases, and these schemes do not cater for inter-feature variabledependencies. Therefore, a scheme was devised where the features were concate-nated before SVM classifier was used to build a model for prediction.In a nutshell, the optimal value of the termination value φ is determined em-pirically and the various feature are concatenated to form an aggregated featureλ, which is processed by a SVM classifier for protein fold recognition. A flowdiagram of this scheme is illustrated in Figure 4.3. For purposes of clarity, thisscheme shall be referred to as Scheme 2.

Figure 4.3: A flowchart depicting classification using concatenated dimer features

30

Algorithm Details

The proposed scheme can be summarized in an algorithm as shown in Algorithm2. The algorithm highlights the feature extraction technique, the various opera-tions used to modify the feature vector extracted and the method for determiningthe optimal value for the termination value φ for the maximum degree of spatialseparation when computing amino acid dimers. The equations and symbols usedin the algorithm are described in the latter sections. It should be noted that thetraining set was used to determine φ. The metric used to evaluate the perfor-mance of various feature vectors was the n-fold cross-validation accuracy with n= 10. SVM was used for classification with the kernel function as radial basisfunction and the C -parameter being 1000.

Step 1 Let k := 1, λ ← { }Step 2 λm,n(k) = ∑L−k

i=1 Pi,mPi+k,n (4.1)Step 3 λ ← {λ, λk}Step 4 Ak := classify(λ)

where classify(λ) returns the cross-validation ac-curacy for feature vectors, λ

Step 5 IF Ak < Ak−1 THENGOTO Step 6

ELSELet k := k + 1

ENDIFStep 6 φ := k

Algorithm 2: Algorithm representation of the training phase of the proposedscheme

The concatenated features, represented by λ, signify a feature vector that com-prises of the probabilistic expressions of amino acid dimers that have spatial sep-arations from k=1,2,. . . , φ. This concatenation has been summarized as shownin Algorithm 2 or, for simplicity, re-written as Equation 4.4.

λ = [λ1 λ2 . . . λφ] (4.4)

This concatenation provides information to explore dependencies between vari-ables of different dimer probabilities with varying degree of spatial separation.Previously, the probabilities for a particular degree of spatial separation wereconsidered individually, which does not allow the classifier to build models based

31

on information from other degrees of spatial separation. In this scheme, due tothe concatenation of individual probabilities into one aggregated feature, allowsthe SVM classifier to explore dependencies in the information from dimers withdifferent degrees of spatial separation.

Termination Criteria φ

During experimentation, the value of φ was empirically determined by succes-sively incrementing φ and observing the classifier performance on the trainingset of the various datasets. This process was continued until a gradual declinein classification accuracies was observed. Upon performing such analysis, it wasdetermined that the optimal values of φ for the DD, EDD and TG datasets areφ = 7, φ = 8 and φ = 6 respectively. The results noted during the analysis areshown in Table 4.6.

Table 4.6: Performance of concatenated amino acid dimers during training withthe terminating value for spatial separation from φ = 1, 2, . . . , 10

φ DD EDD TG1 66.0 82.1 63.82 66.5 85.6 67.53 66.7 85.9 68.24 66.9 86.3 68.45 67.0 86.4 68.66 66.9 86.2 68.77 67.6 86.3 68.28 65.9 86.6 67.89 65.8 85.9 67.510 65.4 85.3 67.4

It can be observed from the results in Table 4.6 that there is a steady increase inclassification accuracies as the value of φ increases from φ = 1 to a certain valuewhere the classification accuracy reaches a peak for the particular dataset. Uponreaching this peak, any further increase in φ leads to a gradual decline in theclassification accuracy. Therefore, the identification of the optimal value of φ wassimplified greatly due to this trend observed during experimentation. It shouldbe noted that all evaluations up till this stage has been performed by using thetraining set of the various datasets and these results were analysed to determine

32

the various parameters. Upon finalizing optimal parameters, the model has beenprimed for evaluation.

4.4.4 Analysis of Explored Schemes

The schemes described in the previous sections were evaluated using the threePFR datasets. The results of the analysis have been tabulated in Table 4.7, and itcan be clearly seen that Scheme 2 outperforms the voting schemes consistently forall the datasets explored in this study. Additionally, the performance of Scheme 2does not fluctuate much relative to voting schemes. The reason for the poor per-formance,relatively, of the voting schemes (Schemes 1*) may be attributed to thefact that they do not consider dependencies between features, which are processedby different instances of SVM. Even though voting schemes improve performancemarginally, they do not perform as well as Scheme 2. Since Scheme 2 employsfeature concatenation, its improved performance is likely related to maintainingdependencies between dimer probabilities extracted for different degrees of spatialseparation.

Table 4.7: Summary of model performance during training for the various schemesexplored using 10 fold cross validation

Dataset Method Accuracy (%)

DD

Scheme 1A - All φ features, equal weights 66.7Scheme 1B - Top-n features, equal weights 67.5Scheme 1C - All φ features, GA optimized weights 68.8Scheme 2 - All φ features, concatenated 67.6

EDD


TG


33

4.5 Results

The experimentation was performed on the benchmarked datasets to evaluatethe performance of the classification scheme described previously. In this study,comparison has been conducted using n-fold cross-validation, which has beenused by many investigators in literature, with SVM as the prediction engine. Forstatistical stability, n-fold cross-validation was repeated 100 times with randomsub-sampling.The proposed scheme was compared against various other schemes that use struc-tural and evolutionary based features for fold recognition. These techniques in-cluded PF1 and PF2 [21], PF [59], Occurrence (O) [44], AAC and AAC+HXPZV[3], which compute feature sets from the original protein sequences. In addition,ACC [18], Bi-gram [5], Tri-gram [45] and Alignment (DTW) [46] are also includedsince they compute features directly from the evolutionary information presentin PSSM. Moreover, features have been computed from the consensus sequencesfor PF1, PF2, O, AAC and AAC+HXPZV to obtain additional feature sets forcomparison. In the tables that highlight the performance of these various tech-niques, a prefix of PSSM+ indicates that the features have been computed on theconsensus sequence. These techniques have been evaluated using the DD, EDDand TG datasets using n-fold cross-validation for n = 5, 6, . . . , 10.It can be seen in Table 4.8 that Scheme 1A and Scheme 1B only show slightimprovements on the DD dataset. Scheme 1C shows good performance on theDD dataset, however, Schemes 1A, 1B and 1C perform poorly on both TG andEDD datasets as shown in Tables 4.9 and 4.10 respectively. This can partiallybe due to the fact that DD dataset is by far the smallest of three datasets thathave been used for classification. TG and EDD datasets are much larger in termsof the number of protein samples. Another possible explanation for such resultsis that Schemes 1A, 1B and 1C are basically ensemble techniques and there isno feedback or information sharing between the various classifier to keep trackof inter-feature dependencies. Since DD dataset was a smaller dataset, GA wasable to optimize the weights for individual classifier votes such that the accuracyimproved significantly although it was unable to do the same for larger datasets.In contrast, Scheme 2, which has concatenated dimer expressions from variousdegrees of spatial separation into one consolidated feature vector, has astonish-ingly good performance on all three datasets. It reports the highest accuracieson DD and TG datasets, showing improvements for all folds tested although itis only marginally behind the Alignment method for EDD dataset. This can beattributed to the fact that EDD is a much larger dataset compared to TG datasetand it does has less stringent on the sequence similarity requirements than TGdataset (40% sequence similarity cutoff for EDD dataset compared to 25% for TGdataset), therefore, Alignment method is able to perform much better on EDDdataset since the disparity between proteins belonging to the same fold is not asgreat as in TG dataset.

34

Table 4.8: Performance (in % accuracy) of various feature sets on the DD datasetusing n-fold cross validation procedure

Feature sets n=5 n=6 n=7 n=8 n=9 n=10PF1 48.6 49.1 49.5 50.1 50.5 50.6PF2 46.3 47.0 47.5 47.7 47.9 48.2PF 51.2 52.2 52.6 52.9 53.4 53.4O 49.7 50.4 50.8 50.8 51.1 51.0AAC 43.6 43.9 44.2 44.8 44.6 45.1AAC+HXPZV 45.1 46.2 46.5 46.8 46.9 47.2ACC 65.7 66.6 66.8 67.5 67.7 68.0PSSM+PF1 62.5 63.2 63.7 64.2 64.5 64.6PSSM+PF2 62.7 63.3 64.1 64.2 64.6 64.7PSSM+PF 65.5 66.2 66.5 66.9 67.1 67.5PSSM+O 62.5 62.1 62.5 62.9 63.4 63.5PSSM+AAC 57.5 58.1 58.4 58.7 59.1 59.2PSSM+AAC+HXPZV 55.9 56.9 57.1 57.7 58.0 58.2Bi-gram 72.6 73.1 73.7 73.7 74.1 74.1Tri-gram 72.1 72.6 73.0 73.2 73.7 73.8Alignment (DTW) 72.6 73.5 73.8 74.2 74.7 74.7Scheme 1A 72.7 73.3 74.1 74.0 74.2 74.6Scheme 1B 65.7 66.7 67.0 67.2 67.5 67.5Scheme 1C 73.5 75.4 74.5 76.2 76.7 75.7Scheme 2 74.5 75.4 75.9 76.4 76.5 76.7

35

Table 4.9: Performance (in % accuracy) of various feature sets on the TG datasetusing n-fold cross validation procedure


36

Table 4.10: Performance (in % accuracy) of various feature sets on the EDDdataset using n-fold cross validation procedure


37

In general, it can be seen that the proposed feature extraction technique describedin this research has significant performance increases in some scenarios. Thesefeatures meet the benchmarks set by works published only just recently in liter-ature [5, 45, 46]. Overall, it is clear that sequential evolution based features areperforming much better compared to features that work only on structural infor-mation. ACC, Bi-gram, Tri-gram and Alignment outperform most of the otherfeatures that have been based on structural information. Additionally, it can beseen that those features that extract information from structural sources performbetter on the consensus sequence rather than the raw amino acid sequence, whichcan be due to the generalization of the resulting sequence as a consequence ofcomputing consensus from PSSM.

38

Chapter 5

Improving protein structuralclass prediction using featureextraction techniques

Protein structural class prediction is as important task in identifying pro-tein tertiary structure and protein functions. In this chapter, the applicabil-ity of probabilistic expressions of spatially separated amino acid dimers inpredicting secondary protein structures is evaluated. This technique modelsrelationships between spatially separated amino acid dimers with informa-tion derived from PSSM. The technique has shown promising results whenevaluated on the benchmarked datasets.

In Chapter 4 an application of spatially varied amino acid dimers was discussedin the field of PRF. In this chapter, that technique has been applied in the fieldof SCP and the results are discussed.SCP is similar to PFR since SCP can be seen as an abstraction of PRF due tothe hierarchical nature of the domains. Every fold can be mapped onto a parentstructural class and structural classes can be further sub-categorized into folds.Due to this relationship between SCP and PFR, many of the proposed methodsin literature usually perform proportionally similar in the domains. In order toverify this, the performance of amino acid dimers with different degrees of spatialseparation as a feature extraction method in SCP has been evaluated.

39

5.1 Materials and Methods

For the purposes of comparison, a subset of the techniques that have been used inPFR, are utilized since they provide a good benchmark for analysis. These tech-niques include AAC, AAC+HXPZV, PF1, PF2, Bi-gram and Tri-gram. A prefixof PSSM+ indicates that feature extraction has been conducted after computingconsensus on the raw sequence.The datasets have been re-labeled so that the samples correspond to their struc-tural classes rather than their folds. A summary of the datasets is provided inTable 5.1. The train and test samples from the DD dataset have been combinedto form a consolidated dataset since all analysis is done via n-fold cross validationparadigm.

Table 5.1: Summary of datasets for Structural Class Prediction

Classes SamplesDD EDD TG

α 115 556 252β 226 967 478

α/β 258 1311 589α + β 95 584 293

5.2 Classification Scheme

It can be clearly seen from the results displayed in Chapter 4 that the proposedScheme 2, which concatenates the probabilistic expressions of amino acid dimersof various degrees of spatial separation, performs significantly better than theother classification schemes explored. Therefore, it was decided to adhere to thisapproach for SCP as well.During the training phases, the datasets were divided into training and test setsusing the same principles as stated previously. During training, the terminationcriteria φ has to be determined for every dataset when using Scheme 2. φ de-termines the number of variables that will be concatenated to form the eventualfeature that will be used in classification.The results observed during the experiments are shown in Table 5.2. It can beobserved from the results in Table 5.2 that there is a steady increase in classifi-cation accuracies as the value of φ increases from φ = 1 to a certain value where

40

Table 5.2: Performance of concatenated amino acid dimers during training withthe terminating value for spatial separation from φ = 1, 2, . . . , 10 in SCP

φ DD EDD TG1 80.5 88.1 80.12 83.4 91.8 85.73 83.9 92.2 86.74 84.4 93.3 87.35 84.9 93.3 87.36 84.9 93.5 87.27 84.8 93.7 87.08 84.7 93.8 87.09 84.7 93.7 87.010 84.6 93.7 86.9

the classification accuracy reaches a peak for all the datasets. The optimal valuesof φ can be determined empirically from these results, which are φ = 6, φ = 8and φ = 5 for the DD, EDD and TG datasets respectively.

5.3 Results

After finalizing the parameters during the training phase, the model is ready tobe evaluated. Again, n-fold cross validation paradigm from n=5, 6,. . . , 10 wasemployed to evaluate the results. For statistical stability, the experiment wasrepeated 50 times and the means are reported.Tables 5.3, 5.4 and 5.5 report the results that were recorded during the experimentfor SCP. It can be seen from the results schemes that extract features solelyfrom the raw amino acid sequences in the proteins do not perform as well as theschemes that incorporate some sort of evolutionary information either in the formof computing consensus or directly extracting features from PSSM. Bi-gram andTri-gram features provide a good benchmark as they report high classificationaccuracies for all the datasets. However, it can be seen that the proposed schemeof using spatially varied amino acid dimers also performs exceedingly well for SCPdisplaying the highest recorded overall accuracies for all datasets in the variousn-folds.

41

Table 5.3: Comparison of classification accuracy of the proposed technique withother reported works on DD dataset in SCP

Method n=5 n=6 n=7 n=8 n=9 n=10AAC 67.8% 68.1% 68.1% 68.2% 68.2% 68.3%AAC+HXPZV 70.0% 70.3% 70.4% 70.6% 70.8% 71.0%PF1 66.7% 67.4% 67.3% 67.4% 67.7% 67.6%PF2 69.0% 69.3% 69.6% 69.6% 69.6% 69.6%PF 70.0% 70.7% 70.8% 71.0% 71.0% 71.0%PSSM+AAC 74.6% 74.7% 74.7% 74.7% 74.8% 74.8%PSSM+AAC+HXPZV 74.9% 75.1% 75.3% 75.7% 75.5% 75.9%PSSM+PF1 75.8% 76.2% 76.3% 76.3% 76.7% 76.7%PSSM+PF2 75.9% 76.6% 76.8% 77.1% 77.2% 77.4%PSSM+PF 80.0% 80.2% 80.6% 80.7% 80.7% 80.8%Bi-gram 83.4% 83.5% 83.5% 83.6% 83.8% 83.7%Tri-gram 87.0% 86.9% 87.1% 87.2% 87.2% 87.2%Scheme 2 86.7% 86.9% 87.1% 87.2% 87.4% 87.4%

42

Table 5.4: Comparison of classification accuracy of the proposed technique withother reported works on EDD dataset in SCP


43

Table 5.5: Comparison of classification accuracy of the proposed technique withother reported works on TG dataset in SCP


44

Chapter 6

Linear Interpolation SmoothingModel in Subcellular Localization

Protein subcellular localization is an important topic in proteomics sinceit is related to a proteins overall function, helps in the understanding ofmetabolic pathways, and in drug design and discovery. In this chapter, abasic approximation technique from natural language processing called thelinear interpolation smoothing model is applied for predicting protein sub-cellular localizations. It extracts features from syntactical information inprotein sequences to build probabilistic profiles using dependency models,which are used to determine how likely is a sequence to belong to a par-ticular subcellular location. This technique builds a statistical model basedon maximum likelihood. This approach has been evaluated by predictingsubcellular localizations of Gram positive and Gram negative bacterial pro-teins.

Subcellular localizations of proteins is very important research topic in molecularcell biology and proteomics since it is closely related to the protein’s functions,metabolic pathways, signal transduction and other biological processes within acell [57,60]. Knowledge of a protein’s subcellular localization also plays an impor-tant role in drug discovery, drug design and biomedical research. Determinationof subcellular localizations experimentally is laborious, time-consuming and, insome cases, experimental means to determine some subcellular localizations ofproteins is difficult using fluorescent microscopy imaging techniques [47].In recent years, there has been significant progress in subcellular localizationpredication using computational means. There are approaches that extract fea-

45

tures directly from the syntactical information present in protein sequences suchas amino acid composition (AAC) [55,61], N-terminus sequences [62] and pseudo-amino acid composition (PseAAC) [63]. Some approaches use the evolutionaryinformation present in Position Specific Scoring Matrices (PSSM) to extract fea-tures [58]. Features can also be generated from protein databases such as theannotations in Gene Ontology (GO), functional domain information, and/or tex-tual information from the keywords in Swiss-Prot [22, 49, 53, 54, 64]. Moreover,some researchers have utilized the information present in the physicochemicalproperties of amino acid residues to enhance prediction accuracy [55, 65]. How-ever, most prevalent techniques are a hybrid collection of various features to helpidentify discriminatory information for the classifiers to obtain an improved pre-diction accuracy [22,49,52–55,63].In proteomics, frequencies or probabilities of occurrence for amino acid subse-quences in proteins have been used to extensively model proteins. Some featuresthat can be considered as variants of such models include Amino Acid Com-position (AAC) [3], Pairwise Frequency (PF1) and Alternate Pairwise Frequency(PF2) [21], bigram [5], k-separated bigrams [28], and trigram [45]. Although suchmodels have been rigorously studied by researchers, they have mostly consideredthe probability distribution as an extracted feature for classification via meansof another classifier such as Bayesian Classifiers, Artificial Neural Networks andSupport Vector Machines [3, 5, 21,28,32,45,62].Such probability models are also prevalent in other fields of study such as naturallanguage processing (NLP), however, they have been deployed in a completelydifferent manner. Instead of considering these models as features for input intoother classifiers such as Support Vector Machines (SVM) or k-Nearest Neighbours(kNN), they are considered probabilistic dependency models that determine thelikelihood of a protein belonging to a subcellular location. In this research, thelinear interpolation smoothing model is proposed which extracts features usingsyntactical information from the protein sequences for predicting protein subcel-lular localizations. This approach is a basic approximation technique in NLP andits concepts have been applied in proteomics for this study. Linear interpolationbuilds probabilistic profiles for proteins based on the frequency information ofamino acid subsequences extracted from proteins to perform subcellular localiza-tion. These probabilistic profiles may be following the independent or dependentmodel based on the probabilities being extracted. In this paper, the applicationof linear interpolation in proteomics is investigated and its ability to predict sub-cellular localizations of Gram positive and Gram negative bacterial proteins isanalyzed.

46

6.1 Method

Linear interpolation is backoff model, indicating that it aggregates informationfrom different sub-models to determine the likelihood of a protein belonging toa particular class. It builds probabilistic profiles for proteins based on the fre-quency information of amino acid subsequences extracted from proteins to per-form subcellular localization. In this sense, linear interpolation is related to Hid-den Markov Models (HMMs) and uses the Markov assumptions to build proba-bilistic profiles of varying dependencies for proteins, which are later used in thistechnique to determine the probability of a protein for belonging to a particularsubcellular location.These probabilistic profiles are similar to amino acid subsequence models that areprevalent in literature, however, their application in linear interpolation is com-pletely different than those previously published. Additionally, there is an absenceof techniques, in literature, that aggregate information from various probabilisticmodels to form a consolidated prediction model. In this scheme, linear interpo-lation, an approach novel to proteomics, is used to consolidate information fromdependent and independent probability distributions to identify the maximumlikelihood of a query protein for belonging to a subcellular location.

6.1.1 Algorithm

Computationally, protein sequences and natural languages share many similari-ties. They both are ambiguous (similar structures can have different meanings),can be very large, are constantly changing, and they are constructed by a com-bination of underlying set of constructs, amino acids for protein sequences andwords for natural languages. Thus, there is a need to explore the applicability ofsome basic techniques that are prevalent in NLP for the field of proteomics.Linear interpolation builds upon probabilistic models of varying dependenciesfrom amino acid subsequences whereby it consolidates the information from thesemodels in an approach known as backoff. For clarity, a model of the probabil-ity distribution that is dependent on n previous amino acids in the sequence iscalled the nth probabilistic model (model n, for short). Additionally, it can benoted that Model n = 0 is the independent model, since it is not dependent onany other amino acid in the sequence. The probabilistic models studied in thisresearch can be defined as a Markov Chain of order n. With respect to Markovchains of order n, the probability of an amino acid ai depends only on the imme-diately n preceding amino acids and not on any other amino acids. In this study,probabilistic models of n = 0, 1, 2 are examined as is common with most linearinterpolation implementations in NLP.Mathematically, probability distributions for models n = 0, 1, 2 have been de-fined in Equations 6.1, 6.2 and 6.3 respectively. In the equations, the function

47

Count() represents a subroutine that computes the frequency of occurrence ofthe selected amino acid(s) and Count(*) represents the count of all amino acidspresent in the sequences. ai represents one of the twenty naturally occurringamino acids in protein samples (this, i = 1, 2, . . . , 20). P denotes the probabilityof occurrence of an amino acid subsequence for a particular location.

P (ai) = Count(ai)Count(∗) (6.1)

P (ai|ai−1) = Count(ai−1, ai)Count(ai−1)

(6.2)

P (ai|ai−2, ai−1) = Count(ai−2, ai−1, ai)Count(ai−2, ai−1)

(6.3)

Once the probability distributions have been defined, it is possible to base allpredictions from these models individually. In order to compute the probabilityfor a sequence of length N belonging to a particular location, P (a1:N), the fac-toring the chain rule and then the appropriate Markov assumptions are applied.This has been highlighted in Equations 6.4, 6.5 and 6.6 and they represent thenth probabilistic models with dependencies of n = 0, 1, 2 respectively.

P (a1:N) =N∏

i=1P (ai) (6.4)

P (a1:N) =N∏

i=2P (ai|ai−1) (6.5)

P (a1:N) =N∏

i=3P (ai|ai−2, ai−1) (6.6)

A major complication of these models is that the probabilities extracted from thetraining data only provide a rough estimate of the true probability distribution ofamino acid subsequences. There may be no representation for uncommon subse-quences in the training data, resulting in the probability of 0. This can negativelyaffect the classification since a zero probability will always yield the resultingoverall probability for that class as zero, which may lead to misclassification.Therefore, the model is adjusted so that subsequences with a frequency count ofzero are assigned a small non-zero probability and this process of adjusting theprobability of low-frequency counts is called smoothing. In this paper, smoothingis done by assigning a probability of 0.0001 to all zero probability subsequencesand the probabilities of all other subsequences were adjusted accordingly.

48

Linear interpolation builds upon these models and aggregates the informationpresent in these models. Subsequence occurrence counts are estimated, but forany sequence that has a low (or zero) count, the model backs off to n-1 de-pendency model. In this study, linear interpolation combines the probabilisticmodels with dependencies of n = 0, 1, 2 and defines the consolidated probabilityestimates as:

P̂ (ai|ai−2, ai−1) = λ1 × P (ai) + λ2 × P (ai|ai−1) + λ3 × P (ai|ai−2, ai−1)where λ1 + λ2 + λ3 = 1 (6.7)

It can be seen from Equation 6.7 that linear interpolation actually combinesprobability estimates by weighting the probabilities. λ can be seen as a tuningparameter which determines the overall performance of this model. Similarly, theprobability for a sequence of length N belonging to a particular location usinglinear interpolation can be defined as:

P (a1:N) =N∏

i=3P̂ (ai|ai−2, ai−1) (6.8)

From Equations 6.7 and 6.8, it can be seen the linear interpolation uses weightedprobabilities from the previously discussed probabilistic models and consolidatesthem to form a unified probabilistic expression. This has been extended usingMarkov’s chain rule to compute the probability of a sequence to belong to aparticular subcellular location. However, since the resultant probabilities andtheir products can be very small, the risk of losing precision in fixed point computeunits is high. Therefore, Equation 6.8 has been modified to compute the sum oflog2 of the interpolated probabilities as shown below. A similar approach can beapplied to the models described previously in Equations 6.4, 6.5 and 6.6 to avoidlosing precision on compute units.

P (a1:N) =N∑

i=3log2 P̂ (ai|ai−2, ai−1) (6.9)

Lastly, it can be seen that if the sequences vary largely in lengths, there arisesa need to somehow normalize the resultant probabilities. Normalization can beachieved by dividing the resultant probabilities of the target proteins by theirlengths as per this equation:

Pnorm(a1:N) = P (a1:N)N

(6.10)

49

6.1.2 Optimizing λ

In this scheme, determining the optimal values for λ is key to improving overallperformance of linear interpolation because the values for λ determine the weightswhich are used to aggregate the probabilities from the probabilistic models withdependencies of n = 0, 1, 2 in linear interpolation and, consequently, alter theprobabilities determined for a protein during prediction.Initially during the early stages of experimentation, equal values for the threeλ scalars were chosen for the models, however, this approach does not yield thebest results possible using linear interpolation. Moving onwards, the values for λcould be determined using researcher intuition and empirical analysis, however,this approach has its shortcomings such as it is quite slow and requires an ex-tensive manual search, which proceeds by incrementally increasing or decreasingλ until good results are observed. Additionally, this approach does not ensurethat optimal values for λ will be discovered. Therefore, a meta-heuristic searchand optimization algorithm, the Genetic Algorithm (GA), was chosen to applyoptimization techniques to determine the optimal values for λ heuristically. Al-though this approach has a higher computational cost, the benefits for a betterprediction model offsets its costs.To optimize λ, the chromosomes C, which are the templates for the solution, areevaluated by GA was encoded using real-value between the range 0 ≤ Ci ≤ 1,where Ci is the value for a particular gene in the chromosome. The chromosomeshad a length len(C) = 3 since λ consists of λi where i = 1, 2, 3. Furthermore,during experimentation it was observed that GA converged fairly quickly duringevolution, thus, small values for the generation limit and the population size wereneeded in order to prevent over training.In order to evaluate the fitness of every chromosome, the fitness function deter-mines the accuracy during prediction by linear interpolation using the values of λbeing processed. The objective of GA was minimization during this experimen-tation, thus, the fitness function has to return a lower value for the chromosomesthat provide better results. There are a number of metrics that can be used todetermine the final output of the fitness function. For instance, it is possible tocalculate the specificity and/or the sensitivity and return its reciprocal as thefitness value. Since sensitivity and specificity are of equal importance in classi-fication, the fitness function in this study returned the reciprocal of the meansof the sensitivity and specificity values. Additionally, the gene values, Ci, werenormalized to determine the values of λi prior to the calculation of these metrics.It should be noted that the metrics are calculated over the training samples onlyusing k-fold cross validation during training. The various parameters for GA usedduring training and other phases of experimentation are listed in Table 6.1.

50

Parameter ValueGA Objective MinimizationNumber of Generations 100Population Size 500Crossover Rate 0.8Crossover Function Two Point CrossoverMutation Function Adaptive FeasibleChromosome Length 3

Table 6.1: A list of parameters for the Genetic Algorithm

6.1.3 Overall

In a nutshell, the scheme proposed in this research has been illustrated in Figure6.1. Initially, protein sequences are processed to extract the probability profilesfor the dependency models of n = 0, 1 ,2. Then, linear interpolation is appliedto form a consolidated prediction model based on these probabilities. In orderto improve performance, the weights used for backoff, λ, are optimized using GAduring the training phase as depicted. Once training is completed, the optimizedλ values with linear interpolation is used for subcellular localization of targetproteins.

6.2 Results

The performance of the proposed technique has been primarily evaluated usingtwo metrics, sensitivity and specificity. Mathematically, sensitivity and specificityhave been described in Equations 6.11 and 6.12 respectively shown below. In theseequations, TP represents the number of samples predicted as positive that belongto the positive class, FP represents the number of samples predicted incorrectly aspositive that belong to the negative class, TN represents the number of samplesas negative that belong to the negative class, and TN represents the number ofsamples predicted incorrectly as negative that belong to the positive class.

Sensitivity = TP/(TP + FP ) (6.11)

Specificity = TN/(TN + FN) (6.12)

The performance of probabilistic models with varying dependencies and linearinterpolation has been compared and discussed. Although the main concern of

51

Figure 6.1: An illustration of the proposed scheme

this paper is linear interpolation, it was also deemed prudent to show the resultsachieved using the various models for purposes of comparison. Additionally, sinceλ significantly affects the performance of linear interpolation, a brief comparisonof linear interpolation with unoptimized and GA optimized λ values was alsonecessary.Authors in previous research have mainly used k-fold cross validation or jackknifetests to report their results. The majority of the experimentation was conductedusing the widely accepted k-fold cross validation paradigm for k = 5, 6,. . . , 10.However, for better comparability, jackknife tests were also conducted on linearinterpolation. In order to gain statistical stability, the k-fold cross validation wasrepeated 100 times using random sub-sampling.The results observed while performing k-fold cross validation have been summa-rized in Tables 6.2 and 6.3 for the Gram positive and Gram negative datasetrespectively. In the tables, linear interpolation has been abbreviated as LIP andmodel n specifies the probabilistic models with dependency n. Since linear in-terpolation and its underlying models build the protein profiles by computingprobabilities of amino acid subsequence occurrences, it is possible to evaluatethe impact of adding evolutionary information to the prediction model by themeans of computing consensus and using the consensus sequence to perform thecomputations rather than the raw sequences. For the tables displaying results, aprefix of PSSM+ indicates that the model builds protein profiles after computing

52

consensus rather than directly using the raw amino acid sequence.Probabilistic models for dependencies n = 0, 1, 2 individually, linear interpola-tion with equal values for λ (λi = 1

3) and liner interpolation with optimized λvalues using GA have been compared. Upon referring to the results for the Grampositive dataset displayed in Table 6.3, it can be noted that linear interpolationwith optimized values for λ outperformed other schemes in cross validation forall values of k.Since sensitivity and specificity are equally important in any classification task,high values for both metrics indicates a well balanced prediction model. The othermodels, especially linear interpolation with equal weights, display high values for ametric, however, they have lower performance for the other metric. Additionally,an observation that can be noted is that the techniques which have evolutionaryinformation added via consensus perform slightly better than those without. Inthe case of linear interpolation, optimizing values of λ instead of using equalvalues for λ leads to a significant difference in the overall performance.The results observed for the Gram negative dataset are also similar and linearinterpolation with optimized λ performs better than other models. The othermodels display results that are much closer to those noted for linear interpolation.Similarly, computing features on consensus sequences has slight improvementsalthough they are not as significant as they were in Gram positive dataset.For a more detailed analysis of the results, the results obtained for each subcellu-lar location in the datasets are reported in Tables 6.6 and 6.7 using 10-fold crossvalidation. These results highlight the previously stated facts that adding evolu-tionary information via consensus improves performance for the models at everysubcellular location in both the datasets. The other models also exhibit goodperformance, however, linear interpolation with optimized λ is clearly dominant.The distribution of sensitivity and specificity values are also quite balanced indi-cating that the results are not skewed in either direction due to a disproportionatedistribution of these metrics.For better comparability, the results obtained by performing jackknife tests usinglinear interpolation with optimized values for λ have been reported in Tables6.4 and 6.5. Jackknife test was not performed on the other models due to itscomputational costs and also since the main focus of this study is to highlightthe applicability of linear interpolation and not its encompassing models. Itshould be noted that these results are computed after computing consensus onthe raw sequences. From the results, it can be seen that the performance of linearinterpolation is quite steady and the results obtained from k-fold cross validationand jackknife test are similar. There is high accuracy shown for both the datasetsand results for all the locations in the datasets are relatively balanced.

53

Scheme k = 5 k = 6 k = 7 k = 8 k = 9 k = 10

Model n = 0Sensitivity 70.8 71.7 71.4 71.8 71.7 71.3Specificity 83.6 83.8 83.7 83.8 83.6 84.0



LIP Sensitivity 73.2 73.8 73.0 73.4 73.7 73.7(Equal λ) Specificity 83.5 83.6 83.6 83.6 83.4 83.7LIP Sensitivity 73.4 73.4 73.9 74.1 74.7 74.9(Optimized λ) Specificity 83.3 83.3 82.7 82.9 82.8 82.7

PSSM+Model n = 0Sensitivity 75.4 75.4 75.6 75.5 75.4 75.5Specificity 83.8 83.9 83.8 84.0 84.0 83.9



PSSM+LIP Sensitivity 79.4 80.1 79.9 80.2 80.0 80.4(Equal λ) Specificity 84.6 84.5 84.3 84.5 84.5 84.3PSSM+LIP Sensitivity 80.2 80.3 80.3 80.4 80.5 80.7(Optimized λ) Specificity 84.9 84.9 84.8 84.8 84.8 84.9

Table 6.2: A summary for the performance of the various models for predictionstudied in this paper using the Gram positive bacterial dataset using k-fold crossvalidation for k = 5, 6, . . . , 10

54

Scheme k = 5 k = 6 k = 7 k = 8 k = 9 k = 10




LIP Sensitivity 84.1 84.5 83.6 84.2 83.6 84.8(Equal λ) Specificity 79.4 79.3 79.1 79.1 79.1 78.9LIP Sensitivity 82.5 80.9 82.1 82.0 81.5 83.0(Optimized λ) Specificity 82.1 82.1 82.7 82.3 82.4 81.3




PSSM+LIP Sensitivity 82.5 82.5 82.7 83.5 83.5 83.5(Equal λ) Specificity 86.2 86.1 86.0 85.9 85.9 85.9PSSM+LIP Sensitivity 84.8 84.8 84.9 85.3 85.5 85.9(Optimized λ) Specificity 86.1 85.9 85.8 85.7 85.7 85.7

Table 6.3: A summary for the performance of the various models for predictionstudied in this paper using the Gram negative bacterial dataset using k-fold crossvalidation for k = 5, 6, . . . , 10

55

Subcellular Location AccuracyCell membrane 114/174 = 65.5%Cell wall 16/18 = 88.9%Cytoplasm 194/208 = 93.3%Extracellular 95/123 = 77.2%Overall 419/523 = 80.1%

Table 6.4: Results from the jackknife test performed on Gram positive bacterialprotein dataset

Subcellular Location AccuracyCell inner membrane 474/557 = 85.1%Cell outer membrane 84/124 = 67.7%Cytoplasm 375/410 = 91.5%Extracellular 95/133 = 71.4%Fimbrium 30/32 = 93.8%Flagellum 12/12 = 100.0%Nucleoid 7/8 = 87.5%Periplasm 137/180 = 76.1%Overall 1214/1456 = 83.4%

Table 6.5: Results from the jackknife test performed on Gram negative bacterialprotein dataset

56

Sche

me

Loca

tion

sC

ellm

embr

ane

Cel

lwal

lC

ytop

lasm

Extr

acel

lula

rO

vera

ll

Mod

eln

=0

Sens

itivi

ty59

.969

.480

.275

.671

.3Sp

ecifi

city

87.4

85.2

75.9

87.5

84.0

Mod

eln

=1

Sens

itivi

ty57

.277

.890

.073

.674

.6Sp

ecifi

city

94.0

74.2

69.5

87.9

81.4

Mod

eln

=2

Sens

itivi

ty68

.875

.083

.970

.574

.6Sp

ecifi

city

88.3

76.1

80.4

84.9

82.4

LIP

Sens

itivi

ty61

.173

.687

.073

.273

.7(E

qual

λ)

Spec

ifici

ty91

.779

.076

.487

.583

.7LI

PSe

nsiti

vity

67.1

77.8

83.5

71.1

74.9

(Opt

imiz

edλ

)Sp

ecifi

city

92.1

76.6

74.9

87.3

82.7

PSSM

+M

odel

n=

0Se

nsiti

vity

68.0

71.4

88.2

74.5

75.5

Spec

ifici

ty82

.287

.776

.888

.983

.9

PSSM

+M

odel

n=

1Se

nsiti

vity

62.8

88.1

92.3

77.8

80.2

Spec

ifici

ty91

.081

.679

.387

.884

.9

PSSM

+M

odel

n=

2Se

nsiti

vity

71.4

84.2

92.0

69.6

79.3

Spec

ifici

ty82

.283

.979

.588

.283

.4PS

SM+

LIP

Sens

itivi

ty67

.987

.292

.174

.380

.4(E

qual

λ)

Spec

ifici

ty85

.883

.879

.288

.684

.3PS

SM+

LIP

Sens

itivi

ty67

.287

.391

.976

.380

.7(O

ptim

ized

λ)

Spec

ifici

ty87

.983

.479

.888

.384

.9

Tabl

e6.

6:A

deta

iled

com

paris

onof

the

vario

usm

odel

sst

udie

dus

ing

10-fo

ldcr

oss

valid

atio

non

Gra

mpo

sitiv

eba

cter

ial

data

set

Sche

me

Loca

tion

sC

elli

nner

Cel

lout

erC

ytop

lasm

Extr

acel

lula

rFi

mbr

ium

Flag

ellu

mN

ucle

oid

Perip

lasm

Ove

rall

mem

bran

em

embr

ane

Mod

eln

=0

Sens

itivi

ty76

.370

.890

.276

.581

.393

.775

.077

.880

.2Sp

ecifi

city

91.3

85.8

74.1

81.9

87.4

93.8

88.8

71.4

84.3

Mod

eln

=1

Sens

itivi

ty73

.487

.391

.887

.087

.510

0.0

75.0

80.3

85.3

Spec

ifici

ty96

.869

.771

.673

.982

.572

.783

.669

.477

.5

Mod

eln

=2

Sens

itivi

ty79

.484

.387

.289

.896

.910

0.0

81.3

70.6

86.2

Spec

ifici

ty91

.371

.676

.269

.263

.254

.168

.580

.671

.8LI

PSe

nsiti

vity

76.3

81.3

90.3

86.8

90.6

100.

078

.175

.084

.8(E

qual

λ)

Spec

ifici

ty94

.576

.074

.973

.579

.677

.679

.575

.978

.9LI

PSe

nsiti

vity

76.5

75.6

90.1

81.6

85.9

100.

078

.176

.583

.0(O

ptim

ized

λ)

Spec

ifici

ty92

.781

.174

.777

.782

.684

.084

.173

.481

.3

PSSM

+M

odel

n=

0Se

nsiti

vity

79.7

64.5

90.4

71.9

81.3

100.

065

.078

.979

.0Sp

ecifi

city

88.5

90.4

74.7

87.3

91.8

97.2

90.5

71.0

86.4

PSSM

+M

odel

n=

1Se

nsiti

vity

77.6

81.2

91.5

80.6

88.8

100.

085

.075

.685

.0Sp

ecifi

city

93.6

79.2

76.9

84.9

89.8

90.0

81.1

77.0

84.1

PSSM

+M

odel

n=

2Se

nsiti

vity

85.0

67.6

91.1

71.6

95.0

100.

081

.375

.383

.3Sp

ecifi

city

85.2

86.9

80.1

86.8

86.9

89.5

80.1

80.1

84.5

PSSM

+LI

PSe

nsiti

vity

81.7

68.7

91.5

74.5

90.6

100.

083

.177

.483

.5(E

qual

λ)

Spec

ifici

ty89

.986

.778

.586

.889

.493

.584

.677

.985

.9PS

SM+

LIP

Sens

itivi

ty81

.881

.091

.377

.390

.510

0.0

83.6

76.9

85.3

(Opt

imiz

edλ

)Sp

ecifi

city

89.4

86.9

78.3

86.7

89.5

93.1

84.7

77.2

85.7

Tabl

e6.

7:A

deta

iled

com

paris

onof

the

vario

usm

odel

sfo

rpr

edic

tion

stud

ied

usin

g10

-fold

cros

sva

lidat

ion

usin

gth

eG

ram

nega

tive

bact

eria

ldat

aset

6.3 Discussion

Our proposed technique builds probabilistic models on primary protein sequences.It utilizes only syntactical information of proteins and therefore we gauge the per-formance of our method with the methods which are mainly based on structuralinformation. This would give a relative measure of performance of the proposedtechnique given the same level of information used.The results obtained in this study perform on par or better than most of the re-cently proposed techniques in literature [9, 57, 66]. Since the proposed techniqueis learning technique that only utilizes syntactical and evolutionary information,we can only compare this strategy with similar work. There are some techniquesthat have been proposed recently in literature, however, these techniques incor-porate functional domains and gene ontology information [14, 67]. It takes timefor newly extracted proteins to annotated and recorded in such databases, there-fore, it may not be possible to use such techniques for predicting the subcellularlocalization of these proteins. The proposed technique builds probabilistic mod-els on the primary structure only, therefore, does not rely on such annotationinformation.Although linear interpolation displays good results, there is scope for furtherimprovements using this technique. This paper is aimed at introducing the pos-sibility of applying a basic natural language processing technique in the field ofproteomics. There are numerous possibilities that can be explored to improve theperformance of linear interpolation, which have been highlighted in the Recom-mendations section.There is a major distinction between the learning technique explored in this studyand several of the other classification techniques published previously in litera-ture. Linear interpolation can be categorized as a maximum likelihood techniqueand it predicts the class of a sample based on the computed probabilities. In itsessence, it determines to class label by simply selecting the class with the highestcomputed probability for the query sample. This allows linear interpolation tobe quite robust and modular, and it relatively easier to extend this techniquewithout severally increasing the computational costs. For instance, in this study,probabilistic models with dependencies of up to n = 2 have been discussed, how-ever, the technique can be easily modified to include higher dependency modelsto profile proteins.Although there is additional computation involved in computing the frequenciesof the amino acid subsequences, the prediction process itself does not experienceany drastic increases in computational costs since its simply the maximum ofthe cumulative sums of the probabilities of the various dependency models. Thisalso effectively deals with the problems associated with high dimensionality usingtraditional classifiers like SVM or kNN where high dimensionality of the featureset may reduce the performance of the classifier and exponentially increase the

59

computational costs during classification. Linear interpolation is able to deal withhigh dimensionality problems since its underlying models determine probabilitiesof sequences by computing the sum of probabilities of the individual amino acidsubsequences, therefore, effectively reducing high dimensional data to a singlescalar representing the probability. For instance, dependency model n = 0 has20 unique probabilities per class, n = 1 has 400 unique probabilities per classand n = 2 has 8000 unique probabilities per class, however, when computing thelikelihood of a query sample belonging to a particular subcellular location, everymodel computes a sum that represents the overall probability of that samplebelonging to that particular location.

6.4 Recommendations

It may be possible to further improve the results if optimization of λ is done usingsome other optimization technique. Currently, GA allows for global optimization,however, it is difficult to fine-tune or find the local optima in the global searchspace using GA. However, if a local optimization algorithm with GA is used, suchas simulated annealing or even artificial neural networks, the performance of theproposed technique may improve.Additionally, since the dimensionality of this approach is independent of the depthof the underlying dependency models, it is possible to explore the effects theincreasing the value of dependencies n. The computational cost of increasingn can be offset by an increase in the classifier performance, if any is noted.Lastly, the discussed technique builds the probabilistic profiles using amino acidoccurrence frequencies. However, these probabilistic profiles can be built usingthe information present in sequential evolution, and this hybrid approach can beexplored to investigate its impact on protein subcellular localization prediction.

60

Chapter 7

Summary

This chapter concludes the research and summarizes the various tasksaccomplished in this study.

In this research, various models have been studied and their relevance in PFR,SCP and SCL investigated. For the purposes of benchmarking, standardizeddatasets, which are popularly used by researchers for benchmarking are used toprovide a baseline to compare the performance of the proposed techniques. Ithas been shown that these models perform on par or better than those proposedrecently in literature.Initially, protein profiles were created using evolutionary information present inPSSM. These probabilistic profiles of proteins were built upon amino acid dimersprobabilities with varying degree of spatial separation. A key factor in obtaininggood results is computing amino acid dimer probabilities directly from PSSMinstead of the primary sequence (raw or with consensus) for the reasons explainedin Chapter 4.Promising results were obtained with this technique and its variants in the fieldsof PFR and SCL. The proposed techniques gave promising results for DD, EDDand TG datasets, which were amongst the highest noted while comparing withother published techniques as shown in Chapter 4 and Chapter 5 for PFR andSCL respectively.Furthermore, a major component of this thesis revolved around SCL. SCL isinherently different from PFR and SCP since any given protein can belong tomore than one subcellular location whereas PFR and SCP have a one-to-onemapping between proteins and folds or structural classes respectively.Most research in SCL has been conducted by utilizing information present in var-ious data sources such as functional domains, gene ontology, textual informationin protein databases, etc. It takes a lot of time for newly discovered proteins to

61

be annotated and added to such databases, therefore, these techniques may notbe viable for such proteins. However, in this study, a basic approximation modelis built by using only syntactical and evolutionary information is proposed. Thisapproximation model has been derived from natural language processing and itis known as linear interpolation modeling.Linear interpolation uses the Markov chain rule to build dependent probabilitymodels for amino acid subsequence occurrence have been used to determine thelikelihood of protein subcellular locations. As explained in Chapter 6,these depen-dency models have been combined in linear interpolation for dependency valuesof n = 0, 1, 2.Experiments have been performed to predict subcellular localizations of Grampositive and Gram negative bacterial proteins using linear interpolation eithervia k-fold cross validation and jackknife tests. These results highlight exceptionalperformance which are on-par or better than the current results in literature.Additionally, there is tremendous scope to extend the usage of linear interpo-lation in SCL. Although this has been explained in detail in Chapter 6, somepossible extensions include using sequential evolution probabilities directly in thedependency models and increasing the number of dependent models.

62

Bibliography

[1] N Leigh Anderson and Norman G Anderson. Proteome and proteomics: newtechnologies, new concepts, and new words. Electrophoresis, 19(11):1853–1861, 1998.

[2] Walter P Blackstock and Malcolm P Weir. Proteomics: quantitative andphysical mapping of cellular proteins. Trends in biotechnology, 17(3):121–127, 1999.

[3] C H Q Ding and Inna Dubchak. Multi-class protein fold recognition usingsupport vector machines and neural networks. Bioinformatics, 17(4):349–358, 2001.

[4] WiesŁaw Chmielnicki. A hybrid discriminative/generative approach to pro-tein fold recognition. Neurocomputing, 75(1):194–198, 2012.

[5] Alok Sharma, James Lyons, Abdollah Dehzangi, and Kuldip K Paliwal. Afeature extraction technique using bi-gram probabilities of position specificscoring matrix for protein fold recognition. Journal of theoretical biology,320:41–46, 2013.

[6] Tao Yang, Vojislav Kecman, Longbing Cao, Chengqi Zhang, and JoshuaZhexue Huang. Margin-based ensemble classifier for protein fold recognition.Expert Systems with Applications, 38(10):12348–12355, 2011.

[7] Abdollah Dehzangi, Kuldip Paliwal, Alok Sharma, Omid Dehzangi, and Ab-dul Sattar. A Combination of Feature Extraction Methods with an Ensem-ble of Different Classifiers for Protein Structural Class Prediction Problem.IEEE transactions on computational biology and bioinformatics, 2013.

[8] Kuo-Chen Chou and Hong-Bin Shen. Recent progress in protein subcellularlocation prediction. Analytical biochemistry, 370(1):1–16, 2007.

[9] Kuo-Chen Chou and Hong-Bin Shen. Cell-PLoc: a package of Web serversfor predicting subcellular localization of proteins in various organisms. Na-ture protocols, 3(2):153–162, 2008.

63

[10] Inna Dubchak, Ilya Muchnik, Christopher Mayor, Igor Dralyuk, andSung Hou Kim. Recognition of a protein fold in the context of the SCOPclassification. Proteins: Structure, Function, and Bioinformatics, 35(4):401–407, 1999.

[11] Hong-Bin Shen and Kuo-Chen Chou. Ensemble classifier for protein foldpattern recognition. Bioinformatics, 22(14):1717–1722, 2006.

[12] Robert E Langlois, Alice Diec, Ognjen Perisic, and Yang Dai. Improvedprotein fold assignment using support vector machines. International journalof bioinformatics research and applications, 1(3):319–334, 2005.

[13] Kuo-Chen Chou. Prediction of protein cellular attributes using pseudoamino acid composition. Proteins: Structure, Function, and Bioinformatics,43(3):246–255, 2001.

[14] Kuo-Chen Chou and Hong-Bin Shen. Cell-PLoc 2.0: An improved packageof web-servers for predicting subcellular localization of proteins in variousorganisms. Natural Science, 109:1091, 2010.

[15] Kevin Karplus, Christian Barrett, and Richard Hughey. Hidden Markovmodels for detecting remote protein homologies. Bioinformatics, 14(10):846–856, 1998.

[16] David T Jones. GenTHREADER: an efficient and reliable protein foldrecognition method for genomic sequences. Journal of molecular biology,287(4):797–815, 1999.

[17] Burkhard Rost and Chris Sander. Prediction of protein secondary structureat better than 70% accuracy. Journal of molecular biology, 232(2):584–599,1993.

[18] Qiwen Dong, Shuigeng Zhou, and Jihong Guan. A new taxonomy-based pro-tein fold recognition approach based on autocross-covariance transformation.Bioinformatics, 25(20):2655–2662, 2009.

[19] Inna Dubchak, Ilya Muchnik, Stephen R Holbrook, and Sung-Hou Kim.Prediction of protein folding class using global description of amino acidsequence. Proceedings of the National Academy of Sciences, 92(19):8700–8704, 1995.

[20] Lukasz Kurgan, Tuo Zhang, Hua Zhang, Shiyi Shen, and Jishou Ruan. Sec-ondary structure-based assignment of the protein structural classes. Aminoacids, 35(3):551–564, 2008.

[21] Pradip Ghanty and Nikhil R Pal. Prediction of protein folds: Extraction ofnew features, dimensionality reduction, and fusion of heterogeneous classi-fiers. NanoBioscience, IEEE Transactions on, 8(1):100–110, 2009.

64

[22] Kuo-Chen Chou and Hong-Bin Shen. Plant-mPLoc: a top-down strategyto augment the power for predicting plant protein subcellular localization.PloS one, 5(6):e11335, 2010.

[23] Stephen F Altschul, Thomas L Madden, Alejandro A Schäffer, JinghuiZhang, Zheng Zhang, Webb Miller, and David J Lipman. Gapped BLASTand PSI-BLAST: a new generation of protein database search programs.Nucleic acids research, 25(17):3389–3402, 1997.

[24] Shuichi Kawashima, Piotr Pokarowski, Maria Pokarowska, Andrzej Kolinski,Toshiaki Katayama, and Minoru Kanehisa. AAindex: amino acid indexdatabase, progress report 2008. Nucleic acids research, 36(suppl 1):D202–D205, 2008.

[25] Petr Klein. Prediction of protein structural class by discriminant analysis.Biochimica et Biophysica Acta (BBA)-Protein Structure and Molecular En-zymology, 874(2):205–215, 1986.

[26] Yong-Sheng Ding and Tong-Liang Zhang. Using ChouâĂŹs pseudo aminoacid composition to predict subcellular localization of apoptosis proteins: anapproach with immune genetic algorithm-based ensemble classifier. PatternRecognition Letters, 29(13):1887–1892, 2008.

[27] Arunkumar Chinnasamy, Wing-Kin Sung, and Ankush Mittal. Protein struc-ture and fold prediction using tree-augmented naive bayesian classifier. Jour-nal of Bioinformatics and Computational Biology, 3(04):803–819, 2005.

[28] Harsh Saini, Gaurav Raicar, Alok Sharma, Sunil Lal, Abdollah De-hzangi, Rajeshkannan Ananthanarayanan, James Lyons, Neela Biswas, andKuldip K Paliwal. Protein Structural Class Prediction via k-separated bi-grams using Position Specific Scoring Matrix. Journal of Advanced Compu-tational Intelligence and Intelligent Informatics, 18(4):474–479, 2014.

[29] Ashish Anand, Ganesan Pugalenthi, and P N Suganthan. Predicting pro-tein structural class by SVM with class-wise optimized features and decisionprobabilities. Journal of theoretical biology, 253(2):375–380, 2008.

[30] Yu-Dong Cai, Xiao-Jun Liu, Xue-biao Xu, and Kuo-Chen Chou. Predic-tion of protein structural classes by support vector machines. Computers &chemistry, 26(3):293–296, 2002.

[31] Alok Sharma, Abdollah Dehzangi, James Lyons, Seiya Imoto, SatoruMiyano, Kenta Nakai, and Ashwini Patil. Evaluation of Sequence Featuresfrom Intrinsically Disordered Regions for the Estimation of Protein Function.PloS one, 9(2):e89890, 2014.

65

[32] Alok Sharma, Kuldip K Paliwal, Abdollah Dehzangi, James Lyons, SeiyaImoto, and Satoru Miyano. A strategy to select suitable physicochemicalattributes of amino acids for protein fold recognition. BMC bioinformatics,14(1):233, 2013.

[33] Yu-Dong Cai and Guo-Ping Zhou. Prediction of protein structural classesby neural network. Biochimie, 82(8):783–785, 2000.

[34] Samad Jahandideh, Parviz Abdolmaleki, Mina Jahandideh, and SayyedHamed Sadat Hayatshahi. Novel hybrid method for the evaluation of pa-rameters contributing in determination of protein structural classes. Journalof theoretical biology, 244(2):275–281, 2007.

[35] Samad Jahandideh, Parviz Abdolmaleki, Mina Jahandideh, andEbrahim Barzegari Asadabadi. Novel two-stage hybrid neural discriminantmodel for predicting proteins structural classes. Biophysical chemistry,128(1):87–93, 2007.

[36] Abdollah Dehzangi and Abdul Sattar. Ensemble of diversely trained supportvector machines for protein fold recognition. In Intelligent Information andDatabase Systems, pages 335–344. Springer, 2013.

[37] Abdollah Dehzangi and Sasan Karamizadeh. Solving protein fold predictionproblem using fusion of heterogeneous classifiers. International InformationInstitute, 2011.

[38] Hong-Bin Shen and Kuo-Chen Chou. Predicting protein fold pattern withfunctional domain and sequential evolution information. Journal of Theo-retical Biology, 256(3):441–446, 2009.

[39] Inna Dubchak, Ilya Muchnik, and Kim. SK. Protein folding class predictorfor SCOP: approach based on global descriptors. Proceedings, 5th Interna-tional Conference on Intelligent Systems for Molecular Biology. Kalkidiki,Greece., pages 104–107, 1997.

[40] Abdollah Dehzangi and Somnuk Phon Amnuaisuk. Fold prediction problem:the application of new physical and physicochemical-based features. ProteinPept Letter, (18):174–185, 2011.

[41] Hua Zhang, Tuo Zhang, Jianzhao Gao, Jishou Ruan, Shiyi Shen, and LukaszKurgan. Determination of protein folding kinetic types using sequenceand predicted secondary structure and solvent accessibility. Amino acids,42(1):271–283, 2012.

[42] Rafael Najmanovich, Josef Kuttner, Vladimir Sobolev, and Marvin Edelman.Side chain flexibility in proteins upon ligand binding. Proteins: Structure,Function, and Bioinformatics, 39(3):261–268, 2000.

66

[43] Ji Tao Huang and Jing Tian. Amino acid sequence predicts folding rate formiddle size two state proteins. Proteins: Structure, Function, and Bioinfor-matics, 63(3):551–554, 2006.

[44] Y H Taguchi and M Michael Gromiha. Application of amino acid occur-rence for discriminating different folding types of globular proteins. BMCbioinformatics, 8(1):404, 2007.

[45] Kuldip K Paliwal, Alok Sharma, James Lyons, and Abdollah Dehzangi. ATri-Gram Based Feature Extraction Technique Using Linear Probabilitiesof Position Specific Scoring Matrix for Protein Fold Recognition. NanoBio-science, IEEE Transactions on, 13(1):44–50, 2014.

[46] James Lyons, Neela Biswas, Alok Sharma, Abdollah Dehzangi, and Kuldip KPaliwal. Protein fold recognition by alignment of amino acid residues usingkernelized dynamic time warping. Journal of theoretical biology, 354:137–145,2014.

[47] Suyu Mei, Wang Fei, and Shuigeng Zhou. Gene ontology based transferlearning for protein subcellular localization. BMC bioinformatics, 12(1):44,2011.

[48] Suyu Mei. Predicting plant protein subcellular multi-localization by Chou’sPseAAC formulation based multi-label homolog knowledge transfer learning.Journal of theoretical biology, 310:80–87, 2012.

[49] Kuo-Chen Chou, Zhi-Cheng Wu, and Xuan Xiao. iLoc-Hum: using theaccumulation-label scale to predict subcellular locations of human proteinswith both single and multiple sites. Molecular Biosystems, 8(2):629–641,2012.

[50] Y Q Zhang, T Li, C Y Yang, D Li, Y Cui, Y Jiang, L Q Zhang, Y P Zhu,and F C He. Prelocabc: a novel predictor of protein sub-cellular localizationusing a bayesian classifier. J Proteomics Bioinform, 4(1), 2011.

[51] Sebastian Briesemeister, Jörg Rahnenführer, and Oliver Kohlbacher. YLoc- an interpretable web server for predicting subcellular localization. Nucleicacids research, 38(suppl 2):W497–W502, 2010.

[52] Sebastian Briesemeister, Jörg Rahnenführer, and Oliver Kohlbacher. Goingfrom where to why - interpretable prediction of protein subcellular localiza-tion. Bioinformatics, 26(9):1232–1238, 2010.

[53] Hong-Bin Shen and Kuo-Chen Chou. Virus-mPLoc: a fusion classifier forviral protein subcellular location prediction by incorporating multiple sites.Journal of Biomolecular Structure and Dynamics, 28(2):175–186, 2010.

67

[54] Hong-Bin Shen and Kuo-Chen Chou. Gneg-mPLoc: a top-down strategy toenhance the quality of predicting subcellular localization of Gram-negativebacterial proteins. Journal of Theoretical Biology, 264(2):326–333, 2010.

[55] E Tantoso and Kuo-Bin Li. AAIndexLoc: predicting subcellular localizationof proteins based on a new representation of sequences using amino acidindices. Amino Acids, 35(2):345–353, 2008.

[56] Yongwook Yoon and Gary Geunbae Lee. Subcellular localization predictionthrough boosting association rules. Computational Biology and Bioinformat-ics, IEEE/ACM Transactions on, 9(2):609–618, 2012.

[57] Chao Huang and Jingqi Yuan. Using radial basis function on the general formof Chou’s pseudo amino acid composition and PSSM to predict subcellularlocations of proteins with both single and multiple sites. Biosystems, 2013.

[58] Xuan Xiao, Zhi-Cheng Wu, and Kuo-Chen Chou. A multi-label classifier forpredicting the subcellular localization of gram-negative bacterial proteinswith both single and multiple sites. PLoS One, 6(6):e20592, 2011.

[59] Jian Yi Yang and Xin Chen. Improving taxonomy based protein fold recog-nition by using global and local features. Proteins: Structure, Function, andBioinformatics, 79(7):2053–2064, 2011.

[60] Shengnan Tang, Tonghua Li, Peisheng Cong, Wenwei Xiong, Zhiheng Wang,and Jiangming Sun. PlantLoc: an accurate web server for predicting plantprotein subcellular localization by substantiality motif. Nucleic acids re-search, 2013.

[61] Tanwir Habib, Chaoyang Zhang, Jack Y Yang, Mary Qu Yang, and YoupingDeng. Supervised learning method for the prediction of subcellular localiza-tion of proteins using amino acid and amino acid pair composition. BMCgenomics, 9(Suppl 1):S16, 2008.

[62] Annette Höglund, Pierre Dönnes, Torsten Blum, Hans-Werner Adolph, andOliver Kohlbacher. MultiLoc: prediction of protein subcellular localizationusing N-terminal targeting sequences, sequence motifs and amino acid com-position. Bioinformatics, 22(10):1158–1165, 2006.

[63] Kuo-Chen Chou. Some remarks on protein attribute prediction and pseudoamino acid composition. Journal of theoretical biology, 273(1):236–247, 2011.

[64] Xiaomei Li, Xindong Wu, and Gongqing Wu. Robust feature generationfor protein subchloroplast location prediction with a weighted GO transfermodel. Journal of theoretical biology, 347:84–94, 2014.

68

[65] Pufeng Du and Yanda Li. Prediction of protein submitochondria locationsby hybridizing pseudo-amino acid composition with various physicochemicalfeatures of segmented sequence. BMC bioinformatics, 7(1):518, 2006.

[66] Eakasit Pacharawongsakda and Thanaruk Theeramunkong. Predict Sub-cellular Locations of Singleplex and Multiplex Proteins by Semi-SupervisedLearning and Dimension-Reducing General Mode of Chou’s PseAAC. IEEETransactions on Nanobioscience, 2013.

[67] Xuan Xiao, Zhi-Cheng Wu, and Kuo-Chen Chou. iLoc-Virus: A multi-label learning classifier for identifying the subcellular localization of virusproteins with both single and multiple sites. Journal of Theoretical Biology,284(1):42–51, 2011.

69

EXPLORING TECHNIQUES FOR OPTIMAL - USP Thesesdigilib.library.usp.ac.fj/.../HASHac42.dir/doc.pdf ·...

Documents

Transcript of EXPLORING TECHNIQUES FOR OPTIMAL - USP Thesesdigilib.library.usp.ac.fj/.../HASHac42.dir/doc.pdf ·...