Structural Alphabets for Protein Structure Classification: A Comparison Study

doi:10.1016/j.jmb.2008.12.044 J. Mol. Biol. (2009) 387, 431–450

Available online at www.sciencedirect.com

Structural Alphabets for Protein StructureClassification: A Comparison Study

Quan Le1, Gianluca Pollastri1 and Patrice Koehl2⁎
1Complex and Adaptive SystemsLaboratory, School of ComputerScience and Informatics,University College Dublin,Dublin, Ireland2Department of ComputerScience and Genome Center,University of California, Davis,Davis, CA 95616, USA
Received 21 July 2008;received in revised form16 December 2008;accepted 17 December 2008Available online25 December 2008

*Corresponding author. E-mail [email protected]; [email protected]

Abbreviations used: 3D, three-dimsequence; SSES, secondary-structure elocal fragment sequence; GFS, globalSVM, support vector machine; HMMmodel; cRMS, coordinate root mean smultidimensional scaling; AIS, averagseparation; ROC, receiver operating cpositive; FP, false positive; CPU, cent

0022-2836/$ - see front matter © 2009 E

Finding structural similarities between proteins often helps reveal sharedfunctionality, which otherwise might not be detected by native sequenceinformation alone. Such similarity is usually detected and quantified byprotein structure alignment. Determining the optimal alignment betweentwo protein structures, however, remains a hard problem. An alternativeapproach is to approximate each three-dimensional protein structureusing a sequence of motifs derived from a structural alphabet. Using thisapproach, structure comparison is performed by comparing the corre-sponding motif sequences or structural sequences. In this article, wemeasure the performance of such alphabets in the context of the proteinstructure classification problem. We consider both local and global struc-tural sequences. Each letter of a local structural sequence corresponds tothe best matching fragment to the corresponding local segment of theprotein structure. The global structural sequence is designed to generatethe best possible complete chain that matches the full protein structure.We use an alphabet of 20 letters, corresponding to a library of 20 motifs orprotein fragments having four residues. We show that the globalstructural sequences approximate well the native structures of proteins,with an average coordinate root mean square of 0.69 Å over 2225 testproteins. The approximation is best for all α-proteins, while relativelypoorer for all β-proteins. We then test the performance of four differentsequence representations of proteins (their native sequence, the sequenceof their secondary-structure elements, and the local and global structuralsequences based on our fragment library) with different classifiers in theirability to classify proteins that belong to five distinct folds of CATH.Without surprise, the primary sequence alone performs poorly as a struc-ture classifier. We show that addition of either secondary-structureinformation or local information from the structural sequence considerablyimproves the classification accuracy. The two fragment-based sequencesperform better than the secondary-structure sequence but not well enoughat this stage to be a viable alternative to more computationally intensivemethods based on protein structure alignment.

© 2009 Elsevier Ltd. All rights reserved.

Keywords: protein structure; structural alphabet; structure classification;protein sequence comparison; sequence feature space
Edited by M. Levitt
seses: [email protected];cdavis.edu.ensional; NS, nativelement sequence; LFS,fragment sequence;, hidden Markovquare; MDS,e interclusterharacteristic; TP, trueral processing unit.

lsevier Ltd. All rights reserve

Introduction

There is a clear understanding in biology that allcellular functions are deeply connected to the shapeof their molecular actors. This is especially true forproteins, whose functions are directly related totheir three-dimensional (3D) structures.1–4 In thehope of deciphering the rules that define therelationships between structure and functions,large-scale experimental projects are performed toprovide maps of the genetic information of different

d.

mailto:[email protected]



http://dx.doi.org/10.1016/j.jmb.2008.12.044

432 Structural Alphabets for Proteins: A Comparison

organisms, including the human genome5,6 (mostlyin the form of genetic sequences of proteins), toderive as much structural information as possible onthe products of these genes, and to relate thesestructures to the function of the correspondingproteins. These are the different “-omics” projects,genomics, structural genomics,7 and functionalgenomics,8 to name a few. While these studies aregenerating a wealth of information, stored intodatabases, the key to their success lies in our abilityto organize and analyze this information, that is, inour ability to classify proteins, based on theirsequences, structures, and/or functions, and toconnect these classifications (for reviews, see Refs.[9,10]). In this article, we focus on the effort oforganizing protein structures.It is currently easier to detect that two proteins

share similar functions based on their structuresrather than on their sequences. This was observedas early as in 1960, when Perutz et al. showed thatmyoglobin and hemoglobin, the first two proteinstructures to be solved at atomic resolution usingX-ray crystallography, have similar structures eventhough their sequences differ.11 These two proteinsare functionally similar, as they are involved in thestorage and the transport of oxygen, respectively.Since then, there has been a continued interest infinding structural similarities between proteins, inthe hope of revealing shared functionality that couldnot be detected by sequence information alone. Theresult of this interest is the development of systemsfor classification of protein structures that identifyand group proteins sharing the same structure so asto reveal evolutionary relationships. Currently,there are three main protein structure classificationschemes: SCOP,12 CATH,13 and DALI.14

Central to any classification scheme is the conceptof similarity and its quantification. A measure ofsimilarity is required to generate the initial classifica-tion of the data, as well as to identify the class towhich any new data would belong. Defining a simi-larity measure for protein structures is a difficultproblem, leading to discrepancies between the differ-ent classification schemes. Protein structure similarityis most often detected and quantified by a proteinstructure alignment program. Although an approxi-mate optimal solution of the structural alignmentproblem exists,15 it is computationally too expensiveto be of practical interest. All methods available todate are heuristic and, consequently, at best, sub-optimal. Many evaluations of structural alignmentmethods are available.16–19 These studies usually con-clude that an optimal solution to this problem that isfast and reliable and therefore appropriate for classi-fication still needs to be defined. As a consequence,there is a significant interest in developing alternativeapproaches to protein structure alignment for mea-suring similarities (for a recent review, see Ref. [10]).Finding the (sub) optimal alignment between two

protein structures is a hard problem as the rotationand translation of one of the two structures withrespect to the other must be found in addition to thealignment itself. By converse, finding the optimal

alignment between two protein sequences is a mucheasier problem, as it can always be solved usingdynamic programming, so long as a satisfactorysubstitution model is available. If it was possible totranslate faithfully a protein structure into a string ofletters, protein structure comparison would thereforebecome much easier. This idea of representingstructures as a string of letters is in fact grounded inthe observation that recurrent, regular structuralmotifs exist at all levels of organization of proteinstructures. This was first observed by Corey andPauling20,21 and later refined into the concept ofprotein secondary structures. Although the latter canbe predicted with high accuracy (N80%), the descrip-tion of a protein in terms of its secondary structures isnot sufficient to capture accurately its 3D geometry. Toovercome this limitation, several studies have focusedon defining libraries of fragment representatives fromwhich complete protein structures can be modeledwith adequate accuracy.22–29 In these approaches,protein structures are represented as a series of over-lapping fragments, each labeled with a symbol,defining a structural alphabet for proteins. Thefragments are chosen such as to provide either thebest local fits to segments of the protein structure orthe best global fit to the entire protein structure,resulting in two types of structural sequences, namely,local or global.27 Current applications of thesestructural alphabets include protein structure model-ing and in particular decoy generation,30 localstructure prediction,24,31–34 the reconstruction of afull-atom representation of the protein from theknowledge of the positions of its Cα only,35,36 iden-tification of structural motifs,37,38 analysis of protein–protein interactions,39 and protein structure compar-ison and protein structure database search.29,40–46 Weare interested in an extension of the latter, that is, to theapplication of structural alphabets to the problem ofprotein structure classification.In this article, we focus on the information content

of sequences of proteins, in the context of structureclassification (fold classification). More specifically,we compare the performance of five differentclassifiers, each tested with four different sequencerepresentation of proteins: the native (amino acid)sequences (NSs), the secondary-structure elementsequence (SSES), and two fragment-based structuralsequences, namely, a local fragment sequence (LFS)and a global fragment sequence (GFS) derived froma library of 20 fragments having four residues. Weshow that LFS, GFS, and SSES always outperformthe NSs and that GFS and LFS perform statisticallysignificantly better than the SSES when adopted incombination with kernel-based, support vectormachine (SVM)-based, and hidden Markov model(HMM)-based classifiers.

Results

With the number of known protein structures inthe Protein Data Bank47 growing exponentially, theneed for reliable, automatic structure comparison

433Structural Alphabets for Proteins: A Comparison

and structure classification tools has never beengreater. Here, we test an alternative approach tostandard structure comparison in which we use a1D representation of a protein, or “sequence”, tocompare and classify proteins. We use fourdifferent types of 1D representation: the NS ofamino acids of the protein, the string describing thesecondary-structure types of each residue (SSES),and two fragment-based structural sequences, local(LFS) and global (GFS). Figure 1 shows an exampleof each type of these “sequences” for a small toyprotein containing one strand and one helix. Weexplore two approaches for comparing and classi-fying sequences. In the first approach, sequences

Fig. 1. Different 1D sequence representations of a toyprotein structure. The NS is simply the primary sequence,or list of the amino acids forming the protein, from Nter toCter. The SSES defines the secondary-structure element foreach residue in the protein, according to STRIDE.48 Threetypes of SSE are considered: helices (H), strands orextended conformations (E), and coils (C). The fragmentsequence (FS) is the sequence of fragments found in thebest-fit reconstruction model for the protein (shown hereas a stick model, with different colors for each type offragment). The fragments are drawn from a library of 20fragments with four residues. They are uniquely identifiedwith a letter from A to T. Two types of fragment sequencesare considered: the local sequence (LFS), which ensuresthe best local fit of the fragment, and the global sequence(GFS), which ensures the best global fit of the structuralmodel to the complete protein structure. Note that byconstruction, the fragment-based sequences are shorterthan the two other sequences (the first four residues beingdefined by the first fragment). Models were drawn usingPyMOL (http://www.pymol.org).

are mapped and compared in a feature space,using a distance measure. In the second approach,sequences are used directly to train either HMMsor SVMs that are subsequently used for classifica-tion purpose.The CATH2225 set of proteins is used as the

protein test set for our procedure to generate struc-tural sequences. It includes 2225 proteins groupedinto 619 CATH fold classes and was designed suchthat the sequences of any pair of proteins in the sethave statistically no similarity (FASTA49 E-valueN10−4).19 We use five of the most populated folds inCATH2225 as a test set for our classifier, covering thethree main classes of CATH: one fully α fold (arcrepressor mutant, 1.10.10), one fully β fold (immu-noglobulin fold, 2.60.40), and three α–β folds (TIMfold, 3.20.20, an α–β plait, 3.30.70, and the Rossmannfold, 3.40.50). The five folds include a total of 605proteins from CATH2225; we refer to this set asCATH605. Figure 2 shows examples of proteinstructures for each of the five folds.

A 20-letter structural alphabet

In our implementation, we use a library of 20fragments, each with four residues. This library wasconstructed using the approach described byKolodny et al.27 The 20 fragments, labeled [A–T], areshown on Fig. 3. As expected, these fragments coverall types of local structures for proteins: helices(fragments A, B, C, D, and I), strands (F, K, N, R,and S), and turns (J). All the other fragments havemixed conformations and correspond to entrance/exit of helices and strands, as well as loop regions.Wehave built a tree to represent the similarities betweenthe fragments using the two programs Fitch andDrawtree of the Phylip package.51 Fitch implementsthe Fitch and Margoliash52 method for constructingtrees from a distance matrix under the “additive treemodel”, according towhich the distance between twoobjects (fragments) is expected to equal the sum ofbranch lengths between the objects on the tree. Theoutput of Drawtree is shown on Fig. 3; this tree clearlyshows the mapping of the fragments to the types oflocal structures of proteins.

Fragment-based local structural sequencesof proteins

The local structural sequence of a protein is thesequence of fragments having four residues that bestmatch overlapping windows of four residues thatscan the protein structure, where match refers to alow coordinate root mean square (cRMS) distance.Averaging the cRMS of the best fragment over allwindows covering the protein gives the local-fitscore for that protein. To test the performance of ourfragment library, we computed this score for all 2225proteins of our CATH2225 domain data set. Resultsare shown in Fig. 4. The average local fit overCATH2225 is 0.22 Å. Kolodny et al. studied thequality of local-fit approximation using libraries ofdifferent sizes with fragments of different lengths,

Fig. 2. Representatives of the five fold classes in our test set. The arc repressor mutant, subunit A fold (CATH 1.10.10)is a common orthogonal helix bundle, found, for example, in the DNA-binding domain of the human telomeric proteinHTRF1 (CATH code 1ba500). The immunoglobulin-like fold is a β sandwich, found in many immunoglobulin-likeproteins, such as the rat CD4 protein (CATH code 1cid02). The TIM barrel is a very common α–β fold, shown in narbonin,a plant seed protein (CATH code 1nar00). The α–β plait fold is a two-layer sandwich, shown here in MERP, a mercurybinding protein (CATH code 2hqi00). The Rossmann fold is a very common three-layer sandwich fold in the mixed α–βclass, found, for example, in the glycyl-tRNA synthetase from Thermus thermophilus (CATH code 1atiB2). All images weregenerated using Molscript.50


using a much smaller test set of proteins (145proteins).27 They did not include in their study alibrary equivalent to the one used here. Theyderived a relationship between the expected cRMSand the complexity of the library. Their predictedcRMS for a library of 20 fragments with fourresidues is 0.25, in full agreement with the valueobserved here. We also calculated the averagecRMS deviation for each of the 20 fragments ofthe library. Figure 5 shows that the helical frag-ments fit better locally than all other fragments.This makes intuitive sense: helices in proteins aremore regular than β-strands, which show greatergeometric diversities.1

Fragment-based global structural sequencesof proteins

To derive the global structural sequence, we imple-mented a soft greedy algorithm in which fragmentsare concatenated without degrees of freedom (simi-lar to stacking lego pieces), such that the cRMSdeviation of the reconstructed structure compared tothe native structure is minimal (see Methods formore details). To test the performance of both ourfragment library and our global-fit algorithm, wereconstructed the 2225 proteins of our CATH2225domain data set (see Methods). Results are shown inFig. 6. The average cRMS deviation of the global-fit

Fig. 3. Our library of 20 fragments with four residues. Images are generated using PyMOL. The fragments have beenorganized around a tree based on similarity (cRMS), computed with the program Phylip.51


approximations over the whole data set is 0.691 Å.Kolodny et al. predicted that the average cRMS ofglobal-fit reconstruction for a library of 20 fragmentswith four residues is 0.70,27 in full agreementwith thevalue observed here. The best approximation (0.19 Å)is obtained for the domain 2bbvF0, a small helicalfragment of 13 residues, while the worst approxima-tion (1.07 Å) is obtained for the domain 1qu5A0, a β-barrel domain of 179 residues that contains manylong loops as well as a long unstructured region.

We observe a strong effect of the secondary-structure content on the accuracy of the global-fitapproximation. Figure 7 shows the global-fit appro-ximations of two proteins (2a0b00 and 1hce00) ofthe same size (118 amino acids) but with differenttypes of secondary-structure elements. 2a0b00 is anall α-protein while 1hce00 is an all β-protein; theirbest approximations achieve an overall cRMS of0.38 and 0.99 Å, respectively. Figure 8 shows plotsof the accuracies of the global-fit approximations for

Fig. 4. Testing the accuracy of local structuralsequences. Histogram of the mean cRMS between allfour consecutive residue fragments of a protein and theirbest local-fit fragments in the library, for all proteins inCATH2225. The average cRMS over all fragments is 0.22.

Fig. 6. Testing the accuracy of global structuralsequences. Histogram of global-fit cRMS deviationsbetween the model built from the global structuralsequence and the native structure for all proteins inCATH2225. Structural sequences were built from a libraryof 20 fragments having four residues, using a soft greedyapproach with a heap size Nkeep=4000 (see the text fordetails). The average cRMS is 0.69.


all proteins in CATH2225 versus the helix and strandcontents of the proteins. The accuracy of the appro-ximation improves (cRMS decreases) as helix con-tent increases, with a correlation of 0.8. Conversely,the accuracy of the approximation declines asstrand content increases. Both results are consistentwith the finding that fragments that best fit helicalregions in proteins provide better matches onaverage than extended fragments (see Fig. 5).

Feature spaces for protein sequences

Each protein sequence is embedded in a spacerepresenting the substrings of length p (p-mer) itcontains. For p=1, the size of the space is the sizeN of the alphabet on which the sequence is drawn(20 for NS, GFS, and LFS, and 3 for SSES). A

protein sequence is represented in this space as anN-dimensional feature vector containing the fre-quency of each letter of the alphabet in thesequence. A fold in the same space is representedby the mean of the feature vectors of all proteins itcontains. Figure 9 shows the feature vectors in the1-mer space for all five folds included in CATH605based on the global structural sequences (GFS) oftheir representatives to illustrate this process. Asexpected, the fold 1.10.10 (α-proteins) contains pre-dominantly fragments A, B, C, and I that havebeen identified as helical fragments (see Fig. 3),while the fold 2.60.40 (β-proteins) contains pre-dominantly fragments F, K, N, O, R, and S that

Fig. 5. Best local fits for all 20structural fragments. Mean qualityof the best local fit for each of the 20fragments with four residues. Thefragments have been ordered basedon similarity (see Fig. 3), withhelical-like fragments on the leftand extended fragments on theright.

Fig. 7. Examples of protein structure reconstruction. The global-fit algorithm was used to reconstruct 2a0b00, a fullyα-protein, and 1hce00, a fully β-protein, using a library of 20 fragments having four residues. The native structure isshown in orange, and the fragment-based structure is shown in cyan.


correspond to extended regions in proteins. Notethat fragment H is highly represented in these twodifferent folds. Fragment H is neither helical norextended; it is expected to be found in loops andtherefore will appear in the global structuralsequences of many types of proteins. The threemixed α–β folds contain a mixture of these twosets of fragments. The differential fragment usagebased on the secondary-structure content of theprotein was already observed by Friedberg et al.44

The 2-mer space has size N2: it contains allpossible substrings having two residues that can beformed with the letters of the alphabet for thesequence. The 2-mer space measures the usage ofeach of these letters, as well as captures their localcorrelations; as such, it is expected that it providesa better mapping of the sequence space than the 1-mer space, as the conformations of consecutiveresidues along the backbone of proteins arecorrelated to one another. In Fig. 10, we show 2Drepresentations of the 1-mer and 2-mer featurespaces for all four types of sequences (NS, SSES,GFS, and LFS). The 2D representation is created byusing metric multidimensional scaling (MDS).53 Weevaluate the clustering of the five folds in the 2D

Fig. 8. Accuracy of structural sequences versus secondaryreconstructed from the structural sequence of a protein is plotte(right) of the protein.

MDS representation using the average interclusterseparation (AIS), defined such that a good clusterconfiguration corresponds to a large AIS value (seeMethods). There is very little separation betweenthe five folds in the 1-mer space based on the NS(AIS=0.26): this is not surprising as the sequencesin CATH605 have very little similarity by design.The corresponding 1-mer spaces based on thestructural sequences (SSES, GFS, and LFS) showbetter separation of the five folds (AIS=0.58, 0.5,and 0.54, respectively). The all-α (black points) andall-β (red points) folds are well separated; for allthree alphabets (SSE, GF, and LF), however, thethree mixed α–β folds overlap significantly. Map-ping the NSs on the 2-mer space improves therepresentation of the folds; the mixed α–β fold3.40.50 (green points), however, still overlapssignificantly with the α and β folds. In the 2-merspace based on the secondary-structure elements,the two mixed α–β folds 3.20.20 and 3.30.70 (bluepoints and magenta points, respectively) are nearlyindistinguishable. The 2-mer spaces based on thelocal and global structural sequences show betterseparation of the five folds with AIS values of 0.70and 0.69, respectively, compared to 0.63 for the 2-

-structure content. The global-fit accuracy of the modeld against the α-helix content (left) and the β-strand content

Fig. 9. Distribution of fragmentusage in the global structural se-quences for five distinct folds.


mer space based on secondary structure and 0.47for the 2-mer space based on NSs. The same resultswere observed for higher-dimensional representa-tions of the feature spaces (results not shown).

Structure similarity versus sequence similarity

Two proteins with highly similar sequencesalmost always share the same fold. The reverse,however, is not always true: Rost has shown thatpairs of proteins with similar structures possess, onaverage, only 8–10% sequence identity:54 this obser-vation is one of the main reasons that it is difficultto classify protein structures based on sequenceonly. The global structural sequence (GFS) of aprotein is designed to capture the characteristics ofits structure: a measure of structural sequence simi-larity should therefore correlate well with a mea-sure of structural similarity. To test this hypothesis,we selected 10,000 diverse pairs of proteins inCATH2225, from highly similar to divergent in

structures. For each pair, we computed threedifferent similarities between the two proteins: theL1-norm distances based on the 2-mer feature spacefor the NS and global structural sequence (GFS)and the structural similarities computed using thestructural alignment program STRUCTAL.55 Figure11 shows how these measures compare. NSsimilarity does not correlate well with structuresimilarity (correlation coefficient: 0.5); the globalstructural sequences, however, correlate muchbetter with the structure similarity, with a correla-tion coefficient of 0.70.

ROC analysis of protein homology detection

The 2D representations of the different sequencefeature spaces considered above all have in commonthat similar (i.e., homologous) proteins are mappedto localized regions. To further quantify this obser-vation, we evaluate the performance of similaritymeasures based on these feature spaces using


receiver operating characteristic (ROC) analysis.56The five folds in CATH605 serve as the standard: apair of structures is defined as similar, or “positive”,if they belong to the same fold, and “negative”otherwise. All pairs of proteins in CATH605 are thencompared using a similarity measure. For varyingthresholds of the measure, all pairs below the thres-hold are assumed positive, and all above it arenegative. The pairs that agree with the standard arecalled true positives (TPs), while those that do notare false positives (FPs). ROC analysis compares therate of TPs as a function of the rate of FPs; it is scoredwith the area below the corresponding curve. AnROC score of 1 indicates that all TPs are detectedfirst: this corresponds to the ideal measure. On theother hand, an ROC score of 0.5 corresponds to thefirst diagonal: TPs and FPs appear at the same rate,and the measure is not discriminative.We evaluate the performance of the L1-norm

distance on the 1-mer and 2-mer feature space of allthree sequence representations (native, SSE, andfragment-based) and compare it with the perfor-mance of FASTA49 and STRUCTAL.55 Results areshown in Fig. 12 and Table 1.FASTA implements a fast Smith and Waterman

sequence comparison; the similarity is given eitheras a raw score or as an E-value; we use the latter as asimilarity measure. The ROC curve for the FASTAmeasure is marginally above the first diagonal, witha score (area) of 0.57: this is expected, as, byconstruction, all protein pairs in CATH605 havelittle or no sequence similarity.STRUCTAL searches for an optimal alignment of

two protein structures using a trial-and-errorapproach in which an initial alignment is assumedand subsequently refined using dynamic program-ming. The best alignment is the one with a maximalSTRUCTAL score that accounts for the Euclideandistance between the superposed structures and thenumber of gaps in the alignment.55 The ROC curveshows that STRUCTAL performs well, with a scoreof 0.91. This is in agreement with a recent study thatcompares different structural alignment programs.19

The similarity measures based on the L1-normdistance in the feature spaces corresponding to thefour sequence representations perform at levelsintermediate to FASTA and STRUCTAL. The NSsequence space is closer to FASTA, as it is based onthe NS of amino acid. The structural sequence spacesSSES, GFS, and LFS perform similarly, with ROCscores of 0.85, 0.82, and 0.82, respectively. Noneperform aswell as STRUCTAL: this shows that eitherthe structural sequence or its representation infeature space does not capture all structural featuresof the proteins.The L1-norm is only one option for computing the

distance between two sequences in their featurespace: another approach is to use a kernel to esti-mate inner product, from which a distance can bederived (see, e.g., Ref. [57] and Methods). We com-pared the L1-norm ROC scores obtained with akernel-based distance (see Table 1) and found verylittle differences.

Detecting fold membership

Distance-based classification

The ROC analysis detects protein similarity. Weextended the analysis of our sequence similaritymeasures to the problem of detecting fold member-ship by performing a set of computational foldclassification experiments. In each experiment, werandomly divide the sets of proteins for all five foldsthat form CATH605 into two groups of approxi-mately equal size: the first groups serve as trainingsets to define the folds, while the second groupsserve as test sets. The test proteins are classified inone of the five folds and the results are stored in aconfusion matrix [element (i,j) of this matrix showshow many proteins belonging to fold i are classifiedas belonging to fold j]. The accuracy of the classifieris then defined as the ratio of the trace of theconfusion matrix over the sum of all its elements(i.e., the percentage of correctly classified proteins).The procedure is repeated 1000 times to removepossible bias from the initial separation of theprotein set into test and training sets. We performedthese experiments for the four types of sequences,for two different feature spaces (1-mer and 2-mer),and for two distance measures in these featurespaces (namely, L1-norm distance and kernel-baseddistance). The results are reported in Table 2. Asexpected from the feature space representationsshown above, classifications based on 2-mer repre-sentation of protein sequences outperform classifica-tions based on 1-mer features. Interestingly, thekernel-baseddistancesperformbetter than theL1-normfor the NSs and the fragment-based structuralsequences but not for the SSE-based structuralsequences. This is probably related to the differencesin the sizes of the corresponding feature spaces, as NS,GFS, and LFS are based on a 20-letter alphabet, whileSSEonly includes three categories (H,E, andC).We findthat classifications based on structural sequences (SSES,GFS, and LFS) always outperform the classificationsbased on theNS only. For completeness, we performedthe same experiments using STRUCTAL as a classi-fier; its success rate is 97.7±0.8%, confirming that itoutperforms our sequence-based classifiers.

HMM-based classification

Table 2 shows that the fragment-based structuralsequences perform better than the SSESs for foldrecognition in protein sequence feature spaces. It isnot clear, however, if this is related to the quality oftheir representations as high-dimensional vectors ora consequence of the difference of the informationcontent of the sequences themselves. To furthercharacterize the latter, we repeated the fold classi-fication experiments using HMMs. For each experi-ment, we build HMMs on each of the training setfor the five folds of CATH605; each protein in thetest set is then assigned to one of the five folds, withthe HMM of this fold scoring the highest. Theprotein sequences representing the fold are con-


sidered as the observables, and the hidden states areunknown. We generated a separate model for eachof these sequences using the Baum–Welch algo-rithm;58 these models were subsequently combinedusing a simple unweighted average scheme. Wefixed the number N of hidden states to 4 (seeMethods for details). The classification accuraciesfor the HMMs based on the NS, SSES, GFS, and LFSare 42.9%, 63.3%, 71.5%, and 71.0%, respectively. Asexpected, the NS alone performs poorly. Interest-ingly, the HMMs based on the two types offragment-based structural sequences (global andlocal) outperform the HMMs based on secondary-structure information.It should be noted that the first-order Markovian

assumption used in HMMs is clearly violated byfragment-based global structural sequences, as thebuildup procedure positions one element of thesequence based on its three previous elements; assuch, the HMM results should be considered asqualitative.

SVM-based classification

Qiu et al. have shown that SVM kernel classifiersperform well for protein structure classification,when combined with structure comparison data.59

Furthermore, they found that classification basedon the structural data alone did not perform as wellas when they added the SVM kernel classifier.Leslie et al. observed the same behavior when com-paring classification based on NS alone and clas-sification using NS and an SVM classifier based onp-mer string kernels.57 Here, we tested if theseresults generalized on different types of proteinsequences. For this purpose, we have adapted themulticlass SVM classifier available in Shogun,60 anopen-source machine learning toolbox, to work onour p-mer-based string kernel. We tested theclassification power of the resulting SVM asfollows. The sequences of each fold in CATH605are randomly divided into two groups, trainingand testing. The training sequences of the five foldsare used to parametrize the multiclass SVM, whichis subsequently used to classify the testingsequences. The procedure is repeated 1000 times,for each type of sequence representation. Resultsare given in Table 3. As expected from the resultsof Leslie et al., the SVM kernel classifier performsbetter for fold recognition than the much simplerdistance-based classifier, with an average of 10%increase in performance.57 The ranking of thedifferent sequence representations, however, remainsidentical: the NSs perform poorly while the two

Fig. 10. 2D representations ofthe protein sequence featurespaces. Each protein in our dataset is assigned four different 1Dsequences: its NS, its SSES, and thetwo structural sequences derivedfrom a structural alphabet of frag-ments, global (GFS) and local(LFS). These sequences are mappedonto a feature space indexed ontheir 1-mer or 2-mer content. Eachprotein is assigned a color based onthe fold it belongs to: black for1.10.10 (all α-proteins), red for2.60.40 (all β-proteins), and blue,magenta, and green for 3.20.20,3.30.70, and 3.40.50, respectively(mixed α–β folds). 2D projectionsof these spaces were generatedusing metric MDS.53

Fig. 10 (legend on previous page)


fragment-based structural sequences perform best,with small but statistically significant improvementcompared to the sequences based on secondary-structure elements.Table 3 reports the results of experiments using

the 1-mer and 2-mer spectrum kernels. We repeatedthese experiments with higher-order spectrumkernels. The performances of the NS, GFS, andLFS decreased quickly as the order of the spectrumkernel was increased from 3-mer upward, while theperformance of the SSESs increased slowly as theorder p-mer was increased, peaked for a 7-merkernel with the accuracy 77.1±1.3%, and thendecreased. These results are just an illustration ofthe effects of using small data sets and large-sizealphabets.

Discussion

The goal of our study is double: find an appro-priate 1D representation of a protein that capturesits structural characteristics and derive a methodthat uses this 1D representation to classify proteinstructures.

Building 1D structural sequences for proteins

To reach our first goal, we have derived a struc-tural alphabet of 20 fragments having four residuesusing the method proposed by Kolodny et al.27 Wederive two types of 1D structural sequence for aprotein using either a local-fit procedure, yieldingthe local fragment-based sequence (LFS), or a global-fit procedure that builds a chain using these frag-ments such as to minimize the cRMS between thechain and the actual structure, yielding the globalfragment-based sequence (GFS). We have shownthat the LFS matches give good local fit to the cor-responding protein structure, with an average cRMSof 0.25 Å over all its fragments having four residues.The LFS, however, does not capture the overall 3Dgeometry of the protein. We have then shown thatusing the global-fit procedure, we can rebuildproteins with an average accuracy of 0.69 Å, wherethe average is taken over 2225 proteins (Fig. 4). Thisresult cannot be compared directly with otherrelated studies, as the testing sets are usuallydifferent. To avoid ambiguity, we repeated ourreconstruction experiment on the Park and Levittset,61 which has been considered before. We

Fig. 11. The relationship between sequence similarityand structure similarity. We use 10,000 pairs of proteins inCATH2225. For each pair of proteins, the L1 distancebetween the projections of the NS (a) and global structuralsequence GF (b) in the 2-mer feature space is plotted versusthe GSAS score between their structures (GSAS is com-puted as 100×cRMS/(N−Ngap), where N is the number ofequivalent residues and Ngap is the number of gaps in thealignment.19 No relation is observed for the NS while alinear relationship is observed for GFS, with a correlationcoefficient R=0.7.

Fig. 12. ROC analysis comparing similarity measuresin protein sequence feature spaces for remote homologydetection. We compare the ability of STRUCTAL (a proteinstructure alignment program), FASTA (a sequence align-ment program), and four classifiers based on the NS, theSSES, and the global and local structural sequences of aprotein mapped in the 1-mer feature space to detect struc-tural similarities in a set of 605 proteins. “True” relation-ships are defined by CATH topologies. Curves close to thefirst diagonal (such as the ROC curve for FASTA) indicatepoor performance, while the uppermost curves (such asSTRUCTAL) reveal good performance.


achieved an average accuracy of 0.73±0.13 Å cRMSfor the resulting backbone reconstruction. On thesame protein set, Camproux et al. obtained anaverage accuracy of 0.64 using a library of 27 frag-ments of four residues.28 Kolodny et al. observedthat the global-fit accuracy of chains built fromstructural sequences derived from library of Cfragments having four residues satisfies the empiri-cal equation ⟨cRMS⟩=5.75C–0.7.27 Applying thisequation to our library and to the library ofCamproux et al. yields predicted accuracies of 0.71and 0.57, respectively, well in range with theobserved values. Recently, Baeten et al. achieved anaverage reconstruction accuracy of 0.48 Å on thesame data set.36 In order to reach this accuracy, theyused a library of more than 1000 fragments with six

different lengths. Figure 8 gives us information onwhy our library did not perform as well as thislarger library. We observed that the accuracy withwhich global structural sequences capture the 3Dgeometry of a protein improves as the helicalcontent of the protein increases; conversely, theaccuracy decreases as the strand content of theprotein increases. Both observations agree well withthe known fact that β-strands have a greatertolerance for deviation from regularity than helices.1

Strand-like fragments are not underrepresented inour library and in fact make up 1/3 of all fragments(see Fig. 3); this is still not large enough to cover thewide basin of conformations observed for residuesin strands. Baeten et al. overcame this problem byincluding a very large number of fragments.36 Thequestion remains as to whether it is possible to reachsuch accuracy with a much smaller library.

Structural alphabets and protein structureclassification

Protein structure alignment is the most naturalmethod for comparing and/or classifying proteinstructures. Unfortunately, current algorithms foraligning protein structures are heuristic and, assuch, cannot guarantee that they find the optimalsolution.19 In addition, the measure of similarityusually reported, that is, the cRMS, is not a metricfor comparing proteins of different sizes and, assuch, reduces the options in choosing a classifier. Analternative approach to structure alignment is to

Table 1. ROC scores for homology detections based on protein feature spaces

Sequence

Feature space: 1-mer Feature space: 2-mer

L1 distance Kernel distance L1 distance Kernel distance

NS 0.61 0.61 0.65 0.60SSES 0.84 0.85 0.85 0.84GFS 0.82 0.83 0.80 0.76LFS 0.83 0.84 0.82 0.79

The ROC score is the area below the ROC curves that plots the rate of TPs versus the rate of FPs for a given homology detection method.High scores are always better.


represent protein structures as vectors in a high-dimensional feature space; protein similarity is thenquantified in this feature space. In this study, weconverted the structure information of a protein intoa 1D sequence and then used representation of these1D sequences in a feature space built upon theirsubstrings to classify protein structures.We use four different types of sequences: the

baseline is the native amino acid sequence and weinclude three structure-based sequences: a sequencebased on secondary structure and two structuralsequences based on fragments, one obtained usinglocal fit to the structure, the other providing the bestglobal fit to its overall 3D shape.NSs have been used extensively for predicting the

structural class or the fold of a protein (see, e.g., Refs.[62,63] and references therein). Most of these studiesfocus on developing new representations of the NS;these include histograms of its amino acid composi-tion (which is equivalent to the 1-mer representationwe consider),64,65 pseudo histograms that maintainsome information about the order of the amino acidsin the sequence,63 and vectors in a feature spaceindexed by all the sequence p-mers.57,62,66,67 In thelatter studies, Leslie et al. project the native proteinsequences in a feature space with k=3, 4, or 5 andcompare the corresponding vectors using a kernel-based distance and an SVM classifier. The ROCscores they report for homology detection experi-ments on a data set similar to ours are of the order of0.62 for fold recognition;57 our ROC scores varyfrom 0.60 to 0.65, depending on the feature space weuse (1-mer or 2-mer) and the distance we use in thisfeature space (L1-norm or kernel-based). Note thatwe only use feature spaces indexed by substrings upto 2-mer as our training sets are smaller than thoseused by Leslie et al.; we tested p=3, and the ROC

Table 2. Classifying protein structures based on a sequence-b

Sequence

Feature space: 1-mer

L1 distance Kernel distan

NS 46.6±2.4 49.8±2.6SSES 65.8±1.8 65.4±1.9GFS 64.5±1.9 66.2±1.8LFS 69.2±1.9 69.2±2.0

Protein sequences are mapped in a high-dimensional space based on ithe NS, the SSES, and the global and local fragment-based structural seon their shortest distance to a known fold, where the distance is eitheras the ratio of proteins correctly classified over the total number of te

scores dropped to 0.48. The loss of accuracy as weincrease the word (p-mer) size was also observed byLingner and Meinicke.68 The latter authors describeapplications of word correlation matrices for remotehomology detection, with the words being similar tothe p-mer strings we use. They report much betterresults than those of Leslie et al. or those reportedhere, with ROC scores in the range 0.8 to 0.92.However, we cannot directly compare the results ofLingner and Meinicke with ours, as their test set ismuch less stringent than ours in the choice ofsequences with very little homology. It should benoted that we did not attempt to improve our resultson the NSs, as those were used solely for comparisonwith the results based on structural sequences.The local structural sequence (LFS) of a protein

is conceptually equivalent to its SSES, with a largeralphabet; both describe well the local structure of aprotein but do not capture its 3D shape. We haveshown that local structural sequences induce betterfold classification than secondary-structuresequences, based on distance-based, HMM-based,and SVM-based classifiers. The improvement, how-ever, though significant, is not commensurate withthe difference of the richness of the alphabets (from3 letters for SSES to 20 letters for LFS). This makesintuitive sense as helical and extended (strand)conformations dominate among local conforma-tions of proteins, as observed, for example, on theRamachandran plot describing a protein. In paral-lel, we have shown the global structural sequences(GFS) perform with the same success level as localstructural sequences. This result is more surprising.Intrinsically, a GFS contains more information thanthe SSES or LFS, as it captures the protein's 3Dshape: we have shown, for example, that we canreconstruct the full 3D structure using the former

ased feature space

Feature space: 2-mer

ce L1 distance Kernel distance

54.1±2.7 62.9±2.467.1±1.8 66.2±2.070.2±2.1 72.5±2.574.3±2.1 70.4±2.5

ts 1-mer or 2-mer content. Four types of sequences are considered:quences (GFS and LFS, respectively). Proteins are classified basedthe L1-norm or a kernel-based distance. The accuracy is computedst proteins.

Table 3. SVM-based classification of protein structuresusing structural sequences

Sequence Feature space: 1-mer Feature space: 2-mer

NS 58.7±1.8 61.4±2.2SSES 74.2±1.2 74.6±1.2GFS 75.5±1.3 80.6±1.7LFS 76.8±1.6 81.3±1.7

Table 4. Effect of heap size on global fit

Nkeep Average cRMS (Å) CPU time (s)

10 0.82 0.45100 0.72 3.91000 0.70 21.61000 0.70 41.74000 0.69 174.1

The average quality of structure reconstruction (given as cRMS)over all proteins in CATH2225 is given for various heap sizesNkeep used by the global-fit algorithm.We also give the CPU times(in seconds on an Intel Core2 processor at 2.8 GHz) required forthe reconstruction of one protein of 759 residues (1cm5B0) for thedifferent values of Nkeep.


within 0.7 Å, that is, with a high accuracy level.Our results indicate that the three types ofclassifiers we have used, namely, distance-based,HMM-based, and SVM-based, are not able tocapture this information. The reasons for thisfailure are unclear at this stage: the small size ofthe p-mer (limited to 2 as data become scarce forlarger strings), the quality of the kernel (basicallylinear in our case), and the choice of hidden statesfor HMM are all possible causes that requirefurther investigation. For example, the HMMs wehave built do not take into account position, as wedid not align the set of sequences representing afold (this is usually done in HMM-based foldrecognition techniques). While it is relatively easy toalign multiple NSs of proteins using dynamicprogramming and a mutation matrix, it is moredifficult to align structural sequences, because of thestrong correlations between neighboring fragments.Recent developments in this area29,40–46 give us hopethat we will be able to develop more performantHMMs based on structural sequences.Structural sequences have already been used for

structure classification. Wang et al. used HMMlearning on sequences of local structural alphabetsto classify protein structure folds;41 they reportaccuracies of fold classification of 82% on training-set structures sharing less than 40% pairwisesequence identity and 65% for those sharing nomore than 25% sequence identity. Our results on thelocal structural sequences (LFS) fall between thesetwovalues,with accuracies of the order of 70%.Wanget al. also tested structural alphabets of different sizesand showed that results improve as the alphabetsize increases up to 20 letters but then reaches aplateau. This comforts us in our choice to maintain asmall alphabet size to derive structural sequences.

How do we compare structural sequences?

While pattern recognition techniques attract a lotof interest for comparing protein sequences,sequence alignment using dynamic programmingremains the method of choice when comparing NSsof proteins. There are, however, at least two prob-lems when we try to apply sequence alignment tostructural sequences. Firstly, we need a substitutionmatrix and values for the gap penalties. There havebeen several attempts to derive such values forstructural alphabets;29,44 alignments using theseparameters are not yet equivalent to the correspond-ing direct structural alignments. Secondly, specialcare is needed to make sure that dynamic program-

ming applies on these sequences; problems occur forglobal structural sequences, as the letters in thesequence are not independent of each other. In thisstudy, we avoided these issues by projecting thesequence in a feature space and using patternrecognition techniques to compare sequences inthis space. We have shown, however, that thesetechniques do not extract all the information con-tained in the structural sequences. We expect thatsuccess will come from first solving the structuralsequence alignment problem.

Conclusion

Structural biology recently experienced a majoruplift through the development of high-throughputstudies aimed at developing a comprehensive viewof the protein structure universe. The key to thesuccess of this approach lies in our ability to organizeand analyze the wealth of information it generatesand to integrate that informationwith other efforts incellular biology. Protein structure comparison andclassification are obligatory steps of this process and,as such, are the focus of many research studies (forreview, see Ref. [10]). Protein structure alignment isprobably the most natural approach to performingthese two steps. However, amid the difficulties ingenerating accurate structural alignments19,69 andambiguities of the RMSD, the standard measure ofstructure similarity (which is not a metric whencomparing proteins of different sizes) is pushing thefield of computational biology to search for alter-native approaches thatmay alleviate these problems.In this study, we explored the option to encode the3D structure of a protein into a 1D string or sequenceand to use standard pattern recognition techniquesto compare and classify these sequences. We haveshown that both local and global 1D representationsbased on a structural alphabet of sequential proteinfragments map the protein structure accurately andperform significantly better than the 1D representa-tion based on the secondary-structure content of theprotein for fold recognition purposes, but notenough to become a viable replacement to compu-tationally intensive procedures such as structuralalignment tools. We plan to extend this study inseveral directions. Firstly, we will analyze in greater


detail the reasons of this apparent failure. In parti-cular, we will test more sophisticated sequencelearning tools to capture long-range correlationsfrom sequence data. Then, we want to take intoaccount the fact that the 1D global structuralsequence of a protein is not unique; by using profilesof 1D structural sequences, we expect to be able toexploit more sophisticated sequence analysis meth-ods to compare protein structures. Secondly, we planto extend beyond fold recognition and use ourstructural sequences for generating protein struc-tural alignment, along the lines of recent work ofFriedberg et al.44 Another direction is to study thecorrelations between the different sequence repre-sentations of a protein. The idea that it might bepossible to predict the structural sequence of aprotein based on its NS is very appealing.

Methods

Protein fragment library

We represent protein structures using a library of 20fragments of protein backbone. Each fragment is fourresidues long and is defined by the 3D coordinates of itsfour Cα atoms. The library was generated from 200 proteindomains whose structure was accurately determined byX-ray crystallography. Each of these domains was brokeninto a series of nonoverlapping fragments of four residues,which were consequently grouped into 20 clusters,according to their cRMS deviation from one another.Each cluster is represented by a single element—thecluster's center. This work has been described in detailelsewhere.27 Note that the fragments do not include anysequence specifics of the proteins.

Structural sequences

The 20 fragments in our library serve as building blocksfor constructing 1D representations of protein structures.Two types of 1D sequences are considered, designed tocapture local or global structural features of the protein,respectively.The best local-fit representation of a protein is derived in

linear time by finding for each fragment of four residues ofthe protein the most similar fragment in the library, wheresimilarity is measured using cRMS.We build the global structural sequence of a protein

using the method introduced by Kolodny et al.27 Namely,copies of fragments from the library are repeatedly addedto extend a chain, until reaching the target length, underthe constraint that the overall cRMS distance between themodel and the native structure of the protein is minimal.Each new fragment is added by superimposing its firstthree atoms on the last three atoms of the growing chain,extending the chain by 4−3=1 atom (where “atoms”refers to the Cα of each amino acid of the protein). Animportant property of this method is that each step ofadding a fragment is local, while the measure of goodnessof fit is global. Once the full chain is constructed, the bestapproximation forms the model for the structure. Eachfragment is then assigned a single letter label (from A to Tfor the 20 fragments), and the final model for a proteinstructure can be stored as a linear string of these labelswith negligible loss of information.

The number of possible global-fit approximations for aprotein ofN residues with our library of 20 fragments withfour residues is 20N − 3; this number is too large to beexplored systematically. Our implementation is conse-quently heuristic and uses a soft greedy algorithm with aheap, Nkeep. The choice of Nkeep is a balance between thedesire to reach a high accuracy and the reality of thecomputing time and memory usage. To decide on whichvalue of Nkeep to choose, we tested the quality of theglobal-fit approximations of 2225 proteins (see below) forfive values of Nkeep. Results are shown in Table 4. Asexpected, higher values of Nkeep lead to better approxima-tions. The accuracy, however, improves drastically withincreasing Nkeep for small heap sizes and remains roughlyconstant for larger heap sizes. We consequently used aheap size of 4000 for all our studies. We also find that thecentral processing unit (CPU) time required to computeone approximation depends roughly linearly onNkeep (seeTable 4).It is important to notice that the local structural

sequence of a protein is unique, while usually manyglobal structural sequences represent the 3D proteinstructures with similar accuracies. In this study, we definethe global structural sequence of a protein as the highestscoring sequence derived by our algorithm. As the latter isheuristic, there are no guarantees that this sequence isindeed the best sequence that can be constructed. Toensure that this approximation had no impact on theresults presented here, we repeated all classificationexperiments with a new definition of the global structuralsequence, chosen at random among the top 50 best scoringsequences obtained using the global-fit procedure. Resultsremained the same, both qualitatively and quantitatively.

Data set of protein structures

The set of structures considered in this study is extrac-ted from the database of 2930 sequence-diverse CATHv2.4 domains used in a previous study.19 As we focus on3D structures, we consider the first three levels of CATH,Class, Architecture, and Topology, to give a CATclassification. We refer to a set of structures with thesame CAT classification as a fold class. Using a set ofstructures with sufficient sequence diversity ensures thatthe data are duplicate-free and that the problem ofdetecting structural similarity is nontrivial for all pairs ofproteins considered. The 2930 structures were selected asfollows: (i) Sort all 34,287 CATH v2.4 domains by theirSPACI score.70 (ii) Start with the domain with the highestSPACI score, and remove from the list all domains thatshare significant sequence similarity with it (FASTA E-value b10−4. (iii) Repeat step (ii) with all domains in thelist that have not been removed. This is the sameprocedure used by Brenner et al. to generate thesequence-diverse set of SCOP structures. The set of 2930resulting from this procedure was further filtered toremove all proteins whose experimental structures havegaps in their backbone. The final set contains 2225 proteinsand is referred to as CATH2225.There are 619 fold classes in CATH2225, many of which

contain only a single element (394). To facilitate statisticalanalysis, we selected five of the most populated folds inCATH2225 as the test set for all computational experi-ments run in these studies, including at least one fold fromeach CATH class: CATH fold 1.10.10, a fully α fold (arcrepressor, 62 representatives); CATH fold 2.60.40, a fully βfold (immunoglobulin-like, 169 representatives); and threemixed α–β folds: 3.20.20 (TIM-like, 67 representatives),


3.30.70 (two-layer sandwich, 92 representatives), and3.40.50 (Rossmann fold, 215 representatives). These fivefolds include a total of 605 proteins of CATH2225 (setCATH605). Figure 2 shows examples of protein structuresfor each of these five folds.

Classifiers

Our goal is to define protein structure classifiers basedon sequence information (where sequence can be the nativeamino acid sequence of the protein or a structural sequencerepresenting the structure). The number of known proteinstructures is growing exponentially, and the number offolds representing these structures is still growing: scal-ability is therefore an essential feature that we take intoaccount when designing our classifiers. As a consequence,we opt for relatively simple classification models.The dynamics of the sequences are considered in the

form of statistics of n-gram subsequences of differentstring representation. In the following subsections, wedescribe the three different types of classifiers we adopt inthis work: distance-based (nearest neighbor) discrimina-tive models, generative models in the form of HMMs, andthe multiclass SVMs.

Distance-based methods

In the distance-based methods, each protein sequence isembedded in a high-dimensional vector space in such away that the distance between two proteins in this spacereflects their similarity. Here, “sequence” refers to eitherthe NS of amino acids of the protein or the structuralsequence obtained from the secondary-structure elements(SSES) of the protein or through reconstruction using afragment library (FS). The pairwise distances are thenused to classify the proteins. The different approachesused here reduce to various ways of counting substringsthat two strings have in common. This is a meaningfulsimilarity notion in biological applications since evolu-tionary proximity is thought to result both in structuresimilarity (functional similarity) and in sequence similar-ity. A distance-based classifier is fully characterized by (a)the mapping of a sequence in a feature space, (b) therepresentation of a fold class in this feature space, and (c)the classification procedure itself.The embedding map is defined between the space of all

finite sequences over an alphabet Σ (amino acids forprimary sequences or structural units for structuralsequences) and a vector space F. The coordinates of F areindexed by a subset I of strings over Σ, that is, by a subsetof the input space. In this article, we will studyembeddings where I is the set Σp of strings of length p.In practice, we will limit ourselves to the case p=2, but thetheory remains valid for all p. We use / to denote thefeature mapping

/ : saA4i /u sð Þ� �

uaIaF: ð1ÞThe embedding of protein sequences to the p-mer

subsequence space is done either explicitly by computingthe p-mer probability distribution of each proteinsequence or by using implicit embedding with p-spectrumkernel,71,72 the latter leading to much faster implementa-tion (i.e., with a linear complexity). We report here both ofthese methods.

(1) L1-norm distance method of p-gram probabilitydistribution. In this method, we compute explicitly

the probability distribution of p-mer for eachprotein. Each fold can be represented by themean of all proteins belonging to it, and thedistance between two protein sequences in thefeature space can be computed as follows:

d s; tð Þ =X

uaAp

j/pu sð Þ � /

pu tð Þj: ð2Þ

(2) p-spectrum kernel. Perhaps the most natural wayto compare two strings is to count how many(contiguous) substrings of length p they have incommon. We define the spectrum of order p (or p-spectrum) of a sequence s to be the histogram offrequencies of all its (contiguous) substrings oflength p (p-mer). We define the p-spectrum kernelas the inner product of the p-spectra. Formally, thefeature space F associated with the p-spectrumkernel is indexed by I=Σp, with the embeddinggiven by

/pu sð Þ = j v1; v2ð Þ : s = v1uv2f gj; uaAp ð3Þ

and the associated p-spectrum kernel betweensequences s and t is defined as

jp s; tð Þ = h/p sð Þ;/p tð Þi =X

uaAp

/pu sð Þ/p

u tð Þ: ð4Þ

To turn the p-spectrum kernel into the inner product ofp-mer probability distribution feature space, we need tonormalize the feature vectors (in other words, this stepremoves the effect of the length of each sequence). This canbe done by the following operation:

jp s; tð Þ = jp s; tð Þffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffijp s; sð Þjp t; tð Þp : ð5Þ

The kernel functions represent inner products in afeature space. Given the kernel values, the distancebetween the feature vectors corresponding to twosequences s and t can be computed as

d s; tð Þ2 = jp s; sð Þ + jp t; tð Þ � 2jp s; tð Þ = 2� 2jp s; tð Þ: ð6ÞNote that to compute the distance from a protein

sequence to a fold, we need the notion of the mean of afold S in the feature space.

/pS =

1l

Xl

i = 1

/psið Þ ð7Þ

where sif gli = 1aS are protein sequences in the fold.Even though it is impossible to represent explicitly the

mean of a fold in the feature space, we can neverthelesscompute its norm

jj/pSjj22 = h/p

S;/pSi =

1l2Xl

i;j = 1

jp si; sj� � ð8Þ

and the distance between the mean of a fold and the imageof a protein sequence is given by

jj/psð Þ � /

pSjj

22 = h/p

sð Þ;/psð Þi + h/p

S;/pSi � 2h/p

S;/psð Þi

= jp s; sð Þ + 1l2Xl

i;j = 1

jp si; sj� �� 2

l

Xl

i = 1

jp s; sið Þ

ð9Þ


Hidden Markov models

HMMs are statistical models for modeling sequentialdata; they have been used successfully in pattern recog-nition, speech processing, and biosequences modeling.Rabiner wrote a good introduction to HMMs.73 The jointprobability of a sequence of observations y1

T=y1, y2, …yTcan always be factored as

P yT1� �

= P y1ð Þ jT

t = 2P ytjyt�1

1

� � ð10Þ

Its calculation appears intractable in general. However,if we assume that the past sequence can be summarized bya state variable qt, then one can rewrite the previousequation as

P yT1� �

=XT

q1

P yT1 ; qT1

� � ð11Þ

where the sum over q1T represents the sum over all possible

state sequences q1T=q1, q2,…qTof length T. Fortunately now,

this can be factored as follows

P yT1 ; qT1

� �= P q1ð Þ j

T

t = 2P qtjqt�1ð Þ j

T

t = 1P ytjqtð Þ ð12Þ

using the first-order Markovian assumption (one statedepends probabilistically on just the preceding state).74

The joint probability is therefore completely specified interms of initial state probabilities P(q1), transition prob-abilities P(qt|qt− 1), and emission probabilities P(yt|qt).Since each state variable qt for the underlying Markovmodel is not directly observed but is a stochastic functionof the previous observations yt − 1, such a model is calledHMM. It can be trained to maximize the likelihood of atraining set of sequences (maximum likelihood criterion),using well-known algorithms such as ExpectationMaximization75 or Viterbi.76

For classification tasks, such as deciding if a givensequence belongs to a given target class (fold in our case),one usually trains a different HMM for each class (fold) tomaximize the likelihood that the training sequences areassigned to that class. Then, given a new sequence toclassify, one computes the likelihood of the sequence foreach HMM and selects the class that maximizes thelikelihood criterion, P(y1

T|class= c).The HMM of a fold is constructed from a training set of

sequences it contains. Individual models are constructedfor each training sequence using the Baum–Welchprocedure58 and subsequently combined using a simpleunweighted average scheme to define the parameters ofthe HMM for the fold. The number of hidden states N ischosen as the value that optimizes the classification powerof the HMMs. We tested values ofN between 1 and 20 andretained the values of 4 as it gave the best results for allfour types of sequences. Finally, we compute theprobability of a sequence of observations given a modelusing the forward procedure.73

It should be noted that the models based on the twotypes of fragment-based structural sequences and, to alesser extent, the models based on secondary-structureelements violate the first-order Markovian assumptionbecause each structural element is statistically relatedto its neighboring structural elements in the sequence.The lack of data (about 100 proteins are available for

each fold) did not allow us to build a more com-plicated model.

Support vector machines

SVMs belong to the class of supervised learningmethods used for classification and regression. Theywere initially introduced to solve binary classificationproblems, in which they attempt to separate the datarepresented as two sets of vectors in a feature space byconstructing a hyperplane in that space that maximizesthe margin between the two data sets.77

Let (xi, yi) be a data set with N examples, where xi isthe ith input and label yi is either 1 or −1, indicating theclass to which the point xi belongs. Let ϕ(xi) be thehigh-dimensional vector representing the point xi in itsfeature space. A hyperplane in this feature space can bewritten as f(x)=⟨w, ϕ(x)⟩+b=0. If this hyperplane dividesthe points having yi=1 from those having yi=−1, thensign(f(x)) is a classification function for the data. The weightvector w (which is the normal to the hyperplane) and thebias b are determined by solving an optimization problemover the training data. A key feature of SVM is that in thisoptimization problem, the feature vectorsΦ(x) only appearin dot products; the explicit mappingΦ from the data spaceto the feature space can then be replaced with an implicitmapping through the use of a kernel function k(x, x′)=⟨Φ(x), Φ(x′)⟩, which does not need to be linear.

There are different extensions of the binary SVMs tosolve multiclass learning problems. The three mostpopular are the following: pairwise classification (oneversus one), one versus all, and a direct approach based on amulticlass objective function.78 In our experiments withmulticlass SVM, we used the pairwise classificationmethod (one versus one) and the spectrum kernel definedin Eq. (4) to classify the proteins in CATH605. In the oneversus one classification method, one trains a binary SVMfor each possible pair of classes using examples from thosetwo pairs. Form classes, this results in m�1ð Þm

2 binary SVMs.All binary classifiers are evaluated and the input isassigned to the class that has the highest number of votesto classify a test input.We used the multiclass SVM code and adapted our

spectrum kernel in the open-source package Shogun.60 Forcomparison, we implemented also the two other popularmethods for multiclass SVM (i.e., the one versus all and theone using the multiclass objective function); the resultswere very similar.

Visualizing data using MDS

MDS is a technique used to provide a low-dimensional“mapping” (usually two or three dimensions) of high-dimensional data points.53 We apply the standard metricMDS. Metric MDS starts with an (nxn) distance matrix D,whose element dij is the Euclidian distance between datapoint i and data point j. Distance geometry is used toconvert this distance matrix into the inner product matrixB, for which element bij is the inner product of the vectorsrepresenting the data point i and data point j in theEuclidean space. The next step computes the first fewprincipal eigenvectors (usually two or three) from theinner product matrix B and then projects all data points onthese principal eigenvectors, thereby generating a low-dimensional mapping.The five folds in the CATH605 data set are represented

as five clusters in the low-dimensional mapping obtainedby MDS. If the features defining the mapping capturewell the differences between the folds, these five clusters


ought to be well separated. We evaluate this statementusing the AIS:

AIS =2

Nc Nc � 1ð ÞXNc�1

i = 1

XNc

j = i + 1

d Ci;Cj� � ð13Þ

where the distance between two clusters Ci and Cj isdefined as

d Cj;Cj� �

=1

jCijjCjjX

iaCi

X

jaCj

d i; jð Þ: ð14Þ

d(i, j) is the Euclidian distance between the data points iand j in the projected space, and |Ci| is the size ofcluster Ci. If two clusters Ci and Cj are well separated,all interdistances between their members are large andthe average distance d(Ci, Cj) is consequently large.Therefore, larger values for AIS mean better clusterconfiguration.

Acknowledgements

We thank Rachel Kolodny for generating thelibrary of fragments used in this study. Quan Lewishes to thank Nello Cristianini for his supervisionand helpful suggestions during Quan's visit inUniversity of California, Davis. Patrice Koehlacknowledges support from the Alfred P. Sloan foun-dation and from the National Institutes of Health.Gianluca Pollastri and Quan Le's work is funded byScience Foundation Ireland grant 05/RFP/CMS0029and grant RP/2005/219 from the Health ResearchBoard of Ireland. Quan's visit to University ofCalifornia, Davis, was partly funded by a 2005University College Dublin Seed Funding award.

References

1. Richardson, J. (1981). The anatomy and taxonomy ofprotein structure. Adv. Protein Chem. 34, 167–339.

2. Chothia, C. (1984). Principles that determine thestructure of proteins. Annu. Rev. Biochem. 53, 537–572.

3. Branden, C. & Tooze, J. (1991). Introduction to ProteinStructures. Garland Publishing, New York, NY.

4. Creighton, T. (1993). Proteins. W.H. Freeman and Co,New York, NY.

5. InternationalHumanGenome SequencingConsortium.(2001). Initial sequencing and analysis of the humangenome. Nature, 409, 860–921.

6. Venter, J., Adams, M., Myers, E., Li, P., Mural, R.,Sutton, G. et al. (2001). The sequence of the humangenome. Science, 291, 1304–1351.

7. Burley, S., Almo, S., Bonanno, J., Capel, M., Chance,M., Gaasterland, T. et al. (1999). Structural genomics:beyond the human genome project. Nat. Genet. 23,151–157.

8. Hieter, P. & Boguski, M. (1997). Functional genomics:it's all how you read it. Science, 278, 601–602.

9. Taylor, W., May, A., Brown, N. & Aszodi, A. (2001).Protein structure: geometry, topology and classifica-tion. Rep. Prog. Phys. 64, 517–590.

10. Koehl, P. (2006). Protein structure classification. InReviews in Computational Chemistry (Lipkowtiz, K.,Cundari, T. & Gillet, V., eds), vol. 22, pp. 1–48. JohnWiley & Sons, Hoboken, NJ.

11. Perutz, M., Rossman, M., Cullis, A., Muirhead, G.,Will, G. & North, A. (1960). Structure of haemoglobin:a three-dimensional Fourier synthesis at 5.5 Å resolu-tion, obtained by X-ray analysis. Nature, 185, 416–422.

12. Murzin, A., Brenner, S., Hubbard, T. & Chothia, C.(1995). SCOP: a structural classification of proteindatabase for the investigation of sequences andstructures. J. Mol. Biol. 247, 536–540.

13. Orengo, C., Michie, A., Jones, S., Jones, D., Swindells,M. & Thornton, J. (1997). CATH: a hierarchicclassification of protein domain structures. Structures,5, 1093–1108.

14. Holm, L. & Sander, C. (1993). Protein structurecomparison by alignment of distance matrices. J. Mol.Biol. 233, 123–138.

15. Kolodny, R. & Linial, N. (2004). Approximate proteinstructural alignment in polynomial time. Proc. NatlAcad. Sci. USA, 101, 12201–12206.

16. Sauder, J., Arthur, J. & Dunbrack, R. (2000). Largescale comparison of protein sequence alignmentalgorithms with structural alignments. Proteins: Struct.Funct. Genet. 40, 6–22.

17. Sierk, M. & Pearson, W. (2004). Sensitivity andselectivity in protein structure comparison. Protein Sci.13, 773–785.

18. Novotny, M., Madsen, D. & Kleeywegt, G. (2004).Evaluation of protein fold comparison servers.Proteins: Struct. Funct. Genet. 54, 260–270.

19. Kolodny, R., Koehl, P. & Levitt, M. (2005). Compre-hensive evaluation of protein structure alignmentmethods: scoring by geometric measures. J. Mol. Biol.346, 1173–1188.

20. Pauling, L., Corey, R. & Branson, H. (1951). Thestructure of proteins: two hydrogen-bonded helicalconfigurations of the polypeptide chain.Proc.Natl Acad.Sci. USA, 37, 205–234.

21. Pauling, L. & Corey, R. (1951). Configurations ofpolypeptide chains with favored orientations aroundsingle bonds: two new pleated sheets. Proc. Natl Acad.Sci. USA, 37, 729–740.

22. Rooman, M., Rodriguez, J. & Wodak, S. (1990).Automatic definition of recurrent local structuremotifs in proteins. J. Mol. Biol. 213, 327–336.

23. Fetrow, J., Palumbo, M. & Berg, G. (1997). Patterns,structures, and amino acid frequency in structuralbuilding blocks, a protein secondary structureclassification scheme. Proteins: Struct. Funct. Genet.27, 249–271.

24. de Brevern, A., Etchebest, C. & Hazout, S. (2001).Bayesian probabilistic approach for predicting back-bone structures in terms of protein blocks. Proteins:Struct. Funct. Genet. 41, 271–287.

25. Bystroff, C., Thorsson, V. & Baker, D. (2000). HMMSTR:a hidden Markov model for local sequence–structurecorrelation in proteins. J. Mol. Biol. 301, 173–190.

26. Yang, A. & Wang, L. (2002). Local structure-basedsequence profile database for local and global proteinstructure predictions. Bioinformatics, 18, 1650–1657.

27. Kolodny, R., Koehl, P., Guibas, L. & Levitt, M.(2002). Small libraries of protein fragments modelnative protein structures accurately. J. Mol. Biol. 323,297–307.

28. Camproux, A., Gautier, R. & Tuffery, P. (2004). Ahidden Markov model derived structural alphabet forproteins. J. Mol. Biol. 339, 591–605.


29. Tung,C. -H.,Huang, J. -W.&Yang, J. -M. (2007). Kappa–alpha plot derived structural alphabet and BLOSUM-like substitution matrix for rapid search of proteinstructure database. Genome Biol. 8, R31.1–R31.16.

30. Kolodny, R. & Levitt, M. (2003). Protein decoyassembly using short fragments under geometricconstraints. Biopolymers, 68, 278–285.

31. Camproux, A., de Brevern, A., Hazout, S. & Tuffery, P.(2001). Exploring the use of a structural alphabet forstructural prediction of protein loops. Theor. Chem.Acc. 106, 28–35.

32. Fourrier, L., Benros, C. & de Brevern, A. (2004). Use ofa structural alphabet for analysis of short loopsconnecting repetitive structures. BMC Bioinformatics,5, 58–71.

33. Etchebest, C., Benros, C., Hazout, S. & de Brevern, A.(2005). A structural alphabet for local protein struc-tures: improved prediction methods. Proteins: Struct.Funct. Bioinf. 59, 810–827.

34. Mooney, C., Vullo, A. & Pollastri, G. (2006). Proteinstructural motif prediction in multidimensional ϕ–ψspace leads to improved secondary structure prediction.J. Comput. Biol. 13, 1489–1502.

35. Maupetit, J., Gautier, R. & Tuffery, P. (2006). SABBAC:online structural alphabet-based protein backbonereconstruction from alpha-carbon trace. Nucleic AcidsRes. 34, W147–W151.

36. Baeten, L., Reumers, J., Tur, V., Stricher, F., Lenaerts,T., Serrano, L. et al. (2008). Reconstruction of proteinbackbones from the BriX collection of canonicalprotein fragments. PLoS Comput. Biol. 4, e1000083.

37. Kurgan, L. & Kedarisetti, K. (2006). Sequence repre-sentation and prediction of protein secondary struc-ture for structural motifs in twilight zone proteins.Protein J. 25, 463–474.

38. Dudev, M. & Lim, C. (2007). Discovering structuralmotifs using a structural alphabet application tomagnesium-binding sites. BMC Bioinformatics, 8,106–117.

39. Martin, J., Regad, L., Lecornet, H. & Camproux, A. C.(2008). Structural deformation upon protein–proteininteraction: a structural alphabet approach. BMCStruct. Biol. 8, 12–30.

40. Guyon, F., Camproux, A., Hochez, J. & Tuffery, P.(2004). SA-Search: a web tool for protein structuremining based on a structural alphabet. Nucleic AcidsRes. 32, W545–W548.

41. Wang, S., Chen, C. & Hwang, M. (2005). Classificationof protein 3D folds by hidden Markov learning onsequences of structural alphabets. Proc. 3rd Asia-PacificBioinf. Conf. (APBC) 1, 65–72.

42. Tyagi, M., Gowri, V., Srinivasan, N., de Brevern, A. &Offmann, B. (2006). A substitution matrix for struc-tural alphabet based on structural alignment ofhomologous proteins and its applications. Proteins:Struct. Funct. Bioinf. 65, 32–39.

43. Yang, J. & Tung, C. (2006). Protein structure databasesearch and evolutionary classification. Nucleic AcidsRes. 34, 3646–3659.

44. Friedberg, I., Harder, T., Kolodny, R., Sitbon, E., Li,Z. & Godzik, A. (2006). Using an alignment offragment strings for comparing protein structures.Bioinformatics, 23, e219–e224.

45. Lo, W. -C., Huang, P. -J., Chang, C. -H. & Lyu, P. -C.(2007). Protein structural similarity search by Rama-chandran codes. BMC Bioinformatics, 8, 307–320.

46. Zheng, W. M. (2008). The use of a conformationalalphabet for fast alignment of protein structures. In(Mandoiu, I., Sunderraman, R. & Zelikovsky, A., eds),

ISBRA 2008, pp. 331–342, Springer Verlag, Berlin,Germany.

47. Berman, H., Westbrook, J., Feng, Z., Gilliland, G.,Bhat, T., Weissig, H. et al. (2000). The Protein DataBank. Nucleic Acids Res. 28, 235–242.

48. Frishman, D. & Argos, P. (1995). Knowledge-basedprotein secondary structure assignment. Proteins:Struct. Funct. Genet. 23, 566–579.

49. Pearson, W. & Lipman, D. (1988). Improved tools forbiological sequence comparison. Proc. Natl Acad. Sci.USA, 85, 2444–2448.

50. Kraulis, P. J. (1991). Molscript: a program to produceboth detailed and schematic plots of protein struc-tures. J. Appl. Crystallogr. 24, 946–950.

51. Felsenstein, J. (1989). Phylogeny inference package(version 3.2). Cladistics, 5, 164–166.

52. Fitch, W. & Margoliash, E. (1967). Construction ofphylogenetic trees. Science, 155, 279–284.

53. Kruskal, J. &Wish, M. (1978).Multidimensional Scaling.Sage Publication, Beverly Hills, CA.

54. Rost, B. (1997). Protein structures sustain evolutionarydrift. Folding Des. 2, 519–524.

55. Subbiah, S., Laurents, D. & Levitt, M. (1993).Structural similarity of DNA-binding domains ofbacteriophage repressors and the globin fold. Curr.Biol. 3, 141–148.

56. Gribskov, M. & Robinson, N. (1996). Use of receiveroperating characteristic (ROC) analysis to evaluatesequence matching. Comput. Chem. 20, 25–33.

57. Leslie, C., Eskin, E., Cohen, A., Weston, J. & Noble, W.(2004). Mismatch string kernels for discriminativeprotein classification. Bioinformatics, 20, 467–476.

58. Baum, L. E., Petrie, T., Soules, G. & Weiss, N. (1970).A maximization technique occurring in the statisticalanalysis of probabilistic functions of Markov chains.Ann. Math. Stat. 41, 164–171.

59. Qiu, J., Hue, M., Ben-Hur, A., Vert, J. -P. & Noble, W. S.(2007). A structural alignment kernel for proteinstructures. Bioinformatics, 23, 1090–1098.

60. Sonnenburg, S., Raetsch, G., Schaefer, C. & Schoelkopf,B. (2006). Large scale multiple kernel learning. J. Mach.Learn. Res. 7, 1531–1565.

61. Park, B. & Levitt, M. (1995). The complexity andaccuracy of discrete state models of protein structure.J. Mol. Biol. 249, 493–507.

62. Melvin, I., Ie, E., Kuang, R., Weston, J., Noble, W. &Leslie, C. (2007). SVM-fold: a tool for discriminativemulti-class protein fold and superfamily recognition.BMC Bioinformatics, 8, S2.

63. Xiao, X., Lin, W. -Z. & Chou, K. C. (2008). Using greydynamic modeling and pseudo amino acid composi-tion to predict protein structural classes. J. Comp. Chem.29, 2018–2024.

64. Nakashima, H., Nishikawa, K. & Ooi, T. (1986). Thefolding type of a protein is relevant to the amino acidcomposition. Biochem. J. 99, 153–162.

65. Chou, K. (1995). A novel approach to predictingprotein structural classes in a (20-1)-D amino acidcomposition space. Proteins: Struct. Funct. Genet. 21,319–344.

66. Shamim, M., Anwaruddin, M. & Nagarajaram, H.(2007). Support vector machine-based classificationof protein folds using the structural properties ofamino acid residues and amino acid residue pairs.Bioinformatics, 23, 3320–3327.

67. Damoulas, T. &Girolami,M. (2008). Probabilisticmulti-class multi-kernel learning: on protein fold recognitionand remote homology detection. Bioinformatics, 24,1264–1270.


68. Lingner, T. & Meinicke, P. (2008). Word correlationmatrices for protein sequence analysis and remotehomology detection. BMC Bioinformatics, 9, 259–271.

69. Koehl, P. (2001). Protein structure similarities. Curr.Opin. Struct. Biol. 11, 348–353.

70. Brenner, S., Koehl, P. & Levitt, M. (2000). The astralcompendium for protein structure and sequenceanalysis. Nucleic Acids Res. 28, 254–256.

71. Taylor, J. S. & Cristianini, N. (2004). Kernel Methodsfor Pattern Analysis. Cambridge University Press,Cambridge, UK.

72. Lodhi, H., Saunders, C., Taylor, J., Cristianini, N. &Watkins, C. (2002). Text classification using stringkernels. J. Mach. Learn. Res. 2, 419–444.

73. Rabiner, L. R. (1989). A tutorial on hidden Markov

models and selected applications in speech recognition.Proc. IEEE, 77, 257–285.

74. Rabiner, L.& Juang, B. -H. (1993). Fundamentals of SpeechRecognition. Prentice Hall, Upper Saddle River, NJ.

75. Dempster, A. P., Laird, N. M. & Rubin, D. B. (1977).Maximum likelihood from incomplete data via theEM algorithm. J. R. Stat. Soc., Ser. B, 39, 1–38.

76. Viterbi, A. (1967). Error bounds for convolutional codeand an asymptotically optimum decoding algorithm.IEEE Trans. Inf. Theory, 13, 260–269.

77. Burges, C. (1998). A tutorial on support vectormachines for pattern recognition. Data Mining Knowl.Discov. 2, 1–47.

78. Schölkopf, B. & Smola, A. (2001). Learning with Kernels.MIT Press, Cambridge, MA.

Structural Alphabets for Protein Structure Classification: A Comparison Study

Documents

Transcript of Structural Alphabets for Protein Structure Classification: A Comparison Study