[IEEE Comput. Soc Third IEEE Symposium on BioInformatics and BioEngineering. BIBE 2003 - Bethesda,...

Combining Few Neural Networks for Effective Secondary Structure Prediction

Katia S. GuimaraesCenter of Informatics - UFPECP 7851, Recife, PE, Brazil

[email protected]

Jeane C. B. MeloCenter of Informatics - UFPEPhysics Math. Dept.-UFRPE

Recife, PE, [email protected]

George D. C. CavalcantiCenter of Informatics - UFPECP 7851, Recife, PE, Brazil

[email protected]

Abstract

The prediction of secondary structure is treated with asimple and efficient method. Combining only three neuralnetworks, an average Q3 accuracy prediction by residuesof 75,93% is achieved. This value is better than the bestresults reported on the same test and training database,CB396, using the same validation method. For a seconddatabase, RS126, an average Q3 accuracy of 74,13% is at-tained, which is better than each individual method, beingdefeated only by CONSENSUS, a rather intrincate engine,which is a combination of several methods.

The networks are trained with RPROP, an efficient vari-ation of the back-propagation algorithm. Five combinationrules are applied independently afterwards. Each one in-creases the accuracy of prediction by at least 1%, due tothe fact that each network used converges to a different lo-cal minimum. The Product rule derives the best results.

The predictor described here can be accessed athttp://biolab.cin.ufpe.br/tools/.

1. Introduction

With the proliferation of sequencing projects, the num-ber of known protein sequences grows at an exponential ra-tio. Research in the protein structure area has fundamentalimportance in the drug design and in diagnostic methodolo-gies. However, that application demands the knowledge ofthe functional properties of the proteins, or, similarly, thedetermination of their three-dimensional structure.

The traditional methods of protein structure determina-tion, such as magnetic resonance and crystallography areexpensive, complex and can not always be applied. This ex-plains the fact that the number of known protein structures issignificantly smaller than the number of sequences. Never-theless, the protein function is closely related to its primarystructure (amino acid sequence). So, there is great interestin tools that can predict protein structure based solely on the

amino acid sequence. Computational techniques appear asone strong option in the attempt to automatize that process.

Due to the complexity involved, that task is usually bro-ken into a series of intermediate steps, as for example, thelocation in the sequence of common substructures in the 3Dconformation, such as alpha-helices, beta-strands and coils.That problem is called Protein Secondary Structure Predic-tion and will be treated in the present work.

Machine learning techniques, such as artificial neuralnetworks, have been applied in the prediction of proteinsecondary structure in the last fifteen years [13, 15, 17, 7].Different architectures, algorithms and types of input havebeen exploited to attain better accuracy in prediction. Inmany cases, training and test sets are developed for a givenapplication, what makes it difficult to compare the results.Nonetheless, we can find some databases commonly usedin articles addressing this problem. We report the resultsof the experiments with our approach when applied to twoknown databases:

1. RS126, the one developed by Rost and Sander [21],which was used as case study by other authors (e.g.,[15]), and

2. CB396, the one developed by Cuff and Barton [4],for that work contains itself a comparison of severalmethods, and the sequences in that dataset were chosenthrough a more selective procedure to eliminate similarsequences or any sequence that is present in RS126.

Our goal was to develop a neural network that was at thesame time simple and efficient. We wanted results compa-rable to or even better than other methods previously devel-oped, but still a tool that did not require exceptional machi-nary. The implementation applied is computationally thriftyand the results are better than those accomplished by almostall of the best predictors cited in the literature, which weretrained on the same databases.

This artice is organized as follows. In Section 2 wepresent an overview of different approaches for the protein

Proceedings of the Third IEEE Symposium on BioInformatics and BioEngineering (BIBE’03) 0-7695-1907-5/03 $17.00 © 2003 IEEE

secondary structure prediction problem, including the workof Cuff and Barton. The specific method applied in the pre-dictor reported here are explained in Section 3. The resultsand comparisons with other works are shown in Section 4.Section 5 presents the conclusions.

2. Overview

From a computational point of view, one can assert thatthe prediction of secondary protein structure is a classifica-tion problem. The input data are sequences defined in analphabet of twenty letters (called aminoacids), and the goalis to classify each residue of this sequence in one of the sub-structures that composes the 3D conformation of a protein:Alpha-helix, beta-strand and coils.

Machine learning techniques, as Artificial Neural Net-works, have been known to achieve significant results. Thefirst work to use neural networks (NNs) for prediction ofprotein secondary structure and obtain results better thanstatistical techniques was done by Qian and Sejnowski [13].The NNs used by them were multi-layer perceptron net-works, fully connected with one hidden layer. They re-ported a Q3 tax of 64,3%.

Motivated by the success of this work, other researchershave used neural networks to solve their problem of pre-diction. Rost and Sander [19] applied a similar structureas Qian et al introducing some changes, such as the addi-tion of evolutionary information through the use of profiles.With this approach they achieved an accuracy of 71,4%.Using the same database and a simpler method, Riis andKrogh [15] reached a Q3 percentage of 66,3% when us-ing the seven-fold cross-validation on the sequences, and71,3% when they combined the output through multiplealignments.

More recent works point to the use of ”distance infor-mation” or profiles as a way to improve the prediction. Thiswould substitute the traditional approach where windows ofthe sequence are informed to the network, passing only lo-cal information[16]. Particularly, the use of the PSI Blastprofile as input data has improved the accuracy predictionsignificantly, in some cases reaching the performance of77.6% [7]. Results even better than those were achievedby Pollastri et al.[12] who, using recurrent neural networksand PSI Blast, achieved an accuracy of about 78% of cor-rect predictions, using a training set with 1180 sequences.

Another expressive result was obtained by Petersen etal.[11], who also used PSI-Blast profiles. The predictionwas made in two levels, the three classes of output weresubstituted by nine units. The residues before and after thecentral node were coded (called procedure expansion of theoutput). The number of networks considered in this case is800, and the combination achieved an accuracy of 77.2% to80.2%, with a trainig set of size 1032.

One can observe that, in order to achieve better results,the computational resources required are becoming higherand higher. In an effort to switch that trend, in the presentwork a simple architecture is proposed. On the other hand,comparing results is not easy, since it is common to have anew database developed for each new predictor. To over-come this difficulty, databases already used in previousworks have been chosen to evaluate the method presentedhere.

A few years ago, Cuff and Barton [4] comparedthe performance of four predictors: DSC[10], PHD[22],NNSSP[23], and PREDATOR[6], using two databases: Onedeveloped by Rost and Sander [21] and another developedby themselves. The best single method evaluated was PHD.It achieved a Q3 of 73.5% on the Rost and Sander database(RS126), when executed over the multiple alignment gen-erated by Cuff and Barton, and 71.9% on the Cuff andBarton database (CB396). The best results overall wereobtained by combining the four methods. This combina-tion was called CONSENSUS. With the CONSENSUS pre-dictor Cuff and Barton achieved a percentage of 74.8 forRS126 and 72.9 for CB396.

These limits were used as reference to evaluate themethod presented here. Our predictor has a Q3 accuracyby residue of 74.13% over the RS126 database (comparablewith the performance of CONSENSUS) and a performanceeven better, 75.93% when the CB396 is used. The detailsabout how his performance was achieved are explained inthe following sections.

3. Methods

In this section we describe the databases, neural networkarchitectures, algorithms and input handling.

3.1. Database Descriptions

The experiments were realized with two databases, oneproposed by Rost and Sander [21], referenced as RS126,and another developed by Cuff and Barton [4], calledCB396. In both databases the assignment of secondarystructure types were done by the DSSP program [9]. Theeight states returned were converted to three classes: Helix,Strand and Coil.

The RS126 database consists of 126 non-homologousglobular proteins taken from HSSP database version 1.0, re-lease 25.0 [24]. The homology threshold used is measuredby similarity pairwise sequence. Any two proteins in thedatabase have a pairwise sequence identity lesser than 25%for sizes greater than 80 residues.

The database developed by Cuff and Barton CB396is composed by 396 sequences originally from 3Dee, adatabase of structural domains definitions. The original set


was reduced in the following way: Instead of using per-centage identity cutoff, a sensitive sequence comparison al-gorithm and cluster analysis was performed. Then, multi-segments domains were removed and all sequence with X-ray crystal structures with resolutions bellow 2.5 Angstromswere filtered. After that, sequences with some similaritywith those on RS126 database were removed.

3.2. Evaluation Method

The evaluation method used was an economical varia-tion of the jack-knife process. In a full jack-knife the cycli-cal process of removing one protein for the test set, trainingthe network with the remaining N-1 proteins, performingthe prediction on the protein removed, and then measuringthe accuracy is repeated for each protein. To save process-ing time, it is common to split the set of proteins into Msubsets, and then do M rounds of removing one of thosesubsets, training the network with the (M-1)N/M remainingproteins, and making the test with the N/M removed pro-teins.

In both experiments the seven-fold cross-validation wasperformed. In other words, the value of M in the procedurejust described was seven.

The CB396 database was divided ramdomly into sevensubsets, with approximately the same size. For the RS126database we adopted the same partition used by Riis andKrogh[15]. The results reported here are the average of theprediction accuracies obtained with the seven different test-ing sets.

3.3. Training Algorithm

Instead of employing the standard Backpropagation al-gorithm, an efficient variation of that, the RPROP algo-rithm, was used in the training of the networks.

RPROP(”Resilient PROPagation”) is a learning schemefor neural networks developed by Riedmiller & Braun [14].In contrast to other gradient-descent algorithms, it does notuse the magnitude of the gradient, but only its sign.

RPROP performs a direct adaptation of the weight basedon local gradient information. The update process of theweight follows a simple rule: If the partial derivative(dE/dw) is positive (increasing error), the weight is de-creased by an update-value. Otherwise, if the derivative isnegative (decreasing error), an update-value is added to theweight. This update-value is calculated based on the direc-tion (sign) of the gradient. Initially, it has a value that isincreased by a factor if the current gradient is in the samedirection of the previous gradient, but its value is decreasedby a factor if the gradient has the opposite direction.

Besides the use of the derivative sign instead of its mag-nitude, other advantages can be mentioned, such as the

number of epochs and the computational effort, which arereduced in comparison to the original gradient-descent pro-cedure. This algorithm is also robust in terms of the adapta-tion of its initial parameter.

3.4. Combining Classifiers

It is known that the combination of neural networks hascontributed in the past to increase the accuracy in secondarystructure prediction[12, 4], sometimes using a huge numberof classifiers[11]. Combining classifiers aims to improvethe classification taking advantage of each one individually.A necessary condition is that each classifier in the combi-nation observe a different local minimum, i.e., they do notgenerate the same error. To obtain this requisite there aresome strategies, which include: initialization of the weightsof the neural networks in different regions of the initializa-tion space, to modify the learning algorithm used for eachone of the neural networks, or still, to modify the topologyof the neural network. In the experiments reported here thelast strategy was adopted.

The input data was treated with the following procedure.For each sequence in the two databases the PSI Blast searchprogram was executed, with default parameters, using theNCBI Non-redundant Protein Sequence Database. The pro-cesses were stoped after three iterations.

For each sequence of length ni, i = 1::jdatabasej, inthe databases, (ni � 12) sets of thirteen adjacent rows ofthe position specific scoring matrix generated as part of thePSI Blast search process were taken. Hence, each set con-tained 260 characters, which were used as input for the net-works.

Three fully-connected neural networks, labelled net-work1, network2 and network3, with one hidden layer wereused. The number of nodes in the hidden layer was differentfor each network. Those numbers were chosen after severalpreliminary tests. The best results were found in the intervalbetween 30 and 40 nodes. Therefore, network1, network2and network3 were trained, respectively, with 30, 35 an 40nodes in the hidden layer. The output layer had three nodesin all networks, one for each class (helix, strand and coil).

The results of the three networks were combined inde-pendently, using five different rules: Voting, Product, Aver-age, Maximum and Minimum.

In the Voting method, the class chosen is that one withgreatest frequency, i.e., the most voted. For all the otherrules we used function Softmax to normalize the outputs ofthe neural networks.

The rules Product and Average are based on the opera-tions that name them. The Product (Average) is applied onthe outputs of the three network, and the class is attributedto the one that attained the highest value of the Product (Av-erage).


The Minimum rule works in the following way: For apattern of test, the least value of each one of the networksis taken, and the class attributed will be the one with thehighest value among the minors. Finally, in the Maximumrule, the lesser value amongst the greatest obtained by eachnetwork is adopted.

4. Discussion of Results

The seven-fold cross validation was performed for bothdatabases, RS126 and CB396. The results for each exper-iment is reported here and the percentages related are theaverageQ3 prediction accuracy for residues.

Using the RS126 database, network1 got the best resultindividually, 71.7%, although network2 and network3 ob-tained similar results. Network1 also got the best result withthe CB396 database, 74.5%. But onde again, the perfor-mances of network2 and network3 were approximately thesame.

The next step in the procedure combined these three net-works using five combination rules. The performance re-sulting of each combination rule, as well as of each individ-ual network performance, are shown in Figures 1 and 2.

Notice that in both databases the Product and Averagecombination rules both achieved excelent and similar ac-curacy rate. The difference between them is inferior to0.05%. But, the best results were found applying the Prod-uct rule: 74.13% with the RS126 database and 75.93% overthe CB396 database. An interesting observation is that mostpreditors in the literature that use combination apply theVoting rule. Hence, we believe that the combination rulewas one of the differential points in the proposed method.

It is important to emphasize that the network combina-tion increased the accuracy rate in both databases. Suchgain is due to the fact that each neural network is lookingat a different local minimum, a necessary condition to suc-cessfully apply this technique. In RS126 the gain was al-most 3%, and in the CB396 database was aproximately 2%.Hence, the computacional complexity of the combination isjustified by the improvement in accuracy.

Table 1 shows the best results found in the literature fornetworks trained over the databases used in this work[4].The simple architecture proposed here achieved resultscomparable to the best one found with the RS126 database.In fact, it performed better than each individual predictor,being defeated only by CONSENSUS, which is a combina-tion of several methods.

The most significant contribution was the improvementof almost 3% of precision using the CB396 database, in re-lation to the best result reported by Cuff and Barton on thesame database, which was 72,9%, obtained by CONSEN-SUS.

Figure 1. Results on RS126 Database

Figure 2. Results on CB396 Database

One intriguing and noticeable fact is that all predic-tors discussed had a better performance with RS126 thanwith CB396, while with our predictor it was the other wayaround. Actually, it is well accepted that increasing thenumber of non-homologous proteins in the training set hasthe advantage that the extra biological information may re-duce the risk of overfitting by improving the network’s abil-ity to discriminate between different types of secondarystructure. Hence, in that sense, the behavior of the predictorpresented here is more natural.

5. Conclusion

An efficient and simple secondary structure predictionclassifier was presented. To allow the comparison withother methods the RS126 and the CB396 databases werechosen to provide the training and testing sets to the net-works.

An optimization was the use of the RPROP algorithmfor training the networks. The use of RPROP helps reducethe number of epochs and the computational effort, and it isalso robust in terms of adaptation of initial parameters.

A considerable improvement came from the combination


Table 1. Average Q3 accuracy prediction us-ing RS126 and CB396 databases.

Method RS126(%) CB396(%)PHD 73.5 71.9DSC 71.1 68.4PREDATOR 70.3 68.6NNSSP 72.7 71.4CONSENSUS 74.8 72.9PRESENTWORK 74.1 75.9

of three networks, with 30, 35 and 40 nodes in the hiddenlayer. The results attained were either comparable to or bet-ter than the best ones reported for classifiers trained with thesame databases.

Using only three networks with one hidden layer andcombining their results, the Q3 accuracy by residue ob-tained by this classifier was 74,13% for the RS126 databaseand 75,93% for the CB396 database. With a significantlymore complex method, resulting of the combination of fourdifferent predictors, Cuff and Barton achieved 74,8% of ac-curacy for RS126 and 72,9% for the CB396 database.

Five different combination rules were applied, and all ofthem improved on the performance of each neural networkindividually, by at least 1%. The best results in the reportedexperiments were attained by the use of the Product rule.

A preliminary and condensed version of this work waspublished as a two-page research report poster (#48) in theEuropean Conference on Computational Biology 2002. Thepredictor described here is implemented as a web serveravailable at http://biolab.cin.ufpe.br/tools/.

6. Acknowledgements

The authors would like to thank Brazilian sponsoringagency CNPq, and the first author would also like to thankFACEPE.

References

[1] Altschul, S., Madden, T., Shaffer, A., Zhang, J.,Zhang, Z., Miller, W., and Lipman, D., Gapped Blastand PSIBlast: A new generation of protein databasesearch programs, Nucleic Acids Research, 25:3389–3402, 1997.

[2] Baldi, P. and Brunak, S., Bioinformatics: The MachineLearning Approach, 2001.

[3] Baldi, P., Brunak, S., Frasconi, P., Soda, G., and Pol-lastri, G., Exploiting the past and the future in pro-tein secondary structure prediction, Bioinformatics,15: 937–946, 1999.

[4] Cuff J. A. and Barton G. J., Evaluation and improve-ment of multiple sequence methods for protein sec-ondary structure prediction, PROTEINS: Structure,Function and Genetics, 34:508-519, 1999.

[5] Eddy, S. R., Profile hidden markov models (review),Bioinformatics, 14(9):755–763, 1998.

[6] Frishman, D., and Argos, P., Seventy-five percent ac-curacy in protein secondary structure prediction, Pro-teins, 27:329–335, 1997.

[7] Jones, D. T., Protein secondary structure predictionbased on position-specific scoring matrices, J. Mol.Biol., 292:195–202, 1999.

[8] Karplus, K., Barrett, C., and Hughey, R., Hiddenmarkov models for detecting remote protein homolo-gies, Bioinformatics, 14(10):846–856, 1998.

[9] Kabsch, W. and Sander, C., A dictionary of pro-tein secondary structure, Biopolymers, 22:2577–2637,1983.

[10] King, R., and Sternberg, M., Identification and appli-cation of the concepts important for accurate and re-liable protein secondary structure prediction, ProteinSci., 5:2298–2310, 1996.

[11] Petersen, T. N., Lundegaard, C., Nielsen, M., Bohr,H., Bohr, J., Brunak, S., Gippert, G. P., and Lund,O., Prediction of protein secondary structure at 80Pro-teins, 41:17–20, 2000.

[12] G. Pollastri, G., Przybylski, D., Rost, B. and Baldi,P., Improving the Prediction of Protein SecondaryStrucure in Three and Eight Classes Using RecurrentNeural Networks and Profiles, Proteins, 47:228-235,2002.

[13] Qian, N. and Sejnowski, T. J., Predicting the sec-ondary structure of globular proteins using neural net-work models, J. Mol. Biol., 202:865–884, 1988.

[14] Riedmiller, M., and Braun, H., A Direct AdaptiveMethod for Faster Backpropagation Learning: TheRPROP Algorithm, Proc. ICNN, 586–591, 1993.

[15] Riis, S. K. and Krogh, A., Improving prediction ofprotein secondary structure using structured neuralnetworks and multiple sequence alignments, J. Comp.Biol., 3:163–183, 1996.


[16] Rost, B., Review: Protein secondary structure predic-tion continues to rise, J. Struct. Biol., 134:204–218,2001.

[17] Rost, B., PHD: Predicting one-dimensional proteinstructure by profile based neural networks, Methodsin Enzymology, 266:525–539, 1996.

[18] Rost, B. and Sander, C., Third generation predictionof secondary structure, Protein Structure Prediction:Methods and Protocols, 71–95, 2000.

[19] Rost, B. and Sander, C., Prediction of protein sec-ondary structure at better than 70% accuracy, J. Mol.Biol.,232:584–599, 1993.

[20] Rost, B. and Sander, C., Improved prediction of pro-tein secondary structure by use of sequence profilesand neural networks, Proc. Natl. Acad. Sci. USA,90:7558–7562, 1993.

[21] Rost, B. and Sander, C., Combining evolutionary in-formation and neural networks to predict protein sec-ondary structure, Proteins, 19:55-72, 1994.

[22] Rost, B., Sander, C., and Schneider, R., Redefining thegoals of protein secondary structure predictor, J. Mol.Biol., 235:13-26, 1994.

[23] Salamov, A., and Solovyev, V., Prediction of proteinsecondary structure by combining neares-neighbor al-gorithms and multiple sequence alignments, J. Mol.Biol., 247: 11–15, 1995.

[24] Sander, C. and Achneider, R., Database of homology-derived protein structures and the structural meaningof sequence alignment, Proteins, 9(1): 56–68, 1991.


[IEEE Comput. Soc Third IEEE Symposium on BioInformatics and BioEngineering. BIBE 2003 - Bethesda,...

Documents

Transcript of [IEEE Comput. Soc Third IEEE Symposium on BioInformatics and BioEngineering. BIBE 2003 - Bethesda,...