Quantifying the relationship of protein burying depth and sequence

8
proteins STRUCTURE FUNCTION BIOINFORMATICS Quantifying the relationship of protein burying depth and sequence Zheng Yuan 1 * and Zhi-Xin Wang 2 1 Institute for Molecular Bioscience and ARC Centre in Bioinformatics, The University of Queensland, Brisbane, Australia 2 Department of Biological Sciences and Biotechnology, Tsinghua University, Beijing, China INTRODUCTION A protein has its particular structural arrangement to fulfill its function. Some residues are located at protein surface whereas others are buried in pro- tein interior to play roles such as forming a hydrophobic core to maintain protein folding state. It has been found that buried residues are more con- served during evolutionary process and therefore proved to be useful for pro- tein fold and function prediction. 1,2 As a protein structural descriptor, the depth or burying depth (BD) of an atom or residue has been proposed to be the distance between itself and the nearest water molecule or protein surface, to measure the extent to which it is buried. 3,4 This parameter has also been applied to a number of problems such as analyzing amide hydrogen/deute- rium exchange rates in nuclear magnetic resonance (NMR) experiments, 3 pre- dicting protein active sites, 5 and improving protein structural alignments. 6 Because of the involvement in the interactions with other biological mole- cules, surface residues have been under extensive investigations, especially in the area of protein drug design. 7 Accordingly, accessible surface area (ASA) has been widely adopted to define surface residues 8 and based on it, a large number of sequence-based prediction methods have been developed during recent years. 9–20 However, the knowledge of ASA cannot provide information about the structural arrangements of buried residues, when their ASAs are ze- ros or near zeros. As a complement, BD can provide additional and more accurate information about the interior of a molecule. Particularly, protein BD may reflect long-range contact information and thus are helpful for protein structure prediction. To date, however, it is still unknown whether BD can be predicted directly from sequence or whether the sequence local environment of a buried residue can determine its burying extent. In this study, we use a regression approach to quantify the relationship of protein BD and sequence. Support vector regression 21 is a sophisticated machine learning method which can reveal the relationship of two objects through a complicated function derived from the learning samples. Applying this method here and examining with a large protein structure dataset, we found a strong correlation between protein BD and the local sequence. The best estimated function is able to predict residue BDs with a correlation coef- ficient of 0.65. Further analyses show a good identification of most deeply The Supplementary Material referred to in this article can be found online at http://www.interscience.wiley.com/ jpages/0887-3585/suppmat Grant sponsor: Australian Research Council. *Correspondence to: Zheng Yuan, Institute for Molecular Bioscience and ARC Centre in Bioinformatics, The University of Queensland, Brisbane, Australia. E-mail: [email protected] Received 11 December 2006; Revised 20 February 2007; Accepted 27 March 2007 Published online 17 August 2007 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/prot.21545 ABSTRACT Protein burying depth (BD) is a struc- tural descriptor that is exploited not only to find whether a residue is exposed or buried, but also to deter- mine how deep a residue is buried. The widely used solvent accessible sur- face area is mainly focusing on the study of protein surface residues, while protein BD can provide more detailed information about the arrangement of buried residues, which may be used to study protein deep level structure and the formation of protein folding nu- cleus. In this work, we analyse the rela- tionship of protein BD and sequences, and describe it by nonlinear functions estimated by support vector machines. We examine the functions by crossvali- dation tests and find strong correlation between residue BD and local sequence environment. By further taking ac- count the size of the molecule where a residue is located, we find that the cor- relation coefficient between predicted and observed depths improves from 0.60 to 0.65. Moreover, nearly half of the deepest 10% residues in a protein sequence can be correctly predicted. Our study suggests that a residue’s bur- ying extent is able to be predicted, to some degree, by itself and its local neighbouring residues. The methods used to estimate the sequence-depth functions are expected to become more useful in the investigation of protein structures and folding mechanism. Proteins 2008; 70:509–516. V V C 2007 Wiley-Liss, Inc. Key words: bioinformatics; support vec- tor regression; protein structure predic- tion; protein sequence analysis; burying extent. V V C 2007 WILEY-LISS, INC. PROTEINS 509

Transcript of Quantifying the relationship of protein burying depth and sequence

Page 1: Quantifying the relationship of protein burying depth and sequence

proteinsSTRUCTURE O FUNCTION O BIOINFORMATICS

Quantifying the relationship of proteinburying depth and sequenceZheng Yuan1* and Zhi-Xin Wang2

1 Institute for Molecular Bioscience and ARC Centre in Bioinformatics, The University of Queensland, Brisbane, Australia

2Department of Biological Sciences and Biotechnology, Tsinghua University, Beijing, China

INTRODUCTION

A protein has its particular structural arrangement to fulfill its function.

Some residues are located at protein surface whereas others are buried in pro-

tein interior to play roles such as forming a hydrophobic core to maintain

protein folding state. It has been found that buried residues are more con-

served during evolutionary process and therefore proved to be useful for pro-

tein fold and function prediction.1,2 As a protein structural descriptor, the

depth or burying depth (BD) of an atom or residue has been proposed to be

the distance between itself and the nearest water molecule or protein surface,

to measure the extent to which it is buried.3,4 This parameter has also been

applied to a number of problems such as analyzing amide hydrogen/deute-

rium exchange rates in nuclear magnetic resonance (NMR) experiments,3 pre-

dicting protein active sites,5 and improving protein structural alignments.6

Because of the involvement in the interactions with other biological mole-

cules, surface residues have been under extensive investigations, especially in

the area of protein drug design.7 Accordingly, accessible surface area (ASA)

has been widely adopted to define surface residues8 and based on it, a large

number of sequence-based prediction methods have been developed during

recent years.9–20 However, the knowledge of ASA cannot provide information

about the structural arrangements of buried residues, when their ASAs are ze-

ros or near zeros. As a complement, BD can provide additional and more

accurate information about the interior of a molecule. Particularly, protein BD

may reflect long-range contact information and thus are helpful for protein

structure prediction. To date, however, it is still unknown whether BD can be

predicted directly from sequence or whether the sequence local environment

of a buried residue can determine its burying extent.

In this study, we use a regression approach to quantify the relationship of

protein BD and sequence. Support vector regression21 is a sophisticated

machine learning method which can reveal the relationship of two objects

through a complicated function derived from the learning samples. Applying

this method here and examining with a large protein structure dataset, we

found a strong correlation between protein BD and the local sequence. The

best estimated function is able to predict residue BDs with a correlation coef-

ficient of 0.65. Further analyses show a good identification of most deeply

The Supplementary Material referred to in this article can be found online at http://www.interscience.wiley.com/

jpages/0887-3585/suppmat

Grant sponsor: Australian Research Council.

*Correspondence to: Zheng Yuan, Institute for Molecular Bioscience and ARC Centre in Bioinformatics, The

University of Queensland, Brisbane, Australia. E-mail: [email protected]

Received 11 December 2006; Revised 20 February 2007; Accepted 27 March 2007

Published online 17 August 2007 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/prot.21545

ABSTRACT

Protein burying depth (BD) is a struc-

tural descriptor that is exploited not

only to find whether a residue is

exposed or buried, but also to deter-

mine how deep a residue is buried.

The widely used solvent accessible sur-

face area is mainly focusing on the

study of protein surface residues, while

protein BD can provide more detailed

information about the arrangement of

buried residues, which may be used to

study protein deep level structure and

the formation of protein folding nu-

cleus. In this work, we analyse the rela-

tionship of protein BD and sequences,

and describe it by nonlinear functions

estimated by support vector machines.

We examine the functions by crossvali-

dation tests and find strong correlation

between residue BD and local sequence

environment. By further taking ac-

count the size of the molecule where a

residue is located, we find that the cor-

relation coefficient between predicted

and observed depths improves from

0.60 to 0.65. Moreover, nearly half of

the deepest 10% residues in a protein

sequence can be correctly predicted.

Our study suggests that a residue’s bur-

ying extent is able to be predicted, to

some degree, by itself and its local

neighbouring residues. The methods

used to estimate the sequence-depth

functions are expected to become more

useful in the investigation of protein

structures and folding mechanism.

Proteins 2008; 70:509–516.VVC 2007 Wiley-Liss, Inc.

Key words: bioinformatics; support vec-

tor regression; protein structure predic-

tion; protein sequence analysis; burying

extent.

VVC 2007 WILEY-LISS, INC. PROTEINS 509

Page 2: Quantifying the relationship of protein burying depth and sequence

buried residues. Our work provides a new method to

predict protein BD, which is for analyzing buried resi-

dues and folding nucleus of novel proteins.

METHODS

Preparation of dataset

To examine the relationship of depth and sequence, we

prepare a large dataset of 923 protein chains using PDB-

REPRDB.22 Those structures solved by X-ray crystallog-

raphy are all in high quality with resolution � 2.0 A and

R-factor � 0.2. We exclude protein chains shorter than

60 amino acids. No two chains are included that have

pair-wise identity more than 25%. All the structures of

the biological molecules that contain those 923 protein

chains are derived from RCSB-PDB database23 for analy-

sis. The names of the 923 protein chains are given in

Table I of supplementary materials.

Calculation of solvent accessible surfaceand protein BD

There are a number of ways to calculate protein BDs.

The concept of protein BD was first introduced by Peder-

sen et al.3 in their NMR study. In their work, the BD of

an atom was defined as the distance from the nearest

water molecule. Atom depth was also defined by other

authors as the distance from protein solvent surface4 or

from the nearest solvent atom.2 A recently published

method also considered the effect of protein shape.24

Despite of different definitions, strong correlations

among them may exist. For example, a correlation coeffi-

cient of 0.93 was observed for the first two definitions in

the same protein structure.25

In this study, we consider the depth of an atom as its

distance from the solvent accessible surface. We use pro-

gram MSMS26 to calculate solvent ASA of each atom

and the solvent accessible surface of the whole molecule.

This program is run with a probing sphere radius of 1.4

A. The outputs of the program for a protein structure

contain a list of vertices that represent protein surface.

Therefore, the BD of an atom can be defined as the dis-

tance between the atom and its nearest vertex as previ-

ously used by Hamelryck.27 The average depth for all

atoms except the hydrogen atoms in an amino acid is

regarded as its residue depth. It is worth mentioning that

all the calculations are based on whole molecules that

may consist of a number of protein subunits although

our analyses will only use nonredundant protein chains.

Residue coding

Although the distribution of BDs should be ultimately

by the whole protein sequence, the depth of a residue or

atom may be somewhat determined by its local sequence

environment or the local segment of amino acids. On ba-

sis of this assumption, we consider the protein segment

centered with a target residue and the sliding window

technique previously used in the prediction of protein

secondary structure28 is adopted here. The window size

is set as 15 amino acids, indicating that the N-terminal

and C-terminal seven neighbouring amino acids of a tar-

get residue are also used. To code each amino acid in the

window, we use the normalized values in PSI-BLAST

scoring matrixes29 obtained by three rounds of searching

against the nonredundant protein sequence database of

National Center for Biotechnology Information.30,31

Any amino acid is represented by a 21-dimensional vec-

tor and therefore the 15 amino acids window is repre-

sented by a 315-dimensional vector. The selection of

seven amino acids around the central residue at the

sequence level is based on the observation that the small

window size may not adequately exploit the sequence in-

formation while the large window size may lead to more

noise and therefore may not be helpful for the predic-

tion. In addition, a window size of 15 amino acids may

contain short-, medium-, and long-range interactions in

protein structures. In a definition,32,33 the short-range

interactions are contributed by the residues within two

residues distance from the central residue; the medium-

range interactions are the contributions of residues with

three or four residues distance from the central residue

and the long-range interactions are from the residues

more than four residues away from the central residue.

According to this, in a 15 amino acid window, residues

participated in short-, medium-, and long-range interac-

tions may have a balanced distribution with a ratio of

2:2:3.

Estimation of the relationship between BDand sequence

To derive the function matching the BD of a residue

or atom with its coding vector, we use nonlinear support

vector regression machines as used in other applications

such as the predictions of solvent accessibility,14,15,17

flexibility,31 and contact numbers.34 To fulfil the non-

linear mapping, Gaussian kernel K ðXi;XjÞ� ¼ exp

�gkXi � Xjk2� �

is used. There are some parameters used

to control the learning, for example, the regulatory pa-

rameter C and error tube e We keep error tube e 5 0.01

in the whole study while trying different values for C

and e. To obtain the final solutions we use program

SVMlight.35

Evaluation of prediction performance

Threefold crossvalidation test is used to avoid biased

evaluation with 923 protein chains being randomly di-

vided into three groups and each group having a roughly

equal number of protein sequences. Proteins in one

group are tested in turn, while the proteins in other

Z. Yuan and Z.X. Wang

510 PROTEINS DOI 10.1002/prot

Page 3: Quantifying the relationship of protein burying depth and sequence

groups are merged to estimate the sequence-depth func-

tion. The correlation coefficient between predicted and

observed depths is calculated to show prediction per-

formance.

To further analysing the function estimation efficiency,

we formulate the predictions as two-class problems by

giving a number of thresholds to divide residues into

two groups: exposed and buried. In these cases, accura-

cies for overall, specificity, and sensitivity are also calcu-

lated.

RESULTS AND DISCUSSION

The distribution of BDs

There are about 16,200 residues in our 923 protein

chains and we give the residue depth distributions in Fig-

ure 1. We calculated the depth values, rounded off their

values with two decimal places, and counted the occur-

ring frequency for each value in the whole dataset. As

shown in Figure 1, nearly all residues are located in the

depth range from 1.67 to 10 A, which covers 99.7% of

the total residues. The distribution is not uniform, as the

majority of residues (about 70% of total residues) have

BDs not more than 2.5 A because most residues are on

or near protein surfaces. The mean and standard devia-

tion for the distribution are 2.64 and 1.41 A, respectively.

We also calculated the BD of Ca atoms and found that

residue depth and atom depth were highly correlated

with a correlation coefficient of 0.96. On the basis of this

observation, we focused our analyses mainly on residue

depths because analyses on atom depths might yield sim-

ilar results.

The relationship of protein BDand protein size

To measure the affect of protein size on calculation of

BDs, we calculated the BDs of all atoms (excluding the

hydrogen atoms) in a protein and averaged their values

to represent this protein. Protein size can be reflected by

its number of amino acids or the molecular weight that

consider the difference of amino acids. However, both of

them have the same correlation coefficient 0.73 with the

averaged atom depths. Figure 2 shows the distribution of

Table IPrediction Performance for Burying Depths with Different Input Information and Different Control Parameters

Dataset Input Control parameter Correlation coefficient Mean absolute error (�) Mean relative error (%)

Total residues Local window g 5 0.1, C 5 0.5 0.62 0.63 18.8g 5 0.01, C 5 2 0.61 0.62 18.3

Local window 1 protein size g 5 0.1, C 5 0.5 0.65 0.61 18.5g 5 0.01, C 5 2 0.65 0.60 18.0

Buried residues Local window 1 protein size g 5 0.1, C 5 0.5 0.52 1.07 24.5g 5 0.01, C 5 2 0.52 1.06 24.4

Figure 1Distribution of occurring frequency for residue burying depth.

Figure 2The average burying depths of proteins according to their molecular sizes.

Protein Burying Depth Prediction

DOI 10.1002/prot PROTEINS 511

Page 4: Quantifying the relationship of protein burying depth and sequence

averaged depths with regards to the numbers of amino

acids in proteins. It is obvious that BDs are strongly de-

pendent on protein size and that the feature is essential

for estimation of sequence-depth functions. In some

other studies, BDs in proteins of different sizes were first

normalized to a similar level and then proper compari-

son was performed.5 Under this condition, a normalizing

function was provided priorly without knowing its best

form. To avoid this, we put protein size and local

sequence information together as input to estimate the

sequence-depth function.

The distributions of BDs according todifferent secondary structures

For all residues, we extracted their secondary structures

from dictionary of protein secondary structure36 and

classified them into three classes: helix, b-strand, and

coil. Distributions for residue depths according to three

classes are shown in Figure 3. For each distribution we

computed its mean and standard deviation. The values

for helix, b-strand, and coil are 2.75 � 1.42, 3.31 � 1.78,

and 2.23 � 0.99 A, respectively. It can be found that

coiled residues are more frequently found with small BDs

(more exposed) while b-sheet residues are likely to locate

with larger BDs (more buried). This result is in consist-

ent with the previous observation of other authors,2

although they adopt a different depth definition and a

different dataset. To further examine whether their distri-

butions were significantly different, we performed Kol-

mogorov-Smirnov tests and found that the P-values were

all near zero between the distributions of separate classes.

Thus, our analyses consolidate the existence of correla-

tion between protein BDs and secondary structures.

The correlation of protein BD and proteinsolvent ASA

The absolute solvent ASA in A2 for each residue was

prepared by running program MSMS,26 while the rela-

tive ASA was obtained by normalizing the absolute value

by its ASA value in a Gly-X-Gly tripeptide with extended

conformation.17,37 We calculated the correlation of BD

with absolute ASA and relative ASA and found that rela-

tive ASA more negatively correlates with depth (20.55)

than absolute ASA (20.51).

Since relative ASA is more correlated with BD, we use

it to define surface residues. Given a threshold, if the rel-

ative ASA of a residue is greater than a certain threshold,

the residue is defined as exposed. Otherwise, it is a bur-

ied residue. A similar definition also applies to BD.

Selecting a certain threshold of BD, a residue is deemed

as exposed if its depth is less than the threshold. As

expected, there should exist a correlation between relative

ASA and BD thresholds that can maximize the consis-

tency between the definitions. To reflect the agreement,

we use the consistency percentage (CP), which is defined

as the percentage of the residues consistently assigned by

both methods as exposed or buried in the total residues.

In Figure 4(A), we plot the CP values according to differ-

ent coupled relative ASA and depth thresholds. For each

relative ASA threshold, we are able to find a depth

threshold that corresponds to the maximum CP. The

matched threshold values and their CP values determine

the ridgeline that is highlighted in blue in Figure 4(A).

Mapping the ridgeline to the horizontal plane determined

by the axes of relative ASA and BD can give the best

match of thresholds of relative ASA and depth. This line

is shown in green.

The more detailed information is given in Figure 4(B).

Particularly, a depth threshold of 2.54 A corresponding

to a relative ASA threshold of 5–6%, can achieve more

than 90% agreement. This cutoff classifies about 28% of

the residues as buried. We use the buried residues to

make another dataset, called ‘‘buried residue dataset,’’ to

further test the relationship of protein BD and sequence.

The mean and standard deviation of residue depths from

this dataset are 4.34 and 1.68 A, respectively.

Numerical solution of the matching functionbetween protein BD and sequence

To estimate the matching function between residue

depths and the feature vectors coding residues, we

applied a support vector regression approach to find its

solution. Its fitness is examined by crossvalidation tests

to avoid overestimate. That is, proteins for testing are

not included in the training set that is used to build up a

function. For easiness of handling the data in the training

Figure 3Residue depths distribution according to three secondary structure classes: helix

(solid), b-strand (slashed) and coil (dotted).

Z. Yuan and Z.X. Wang

512 PROTEINS DOI 10.1002/prot

Page 5: Quantifying the relationship of protein burying depth and sequence

procedure, all residue depths are normalized by the mean

(X) and standard deviation (r) using the formula

Xnorm ¼ X�Xr , where the mean and standard deviation are

derived from the whole dataset (X 5 2.64 A and r 51.41 A). The same formula is also used in the testing

procedure when a predicted value (Xnorm) is transformed

back to the value of original meaning (i.e. X).

As given in Methods Section, each residue is coded by

a 315-demensional vector that contains the information

from a local window of 15 amino acids. Besides, protein

(molecular) size directly derived from protein sequence is

an important feature that can be added (after normaliza-

tion) to make the 316th dimension. We estimate the

function using total residues in our dataset and give the

results in Table I. Only using the local window informa-

tion, the correlation coefficients between predicted and

observed are around 0.60. When including protein size as

a feature, the correlation coefficients increase to 0.65

with decreased mean absolute errors and relative errors.

The results demonstrate that protein size is a significant

factor that can influence the estimation accuracy. If we

use BD to define exposed and buried residues, different

depth thresholds are applied and their overall accuracies

(the percentages of correctly predicted residues) are com-

puted. To verify whether our methods can be used to

find the buried residues particularly those deeply buried

residues, we calculated the specificity and sensitivity, as

well as their occurring probabilities in our dataset. Speci-

ficity is the percentage of correctly predicted buried resi-

dues among the total predicted buried residues while

sensitivity is the percentage of correctly predicted buried

residues among the total observed buried residues. Using

the model with g 5 0.01, C 5 2 and local window as

well as protein size information, we provide the results in

Figure 5. The least overall accuracy is 73.1% correspond-

ing to a threshold of 1.95 A. The overall accuracies from

73 to 90% according to different thresholds are compara-

ble with previously reported results of 72–90% on predic-

Figure 4Consistency between the methods of using relative accessible surface area and burying depth to assign exposed and buried residues. A: Consistency percentage is defined as

percent of the number of residues consistently defined by both methods as exposed or buried in the total number of residues. Its values are plotted against different

thresholds of relative accessible surface area and burying depth. B: The best match between the thresholds of relative accessible surface area and burying depth (blue) with

their consistency percentage (red).

Figure 5Prediction accuracies when formulating as two class problems using different

depth thresholds. Specificity, sensitivity, occurring probability of buried residues,

and overall accuracy are plotted according to various depth cut-offs.

Protein Burying Depth Prediction

DOI 10.1002/prot PROTEINS 513

Page 6: Quantifying the relationship of protein burying depth and sequence

tion of protein surface residues.10,13–15,18,20 With

increase of threshold values, the abundance of defined

buried residues and prediction sensitivity decrease suggest-

ing difficulties in accurately predicting more deeply buried

residues. However, our methods are far better than the

random guess. For instance, using the depth threshold

2.54 A, 28.3% of total residues are buried. That means, if

we randomly assign a residue as buried with this probabil-

ity, only 28.3% of the buried residues can be identified

correctly. In contrast, the support vector regression

method can correctly predict 66.1% of those residues (sen-

sitivity). Even using larger thresholds, the sensitivities are

always more than twice the random prediction accuracies.

However, unlike sensitivity, specificity is kept at a relatively

stable level with least accuracies around 70%.

It is well known that a protein structure is determined

by its whole sequence. Whether a residue is buried and

how deep it is buried in a protein structure are the out-

comes of the cooperative effects of all residues. However,

when we only picked buried residues from different pro-

tein structures to perform analyses, we still observed a

strong correlation of depth with both residue local infor-

mation and protein size. We trained on the ‘‘buried resi-

due’’ dataset and tested using three-fold crossvalidation.

As a result, the CC is 0.52, and the mean absolute error

and mean relative error are 1.06 A and 24.4%, respec-

tively (Table I). Formulating predictions as two-class

problems with different thresholds to find more deeply

buried residues, we give the specificities, sensitivities,

overall accuracies, and occurring probabilities in Figure

6. Our results consistently suggest that the local informa-

tion of a buried residue can, to some degree, determine

its burying extent.

The error distribution according to proteinsize and residue depth

We calculated the correlation coefficient for each indi-

vidual protein according to residue depths. For the 453

protein subunits in the group of small proteins each of

them has no more than 150 amino acids, the average

correlation coefficient is 0.60. For large proteins (>150

amino acids), we achieve a correlation coefficient of 0.65

averagely. However, the group of large proteins always

have greater mean absolute and mean relative errors. The

mean absolute and relative errors for small proteins are

0.35 A and 13.2% while they are 0.70 A and 19.8% for

large proteins. We also found that, monomers are more

accurately predicted than multimers as shown by the

mean absolute error 0.43 A for monomers and 0.65 A

for multimers.

For surface residues (residue depth < 2.54 A), the

mean absolute and relative errors are 0.27 A and 13.6%,

respectively. The buried residues have mean absolute

error 1.44 A and mean relative error 29.3%. In contrast,

if we train and test on buried residues solely, we can

obtain better accuracy, as the mean absolute and relative

errors are 1.06 A and 24.4% (Table I).

Shown in Figure 7 are two examples to illustrate the

prediction performances. The RAS-Binding domain of

Figure 6Prediction accuracies when formulating as two class problems using different

depth thresholds. The sequence-depth function is estimated only on the ‘‘buried

residue’’ dataset. Specificity, sensitivity, occurring probability of buried residues,

and overall accuracy are plotted according to various depth cut-offs.

Figure 7Burying depth predictions for proteins 1i35 and 1ftr (PDB codes). Blue lines and

green lines represent the observed and predicted residue depths, respectively.

Their absolute difference is shown in red line. A: 127 residues in subunit A of

protein 1i35 is predicted with a correlation coefficient of 0.64; (B) 296 residues

in subunit A of protein 1ftr is predicted with a correlation coefficient of 0.65.

[Color figure can be viewed in the online issue, which is available at

www.interscience.wiley.com.]

Z. Yuan and Z.X. Wang

514 PROTEINS DOI 10.1002/prot

Page 7: Quantifying the relationship of protein burying depth and sequence

the protein kinase BYR2 (PDB code: 1i35; subunit A) is

predicted with a CC of 0.64 and Formyltransferase (PDB

code: 1 ftr; subunit A) achieves a CC of 0.65. In both

cases, the deeply buried residues have larger errors. How-

ever, there is a good agreement between the predicted

and observed regions containing deeply buried residues

as shown by the overlap of the peaks, although the pre-

dicted are frequently smaller than their real values.

To further study the errors, we calculate the mean

absolute errors in different ranges of residue BDs accord-

ing to different secondary structures. Results are given in

Table II. For this dataset, residues with the conformations

of helix, b-sheet and coil cover 34.2, 21.6, and 44.2%,

respectively. First, the mean absolute errors increase

when residues are more deeply buried. This phenomenon

occurs for all kinds of secondary structures. Second, reg-

ular secondary structures (helix and b-sheet) are more

frequently observed in protein core. For the residues with

BDs no less than 3.0 A, 40.7% form helixes and 39.5%

form b-sheets, which are much higher than their overall

percentage frequencies 34.2% and 21.6% in the whole

dataset. In contrast, the irregular structures (coil) have a

lower frequency of 19.8% compared with the percentage

frequency of 44.2%. This can also be observed from the

different distributions of secondary structures in Figure

3. In addition, the prediction of deeply buried coiled res-

idues is less accurate shown by the greater errors. There

are some reasons accounting for worse prediction of

deeply buried residues. On one hand, under-representa-

tion of deeply buried residues in the dataset undermines

the built models to grasp some rules specific for them;

on the other hand, both the technique for encoding pro-

tein sequence and the machine learning method for find-

ing a mapping function may not be accurate enough to

reflect the complicated sequence-structure relationship in

proteins.

Rank prediction of residue depths

Since the predicted values for deeply buried residues

are less accurate, we can interpret their values relatively

by ranking them in the whole sequence. We sort the

depth profiles of a protein sequence from the smallest to

the largest and apply this procedure to both the pre-

dicted and observed. If we compare the top 10% residues

from both of them, we observe the overlap is 47.8%.

This estimation is performed on the whole dataset. That

means, for a 100 amino acids protein sequence, 4–5 resi-

dues among the 10 predicted most buried are correct.

This becomes very helpful when we model the structure

of a novel protein if we know some residues are deeply

buried. Different ranking percentages generate different

overlap percentage, as shown in Table III. Selecting and

comparing the top 35% residues, the overlap increases to

nearly 70%.

Our computational analyses via a machine learning

approach demonstrate the strong dependence of protein

BD on sequence and local sequence environment. This

observation indicates that it is feasible to identify those

most deeply buried residues using sequence information.

This is particularly useful when we deal with newly

sequenced proteomes and know little about structures

and functions of the proteins. If the hypothesis that the

most deeply buried residues are formed earlier in the

folding pathway is true,38 finding deeply buried residues

from protein sequence will aid the design of folding

experiments for novel proteins. Protein secondary struc-

ture and solvent ASA have been used in protein tertiary

and fold predictions. Protein BD, different from earlier

structural descriptors, bears different and extra informa-

tion that describes the protein structures. To our knowl-

edge, it only has limited applications until recently. Since

protein BD is correlated with other structural characteris-

tics, incorporating depth prediction to other methods

should enhance each other’s performance and provide

more accurate combined prediction methods. In this

study, we provide a way to analyze the relation of protein

sequence with its one-dimensional structural properties

(BD). Clearly, this approach can be extended to predic-

tion of other structural or functional properties, and

therefore provide more accurate predicted information

for novel protein sequences.

CONCLUSIONS

With a great number of novel sequences having been

generated from genome projects, how to determine their

structures and functions is one of the most challenging

problems in the postgenome era. Although many other

aspects of a protein are being explored via X-ray or

NMR, mutation, and microarray gene expression experi-

ments, the sequence itself that contains the most valuable

Table IIThe Mean Absolute Errors and Percentage Frequencies of Three Different

Secondary Structure Classes According to Different Ranges of Depths

Mean absolute error (�) and percentagefrequency

a-helix b-strand Coil

Depth < 2.0 � 0.23 30.0% 0.36 10.6% 0.21 59.4%2.0 � � depth < 2.5 � 0.35 35.2% 0.39 22.9% 0.28 41.9%2.5 � � depth < 3.0 � 0.50 36.9% 0.52 31.5% 0.49 31.6%Depth � 3.0 � 1.72 40.7% 1.86 39.5% 2.11 19.8%

Table IIIPrediction Accuracies for the Most Buried Residues From Top 10–35%

The most buried (%) 10 15 20 25 30 35Precision (%) 47.8 55.1 60.0 63.6 66.6 69.9

Protein Burying Depth Prediction

DOI 10.1002/prot PROTEINS 515

Page 8: Quantifying the relationship of protein burying depth and sequence

information needs further exploring. In this work, we

study an important structural feature, protein BD, only

using sequence information. Our approach can achieve a

correlation coefficient of 0.65 between the observed and

estimated residue depths. In finding the most deeply bur-

ied residues, nearly half of the residues in the group of

top 10% largest values are correctly assigned. The struc-

tural information about buried residues, predicted from

protein sequence, will enhance our study on structures

and functions of novel proteins.

ACKNOWLEDGMENTS

ZY is supported by ARC Discovery Project.

REFERENCES

1. Donald JE, Hubner IA, Rotemberg VM, Shakhnovich EI, Mirny LA.

CoC: a database of universally conserved residues in protein folds.

Bioinformatics 2005;21:2539–2540.

2. Pintar A, Carugo O, Pongor S. Atom depth as a descriptor of the

protein interior. Biophys J 2003;84:2553–2561.

3. Pedersen TG, Sigurskjold BW, Andersen KV, Kjaer M, Poulsen FM,

Dobson CM, Redfield C. A nuclear-magnetic-resonance study of

the hydrogen-exchange behavior of lysozyme in crystals and solu-

tion. J Mol Biol 1991;218:413–426.

4. Chakravarty S, Varadarajan R. Residue depth: a novel parameter for

the analysis of protein structure and stability. Structure 1999;7:723–

732.

5. Gutteridge A, Bartlett GJ, Thornton JM. Using a neural network

and spatial clustering to predict the location of active sites in

enzymes. J Mol Biol 2003;330:719–734.

6. Zhou H, Zhou Y. Single-body residue-level knowledge-based energy

score combined with sequence-profile and secondary structure in-

formation for fold recognition. Proteins 2004;55:1005–1013.

7. Marrone TJ, Briggs JM, McCammon JA. Structure-based drug

design: computational advances. Annu Rev Pharmacol Toxicol

1997;37:71–90.

8. Lee B, Richards F. The interpretation of protein structures: estima-

tion of static accessibility. J Mol Biol 1971;55:379–400.

9. Singh YH, Gromiha MM, Sarai A, Ahmad S. Atom-wise statistics

and prediction of solvent accessibility in proteins. Biophys Chem

2006;124:145–154.

10. Rost B, Sander C. Conservation and prediction of solvent accessibil-

ity in protein families. Proteins 1994;20:216–226.

11. Thompson MJ, Goldstein RA. Predicting solvent accessibility:

higher accuracy using Bayesian statistics and optimized residue sub-

stitution classes. Proteins 1996;25:38–47.

12. Carugo O. Predicting residue solvent accessibility from protein

sequence by considering the sequence environment. Protein Eng

2000;13:607–609.

13. Gianese G, Bossa F, Pascarella S. Improvement in prediction of sol-

vent accessibility by probability profiles. Protein Eng 2003;16:987–

992.

14. Yuan Z, Huang B. Prediction of protein accessible surface areas by

support vector regression. Proteins 2004;57:558–564.

15. Nguyen MN, Rajapakse JC. Two-stage support vector regression

approach for predicting accessible surface areas of amino acids. Pro-

teins 2006;63:542–550.

16. Xu Z, Zhang C, Liu S, Zhou Y. QBES: predicting real values of sol-

vent accessibility from sequences by efficient, constrained energy

optimization. Proteins 2006;63:961–966.

17. Yuan Z, Zhang F, Davis MJ, Boden M, Teasdale RD. Predicting the

solvent accessibility of transmembrane residues from protein

sequence. J Proteome Res 2006;5:1063–1070.

18. Ahmad S, Gromiha MM, Sarai A. Real value prediction of solvent

accessibility from amino acid sequence. Proteins 2003;50:629–635.

19. Nguyen MN, Rajapakse JC. Prediction of protein relative solvent

accessibility with a two-stage SVM approach. Proteins 2005;59:30–37.

20. Wang JY, Lee HM, Ahmad S. Prediction and evolutionary informa-

tion analysis of protein solvent accessibility using multiple linear

regression. Proteins 2005;61:481–491.

21. Drucker H, Burges CJC, Kaufman L, Smola A, Vapnik V. Support

vector regression machines. In: Mozer MC, Jordan MI, Petsche T,

editors. Advances in Neural Information Processing Systems. Cam-

bridge, MA: MIT Press; 1997. pp 155–161.

22. Noguchi T, Akiyama Y. PDB-REPRDB: a database of representative

protein chains from the Protein Data Bank (PDB) in 2003. Nucleic

Acids Res 2003;31:492–493.

23. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig

H, Shindyalov IN, Bourne PE. The Protein Data Bank. Nucleic

Acids Res 2000;28:235–242.

24. Varrazzo D, Bernini A, Spiga O, Ciutti A, Chiellini S, Venditti V, Bracci

L, Niccolai N. Three-dimensional computation of atom depth in com-

plex molecular structures. Bioinformatics 2005;21: 2856–2860.

25. Pintar A, Carugo O, Pongor S. Atom depth in protein structure

and function. Trends Biochem Sci 2003;28:593–597.

26. Sanner MF, Olson AJ, Spehner JC. Reduced surface: an efficient way

to compute molecular surfaces. Biopolymers 1996;38:305–320.

27. Hamelryck T. An amino acid has two sides: a new 2D measure pro-

vides a different view of solvent exposure. Proteins 2005;59:38–48.

28. Qian N, Sejnowski TJ. Predicting the secondary structure of globu-

lar proteins using neural network models. J Mol Biol 1988;202:865–

884.

29. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W,

Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of

protein database search programs. Nucleic Acids Res 1997;25:3389–

3402.

30. Jones DT. Protein secondary structure prediction based on posi-

tion-specific scoring matrices. J Mol Biol 1999;292:195–202.

31. Yuan Z, Bailey TL, Teasdale RD. Prediction of protein B-factor pro-

files. Proteins 2005;58:905–912.

32. Gromiha MM, Selvaraj S. Importance of long-range interactions in

protein folding. Biophys Chem 1999;77:49–68.

33. Gromiha MM, Selvaraj S. Inter-residue interactions in protein fold-

ing and stability. Prog Biophys Mol Biol 2004;86:235–277.

34. Yuan Z. Better prediction of protein contact number using a sup-

port vector regression analysis of amino acid sequence. BMC Bioin-

formatics 2005;6:248.

35. Joachims T. Making large-scale SVM learning practical. In: Schol-

kopf B, Burges C, Smola A, editors. Advances in Kernel Methods—

Support Vector Learning. Cambridge, MA: MIT Press; 1999.

pp 169–184.

36. Kabsch W, Sander C. Dictionary of protein secondary structure:

pattern recognition of hydrogen-bonded and geometrical features.

Biopolymers 1983;22:2577–2637.

37. Samanta U, Bahadur RP, Chakrabarti P. Quantifying the accessible

surface area of protein residues in their local environment. Protein

Eng 2002;15:659–667.

38. Pintar A, Pongor S. The ‘‘first in-last out’’ hypothesis on protein

folding revisited. Proteins 2005;60:584–590.

Z. Yuan and Z.X. Wang

516 PROTEINS DOI 10.1002/prot