Quantifying the relationship of protein burying depth and sequence
-
Upload
zheng-yuan -
Category
Documents
-
view
215 -
download
2
Transcript of Quantifying the relationship of protein burying depth and sequence
proteinsSTRUCTURE O FUNCTION O BIOINFORMATICS
Quantifying the relationship of proteinburying depth and sequenceZheng Yuan1* and Zhi-Xin Wang2
1 Institute for Molecular Bioscience and ARC Centre in Bioinformatics, The University of Queensland, Brisbane, Australia
2Department of Biological Sciences and Biotechnology, Tsinghua University, Beijing, China
INTRODUCTION
A protein has its particular structural arrangement to fulfill its function.
Some residues are located at protein surface whereas others are buried in pro-
tein interior to play roles such as forming a hydrophobic core to maintain
protein folding state. It has been found that buried residues are more con-
served during evolutionary process and therefore proved to be useful for pro-
tein fold and function prediction.1,2 As a protein structural descriptor, the
depth or burying depth (BD) of an atom or residue has been proposed to be
the distance between itself and the nearest water molecule or protein surface,
to measure the extent to which it is buried.3,4 This parameter has also been
applied to a number of problems such as analyzing amide hydrogen/deute-
rium exchange rates in nuclear magnetic resonance (NMR) experiments,3 pre-
dicting protein active sites,5 and improving protein structural alignments.6
Because of the involvement in the interactions with other biological mole-
cules, surface residues have been under extensive investigations, especially in
the area of protein drug design.7 Accordingly, accessible surface area (ASA)
has been widely adopted to define surface residues8 and based on it, a large
number of sequence-based prediction methods have been developed during
recent years.9–20 However, the knowledge of ASA cannot provide information
about the structural arrangements of buried residues, when their ASAs are ze-
ros or near zeros. As a complement, BD can provide additional and more
accurate information about the interior of a molecule. Particularly, protein BD
may reflect long-range contact information and thus are helpful for protein
structure prediction. To date, however, it is still unknown whether BD can be
predicted directly from sequence or whether the sequence local environment
of a buried residue can determine its burying extent.
In this study, we use a regression approach to quantify the relationship of
protein BD and sequence. Support vector regression21 is a sophisticated
machine learning method which can reveal the relationship of two objects
through a complicated function derived from the learning samples. Applying
this method here and examining with a large protein structure dataset, we
found a strong correlation between protein BD and the local sequence. The
best estimated function is able to predict residue BDs with a correlation coef-
ficient of 0.65. Further analyses show a good identification of most deeply
The Supplementary Material referred to in this article can be found online at http://www.interscience.wiley.com/
jpages/0887-3585/suppmat
Grant sponsor: Australian Research Council.
*Correspondence to: Zheng Yuan, Institute for Molecular Bioscience and ARC Centre in Bioinformatics, The
University of Queensland, Brisbane, Australia. E-mail: [email protected]
Received 11 December 2006; Revised 20 February 2007; Accepted 27 March 2007
Published online 17 August 2007 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/prot.21545
ABSTRACT
Protein burying depth (BD) is a struc-
tural descriptor that is exploited not
only to find whether a residue is
exposed or buried, but also to deter-
mine how deep a residue is buried.
The widely used solvent accessible sur-
face area is mainly focusing on the
study of protein surface residues, while
protein BD can provide more detailed
information about the arrangement of
buried residues, which may be used to
study protein deep level structure and
the formation of protein folding nu-
cleus. In this work, we analyse the rela-
tionship of protein BD and sequences,
and describe it by nonlinear functions
estimated by support vector machines.
We examine the functions by crossvali-
dation tests and find strong correlation
between residue BD and local sequence
environment. By further taking ac-
count the size of the molecule where a
residue is located, we find that the cor-
relation coefficient between predicted
and observed depths improves from
0.60 to 0.65. Moreover, nearly half of
the deepest 10% residues in a protein
sequence can be correctly predicted.
Our study suggests that a residue’s bur-
ying extent is able to be predicted, to
some degree, by itself and its local
neighbouring residues. The methods
used to estimate the sequence-depth
functions are expected to become more
useful in the investigation of protein
structures and folding mechanism.
Proteins 2008; 70:509–516.VVC 2007 Wiley-Liss, Inc.
Key words: bioinformatics; support vec-
tor regression; protein structure predic-
tion; protein sequence analysis; burying
extent.
VVC 2007 WILEY-LISS, INC. PROTEINS 509
buried residues. Our work provides a new method to
predict protein BD, which is for analyzing buried resi-
dues and folding nucleus of novel proteins.
METHODS
Preparation of dataset
To examine the relationship of depth and sequence, we
prepare a large dataset of 923 protein chains using PDB-
REPRDB.22 Those structures solved by X-ray crystallog-
raphy are all in high quality with resolution � 2.0 A and
R-factor � 0.2. We exclude protein chains shorter than
60 amino acids. No two chains are included that have
pair-wise identity more than 25%. All the structures of
the biological molecules that contain those 923 protein
chains are derived from RCSB-PDB database23 for analy-
sis. The names of the 923 protein chains are given in
Table I of supplementary materials.
Calculation of solvent accessible surfaceand protein BD
There are a number of ways to calculate protein BDs.
The concept of protein BD was first introduced by Peder-
sen et al.3 in their NMR study. In their work, the BD of
an atom was defined as the distance from the nearest
water molecule. Atom depth was also defined by other
authors as the distance from protein solvent surface4 or
from the nearest solvent atom.2 A recently published
method also considered the effect of protein shape.24
Despite of different definitions, strong correlations
among them may exist. For example, a correlation coeffi-
cient of 0.93 was observed for the first two definitions in
the same protein structure.25
In this study, we consider the depth of an atom as its
distance from the solvent accessible surface. We use pro-
gram MSMS26 to calculate solvent ASA of each atom
and the solvent accessible surface of the whole molecule.
This program is run with a probing sphere radius of 1.4
A. The outputs of the program for a protein structure
contain a list of vertices that represent protein surface.
Therefore, the BD of an atom can be defined as the dis-
tance between the atom and its nearest vertex as previ-
ously used by Hamelryck.27 The average depth for all
atoms except the hydrogen atoms in an amino acid is
regarded as its residue depth. It is worth mentioning that
all the calculations are based on whole molecules that
may consist of a number of protein subunits although
our analyses will only use nonredundant protein chains.
Residue coding
Although the distribution of BDs should be ultimately
by the whole protein sequence, the depth of a residue or
atom may be somewhat determined by its local sequence
environment or the local segment of amino acids. On ba-
sis of this assumption, we consider the protein segment
centered with a target residue and the sliding window
technique previously used in the prediction of protein
secondary structure28 is adopted here. The window size
is set as 15 amino acids, indicating that the N-terminal
and C-terminal seven neighbouring amino acids of a tar-
get residue are also used. To code each amino acid in the
window, we use the normalized values in PSI-BLAST
scoring matrixes29 obtained by three rounds of searching
against the nonredundant protein sequence database of
National Center for Biotechnology Information.30,31
Any amino acid is represented by a 21-dimensional vec-
tor and therefore the 15 amino acids window is repre-
sented by a 315-dimensional vector. The selection of
seven amino acids around the central residue at the
sequence level is based on the observation that the small
window size may not adequately exploit the sequence in-
formation while the large window size may lead to more
noise and therefore may not be helpful for the predic-
tion. In addition, a window size of 15 amino acids may
contain short-, medium-, and long-range interactions in
protein structures. In a definition,32,33 the short-range
interactions are contributed by the residues within two
residues distance from the central residue; the medium-
range interactions are the contributions of residues with
three or four residues distance from the central residue
and the long-range interactions are from the residues
more than four residues away from the central residue.
According to this, in a 15 amino acid window, residues
participated in short-, medium-, and long-range interac-
tions may have a balanced distribution with a ratio of
2:2:3.
Estimation of the relationship between BDand sequence
To derive the function matching the BD of a residue
or atom with its coding vector, we use nonlinear support
vector regression machines as used in other applications
such as the predictions of solvent accessibility,14,15,17
flexibility,31 and contact numbers.34 To fulfil the non-
linear mapping, Gaussian kernel K ðXi;XjÞ� ¼ exp
�gkXi � Xjk2� �
is used. There are some parameters used
to control the learning, for example, the regulatory pa-
rameter C and error tube e We keep error tube e 5 0.01
in the whole study while trying different values for C
and e. To obtain the final solutions we use program
SVMlight.35
Evaluation of prediction performance
Threefold crossvalidation test is used to avoid biased
evaluation with 923 protein chains being randomly di-
vided into three groups and each group having a roughly
equal number of protein sequences. Proteins in one
group are tested in turn, while the proteins in other
Z. Yuan and Z.X. Wang
510 PROTEINS DOI 10.1002/prot
groups are merged to estimate the sequence-depth func-
tion. The correlation coefficient between predicted and
observed depths is calculated to show prediction per-
formance.
To further analysing the function estimation efficiency,
we formulate the predictions as two-class problems by
giving a number of thresholds to divide residues into
two groups: exposed and buried. In these cases, accura-
cies for overall, specificity, and sensitivity are also calcu-
lated.
RESULTS AND DISCUSSION
The distribution of BDs
There are about 16,200 residues in our 923 protein
chains and we give the residue depth distributions in Fig-
ure 1. We calculated the depth values, rounded off their
values with two decimal places, and counted the occur-
ring frequency for each value in the whole dataset. As
shown in Figure 1, nearly all residues are located in the
depth range from 1.67 to 10 A, which covers 99.7% of
the total residues. The distribution is not uniform, as the
majority of residues (about 70% of total residues) have
BDs not more than 2.5 A because most residues are on
or near protein surfaces. The mean and standard devia-
tion for the distribution are 2.64 and 1.41 A, respectively.
We also calculated the BD of Ca atoms and found that
residue depth and atom depth were highly correlated
with a correlation coefficient of 0.96. On the basis of this
observation, we focused our analyses mainly on residue
depths because analyses on atom depths might yield sim-
ilar results.
The relationship of protein BDand protein size
To measure the affect of protein size on calculation of
BDs, we calculated the BDs of all atoms (excluding the
hydrogen atoms) in a protein and averaged their values
to represent this protein. Protein size can be reflected by
its number of amino acids or the molecular weight that
consider the difference of amino acids. However, both of
them have the same correlation coefficient 0.73 with the
averaged atom depths. Figure 2 shows the distribution of
Table IPrediction Performance for Burying Depths with Different Input Information and Different Control Parameters
Dataset Input Control parameter Correlation coefficient Mean absolute error (�) Mean relative error (%)
Total residues Local window g 5 0.1, C 5 0.5 0.62 0.63 18.8g 5 0.01, C 5 2 0.61 0.62 18.3
Local window 1 protein size g 5 0.1, C 5 0.5 0.65 0.61 18.5g 5 0.01, C 5 2 0.65 0.60 18.0
Buried residues Local window 1 protein size g 5 0.1, C 5 0.5 0.52 1.07 24.5g 5 0.01, C 5 2 0.52 1.06 24.4
Figure 1Distribution of occurring frequency for residue burying depth.
Figure 2The average burying depths of proteins according to their molecular sizes.
Protein Burying Depth Prediction
DOI 10.1002/prot PROTEINS 511
averaged depths with regards to the numbers of amino
acids in proteins. It is obvious that BDs are strongly de-
pendent on protein size and that the feature is essential
for estimation of sequence-depth functions. In some
other studies, BDs in proteins of different sizes were first
normalized to a similar level and then proper compari-
son was performed.5 Under this condition, a normalizing
function was provided priorly without knowing its best
form. To avoid this, we put protein size and local
sequence information together as input to estimate the
sequence-depth function.
The distributions of BDs according todifferent secondary structures
For all residues, we extracted their secondary structures
from dictionary of protein secondary structure36 and
classified them into three classes: helix, b-strand, and
coil. Distributions for residue depths according to three
classes are shown in Figure 3. For each distribution we
computed its mean and standard deviation. The values
for helix, b-strand, and coil are 2.75 � 1.42, 3.31 � 1.78,
and 2.23 � 0.99 A, respectively. It can be found that
coiled residues are more frequently found with small BDs
(more exposed) while b-sheet residues are likely to locate
with larger BDs (more buried). This result is in consist-
ent with the previous observation of other authors,2
although they adopt a different depth definition and a
different dataset. To further examine whether their distri-
butions were significantly different, we performed Kol-
mogorov-Smirnov tests and found that the P-values were
all near zero between the distributions of separate classes.
Thus, our analyses consolidate the existence of correla-
tion between protein BDs and secondary structures.
The correlation of protein BD and proteinsolvent ASA
The absolute solvent ASA in A2 for each residue was
prepared by running program MSMS,26 while the rela-
tive ASA was obtained by normalizing the absolute value
by its ASA value in a Gly-X-Gly tripeptide with extended
conformation.17,37 We calculated the correlation of BD
with absolute ASA and relative ASA and found that rela-
tive ASA more negatively correlates with depth (20.55)
than absolute ASA (20.51).
Since relative ASA is more correlated with BD, we use
it to define surface residues. Given a threshold, if the rel-
ative ASA of a residue is greater than a certain threshold,
the residue is defined as exposed. Otherwise, it is a bur-
ied residue. A similar definition also applies to BD.
Selecting a certain threshold of BD, a residue is deemed
as exposed if its depth is less than the threshold. As
expected, there should exist a correlation between relative
ASA and BD thresholds that can maximize the consis-
tency between the definitions. To reflect the agreement,
we use the consistency percentage (CP), which is defined
as the percentage of the residues consistently assigned by
both methods as exposed or buried in the total residues.
In Figure 4(A), we plot the CP values according to differ-
ent coupled relative ASA and depth thresholds. For each
relative ASA threshold, we are able to find a depth
threshold that corresponds to the maximum CP. The
matched threshold values and their CP values determine
the ridgeline that is highlighted in blue in Figure 4(A).
Mapping the ridgeline to the horizontal plane determined
by the axes of relative ASA and BD can give the best
match of thresholds of relative ASA and depth. This line
is shown in green.
The more detailed information is given in Figure 4(B).
Particularly, a depth threshold of 2.54 A corresponding
to a relative ASA threshold of 5–6%, can achieve more
than 90% agreement. This cutoff classifies about 28% of
the residues as buried. We use the buried residues to
make another dataset, called ‘‘buried residue dataset,’’ to
further test the relationship of protein BD and sequence.
The mean and standard deviation of residue depths from
this dataset are 4.34 and 1.68 A, respectively.
Numerical solution of the matching functionbetween protein BD and sequence
To estimate the matching function between residue
depths and the feature vectors coding residues, we
applied a support vector regression approach to find its
solution. Its fitness is examined by crossvalidation tests
to avoid overestimate. That is, proteins for testing are
not included in the training set that is used to build up a
function. For easiness of handling the data in the training
Figure 3Residue depths distribution according to three secondary structure classes: helix
(solid), b-strand (slashed) and coil (dotted).
Z. Yuan and Z.X. Wang
512 PROTEINS DOI 10.1002/prot
procedure, all residue depths are normalized by the mean
(X) and standard deviation (r) using the formula
Xnorm ¼ X�Xr , where the mean and standard deviation are
derived from the whole dataset (X 5 2.64 A and r 51.41 A). The same formula is also used in the testing
procedure when a predicted value (Xnorm) is transformed
back to the value of original meaning (i.e. X).
As given in Methods Section, each residue is coded by
a 315-demensional vector that contains the information
from a local window of 15 amino acids. Besides, protein
(molecular) size directly derived from protein sequence is
an important feature that can be added (after normaliza-
tion) to make the 316th dimension. We estimate the
function using total residues in our dataset and give the
results in Table I. Only using the local window informa-
tion, the correlation coefficients between predicted and
observed are around 0.60. When including protein size as
a feature, the correlation coefficients increase to 0.65
with decreased mean absolute errors and relative errors.
The results demonstrate that protein size is a significant
factor that can influence the estimation accuracy. If we
use BD to define exposed and buried residues, different
depth thresholds are applied and their overall accuracies
(the percentages of correctly predicted residues) are com-
puted. To verify whether our methods can be used to
find the buried residues particularly those deeply buried
residues, we calculated the specificity and sensitivity, as
well as their occurring probabilities in our dataset. Speci-
ficity is the percentage of correctly predicted buried resi-
dues among the total predicted buried residues while
sensitivity is the percentage of correctly predicted buried
residues among the total observed buried residues. Using
the model with g 5 0.01, C 5 2 and local window as
well as protein size information, we provide the results in
Figure 5. The least overall accuracy is 73.1% correspond-
ing to a threshold of 1.95 A. The overall accuracies from
73 to 90% according to different thresholds are compara-
ble with previously reported results of 72–90% on predic-
Figure 4Consistency between the methods of using relative accessible surface area and burying depth to assign exposed and buried residues. A: Consistency percentage is defined as
percent of the number of residues consistently defined by both methods as exposed or buried in the total number of residues. Its values are plotted against different
thresholds of relative accessible surface area and burying depth. B: The best match between the thresholds of relative accessible surface area and burying depth (blue) with
their consistency percentage (red).
Figure 5Prediction accuracies when formulating as two class problems using different
depth thresholds. Specificity, sensitivity, occurring probability of buried residues,
and overall accuracy are plotted according to various depth cut-offs.
Protein Burying Depth Prediction
DOI 10.1002/prot PROTEINS 513
tion of protein surface residues.10,13–15,18,20 With
increase of threshold values, the abundance of defined
buried residues and prediction sensitivity decrease suggest-
ing difficulties in accurately predicting more deeply buried
residues. However, our methods are far better than the
random guess. For instance, using the depth threshold
2.54 A, 28.3% of total residues are buried. That means, if
we randomly assign a residue as buried with this probabil-
ity, only 28.3% of the buried residues can be identified
correctly. In contrast, the support vector regression
method can correctly predict 66.1% of those residues (sen-
sitivity). Even using larger thresholds, the sensitivities are
always more than twice the random prediction accuracies.
However, unlike sensitivity, specificity is kept at a relatively
stable level with least accuracies around 70%.
It is well known that a protein structure is determined
by its whole sequence. Whether a residue is buried and
how deep it is buried in a protein structure are the out-
comes of the cooperative effects of all residues. However,
when we only picked buried residues from different pro-
tein structures to perform analyses, we still observed a
strong correlation of depth with both residue local infor-
mation and protein size. We trained on the ‘‘buried resi-
due’’ dataset and tested using three-fold crossvalidation.
As a result, the CC is 0.52, and the mean absolute error
and mean relative error are 1.06 A and 24.4%, respec-
tively (Table I). Formulating predictions as two-class
problems with different thresholds to find more deeply
buried residues, we give the specificities, sensitivities,
overall accuracies, and occurring probabilities in Figure
6. Our results consistently suggest that the local informa-
tion of a buried residue can, to some degree, determine
its burying extent.
The error distribution according to proteinsize and residue depth
We calculated the correlation coefficient for each indi-
vidual protein according to residue depths. For the 453
protein subunits in the group of small proteins each of
them has no more than 150 amino acids, the average
correlation coefficient is 0.60. For large proteins (>150
amino acids), we achieve a correlation coefficient of 0.65
averagely. However, the group of large proteins always
have greater mean absolute and mean relative errors. The
mean absolute and relative errors for small proteins are
0.35 A and 13.2% while they are 0.70 A and 19.8% for
large proteins. We also found that, monomers are more
accurately predicted than multimers as shown by the
mean absolute error 0.43 A for monomers and 0.65 A
for multimers.
For surface residues (residue depth < 2.54 A), the
mean absolute and relative errors are 0.27 A and 13.6%,
respectively. The buried residues have mean absolute
error 1.44 A and mean relative error 29.3%. In contrast,
if we train and test on buried residues solely, we can
obtain better accuracy, as the mean absolute and relative
errors are 1.06 A and 24.4% (Table I).
Shown in Figure 7 are two examples to illustrate the
prediction performances. The RAS-Binding domain of
Figure 6Prediction accuracies when formulating as two class problems using different
depth thresholds. The sequence-depth function is estimated only on the ‘‘buried
residue’’ dataset. Specificity, sensitivity, occurring probability of buried residues,
and overall accuracy are plotted according to various depth cut-offs.
Figure 7Burying depth predictions for proteins 1i35 and 1ftr (PDB codes). Blue lines and
green lines represent the observed and predicted residue depths, respectively.
Their absolute difference is shown in red line. A: 127 residues in subunit A of
protein 1i35 is predicted with a correlation coefficient of 0.64; (B) 296 residues
in subunit A of protein 1ftr is predicted with a correlation coefficient of 0.65.
[Color figure can be viewed in the online issue, which is available at
www.interscience.wiley.com.]
Z. Yuan and Z.X. Wang
514 PROTEINS DOI 10.1002/prot
the protein kinase BYR2 (PDB code: 1i35; subunit A) is
predicted with a CC of 0.64 and Formyltransferase (PDB
code: 1 ftr; subunit A) achieves a CC of 0.65. In both
cases, the deeply buried residues have larger errors. How-
ever, there is a good agreement between the predicted
and observed regions containing deeply buried residues
as shown by the overlap of the peaks, although the pre-
dicted are frequently smaller than their real values.
To further study the errors, we calculate the mean
absolute errors in different ranges of residue BDs accord-
ing to different secondary structures. Results are given in
Table II. For this dataset, residues with the conformations
of helix, b-sheet and coil cover 34.2, 21.6, and 44.2%,
respectively. First, the mean absolute errors increase
when residues are more deeply buried. This phenomenon
occurs for all kinds of secondary structures. Second, reg-
ular secondary structures (helix and b-sheet) are more
frequently observed in protein core. For the residues with
BDs no less than 3.0 A, 40.7% form helixes and 39.5%
form b-sheets, which are much higher than their overall
percentage frequencies 34.2% and 21.6% in the whole
dataset. In contrast, the irregular structures (coil) have a
lower frequency of 19.8% compared with the percentage
frequency of 44.2%. This can also be observed from the
different distributions of secondary structures in Figure
3. In addition, the prediction of deeply buried coiled res-
idues is less accurate shown by the greater errors. There
are some reasons accounting for worse prediction of
deeply buried residues. On one hand, under-representa-
tion of deeply buried residues in the dataset undermines
the built models to grasp some rules specific for them;
on the other hand, both the technique for encoding pro-
tein sequence and the machine learning method for find-
ing a mapping function may not be accurate enough to
reflect the complicated sequence-structure relationship in
proteins.
Rank prediction of residue depths
Since the predicted values for deeply buried residues
are less accurate, we can interpret their values relatively
by ranking them in the whole sequence. We sort the
depth profiles of a protein sequence from the smallest to
the largest and apply this procedure to both the pre-
dicted and observed. If we compare the top 10% residues
from both of them, we observe the overlap is 47.8%.
This estimation is performed on the whole dataset. That
means, for a 100 amino acids protein sequence, 4–5 resi-
dues among the 10 predicted most buried are correct.
This becomes very helpful when we model the structure
of a novel protein if we know some residues are deeply
buried. Different ranking percentages generate different
overlap percentage, as shown in Table III. Selecting and
comparing the top 35% residues, the overlap increases to
nearly 70%.
Our computational analyses via a machine learning
approach demonstrate the strong dependence of protein
BD on sequence and local sequence environment. This
observation indicates that it is feasible to identify those
most deeply buried residues using sequence information.
This is particularly useful when we deal with newly
sequenced proteomes and know little about structures
and functions of the proteins. If the hypothesis that the
most deeply buried residues are formed earlier in the
folding pathway is true,38 finding deeply buried residues
from protein sequence will aid the design of folding
experiments for novel proteins. Protein secondary struc-
ture and solvent ASA have been used in protein tertiary
and fold predictions. Protein BD, different from earlier
structural descriptors, bears different and extra informa-
tion that describes the protein structures. To our knowl-
edge, it only has limited applications until recently. Since
protein BD is correlated with other structural characteris-
tics, incorporating depth prediction to other methods
should enhance each other’s performance and provide
more accurate combined prediction methods. In this
study, we provide a way to analyze the relation of protein
sequence with its one-dimensional structural properties
(BD). Clearly, this approach can be extended to predic-
tion of other structural or functional properties, and
therefore provide more accurate predicted information
for novel protein sequences.
CONCLUSIONS
With a great number of novel sequences having been
generated from genome projects, how to determine their
structures and functions is one of the most challenging
problems in the postgenome era. Although many other
aspects of a protein are being explored via X-ray or
NMR, mutation, and microarray gene expression experi-
ments, the sequence itself that contains the most valuable
Table IIThe Mean Absolute Errors and Percentage Frequencies of Three Different
Secondary Structure Classes According to Different Ranges of Depths
Mean absolute error (�) and percentagefrequency
a-helix b-strand Coil
Depth < 2.0 � 0.23 30.0% 0.36 10.6% 0.21 59.4%2.0 � � depth < 2.5 � 0.35 35.2% 0.39 22.9% 0.28 41.9%2.5 � � depth < 3.0 � 0.50 36.9% 0.52 31.5% 0.49 31.6%Depth � 3.0 � 1.72 40.7% 1.86 39.5% 2.11 19.8%
Table IIIPrediction Accuracies for the Most Buried Residues From Top 10–35%
The most buried (%) 10 15 20 25 30 35Precision (%) 47.8 55.1 60.0 63.6 66.6 69.9
Protein Burying Depth Prediction
DOI 10.1002/prot PROTEINS 515
information needs further exploring. In this work, we
study an important structural feature, protein BD, only
using sequence information. Our approach can achieve a
correlation coefficient of 0.65 between the observed and
estimated residue depths. In finding the most deeply bur-
ied residues, nearly half of the residues in the group of
top 10% largest values are correctly assigned. The struc-
tural information about buried residues, predicted from
protein sequence, will enhance our study on structures
and functions of novel proteins.
ACKNOWLEDGMENTS
ZY is supported by ARC Discovery Project.
REFERENCES
1. Donald JE, Hubner IA, Rotemberg VM, Shakhnovich EI, Mirny LA.
CoC: a database of universally conserved residues in protein folds.
Bioinformatics 2005;21:2539–2540.
2. Pintar A, Carugo O, Pongor S. Atom depth as a descriptor of the
protein interior. Biophys J 2003;84:2553–2561.
3. Pedersen TG, Sigurskjold BW, Andersen KV, Kjaer M, Poulsen FM,
Dobson CM, Redfield C. A nuclear-magnetic-resonance study of
the hydrogen-exchange behavior of lysozyme in crystals and solu-
tion. J Mol Biol 1991;218:413–426.
4. Chakravarty S, Varadarajan R. Residue depth: a novel parameter for
the analysis of protein structure and stability. Structure 1999;7:723–
732.
5. Gutteridge A, Bartlett GJ, Thornton JM. Using a neural network
and spatial clustering to predict the location of active sites in
enzymes. J Mol Biol 2003;330:719–734.
6. Zhou H, Zhou Y. Single-body residue-level knowledge-based energy
score combined with sequence-profile and secondary structure in-
formation for fold recognition. Proteins 2004;55:1005–1013.
7. Marrone TJ, Briggs JM, McCammon JA. Structure-based drug
design: computational advances. Annu Rev Pharmacol Toxicol
1997;37:71–90.
8. Lee B, Richards F. The interpretation of protein structures: estima-
tion of static accessibility. J Mol Biol 1971;55:379–400.
9. Singh YH, Gromiha MM, Sarai A, Ahmad S. Atom-wise statistics
and prediction of solvent accessibility in proteins. Biophys Chem
2006;124:145–154.
10. Rost B, Sander C. Conservation and prediction of solvent accessibil-
ity in protein families. Proteins 1994;20:216–226.
11. Thompson MJ, Goldstein RA. Predicting solvent accessibility:
higher accuracy using Bayesian statistics and optimized residue sub-
stitution classes. Proteins 1996;25:38–47.
12. Carugo O. Predicting residue solvent accessibility from protein
sequence by considering the sequence environment. Protein Eng
2000;13:607–609.
13. Gianese G, Bossa F, Pascarella S. Improvement in prediction of sol-
vent accessibility by probability profiles. Protein Eng 2003;16:987–
992.
14. Yuan Z, Huang B. Prediction of protein accessible surface areas by
support vector regression. Proteins 2004;57:558–564.
15. Nguyen MN, Rajapakse JC. Two-stage support vector regression
approach for predicting accessible surface areas of amino acids. Pro-
teins 2006;63:542–550.
16. Xu Z, Zhang C, Liu S, Zhou Y. QBES: predicting real values of sol-
vent accessibility from sequences by efficient, constrained energy
optimization. Proteins 2006;63:961–966.
17. Yuan Z, Zhang F, Davis MJ, Boden M, Teasdale RD. Predicting the
solvent accessibility of transmembrane residues from protein
sequence. J Proteome Res 2006;5:1063–1070.
18. Ahmad S, Gromiha MM, Sarai A. Real value prediction of solvent
accessibility from amino acid sequence. Proteins 2003;50:629–635.
19. Nguyen MN, Rajapakse JC. Prediction of protein relative solvent
accessibility with a two-stage SVM approach. Proteins 2005;59:30–37.
20. Wang JY, Lee HM, Ahmad S. Prediction and evolutionary informa-
tion analysis of protein solvent accessibility using multiple linear
regression. Proteins 2005;61:481–491.
21. Drucker H, Burges CJC, Kaufman L, Smola A, Vapnik V. Support
vector regression machines. In: Mozer MC, Jordan MI, Petsche T,
editors. Advances in Neural Information Processing Systems. Cam-
bridge, MA: MIT Press; 1997. pp 155–161.
22. Noguchi T, Akiyama Y. PDB-REPRDB: a database of representative
protein chains from the Protein Data Bank (PDB) in 2003. Nucleic
Acids Res 2003;31:492–493.
23. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig
H, Shindyalov IN, Bourne PE. The Protein Data Bank. Nucleic
Acids Res 2000;28:235–242.
24. Varrazzo D, Bernini A, Spiga O, Ciutti A, Chiellini S, Venditti V, Bracci
L, Niccolai N. Three-dimensional computation of atom depth in com-
plex molecular structures. Bioinformatics 2005;21: 2856–2860.
25. Pintar A, Carugo O, Pongor S. Atom depth in protein structure
and function. Trends Biochem Sci 2003;28:593–597.
26. Sanner MF, Olson AJ, Spehner JC. Reduced surface: an efficient way
to compute molecular surfaces. Biopolymers 1996;38:305–320.
27. Hamelryck T. An amino acid has two sides: a new 2D measure pro-
vides a different view of solvent exposure. Proteins 2005;59:38–48.
28. Qian N, Sejnowski TJ. Predicting the secondary structure of globu-
lar proteins using neural network models. J Mol Biol 1988;202:865–
884.
29. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W,
Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of
protein database search programs. Nucleic Acids Res 1997;25:3389–
3402.
30. Jones DT. Protein secondary structure prediction based on posi-
tion-specific scoring matrices. J Mol Biol 1999;292:195–202.
31. Yuan Z, Bailey TL, Teasdale RD. Prediction of protein B-factor pro-
files. Proteins 2005;58:905–912.
32. Gromiha MM, Selvaraj S. Importance of long-range interactions in
protein folding. Biophys Chem 1999;77:49–68.
33. Gromiha MM, Selvaraj S. Inter-residue interactions in protein fold-
ing and stability. Prog Biophys Mol Biol 2004;86:235–277.
34. Yuan Z. Better prediction of protein contact number using a sup-
port vector regression analysis of amino acid sequence. BMC Bioin-
formatics 2005;6:248.
35. Joachims T. Making large-scale SVM learning practical. In: Schol-
kopf B, Burges C, Smola A, editors. Advances in Kernel Methods—
Support Vector Learning. Cambridge, MA: MIT Press; 1999.
pp 169–184.
36. Kabsch W, Sander C. Dictionary of protein secondary structure:
pattern recognition of hydrogen-bonded and geometrical features.
Biopolymers 1983;22:2577–2637.
37. Samanta U, Bahadur RP, Chakrabarti P. Quantifying the accessible
surface area of protein residues in their local environment. Protein
Eng 2002;15:659–667.
38. Pintar A, Pongor S. The ‘‘first in-last out’’ hypothesis on protein
folding revisited. Proteins 2005;60:584–590.
Z. Yuan and Z.X. Wang
516 PROTEINS DOI 10.1002/prot