Randomness of Protein Structure
-
Upload
jaime-sarmiento-zegarra -
Category
Documents
-
view
219 -
download
0
Transcript of Randomness of Protein Structure
8/7/2019 Randomness of Protein Structure
http://slidepdf.com/reader/full/randomness-of-protein-structure 1/6
Nonlinear deterministic structures and the randomness
of protein sequences
Yanzhao Huang, Yi Xiao *
Department of Physics, Huazhong University of Science and Technology, Wuhan 430074, China
Accepted 20 November 2002
Abstract
To clarify the randomness of protein sequences, we make a detailed analysis of a set of typical protein sequences
representing each structural classes by using nonlinear prediction method. No deterministic structures are found in
these protein sequences and this implies that they behave as random sequences. We also give an explanation to the
controversial results obtained in previous investigations.
Ó 2003 Elsevier Science Ltd. All rights reserved.
One of unsolved problems in molecular biophysics is how proteins encode their structural informations in their
amino acid sequences. The amino acid sequences of proteins appear very irregular, but the three-dimensional struc-
tures they encode clearly show certain regularity. This riddle has motivated intensive studies of the longitudinal
correlation properties of protein sequences [1–18] to see whether they are random or not. However, these studies gave
opposing results: some studies showed that protein sequences were indistinguishable from random ones, while other
results indicated that protein sequences were nonrandom. For examples, White and Jacobs [1] studied the statistical
distribution of hydrophobic residues along the length of protein chains by using a binary hydrophobicity scale, which
assigns hydrophobic residues a value of one and nonhydrophobes a value of zero. Using the standard run test, they
found that, for the majority of the 5247 proteins examined, the distribution of hydrophobic residues along a sequence
could not be distinguished from that expected for a random distribution. On the other hand, Pande et al. [8] studied
the statistics of protein sequences by using the idea of mapping the sequence onto the trajectory of a random walk.
They found pronounced deviations from pure randomness. It is noted that both studies use a binary scale of hy-
drophobicity and hydrophilicity but different mapping schemes. In the work of White and Jocobs, Phe, Met, Leu, Ile,
Val, Cys, Ala, Pro, Gly, Trp and Tyr were considered as hydrophobic and other residues as hydrophilic, while in the
work of Pande et al., Lys, Arg, His, Asp and Glu were considered as hydrophilic and other as hydrophobic. Recently,
Weiss and Herzel [12,13] analyzed the correlation functions in large sets of nonhomologous protein sequences. They
found that the hydrophobicity autocorrelation showed period 3 to 4 oscillations. These oscillation decayed until theyvanish at a length of 10–15 amino acids and they can be related to the 3.6 periodicity of a-helices. Rackovsky [14]
demonstrated the existence in protein domain sequences of sets of statistically significant periodic signals, characteristic
of the architectures of those domains. Therefore, despite the efforts spent, it is still an open question whether protein
sequences are random or not. Thus, further work is warranted to clarify the apparent contradictions in the above
results.
The above investigations were based on statistical methods, usually used in physics, namely correlation functions,
random walk, Fourier transform, etc. As mentioned above, protein sequences are very irregular. It is known that
* Corresponding author.
E-mail address: [email protected] (Y. Xiao).
0960-0779/03/$ - see front matter Ó 2003 Elsevier Science Ltd. All rights reserved.
P II: S0 9 6 0 -0 7 7 9 (0 2 )0 0 5 7 1 -4
Chaos, Solitons and Fractals xxx (2003) xxx–xxx
www.elsevier.com/locate/chaos
ARTICLE IN PRESS
8/7/2019 Randomness of Protein Structure
http://slidepdf.com/reader/full/randomness-of-protein-structure 2/6
nonlinear dynamics theory have developed some very good methods to identify determinism or randomness of ir-
regular systems [19] and so it is reasonable to investigate the correlation properties of protein sequences by using the
methods of nonlinear dynamics. In fact, the theory of chaos has already been applied to investigate the behaviors of
biomolecules. For examples, El Naschie et al. [20–22] studied the possible connections of spatial chaos in mechanical
elastic chains to the conformations of biomolecules and they showed supercoiling in the elastic band very similar to
that of DNA. They also investigated chaos and order in symbolic sequences and polymers [23,24]. In the present paper
we study the correlation properties of protein sequences by using nonlinear prediction method which has beenpreviously used successfully to distinguish between chaos and noise in time series. This method can give specific in-
formation of how different regions are characterized and can detect the determinism which is not detected by the
standard methods, such as Fourier transformation and power spectrum. It can also give reasonable results for short
sequences.
The nonlinear prediction technique works as follows [19,25]. For an arbitrary symbolic series x1; x2; x3; . . . ; x N , one
constructs a set of d -dimensional vectors:
X 1 ð x1; x2; . . . ; xd Þ;
X 2 ð x2; x3; . . . ; xd þ1Þ;
. . .
X N Àd þ1 ð x N Àd þ1; x N Àd þ2; . . . ; x N Þ
ð1Þ
which correspond to all possible segments of d consecutive symbols. Next, for each vector X p ð x p ; ; x p þ1; . . . ; x p þd À1Þ,
(16 p 6 N À d ), one searches for its nearest neighbor X H ð p Þ ð x H ð p Þ; x H ð p Þþ1; . . . ; x H ð p Þþd À1Þ and then compares how close
the symbols x p þd and x H ð p Þþd are following these two vectors. The closeness of a pair of symbols xi and x j can be
measured in a Hamming metric:
hð xi; x jÞ ¼0 xi ¼ x j1 xi 6¼ x j
ð2Þ
while the closeness of a pair of vectors X i and X j can be measured by
H ð X i; X jÞ ¼
Xd À1
k ¼0
hð xiþk ; x jþk Þ; ð3Þ
The nearest neighbors X H ð p Þ of a given vector X p are those X jÕs which make H ð X p ; X jÞ be a minimum for j 6¼ p . Once the
nearest neighbors X H ð p Þ have been determined, we compute the mean local error: e p ¼ hhð x p þd ; x H ð p Þþd Þi where hÁi denotes
the average over all the nearest neighbors of X p since there are usually more than one the nearest neighbors. From this,
the overall mean error is
E d h i ¼1
N À d
X N Àd
p ¼1
e p ð4Þ
For a perfect deterministic sequence, e.g., periodic sequence, h E d i ¼ 0. For uncorrelated random chains, there is no
relation between any symbol x p þd and the vector X p , and in that case h E d i can be approximated byP
fag p ðaÞ½1 À p ðaÞ,where fag is the alphabet taken by xi and p ðaÞ is the probability of occurrences for the symbol a. Consequently, for such
series, the overall mean error h E d i will not depend on the embedding dimension d .For protein sequences, there are different ways to define the alphabet taken by xi based on the selection of phys-
icochemical properties of amino acids. In the present paper, we shall consider three different schemes of representing
amino acids: (i) The WhiteÕs scheme [1]. The 20 amino acids are divided into two types: hydrophobic ð H Þ and hy-
drophilic ð P Þ. In this case, each xi can take one of two symbols f H ; P g, with H representing Phe, Met, Leu, Ile, Val, Cys,
Ala, Pro, Gly, Trp, Tyr and P representing other amino acids. In this case, for a uniform random process,
p ð H Þ ¼ p ð P Þ ¼ 0:5 and h E d i ¼ 0:5; (ii) The PandeÕs scheme [8]. It is similar to (i), but with P representing Arg, Asp, Glu,
His, Lys and H representing other amino acids. (iii) In this case, each xi can take one of 20 symbols f A;C ; D; . . . ; Y gwhich represents the 20 amino acids. The similarity between xi and x j is taken as the value Bð xi; x jÞ of the blocks
substitution matrix (BLOSUM62), e.g., hð xi; x jÞ ¼ Bð xi; x jÞ, and the closeness of a pair of vectors X i and X j is
H ð X i; X jÞ ¼Xd À1
k ¼0
Bð xiþk ; x jþk Þ ð5Þ
2 Y. Huang, Y. Xiao / Chaos, Solitons and Fractals xxx (2003) xxx–xxx
ARTICLE IN PRESS
8/7/2019 Randomness of Protein Structure
http://slidepdf.com/reader/full/randomness-of-protein-structure 3/6
The overall mean error is defined as
E Bd
¼1
N À d
X N Àd
p ¼1
e B p ¼1
N À d
X N Àd
p ¼1
h Bð x p þd ; x H ð p Þþd Þi ð6Þ
It must be noted that, in this case, the larger the value of H ð X i; X jÞ, the closer the vectors X i and X j. Similarly, the larger
the value of h E Bd i, the stronger the nonlinear correlation.
5 10 15 20 25 30 35 40 45 500.45
0.46
0.47
0.48
0.49
0.5
0.51
0.52
0.53
d
A v e r a g e d
< E d >
α
β
αβ
Fig. 1. The average values of h E d i versus the embedding dimension d calculated for the scheme (i).
5 10 15 20 25 30 35 40 45 500.25
0.3
0.35
0.4
0.45
0.5
d
A v e r a g e d < E d >
α
β
αβ
Fig. 2. The average values of h E d i versus the embedding dimension d calculated for the scheme (ii).
Y. Huang, Y. Xiao / Chaos, Solitons and Fractals xxx (2003) xxx–xxx 3
ARTICLE IN PRESS
8/7/2019 Randomness of Protein Structure
http://slidepdf.com/reader/full/randomness-of-protein-structure 4/6
Protein sequences corresponding to three different structural classes a, b, ab [26] are analyzed respectively. The
representative protein sequences of the three structural classes are taken from the PDB-select domain sequences with
less than 25% identity [27]. In the database, there are 108, 136, and 413 sequences in a, b, and ab classes respectively.
For these sequences, the average error h E d i over the ensemble of protein sequences in a structural class is computed as a
function of the embedding dimension d . These results are shown in Figs. 1–4.
Fig. 1 shows the average values of h E d i versus the embedding dimension d calculated by using the scheme (i), i.e., the
WhiteÕs scheme. It can be seen that the average values of h E d i for all the structural classes show no significant deviation
5 10 15 20 25 30 35 40 45 50–1
–0.95
–0.9
–0.85
–0.8
–0.75
–0.7
d
A v e r a g e d
< E B d
>
α
β
αβ
Fig. 3. The average values of h E Bd i versus the embedding dimension d calculated for the scheme (iii).
5 10 15 20 25 30 35 40 45 500
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
d
< R d >
α
β
αβ
Fig. 4. The reduced overall mean error h Rd i versus the embedding dimension d calculated for the scheme (iii).
4 Y. Huang, Y. Xiao / Chaos, Solitons and Fractals xxx (2003) xxx–xxx
ARTICLE IN PRESS
8/7/2019 Randomness of Protein Structure
http://slidepdf.com/reader/full/randomness-of-protein-structure 5/6
from 0.5 for any embedding dimension d . This implies that the protein sequences behave as random ones on average.
This is just the conclusion given by White et al. [1].
Fig. 2 shows the average values of h E d i versus the embedding dimension d calculated by using the scheme (ii), i.e., the
PandeÕs scheme. In this case, the average values of h E d i of all the three structural classes show clear deviation from 0.5
and are around 0.375. This means that the protein sequences represented by the scheme (ii) deviate significantly from
uniform random sequences with h E d i ¼ 0:5. However, this does not imply that protein sequences in the PandeÕs scheme
show nonlinear deterministic structures. In fact, in the PandeÕs scheme, the probabilities of occurrences of H and P inprotein sequences are not equal to those (0.5) in uniform random sequences. If the probability of each of 20 amino acids
in protein sequences is 1=20, then the probabilities of occurrences of H and P are 3=4 and 1=4 respectively in the PandeÕs
scheme, since 15 amino acids are hydrophobic and only 5 amino acids are hydrophilic. Therefore, for an uncorrelated
random series with p ð H Þ ¼ 3=4 and p ð P Þ ¼ 1=4, the mean error is: h E d i ¼ ð3=4Þ½1 À ð3=4Þ þ ð1=4Þ½1 À ð1=4Þ ¼ 0:375.
Fig. 2 indeed shows that the average values of h E d i for protein sequences of the three structural classes are around
0.375. Furthermore, it is noted that the averaged values of h E d i for the three structural classes are separated with each
other and those for ab class lay between a and b classes. It seems that the protein sequences of b class are more regular
than those of other two classes.
Fig. 3 shows the average values of h E Bd i versus the embedding dimension d calculated by using the scheme (iii). Our
task was to identify the average values of h E Bd i of a given class of sequences with magnitudes larger than one would
expect in randomly selected sequences. However, in this case, it is difficult to give the values of h E Bd i for the random
sequence analytically. We therefore needed to construct a scale on which to measure the size of h E B
d i. We did this asfollows. For every sequence in a structural class, we generated an ensemble of 1000 sequences, each with a composition
identical to that of the actual protein sequence but also with a randomly permuted ordering of amino acids. For each
such random sequence, we calculated the values of h E Bd i versus the embedding dimension d . From these, we generated
an average value hr d i of h E Bd i for each d over the ensemble and a standard deviation rfr d g. These values, together with
h E Bd i, for the actual protein sequence, made it possible to generate reduced values of h E Bd i as a function of d :
Rd ¼h E Bd i À r d h i
rfr d gð7Þ
The quantity defined in Eq. (7) gives the deviation of the specified h E Bd i of the actual protein sequence from its random
ensemble average, measured in units of the SD. Thus, large positive values indicate nonlinear correlation coefficients
that are significantly larger than those that are measured for random sequences. These are the signals we sought. The
averaged nonlinear correlation coefficients h Rd i is an average of Rd over the ensemble of protein sequences in a
structural class. The results are shown in Fig. 4. Furthermore, we defined a significant nonlinear correlation coefficient
as one for which h Rd iP 1:0. As shown in Fig. 4, none of h Rd i of all the three different structural classes show significant
deviations from random sequences. This again implies that the protein sequences behave as random sequences on
average and show no clear nonlinear deterministic structures. However, for 5 < d < 20, the values of h Rd i of the protein
sequences of all the three structural classes are larger than other parts of them and, in particular, they are close to 1.0 for
a and b classes. This implies that some of the protein sequences of a and b classes may have certain nonlinear deter-
ministic structures for this range of d because h Rd i is the average over one structural class.
In conclusion, we did not find any significant deterministic structures in the protein sequences on average from the
calculations based on all the three kinds of the schemes. Furthermore, the method used here is very simple and makes us
able to clarify the controversy in the previous investigations. Our results show that the controversy may be due to having
used different schemes in representing amino acids. Although the protein sequences do not behave as uniform random
sequences in the PandeÕs scheme, they still behave as random sequences. Using the more sophisticated BLOSUM matrix
as the measure of the distance between amino acids, we again did not find significant evidence of nonlinear correlations inthe protein sequences. These raise important questions about how a random sequence can fold into a spatial structure
with certain regularity and how a random sequence can encode its structural information. Although the analysis of
nonlinear deterministic structures using the schemes above shows that the protein sequences behave as random sequences
on average, it does not preclude the possibility that some of protein sequences have deterministic structures and that the
protein sequences encode the structural information in other ways and different schemes.
Acknowledgements
This work is supported by the National Natural Science Foundation of China under Grant no. 10175023 and
90103031.
Y. Huang, Y. Xiao / Chaos, Solitons and Fractals xxx (2003) xxx–xxx 5
ARTICLE IN PRESS
8/7/2019 Randomness of Protein Structure
http://slidepdf.com/reader/full/randomness-of-protein-structure 6/6
References
[1] White ST, Jacob RE. Biophys J 1990;57:911.
[2] Li XQ, Luo LF. Acta Sci Nat Univ Intram 1992;23:534.
[3] Johnson MS, Overington JP. J Mol Biol 1993;233:716.
[4] Cohen C, Parry DAD. Science 1994;263:488.
[5] Shakhnovich EI. Phys Rev Lett 1994;72:3907.
[6] Hobohm U, Sander C. J Mol Biol 1995;251:390.
[7] Rahman RS, Rackovsky S. Biophys J 1995;68:1591.
[8] Pande VS et al. Proc Natl Acad Sci USA 1994;91:12972.
[9] Eisenberg D et al. Proc Natl Acad Sci USA 1984;81:140.
[10] Herzel H, Grosse I. Physica A 1995;216:518.
[11] Herzel H, Grosse I. Phys Rev E 1997;55:800.
[12] Weiss O, Herzel H. J Theor Biol 1998;190:341.
[13] Weiss O, Herzel H. Zeitschr Phys Chem 1998;204:183.
[14] Rackovsky S. Proc Natl Acad Sci USA 1995;81:140.
[15] Mandell AJ et al. J Stat Phys 1998;93:673.
[16] Mandell AJ et al. Physica A 1997;244:254.
[17] Chechetin VR, Lobzin VV. J Theor Biol 1999;198:197.
[18] Korotkova MA et al. J Mol Model 1999;5:103.
[19] Kantz H, Schreiber T. Nonlinear time series analysis. Cambridge: Cambridge University Press; 1997.[20] El Naschie MS, Kapitaniak T. Phys Lett A 1990;147:275.
[21] El Naschie MS. J Phys Soc Jpn 1989;58:4310.
[22] El Naschie MS, Al Athel S. Z Naturforsch 1989;44a:645.
[23] Ebeling W, El Naschie MS, Chaos, Solitons & Fractals, 4, Special Issue, 1994.
[24] El Naschie MS. Chaos, Solitons & Fractals 1998;9:135.
[25] Barral J et al. Phys Rev E 2000;61:1812.
[26] Branden C, Tiize J. Introduction to protein structures. second ed. New York: Garland Publishing; 1999.
[27] The database of domain sequences used here can be obtained by going to: http://www.cmbi.kun.nl/gv/pdbsel/.
6 Y. Huang, Y. Xiao / Chaos, Solitons and Fractals xxx (2003) xxx–xxx
ARTICLE IN PRESS