Randomness of Protein Structure

6
Nonlinear deterministic structures and the randomness of protein sequences Yanzhao Huang, Yi Xiao * Depar tment of Physic s, Huazhong Universit y of Science and Techn ology, Wuhan 430074, China Accepted 20 November 2002 Abstract To clarify the randomness of protein sequences, we make a detailed analysis of a set of typical protein sequences representing each structural classes by using nonlinear prediction method. No deterministic structures are found in these protein sequences and this implies that they behave as random sequences. We also give an explanation to the controversial results obtained in previous investigations. Ó 2003 Elsevier Science Ltd. All rights reserved. One of unsolved problems in molecular biophysics is how proteins encode their structural informations in their amino acid sequences. The amino acid sequences of proteins appear very irregular, but the three-dimensional struc- tures the y enc ode cle arl y show cer tai n reg ularit y. Thi s riddle has mot ivated int ens ive studie s of the longit udi nal correlation properties of protein sequences [1–18] to see whether they are random or not. However, these studies gave opposing results: some studies showed that protein sequences were indistinguishable from random ones, while other results indicated that protein sequences were nonrandom. For examples, White and Jacobs [1] studied the statistical distribution of hydrophobic residues along the length of protein chains by using a binary hydrophobicity scale, which assigns hydrophobic residues a value of one and nonhydrophobes a value of zero. Using the standard run test, they found that, for the majority of the 5247 proteins examined, the distribution of hydrophobic residues along a sequence could not be distinguished from that expected for a random distribution. On the other hand, Pande et al. [8] studied the statistics of protein sequences by using the idea of mapping the sequence onto the trajectory of a random walk. They found pronounced deviations from pure randomness. It is noted that both studies use a binary scale of hy- drophobicity and hydrophilicity but dierent mapping schemes. In the work of White and Jocobs, Phe, Met, Leu, Ile, Val, Cys, Ala, Pro, Gly, Trp and Tyr were considered as hydrophobic and other residues as hydrophilic, while in the work of Pande et al., Lys, Arg, His, Asp and Glu were considered as hydrophilic and other as hydrophobic. Recently, Weiss and Herzel [12,13] analyzed the correlation functions in large sets of nonhomologous protein sequences. They found that the hydrophobicity autocorrelation showed period 3 to 4 oscillations. These oscillation decayed until they vanish at a length of 10–15 amino acids and they can be related to the 3.6 periodicity of a-helices. Rackovsky [14] demonstrated the existence in protein domain sequences of sets of statistically signicant periodic signals, characteristic of the architectures of those domains. Therefore, despite the eorts spent, it is still an open question whether protein sequences are random or not. Thus, further work is warranted to clarify the apparent contradictions in the above results. The above investigations were based on statistical methods, usually used in physics, namely correlation functions, random walk, Fourier transform, etc. As mentioned above, protein sequences are very irregular. It is known that * Correspond ing author. E-mail address: [email protected] (Y. Xiao). 0960-0779/03/$ - see front matter Ó 2003 Elsevier Science Ltd. All rights reserved. PII: S0960-0779(02)00571-4 Chaos, Solitons and Fractals xxx (2003) xxx–xxx www.elsevier.com/locate/chaos ARTI CLE IN PRESS

Transcript of Randomness of Protein Structure

8/7/2019 Randomness of Protein Structure

http://slidepdf.com/reader/full/randomness-of-protein-structure 1/6

Nonlinear deterministic structures and the randomness

of protein sequences

Yanzhao Huang, Yi Xiao *

Department of Physics, Huazhong University of Science and Technology, Wuhan 430074, China

Accepted 20 November 2002

Abstract

To clarify the randomness of protein sequences, we make a detailed analysis of a set of typical protein sequences

representing each structural classes by using nonlinear prediction method. No deterministic structures are found in

these protein sequences and this implies that they behave as random sequences. We also give an explanation to the

controversial results obtained in previous investigations.

Ó 2003 Elsevier Science Ltd. All rights reserved.

One of unsolved problems in molecular biophysics is how proteins encode their structural informations in their

amino acid sequences. The amino acid sequences of proteins appear very irregular, but the three-dimensional struc-

tures they encode clearly show certain regularity. This riddle has motivated intensive studies of the longitudinal

correlation properties of protein sequences [1–18] to see whether they are random or not. However, these studies gave

opposing results: some studies showed that protein sequences were indistinguishable from random ones, while other

results indicated that protein sequences were nonrandom. For examples, White and Jacobs [1] studied the statistical

distribution of hydrophobic residues along the length of protein chains by using a binary hydrophobicity scale, which

assigns hydrophobic residues a value of one and nonhydrophobes a value of zero. Using the standard run test, they

found that, for the majority of the 5247 proteins examined, the distribution of hydrophobic residues along a sequence

could not be distinguished from that expected for a random distribution. On the other hand, Pande et al. [8] studied

the statistics of protein sequences by using the idea of mapping the sequence onto the trajectory of a random walk.

They found pronounced deviations from pure randomness. It is noted that both studies use a binary scale of hy-

drophobicity and hydrophilicity but different mapping schemes. In the work of White and Jocobs, Phe, Met, Leu, Ile,

Val, Cys, Ala, Pro, Gly, Trp and Tyr were considered as hydrophobic and other residues as hydrophilic, while in the

work of Pande et al., Lys, Arg, His, Asp and Glu were considered as hydrophilic and other as hydrophobic. Recently,

Weiss and Herzel [12,13] analyzed the correlation functions in large sets of nonhomologous protein sequences. They

found that the hydrophobicity autocorrelation showed period 3 to 4 oscillations. These oscillation decayed until theyvanish at a length of 10–15 amino acids and they can be related to the 3.6 periodicity of  a-helices. Rackovsky [14]

demonstrated the existence in protein domain sequences of sets of statistically significant periodic signals, characteristic

of the architectures of those domains. Therefore, despite the efforts spent, it is still an open question whether protein

sequences are random or not. Thus, further work is warranted to clarify the apparent contradictions in the above

results.

The above investigations were based on statistical methods, usually used in physics, namely correlation functions,

random walk, Fourier transform, etc. As mentioned above, protein sequences are very irregular. It is known that

* Corresponding author.

E-mail address: [email protected] (Y. Xiao).

0960-0779/03/$ - see front matter Ó 2003 Elsevier Science Ltd. All rights reserved.

P II: S0 9 6 0 -0 7 7 9 (0 2 )0 0 5 7 1 -4

Chaos, Solitons and Fractals xxx (2003) xxx–xxx

www.elsevier.com/locate/chaos

ARTICLE IN PRESS

8/7/2019 Randomness of Protein Structure

http://slidepdf.com/reader/full/randomness-of-protein-structure 2/6

nonlinear dynamics theory have developed some very good methods to identify determinism or randomness of ir-

regular systems [19] and so it is reasonable to investigate the correlation properties of protein sequences by using the

methods of nonlinear dynamics. In fact, the theory of chaos has already been applied to investigate the behaviors of 

biomolecules. For examples, El Naschie et al. [20–22] studied the possible connections of spatial chaos in mechanical

elastic chains to the conformations of biomolecules and they showed supercoiling in the elastic band very similar to

that of DNA. They also investigated chaos and order in symbolic sequences and polymers [23,24]. In the present paper

we study the correlation properties of protein sequences by using nonlinear prediction method which has beenpreviously used successfully to distinguish between chaos and noise in time series. This method can give specific in-

formation of how different regions are characterized and can detect the determinism which is not detected by the

standard methods, such as Fourier transformation and power spectrum. It can also give reasonable results for short

sequences.

The nonlinear prediction technique works as follows [19,25]. For an arbitrary symbolic series x1; x2; x3; . . . ; x N , one

constructs a set of  d -dimensional vectors:

 X 1 ð x1; x2; . . . ; xd Þ;

 X 2 ð x2; x3; . . . ; xd þ1Þ;

. . .

 X  N Àd þ1 ð x N Àd þ1; x N Àd þ2; . . . ; x N Þ

ð1Þ

which correspond to all possible segments of  d  consecutive symbols. Next, for each vector X  p  ð x p ; ; x p þ1; . . . ; x p þd À1Þ,

(16 p 6 N À d ), one searches for its nearest neighbor X  H ð p Þ ð x H ð p Þ; x H ð p Þþ1; . . . ; x H ð p Þþd À1Þ and then compares how close

the symbols x p þd  and x H ð p Þþd  are following these two vectors. The closeness of a pair of symbols xi and x j can be

measured in a Hamming metric:

hð xi; x jÞ ¼0 xi ¼ x j1 xi 6¼ x j

ð2Þ

while the closeness of a pair of vectors X i and X  j can be measured by

 H ð X i; X  jÞ ¼

Xd À1

k ¼0

hð xiþk ; x jþk Þ; ð3Þ

The nearest neighbors X  H ð p Þ of a given vector X  p  are those X  jÕs which make H ð X  p ; X  jÞ be a minimum for j 6¼ p . Once the

nearest neighbors X  H ð p Þ have been determined, we compute the mean local error: e p  ¼ hhð x p þd ; x H ð p Þþd Þi where hÁi denotes

the average over all the nearest neighbors of X  p  since there are usually more than one the nearest neighbors. From this,

the overall mean error is

 E d h i ¼1

 N À d 

X N Àd 

 p ¼1

e p  ð4Þ

For a perfect deterministic sequence, e.g., periodic sequence, h E d i ¼ 0. For uncorrelated random chains, there is no

relation between any symbol x p þd  and the vector X  p , and in that case h E d i can be approximated byP

fag p ðaÞ½1 À p ðaÞ,where fag is the alphabet taken by xi and p ðaÞ is the probability of occurrences for the symbol a. Consequently, for such

series, the overall mean error h E d i will not depend on the embedding dimension d .For protein sequences, there are different ways to define the alphabet taken by xi based on the selection of phys-

icochemical properties of amino acids. In the present paper, we shall consider three different schemes of representing

amino acids: (i) The WhiteÕs scheme [1]. The 20 amino acids are divided into two types: hydrophobic ð H Þ and hy-

drophilic ð P Þ. In this case, each xi can take one of two symbols f H ; P g, with H  representing Phe, Met, Leu, Ile, Val, Cys,

Ala, Pro, Gly, Trp, Tyr and P  representing other amino acids. In this case, for a uniform random process,

 p ð H Þ ¼ p ð P Þ ¼ 0:5 and h E d i ¼ 0:5; (ii) The PandeÕs scheme [8]. It is similar to (i), but with P  representing Arg, Asp, Glu,

His, Lys and H  representing other amino acids. (iii) In this case, each xi can take one of 20 symbols f A;C ; D; . . . ; Y gwhich represents the 20 amino acids. The similarity between xi and x j is taken as the value Bð xi; x jÞ of the blocks

substitution matrix (BLOSUM62), e.g., hð xi; x jÞ ¼ Bð xi; x jÞ, and the closeness of a pair of vectors X i and X  j is

 H ð X i; X  jÞ ¼Xd À1

k ¼0

 Bð xiþk ; x jþk Þ ð5Þ

2 Y. Huang, Y. Xiao / Chaos, Solitons and Fractals xxx (2003) xxx–xxx

ARTICLE IN PRESS

8/7/2019 Randomness of Protein Structure

http://slidepdf.com/reader/full/randomness-of-protein-structure 3/6

The overall mean error is defined as

 E  Bd 

¼1

 N À d 

X N Àd 

 p ¼1

e B p  ¼1

 N À d 

X N Àd 

 p ¼1

h Bð x p þd ; x H ð p Þþd Þi ð6Þ

It must be noted that, in this case, the larger the value of H ð X i; X  jÞ, the closer the vectors X i and X  j. Similarly, the larger

the value of  h E  Bd i, the stronger the nonlinear correlation.

5 10 15 20 25 30 35 40 45 500.45

0.46

0.47

0.48

0.49

0.5

0.51

0.52

0.53

d

   A  v  e  r  a  g  e   d

  <   E   d  >

α

β

αβ

Fig. 1. The average values of  h E d i versus the embedding dimension d  calculated for the scheme (i).

5 10 15 20 25 30 35 40 45 500.25

0.3

0.35

0.4

0.45

0.5

d

   A  v  e  r  a  g  e   d  <   E   d  >

α

β

αβ

Fig. 2. The average values of  h E d i versus the embedding dimension d  calculated for the scheme (ii).

Y. Huang, Y. Xiao / Chaos, Solitons and Fractals xxx (2003) xxx–xxx 3

ARTICLE IN PRESS

8/7/2019 Randomness of Protein Structure

http://slidepdf.com/reader/full/randomness-of-protein-structure 4/6

Protein sequences corresponding to three different structural classes a, b, ab [26] are analyzed respectively. The

representative protein sequences of the three structural classes are taken from the PDB-select domain sequences with

less than 25% identity [27]. In the database, there are 108, 136, and 413 sequences in a, b, and ab classes respectively.

For these sequences, the average error h E d i over the ensemble of protein sequences in a structural class is computed as a

function of the embedding dimension d . These results are shown in Figs. 1–4.

Fig. 1 shows the average values of h E d i versus the embedding dimension d  calculated by using the scheme (i), i.e., the

WhiteÕs scheme. It can be seen that the average values of h E d i for all the structural classes show no significant deviation

5 10 15 20 25 30 35 40 45 50–1

–0.95

–0.9

–0.85

–0.8

–0.75

–0.7

d

   A  v  e  r  a  g  e   d

  <   E   B d

  >

α

β

αβ

Fig. 3. The average values of  h E  Bd i versus the embedding dimension d  calculated for the scheme (iii).

5 10 15 20 25 30 35 40 45 500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

d

    <      R      d    >

α

β

αβ

Fig. 4. The reduced overall mean error h Rd i versus the embedding dimension d  calculated for the scheme (iii).

4 Y. Huang, Y. Xiao / Chaos, Solitons and Fractals xxx (2003) xxx–xxx

ARTICLE IN PRESS

8/7/2019 Randomness of Protein Structure

http://slidepdf.com/reader/full/randomness-of-protein-structure 5/6

from 0.5 for any embedding dimension d . This implies that the protein sequences behave as random ones on average.

This is just the conclusion given by White et al. [1].

Fig. 2 shows the average values of h E d i versus the embedding dimension d calculated by using the scheme (ii), i.e., the

PandeÕs scheme. In this case, the average values of  h E d i of all the three structural classes show clear deviation from 0.5

and are around 0.375. This means that the protein sequences represented by the scheme (ii) deviate significantly from

uniform random sequences with h E d i ¼ 0:5. However, this does not imply that protein sequences in the PandeÕs scheme

show nonlinear deterministic structures. In fact, in the PandeÕs scheme, the probabilities of occurrences of H  and P  inprotein sequences are not equal to those (0.5) in uniform random sequences. If the probability of each of 20 amino acids

in protein sequences is 1=20, then the probabilities of occurrences of H  and P  are 3=4 and 1=4 respectively in the PandeÕs

scheme, since 15 amino acids are hydrophobic and only 5 amino acids are hydrophilic. Therefore, for an uncorrelated

random series with p ð H Þ ¼ 3=4 and p ð P Þ ¼ 1=4, the mean error is: h E d i ¼ ð3=4Þ½1 À ð3=4Þ þ ð1=4Þ½1 À ð1=4Þ ¼ 0:375.

Fig. 2 indeed shows that the average values of  h E d i for protein sequences of the three structural classes are around

0.375. Furthermore, it is noted that the averaged values of h E d i for the three structural classes are separated with each

other and those for ab class lay between a and b classes. It seems that the protein sequences of b class are more regular

than those of other two classes.

Fig. 3 shows the average values of  h E  Bd i versus the embedding dimension d  calculated by using the scheme (iii). Our

task was to identify the average values of  h E  Bd i of a given class of sequences with magnitudes larger than one would

expect in randomly selected sequences. However, in this case, it is difficult to give the values of  h E  Bd i for the random

sequence analytically. We therefore needed to construct a scale on which to measure the size of  h E  B

d i. We did this asfollows. For every sequence in a structural class, we generated an ensemble of 1000 sequences, each with a composition

identical to that of the actual protein sequence but also with a randomly permuted ordering of amino acids. For each

such random sequence, we calculated the values of h E  Bd i versus the embedding dimension d . From these, we generated

an average value hr d i of h E  Bd i for each d  over the ensemble and a standard deviation rfr d g. These values, together with

h E  Bd i, for the actual protein sequence, made it possible to generate reduced values of  h E  Bd i as a function of  d :

 Rd  ¼h E  Bd i À r d h i

rfr d gð7Þ

The quantity defined in Eq. (7) gives the deviation of the specified h E  Bd i of the actual protein sequence from its random

ensemble average, measured in units of the SD. Thus, large positive values indicate nonlinear correlation coefficients

that are significantly larger than those that are measured for random sequences. These are the signals we sought. The

averaged nonlinear correlation coefficients h Rd i is an average of  Rd  over the ensemble of protein sequences in a

structural class. The results are shown in Fig. 4. Furthermore, we defined a significant nonlinear correlation coefficient

as one for which h Rd iP 1:0. As shown in Fig. 4, none of  h Rd i of all the three different structural classes show significant

deviations from random sequences. This again implies that the protein sequences behave as random sequences on

average and show no clear nonlinear deterministic structures. However, for 5 < d < 20, the values of h Rd i of the protein

sequences of all the three structural classes are larger than other parts of them and, in particular, they are close to 1.0 for

a and b classes. This implies that some of the protein sequences of  a and b classes may have certain nonlinear deter-

ministic structures for this range of  d  because h Rd i is the average over one structural class.

In conclusion, we did not find any significant deterministic structures in the protein sequences on average from the

calculations based on all the three kinds of the schemes. Furthermore, the method used here is very simple and makes us

able to clarify the controversy in the previous investigations. Our results show that the controversy may be due to having

used different schemes in representing amino acids. Although the protein sequences do not behave as uniform random

sequences in the PandeÕs scheme, they still behave as random sequences. Using the more sophisticated BLOSUM matrix

as the measure of the distance between amino acids, we again did not find significant evidence of nonlinear correlations inthe protein sequences. These raise important questions about how a random sequence can fold into a spatial structure

with certain regularity and how a random sequence can encode its structural information. Although the analysis of 

nonlinear deterministic structures using the schemes above shows that the protein sequences behave as random sequences

on average, it does not preclude the possibility that some of protein sequences have deterministic structures and that the

protein sequences encode the structural information in other ways and different schemes.

Acknowledgements

This work is supported by the National Natural Science Foundation of China under Grant no. 10175023 and

90103031.

Y. Huang, Y. Xiao / Chaos, Solitons and Fractals xxx (2003) xxx–xxx 5

ARTICLE IN PRESS

8/7/2019 Randomness of Protein Structure

http://slidepdf.com/reader/full/randomness-of-protein-structure 6/6

References

[1] White ST, Jacob RE. Biophys J 1990;57:911.

[2] Li XQ, Luo LF. Acta Sci Nat Univ Intram 1992;23:534.

[3] Johnson MS, Overington JP. J Mol Biol 1993;233:716.

[4] Cohen C, Parry DAD. Science 1994;263:488.

[5] Shakhnovich EI. Phys Rev Lett 1994;72:3907.

[6] Hobohm U, Sander C. J Mol Biol 1995;251:390.

[7] Rahman RS, Rackovsky S. Biophys J 1995;68:1591.

[8] Pande VS et al. Proc Natl Acad Sci USA 1994;91:12972.

[9] Eisenberg D et al. Proc Natl Acad Sci USA 1984;81:140.

[10] Herzel H, Grosse I. Physica A 1995;216:518.

[11] Herzel H, Grosse I. Phys Rev E 1997;55:800.

[12] Weiss O, Herzel H. J Theor Biol 1998;190:341.

[13] Weiss O, Herzel H. Zeitschr Phys Chem 1998;204:183.

[14] Rackovsky S. Proc Natl Acad Sci USA 1995;81:140.

[15] Mandell AJ et al. J Stat Phys 1998;93:673.

[16] Mandell AJ et al. Physica A 1997;244:254.

[17] Chechetin VR, Lobzin VV. J Theor Biol 1999;198:197.

[18] Korotkova MA et al. J Mol Model 1999;5:103.

[19] Kantz H, Schreiber T. Nonlinear time series analysis. Cambridge: Cambridge University Press; 1997.[20] El Naschie MS, Kapitaniak T. Phys Lett A 1990;147:275.

[21] El Naschie MS. J Phys Soc Jpn 1989;58:4310.

[22] El Naschie MS, Al Athel S. Z Naturforsch 1989;44a:645.

[23] Ebeling W, El Naschie MS, Chaos, Solitons & Fractals, 4, Special Issue, 1994.

[24] El Naschie MS. Chaos, Solitons & Fractals 1998;9:135.

[25] Barral J et al. Phys Rev E 2000;61:1812.

[26] Branden C, Tiize J. Introduction to protein structures. second ed. New York: Garland Publishing; 1999.

[27] The database of domain sequences used here can be obtained by going to: http://www.cmbi.kun.nl/gv/pdbsel/.

6 Y. Huang, Y. Xiao / Chaos, Solitons and Fractals xxx (2003) xxx–xxx

ARTICLE IN PRESS