Project Presentation

5/4/05CS573X Machine Learning

Project

Prediction of peptidyl prolyl residues in cis/trans configuration

using machine learning algorithms

Jae-Hyung LeeGenetics, Development and Cell Biology

Bioinformatics and Computational Biology Program

CS573X Machine Learning Project5/4/05

What is a protein?What is a protein?


Proteins are polypeptide chainsProteins are polypeptide chains


Torsional anglesTorsional angles

Although the peptide bond is planer and fixed rotation can and does occur about the two single bonds on either side of the a carbon: Φ (Phi), the bond between N and Ca Ψ (Psi), the bond between Ca and C


Two different peptide configurationsTwo different peptide configurations

cisω = -20° ~ 20°

transω = -180° ~ -160°

orω = 160° ~ 180°


Importance in isomerization of prolyl peptide bondImportance in isomerization of prolyl peptide bond

Peptidyl-prolyl cis/trans isomerization has considerably biological significance.

Peptidyl prolyl cis/trans isomerization has been frequently found as a rate limiting step in the folding of proteins.

Prolyl residue plays an important role in the final structure of proteins

Can be potential regulatory switch which is involved in cellular functions


Datasets (1)Datasets (1)

Protein Data Bank (PDB) database Protein structure information 3D coordinates of atoms based on X-ray crystallography or NMR spectroscopy exp

erimental data

Every Omega (ω) angles in prolyl peptide bond were calculated

Total 667,230 proline residue’s omega angle were calculated

To reduce the redundancy of dataset and use more reliable structural information, the proteins were removed based on resolution cutoff (≤ 3.0 Å), R-factor cutoff (≤ 0.3) and sequence identity (< 30%) using PISCES web server (60,814 pdb chains -> 4,006 pdb chains)

Finally total 3268 instances (1571 –cis cases and 1697 –trans cases) were used for constructing and testing classifiers.


Datasets (2)Datasets (2)

9 Different window sizes (different local sequences near proline residue) For example window size: 5

i-2, i-1, i, i+1, i+2 (total 5 amino acids, ith residue: proline)

In the absence and the presence of secondary structure (ss) information in instances’ attributes

Total 9 (window size) X 2 (+/- ss information) = 18 datasets were generated

K, A, I, I, S, E, N, P, C, I, K, H, Y, H, I, t

K, A, I, I, S, E, N, P, C, I, K, H, Y, H, I, E, E, C, C, C, C, C, C, E, E, E, E, E, C, C, t

secondary structure information

Class labelamino acid sequence information

a.

b.

Window size: 15 and no ss information

Window size: 15 and ss information


Naïve Bayes Classifier (1) Naïve Bayes Classifier (1)

Using Bayes theorem,

The probability of hypothesis given a set of data can be calculated based on its prior probability the probability of observing the data given the hypothesis and the probability of the data


Naïve Bayes Classifier (2)Naïve Bayes Classifier (2)

Given an instance X with attribute value

the Bayesian approach to classify X is to assign to it the most probable hypothesis

Assumption: when class is given attributes are independent of each other


Support Vector Machine (SVM) (1)Support Vector Machine (SVM) (1)

Finds a linear boundary which separate the training data

Uses the non-linear kernel function to map non-separable original n-dimensional pattern space onto higher dimensional feature space in which the patterns can be separable implicitly

Polynomial function was used for kernel function

ox

oo

o

x

x

x

(o)

(o)

(o)(o)

(x)(x)

(x)

(x)


Support Vector Machine (SVM) (2)Support Vector Machine (SVM) (2)

A maximal margin hyperplane with its support vector highlighted in the 2-dimensional feature space (1, 2)

x

o

x

o

o

o o o

x

xx

x

1

2

Overfitting problem for the training dataset is resolved by selecting the hype plane that maximizes the margin of separation between two classes from among all separating hyperplanes


Performance evaluationPerformance evaluation

Accuracy = N

TNTP

.Correlation Coefficient =

))()()(( FNTNFPTNFPTPFNTP

FNFPTNTP

-TP (true positives) = the number of proline residues predicted to be trans configurations that actually are trans configurations.

-TN (true negatives) =the number of proline residues predicted to be cis configurations that actually are cis configurations.

-FP (false positive) = the number of proline residues predicted to be trans configurations that actually are cis configurations.

-FN (false negative) =the number of proline residues predicted to be cis configurations that actually are trans configurations.

- N = TP+TN+FP+FN

5 fold cross-validation using datasets in WEKA package


Result – Naïve Bayes Classifier (1)Result – Naïve Bayes Classifier (1)

　 no ss information ss information

window size Accuracy CC Accuracy CC

3 0.596 0.190 0.628 0.276

5 0.604 0.204 0.635 0.296

7 0.599 0.195 0.637 0.302

9 0.606 0.210 0.629 0.282

11 0.602 0.201 0.625 0.270

13 0.598 0.194 0.618 0.254

15 0.606 0.209 0.614 0.245

17 0.602 0.202 0.610 0.235

19 0.601 0.199 0.609 0.229

21 0.602 0.201 0.610 0.230


Result – Naïve Bayes Classifier (2)Result – Naïve Bayes Classifier (2)

0.59

0.592

0.594

0.596

0.598

0.6

0.602

0.604

0.606

0.608

3 5 7 9 11 13 15 17 19 21

window size

accu

racy

0.185

0.19

0.195

0.2

0.205

0.21

0.215

cc

Accuracy CC

0.59

0.595

0.6

0.605

0.61

0.615

0.62

0.625

0.63

0.635

0.64

3 5 7 9 11 13 15 17 19 21

window size

accu

racy

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

cc

Accuracy CC

No ss information

ss information incorporated


Result – SVM (1)Result – SVM (1)

　 no ss information ss information

window size Accuracy CC Accuracy CC

3 0.5826 0.1625 0.63 0.2594

5 0.575 0.1499 0.6233 0.246

7 0.5817 0.1613 0.653 0.305

9 0.5921 0.1815 0.6515 0.3021

11 0.6034 0.2036 0.66 0.3192

13 0.6047 0.2059 0.6631 0.3253

15 0.593 0.1821 0.6662 0.3312

17 0.6001 0.1966 0.6704 0.3399

19 0.6028 0.2027 0.6649 0.3289

21 0.6037 0.2053 0.6634 0.3258

Polynomial kernel – third degree



0.56

0.5650.57

0.5750.58

0.585

0.590.595

0.60.605

0.61

3 5 7 9 11 13 15 17 19 21

window size

accu

racy

0

0.05

0.1

0.15

0.2

0.25

cc

Accuracy CC

0.59

0.6

0.61

0.62

0.63

0.64

0.65

0.66

0.67

0.68

3 5 7 9 11 13 15 17 19 21

window size

accu

racy

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

cc

Accuracy CC

No ss information

ss information incorporated

Polynomial kernel – third degree



0.54

0.56

0.58

0.6

0.62

0.64

0.66

0.68

1 2 3 4 5 6 7 8 9

polynomial degree

accu

racy

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

cc

Accuracy CC

Window size 17 and ss information incorporated


Discussion (1)Discussion (1)

Unbalanced data

5%

95%

Trans Cis


Ligand:Phosphopeptide

Ligand:SH3 domain of ITK

Discussion (2)Discussion (2) Third class- both cis and trans

Native proline isomerization


SummarySummary

If we use just sequence information, the performance is not as good as using secondary structure information The optimal performance Naïve Bayes classifier achieved an

accuracy of 64% with a CC of 0.302, specificity for trans of 0.483 and sensitivity for trans of 0.590 if we use secondary structure information

In case of SVM method, we can get the performance, an accuracy of 67% with a CC of 0.340, specificity for trans of 0.683 and sensitivity for trans of 0.657 if we use secondary structure information

Naïve Bayes classifier did not need long window size to get good performance (~7). More bigger window did not show reasonable performance. On the other hand, SVM needed at least 7 longer window size to get good performance

For SVM, polynomial kernel with third degree did the best performance.


ReferencesReferences

1. Andreotti, A. H. 2003. Native state proline isomerization: an intrinsic molecular switch. Biochemistry 42:9515-24.

2. Baldi, P., S. Brunak, Y. Chauvin, C. A. Andersen, and H. Nielsen. 2000. Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 16:412-24.

3. Berman, H. M., J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne. 2000. The Protein Data Bank. Nucleic Acids Res 28:235-42.

4. Kabsch, W., and C. Sander. 1983. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22:2577-637.

5. Mitchell, T. 1997. Machine Learning. McGRAW-HILL.

6. Pahlke, D., C. Freund, D. Leitner, and D. Labudde. 2005. Statistically significant dependenceof the Xaa-Pro peptide bond conformation on secondary structure and amino acid sequence. BMC Struct Biol 5:8.

7. Vapnik, V. 1998. Statistical learning theory. Springer-Verlag., New York.

8. Wang, G., and R. L. Dunbrack, Jr. 2003. PISCES: a protein sequence culling server. Bioinformatics 19:1589-91.

9. Witten, I. H., and E. Frank. 2000. Data Mining: Practical machine learning tools with Java implementations. Morgan Kaufmann.

Project Presentation

Documents

Transcript of Project Presentation