Project Presentation
description
Transcript of Project Presentation
5/4/05CS573X Machine Learning
Project
Prediction of peptidyl prolyl residues in cis/trans configuration
using machine learning algorithms
Jae-Hyung LeeGenetics, Development and Cell Biology
Bioinformatics and Computational Biology Program
CS573X Machine Learning Project5/4/05
What is a protein?What is a protein?
CS573X Machine Learning Project5/4/05
Proteins are polypeptide chainsProteins are polypeptide chains
CS573X Machine Learning Project5/4/05
Torsional anglesTorsional angles
Although the peptide bond is planer and fixed rotation can and does occur about the two single bonds on either side of the a carbon: Φ (Phi), the bond between N and Ca Ψ (Psi), the bond between Ca and C
CS573X Machine Learning Project5/4/05
Two different peptide configurationsTwo different peptide configurations
cisω = -20° ~ 20°
transω = -180° ~ -160°
orω = 160° ~ 180°
CS573X Machine Learning Project5/4/05
Importance in isomerization of prolyl peptide bondImportance in isomerization of prolyl peptide bond
Peptidyl-prolyl cis/trans isomerization has considerably biological significance.
Peptidyl prolyl cis/trans isomerization has been frequently found as a rate limiting step in the folding of proteins.
Prolyl residue plays an important role in the final structure of proteins
Can be potential regulatory switch which is involved in cellular functions
CS573X Machine Learning Project5/4/05
Datasets (1)Datasets (1)
Protein Data Bank (PDB) database Protein structure information 3D coordinates of atoms based on X-ray crystallography or NMR spectroscopy exp
erimental data
Every Omega (ω) angles in prolyl peptide bond were calculated
Total 667,230 proline residue’s omega angle were calculated
To reduce the redundancy of dataset and use more reliable structural information, the proteins were removed based on resolution cutoff (≤ 3.0 Å), R-factor cutoff (≤ 0.3) and sequence identity (< 30%) using PISCES web server (60,814 pdb chains -> 4,006 pdb chains)
Finally total 3268 instances (1571 –cis cases and 1697 –trans cases) were used for constructing and testing classifiers.
CS573X Machine Learning Project5/4/05
Datasets (2)Datasets (2)
9 Different window sizes (different local sequences near proline residue) For example window size: 5
i-2, i-1, i, i+1, i+2 (total 5 amino acids, ith residue: proline)
In the absence and the presence of secondary structure (ss) information in instances’ attributes
Total 9 (window size) X 2 (+/- ss information) = 18 datasets were generated
K, A, I, I, S, E, N, P, C, I, K, H, Y, H, I, t
K, A, I, I, S, E, N, P, C, I, K, H, Y, H, I, E, E, C, C, C, C, C, C, E, E, E, E, E, C, C, t
secondary structure information
Class labelamino acid sequence information
a.
b.
Window size: 15 and no ss information
Window size: 15 and ss information
CS573X Machine Learning Project5/4/05
Naïve Bayes Classifier (1) Naïve Bayes Classifier (1)
Using Bayes theorem,
The probability of hypothesis given a set of data can be calculated based on its prior probability the probability of observing the data given the hypothesis and the probability of the data
CS573X Machine Learning Project5/4/05
Naïve Bayes Classifier (2)Naïve Bayes Classifier (2)
Given an instance X with attribute value
the Bayesian approach to classify X is to assign to it the most probable hypothesis
Assumption: when class is given attributes are independent of each other
CS573X Machine Learning Project5/4/05
Support Vector Machine (SVM) (1)Support Vector Machine (SVM) (1)
Finds a linear boundary which separate the training data
Uses the non-linear kernel function to map non-separable original n-dimensional pattern space onto higher dimensional feature space in which the patterns can be separable implicitly
Polynomial function was used for kernel function
ox
oo
o
x
x
x
(o)
(o)
(o)(o)
(x)(x)
(x)
(x)
CS573X Machine Learning Project5/4/05
Support Vector Machine (SVM) (2)Support Vector Machine (SVM) (2)
A maximal margin hyperplane with its support vector highlighted in the 2-dimensional feature space (1, 2)
x
o
x
o
o
o o o
x
xx
x
1
2
Overfitting problem for the training dataset is resolved by selecting the hype plane that maximizes the margin of separation between two classes from among all separating hyperplanes
CS573X Machine Learning Project5/4/05
Performance evaluationPerformance evaluation
Accuracy = N
TNTP
.Correlation Coefficient =
))()()(( FNTNFPTNFPTPFNTP
FNFPTNTP
-TP (true positives) = the number of proline residues predicted to be trans configurations that actually are trans configurations.
-TN (true negatives) =the number of proline residues predicted to be cis configurations that actually are cis configurations.
-FP (false positive) = the number of proline residues predicted to be trans configurations that actually are cis configurations.
-FN (false negative) =the number of proline residues predicted to be cis configurations that actually are trans configurations.
- N = TP+TN+FP+FN
5 fold cross-validation using datasets in WEKA package
CS573X Machine Learning Project5/4/05
Result – Naïve Bayes Classifier (1)Result – Naïve Bayes Classifier (1)
no ss information ss information
window size Accuracy CC Accuracy CC
3 0.596 0.190 0.628 0.276
5 0.604 0.204 0.635 0.296
7 0.599 0.195 0.637 0.302
9 0.606 0.210 0.629 0.282
11 0.602 0.201 0.625 0.270
13 0.598 0.194 0.618 0.254
15 0.606 0.209 0.614 0.245
17 0.602 0.202 0.610 0.235
19 0.601 0.199 0.609 0.229
21 0.602 0.201 0.610 0.230
CS573X Machine Learning Project5/4/05
Result – Naïve Bayes Classifier (2)Result – Naïve Bayes Classifier (2)
0.59
0.592
0.594
0.596
0.598
0.6
0.602
0.604
0.606
0.608
3 5 7 9 11 13 15 17 19 21
window size
accu
racy
0.185
0.19
0.195
0.2
0.205
0.21
0.215
cc
Accuracy CC
0.59
0.595
0.6
0.605
0.61
0.615
0.62
0.625
0.63
0.635
0.64
3 5 7 9 11 13 15 17 19 21
window size
accu
racy
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
cc
Accuracy CC
No ss information
ss information incorporated
CS573X Machine Learning Project5/4/05
Result – SVM (1)Result – SVM (1)
no ss information ss information
window size Accuracy CC Accuracy CC
3 0.5826 0.1625 0.63 0.2594
5 0.575 0.1499 0.6233 0.246
7 0.5817 0.1613 0.653 0.305
9 0.5921 0.1815 0.6515 0.3021
11 0.6034 0.2036 0.66 0.3192
13 0.6047 0.2059 0.6631 0.3253
15 0.593 0.1821 0.6662 0.3312
17 0.6001 0.1966 0.6704 0.3399
19 0.6028 0.2027 0.6649 0.3289
21 0.6037 0.2053 0.6634 0.3258
Polynomial kernel – third degree
CS573X Machine Learning Project5/4/05
Result – SVM (2)Result – SVM (2)
0.56
0.5650.57
0.5750.58
0.585
0.590.595
0.60.605
0.61
3 5 7 9 11 13 15 17 19 21
window size
accu
racy
0
0.05
0.1
0.15
0.2
0.25
cc
Accuracy CC
0.59
0.6
0.61
0.62
0.63
0.64
0.65
0.66
0.67
0.68
3 5 7 9 11 13 15 17 19 21
window size
accu
racy
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
cc
Accuracy CC
No ss information
ss information incorporated
Polynomial kernel – third degree
CS573X Machine Learning Project5/4/05
Result – SVM (3)Result – SVM (3)
0.54
0.56
0.58
0.6
0.62
0.64
0.66
0.68
1 2 3 4 5 6 7 8 9
polynomial degree
accu
racy
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
cc
Accuracy CC
Window size 17 and ss information incorporated
CS573X Machine Learning Project5/4/05
Discussion (1)Discussion (1)
Unbalanced data
5%
95%
Trans Cis
CS573X Machine Learning Project5/4/05
Ligand:Phosphopeptide
Ligand:SH3 domain of ITK
Discussion (2)Discussion (2) Third class- both cis and trans
Native proline isomerization
CS573X Machine Learning Project5/4/05
SummarySummary
If we use just sequence information, the performance is not as good as using secondary structure information The optimal performance Naïve Bayes classifier achieved an
accuracy of 64% with a CC of 0.302, specificity for trans of 0.483 and sensitivity for trans of 0.590 if we use secondary structure information
In case of SVM method, we can get the performance, an accuracy of 67% with a CC of 0.340, specificity for trans of 0.683 and sensitivity for trans of 0.657 if we use secondary structure information
Naïve Bayes classifier did not need long window size to get good performance (~7). More bigger window did not show reasonable performance. On the other hand, SVM needed at least 7 longer window size to get good performance
For SVM, polynomial kernel with third degree did the best performance.
CS573X Machine Learning Project5/4/05
ReferencesReferences
1. Andreotti, A. H. 2003. Native state proline isomerization: an intrinsic molecular switch. Biochemistry 42:9515-24.
2. Baldi, P., S. Brunak, Y. Chauvin, C. A. Andersen, and H. Nielsen. 2000. Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 16:412-24.
3. Berman, H. M., J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne. 2000. The Protein Data Bank. Nucleic Acids Res 28:235-42.
4. Kabsch, W., and C. Sander. 1983. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22:2577-637.
5. Mitchell, T. 1997. Machine Learning. McGRAW-HILL.
6. Pahlke, D., C. Freund, D. Leitner, and D. Labudde. 2005. Statistically significant dependenceof the Xaa-Pro peptide bond conformation on secondary structure and amino acid sequence. BMC Struct Biol 5:8.
7. Vapnik, V. 1998. Statistical learning theory. Springer-Verlag., New York.
8. Wang, G., and R. L. Dunbrack, Jr. 2003. PISCES: a protein sequence culling server. Bioinformatics 19:1589-91.
9. Witten, I. H., and E. Frank. 2000. Data Mining: Practical machine learning tools with Java implementations. Morgan Kaufmann.