Protein Classification Using Averaged Perceptron SVM Eugene Ie CS6772 Project Presentation...
-
Upload
betty-shepherd -
Category
Documents
-
view
213 -
download
0
Transcript of Protein Classification Using Averaged Perceptron SVM Eugene Ie CS6772 Project Presentation...
Protein Classification Using Averaged Perceptron SVM
Eugene Ie
CS6772 Project Presentation12/03/2003
Protein Sequence Classification • Protein = | | = 20 amino acids• Easy to sequence proteins, difficult to obtain structure
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
Sequence3D Structure
FunctionOxygen transport
ClassGlobin familyGlobin-like superfamily
?
Sequence Alignment vs. Classification
• Sequence similarity through alignment
SGFIEEDELKLFLSGFIEEEELKFVL
close homology
distanthomology
• Sequence classification for remote homologyClassifier
Structural Hierarchy of Proteins
• Remote homologs: – Structure and function conserved– Sequence similarity - low
SCOP
Fold
Superfamily
Family
Positive Training Set
Positive Test Set
Negative Training Set
Negative Test Set
Remote Homology Detection• Discriminative supervised learning approach to
protein classification
Approach: Support Vector Machines with String Kernels
C. Leslie, E. Eskin, J. Weston, and W. Noble, Mismatch String Kernels for SVM Protein Classification.C. Leslie and R. Kuang, Fast Kernels for Inexact String Matching.
QP SVM Training
njiji xx 1,)],([
)(
),(2nO
yxK>VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
…
>TYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
Total: n sequences + n labels
Sequence Training Data
QP Solver(slow)
Learned Weights and Bias
')(
bavgb
xywi iii
From KKT
Averaged Perceptron SVM Training
Y. Freund and R. Schapire, Large Margin Classification Using the Perceptron Algorithm.
Training Algorithm:
Averaged Perceptron SVM Training
>VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
…
>TYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
Total: n sequences + n labels
Sequence Training Data
Final Weight Vector, Voting Weightssiw 0)(
Run PerceptronAlgorithm
)(
),(
knO
yxK
kiv 1)(
Iterate Epochs
s = no. of dimensions in feature spacek = no. of mistakes made during perceptron run
Generalized Bound for k
1 ..
,,0max
0 ,
1
2
2
utsRu
xuyd
dD
xR
DROk
s
iii
n
i i
i
SCOP experiments show:For average n ~ 1000Average k ~ 50-60
Averaged Perceptron SVM ClassificationTesting Algorithm:
Note: Only k kernel products with unknown sequence x need to be computed.
Recurrence relation:
Mm
xxyxvxv
i
mmii ii
,,,1
M is the set of “mistake indices”
Implementation Details Built on top of protclass (Protein Classification) platform
Java Platform Classification Task
Classification Task Hash table scan instead of Mismatch Trie
Generate mismatch mappings once using shifts Dynamic kernel matrix storage Still needs debugging
Speed/Space Performance ~80% reduction in space requirement ~50% reduction in training time ~50% reduction in testing time Mainly from simple online algorithm