Protein Classification Using Averaged Perceptron SVM Eugene Ie CS6772 Project Presentation...

11
Protein Classification Using Averaged Perceptron SVM Eugene Ie CS6772 Project Presentation 12/03/2003

Transcript of Protein Classification Using Averaged Perceptron SVM Eugene Ie CS6772 Project Presentation...

Page 1: Protein Classification Using Averaged Perceptron SVM Eugene Ie CS6772 Project Presentation 12/03/2003.

Protein Classification Using Averaged Perceptron SVM

Eugene Ie

CS6772 Project Presentation12/03/2003

Page 2: Protein Classification Using Averaged Perceptron SVM Eugene Ie CS6772 Project Presentation 12/03/2003.

Protein Sequence Classification • Protein = | | = 20 amino acids• Easy to sequence proteins, difficult to obtain structure

VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR

Sequence3D Structure

FunctionOxygen transport

ClassGlobin familyGlobin-like superfamily

?

Page 3: Protein Classification Using Averaged Perceptron SVM Eugene Ie CS6772 Project Presentation 12/03/2003.

Sequence Alignment vs. Classification

• Sequence similarity through alignment

SGFIEEDELKLFLSGFIEEEELKFVL

close homology

distanthomology

• Sequence classification for remote homologyClassifier

Page 4: Protein Classification Using Averaged Perceptron SVM Eugene Ie CS6772 Project Presentation 12/03/2003.

Structural Hierarchy of Proteins

• Remote homologs: – Structure and function conserved– Sequence similarity - low

SCOP

Fold

Superfamily

Family

Positive Training Set

Positive Test Set

Negative Training Set

Negative Test Set

Page 5: Protein Classification Using Averaged Perceptron SVM Eugene Ie CS6772 Project Presentation 12/03/2003.

Remote Homology Detection• Discriminative supervised learning approach to

protein classification

Approach: Support Vector Machines with String Kernels

C. Leslie, E. Eskin, J. Weston, and W. Noble, Mismatch String Kernels for SVM Protein Classification.C. Leslie and R. Kuang, Fast Kernels for Inexact String Matching.

Page 6: Protein Classification Using Averaged Perceptron SVM Eugene Ie CS6772 Project Presentation 12/03/2003.

QP SVM Training

njiji xx 1,)],([

)(

),(2nO

yxK>VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR

>TYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR

Total: n sequences + n labels

Sequence Training Data

QP Solver(slow)

Learned Weights and Bias

')(

bavgb

xywi iii

From KKT

Page 7: Protein Classification Using Averaged Perceptron SVM Eugene Ie CS6772 Project Presentation 12/03/2003.

Averaged Perceptron SVM Training

Y. Freund and R. Schapire, Large Margin Classification Using the Perceptron Algorithm.

Training Algorithm:

Page 8: Protein Classification Using Averaged Perceptron SVM Eugene Ie CS6772 Project Presentation 12/03/2003.

Averaged Perceptron SVM Training

>VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR

>TYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR

Total: n sequences + n labels

Sequence Training Data

Final Weight Vector, Voting Weightssiw 0)(

Run PerceptronAlgorithm

)(

),(

knO

yxK

kiv 1)(

Iterate Epochs

s = no. of dimensions in feature spacek = no. of mistakes made during perceptron run

Generalized Bound for k

1 ..

,,0max

0 ,

1

2

2

utsRu

xuyd

dD

xR

DROk

s

iii

n

i i

i

SCOP experiments show:For average n ~ 1000Average k ~ 50-60

Page 9: Protein Classification Using Averaged Perceptron SVM Eugene Ie CS6772 Project Presentation 12/03/2003.

Averaged Perceptron SVM ClassificationTesting Algorithm:

Note: Only k kernel products with unknown sequence x need to be computed.

Recurrence relation:

Mm

xxyxvxv

i

mmii ii

,,,1

M is the set of “mistake indices”

Page 10: Protein Classification Using Averaged Perceptron SVM Eugene Ie CS6772 Project Presentation 12/03/2003.

Implementation Details Built on top of protclass (Protein Classification) platform

Java Platform Classification Task

Classification Task Hash table scan instead of Mismatch Trie

Generate mismatch mappings once using shifts Dynamic kernel matrix storage Still needs debugging

Speed/Space Performance ~80% reduction in space requirement ~50% reduction in training time ~50% reduction in testing time Mainly from simple online algorithm

Page 11: Protein Classification Using Averaged Perceptron SVM Eugene Ie CS6772 Project Presentation 12/03/2003.