Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology...
-
date post
19-Dec-2015 -
Category
Documents
-
view
227 -
download
0
Transcript of Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology...
![Page 1: Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.](https://reader036.fdocuments.us/reader036/viewer/2022081506/56649d3f5503460f94a18ea1/html5/thumbnails/1.jpg)
Machine Learning for Protein
Classification
Ashutosh Saxena
CS 374 – Algorithms in Biology Thursday, Nov 16, 2006
![Page 2: Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.](https://reader036.fdocuments.us/reader036/viewer/2022081506/56649d3f5503460f94a18ea1/html5/thumbnails/2.jpg)
2
Papers
Mismatch string kernels for discriminative protein classification.Christina Leslie, Eleazar Eskin, Adiel Cohen, Jason Weston and William Stafford Nobley. Bioinformatics, 2004.
Semi-supervised protein classification using cluster kernelsJason Weston, Christina Leslie, Dengyong Zhou, Andre Elisseeff, William Stafford Noble. Bioinformatics, 2005.
![Page 3: Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.](https://reader036.fdocuments.us/reader036/viewer/2022081506/56649d3f5503460f94a18ea1/html5/thumbnails/3.jpg)
4
Protein Sequence Classification
Protein represented by sequence of amino acids, encoded by a gene
Easy to sequence proteins, difficult to obtain structure Classification Problem: Learn how to classify protein
sequence data into families and superfamilies defined by structure/function relationships.
![Page 4: Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.](https://reader036.fdocuments.us/reader036/viewer/2022081506/56649d3f5503460f94a18ea1/html5/thumbnails/4.jpg)
5
Structural Hierarchy of Proteins
Remote homologs: sequences that belong to the same superfamily but not the same family – remote evolutionary relationship
Structure and function conserved, though sequence similarity can be low
![Page 5: Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.](https://reader036.fdocuments.us/reader036/viewer/2022081506/56649d3f5503460f94a18ea1/html5/thumbnails/5.jpg)
6
Learning Problem
Use discriminative supervised learning approach to classify protein sequences into structurally-related families, superfamilies
Labeled training data: proteins with known structure, positive (+) if example belongs to a family or superfamily, negative (-) otherwise
Focus on remote homology detection
![Page 6: Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.](https://reader036.fdocuments.us/reader036/viewer/2022081506/56649d3f5503460f94a18ea1/html5/thumbnails/6.jpg)
7
Protein Ranking
Ranking problem: given query protein sequence, return ranked list of similar proteins from sequence database
Limitations of classification framework Small amount of labeled data (proteins with
known structure), huge unlabeled databases Missing classes: undiscovered structures
Good local similarity scores, based on heuristic alignment: BLAST, PSI-BLAST
![Page 7: Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.](https://reader036.fdocuments.us/reader036/viewer/2022081506/56649d3f5503460f94a18ea1/html5/thumbnails/7.jpg)
8
Machine Learning Background
Data -> Features, Labels Supervised:
Training: (Features, Labels) => learn weights Prediction: (Features) => Labels using weights.
Unsupervised: Only labels Labels (classification) could be ranks
(ranking) or continuous values (regression). Classification
Logistic Regression SVM
![Page 8: Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.](https://reader036.fdocuments.us/reader036/viewer/2022081506/56649d3f5503460f94a18ea1/html5/thumbnails/8.jpg)
9
Support Vector Machines
Training examples mapped to (high-dimensional) feature space
F(x) = (F1(x), … , FN(x)) Learn linear classifier in feature
spacef(x) = <w, F(x)> + bw = Σi yi αi F(xi)
by solving optimization problem: trade-off between maximizing geometric margin and minimizing margin violations
Large margin gives good generalization performance, even in high dimensions
![Page 9: Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.](https://reader036.fdocuments.us/reader036/viewer/2022081506/56649d3f5503460f94a18ea1/html5/thumbnails/9.jpg)
10
Support Vector Machines: Kernel
Kernel trick: To train an SVM, can use kernel rather than explicit feature map
Can define kernels for sequences, graphs, other discrete objects:
{ sequences } -> RN
For sequences x, y, feature map F, kernel value is inner product in feature space
K(x, y) = < F(x), F(y) >
![Page 10: Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.](https://reader036.fdocuments.us/reader036/viewer/2022081506/56649d3f5503460f94a18ea1/html5/thumbnails/10.jpg)
11
String Kernels for Biosequences
Biological data needs kernels suited to its needs.
Fast string kernels for biological sequence data Biologically-inspired underlying feature map Kernels scale linearly with sequence length,
O(cK(|x| + |y|)) to compute Protein classification performance competitive
with best available methods Mismatches for inexact sequence matching
(other models later)
![Page 11: Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.](https://reader036.fdocuments.us/reader036/viewer/2022081506/56649d3f5503460f94a18ea1/html5/thumbnails/11.jpg)
12
Spectrum-based Feature Map
Idea: feature map based on spectrum of a sequence The k-spectrum of a sequence
is the set of all k-length contiguous subsequences that it contains
Feature map is indexed by all possible k-length subsequences (“k-mers”) from the alphabet of amino acids
Dimension of feature space = |Σ|k (|Σ| = 20 for amino acids)
![Page 12: Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.](https://reader036.fdocuments.us/reader036/viewer/2022081506/56649d3f5503460f94a18ea1/html5/thumbnails/12.jpg)
13
K-Spectrum Feature Map
![Page 13: Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.](https://reader036.fdocuments.us/reader036/viewer/2022081506/56649d3f5503460f94a18ea1/html5/thumbnails/13.jpg)
14
Inexact Matches
Improve performance by explicitly taking care of mismatches
![Page 14: Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.](https://reader036.fdocuments.us/reader036/viewer/2022081506/56649d3f5503460f94a18ea1/html5/thumbnails/14.jpg)
15
Mismatch Tree
Mismatches can be efficiently tracked through a tree.
![Page 15: Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.](https://reader036.fdocuments.us/reader036/viewer/2022081506/56649d3f5503460f94a18ea1/html5/thumbnails/15.jpg)
16
Fast Prediction
Kernel (string mismatch) can be computed efficiently.
SVM is fast (a dot product over support vectors)
Therefore, protein classification is fast.
![Page 16: Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.](https://reader036.fdocuments.us/reader036/viewer/2022081506/56649d3f5503460f94a18ea1/html5/thumbnails/16.jpg)
17
Experiments
Tested with experiments on SCOP dataset from Jaakkola et al.
Experiments designed to ask: Could the method discover a new family of a known superfamily?
![Page 17: Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.](https://reader036.fdocuments.us/reader036/viewer/2022081506/56649d3f5503460f94a18ea1/html5/thumbnails/17.jpg)
18
SCOP Experiments
160 experiments for 33 target families from 16 superfamilies
Compared results against SVM-Fisher (HMM-based kernel) SAM-T98 (profile HMM) PSI-BLAST (heuristic alignment-based method)
ROC scores: area under the graph of true positives as a function of false positives, scaled so that both axes vary between 0 and 1
![Page 18: Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.](https://reader036.fdocuments.us/reader036/viewer/2022081506/56649d3f5503460f94a18ea1/html5/thumbnails/18.jpg)
19
Result
![Page 19: Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.](https://reader036.fdocuments.us/reader036/viewer/2022081506/56649d3f5503460f94a18ea1/html5/thumbnails/19.jpg)
20
Result (contd.)
![Page 20: Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.](https://reader036.fdocuments.us/reader036/viewer/2022081506/56649d3f5503460f94a18ea1/html5/thumbnails/20.jpg)
21
Aside: Similarity with SVM-Fisher
Fisher kernel for Markov chain model similar to k-spectrum kernel
Mismatch-SVM performs as well as SVM-Fisher but avoids computational expense, training difficulties of profile HMM
![Page 21: Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.](https://reader036.fdocuments.us/reader036/viewer/2022081506/56649d3f5503460f94a18ea1/html5/thumbnails/21.jpg)
22
Advantages of mismatch SVM
Efficient computation: scales linearly with sequence length
Fast prediction: classify test sequences in linear time
Interpretation of learned classifier General approach for biosequence data,
does not rely on alignment or generative model
![Page 22: Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.](https://reader036.fdocuments.us/reader036/viewer/2022081506/56649d3f5503460f94a18ea1/html5/thumbnails/22.jpg)
23
Other Extensions of Mismatch Kernel Other models for inexact matching
Restricted gaps Probabilistic substitutions Wildcards
![Page 23: Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.](https://reader036.fdocuments.us/reader036/viewer/2022081506/56649d3f5503460f94a18ea1/html5/thumbnails/23.jpg)
24
Inexact Matching through Gaps
![Page 24: Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.](https://reader036.fdocuments.us/reader036/viewer/2022081506/56649d3f5503460f94a18ea1/html5/thumbnails/24.jpg)
25
Result: Gap Kernel
![Page 25: Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.](https://reader036.fdocuments.us/reader036/viewer/2022081506/56649d3f5503460f94a18ea1/html5/thumbnails/25.jpg)
26
Inexact matching through Probabilistic Substitutions
Use substitution matrices to obtain P(a|b), substitution probabilities for residues a, b
The mutation neighborhood M(k,s)(s) is the set of all k-mers t such that
![Page 26: Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.](https://reader036.fdocuments.us/reader036/viewer/2022081506/56649d3f5503460f94a18ea1/html5/thumbnails/26.jpg)
27
Results: Substitution
![Page 27: Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.](https://reader036.fdocuments.us/reader036/viewer/2022081506/56649d3f5503460f94a18ea1/html5/thumbnails/27.jpg)
28
Wildcards
![Page 28: Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.](https://reader036.fdocuments.us/reader036/viewer/2022081506/56649d3f5503460f94a18ea1/html5/thumbnails/28.jpg)
29
Results: Wildcard
![Page 29: Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.](https://reader036.fdocuments.us/reader036/viewer/2022081506/56649d3f5503460f94a18ea1/html5/thumbnails/29.jpg)
30
Cluster Kernels: Semi-Supervised
About 30,000 proteins with known structure (labeled proteins), but about 1 million sequenced proteins
BLAST, PSI-BLAST: widely-used heuristic alignment-based sequence similarity scores Good local similarity score, less useful for more remote
homology BLAST/PSI-BLAST E-values give good measure of
distance between sequences
Can we use unlabeled data, combined with good local distance measure, for semi-supervised approach to protein classification?
![Page 30: Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.](https://reader036.fdocuments.us/reader036/viewer/2022081506/56649d3f5503460f94a18ea1/html5/thumbnails/30.jpg)
31
Cluster Kernels
Use unlabeled data to change the (string) kernel representation
Cluster assumption: decision boundary should pass through low density region of input space; clusters in the data are likely to have consistent labels Profile kernel Bagged kernel
![Page 31: Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.](https://reader036.fdocuments.us/reader036/viewer/2022081506/56649d3f5503460f94a18ea1/html5/thumbnails/31.jpg)
32
Profile Kernel
![Page 32: Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.](https://reader036.fdocuments.us/reader036/viewer/2022081506/56649d3f5503460f94a18ea1/html5/thumbnails/32.jpg)
33
Profile Kernel
![Page 33: Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.](https://reader036.fdocuments.us/reader036/viewer/2022081506/56649d3f5503460f94a18ea1/html5/thumbnails/33.jpg)
34
Bagged Kernel
A hack to use clustering for a kernel
Use k-means clustering to cluster data (labeled + unlabeled), N bagged runs
Using N runs, definep(x,y) = (# times x, y are in same cluster)/N
Bagged kernel:KBagged(x,y) = p(x,y) K(x,y)
![Page 34: Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.](https://reader036.fdocuments.us/reader036/viewer/2022081506/56649d3f5503460f94a18ea1/html5/thumbnails/34.jpg)
35
Experimental Setup
Full dataset: 7329 SCOP protein sequences Experiments:
54 target families (remote homology detection) Test + training approximately for each
experiment <4000 sequences, other data treated as unlabeled
Evaluation: How well do cluster kernel approaches compare to the standard approach, adding positive homologs to dataset?
![Page 35: Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.](https://reader036.fdocuments.us/reader036/viewer/2022081506/56649d3f5503460f94a18ea1/html5/thumbnails/35.jpg)
36
Results: Cluster Kernels
![Page 36: Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.](https://reader036.fdocuments.us/reader036/viewer/2022081506/56649d3f5503460f94a18ea1/html5/thumbnails/36.jpg)
37
Conclusions
SVMs with string kernels – like mismatch kernels – that incorporate inexact matching are competitive with best-known methods for protein classification
Efficient kernel computation: linear-time prediction, feature extraction
Gaps, substitutions, and wildcards other possible kernels
Semi-supervised cluster kernels – using unlabeled data to modify kernel representation – improve on original string kernel
![Page 37: Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.](https://reader036.fdocuments.us/reader036/viewer/2022081506/56649d3f5503460f94a18ea1/html5/thumbnails/37.jpg)
38
Future Directions
Full multiclass protein classification problem 1000s of classes of different sizes, hierarchical
labels Use of unlabeled data for improving kernel
representation Domain segmentation problem: predict and
classify domains of multidomain proteins Local structure prediction: predict local
conformation state (backbone angles) for short peptide segments, step towards structure prediction
![Page 38: Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.](https://reader036.fdocuments.us/reader036/viewer/2022081506/56649d3f5503460f94a18ea1/html5/thumbnails/38.jpg)
Thank you
CS 374 – Algorithms in Biology Tuesday, October 24, 2006
![Page 39: Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.](https://reader036.fdocuments.us/reader036/viewer/2022081506/56649d3f5503460f94a18ea1/html5/thumbnails/39.jpg)
40
Protein Ranking
Pairwise sequence comparison: most fundamental bioinformatics application
BLAST, PSI-BLAST: widely-used heuristic alignment-based sequence similarity scores Given query sequence, search unlabeled
database and return ranked list of similar sequences
Query sequence does not have to belong to known family or superfamily
Good local similarity score, less useful for more remote homology
Can we use semi-supervised machine learning to improve on PSI-BLAST?
![Page 40: Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.](https://reader036.fdocuments.us/reader036/viewer/2022081506/56649d3f5503460f94a18ea1/html5/thumbnails/40.jpg)
41
Ranking Induced By Clustering
![Page 41: Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.](https://reader036.fdocuments.us/reader036/viewer/2022081506/56649d3f5503460f94a18ea1/html5/thumbnails/41.jpg)
42
Ranking: Experiemntal Setup
Training set: 4246 SCOP protein sequences (from 554 superfamilies) – known classes
Test set: 3083 SCOP protein sequences (from 516 different
superfamilies) – hidden classes 101,403 unlabeled sequences from SWISSPROT
Task: How well can we retrieve database sequences (from train + test sets) in same superfamily as query? Evaluate with ROC-50 scores
For initial experiments, SWISSPROT sequences only used for PSI-BLAST scores, not for clustering
![Page 42: Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.](https://reader036.fdocuments.us/reader036/viewer/2022081506/56649d3f5503460f94a18ea1/html5/thumbnails/42.jpg)
43
Results: Ranking