A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences...
-
Upload
joshua-reed -
Category
Documents
-
view
219 -
download
0
Transcript of A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences...
A statistical framework for genomic data fusion
William Stafford NobleDepartment of Genome Sciences
Department of Computer Science and Engineering
University of Washington
Outline
• Recognizing correctly identified peptides
• The support vector machine algorithm
• Experimental results
• Yeast protein classification
• SVM learning from heterogeneous data
• Results
Recognizing correctly identified peptides
Database searchProtein sample
Tandem mass spectrometer
Search algorithm
Observed spectra
Predicted peptides
Sequence database
The learning task
• We are given paired observed and theoretical spectra.
• Question: Is the pairing correct?
Observed Theoretical
Properties of the observed spectrum
1. Total peptide mass. Too small yields little information; too large (>25 amino acids) yields uneven fragmentation.
2. Charge (+1, +2 or +3). Provides some evidence about amino acid composition.
3. Total ion current. Proportional to the amount of peptide present.
4. Peak count. Small indicates poor fragmentation; large indicates noise.
Observed vs. theoretical spectra
5. Mass difference.
6. Percent of ions matched. Number of matched ions / total number of ions.
7. Percent of peaks matched. Number of matched peaks / total number of peaks.
8. Percent of peptide fragment ion current matched. Total intensity of matched peaks / total intensity of all peaks.
9. Preliminary SEQUEST score (Sp).
10. Preliminary score rank.
11. SEQUEST cross-correlation (XCorr).
Top-ranked vs. second-ranked peptides
12.Change in cross-correlation. Compute the difference in XCorr for the top-ranked and second-ranked peptide.
13.Percent sequence identity. Usually anti-correlated with change in cross-correlation.
Pos
itiv
e ex
ampl
es
Neg
ativ
e ex
ampl
es
The support vector machine algorithm
SVMs in computational biology
• Splice site recognition• Protein sequence
similarity detection• Protein functional
classification• Regulatory module
search
• Protein-protein interaction prediction
• Gene functional classification from microarray data
• Cancer classification from microarray data
Support vector machine
Support vector machine
++
+
+ +
+
+
+ +
+ +
+-- -
-
-
-
-
--
-
--
-+
+
-
--
Locate a plane that separates
positive from negative
examples.
Focus on the examples closest to the boundary.
Kernel matrix
31, YXYXK
2
2
2exp,
YX
YXK
Kernel functions
• Let X be a finite input space.• A kernel is a function K, such that for all x, z
X, K(x, z) = (x) · (y), where is a mapping from X to an (inner product) feature space F.
• Let K(x,z) be a symmetric function on X. Then K(x,z) is a kernel function if and only if the matrix
is positive semi-definite.
njijiK
1,,
xxK
Peptide ID kernel function
• Let p(x,y) be the function that computes a 13-element vector of parameters for a pair of spectra, x and y.
• The kernel function K operates on pairs of observed and theoretical spectra:
21,,
,,,:,:
Bt
Bo
At
Ao
Bt
Bo
At
Ao
Bt
Bo
At
Ao
SSpSSp
SSpSSpKSSSSK
Experimental results
Experimental design
• Data consists of one 13-element vector per predicted peptide.
• Each feature is normalized to sum to 1.0 across all examples.
• The SVM is tested using leave-one-out cross-validation.
• The SVM uses a second-degree polynomial, normalized kernel with a 2-norm asymmetric soft margin.
Three data sets
• Set 1: Ion trap mass spectrometer. Sequest search on the full non-redundant database.
• Set 2: Ion trap mass spectrometer. Sequest search on human NRDB.
• Set 3: Quadrupole time-of-flight mass spectrometer. Sequence search on human NRDB.
Data set sizes
15405231017QTOF HNRDB
1161465696Ion-trap HNRDB
976479497Ion-trap NRDB
TotalNegativePositive
0.940.95
0.99
(18,966) (126,931)
(57,732)
(108,936)(27,936)
Conversion to probabilities
• Hold out a subset of the training examples.
• Use the hold-out set to fit a sigmoid.
• This is equivalent to assuming that the SVM output is proportional to the log-odds of a positive example.
BAfefy
1
11Pr
y = labelf = discriminant
114.015.111
Pr xe
x
Yeast protein classification
Membrane proteins
• Membrane proteins anchor in a cellular membrane (plasma, ER, golgi, mitochondrial).
• Communicate across membrane.
• Pass through the membrane several times.
mRNA expression
dataprotein-protein interaction data
sequence data
Heterogeneous data
Vector representation
• Each matrix entry is an mRNA expression measurement.
• Each column is an experiment.
• Each row corresponds to a gene.
2
2
2exp,
YX
YXK
>ICYA_MANSEGDIFYPGYCPDVKPVNDFDLSAFAGAWHEIAKLPLENENQGKCTIAEYKYDGKKASVYNSFVSNGVKEYMEGDLEIAPDAKYTKQGKYVMTFKFGQRVVNLVPWVLATDYKNYAINYNCDYHPDKKAHSIHAWILSKSKVLEGNTKEVVDNVLKTFSHLIDASKFISNDFSEAACQYSTTYSLTGPDRH
>LACB_BOVINMKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTKIPAVFKIDALNENKVLVLDTDYKKYLLFCMENSAEPEQSLACQCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI
Sequence kernels
• We cannot compute a scalar product on a pair of variable-length, discrete strings.
Pairwise comparison kernel
Pairwise comparison kernel
Pairwise kernel variants
• Smith-Waterman all-vs-all
• BLAST all-vs-all• Smith-Waterman
w.r.t. SCOP database• E-values from Pfam
database
Protein-protein interactions
• Pairwise interactions can be represented as a graph or a matrix.1 0 0 1 0 1 0
11 0 1 0 1 1 0 10 0 0 0 1 1 0 00 0 1 0 1 1 0 10 0 1 0 1 0 0 11 0 0 0 0 0 0 10 0 1 0 1 0 0 0
protein
protein
Linear interaction kernel
• The simplest kernel counts the number of interactions between each pair.
1 0 0 1 0 1 0 11 0 1 0 1 1 0 10 0 0 0 1 1 0 00 0 1 0 1 1 0 10 0 1 0 1 0 0 11 0 0 0 0 0 0 10 0 1 0 1 0 0 0
3
Diffusion kernel
• A general method for establishing similarities between nodes of a graph.
• Based upon a random walk.• Efficiently accounts for all paths connecting two
nodes, weighted by path lengths.
Hydrophobicity profile
• Transmembrane regions are typically hydrophobic, and vice versa.
• The hydrophobicity profile of a membrane protein is evolutionarily conserved.
Membrane Membrane proteinprotein
Non-membrane Non-membrane proteinprotein
Hydrophobicity kernel
• Generate hydropathy profile from amino acid sequence using Kyte-Doolittle index.
• Prefilter the profiles.• Compare two profiles by
– Computing fast Fourier transform (FFT), and– Applying Gaussian kernel function.
• This kernel detects periodicities in the hydrophobicity profile.
SVM learning from heterogeneous data
Combining kernels
Identical
A B
K(A) K(B)
K(A)+K(B)
A B
A:B
K(A:B)
Semidefinite programming
• Define a convex cost function to assess the quality of a kernel matrix.
• Semidefinite programming (SDP) optimizes convex cost functions over the convex cone of positive semidefinite matrices.
Learn K from the convex cone Learn K from the convex cone of positive-semidefinite of positive-semidefinite matrices or a convex subset of matrices or a convex subset of it :it :
According to a convex According to a convex quality measure:quality measure:
Integrate constructed Integrate constructed kernelskernels
Learn a linear mix
Large margin classifier Large margin classifier (SVM)(SVM)
Maximize the margin
SDPSDP
Semidefinite programming
i
iiKK
i
iiKK
Integrate constructed Integrate constructed kernelskernels
Learn a linear mix
Large margin classifier Large margin classifier (SVM)(SVM)
Maximize the margin
Experimental results
Seven yeast kernels
Kernel Data Similarity measure
KSW protein sequence Smith-Waterman
KB protein sequence BLAST
KHMM protein sequence Pfam HMM
KFFT hydropathy profile FFT
KLI protein interactions
linear kernel
KD protein interactions
diffusion kernel
KE gene expression radial basis kernel
Membrane proteins
Comparison of performance
Simple rules fromhydrophobicity profile
TMHMM
Cytoplasmic ribosomal proteins
False negative predictions
False negative expression profiles
Markov Random Field
• General Bayesian method, applied by Deng et al. to yeast functional classification.
• Used five different types of data.
• For their model, the input data must be binary.
• Reported improved accuracy compared to using any single data type.
Yeast functional classesCategory SizeMetabolism 1048
Energy 242
Cell cycle & DNA processing 600
Transcription 753
Protein synthesis 335
Protein fate 578
Cellular transport 479
Cell rescue, defense 264
Interaction w/ evironment 193
Cell fate 411
Cellular organization 192
Transport facilitation 306
Other classes 81
Six types of data
• Presence of Pfam domains.
• Genetic interactions from CYGD.
• Physical interactions from CYGD.
• Protein-protein interaction by TAP.
• mRNA expression profiles.
• (Smith-Waterman scores).
Results
MRF
SDP/SVM(binary)
SDP/SVM(enriched)
Many yeast kernels
• protein sequence• phylogenetic profiles• separate gene
expression kernels• time series
expression kernel• promoter regions
using seven aligned species
• protein localization• ChIP• protein-protein
interactions• yeast knockout
growth data• more ...
Future work
• New kernel functions that incorporate domain knowledge.
• Better understanding of the semantics of kernel weights.
• Further investigation of yeast biology.
• Improved scalability of the algorithm.
• Prediction of protein-protein interactions.
Acknowledgments
• Dave Anderson, University of Oregon
• Wei Wu, Genome Sciences, UW & FHCRC
• Michael Jordan, Statistics & EECS, UC Berkeley
• Laurent El Ghaoui, EECS, UC Berkeley
• Gert Lanckriet, EECS, UC Berkeley
• Nello Cristianini, Statistics, UC Davis
Fisher criterion score
22
21
221
High scoreLow score
Feature ranking
delta Cn 2.861% match total ion current 2.804Cn 2.444% match peaks 2.314Sp 1.158mass 0.704charge 0.488rank Sp 0.313peak count 0.209sequence similarity 0.115% ion match 0.079total ion current 0.026delta mass 0.024
Pairwise feature ranking
% match TIC-delta Cn 4.741% match peaks-delta Cn 4.233% match TIC-Cn 3.819delta Cn-Cn 3.597delta Cn-charge 3.563% match peaks-Cn 3.377delta Cn-mass 3.119% match TIC-% match peaks 2.823% ion match-delta Cn 2.812Sp-delta Cn 2.799% match TIC-Sp 2.579% match peaks-Sp 2.383
% match TIC-mass 2.097% ion match-mass 2.091Cn-charge 1.943Sp-mass 1.922% match TIC-charge 1.898Cn-mass 1.884Sp-Cn 1.881% ion match-Cn 1.827Sp-charge 1.770% match peaks-mass 1.668% match peaks-charge 1.528% match TIC-% ion match 1.473