A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences...

Post on 11-Jan-2016

219 views 0 download

Transcript of A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences...

A statistical framework for genomic data fusion

William Stafford NobleDepartment of Genome Sciences

Department of Computer Science and Engineering

University of Washington

Outline

• Recognizing correctly identified peptides

• The support vector machine algorithm

• Experimental results

• Yeast protein classification

• SVM learning from heterogeneous data

• Results

Recognizing correctly identified peptides

Database searchProtein sample

Tandem mass spectrometer

Search algorithm

Observed spectra

Predicted peptides

Sequence database

The learning task

• We are given paired observed and theoretical spectra.

• Question: Is the pairing correct?

Observed Theoretical

Properties of the observed spectrum

1. Total peptide mass. Too small yields little information; too large (>25 amino acids) yields uneven fragmentation.

2. Charge (+1, +2 or +3). Provides some evidence about amino acid composition.

3. Total ion current. Proportional to the amount of peptide present.

4. Peak count. Small indicates poor fragmentation; large indicates noise.

Observed vs. theoretical spectra

5. Mass difference.

6. Percent of ions matched. Number of matched ions / total number of ions.

7. Percent of peaks matched. Number of matched peaks / total number of peaks.

8. Percent of peptide fragment ion current matched. Total intensity of matched peaks / total intensity of all peaks.

9. Preliminary SEQUEST score (Sp).

10. Preliminary score rank.

11. SEQUEST cross-correlation (XCorr).

Top-ranked vs. second-ranked peptides

12.Change in cross-correlation. Compute the difference in XCorr for the top-ranked and second-ranked peptide.

13.Percent sequence identity. Usually anti-correlated with change in cross-correlation.

Pos

itiv

e ex

ampl

es

Neg

ativ

e ex

ampl

es

The support vector machine algorithm

SVMs in computational biology

• Splice site recognition• Protein sequence

similarity detection• Protein functional

classification• Regulatory module

search

• Protein-protein interaction prediction

• Gene functional classification from microarray data

• Cancer classification from microarray data

Support vector machine

Support vector machine

++

+

+ +

+

+

+ +

+ +

+-- -

-

-

-

-

--

-

--

-+

+

-

--

Locate a plane that separates

positive from negative

examples.

Focus on the examples closest to the boundary.

Kernel matrix

31, YXYXK

2

2

2exp,

YX

YXK

Kernel functions

• Let X be a finite input space.• A kernel is a function K, such that for all x, z

X, K(x, z) = (x) · (y), where is a mapping from X to an (inner product) feature space F.

• Let K(x,z) be a symmetric function on X. Then K(x,z) is a kernel function if and only if the matrix

is positive semi-definite.

njijiK

1,,

xxK

Peptide ID kernel function

• Let p(x,y) be the function that computes a 13-element vector of parameters for a pair of spectra, x and y.

• The kernel function K operates on pairs of observed and theoretical spectra:

21,,

,,,:,:

Bt

Bo

At

Ao

Bt

Bo

At

Ao

Bt

Bo

At

Ao

SSpSSp

SSpSSpKSSSSK

Experimental results

Experimental design

• Data consists of one 13-element vector per predicted peptide.

• Each feature is normalized to sum to 1.0 across all examples.

• The SVM is tested using leave-one-out cross-validation.

• The SVM uses a second-degree polynomial, normalized kernel with a 2-norm asymmetric soft margin.

Three data sets

• Set 1: Ion trap mass spectrometer. Sequest search on the full non-redundant database.

• Set 2: Ion trap mass spectrometer. Sequest search on human NRDB.

• Set 3: Quadrupole time-of-flight mass spectrometer. Sequence search on human NRDB.

Data set sizes

15405231017QTOF HNRDB

1161465696Ion-trap HNRDB

976479497Ion-trap NRDB

TotalNegativePositive

0.940.95

0.99

(18,966) (126,931)

(57,732)

(108,936)(27,936)

Conversion to probabilities

• Hold out a subset of the training examples.

• Use the hold-out set to fit a sigmoid.

• This is equivalent to assuming that the SVM output is proportional to the log-odds of a positive example.

BAfefy

1

11Pr

y = labelf = discriminant

114.015.111

Pr xe

x

Yeast protein classification

Membrane proteins

• Membrane proteins anchor in a cellular membrane (plasma, ER, golgi, mitochondrial).

• Communicate across membrane.

• Pass through the membrane several times.

mRNA expression

dataprotein-protein interaction data

sequence data

Heterogeneous data

Vector representation

• Each matrix entry is an mRNA expression measurement.

• Each column is an experiment.

• Each row corresponds to a gene.

2

2

2exp,

YX

YXK

>ICYA_MANSEGDIFYPGYCPDVKPVNDFDLSAFAGAWHEIAKLPLENENQGKCTIAEYKYDGKKASVYNSFVSNGVKEYMEGDLEIAPDAKYTKQGKYVMTFKFGQRVVNLVPWVLATDYKNYAINYNCDYHPDKKAHSIHAWILSKSKVLEGNTKEVVDNVLKTFSHLIDASKFISNDFSEAACQYSTTYSLTGPDRH

>LACB_BOVINMKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTKIPAVFKIDALNENKVLVLDTDYKKYLLFCMENSAEPEQSLACQCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI

Sequence kernels

• We cannot compute a scalar product on a pair of variable-length, discrete strings.

Pairwise comparison kernel

Pairwise comparison kernel

Pairwise kernel variants

• Smith-Waterman all-vs-all

• BLAST all-vs-all• Smith-Waterman

w.r.t. SCOP database• E-values from Pfam

database

Protein-protein interactions

• Pairwise interactions can be represented as a graph or a matrix.1 0 0 1 0 1 0

11 0 1 0 1 1 0 10 0 0 0 1 1 0 00 0 1 0 1 1 0 10 0 1 0 1 0 0 11 0 0 0 0 0 0 10 0 1 0 1 0 0 0

protein

protein

Linear interaction kernel

• The simplest kernel counts the number of interactions between each pair.

1 0 0 1 0 1 0 11 0 1 0 1 1 0 10 0 0 0 1 1 0 00 0 1 0 1 1 0 10 0 1 0 1 0 0 11 0 0 0 0 0 0 10 0 1 0 1 0 0 0

3

Diffusion kernel

• A general method for establishing similarities between nodes of a graph.

• Based upon a random walk.• Efficiently accounts for all paths connecting two

nodes, weighted by path lengths.

Hydrophobicity profile

• Transmembrane regions are typically hydrophobic, and vice versa.

• The hydrophobicity profile of a membrane protein is evolutionarily conserved.

Membrane Membrane proteinprotein

Non-membrane Non-membrane proteinprotein

Hydrophobicity kernel

• Generate hydropathy profile from amino acid sequence using Kyte-Doolittle index.

• Prefilter the profiles.• Compare two profiles by

– Computing fast Fourier transform (FFT), and– Applying Gaussian kernel function.

• This kernel detects periodicities in the hydrophobicity profile.

SVM learning from heterogeneous data

Combining kernels

Identical

A B

K(A) K(B)

K(A)+K(B)

A B

A:B

K(A:B)

Semidefinite programming

• Define a convex cost function to assess the quality of a kernel matrix.

• Semidefinite programming (SDP) optimizes convex cost functions over the convex cone of positive semidefinite matrices.

Learn K from the convex cone Learn K from the convex cone of positive-semidefinite of positive-semidefinite matrices or a convex subset of matrices or a convex subset of it :it :

According to a convex According to a convex quality measure:quality measure:

Integrate constructed Integrate constructed kernelskernels

Learn a linear mix

Large margin classifier Large margin classifier (SVM)(SVM)

Maximize the margin

SDPSDP

Semidefinite programming

i

iiKK

i

iiKK

Integrate constructed Integrate constructed kernelskernels

Learn a linear mix

Large margin classifier Large margin classifier (SVM)(SVM)

Maximize the margin

Experimental results

Seven yeast kernels

Kernel Data Similarity measure

KSW protein sequence Smith-Waterman

KB protein sequence BLAST

KHMM protein sequence Pfam HMM

KFFT hydropathy profile FFT

KLI protein interactions

linear kernel

KD protein interactions

diffusion kernel

KE gene expression radial basis kernel

Membrane proteins

Comparison of performance

Simple rules fromhydrophobicity profile

TMHMM

Cytoplasmic ribosomal proteins

False negative predictions

False negative expression profiles

Markov Random Field

• General Bayesian method, applied by Deng et al. to yeast functional classification.

• Used five different types of data.

• For their model, the input data must be binary.

• Reported improved accuracy compared to using any single data type.

Yeast functional classesCategory SizeMetabolism 1048

Energy 242

Cell cycle & DNA processing 600

Transcription 753

Protein synthesis 335

Protein fate 578

Cellular transport 479

Cell rescue, defense 264

Interaction w/ evironment 193

Cell fate 411

Cellular organization 192

Transport facilitation 306

Other classes 81

Six types of data

• Presence of Pfam domains.

• Genetic interactions from CYGD.

• Physical interactions from CYGD.

• Protein-protein interaction by TAP.

• mRNA expression profiles.

• (Smith-Waterman scores).

Results

MRF

SDP/SVM(binary)

SDP/SVM(enriched)

Many yeast kernels

• protein sequence• phylogenetic profiles• separate gene

expression kernels• time series

expression kernel• promoter regions

using seven aligned species

• protein localization• ChIP• protein-protein

interactions• yeast knockout

growth data• more ...

Future work

• New kernel functions that incorporate domain knowledge.

• Better understanding of the semantics of kernel weights.

• Further investigation of yeast biology.

• Improved scalability of the algorithm.

• Prediction of protein-protein interactions.

Acknowledgments

• Dave Anderson, University of Oregon

• Wei Wu, Genome Sciences, UW & FHCRC

• Michael Jordan, Statistics & EECS, UC Berkeley

• Laurent El Ghaoui, EECS, UC Berkeley

• Gert Lanckriet, EECS, UC Berkeley

• Nello Cristianini, Statistics, UC Davis

Fisher criterion score

22

21

221

High scoreLow score

Feature ranking

delta Cn 2.861% match total ion current 2.804Cn 2.444% match peaks 2.314Sp 1.158mass 0.704charge 0.488rank Sp 0.313peak count 0.209sequence similarity 0.115% ion match 0.079total ion current 0.026delta mass 0.024

Pairwise feature ranking

% match TIC-delta Cn 4.741% match peaks-delta Cn 4.233% match TIC-Cn 3.819delta Cn-Cn 3.597delta Cn-charge 3.563% match peaks-Cn 3.377delta Cn-mass 3.119% match TIC-% match peaks 2.823% ion match-delta Cn 2.812Sp-delta Cn 2.799% match TIC-Sp 2.579% match peaks-Sp 2.383

% match TIC-mass 2.097% ion match-mass 2.091Cn-charge 1.943Sp-mass 1.922% match TIC-charge 1.898Cn-mass 1.884Sp-Cn 1.881% ion match-Cn 1.827Sp-charge 1.770% match peaks-mass 1.668% match peaks-charge 1.528% match TIC-% ion match 1.473