A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences...

67
A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington

Transcript of A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences...

Page 1: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

A statistical framework for genomic data fusion

William Stafford NobleDepartment of Genome Sciences

Department of Computer Science and Engineering

University of Washington

Page 2: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Outline

• Recognizing correctly identified peptides

• The support vector machine algorithm

• Experimental results

• Yeast protein classification

• SVM learning from heterogeneous data

• Results

Page 3: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Recognizing correctly identified peptides

Page 4: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Database searchProtein sample

Tandem mass spectrometer

Search algorithm

Observed spectra

Predicted peptides

Sequence database

Page 5: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

The learning task

• We are given paired observed and theoretical spectra.

• Question: Is the pairing correct?

Observed Theoretical

Page 6: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Properties of the observed spectrum

1. Total peptide mass. Too small yields little information; too large (>25 amino acids) yields uneven fragmentation.

2. Charge (+1, +2 or +3). Provides some evidence about amino acid composition.

3. Total ion current. Proportional to the amount of peptide present.

4. Peak count. Small indicates poor fragmentation; large indicates noise.

Page 7: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Observed vs. theoretical spectra

5. Mass difference.

6. Percent of ions matched. Number of matched ions / total number of ions.

7. Percent of peaks matched. Number of matched peaks / total number of peaks.

8. Percent of peptide fragment ion current matched. Total intensity of matched peaks / total intensity of all peaks.

9. Preliminary SEQUEST score (Sp).

10. Preliminary score rank.

11. SEQUEST cross-correlation (XCorr).

Page 8: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Top-ranked vs. second-ranked peptides

12.Change in cross-correlation. Compute the difference in XCorr for the top-ranked and second-ranked peptide.

13.Percent sequence identity. Usually anti-correlated with change in cross-correlation.

Page 9: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Pos

itiv

e ex

ampl

es

Neg

ativ

e ex

ampl

es

Page 10: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

The support vector machine algorithm

Page 11: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

SVMs in computational biology

• Splice site recognition• Protein sequence

similarity detection• Protein functional

classification• Regulatory module

search

• Protein-protein interaction prediction

• Gene functional classification from microarray data

• Cancer classification from microarray data

Page 12: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Support vector machine

Page 13: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Support vector machine

++

+

+ +

+

+

+ +

+ +

+-- -

-

-

-

-

--

-

--

-+

+

-

--

Locate a plane that separates

positive from negative

examples.

Focus on the examples closest to the boundary.

Page 14: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Kernel matrix

Page 15: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

31, YXYXK

Page 16: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

2

2

2exp,

YX

YXK

Page 17: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Kernel functions

• Let X be a finite input space.• A kernel is a function K, such that for all x, z

X, K(x, z) = (x) · (y), where is a mapping from X to an (inner product) feature space F.

• Let K(x,z) be a symmetric function on X. Then K(x,z) is a kernel function if and only if the matrix

is positive semi-definite.

njijiK

1,,

xxK

Page 18: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Peptide ID kernel function

• Let p(x,y) be the function that computes a 13-element vector of parameters for a pair of spectra, x and y.

• The kernel function K operates on pairs of observed and theoretical spectra:

21,,

,,,:,:

Bt

Bo

At

Ao

Bt

Bo

At

Ao

Bt

Bo

At

Ao

SSpSSp

SSpSSpKSSSSK

Page 19: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Experimental results

Page 20: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Experimental design

• Data consists of one 13-element vector per predicted peptide.

• Each feature is normalized to sum to 1.0 across all examples.

• The SVM is tested using leave-one-out cross-validation.

• The SVM uses a second-degree polynomial, normalized kernel with a 2-norm asymmetric soft margin.

Page 21: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Three data sets

• Set 1: Ion trap mass spectrometer. Sequest search on the full non-redundant database.

• Set 2: Ion trap mass spectrometer. Sequest search on human NRDB.

• Set 3: Quadrupole time-of-flight mass spectrometer. Sequence search on human NRDB.

Page 22: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Data set sizes

15405231017QTOF HNRDB

1161465696Ion-trap HNRDB

976479497Ion-trap NRDB

TotalNegativePositive

Page 23: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.
Page 24: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.
Page 25: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

0.940.95

0.99

Page 26: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

(18,966) (126,931)

(57,732)

(108,936)(27,936)

Page 27: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.
Page 28: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.
Page 29: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Conversion to probabilities

• Hold out a subset of the training examples.

• Use the hold-out set to fit a sigmoid.

• This is equivalent to assuming that the SVM output is proportional to the log-odds of a positive example.

BAfefy

1

11Pr

y = labelf = discriminant

Page 30: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

114.015.111

Pr xe

x

Page 31: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Yeast protein classification

Page 32: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Membrane proteins

• Membrane proteins anchor in a cellular membrane (plasma, ER, golgi, mitochondrial).

• Communicate across membrane.

• Pass through the membrane several times.

Page 33: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

mRNA expression

dataprotein-protein interaction data

sequence data

Heterogeneous data

Page 34: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Vector representation

• Each matrix entry is an mRNA expression measurement.

• Each column is an experiment.

• Each row corresponds to a gene.

Page 35: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

2

2

2exp,

YX

YXK

Page 36: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

>ICYA_MANSEGDIFYPGYCPDVKPVNDFDLSAFAGAWHEIAKLPLENENQGKCTIAEYKYDGKKASVYNSFVSNGVKEYMEGDLEIAPDAKYTKQGKYVMTFKFGQRVVNLVPWVLATDYKNYAINYNCDYHPDKKAHSIHAWILSKSKVLEGNTKEVVDNVLKTFSHLIDASKFISNDFSEAACQYSTTYSLTGPDRH

>LACB_BOVINMKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTKIPAVFKIDALNENKVLVLDTDYKKYLLFCMENSAEPEQSLACQCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI

Sequence kernels

• We cannot compute a scalar product on a pair of variable-length, discrete strings.

Page 37: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Pairwise comparison kernel

Page 38: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Pairwise comparison kernel

Page 39: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Pairwise kernel variants

• Smith-Waterman all-vs-all

• BLAST all-vs-all• Smith-Waterman

w.r.t. SCOP database• E-values from Pfam

database

Page 40: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Protein-protein interactions

• Pairwise interactions can be represented as a graph or a matrix.1 0 0 1 0 1 0

11 0 1 0 1 1 0 10 0 0 0 1 1 0 00 0 1 0 1 1 0 10 0 1 0 1 0 0 11 0 0 0 0 0 0 10 0 1 0 1 0 0 0

protein

protein

Page 41: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Linear interaction kernel

• The simplest kernel counts the number of interactions between each pair.

1 0 0 1 0 1 0 11 0 1 0 1 1 0 10 0 0 0 1 1 0 00 0 1 0 1 1 0 10 0 1 0 1 0 0 11 0 0 0 0 0 0 10 0 1 0 1 0 0 0

3

Page 42: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Diffusion kernel

• A general method for establishing similarities between nodes of a graph.

• Based upon a random walk.• Efficiently accounts for all paths connecting two

nodes, weighted by path lengths.

Page 43: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Hydrophobicity profile

• Transmembrane regions are typically hydrophobic, and vice versa.

• The hydrophobicity profile of a membrane protein is evolutionarily conserved.

Membrane Membrane proteinprotein

Non-membrane Non-membrane proteinprotein

Page 44: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Hydrophobicity kernel

• Generate hydropathy profile from amino acid sequence using Kyte-Doolittle index.

• Prefilter the profiles.• Compare two profiles by

– Computing fast Fourier transform (FFT), and– Applying Gaussian kernel function.

• This kernel detects periodicities in the hydrophobicity profile.

Page 45: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

SVM learning from heterogeneous data

Page 46: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Combining kernels

Identical

A B

K(A) K(B)

K(A)+K(B)

A B

A:B

K(A:B)

Page 47: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Semidefinite programming

• Define a convex cost function to assess the quality of a kernel matrix.

• Semidefinite programming (SDP) optimizes convex cost functions over the convex cone of positive semidefinite matrices.

Page 48: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Learn K from the convex cone Learn K from the convex cone of positive-semidefinite of positive-semidefinite matrices or a convex subset of matrices or a convex subset of it :it :

According to a convex According to a convex quality measure:quality measure:

Integrate constructed Integrate constructed kernelskernels

Learn a linear mix

Large margin classifier Large margin classifier (SVM)(SVM)

Maximize the margin

SDPSDP

Semidefinite programming

i

iiKK

Page 49: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

i

iiKK

Integrate constructed Integrate constructed kernelskernels

Learn a linear mix

Large margin classifier Large margin classifier (SVM)(SVM)

Maximize the margin

Page 50: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Experimental results

Page 51: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Seven yeast kernels

Kernel Data Similarity measure

KSW protein sequence Smith-Waterman

KB protein sequence BLAST

KHMM protein sequence Pfam HMM

KFFT hydropathy profile FFT

KLI protein interactions

linear kernel

KD protein interactions

diffusion kernel

KE gene expression radial basis kernel

Page 52: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Membrane proteins

Page 53: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Comparison of performance

Simple rules fromhydrophobicity profile

TMHMM

Page 54: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Cytoplasmic ribosomal proteins

Page 55: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

False negative predictions

Page 56: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

False negative expression profiles

Page 57: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.
Page 58: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Markov Random Field

• General Bayesian method, applied by Deng et al. to yeast functional classification.

• Used five different types of data.

• For their model, the input data must be binary.

• Reported improved accuracy compared to using any single data type.

Page 59: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Yeast functional classesCategory SizeMetabolism 1048

Energy 242

Cell cycle & DNA processing 600

Transcription 753

Protein synthesis 335

Protein fate 578

Cellular transport 479

Cell rescue, defense 264

Interaction w/ evironment 193

Cell fate 411

Cellular organization 192

Transport facilitation 306

Other classes 81

Page 60: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Six types of data

• Presence of Pfam domains.

• Genetic interactions from CYGD.

• Physical interactions from CYGD.

• Protein-protein interaction by TAP.

• mRNA expression profiles.

• (Smith-Waterman scores).

Page 61: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Results

MRF

SDP/SVM(binary)

SDP/SVM(enriched)

Page 62: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Many yeast kernels

• protein sequence• phylogenetic profiles• separate gene

expression kernels• time series

expression kernel• promoter regions

using seven aligned species

• protein localization• ChIP• protein-protein

interactions• yeast knockout

growth data• more ...

Page 63: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Future work

• New kernel functions that incorporate domain knowledge.

• Better understanding of the semantics of kernel weights.

• Further investigation of yeast biology.

• Improved scalability of the algorithm.

• Prediction of protein-protein interactions.

Page 64: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Acknowledgments

• Dave Anderson, University of Oregon

• Wei Wu, Genome Sciences, UW & FHCRC

• Michael Jordan, Statistics & EECS, UC Berkeley

• Laurent El Ghaoui, EECS, UC Berkeley

• Gert Lanckriet, EECS, UC Berkeley

• Nello Cristianini, Statistics, UC Davis

Page 65: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Fisher criterion score

22

21

221

High scoreLow score

Page 66: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Feature ranking

delta Cn 2.861% match total ion current 2.804Cn 2.444% match peaks 2.314Sp 1.158mass 0.704charge 0.488rank Sp 0.313peak count 0.209sequence similarity 0.115% ion match 0.079total ion current 0.026delta mass 0.024

Page 67: A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Pairwise feature ranking

% match TIC-delta Cn 4.741% match peaks-delta Cn 4.233% match TIC-Cn 3.819delta Cn-Cn 3.597delta Cn-charge 3.563% match peaks-Cn 3.377delta Cn-mass 3.119% match TIC-% match peaks 2.823% ion match-delta Cn 2.812Sp-delta Cn 2.799% match TIC-Sp 2.579% match peaks-Sp 2.383

% match TIC-mass 2.097% ion match-mass 2.091Cn-charge 1.943Sp-mass 1.922% match TIC-charge 1.898Cn-mass 1.884Sp-Cn 1.881% ion match-Cn 1.827Sp-charge 1.770% match peaks-mass 1.668% match peaks-charge 1.528% match TIC-% ion match 1.473