Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining...

25
Karsten Borgwardt: Data Mining in Bioinformatics, Page 1 Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt March 1 to March 12, 2010 Machine Learning & Computational Biology Research Group MPIs Tübingen

Transcript of Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining...

Page 1: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification

Karsten Borgwardt: Data Mining in Bioinformatics, Page 1

Data Mining in BioinformaticsDay 9: String & Text Mining in Bioinformatics

Karsten Borgwardt

March 1 to March 12, 2010

Machine Learning & Computational Biology Research GroupMPIs Tübingen

Page 2: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification

Why compare sequences?

Karsten Borgwardt: Data Mining in Bioinformatics, Page 2

Protein sequencesProteins are chains of amino acids.20 different types of amino acids can be found in proteinsequences.Protein sequence changes over time by mutations, dele-tion, insertions.Different protein sequences may diverge from one com-mon ancestor.Their sequences may differ slightly, yet their function isoften conserved.

Page 3: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification

Why compare sequences?

Karsten Borgwardt: Data Mining in Bioinformatics, Page 3

Biological Question:Biologists are interested in the reverse direction:Given two protein sequences, is it likely that they origi-nate from the same common ancestor?

Computational Challenge:How to measure similarity between two protein se-quence, or equivalently:How to measure similarity between two strings

Kernel Challenge:How to measure similarity between two strings via a ker-nel function

In short: How to define a string kernel

Page 4: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification

History of sequence comparison

Karsten Borgwardt: Data Mining in Bioinformatics, Page 4

First phaseSmith-WatermanBLAST

Second phaseProfilesHidden Markov Models

Third phasePSI-BlastSAM-T98

Fourth phaseKernels

Page 5: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification

Sequence comparison: Phase 1

Karsten Borgwardt: Data Mining in Bioinformatics, Page 5

IdeaMeasure pairwise similarities between sequences withgaps

MethodsSmith-Waterman

dynamic programminghigh accuracyslow (O(n2))

BLASTfaster heuristic alternative with sufficient accuracysearches common substrings of fixed lengthextends these in both directionsperforms gapped alignment

Page 6: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification

Sequence comparison: Phase 2

Karsten Borgwardt: Data Mining in Bioinformatics, Page 6

IdeaCollect aggregate statistics from a family of sequencesCompare this statistics to a single unlabeled protein

MethodsHidden Markov Models (HMMs)

Markov process with hidden and observable parame-tersForward algorithm determines probability if given se-quence is output of particular HMM

ProfilesProfiles of sequence families are derived by multiplesequence alignmentGiven sequence is compared to this profile

Page 7: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification

Sequence comparison: Phase 3

Karsten Borgwardt: Data Mining in Bioinformatics, Page 7

IdeaCreate single models from database collectionsof homologous sequences

MethodsPSI-BLAST

Position specific iterative BLASTProfile from highest scoring hits in initial BLAST runsPosition weighting according to degree of conserva-tionIteration of these steps

SAM-T98, now SAM-T02database search with HMM from multiple sequencealignment

Page 8: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification

Phase 4: Kernels and SVMs

Karsten Borgwardt: Data Mining in Bioinformatics, Page 8

General ideaModel differences between classes of sequencesUse SVM classifier to distinguish classesUse kernel to measure similarity between strings

Kernels for Protein SequencesSVM-Fisher kernelComposite kernelMotif kernelString kernel

Page 9: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification

SVM-Fisher method

Karsten Borgwardt: Data Mining in Bioinformatics, Page 9

General ideaCombine HMMs and SVMs for sequence classificationWon best-paper award at ISMB 1999

Sequence representationfixed-length vectorcomponents are transition and emission probabilitiestransformation into Fisher score

Page 10: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification

SVM-Fisher method

Karsten Borgwardt: Data Mining in Bioinformatics, Page 10

AlgorithmModel protein family F as HMMTransform query protein X into fixed-length vector viaHMMCompute kernel between X and positive and negativeexamples of the protein family

Advantagesallows to incorporate prior knowledgeallows to deal with missing datais interpretableoutperforms competing methods

Page 11: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification

Composition kernels

Karsten Borgwardt: Data Mining in Bioinformatics, Page 11

General ideaModel sequence by amino acid contentBin amino acids w.r.t physico-chemical properties

Sequence representationfeature vector of amino acid frequenciesphysico-chemical properties includepredicted secondary structure, hydrophobicity,normalized van der Waals volume, polarity,polarizabilityuseful database: AAindex

Page 12: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification

Motif kernels

Karsten Borgwardt: Data Mining in Bioinformatics, Page 12

General ideaConserved motif in amino acid sequences indicatestructural and functional relationshipModel sequence s as a feature vector f representingmotifsi-th component of f is 1⇔ s contains i-th motif

Motif databasesPROSITEeMOTIFsBLOCKS+ combines several databases

Generated bymanual constructionmultiple sequence alignment

Page 13: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification

Pairwise comparison kernels

Karsten Borgwardt: Data Mining in Bioinformatics, Page 13

General ideaEmploy empirical kernel map on Smith-Waterman/Blastscores

AdvantageUtilizes decades of practical experience with Blast

DisadvantageHigh computational cost (O(m3))

AlleviationEmploy Blast instead of Smith-WatermanUse vectorization set for empirical map only

Page 14: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification

Phase 4: String Kernels

Karsten Borgwardt: Data Mining in Bioinformatics, Page 14

General ideaCount common substrings in two stringsA substring of length k is a k-mer

VariationsAssign weights to k-mersAllow for mismatchesAllow for gapsInclude substitutionsInclude wildcards

Page 15: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification

Spectrum Kernel

Karsten Borgwardt: Data Mining in Bioinformatics, Page 15

General ideaFor each l-mer α ∈ Σl, the coordinate indexed by α willbe the number of times α occurs in sequence x.Then the l-spectrum feature map is

ΦSpectruml (x) = (φα(x))α∈Σl

Here φα(x) is the # occurrences of α in x.The spectrum kernel is now the inner product in the fea-ture space defined by this map:

kSpectrum(x, x′) =< ΦSpectruml (x),ΦSpectrum

l (x′) >

Sequences are deemed the more similar, the more com-mon substrings they contain

Page 16: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification

Spectrum Kernel

Karsten Borgwardt: Data Mining in Bioinformatics, Page 16

PrincipleSpectrum kernel: Count exactly common k-mers

Page 17: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification

Mismatch Kernel

Karsten Borgwardt: Data Mining in Bioinformatics, Page 17

General ideaDo not enforce strictly exact matchesDefine mismatch neighborhood of an l-mer α with up tom mismatches:

φMismatch(l,m) (α) = (φβ(α))β∈Σl

For a sequence x of any length, the map is then ex-tended as

φMismatch(l,m) (x) =

∑l−mers α in x

(φMismatch(l,m) (α))

The mismatch kernel is now the inner product in featurespace defined by:

kMismatch(l,m) (x, x′) =< ΦMismatch

(l,m) (x),ΦMismatch(l,m) (x′) >

Page 18: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification

Mismatch Kernel

Karsten Borgwardt: Data Mining in Bioinformatics, Page 18

PrincipleMismatch kernel: Count common k-mers with max. mmismatches

Page 19: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification

Gappy Kernel

Karsten Borgwardt: Data Mining in Bioinformatics, Page 19

General ideaAllow for gaps in common substrings→ “subsequences”A g-mer then contributes to all its l-mer subsequences

φGap(g,l)(α) = (φβ(α))β∈Σl

For a sequence x of any length, the map is then ex-tended as

φGap(g,l)(x) =∑

g−mers α in x(φGap(g,l)(α))

The gappy kernel is now the inner product in featurespace defined by:

kGap(g,l)(x, x′) =< ΦGap

(g,l)(x),ΦGap(g,l)(x

′) >

Page 20: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification

Gappy Kernel

Karsten Borgwardt: Data Mining in Bioinformatics, Page 20

PrincipleGappy kernel: Count common l-subsequences of g-mers

Page 21: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification

Substitution Kernel

Karsten Borgwardt: Data Mining in Bioinformatics, Page 21

General ideamismatch neighborhood→ substitution neighborhoodAn l-mer then contributes to all l-mers in its substitutionneighborhood

M(l,σ)(α) = {β = b1b2 . . . bl ∈ Σl : −l∑i

logP (ai|bi) < σ}

For a sequence x of any length, the map is then ex-tended as

φSub(l,σ)(x) =∑

l−mers α in x(φSub(l,σ)(α))

The substitution kernel is now:

kSub(l,σ)(x, x′) =< ΦSub

(l,σ)(x),ΦSub(l,σ)(x

′) >

Page 22: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification

Substitution Kernel

Karsten Borgwardt: Data Mining in Bioinformatics, Page 22

PrincipleSubstitution kernel: Count common l-subsequences insubstitution neighborhood

Page 23: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification

Wildcard Kernels

Karsten Borgwardt: Data Mining in Bioinformatics, Page 23

General ideaaugment alphabet Σ by a wildcard character ∗→ Σ∪{∗}given α from Σl and β from {Σ∪ {∗}}l with maximum moccurrences of ∗l-mer α contributes to l-mer β if their non-wildcard char-acters matchFor a sequence x of any length, the map is then givenby

φWildcard(l,m,λ) (x) =

∑l−mers α in x

(φβ(α))β∈W

where φβ(α) = λj if α matches pattern β containing jwildcards, φβ(α) = 0 if α does not match β, and0 ≤ λ ≤ 1.

Page 24: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification

Wildcard Kernel

Karsten Borgwardt: Data Mining in Bioinformatics, Page 24

PrincipleWildcard kernel: Count l-mers that match except forwildcards

Page 25: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification

References and further reading

Karsten Borgwardt: Data Mining in Bioinformatics, Page 25

References

[1] C. Leslie, E. Eskin, and W. S. Noble. The spectrumkernel: A string kernel for SVM protein classification. InPSB, pages 564–575, 2002.

[2] C. Leslie, E. Eskin, J. Weston, and W. S. Noble. Mis-match string kernels for SVM protein classification. InNIPS 2002. MIT Press.

[3] C. Leslie and R. Kuang. Fast kernels for inexact stringmatching. In COLT, 2003.

[4] B. Schölkopf, K. Tsuda, and J.-P. Vert. Kernel Methodsin Computational Biology, Chapter 3 and 4. MIT Press,Cambridge, MA, 2004.