Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 ·...

43
Kernel Methods in Bioinformatics Sudarsan Padhy IIIT Bhubaneswar [email protected]

Transcript of Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 ·...

Page 1: Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 · Bioinformatics Sudarsan Padhy IIIT Bhubaneswar spadhy07@gmail.com. 2 Empirical Inference

Kernel Methods in Bioinformatics

Sudarsan  Padhy

IIIT   Bhubaneswar

[email protected]

Page 2: Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 · Bioinformatics Sudarsan Padhy IIIT Bhubaneswar spadhy07@gmail.com. 2 Empirical Inference

2

Empirical Inference• drawing conclusions from observations (includes learning)

• focus in studying it is not primarily on the conclusions, but onautomatic methods

• for high‐dimensional noisy data, inference becomes nontrivial

• empirical inference methods are appropriate whenever:– little model knowledge is available– many variables are involved– mechanistic models are infeasible– (relatively) large datasets are available

Page 3: Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 · Bioinformatics Sudarsan Padhy IIIT Bhubaneswar spadhy07@gmail.com. 2 Empirical Inference

3

Motivation behind kernel methods

• Linear learning typically has nice properties– Unique optimal solutions

– Fast learning algorithms

– Better statistical analysis

• But one big problem– Insufficient capacity

Page 4: Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 · Bioinformatics Sudarsan Padhy IIIT Bhubaneswar spadhy07@gmail.com. 2 Empirical Inference

4

Problems of high dimensions

• Capacity may easily become too large and lead to over‐fitting: being able to realise every classifier means unlikely to generalise well

• Computational costs involved in dealing with large vectors

Page 5: Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 · Bioinformatics Sudarsan Padhy IIIT Bhubaneswar spadhy07@gmail.com. 2 Empirical Inference

5

Historical perspective

• Minsky and Papert (1969) highlighted the weakness in their book Perceptrons

• Multylayer Perceptrons with back propagation learning (1985) overcame the problem by gluing together many linear units with non‐linear activation functions– Solved problem of capacity and led to very impressive extension of applicability of learning

– But ran into training problems of speed and multiple local minima

Page 6: Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 · Bioinformatics Sudarsan Padhy IIIT Bhubaneswar spadhy07@gmail.com. 2 Empirical Inference

MODEL OF A NEURON

φ(.)jInputSignals

x1

x2

xm

wj1

wj2

wjm

Biasbkwj0

vk yk = φ(vk)

Activation Function

A neuron is an information processing unit that is fundamental to the operation of a neural network. The three basic elements of the neuronal model are:

Page 7: Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 · Bioinformatics Sudarsan Padhy IIIT Bhubaneswar spadhy07@gmail.com. 2 Empirical Inference

1. A set of synapses characterized by a weight (wji : weight at j from i ) connecting link from input i to neuron j.

2. A linear combiner :

3. An activation function φ for limiting the amplitude of the output neuron.

It also includes a bias denoted by bj for increasing or decreasing the net input of the activation function.

Mathematically a neuron for input xi is described by :

Page 8: Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 · Bioinformatics Sudarsan Padhy IIIT Bhubaneswar spadhy07@gmail.com. 2 Empirical Inference

Examples of Activation Functions

1. Threshold Function:

2. Sigmoid Function:

3. Hyperbolic Tangent Function:

Page 9: Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 · Bioinformatics Sudarsan Padhy IIIT Bhubaneswar spadhy07@gmail.com. 2 Empirical Inference

MULTILAYER PERCEPTRON (MLP)In

put S

ig n

al

Input layer 1st Hidden layer

Out

put S

igna

l

2nd Hidden Layer

Output Layer

b

X

Page 10: Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 · Bioinformatics Sudarsan Padhy IIIT Bhubaneswar spadhy07@gmail.com. 2 Empirical Inference

The basic idea for a learning algorithm for

MLP is to minimize the squared error E

(error between the desired output and the

actual output), which is a non-linear

function of the weights.

E=0.5 ∑( yj - oj )2 , Oj =∑wji oi

Page 11: Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 · Bioinformatics Sudarsan Padhy IIIT Bhubaneswar spadhy07@gmail.com. 2 Empirical Inference

Back propagation neural network models are being

used for many applications. However BPNN suffers

from the following weaknesses.

• Need of large no. of controlling parameters.• Danger of over-fitting (for large problem size

captures not only useful information contained in training data but also unwanted characteristics).

• Being stuck into local minima- error function to be minimised is nonlinear.

• Slow convergence rate- for large size problems.

Page 12: Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 · Bioinformatics Sudarsan Padhy IIIT Bhubaneswar spadhy07@gmail.com. 2 Empirical Inference

Hence, Support Vector Machines (SVMs)developed by

Vapnik and his co-workers (1992-95) has been used for

supervised learning due to –

• Better generalization performance than other NN models

•Solution of SVM is unique , optimal and absent from local

minima as it uses linearly constrained quadratic

programming problem.

•Applicability to non-vectorial data ( Strings and Graphs)

. Few parameters are required for tuning the learning m/c

Page 13: Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 · Bioinformatics Sudarsan Padhy IIIT Bhubaneswar spadhy07@gmail.com. 2 Empirical Inference

• Kernel Methods are a set of algorithms from statistical learning which include the SVM for classification and regression ,Kernel PCA , Kernel based clustering ,  feature selection, and dimensionality reduction  etc.

• Popular methods in bioinformatics in last decade –Pubmed( search engine for biomedical literature) lists 3070 hits for SVM and 2626 hits for ‘ kernel methods’

Page 14: Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 · Bioinformatics Sudarsan Padhy IIIT Bhubaneswar spadhy07@gmail.com. 2 Empirical Inference

SUPPORT VECTOR MACHINEfor classification

Second solution is better because there is a larger margin between separating hyper plane and the closest data points.

Objective : Find the hyper plane that maximizes the margin of separation.

Support Vectors are the data points which lie closest to the decision surface (optimal Hyper plane).

Which solution will generalize better to unseen examples ?

C1

C2

Fig-1

C1

C2

Support Vectors

Optimal HyperplaneFig-2

Page 15: Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 · Bioinformatics Sudarsan Padhy IIIT Bhubaneswar spadhy07@gmail.com. 2 Empirical Inference

Consider a binary classification

problem: input vectors are and are thetargets or labels. The index labels the patternpairs (

The

).

define a space of labelled points called in-

put space.

Page 16: Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 · Bioinformatics Sudarsan Padhy IIIT Bhubaneswar spadhy07@gmail.com. 2 Empirical Inference

Separating hyperplane

Margin

Page 17: Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 · Bioinformatics Sudarsan Padhy IIIT Bhubaneswar spadhy07@gmail.com. 2 Empirical Inference

In an arbitrary m-dimensional space a separatinghyperplane can be written:

where is the bias, and the weights, etc.Thus we will consider a decision function of theform:

Page 18: Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 · Bioinformatics Sudarsan Padhy IIIT Bhubaneswar spadhy07@gmail.com. 2 Empirical Inference

We note that the argument in is invariantunder a rescaling: , .We will implicitly fix a scale with:

Page 19: Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 · Bioinformatics Sudarsan Padhy IIIT Bhubaneswar spadhy07@gmail.com. 2 Empirical Inference

for the support vectors (canonical hyperplanes).

Page 20: Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 · Bioinformatics Sudarsan Padhy IIIT Bhubaneswar spadhy07@gmail.com. 2 Empirical Inference

Thus:

For two support vectors on each side of theseparating hyperplane.

Page 21: Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 · Bioinformatics Sudarsan Padhy IIIT Bhubaneswar spadhy07@gmail.com. 2 Empirical Inference

The margin will be given by the projection of thevector ( ) onto the normal vector to thehyperplane i.e. from which we deducethat the margin is given by .

Page 22: Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 · Bioinformatics Sudarsan Padhy IIIT Bhubaneswar spadhy07@gmail.com. 2 Empirical Inference

Separating hyperplane

Margin

1

2

Page 23: Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 · Bioinformatics Sudarsan Padhy IIIT Bhubaneswar spadhy07@gmail.com. 2 Empirical Inference

Maximisation of the margin is thus equivalent tominimisation of the functional:

subject to the constraints:

Page 24: Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 · Bioinformatics Sudarsan Padhy IIIT Bhubaneswar spadhy07@gmail.com. 2 Empirical Inference

SVM FOR CLASSIFICATIONGiven the training sample find the optimal values of the weight vector W and bias b such that W minimizes the function subjects to the constraints :

≥ 1 for i=1,2,…,N.

This problem leads to the dual problem :

Determine that minimizes the objective function

Q(α)=

Subject to the constraints :

(1) (2)

Page 25: Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 · Bioinformatics Sudarsan Padhy IIIT Bhubaneswar spadhy07@gmail.com. 2 Empirical Inference

Then the optimum weight vector

And the optimum biasWhere is the support vector, s is the index such that α0,s > 0.The decision function is defined by :

If D(x) > 0 then x Є the class labeled by +1 else it belongs to the class labeled by -1.

Page 26: Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 · Bioinformatics Sudarsan Padhy IIIT Bhubaneswar spadhy07@gmail.com. 2 Empirical Inference

Soft Margin Separation of Classes

• Some data point (xi ,d I ) violates di (wTxi +b) ≥1( falls inside the region of separation but on the right side of the decision surface or falls on the wrong side)

Introduce slack variables ξi ≥0 ( measures deviation from ideal separability) 

di (wTxi +b) ≥1 – ξi   ,i=1,2,…,Nξi >1 data falls on wrong side ( misclassification)Goal‐ find separating hyperplane to minimise misclassification error

Page 27: Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 · Bioinformatics Sudarsan Padhy IIIT Bhubaneswar spadhy07@gmail.com. 2 Empirical Inference

Soft Margin …

• Minimise   ½ wT w + C ∑i  ξi(2nd term is an upper bound on the number of test errors , C controls  the tradeoff between complexity of the m/c and the number of non‐separable points)

Subject to the above constraints. This leads the same dual problem  except  0 ≤αi ≤ C , i=1,…,N,

And hence the same algo. with obvious modification.

Page 28: Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 · Bioinformatics Sudarsan Padhy IIIT Bhubaneswar spadhy07@gmail.com. 2 Empirical Inference

Support Vector Machines (SVM)

Page 29: Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 · Bioinformatics Sudarsan Padhy IIIT Bhubaneswar spadhy07@gmail.com. 2 Empirical Inference

Using kernels• The SVM alg. with linearly separable data is to be modified by replacing x  with φ(x) in the feature space. Critical observation is that  only inner products of φ(x) ‘s  are used.

• Suppose that we now have a shortcut method of computing:

• Then we do not need to explicitly compute the feature vectors either in training or testing

Page 30: Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 · Bioinformatics Sudarsan Padhy IIIT Bhubaneswar spadhy07@gmail.com. 2 Empirical Inference

The kernel K(x,z)=(x.z+c)d

corresponds to a feature mapwhose coordinates are all monomials of degree upto order d.

Then dim(feature space)=  n+d Cd = O(nd ) 

Computing K(x,z) takes only O(n) time 

Page 31: Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 · Bioinformatics Sudarsan Padhy IIIT Bhubaneswar spadhy07@gmail.com. 2 Empirical Inference

31

Dual form of SVM

• The dual form of the SVM can also be derived by taking the dual optimisation problem! This gives:

Page 32: Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 · Bioinformatics Sudarsan Padhy IIIT Bhubaneswar spadhy07@gmail.com. 2 Empirical Inference

KERNELA kernel is a function                             such that for all                                       (input space) satisfies: 

(i) , where                is a mapping from the input space             to an inner product space (feature space)   

RXxXK ⎯→⎯:Xxx ∈',

)'(),()',( xxxxK ΦΦ=

Φ

XF

Page 33: Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 · Bioinformatics Sudarsan Padhy IIIT Bhubaneswar spadhy07@gmail.com. 2 Empirical Inference

(ii) is symmetric and positive definite, that is,                                   for any two objects,      and                               , for any choice of           objects                   , and any choice of                   real numbers .

Thus, using any kernel, the SVM method in the feature space can be stated in the form of the following algorithm.

K),'()',( xxKxxK =

Xxx ∈',

0),(1 1

≥∑ ∑= =

jij

n

i

n

ji xxKcc

0>nXxx n ∈,...,1Rcc n ∈,...,1

Page 34: Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 · Bioinformatics Sudarsan Padhy IIIT Bhubaneswar spadhy07@gmail.com. 2 Empirical Inference

Algorithm for General SVM

Step 1: Find

that minimizes,

under the constraints: (i)and (ii)for i = 1, …, n.Step 2: Find an index i with,               

and set b as:Step 3: The classification of a new object        is then based on the sign of the function 

),...,( 1 nααα =

),(21

1 11jijij

n

i

n

ji

n

ii xxKyy ααα ∑ ∑−∑

= ==

00

=∑=

i

n

iiy α

Ci << α0

),(1

i

n

jjjji xxKyyb ∑−=

Xx∈bxxKyxf ii

n

ii +∑=

=),()(

Ci << α0

Page 35: Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 · Bioinformatics Sudarsan Padhy IIIT Bhubaneswar spadhy07@gmail.com. 2 Empirical Inference

Positive Definite Kernels

(RKHS)

Page 36: Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 · Bioinformatics Sudarsan Padhy IIIT Bhubaneswar spadhy07@gmail.com. 2 Empirical Inference

For any for which:

it must be the case that:

A simple criterion is that the kernel should bepositive semi-definite.

Page 37: Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 · Bioinformatics Sudarsan Padhy IIIT Bhubaneswar spadhy07@gmail.com. 2 Empirical Inference

The bias is also found as follows:

Page 38: Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 · Bioinformatics Sudarsan Padhy IIIT Bhubaneswar spadhy07@gmail.com. 2 Empirical Inference

Alternative approach ( -SVM): solutions for an-error norm are the same as those obtained

from maximising:

Page 39: Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 · Bioinformatics Sudarsan Padhy IIIT Bhubaneswar spadhy07@gmail.com. 2 Empirical Inference

subject to:

where lies on the range to .

Page 40: Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 · Bioinformatics Sudarsan Padhy IIIT Bhubaneswar spadhy07@gmail.com. 2 Empirical Inference

In this formulation the conceptual meaning of thesoft margin parameter is more transparent:The fraction of training errors is upper boundedby and also provides a lower bound on thefraction of points which are support vectors.

Page 41: Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 · Bioinformatics Sudarsan Padhy IIIT Bhubaneswar spadhy07@gmail.com. 2 Empirical Inference

RBF Kernel and String  Kernels

• RBF Kernel : K(x,z)=exp(‐ II x – zII2 / (2σ2 )

• Spectrum Kernel (Leslie et al,2002)‐

K(s,t)=∑q #( q< s)#(q<t), qεAn ,#(q<s)=no.of substrings of s of length n.

Feature map :

Φ(s) = (ϕu (s))uεAn ,ϕu  (s)=# occurances of u in s

Page 42: Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 · Bioinformatics Sudarsan Padhy IIIT Bhubaneswar spadhy07@gmail.com. 2 Empirical Inference

Closure properties of kernels

• If k1  and k2 are kernels then k1 + k2   , k1 * k2   , ak1  are kernels. 

• Ex. To compare two proteins one can define a kernel on their sequences and 3D structures and combine into a sequence –structure kernel for proteins(Lewis et al,2006)

• For protein function prediction combination of kernels on genome‐wide data sets, gene expression data and protein‐protein interaction

Page 43: Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 · Bioinformatics Sudarsan Padhy IIIT Bhubaneswar spadhy07@gmail.com. 2 Empirical Inference

SVM in Bioinformatics• Secondary structure prediction from DNA sequence using RBF kernel ( Hua & Sun 2001)

• Detection of remote protein homology using Fisher kernel (Jaakola et al,1999)

• Protein structure prediction (QIU et al,2007)

• Svm based gene finding‐ nematode genomes (Schweikert et al ,2009)

. Protein interaction prediction(Ben‐Hur et al,2005)

Feature selection‐gene selection from microarray data‐multiple kernel(Borgwardt et al.2005)