Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 ·...

Post on 15-Jul-2020

2 views 0 download

Transcript of Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 ·...

Kernel Methods in Bioinformatics

Sudarsan  Padhy

IIIT   Bhubaneswar

spadhy07@gmail.com

2

Empirical Inference• drawing conclusions from observations (includes learning)

• focus in studying it is not primarily on the conclusions, but onautomatic methods

• for high‐dimensional noisy data, inference becomes nontrivial

• empirical inference methods are appropriate whenever:– little model knowledge is available– many variables are involved– mechanistic models are infeasible– (relatively) large datasets are available

3

Motivation behind kernel methods

• Linear learning typically has nice properties– Unique optimal solutions

– Fast learning algorithms

– Better statistical analysis

• But one big problem– Insufficient capacity

4

Problems of high dimensions

• Capacity may easily become too large and lead to over‐fitting: being able to realise every classifier means unlikely to generalise well

• Computational costs involved in dealing with large vectors

5

Historical perspective

• Minsky and Papert (1969) highlighted the weakness in their book Perceptrons

• Multylayer Perceptrons with back propagation learning (1985) overcame the problem by gluing together many linear units with non‐linear activation functions– Solved problem of capacity and led to very impressive extension of applicability of learning

– But ran into training problems of speed and multiple local minima

MODEL OF A NEURON

φ(.)jInputSignals

x1

x2

xm

wj1

wj2

wjm

Biasbkwj0

vk yk = φ(vk)

Activation Function

A neuron is an information processing unit that is fundamental to the operation of a neural network. The three basic elements of the neuronal model are:

1. A set of synapses characterized by a weight (wji : weight at j from i ) connecting link from input i to neuron j.

2. A linear combiner :

3. An activation function φ for limiting the amplitude of the output neuron.

It also includes a bias denoted by bj for increasing or decreasing the net input of the activation function.

Mathematically a neuron for input xi is described by :

Examples of Activation Functions

1. Threshold Function:

2. Sigmoid Function:

3. Hyperbolic Tangent Function:

MULTILAYER PERCEPTRON (MLP)In

put S

ig n

al

Input layer 1st Hidden layer

Out

put S

igna

l

2nd Hidden Layer

Output Layer

b

X

The basic idea for a learning algorithm for

MLP is to minimize the squared error E

(error between the desired output and the

actual output), which is a non-linear

function of the weights.

E=0.5 ∑( yj - oj )2 , Oj =∑wji oi

Back propagation neural network models are being

used for many applications. However BPNN suffers

from the following weaknesses.

• Need of large no. of controlling parameters.• Danger of over-fitting (for large problem size

captures not only useful information contained in training data but also unwanted characteristics).

• Being stuck into local minima- error function to be minimised is nonlinear.

• Slow convergence rate- for large size problems.

Hence, Support Vector Machines (SVMs)developed by

Vapnik and his co-workers (1992-95) has been used for

supervised learning due to –

• Better generalization performance than other NN models

•Solution of SVM is unique , optimal and absent from local

minima as it uses linearly constrained quadratic

programming problem.

•Applicability to non-vectorial data ( Strings and Graphs)

. Few parameters are required for tuning the learning m/c

• Kernel Methods are a set of algorithms from statistical learning which include the SVM for classification and regression ,Kernel PCA , Kernel based clustering ,  feature selection, and dimensionality reduction  etc.

• Popular methods in bioinformatics in last decade –Pubmed( search engine for biomedical literature) lists 3070 hits for SVM and 2626 hits for ‘ kernel methods’

SUPPORT VECTOR MACHINEfor classification

Second solution is better because there is a larger margin between separating hyper plane and the closest data points.

Objective : Find the hyper plane that maximizes the margin of separation.

Support Vectors are the data points which lie closest to the decision surface (optimal Hyper plane).

Which solution will generalize better to unseen examples ?

C1

C2

Fig-1

C1

C2

Support Vectors

Optimal HyperplaneFig-2

Consider a binary classification

problem: input vectors are and are thetargets or labels. The index labels the patternpairs (

The

).

define a space of labelled points called in-

put space.

Separating hyperplane

Margin

In an arbitrary m-dimensional space a separatinghyperplane can be written:

where is the bias, and the weights, etc.Thus we will consider a decision function of theform:

We note that the argument in is invariantunder a rescaling: , .We will implicitly fix a scale with:

for the support vectors (canonical hyperplanes).

Thus:

For two support vectors on each side of theseparating hyperplane.

The margin will be given by the projection of thevector ( ) onto the normal vector to thehyperplane i.e. from which we deducethat the margin is given by .

Separating hyperplane

Margin

1

2

Maximisation of the margin is thus equivalent tominimisation of the functional:

subject to the constraints:

SVM FOR CLASSIFICATIONGiven the training sample find the optimal values of the weight vector W and bias b such that W minimizes the function subjects to the constraints :

≥ 1 for i=1,2,…,N.

This problem leads to the dual problem :

Determine that minimizes the objective function

Q(α)=

Subject to the constraints :

(1) (2)

Then the optimum weight vector

And the optimum biasWhere is the support vector, s is the index such that α0,s > 0.The decision function is defined by :

If D(x) > 0 then x Є the class labeled by +1 else it belongs to the class labeled by -1.

Soft Margin Separation of Classes

• Some data point (xi ,d I ) violates di (wTxi +b) ≥1( falls inside the region of separation but on the right side of the decision surface or falls on the wrong side)

Introduce slack variables ξi ≥0 ( measures deviation from ideal separability) 

di (wTxi +b) ≥1 – ξi   ,i=1,2,…,Nξi >1 data falls on wrong side ( misclassification)Goal‐ find separating hyperplane to minimise misclassification error

Soft Margin …

• Minimise   ½ wT w + C ∑i  ξi(2nd term is an upper bound on the number of test errors , C controls  the tradeoff between complexity of the m/c and the number of non‐separable points)

Subject to the above constraints. This leads the same dual problem  except  0 ≤αi ≤ C , i=1,…,N,

And hence the same algo. with obvious modification.

Support Vector Machines (SVM)

Using kernels• The SVM alg. with linearly separable data is to be modified by replacing x  with φ(x) in the feature space. Critical observation is that  only inner products of φ(x) ‘s  are used.

• Suppose that we now have a shortcut method of computing:

• Then we do not need to explicitly compute the feature vectors either in training or testing

The kernel K(x,z)=(x.z+c)d

corresponds to a feature mapwhose coordinates are all monomials of degree upto order d.

Then dim(feature space)=  n+d Cd = O(nd ) 

Computing K(x,z) takes only O(n) time 

31

Dual form of SVM

• The dual form of the SVM can also be derived by taking the dual optimisation problem! This gives:

KERNELA kernel is a function                             such that for all                                       (input space) satisfies: 

(i) , where                is a mapping from the input space             to an inner product space (feature space)   

RXxXK ⎯→⎯:Xxx ∈',

)'(),()',( xxxxK ΦΦ=

Φ

XF

(ii) is symmetric and positive definite, that is,                                   for any two objects,      and                               , for any choice of           objects                   , and any choice of                   real numbers .

Thus, using any kernel, the SVM method in the feature space can be stated in the form of the following algorithm.

K),'()',( xxKxxK =

Xxx ∈',

0),(1 1

≥∑ ∑= =

jij

n

i

n

ji xxKcc

0>nXxx n ∈,...,1Rcc n ∈,...,1

Algorithm for General SVM

Step 1: Find

that minimizes,

under the constraints: (i)and (ii)for i = 1, …, n.Step 2: Find an index i with,               

and set b as:Step 3: The classification of a new object        is then based on the sign of the function 

),...,( 1 nααα =

),(21

1 11jijij

n

i

n

ji

n

ii xxKyy ααα ∑ ∑−∑

= ==

00

=∑=

i

n

iiy α

Ci << α0

),(1

i

n

jjjji xxKyyb ∑−=

Xx∈bxxKyxf ii

n

ii +∑=

=),()(

Ci << α0

Positive Definite Kernels

(RKHS)

For any for which:

it must be the case that:

A simple criterion is that the kernel should bepositive semi-definite.

The bias is also found as follows:

Alternative approach ( -SVM): solutions for an-error norm are the same as those obtained

from maximising:

subject to:

where lies on the range to .

In this formulation the conceptual meaning of thesoft margin parameter is more transparent:The fraction of training errors is upper boundedby and also provides a lower bound on thefraction of points which are support vectors.

RBF Kernel and String  Kernels

• RBF Kernel : K(x,z)=exp(‐ II x – zII2 / (2σ2 )

• Spectrum Kernel (Leslie et al,2002)‐

K(s,t)=∑q #( q< s)#(q<t), qεAn ,#(q<s)=no.of substrings of s of length n.

Feature map :

Φ(s) = (ϕu (s))uεAn ,ϕu  (s)=# occurances of u in s

Closure properties of kernels

• If k1  and k2 are kernels then k1 + k2   , k1 * k2   , ak1  are kernels. 

• Ex. To compare two proteins one can define a kernel on their sequences and 3D structures and combine into a sequence –structure kernel for proteins(Lewis et al,2006)

• For protein function prediction combination of kernels on genome‐wide data sets, gene expression data and protein‐protein interaction

SVM in Bioinformatics• Secondary structure prediction from DNA sequence using RBF kernel ( Hua & Sun 2001)

• Detection of remote protein homology using Fisher kernel (Jaakola et al,1999)

• Protein structure prediction (QIU et al,2007)

• Svm based gene finding‐ nematode genomes (Schweikert et al ,2009)

. Protein interaction prediction(Ben‐Hur et al,2005)

Feature selection‐gene selection from microarray data‐multiple kernel(Borgwardt et al.2005)