Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 ·...

Kernel Methods in Bioinformatics

Sudarsan Padhy

IIIT Bhubaneswar

spadhy07@gmail.com

Empirical Inference• drawing conclusions from observations (includes learning)

• focus in studying it is not primarily on the conclusions, but onautomatic methods

• for high‐dimensional noisy data, inference becomes nontrivial

• empirical inference methods are appropriate whenever:– little model knowledge is available– many variables are involved– mechanistic models are infeasible– (relatively) large datasets are available

Motivation behind kernel methods

• Linear learning typically has nice properties– Unique optimal solutions

– Fast learning algorithms

– Better statistical analysis

• But one big problem– Insufficient capacity

Problems of high dimensions

• Capacity may easily become too large and lead to over‐fitting: being able to realise every classifier means unlikely to generalise well

• Computational costs involved in dealing with large vectors

Historical perspective

• Minsky and Papert (1969) highlighted the weakness in their book Perceptrons

• Multylayer Perceptrons with back propagation learning (1985) overcame the problem by gluing together many linear units with non‐linear activation functions– Solved problem of capacity and led to very impressive extension of applicability of learning

– But ran into training problems of speed and multiple local minima

MODEL OF A NEURON

φ(.)jInputSignals

Biasbkwj0

vk yk = φ(vk)

Activation Function

A neuron is an information processing unit that is fundamental to the operation of a neural network. The three basic elements of the neuronal model are:

1. A set of synapses characterized by a weight (wji : weight at j from i ) connecting link from input i to neuron j.

2. A linear combiner :

3. An activation function φ for limiting the amplitude of the output neuron.

It also includes a bias denoted by bj for increasing or decreasing the net input of the activation function.

Mathematically a neuron for input xi is described by :

Examples of Activation Functions

1. Threshold Function:

2. Sigmoid Function:

3. Hyperbolic Tangent Function:

MULTILAYER PERCEPTRON (MLP)In

Input layer 1st Hidden layer

2nd Hidden Layer

Output Layer

The basic idea for a learning algorithm for

MLP is to minimize the squared error E

(error between the desired output and the

actual output), which is a non-linear

function of the weights.

E=0.5 ∑( yj - oj )2 , Oj =∑wji oi

Back propagation neural network models are being

used for many applications. However BPNN suffers

from the following weaknesses.

• Need of large no. of controlling parameters.• Danger of over-fitting (for large problem size

captures not only useful information contained in training data but also unwanted characteristics).

• Being stuck into local minima- error function to be minimised is nonlinear.

• Slow convergence rate- for large size problems.

Hence, Support Vector Machines (SVMs)developed by

Vapnik and his co-workers (1992-95) has been used for

supervised learning due to –

• Better generalization performance than other NN models

•Solution of SVM is unique , optimal and absent from local

minima as it uses linearly constrained quadratic

programming problem.

•Applicability to non-vectorial data ( Strings and Graphs)

. Few parameters are required for tuning the learning m/c

• Kernel Methods are a set of algorithms from statistical learning which include the SVM for classification and regression ,Kernel PCA , Kernel based clustering , feature selection, and dimensionality reduction etc.

• Popular methods in bioinformatics in last decade –Pubmed( search engine for biomedical literature) lists 3070 hits for SVM and 2626 hits for ‘ kernel methods’

SUPPORT VECTOR MACHINEfor classification

Second solution is better because there is a larger margin between separating hyper plane and the closest data points.

Objective : Find the hyper plane that maximizes the margin of separation.

Support Vectors are the data points which lie closest to the decision surface (optimal Hyper plane).

Which solution will generalize better to unseen examples ?

Support Vectors

Optimal HyperplaneFig-2

Consider a binary classification

problem: input vectors are and are thetargets or labels. The index labels the patternpairs (

define a space of labelled points called in-

put space.

Separating hyperplane

Margin

In an arbitrary m-dimensional space a separatinghyperplane can be written:

where is the bias, and the weights, etc.Thus we will consider a decision function of theform:

We note that the argument in is invariantunder a rescaling: , .We will implicitly fix a scale with:

for the support vectors (canonical hyperplanes).

For two support vectors on each side of theseparating hyperplane.

The margin will be given by the projection of thevector ( ) onto the normal vector to thehyperplane i.e. from which we deducethat the margin is given by .

Separating hyperplane

Margin

Maximisation of the margin is thus equivalent tominimisation of the functional:

subject to the constraints:

SVM FOR CLASSIFICATIONGiven the training sample find the optimal values of the weight vector W and bias b such that W minimizes the function subjects to the constraints :

≥ 1 for i=1,2,…,N.

This problem leads to the dual problem :

Determine that minimizes the objective function

Q(α)=

Subject to the constraints :

(1) (2)

Then the optimum weight vector

And the optimum biasWhere is the support vector, s is the index such that α0,s > 0.The decision function is defined by :

If D(x) > 0 then x Є the class labeled by +1 else it belongs to the class labeled by -1.

Soft Margin Separation of Classes

• Some data point (xi ,d I ) violates di (wTxi +b) ≥1( falls inside the region of separation but on the right side of the decision surface or falls on the wrong side)

Introduce slack variables ξi ≥0 ( measures deviation from ideal separability)

di (wTxi +b) ≥1 – ξi ,i=1,2,…,Nξi >1 data falls on wrong side ( misclassification)Goal‐ find separating hyperplane to minimise misclassification error

Soft Margin …

• Minimise ½ wT w + C ∑i ξi(2nd term is an upper bound on the number of test errors , C controls the tradeoff between complexity of the m/c and the number of non‐separable points)

Subject to the above constraints. This leads the same dual problem except 0 ≤αi ≤ C , i=1,…,N,

And hence the same algo. with obvious modification.

Support Vector Machines (SVM)

Using kernels• The SVM alg. with linearly separable data is to be modified by replacing x with φ(x) in the feature space. Critical observation is that only inner products of φ(x) ‘s are used.

• Suppose that we now have a shortcut method of computing:

• Then we do not need to explicitly compute the feature vectors either in training or testing

The kernel K(x,z)=(x.z+c)d

corresponds to a feature mapwhose coordinates are all monomials of degree upto order d.

Then dim(feature space)= n+d Cd = O(nd )

Computing K(x,z) takes only O(n) time

Dual form of SVM

• The dual form of the SVM can also be derived by taking the dual optimisation problem! This gives:

KERNELA kernel is a function such that for all (input space) satisfies:

(i) , where is a mapping from the input space to an inner product space (feature space)

RXxXK ⎯→⎯:Xxx ∈',

)'(),()',( xxxxK ΦΦ=

(ii) is symmetric and positive definite, that is, for any two objects, and , for any choice of objects , and any choice of real numbers .

Thus, using any kernel, the SVM method in the feature space can be stated in the form of the following algorithm.

K),'()',( xxKxxK =

Xxx ∈',

0),(1 1

≥∑ ∑= =

ji xxKcc

0>nXxx n ∈,...,1Rcc n ∈,...,1

Algorithm for General SVM

Step 1: Find

that minimizes,

under the constraints: (i)and (ii)for i = 1, …, n.Step 2: Find an index i with,

and set b as:Step 3: The classification of a new object is then based on the sign of the function

),...,( 1 nααα =

1 11jijij

ii xxKyy ααα ∑ ∑−∑

iiy α

Ci << α0

jjjji xxKyyb ∑−=

Xx∈bxxKyxf ii

ii +∑=

=),()(

Ci << α0

Positive Definite Kernels

(RKHS)

For any for which:

it must be the case that:

A simple criterion is that the kernel should bepositive semi-definite.

The bias is also found as follows:

Alternative approach ( -SVM): solutions for an-error norm are the same as those obtained

from maximising:

subject to:

where lies on the range to .

In this formulation the conceptual meaning of thesoft margin parameter is more transparent:The fraction of training errors is upper boundedby and also provides a lower bound on thefraction of points which are support vectors.

RBF Kernel and String Kernels

• RBF Kernel : K(x,z)=exp(‐ II x – zII2 / (2σ2 )

• Spectrum Kernel (Leslie et al,2002)‐

K(s,t)=∑q #( q< s)#(q<t), qεAn ,#(q<s)=no.of substrings of s of length n.

Feature map :

Φ(s) = (ϕu (s))uεAn ,ϕu (s)=# occurances of u in s

Closure properties of kernels

• If k1 and k2 are kernels then k1 + k2 , k1 * k2 , ak1 are kernels.

• Ex. To compare two proteins one can define a kernel on their sequences and 3D structures and combine into a sequence –structure kernel for proteins(Lewis et al,2006)

• For protein function prediction combination of kernels on genome‐wide data sets, gene expression data and protein‐protein interaction

SVM in Bioinformatics• Secondary structure prediction from DNA sequence using RBF kernel ( Hua & Sun 2001)

• Detection of remote protein homology using Fisher kernel (Jaakola et al,1999)

• Protein structure prediction (QIU et al,2007)

• Svm based gene finding‐ nematode genomes (Schweikert et al ,2009)

. Protein interaction prediction(Ben‐Hur et al,2005)

Feature selection‐gene selection from microarray data‐multiple kernel(Borgwardt et al.2005)

Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 ·...

Documents

Transcript of Kernel Methods in Bioinformaticsscc/Lecture_material/Kernel Methods in... · 2016-12-01 ·...

Kernel Selection using Multiple Kernel Learning and … · 1 Kernel Selection using Multiple Kernel Learning and Domain Adaptation in Reproducing Kernel Hilbert Space, for Face Recognition

µµ--Kernel Construction (6)Kernel Construction (6)

Perf: From Profiling to Kernel Exploiting - HITB · PDF fileFrom Profiling to Kernel Exploiting ... Kernel Source Code kernel/events/* arch//kernel/* ... customized Android Linux version

HxRefactored - IBM Watson - Vance Allen + Sridhar Sudarsan

Kernel Recipes 2016 - The kernel report

Athira Sudarsan

Search++: Cognitive transformation of human-system interaction: Presented by Sridhar Sudarsan, IBM Watson

Sudarsan Jayaraman - Open information security management maturity model

Kernel Architecture : UNIX Kernel

µT-Kernel Specification - TRON · μT-Kernel μT-Kernel

CA JAGABANDHU PADHY

Linux Kernel Security Overview - Linux Kernel Developernamei.org/presentations/linux-kernel-security-kca09.pdf · Linux Kernel Security Overview Kernel Conference Australia ... Labeled

KERNEL WARS: KERNEL-EXPLOITATION DEMYSTIFIED · Introduction to kernel-mode vulnerabilities and exploitation • Why exploit kernel level vulnerabilities? – Attacks at the lowest

Kernel Recipes 2015 - Kernel dump analysis

Introduction to Kernel Methods - GitHub Pages · to Kernel Methods F. Gonz alez Introduction The Kernel Trick The Kernel Approach to Machine Learning A Kernel Pattern Analysis Algorithm

Kernel 8.0 and Kernel Toolkit 7.3 Developer's Guide

linux* kernel scalability linux* kernel scalability

Kernel Recipes 2013 - Kernel for your device

Kernel Machines - From linear to Kernel SVM€¦ · Road map 1 Intuition and Motivation 2 Kernel Theoretical framework 3 Kernel functions Kernel on vectors Kernels on generic data

Kernel Recipes 2015: Kernel packet capture technologies