Pattern Recognition and Machine Learning

27
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 6: KERNEL METHODS

description

Pattern Recognition and Machine Learning. Chapter 6: KERNEL Methods. Kernel methods (1). In chapters 3 and 4 we dealed with linear parametric models of this form: where Á j ( x ) are known as basis functions and Á ( x ) is a fixed nonlinear feature space mapping - PowerPoint PPT Presentation

Transcript of Pattern Recognition and Machine Learning

Page 1: Pattern Recognition  and  Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNINGCHAPTER 6: KERNEL METHODS

Page 2: Pattern Recognition  and  Machine Learning

Kernel methods (1)In chapters 3 and 4 we dealed with linear parametric models of this form:

where Áj(x) are known as basis functions and Á(x) is a fixed nonlinear feature space mapping

This models can be re-cast into an equivalent ‘dual representation’ in which the prediction is based on linear combination of a kernel functions

Page 3: Pattern Recognition  and  Machine Learning

Dual representation(1)

0wwxwfwJ TN

i

Tn

,22

11

2

Let’s consider a linear regression model whose parameters are determines by minimizing a regularized sum-of-squares error function

If we set the gradient of J(w) with respect to w equal to zero, the solution for w takes the following form

If we substitute it into the model we obtain the following prediction

tIw TT 1

xItxwwxy TTT 1,

Page 4: Pattern Recognition  and  Machine Learning

Dual representation(2)

Let’s define the Gramm matrix , which is an NxN symmetric matrix with elements

In terms of Gram matrix the prediction can be written as

So now the prediction is expressed entirely in terms of the kernel function

tKIxkxItxwwxy TTT 11 )(,

TK

mnmT

nnm xxkxxK ,

Page 5: Pattern Recognition  and  Machine Learning

Kernel trick

If we have an algorithm formulated in such a way that the input vector x enters only in the form of scalar products, then we can replace the scalar product with some other choice of kernel.

Page 6: Pattern Recognition  and  Machine Learning

Advantages of dual representation

1. We have to invert an NxN matrix, whereas in the original parameter space formulation we had to invert MxM matrix in order to determine w. I could be very important if N is significantly smaller than M.

2. As a dual representation is expressed entirely in terms of kernel function k(x,x’), we can work directly in terms of kernels, which allows to use feature spaces of high, even infinite, dimensionality.

3. Kernel functions can be defined not only over simply vectors of real numbers but also over objects as diverse as graphs, sets, string, and text documents

Page 7: Pattern Recognition  and  Machine Learning

Constructing kernels (1)

Kernel is valid if it corresponds to a scalar product in some (perhaps infinite dimensional) feature space

Three approaches to construct kernels:1. To choose a feature space mapping Á(x)

2. To construct kernel function directly

M

iii

T zxzxzxk1

,

zx

zzzzxxxx

zxzxzxzx

zxzxzxzxk

T

T

T

2221

21

2221

21

22

222211

21

21

22211

2

,2,,2,

2

,

Page 8: Pattern Recognition  and  Machine Learning

Constructing kernels (2)3.

Page 9: Pattern Recognition  and  Machine Learning

Constructing kernels (3)

A necessary and sufficient condition for function k(x,x’) to be a valid kernel, is that Gram matrix K should be positive semidefinite for all possible choices of the set {x}

Page 10: Pattern Recognition  and  Machine Learning

Some worth mentioning kernels (1)

• Linear kernel

• Gaussian kernel

• Kernel for sets

Page 11: Pattern Recognition  and  Machine Learning

Some worth mentioning kernels (2)

• Kernel for probabilistic generative models

• Hidden Markov models

• Fisher kernel

Page 12: Pattern Recognition  and  Machine Learning

Radial Basis Function Networks

To specify a regression model based on linear combination of fixed basis function we should choose the particular form of such functions.

On possible choice is to use radial basis functions, where each basis function depends only on the radial distance form a certain centre so that

Page 13: Pattern Recognition  and  Machine Learning

We want to find the regression function y(x), using a Parzen density estimator to model the joint distribution

It can be shown that

Where and

Nadaraya-Watson model (1)

Page 14: Pattern Recognition  and  Machine Learning

Nadaraya-Watson model (2)

Page 15: Pattern Recognition  and  Machine Learning

Gaussian processes (1)

Let’s apply kernels to probabilistic discriminative models

Instead of defining prior on parameters vector w we define a prior probability over functions directly

?

Page 16: Pattern Recognition  and  Machine Learning

Gaussian processes (2)

y is a linear combination of Gaussian distributed variables and hence is itself Gaussian

Where K is the Gram matrix with elements

So the marginal distribution p(y) is defined by a Gram matrix so that

Page 17: Pattern Recognition  and  Machine Learning

Gaussian processes (3)

A Gaussian process is defined as a probability distribution over functions y(x) such that the set of values of y(x) evaluated at an arbitrary set of points {x} jointly has a Gaussian distribution.

This distribution is specified completely by the second-order statistics, the mean and the covariance.

The mean by symmetry is taken to be zero. The covariance is given by the kernel function

Page 18: Pattern Recognition  and  Machine Learning

Gaussian processes (4)

Gaussian kernel Exponential kernel

Page 19: Pattern Recognition  and  Machine Learning

Gaussian processes for regression(1)

Let’s consider t to be a target value

Where is a hyperparameter representing the precision of the noise.

Mentioning that we can find the marginal distribution

where

Page 20: Pattern Recognition  and  Machine Learning

Gaussian processes for regression(2)Our goal is to find the conditional distributionThe joint distribution is given by

Where

Using the results from Chapter 2 we see that is a Gaussian distribution with mean and covariance given by

Page 21: Pattern Recognition  and  Machine Learning

Gaussian processes for regression(3)

Page 22: Pattern Recognition  and  Machine Learning

Gaussian processes for regression(4)

Page 23: Pattern Recognition  and  Machine Learning

Automatic relevance determination

Page 24: Pattern Recognition  and  Machine Learning

Gaussian processes for classification (1)

Our goal is to model the posterior probabilities of the target variable for a new input vector, given a set of training data.

These probabilities must lie in the interval (0, 1), whereas a Gaussian process model makes predictions that lie on the entire real axis.

To adapt Gaussian processes we should transform the output of the Gaussian processes using an appropriate nonlinear activation function.

Page 25: Pattern Recognition  and  Machine Learning

Gaussian processes for classification (2)Let’s define a Gaussian process over a function a(x) and a logistic sigmoid transformation of the output

Page 26: Pattern Recognition  and  Machine Learning

Gaussian processes for classification (3)

We need to determine the predictive distribution So we introduce a Gaussian process prior over the

vector a which in turn defines a non-Gaussian process over t

For two-class problem the required prediction distribution is given by

where

Page 27: Pattern Recognition  and  Machine Learning

Gaussian processes for classification (4)

Approaches to Gaussian approximation:1. Variational inference2. Expectation propagation 3. Laplace approximation