G. Sanguinetti, EPSRC Winter School 01/08 Semi-supervised learning Guido Sanguinetti Department of...

G. Sanguinetti, EPSRC Winter School 01/08

Semi-supervised learning

Guido SanguinettiDepartment of Computer Science, University of Sheffield


Programme

• Different ways of learning– Unsupervised learning

• EM algorithm

– Supervised learning

• Different ways of going semi-supervised– Generative models

• Reweighting

– Discriminative models• Regularisation: clusters and manifolds


Disclaimer• This is going to be a high-level introduction

rather than a technical talk• We will spend some time talking about

supervised and unsupervised learning as well• We will be unashamedly probabilistic, if not fully

Bayesian• Main reference C.Bishop’s book “Pattern

Recognition and Machine Learning” and M. Seeger’s chapter in “Semi-supervised learning”, Chapelle, Schölkopf and Zien, eds.


Different ways to learn

• Reinforcement learning: learn a strategy to optimise a reward.

• Closely related to decision theory and control theory.

• Supervised learning: learn a map

• Unsupervised learning: estimate a density


Unsupervised learning

• We are given data x.• We want to estimate the density

that generated the data.• Generally, assume a latent

variable y as in graphical model.• A continuous y leads to

dimensionality reduction (cf previous lecture), a discrete one to clustering

x

θπ

y


Example: mixture of Gaussians

• Data: vector measurements xi, i=1,...,N.

• Latent variable: K-dimensional binary vectors (class membership) yi; yij=1 means point i belongs to class j.

• The θ parameters are in this case the covariances and the means of each component.


Estimating mixtures

• Two objectives: estimate the parameters , j and j and estimate the posterior probabilities on class membership

• ij are the responsibilities.

• Could use gradient descent.• Better to use EM


Expectation-Maximization

• Iterative procedure to estimate the maximum likelihood values of parameters in models with latent variables.

• We want to maximise the log-likelihood of the model

• Notice that log of a sum is not nice.• The key mathematical tool is Jensen’s inequality


Jensen’s inequality

• For every concave function f, random variable x and probability distribution q(x)

• A cartoon of the proof, which relies on the centre of mass of a convex polygon lying inside the polygon.


Bound on the log-likelihood

• Jensen’s inequality leads to

• H[q] is the entropy of the distribution q. Notice the absence of the nasty log of a sum.

• If and only if q(c) is the posterior q(c|x), we get that the bound is saturated


EM

• For a fixed value of the parameters, the posterior will saturate the bound.

• In the M-step, optimise the bound with respect to the parameters.

• In the E-step, recompute the posterior with the new value of the parameters.

• Exercise: EM for mixture of Gaussians.

θ

M-step

E-step


Supervised learning

• The data consists of a set of inputs x and a set of output (target) values y.

• The goal is to learn the functional relation

f:x y• Evaluated using reconstruction error

(classification accuracy), usually on a separate test set.

• Continuous y leads to regression, discrete to classification


Classification-Generative

• The generative approach starts with modelling p(x|c) as in unsupervised learning.

• The model parameters are

estimated (e.g. using Maximum

Likelihood).• The assignment is based on the

posterior probabilities p(c|x)• Requires estimation of many parameters.

x

θπ

c


Example: discriminant analysis• Assume class conditional distributions to be

Gaussian

• Estimate means and covariances using maximum likelihood (O(KD2) params).

• Classify novel (test) data using posterior

• Exercise: rewrite posterior in terms of sigmoids.


Classification-Discriminative

• Also called diagnostic• Modelling the class-conditional

distributions involves significant

overheads.• Discriminative techniques avoid

this by modelling directly the posterior distribution p(cj|x).

• Closely related with the concept of transductive learning.

x

μθ

c


Example: logistic regression

• Restrict to the two classes case.• Model posteriors as

where we have introduced the logistic sigmoid

• Notice similarity with discriminant function in generative models.

• Number of parameters to be estimated is D.


Estimating logistic regression

• Let • Let ci be 0 or 1.

• The likelihood for the data (xi,ci) is

• The gradient of the negative log-likelihood is

• Overfitting, may need a prior on w


Semi-supervised learning• In many practical applications, the labelling

process is expensive and time consuming.• Plenty of examples of inputs x, but few of the

corresponding target c• The goal is still to predict p(c|x) and is evaluated

using classification accuracy• Semi-supervised learning is, in this sense, a

special case of supervised learning where we use the extra unlabeled data to improve the predictive power.


Notation

• We have a labeled or complete data set Dl, comprising of Nl sets of pairs (x,c)

• We have an unlabeled or incomplete data set Du comprising of Nu sets of vectors x

• The interesting (and common) case is when Nu>>Nl


Baselines

• One attack to semi-supervised learning would be to ignore the labels and estimate an unsupervised clustering model.

• Another attack is to ignore the unlabeled data and just use the complete data.

• No free lunch: it is possible to construct data distributions for which either of the baselines outperforms a given SSL method.

• The art in SSL is designing good models for specific applications, rather than theory.


Generative SSL

• Generative SSL methods are most intuitive• The graphical model for

generative classification and

unsupervised learning is the

same• The likelihoods are combined

• Parameters are estimated by EM

x

θπ

c


Discriminant analysis

• McLachlan (1977) considered SSL for a generative model with two Gaussian classes

• Assume the covariance to be known, means need to be estimated from the data.

• Assume we have Nu=M1+M2<<Nl

• Assume also that the labelled data is split evenly among the two classes, so Nl=2n


A surprising result

• Consider the misclassification expected error R• Let R be the expected error obtained using only

the labelled data, and R1 the expected error obtained using all data (estimating parameters with EM)

• If , the first order expansion (in Nu/Nl) of R-R1 is positive only for Nu<M*, with M* finite

• In other words, unlabelled data helps up to a point only.


A way out

• Perhaps considering labelled and unlabelled data on the same footing is not ideal.

• McLachlan went on to propose to estimate the class means as

• He then shows that, for suitable , the expected error obtained in this way is to first order always lower than the one obtained without the unlabelled data


A hornets’ nest

• In general, it seems sensible to reweight the log-likelihood as

• This is partly because, in the important case when Nl is small, we do not want the label information to be swamped

• But how do you choose ? • Cross validation on the labels?

• Unfeasible for small Nl


Stability

• An elegant solution proposed by Corduneanu and Jaakkola (2002)

• In the limit when all data is labelled the likelihood is unimodal

• In the limit when all data is unlabelled the likelihood is multimodal (e.g. permutations)

• When moving from =1 to =0, we must encounter a critical * when the likelihood becomes multimodal

• That is the “optimal” choice


Discriminative SSL

• The alternative paradigm for SSL

is the discriminative approach• As in supervised learning, the

discriminative approach models

directly p(c|x) using the graphical model above• The total likelihood factorises as

x

μθ

c


Surprise!

• Using Bayes’ theorem we obtain the posterior over \theta as

• This is seriously worrying!• Why?


Regularization

• Bayesian methods in a straight discriminative SSL cannot be employed

• Some limited success in non-Bayesian methods (e.g. Anderson 1978 for logistic regression over discrete variables)

• Most promising avenue is to

modify the graphical model used• This is known as regularization

x

μθ

c


Discriminative vs Generative

• The idea behind regularization is that information about the generative structure of x is feeding into the discriminative process

• Does it still make sense to talk about a discriminative approach?

• Strictly speaking, perhaps no• In practice, the hypothesis used for

regularization are much weaker than modelling the full class conditional distribution


Cluster assumption• Perhaps the most widely used form of

regularization• It states that the boundary between classes

should cross areas of low data density• It is often reasonable particularly in high

dimensions• Implemented in a host of Bayesian and non-

Bayesian methods• Can be understood in terms of smoothness of

the discriminant function


Manifold assumption

• Another convenient assumption is that the data lies on a low-dimensional sub-manifold of a high-dimensional space

• Often the starting point is to map the data to a high dimensional space via a feature map

• The cluster assumption is then applied in the high dimensional space

• Key concept is the Graph Laplacian• Ideas pioneered by Belkin and Niyogi


Manifolds cont.

• We want discriminant functions which vary little in regions of high data density

• Mathematically, this is equivalent to

being very small, where is the Laplace-Beltrami operator

• The finite sample approximation to is given by the graph Laplacian

• The objective function for SSL is a combination of the error given by f and of its norm under

G. Sanguinetti, EPSRC Winter School 01/08 Semi-supervised learning Guido Sanguinetti Department of...

Documents

Transcript of G. Sanguinetti, EPSRC Winter School 01/08 Semi-supervised learning Guido Sanguinetti Department of...