G. Sanguinetti, EPSRC Winter School 01/08 Semi-supervised learning Guido Sanguinetti Department of...
-
Upload
alban-cross -
Category
Documents
-
view
215 -
download
0
Transcript of G. Sanguinetti, EPSRC Winter School 01/08 Semi-supervised learning Guido Sanguinetti Department of...
G. Sanguinetti, EPSRC Winter School 01/08
Semi-supervised learning
Guido SanguinettiDepartment of Computer Science, University of Sheffield
G. Sanguinetti, EPSRC Winter School 01/08
Programme
• Different ways of learning– Unsupervised learning
• EM algorithm
– Supervised learning
• Different ways of going semi-supervised– Generative models
• Reweighting
– Discriminative models• Regularisation: clusters and manifolds
G. Sanguinetti, EPSRC Winter School 01/08
Disclaimer• This is going to be a high-level introduction
rather than a technical talk• We will spend some time talking about
supervised and unsupervised learning as well• We will be unashamedly probabilistic, if not fully
Bayesian• Main reference C.Bishop’s book “Pattern
Recognition and Machine Learning” and M. Seeger’s chapter in “Semi-supervised learning”, Chapelle, Schölkopf and Zien, eds.
G. Sanguinetti, EPSRC Winter School 01/08
Different ways to learn
• Reinforcement learning: learn a strategy to optimise a reward.
• Closely related to decision theory and control theory.
• Supervised learning: learn a map
• Unsupervised learning: estimate a density
G. Sanguinetti, EPSRC Winter School 01/08
Unsupervised learning
• We are given data x.• We want to estimate the density
that generated the data.• Generally, assume a latent
variable y as in graphical model.• A continuous y leads to
dimensionality reduction (cf previous lecture), a discrete one to clustering
x
θπ
y
G. Sanguinetti, EPSRC Winter School 01/08
Example: mixture of Gaussians
• Data: vector measurements xi, i=1,...,N.
• Latent variable: K-dimensional binary vectors (class membership) yi; yij=1 means point i belongs to class j.
• The θ parameters are in this case the covariances and the means of each component.
G. Sanguinetti, EPSRC Winter School 01/08
Estimating mixtures
• Two objectives: estimate the parameters , j and j and estimate the posterior probabilities on class membership
• ij are the responsibilities.
• Could use gradient descent.• Better to use EM
G. Sanguinetti, EPSRC Winter School 01/08
Expectation-Maximization
• Iterative procedure to estimate the maximum likelihood values of parameters in models with latent variables.
• We want to maximise the log-likelihood of the model
• Notice that log of a sum is not nice.• The key mathematical tool is Jensen’s inequality
G. Sanguinetti, EPSRC Winter School 01/08
Jensen’s inequality
• For every concave function f, random variable x and probability distribution q(x)
• A cartoon of the proof, which relies on the centre of mass of a convex polygon lying inside the polygon.
G. Sanguinetti, EPSRC Winter School 01/08
Bound on the log-likelihood
• Jensen’s inequality leads to
• H[q] is the entropy of the distribution q. Notice the absence of the nasty log of a sum.
• If and only if q(c) is the posterior q(c|x), we get that the bound is saturated
G. Sanguinetti, EPSRC Winter School 01/08
EM
• For a fixed value of the parameters, the posterior will saturate the bound.
• In the M-step, optimise the bound with respect to the parameters.
• In the E-step, recompute the posterior with the new value of the parameters.
• Exercise: EM for mixture of Gaussians.
θ
M-step
E-step
G. Sanguinetti, EPSRC Winter School 01/08
Supervised learning
• The data consists of a set of inputs x and a set of output (target) values y.
• The goal is to learn the functional relation
f:x y• Evaluated using reconstruction error
(classification accuracy), usually on a separate test set.
• Continuous y leads to regression, discrete to classification
G. Sanguinetti, EPSRC Winter School 01/08
Classification-Generative
• The generative approach starts with modelling p(x|c) as in unsupervised learning.
• The model parameters are
estimated (e.g. using Maximum
Likelihood).• The assignment is based on the
posterior probabilities p(c|x)• Requires estimation of many parameters.
x
θπ
c
G. Sanguinetti, EPSRC Winter School 01/08
Example: discriminant analysis• Assume class conditional distributions to be
Gaussian
• Estimate means and covariances using maximum likelihood (O(KD2) params).
• Classify novel (test) data using posterior
• Exercise: rewrite posterior in terms of sigmoids.
G. Sanguinetti, EPSRC Winter School 01/08
Classification-Discriminative
• Also called diagnostic• Modelling the class-conditional
distributions involves significant
overheads.• Discriminative techniques avoid
this by modelling directly the posterior distribution p(cj|x).
• Closely related with the concept of transductive learning.
x
μθ
c
G. Sanguinetti, EPSRC Winter School 01/08
Example: logistic regression
• Restrict to the two classes case.• Model posteriors as
where we have introduced the logistic sigmoid
• Notice similarity with discriminant function in generative models.
• Number of parameters to be estimated is D.
G. Sanguinetti, EPSRC Winter School 01/08
Estimating logistic regression
• Let • Let ci be 0 or 1.
• The likelihood for the data (xi,ci) is
• The gradient of the negative log-likelihood is
• Overfitting, may need a prior on w
G. Sanguinetti, EPSRC Winter School 01/08
Semi-supervised learning• In many practical applications, the labelling
process is expensive and time consuming.• Plenty of examples of inputs x, but few of the
corresponding target c• The goal is still to predict p(c|x) and is evaluated
using classification accuracy• Semi-supervised learning is, in this sense, a
special case of supervised learning where we use the extra unlabeled data to improve the predictive power.
G. Sanguinetti, EPSRC Winter School 01/08
Notation
• We have a labeled or complete data set Dl, comprising of Nl sets of pairs (x,c)
• We have an unlabeled or incomplete data set Du comprising of Nu sets of vectors x
• The interesting (and common) case is when Nu>>Nl
G. Sanguinetti, EPSRC Winter School 01/08
Baselines
• One attack to semi-supervised learning would be to ignore the labels and estimate an unsupervised clustering model.
• Another attack is to ignore the unlabeled data and just use the complete data.
• No free lunch: it is possible to construct data distributions for which either of the baselines outperforms a given SSL method.
• The art in SSL is designing good models for specific applications, rather than theory.
G. Sanguinetti, EPSRC Winter School 01/08
Generative SSL
• Generative SSL methods are most intuitive• The graphical model for
generative classification and
unsupervised learning is the
same• The likelihoods are combined
• Parameters are estimated by EM
x
θπ
c
G. Sanguinetti, EPSRC Winter School 01/08
Discriminant analysis
• McLachlan (1977) considered SSL for a generative model with two Gaussian classes
• Assume the covariance to be known, means need to be estimated from the data.
• Assume we have Nu=M1+M2<<Nl
• Assume also that the labelled data is split evenly among the two classes, so Nl=2n
G. Sanguinetti, EPSRC Winter School 01/08
A surprising result
• Consider the misclassification expected error R• Let R be the expected error obtained using only
the labelled data, and R1 the expected error obtained using all data (estimating parameters with EM)
• If , the first order expansion (in Nu/Nl) of R-R1 is positive only for Nu<M*, with M* finite
• In other words, unlabelled data helps up to a point only.
G. Sanguinetti, EPSRC Winter School 01/08
A way out
• Perhaps considering labelled and unlabelled data on the same footing is not ideal.
• McLachlan went on to propose to estimate the class means as
• He then shows that, for suitable , the expected error obtained in this way is to first order always lower than the one obtained without the unlabelled data
G. Sanguinetti, EPSRC Winter School 01/08
A hornets’ nest
• In general, it seems sensible to reweight the log-likelihood as
• This is partly because, in the important case when Nl is small, we do not want the label information to be swamped
• But how do you choose ? • Cross validation on the labels?
• Unfeasible for small Nl
G. Sanguinetti, EPSRC Winter School 01/08
Stability
• An elegant solution proposed by Corduneanu and Jaakkola (2002)
• In the limit when all data is labelled the likelihood is unimodal
• In the limit when all data is unlabelled the likelihood is multimodal (e.g. permutations)
• When moving from =1 to =0, we must encounter a critical * when the likelihood becomes multimodal
• That is the “optimal” choice
G. Sanguinetti, EPSRC Winter School 01/08
Discriminative SSL
• The alternative paradigm for SSL
is the discriminative approach• As in supervised learning, the
discriminative approach models
directly p(c|x) using the graphical model above• The total likelihood factorises as
x
μθ
c
G. Sanguinetti, EPSRC Winter School 01/08
Surprise!
• Using Bayes’ theorem we obtain the posterior over \theta as
• This is seriously worrying!• Why?
G. Sanguinetti, EPSRC Winter School 01/08
Regularization
• Bayesian methods in a straight discriminative SSL cannot be employed
• Some limited success in non-Bayesian methods (e.g. Anderson 1978 for logistic regression over discrete variables)
• Most promising avenue is to
modify the graphical model used• This is known as regularization
x
μθ
c
G. Sanguinetti, EPSRC Winter School 01/08
Discriminative vs Generative
• The idea behind regularization is that information about the generative structure of x is feeding into the discriminative process
• Does it still make sense to talk about a discriminative approach?
• Strictly speaking, perhaps no• In practice, the hypothesis used for
regularization are much weaker than modelling the full class conditional distribution
G. Sanguinetti, EPSRC Winter School 01/08
Cluster assumption• Perhaps the most widely used form of
regularization• It states that the boundary between classes
should cross areas of low data density• It is often reasonable particularly in high
dimensions• Implemented in a host of Bayesian and non-
Bayesian methods• Can be understood in terms of smoothness of
the discriminant function
G. Sanguinetti, EPSRC Winter School 01/08
Manifold assumption
• Another convenient assumption is that the data lies on a low-dimensional sub-manifold of a high-dimensional space
• Often the starting point is to map the data to a high dimensional space via a feature map
• The cluster assumption is then applied in the high dimensional space
• Key concept is the Graph Laplacian• Ideas pioneered by Belkin and Niyogi
G. Sanguinetti, EPSRC Winter School 01/08
Manifolds cont.
• We want discriminant functions which vary little in regions of high data density
• Mathematically, this is equivalent to
being very small, where is the Laplace-Beltrami operator
• The finite sample approximation to is given by the graph Laplacian
• The objective function for SSL is a combination of the error given by f and of its norm under