Transfer Learning & Semi-supervised Learning
Transcript of Transfer Learning & Semi-supervised Learning
Transfer Learning & Semi-supervised Learning
Kai Yu
2
Outline
§ Transfer learning § Semi-supervised learning § Conclusion remark
Transfer Learning
8/10/12 3 •First •Prev •Page 4 •Next •Last •Go Back •Full Screen •Close •Quit
Transfer Learning (TL)
Also known as multi-task learninga, or learning to learnb.
Consider a set of related but different learning tasks.
Instead of treating them independently, solve them jointly.
Explore the commonality of tasks, and generalize it to new tasks.a[Caruana 1997]b[Pratt and Thrun 1998]
Why transfer learning?
§ In some tasks training data are in short supply. § In some domains, the calibration effort is very expensive.
§ In some domains, the learning process is time consuming.
• How to extract knowledge learnt from related domains to help learning in a target domain with a few labeled data?
• How to extract knowledge learnt from related domains to speed up learning in a target domain?
Courtesy - Pan & Yang
The usual setting of the problem
8/10/12 5 •First •Prev •Page 5 •Next •Last •Go Back •Full Screen •Close •Quit
Transfer Learning (TL) in the current practice
The scope can be quite generic, for example,
– knowledge of German can be transferred to learn Dutch– knowledge of cars can be helpful to recognition of bikes
However, the current research has mostly focused on a special casea
– Tasks share a common input/output spaceX ⇥ Y
– Each task is to learn a predictive function f : X ! Y .
ae.g., Caruana, 1997; Bakker and Heskes, 2003; Evegniou et al., 2004; Ando and Zhang, 2004;Schwaighofer et al., 2004; Yu et al., 2005; Zhang et al.,2005; Argyriou et al., 2006...
The training data
8/10/12 6 •First •Prev •Page 6 •Next •Last •Go Back •Full Screen •Close •Quit
The training data
The training data of transfer learning: observations of multiplefunctions acting on the same data space.
Despite its simplicity, the setting covers a lots of applications.
Example: Sensor Network Prediction
8/10/12 7 •First •Prev •Page 7 •Next •Last •Go Back •Full Screen •Close •Quit
Application 1: Sensor network prediction
Light and temperature are measured in a hall. At each time, onlyseveral sensors (out of 54) are randomly on.a
Predict the measurements at any location at one time.
Each location is treated as a data example, and the signal distributionat each time is a 2-D function.
Challenge: the functions are always non-stationary.aGuestrin et al, 05
Example: Recommendation
8/10/12 8 •First •Prev •Page 8 •Next •Last •Go Back •Full Screen •Close •Quit
Application 2: Recommendation systems
Each movie can be seen as a data example, and each user’s ratingsare modeled by a function on moviesa.
Challenge: the movie features are often deficient or unavailable.aThe Netflix competition has a data set with about 480,000 users and 17,000 movies
Example: Image classification across domains
8/10/12 9
Coding Pooling Coding Pooling
Hierarchical Linear Models
8/10/12 10 •First •Prev •Page 9 •Next •Last •Go Back •Full Screen •Close •Quit
Method 1: Hierarchical Linear Models
There has been a long tradition in statistics to apply hierarchical linearmodels for analyzing repeated experimentsa.
A typical problem: predict outcomes of a treatment to patients in var-ious hospitals. Due to a variety of (unknown) conditionals acrosshospitals, outcomes are not i.i.d. samples, but conditioned on whichhospital the patients are from.aGelman et al., Bayesian Data Analysis, second edition, Chapman & Hall/CRC, 2004.
Neural Networks
8/10/12 11 •First •Prev •Page 10 •Next •Last •Go Back •Full Screen •Close •Quit
Method 2: Neural Networks
Bakker and Heskes (2003) proposed a neural network model for TL
– The middle layer represents a new representation of input data.– Each output node corresponds to one task.
The architecture summarizes many TL models.
Jointly Regularized Linear Models
8/10/12 12
•First •Prev •Page 11 •Next •Last •Go Back •Full Screen •Close •Quit
Method 3: Regularized Linear TL
Traditionally, single-task learning learns a function f : X ! Y thatis generalizable from training data to test data, e.g.,
min
w
X
i2O
`
�yi, w
>xi
�+ �w
>w (1)
TL has two directions of generalization: (1) from data to data; and (2)from tasks to tasks. One formulation is to learn a regularization termtransferrable to new tasksa
min
wt,C
mX
t=1
h X
i2Ot
`
�yit, w
>t xi
�+ �w
>t C
-1wt
| {z }the t-th single learning task
i+ ⌦(C)|{z}
regularization for C
(2)
aEvegniou et al., 2004; Ando and Zhang, 2004; Argyriou et al., 2006
They all learn an implicit common feature space
8/10/12 13 •First •Prev •Page 12 •Next •Last •Go Back •Full Screen •Close •Quit
The message: learn a better data representation
Neural Networks: ... so obvious.
Regularized Linear TL: Given the eigen-decomposition C = U⌃
2U
>,let vt = ⌃
�1U
>wt and zi = ⌃U
>xi = Pxi, then the model becomes
min
vt,P
mX
t=1
h X
i2Ot
`(yti, v>t Pxi) + �v
>t vt
i+ ⌦(P
>P ) (3)
It means learning a linear mapping P : X ! Z .
Hierarchical Linear Models: Under a prior wt|✓ ⇠ N (0, C), the neg-ative log-likelihood� log p({yit}, {wt}|C) is
mX
t=1
h X
i2Ot
� log p
�yti|w>
t xi
�+
1
2
w
>t C
�1wt
i+
m
2
log |C| + const (4)
This is a formulation the same as the regularized linear TL.
Transfer Learning via Gaussian Processes
8/10/12 14
•First •Prev •Page 20 •Next •Last •Go Back •Full Screen •Close •Quit
Let’s look at those functions
One way to model the commonality of tasks is to assume thatfunctions are sharing the same stochastic characteristics.
What is Gaussian Processes
8/10/12 15
•First •Prev •Page 14 •Next •Last •Go Back •Full Screen •Close •Quit
Introduction to Gaussian Processes (GPs)
For functions f (x) ⇠ GP (f ), without specifying their parametricforms, we can directly characterize their behaviors by
– a mean function: E[f (x)] = g(x)
– a covariance (or kernel) function: Cov(f (xi), f(xj)) = K(xi,xj)
As a result, any finite collection [f (x1), . . . , f (xn)] follows an n-dimensional Gaussian distribution.
GP is a Bayesian kernel machine for nonlinear learning.
Bayesian inference on function space
8/10/12 16
•First •Prev •Page 15 •Next •Last •Go Back •Full Screen •Close •Quit
Bayesian inference on functions
a
Given {(xi, yi)}ni=1 and a test point xtest, then [y1, . . . yn, ytest] follow a
joint n + 1 dimensional Gaussian distribution N (0, K), where the co-variance matrix K is between [x1, . . .xn,xtest]. Then the prediction is asimple computation of the Gaussian posterior
p(ytest|{yi}) =
p(y1, . . . yn, ytest)
p(y1, . . . yn)(5)
afrom Carl Rassmussen’s tutorial
Hierarchical GP for multiple functions
8/10/12 17
•First •Prev •Page 22 •Next •Last •Go Back •Full Screen •Close •Quit
Hierarchical Gaussian Processes (HGP)
A generalization of hierarchical linear models
Functions are i.i.d. samples from a GP prior p(f |✓).
Challenge: how to set ✓ to get a sufficiently flexible GP?
Yu et al. ICML 2005
Learn joint covariance function
8/10/12 18
•First •Prev •Page 29 •Next •Last •Go Back •Full Screen •Close •Quit
Inductive Learning
The algorithm learns only K and µ at a finite set of data points X =
{xi|i 2 O}. To handle a new x, do we have to add it into X andretrain the model?
The answer is NO!
It turns out that µ andK have surprisingly simple analytical forms
µ(x) =
X
i2O
⇠iK0(x, xi) (7)
K(xi, xj) = K0(xi, xj) +
X
p,q2O
K0(xi, xp)�pqK0(xq, xj)
| {z }possibly non-stationary even if K0 is not
(8)
Therefore, at the convergence of the EM, we can compute {⇠i} and{�pq}. Then both µ and K are generalizable to the entire input space.
•First •Prev •Page 30 •Next •Last •Go Back •Full Screen •Close •Quit
Illustration of the idea
Experiments on synthetic data
8/10/12 19 •First •Prev •Page 35 •Next •Last •Go Back •Full Screen •Close •Quit
Synthetic Data
Approaches to Transfer Learning
Transfer learning approaches
Description
Instance-transfer To re-weight some labeled data in a source domain for use in the target domain
Feature-representation-transfer Find a “good” feature representation that reduces difference between a source and a target domain or minimizes error of models
Model-transfer Discover shared parameters or priors of models between a source domain and a target domain
Relational-knowledge-transfer Build mapping of relational knowledge between a source domain and a target domain.
A survey on transfer learning, by Sinno Jialin Pan & Qiang Yang
21
Outline
§ Transfer learning § Semi-supervised learning § Conclusion remark
Slides courtesy: Jerry Zhu
Why Semi-supervised learning SSL?
§ Promise: better performance for free... labeled data can be hard to get – labels may require human experts – labels may require special devices
– unlabeled data is often cheap in large quantity
8/10/12 22
What is SSL?
8/10/12 23
Part I What is SSL?
What is Semi-Supervised Learning?
Learning from both labeled and unlabeled data. Examples:
Semi-supervised classification: training data l labeled instances{(x
i
, yi
)}l
i=1 and u unlabeled instances {xj
}l+u
j=l+1, often u� l.Goal: better classifier f than from labeled data alone.
Constrained clustering: unlabeled instances {xi
}n
j=1, and “supervisedinformation”, e.g., must-links, cannot-links. Goal: better clusteringthan from unlabeled data alone.
We will mainly discuss semi-supervised classification.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 6 / 99
Example of hard-to-get labels
8/10/12 24
Part I What is SSL?
Example of hard-to-get labels
Task: speech analysis
Switchboard dataset
telephone conversation transcription
400 hours annotation time for each hour of speech
film ) f ih n uh gl n m
be all ) bcl b iy iy tr ao tr ao l dl
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 8 / 99
Another example
8/10/12 25
Part I What is SSL?
Another example of hard-to-get labels
Task: natural language parsing
Penn Chinese Treebank
2 years for 4000 sentences
“The National Track and Field Championship has finished.”
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 9 / 99
A mixture model approach
8/10/12 26
Part I Mixture Models
A simple example of generative models
Labeled data (Xl
, Yl
):
−5 −4 −3 −2 −1 0 1 2 3 4 5−5
−4
−3
−2
−1
0
1
2
3
4
5
Assuming each class has a Gaussian distribution, what is the decisionboundary?
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 18 / 99
A simple mixture model formulation
8/10/12 27
Part I Mixture Models
A simple example of generative models
Model parameters: ✓ = {w1, w2, µ1, µ2,⌃1,⌃2}The GMM:
p(x, y|✓) = p(y|✓)p(x|y, ✓)
= wy
N (x;µy
,⌃y
)
Classification: p(y|x, ✓) =
p(x,y|✓)Py0 p(x,y
0|✓)
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 19 / 99
Using labeled data only
8/10/12 28
Part I Mixture Models
A simple example of generative modelsThe most likely model, and its decision boundary:
−5 −4 −3 −2 −1 0 1 2 3 4 5−5
−4
−3
−2
−1
0
1
2
3
4
5
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 20 / 99
Add unlabeled data
8/10/12 29
Part I Mixture Models
A simple example of generative models
Adding unlabeled data:
−5 −4 −3 −2 −1 0 1 2 3 4 5−5
−4
−3
−2
−1
0
1
2
3
4
5
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 21 / 99
Most likely decision boundary
8/10/12 30
Part I Mixture Models
A simple example of generative modelsWith unlabeled data, the most likely model and its decision boundary:
−5 −4 −3 −2 −1 0 1 2 3 4 5−5
−4
−3
−2
−1
0
1
2
3
4
5
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 22 / 99
Results with and without unlabeled data
8/10/12 31
Part I Mixture Models
A simple example of generative models
They are di↵erent because they maximize di↵erent quantities.
p(Xl
, Yl
|✓) p(Xl
, Yl
, Xu
|✓)
−5 −4 −3 −2 −1 0 1 2 3 4 5−5
−4
−3
−2
−1
0
1
2
3
4
5
−5 −4 −3 −2 −1 0 1 2 3 4 5−5
−4
−3
−2
−1
0
1
2
3
4
5
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 23 / 99
A generative model for SSL
8/10/12 32
Part I Mixture Models
Generative model for semi-supervised learning
Assumption
knowledge of the model form p(X, Y |✓).
joint and marginal likelihood
p(Xl
, Yl
, Xu
|✓) =
X
Yu
p(Xl
, Yl
, Xu
, Yu
|✓)
find the maximum likelihood estimate (MLE) of ✓, the maximum aposteriori (MAP) estimate, or be Bayesian
common mixture models used in semi-supervised learning:I Mixture of Gaussian distributions (GMM) – image classificationI Mixture of multinomial distributions (Naıve Bayes) – text
categorizationI Hidden Markov Models (HMM) – speech recognition
Learning via the Expectation-Maximization (EM) algorithm(Baum-Welch)
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 24 / 99
Assumption: we know p(x,y |theta) Learning via EM algorithm
A co-training method
8/10/12 33
Part I Co-training and Multiview Algorithms
Multiview learning
A special regularizer ⌦(f) defined on unlabeled data, to encourageagreement among multiple learners:
argmin
f1,...,fk
kX
v=1
lX
i=1
c(xi
, yi
, fv
(x
i
)) + �1⌦SL
(fv
)
!
+�2
kX
u,v=1
l+uX
i=l+1
c(xi
, fu
(x
i
), fv
(x
i
))
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 38 / 99
Graph Laplacian
8/10/12 34
Part I Manifold Regularization and Graph-Based Algorithms
The graph Laplacian
We can also compute f in closed form using the graph Laplacian.
n⇥ n weight matrix W on Xl
[Xu
I symmetric, non-negative
Diagonal degree matrix D: Dii
=
Pn
j=1 Wij
Graph Laplacian matrix �
� = D �W
The energy can be rewritten as
X
i⇠j
wij
(f(xi
)� f(xj
))
2= f>�f
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 50 / 99
Harmonic solution with Laplacian
8/10/12 35
Part I Manifold Regularization and Graph-Based Algorithms
Harmonic solution with Laplacian
The harmonic solution minimizes energy subject to the given labels
min
f
1lX
i=1
(f(xi
)� yi
)
2+ f>�f
Partition the Laplacian matrix � =
�
ll
�
lu
�
ul
�
uu
�
Harmonic solutionf
u
= ��
uu
�1�
ul
Yl
The normalized Laplacian L = D�1/2�D�1/2
= I �D�1/2WD�1/2, or�
p,Lp are often used too (p > 0).
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 51 / 99
Reference
8/10/12 36
Part II Human SSL
References
See the references inXiaojin Zhu and Andrew B. Goldberg. Introduction to Semi-Supervised
Learning. Morgan & Claypool, 2009.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 99 / 99
Conclusion remark
§ In both transfer learning and semi-supervised learning, the key is the regularization term, reflecting some assumption, or prior knowledge
§ No assumption, no learning
8/10/12 37