Transfer Learning & Semi-supervised Learning

Transfer Learning & Semi-supervised Learning

Kai Yu

2

Outline

§  Transfer learning §  Semi-supervised learning §  Conclusion remark

Transfer Learning

8/10/12 3 •First •Prev •Page 4 •Next •Last •Go Back •Full Screen •Close •Quit

Transfer Learning (TL)

Also known as multi-task learninga, or learning to learnb.

Consider a set of related but different learning tasks.

Instead of treating them independently, solve them jointly.

Explore the commonality of tasks, and generalize it to new tasks.a[Caruana 1997]b[Pratt and Thrun 1998]

Why transfer learning?

§  In some tasks training data are in short supply. §  In some domains, the calibration effort is very expensive.

§  In some domains, the learning process is time consuming.

•  How to extract knowledge learnt from related domains to help learning in a target domain with a few labeled data?

•  How to extract knowledge learnt from related domains to speed up learning in a target domain?

Courtesy - Pan & Yang

The usual setting of the problem

8/10/12 5 •First •Prev • •Next •Last •Go Back •Full Screen •Close •Quit

Transfer Learning (TL) in the current practice

The scope can be quite generic, for example,

– knowledge of German can be transferred to learn Dutch– knowledge of cars can be helpful to recognition of bikes

However, the current research has mostly focused on a special casea

– Tasks share a common input/output spaceX ⇥ Y

– Each task is to learn a predictive function f : X ! Y .

ae.g., Caruana, 1997; Bakker and Heskes, 2003; Evegniou et al., 2004; Ando and Zhang, 2004;Schwaighofer et al., 2004; Yu et al., 2005; Zhang et al.,2005; Argyriou et al., 2006...

The training data


The training data

The training data of transfer learning: observations of multiplefunctions acting on the same data space.

Despite its simplicity, the setting covers a lots of applications.

Example: Sensor Network Prediction


Application 1: Sensor network prediction

Light and temperature are measured in a hall. At each time, onlyseveral sensors (out of 54) are randomly on.a

Predict the measurements at any location at one time.

Each location is treated as a data example, and the signal distributionat each time is a 2-D function.

Challenge: the functions are always non-stationary.aGuestrin et al, 05

Example: Recommendation


Application 2: Recommendation systems

Each movie can be seen as a data example, and each user’s ratingsare modeled by a function on moviesa.

Challenge: the movie features are often deficient or unavailable.aThe Netflix competition has a data set with about 480,000 users and 17,000 movies

Example: Image classification across domains

8/10/12 9

Coding Pooling Coding Pooling

Hierarchical Linear Models


Method 1: Hierarchical Linear Models

There has been a long tradition in statistics to apply hierarchical linearmodels for analyzing repeated experimentsa.

A typical problem: predict outcomes of a treatment to patients in var-ious hospitals. Due to a variety of (unknown) conditionals acrosshospitals, outcomes are not i.i.d. samples, but conditioned on whichhospital the patients are from.aGelman et al., Bayesian Data Analysis, second edition, Chapman & Hall/CRC, 2004.

Neural Networks


Method 2: Neural Networks

Bakker and Heskes (2003) proposed a neural network model for TL

– The middle layer represents a new representation of input data.– Each output node corresponds to one task.

The architecture summarizes many TL models.

Jointly Regularized Linear Models

8/10/12 12

•First •Prev •Page 11 •Next •Last •Go Back •Full Screen •Close •Quit

Method 3: Regularized Linear TL

Traditionally, single-task learning learns a function f : X ! Y thatis generalizable from training data to test data, e.g.,

min

w

X

i2O

`

�yi, w

>xi

�+ �w

>w (1)

TL has two directions of generalization: (1) from data to data; and (2)from tasks to tasks. One formulation is to learn a regularization termtransferrable to new tasksa

min

wt,C

mX

t=1

h X

i2Ot

`

�yit, w

>t xi

�+ �w

>t C

-1wt

| {z }the t-th single learning task

i+ ⌦(C)|{z}

regularization for C

(2)

aEvegniou et al., 2004; Ando and Zhang, 2004; Argyriou et al., 2006

They all learn an implicit common feature space


The message: learn a better data representation

Neural Networks: ... so obvious.

Regularized Linear TL: Given the eigen-decomposition C = U⌃

2U

>,let vt = ⌃

�1U

>wt and zi = ⌃U

>xi = Pxi, then the model becomes

min

vt,P

mX

t=1

h X

i2Ot

`(yti, v>t Pxi) + �v

>t vt

i+ ⌦(P

>P ) (3)

It means learning a linear mapping P : X ! Z .

Hierarchical Linear Models: Under a prior wt|✓ ⇠ N (0, C), the neg-ative log-likelihood� log p({yit}, {wt}|C) is

mX

t=1

h X

i2Ot

� log p

�yti|w>

t xi

�+

1

2

w

>t C

�1wt

i+

m

2

log |C| + const (4)

This is a formulation the same as the regularized linear TL.

Transfer Learning via Gaussian Processes

8/10/12 14


Let’s look at those functions

One way to model the commonality of tasks is to assume thatfunctions are sharing the same stochastic characteristics.

What is Gaussian Processes

8/10/12 15


Introduction to Gaussian Processes (GPs)

For functions f (x) ⇠ GP (f ), without specifying their parametricforms, we can directly characterize their behaviors by

– a mean function: E[f (x)] = g(x)

– a covariance (or kernel) function: Cov(f (xi), f(xj)) = K(xi,xj)

As a result, any finite collection [f (x1), . . . , f (xn)] follows an n-dimensional Gaussian distribution.

GP is a Bayesian kernel machine for nonlinear learning.

Bayesian inference on function space

8/10/12 16


Bayesian inference on functions

a

Given {(xi, yi)}ni=1 and a test point xtest, then [y1, . . . yn, ytest] follow a

joint n + 1 dimensional Gaussian distribution N (0, K), where the co-variance matrix K is between [x1, . . .xn,xtest]. Then the prediction is asimple computation of the Gaussian posterior

p(ytest|{yi}) =

p(y1, . . . yn, ytest)

p(y1, . . . yn)(5)

afrom Carl Rassmussen’s tutorial

Hierarchical GP for multiple functions

8/10/12 17


Hierarchical Gaussian Processes (HGP)

A generalization of hierarchical linear models

Functions are i.i.d. samples from a GP prior p(f |✓).

Challenge: how to set ✓ to get a sufficiently flexible GP?

Yu et al. ICML 2005

Learn joint covariance function

8/10/12 18


Inductive Learning

The algorithm learns only K and µ at a finite set of data points X =

{xi|i 2 O}. To handle a new x, do we have to add it into X andretrain the model?

The answer is NO!

It turns out that µ andK have surprisingly simple analytical forms

µ(x) =

X

i2O

⇠iK0(x, xi) (7)

K(xi, xj) = K0(xi, xj) +

X

p,q2O

K0(xi, xp)�pqK0(xq, xj)

| {z }possibly non-stationary even if K0 is not

(8)

Therefore, at the convergence of the EM, we can compute {⇠i} and{�pq}. Then both µ and K are generalizable to the entire input space.


Illustration of the idea

Experiments on synthetic data


Synthetic Data

Approaches to Transfer Learning

Transfer learning approaches

Description

Instance-transfer To re-weight some labeled data in a source domain for use in the target domain

Feature-representation-transfer Find a “good” feature representation that reduces difference between a source and a target domain or minimizes error of models

Model-transfer Discover shared parameters or priors of models between a source domain and a target domain

Relational-knowledge-transfer Build mapping of relational knowledge between a source domain and a target domain.

A survey on transfer learning, by Sinno Jialin Pan & Qiang Yang

21

Outline

§  Transfer learning §  Semi-supervised learning §  Conclusion remark

Slides courtesy: Jerry Zhu

Why Semi-supervised learning SSL?

§  Promise: better performance for free... labeled data can be hard to get –  labels may require human experts –  labels may require special devices

–  unlabeled data is often cheap in large quantity

8/10/12 22

What is SSL?

8/10/12 23

Part I What is SSL?

What is Semi-Supervised Learning?

Learning from both labeled and unlabeled data. Examples:

Semi-supervised classification: training data l labeled instances{(x

i

, yi

)}l

i=1 and u unlabeled instances {xj

}l+u

j=l+1, often u� l.Goal: better classifier f than from labeled data alone.

Constrained clustering: unlabeled instances {xi

}n

j=1, and “supervisedinformation”, e.g., must-links, cannot-links. Goal: better clusteringthan from unlabeled data alone.

We will mainly discuss semi-supervised classification.

Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 6 / 99

Example of hard-to-get labels

8/10/12 24

Part I What is SSL?

Example of hard-to-get labels

Task: speech analysis

Switchboard dataset

telephone conversation transcription

400 hours annotation time for each hour of speech

film ) f ih n uh gl n m

be all ) bcl b iy iy tr ao tr ao l dl


Another example

8/10/12 25

Part I What is SSL?

Another example of hard-to-get labels

Task: natural language parsing

Penn Chinese Treebank

2 years for 4000 sentences

“The National Track and Field Championship has finished.”


A mixture model approach

8/10/12 26

Part I Mixture Models

A simple example of generative models

Labeled data (Xl

, Yl

):

−5 −4 −3 −2 −1 0 1 2 3 4 5−5

−4

−3

−2

−1

0

1

2

3

4

5

Assuming each class has a Gaussian distribution, what is the decisionboundary?


A simple mixture model formulation

8/10/12 27



Model parameters: ✓ = {w1, w2, µ1, µ2,⌃1,⌃2}The GMM:

p(x, y|✓) = p(y|✓)p(x|y, ✓)

= wy

N (x;µy

,⌃y

)

Classification: p(y|x, ✓) =

p(x,y|✓)Py0 p(x,y

0|✓)


Using labeled data only

8/10/12 28


A simple example of generative modelsThe most likely model, and its decision boundary:

−5 −4 −3 −2 −1 0 1 2 3 4 5−5

−4

−3

−2

−1

0

1

2

3

4

5


Add unlabeled data

8/10/12 29



Adding unlabeled data:

−5 −4 −3 −2 −1 0 1 2 3 4 5−5

−4

−3

−2

−1

0

1

2

3

4

5


Most likely decision boundary

8/10/12 30


A simple example of generative modelsWith unlabeled data, the most likely model and its decision boundary:

−5 −4 −3 −2 −1 0 1 2 3 4 5−5

−4

−3

−2

−1

0

1

2

3

4

5


Results with and without unlabeled data

8/10/12 31



They are di↵erent because they maximize di↵erent quantities.

p(Xl

, Yl

|✓) p(Xl

, Yl

, Xu

|✓)

−5 −4 −3 −2 −1 0 1 2 3 4 5−5

−4

−3

−2

−1

0

1

2

3

4

5

−5 −4 −3 −2 −1 0 1 2 3 4 5−5

−4

−3

−2

−1

0

1

2

3

4

5


A generative model for SSL

8/10/12 32


Generative model for semi-supervised learning

Assumption

knowledge of the model form p(X, Y |✓).

joint and marginal likelihood

p(Xl

, Yl

, Xu

|✓) =

X

Yu

p(Xl

, Yl

, Xu

, Yu

|✓)

find the maximum likelihood estimate (MLE) of ✓, the maximum aposteriori (MAP) estimate, or be Bayesian

common mixture models used in semi-supervised learning:I Mixture of Gaussian distributions (GMM) – image classificationI Mixture of multinomial distributions (Naıve Bayes) – text

categorizationI Hidden Markov Models (HMM) – speech recognition

Learning via the Expectation-Maximization (EM) algorithm(Baum-Welch)


Assumption: we know p(x,y |theta) Learning via EM algorithm

A co-training method

8/10/12 33

Part I Co-training and Multiview Algorithms

Multiview learning

A special regularizer ⌦(f) defined on unlabeled data, to encourageagreement among multiple learners:

argmin

f1,...,fk

kX

v=1

lX

i=1

c(xi

, yi

, fv

(x

i

)) + �1⌦SL

(fv

)

!

+�2

kX

u,v=1

l+uX

i=l+1

c(xi

, fu

(x

i

), fv

(x

i

))


Graph Laplacian

8/10/12 34

Part I Manifold Regularization and Graph-Based Algorithms

The graph Laplacian

We can also compute f in closed form using the graph Laplacian.

n⇥ n weight matrix W on Xl

[Xu

I symmetric, non-negative

Diagonal degree matrix D: Dii

=

Pn

j=1 Wij

Graph Laplacian matrix �

� = D �W

The energy can be rewritten as

X

i⇠j

wij

(f(xi

)� f(xj

))

2= f>�f


Harmonic solution with Laplacian

8/10/12 35

Part I Manifold Regularization and Graph-Based Algorithms

Harmonic solution with Laplacian

The harmonic solution minimizes energy subject to the given labels

min

f

1lX

i=1

(f(xi

)� yi

)

2+ f>�f

Partition the Laplacian matrix � =

�

ll

�

lu

�

ul

�

uu

�

Harmonic solutionf

u

= ��

uu

�1�

ul

Yl

The normalized Laplacian L = D�1/2�D�1/2

= I �D�1/2WD�1/2, or�

p,Lp are often used too (p > 0).


Reference

8/10/12 36

Part II Human SSL

References

See the references inXiaojin Zhu and Andrew B. Goldberg. Introduction to Semi-Supervised

Learning. Morgan & Claypool, 2009.


Conclusion remark

§  In both transfer learning and semi-supervised learning, the key is the regularization term, reflecting some assumption, or prior knowledge

§  No assumption, no learning

8/10/12 37

Transfer Learning & Semi-supervised Learning

Documents

Transcript of Transfer Learning & Semi-supervised Learning