Transfer Learning & Semi-supervised Learning

37
Transfer Learning & Semi-supervised Learning Kai Yu

Transcript of Transfer Learning & Semi-supervised Learning

Page 1: Transfer Learning & Semi-supervised Learning

Transfer Learning & Semi-supervised Learning

Kai Yu

Page 2: Transfer Learning & Semi-supervised Learning

2

Outline

§  Transfer learning §  Semi-supervised learning §  Conclusion remark

Page 3: Transfer Learning & Semi-supervised Learning

Transfer Learning

8/10/12 3 •First •Prev •Page 4 •Next •Last •Go Back •Full Screen •Close •Quit

Transfer Learning (TL)

Also known as multi-task learninga, or learning to learnb.

Consider a set of related but different learning tasks.

Instead of treating them independently, solve them jointly.

Explore the commonality of tasks, and generalize it to new tasks.a[Caruana 1997]b[Pratt and Thrun 1998]

Page 4: Transfer Learning & Semi-supervised Learning

Why transfer learning?

§  In some tasks training data are in short supply. §  In some domains, the calibration effort is very expensive.

§  In some domains, the learning process is time consuming.

•  How to extract knowledge learnt from related domains to help learning in a target domain with a few labeled data?

•  How to extract knowledge learnt from related domains to speed up learning in a target domain?

Courtesy - Pan & Yang

Page 5: Transfer Learning & Semi-supervised Learning

The usual setting of the problem

8/10/12 5 •First •Prev •Page 5 •Next •Last •Go Back •Full Screen •Close •Quit

Transfer Learning (TL) in the current practice

The scope can be quite generic, for example,

– knowledge of German can be transferred to learn Dutch– knowledge of cars can be helpful to recognition of bikes

However, the current research has mostly focused on a special casea

– Tasks share a common input/output spaceX ⇥ Y

– Each task is to learn a predictive function f : X ! Y .

ae.g., Caruana, 1997; Bakker and Heskes, 2003; Evegniou et al., 2004; Ando and Zhang, 2004;Schwaighofer et al., 2004; Yu et al., 2005; Zhang et al.,2005; Argyriou et al., 2006...

Page 6: Transfer Learning & Semi-supervised Learning

The training data

8/10/12 6 •First •Prev •Page 6 •Next •Last •Go Back •Full Screen •Close •Quit

The training data

The training data of transfer learning: observations of multiplefunctions acting on the same data space.

Despite its simplicity, the setting covers a lots of applications.

Page 7: Transfer Learning & Semi-supervised Learning

Example: Sensor Network Prediction

8/10/12 7 •First •Prev •Page 7 •Next •Last •Go Back •Full Screen •Close •Quit

Application 1: Sensor network prediction

Light and temperature are measured in a hall. At each time, onlyseveral sensors (out of 54) are randomly on.a

Predict the measurements at any location at one time.

Each location is treated as a data example, and the signal distributionat each time is a 2-D function.

Challenge: the functions are always non-stationary.aGuestrin et al, 05

Page 8: Transfer Learning & Semi-supervised Learning

Example: Recommendation

8/10/12 8 •First •Prev •Page 8 •Next •Last •Go Back •Full Screen •Close •Quit

Application 2: Recommendation systems

Each movie can be seen as a data example, and each user’s ratingsare modeled by a function on moviesa.

Challenge: the movie features are often deficient or unavailable.aThe Netflix competition has a data set with about 480,000 users and 17,000 movies

Page 9: Transfer Learning & Semi-supervised Learning

Example: Image classification across domains

8/10/12 9

Coding Pooling Coding Pooling

Page 10: Transfer Learning & Semi-supervised Learning

Hierarchical Linear Models

8/10/12 10 •First •Prev •Page 9 •Next •Last •Go Back •Full Screen •Close •Quit

Method 1: Hierarchical Linear Models

There has been a long tradition in statistics to apply hierarchical linearmodels for analyzing repeated experimentsa.

A typical problem: predict outcomes of a treatment to patients in var-ious hospitals. Due to a variety of (unknown) conditionals acrosshospitals, outcomes are not i.i.d. samples, but conditioned on whichhospital the patients are from.aGelman et al., Bayesian Data Analysis, second edition, Chapman & Hall/CRC, 2004.

Page 11: Transfer Learning & Semi-supervised Learning

Neural Networks

8/10/12 11 •First •Prev •Page 10 •Next •Last •Go Back •Full Screen •Close •Quit

Method 2: Neural Networks

Bakker and Heskes (2003) proposed a neural network model for TL

– The middle layer represents a new representation of input data.– Each output node corresponds to one task.

The architecture summarizes many TL models.

Page 12: Transfer Learning & Semi-supervised Learning

Jointly Regularized Linear Models

8/10/12 12

•First •Prev •Page 11 •Next •Last •Go Back •Full Screen •Close •Quit

Method 3: Regularized Linear TL

Traditionally, single-task learning learns a function f : X ! Y thatis generalizable from training data to test data, e.g.,

min

w

X

i2O

`

�yi, w

>xi

�+ �w

>w (1)

TL has two directions of generalization: (1) from data to data; and (2)from tasks to tasks. One formulation is to learn a regularization termtransferrable to new tasksa

min

wt,C

mX

t=1

h X

i2Ot

`

�yit, w

>t xi

�+ �w

>t C

-1wt

| {z }the t-th single learning task

i+ ⌦(C)|{z}

regularization for C

(2)

aEvegniou et al., 2004; Ando and Zhang, 2004; Argyriou et al., 2006

Page 13: Transfer Learning & Semi-supervised Learning

They all learn an implicit common feature space

8/10/12 13 •First •Prev •Page 12 •Next •Last •Go Back •Full Screen •Close •Quit

The message: learn a better data representation

Neural Networks: ... so obvious.

Regularized Linear TL: Given the eigen-decomposition C = U⌃

2U

>,let vt = ⌃

�1U

>wt and zi = ⌃U

>xi = Pxi, then the model becomes

min

vt,P

mX

t=1

h X

i2Ot

`(yti, v>t Pxi) + �v

>t vt

i+ ⌦(P

>P ) (3)

It means learning a linear mapping P : X ! Z .

Hierarchical Linear Models: Under a prior wt|✓ ⇠ N (0, C), the neg-ative log-likelihood� log p({yit}, {wt}|C) is

mX

t=1

h X

i2Ot

� log p

�yti|w>

t xi

�+

1

2

w

>t C

�1wt

i+

m

2

log |C| + const (4)

This is a formulation the same as the regularized linear TL.

Page 14: Transfer Learning & Semi-supervised Learning

Transfer Learning via Gaussian Processes

8/10/12 14

•First •Prev •Page 20 •Next •Last •Go Back •Full Screen •Close •Quit

Let’s look at those functions

One way to model the commonality of tasks is to assume thatfunctions are sharing the same stochastic characteristics.

Page 15: Transfer Learning & Semi-supervised Learning

What is Gaussian Processes

8/10/12 15

•First •Prev •Page 14 •Next •Last •Go Back •Full Screen •Close •Quit

Introduction to Gaussian Processes (GPs)

For functions f (x) ⇠ GP (f ), without specifying their parametricforms, we can directly characterize their behaviors by

– a mean function: E[f (x)] = g(x)

– a covariance (or kernel) function: Cov(f (xi), f(xj)) = K(xi,xj)

As a result, any finite collection [f (x1), . . . , f (xn)] follows an n-dimensional Gaussian distribution.

GP is a Bayesian kernel machine for nonlinear learning.

Page 16: Transfer Learning & Semi-supervised Learning

Bayesian inference on function space

8/10/12 16

•First •Prev •Page 15 •Next •Last •Go Back •Full Screen •Close •Quit

Bayesian inference on functions

a

Given {(xi, yi)}ni=1 and a test point xtest, then [y1, . . . yn, ytest] follow a

joint n + 1 dimensional Gaussian distribution N (0, K), where the co-variance matrix K is between [x1, . . .xn,xtest]. Then the prediction is asimple computation of the Gaussian posterior

p(ytest|{yi}) =

p(y1, . . . yn, ytest)

p(y1, . . . yn)(5)

afrom Carl Rassmussen’s tutorial

Page 17: Transfer Learning & Semi-supervised Learning

Hierarchical GP for multiple functions

8/10/12 17

•First •Prev •Page 22 •Next •Last •Go Back •Full Screen •Close •Quit

Hierarchical Gaussian Processes (HGP)

A generalization of hierarchical linear models

Functions are i.i.d. samples from a GP prior p(f |✓).

Challenge: how to set ✓ to get a sufficiently flexible GP?

Yu et al. ICML 2005

Page 18: Transfer Learning & Semi-supervised Learning

Learn joint covariance function

8/10/12 18

•First •Prev •Page 29 •Next •Last •Go Back •Full Screen •Close •Quit

Inductive Learning

The algorithm learns only K and µ at a finite set of data points X =

{xi|i 2 O}. To handle a new x, do we have to add it into X andretrain the model?

The answer is NO!

It turns out that µ andK have surprisingly simple analytical forms

µ(x) =

X

i2O

⇠iK0(x, xi) (7)

K(xi, xj) = K0(xi, xj) +

X

p,q2O

K0(xi, xp)�pqK0(xq, xj)

| {z }possibly non-stationary even if K0 is not

(8)

Therefore, at the convergence of the EM, we can compute {⇠i} and{�pq}. Then both µ and K are generalizable to the entire input space.

•First •Prev •Page 30 •Next •Last •Go Back •Full Screen •Close •Quit

Illustration of the idea

Page 19: Transfer Learning & Semi-supervised Learning

Experiments on synthetic data

8/10/12 19 •First •Prev •Page 35 •Next •Last •Go Back •Full Screen •Close •Quit

Synthetic Data

Page 20: Transfer Learning & Semi-supervised Learning

Approaches to Transfer Learning

Transfer learning approaches

Description

Instance-transfer To re-weight some labeled data in a source domain for use in the target domain

Feature-representation-transfer Find a “good” feature representation that reduces difference between a source and a target domain or minimizes error of models

Model-transfer Discover shared parameters or priors of models between a source domain and a target domain

Relational-knowledge-transfer Build mapping of relational knowledge between a source domain and a target domain.

A survey on transfer learning, by Sinno Jialin Pan & Qiang Yang

Page 21: Transfer Learning & Semi-supervised Learning

21

Outline

§  Transfer learning §  Semi-supervised learning §  Conclusion remark

Slides courtesy: Jerry Zhu

Page 22: Transfer Learning & Semi-supervised Learning

Why Semi-supervised learning SSL?

§  Promise: better performance for free... labeled data can be hard to get –  labels may require human experts –  labels may require special devices

–  unlabeled data is often cheap in large quantity

8/10/12 22

Page 23: Transfer Learning & Semi-supervised Learning

What is SSL?

8/10/12 23

Part I What is SSL?

What is Semi-Supervised Learning?

Learning from both labeled and unlabeled data. Examples:

Semi-supervised classification: training data l labeled instances{(x

i

, yi

)}l

i=1 and u unlabeled instances {xj

}l+u

j=l+1, often u� l.Goal: better classifier f than from labeled data alone.

Constrained clustering: unlabeled instances {xi

}n

j=1, and “supervisedinformation”, e.g., must-links, cannot-links. Goal: better clusteringthan from unlabeled data alone.

We will mainly discuss semi-supervised classification.

Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 6 / 99

Page 24: Transfer Learning & Semi-supervised Learning

Example of hard-to-get labels

8/10/12 24

Part I What is SSL?

Example of hard-to-get labels

Task: speech analysis

Switchboard dataset

telephone conversation transcription

400 hours annotation time for each hour of speech

film ) f ih n uh gl n m

be all ) bcl b iy iy tr ao tr ao l dl

Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 8 / 99

Page 25: Transfer Learning & Semi-supervised Learning

Another example

8/10/12 25

Part I What is SSL?

Another example of hard-to-get labels

Task: natural language parsing

Penn Chinese Treebank

2 years for 4000 sentences

“The National Track and Field Championship has finished.”

Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 9 / 99

Page 26: Transfer Learning & Semi-supervised Learning

A mixture model approach

8/10/12 26

Part I Mixture Models

A simple example of generative models

Labeled data (Xl

, Yl

):

−5 −4 −3 −2 −1 0 1 2 3 4 5−5

−4

−3

−2

−1

0

1

2

3

4

5

Assuming each class has a Gaussian distribution, what is the decisionboundary?

Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 18 / 99

Page 27: Transfer Learning & Semi-supervised Learning

A simple mixture model formulation

8/10/12 27

Part I Mixture Models

A simple example of generative models

Model parameters: ✓ = {w1, w2, µ1, µ2,⌃1,⌃2}The GMM:

p(x, y|✓) = p(y|✓)p(x|y, ✓)

= wy

N (x;µy

,⌃y

)

Classification: p(y|x, ✓) =

p(x,y|✓)Py0 p(x,y

0|✓)

Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 19 / 99

Page 28: Transfer Learning & Semi-supervised Learning

Using labeled data only

8/10/12 28

Part I Mixture Models

A simple example of generative modelsThe most likely model, and its decision boundary:

−5 −4 −3 −2 −1 0 1 2 3 4 5−5

−4

−3

−2

−1

0

1

2

3

4

5

Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 20 / 99

Page 29: Transfer Learning & Semi-supervised Learning

Add unlabeled data

8/10/12 29

Part I Mixture Models

A simple example of generative models

Adding unlabeled data:

−5 −4 −3 −2 −1 0 1 2 3 4 5−5

−4

−3

−2

−1

0

1

2

3

4

5

Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 21 / 99

Page 30: Transfer Learning & Semi-supervised Learning

Most likely decision boundary

8/10/12 30

Part I Mixture Models

A simple example of generative modelsWith unlabeled data, the most likely model and its decision boundary:

−5 −4 −3 −2 −1 0 1 2 3 4 5−5

−4

−3

−2

−1

0

1

2

3

4

5

Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 22 / 99

Page 31: Transfer Learning & Semi-supervised Learning

Results with and without unlabeled data

8/10/12 31

Part I Mixture Models

A simple example of generative models

They are di↵erent because they maximize di↵erent quantities.

p(Xl

, Yl

|✓) p(Xl

, Yl

, Xu

|✓)

−5 −4 −3 −2 −1 0 1 2 3 4 5−5

−4

−3

−2

−1

0

1

2

3

4

5

−5 −4 −3 −2 −1 0 1 2 3 4 5−5

−4

−3

−2

−1

0

1

2

3

4

5

Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 23 / 99

Page 32: Transfer Learning & Semi-supervised Learning

A generative model for SSL

8/10/12 32

Part I Mixture Models

Generative model for semi-supervised learning

Assumption

knowledge of the model form p(X, Y |✓).

joint and marginal likelihood

p(Xl

, Yl

, Xu

|✓) =

X

Yu

p(Xl

, Yl

, Xu

, Yu

|✓)

find the maximum likelihood estimate (MLE) of ✓, the maximum aposteriori (MAP) estimate, or be Bayesian

common mixture models used in semi-supervised learning:I Mixture of Gaussian distributions (GMM) – image classificationI Mixture of multinomial distributions (Naıve Bayes) – text

categorizationI Hidden Markov Models (HMM) – speech recognition

Learning via the Expectation-Maximization (EM) algorithm(Baum-Welch)

Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 24 / 99

Assumption: we know p(x,y |theta) Learning via EM algorithm

Page 33: Transfer Learning & Semi-supervised Learning

A co-training method

8/10/12 33

Part I Co-training and Multiview Algorithms

Multiview learning

A special regularizer ⌦(f) defined on unlabeled data, to encourageagreement among multiple learners:

argmin

f1,...,fk

kX

v=1

lX

i=1

c(xi

, yi

, fv

(x

i

)) + �1⌦SL

(fv

)

!

+�2

kX

u,v=1

l+uX

i=l+1

c(xi

, fu

(x

i

), fv

(x

i

))

Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 38 / 99

Page 34: Transfer Learning & Semi-supervised Learning

Graph Laplacian

8/10/12 34

Part I Manifold Regularization and Graph-Based Algorithms

The graph Laplacian

We can also compute f in closed form using the graph Laplacian.

n⇥ n weight matrix W on Xl

[Xu

I symmetric, non-negative

Diagonal degree matrix D: Dii

=

Pn

j=1 Wij

Graph Laplacian matrix �

� = D �W

The energy can be rewritten as

X

i⇠j

wij

(f(xi

)� f(xj

))

2= f>�f

Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 50 / 99

Page 35: Transfer Learning & Semi-supervised Learning

Harmonic solution with Laplacian

8/10/12 35

Part I Manifold Regularization and Graph-Based Algorithms

Harmonic solution with Laplacian

The harmonic solution minimizes energy subject to the given labels

min

f

1lX

i=1

(f(xi

)� yi

)

2+ f>�f

Partition the Laplacian matrix � =

ll

lu

ul

uu

Harmonic solutionf

u

= ��

uu

�1�

ul

Yl

The normalized Laplacian L = D�1/2�D�1/2

= I �D�1/2WD�1/2, or�

p,Lp are often used too (p > 0).

Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 51 / 99

Page 36: Transfer Learning & Semi-supervised Learning

Reference

8/10/12 36

Part II Human SSL

References

See the references inXiaojin Zhu and Andrew B. Goldberg. Introduction to Semi-Supervised

Learning. Morgan & Claypool, 2009.

Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 99 / 99

Page 37: Transfer Learning & Semi-supervised Learning

Conclusion remark

§  In both transfer learning and semi-supervised learning, the key is the regularization term, reflecting some assumption, or prior knowledge

§  No assumption, no learning

8/10/12 37