Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua...

Transcript of Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua...

Deep Learning tutorial

Nicolas Le Roux

Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng

DL/UFL workshop

16/12/11

Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng (DL/UFL workshop)Deep Learning tutorial 16/12/11 1 / 31

Page 2: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

Disclaimer

A lot of the slides of this tutorial have been taken from

presentations from:

Honglak Lee

Yann LeCun

Geoff Hinton

Andrew Ng

Yoshua BengioNicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 2 / 31

Outline1 Introduction

What is deep?

Training problems2 Unsupervised pretraining

Pretraining

The denoising autoencoder

Predictive sparse coding

The deep belief network3 ConclusionNicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 3 / 31

Page 4: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

What does deep mean?

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 4 / 31

Page 5: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

Page 6: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

Page 7: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

Properties of deep models

Many non-linearities = One non-linearity

More constrained space of transformations

Deep model = prior on the type of transformations

May need fewer computational units for the same function

Page 8: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

Page 9: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

Page 10: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

Features for face recognition

Image courtesy of Honglak Lee

Page 11: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

Features for face recognition

Image courtesy of Honglak Lee

“First-level” features are not

task-dependent

“First-level” features are easy to

learn

“Higher-level” features are

easier to learn given the

low-level features

Page 12: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

A deep neural network

H1 = g(W1X)

H2 = g(W2H1)

Y = W3H2

g = sigmoid or tanh

Page 13: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

H1 = g(W1X)

H2 = g(W2H1)

Y = W3H2

g = sigmoid or tanh

Page 14: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

H1 = g(W1X)

H2 = g(W2H1)

Y = W3H2

g = sigmoid or tanh

Page 15: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

H1 = g(W1X)

H2 = g(W2H1)

Y = W3H2

g = sigmoid or tanh

Page 16: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

Training issues

W3 is a lot easier to learn than W2 and W1

Compare

Given set of features, find combination

Given combination, find set of features

Page 17: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

Training issues

W3 is a lot easier to learn than W2 and W1

Compare

Given set of features, find combination

Given combination, find set of features

Page 18: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

Practical results

Parameters randomly initialized

They do not change much (bad local minima)

A random transformation is not very good

Bengio et al., 2007

Page 19: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

Practical results

Parameters randomly initialized

They do not change much (bad local minima)

A random transformation is not very good

Bengio et al., 2007

Page 20: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

Practical results - TIMIT

Mohamed et al., Deep Belief Networks for phone recognition

Page 21: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

What is deep?

Pretraining

Page 22: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

The gist of pretraining

1 We want (potentially) deep networks

2 Training the lower layers is hard.

Solution: optimize each layer without relying on the layersabove.

Page 23: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

1 We want (potentially) deep networks

2 Training the lower layers is hard.

Solution: optimize each layer without relying on the layersabove.

Page 24: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

1 Learn W1 first

2 Given W1, learn W2

3 Given W1 and W2, learn W3

Page 25: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

1 Learn W1 first

Page 26: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

1 Learn W1 first

Page 27: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

1 Learn W1 first

Page 28: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

1 Learn W1 first

Page 29: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

Greedy learning

1 Train each layer in a supervised way

2 Use alternate cost hoping that it will be useful.

Alternate cost: modeling the data well.

Page 30: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

Greedy learning

1 Train each layer in a supervised way

2 Use alternate cost hoping that it will be useful.

Alternate cost: modeling the data well.

Page 31: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

Denoising autoencoders

Vincent et al, 2008

Corrupt the input (e.g. set 25% of inputs to 0)

Reconstruct the uncorrupted input

Use uncorrupted encoding as input to next level

A representation is good if it works well with noisy data

Page 32: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

Denoising autoencoders

Vincent et al, 2008

Corrupt the input (e.g. set 25% of inputs to 0)

Reconstruct the uncorrupted input

Use uncorrupted encoding as input to next levelA representation is good if it works well with noisy dataNicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 16 / 31

Page 33: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

DAE - A graphic explanation

Projects a “corrupted point” onto its position on the manifold

Two corrupted versions of the same point have the same representation

Page 34: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

Stacking DAEs

1 Start with the lowest level and stack upwards

2 Train each layer on the intermediate code (features) from the layer below

3 Top layer can have a different output (e.g., softmax non-linearity) to provide an output for

classification

Page 35: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

Objective function for sparse coding:

Z (X ) = argminZ12‖X − DZ‖2

2 + λ‖Z‖1

Z must:

Be sparse

Reconstruct X well

Be well predicted by a feedforward model

Page 36: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

Objective function for sparse coding:

2 + λ‖Z‖1 + α‖Z − F (X , θ)‖22

F (X , θ) = Gtanh(WeX )

Z must:

Be sparse

Reconstruct X well

Be well predicted by a feedforward modelNicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 19 / 31

Page 37: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

Stacking PSDs

2 + λ‖Z‖1 + α‖Z − F1(X , θ)‖22

F1(X , θ1) = Gtanh(W 1e X )

1 Train the first layer using PSD on X

2 Use H1 = |F1(X , θ)| as features

3 Train the second layer using PSD on H1

4 Use H2 = |F2(H1, θ)| as features

5 Keep going...

6 Use H37 as input to your final classifier

Page 38: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

Stacking PSDs

2 + λ‖Z‖1 + α‖Z − F1(X , θ)‖22

F1(X , θ1) = Gtanh(W 1e X )

5 Keep going...

Page 39: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

Stacking PSDs

Z (H1) = argminZ12‖H1 − DZ‖2

2 + λ‖Z‖1 + α‖Z − F2(H1, θ)‖22

F2(H1, θ2) = Gtanh(W 2e H1)

5 Keep going...

Page 40: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

Stacking PSDs

2 + λ‖Z‖1 + α‖Z − F2(H1, θ)‖22

5 Keep going...

Page 41: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

Stacking PSDs

2 + λ‖Z‖1 + α‖Z − F2(H1, θ)‖22

5 Keep going...

Page 42: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

Stacking PSDs

2 + λ‖Z‖1 + α‖Z − F2(H1, θ)‖22

5 Keep going...

Page 43: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

The mothership: the DBN

Unsupervised pretraining works by reconstructing the data

What if we tried to have a full generative model of the data?

Page 44: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

The mothership: the DBN

Unsupervised pretraining works by reconstructing the data

What if we tried to have a full generative model of the data?

Page 45: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

Restricted Boltzmann Machine (RBM)

E(x,h) = −∑

ij

Wijxihj = −x>Wh

P(x,h) =exp

[x>Wh

]∑x0,h0

exp[x>

0 Wh0]

Page 46: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

Type of variables in an RBM

P(x,h) =exp

[x>Wh

]∑x0,h0

exp[x>

0 Wh0]

This formulation is correct when x and h are binary

variables

You have no choice over the type of x (this is your data)

You can choose what type h is

There are other formulations when x is not binary

They typically do not work that well (with notable

exceptions)

Page 47: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

Training an RBM

P(x,h) =exp

[x>Wh

]∑x0,h0

exp[x>

0 Wh0]

We have a joint probability of data and “features”

Maximizing P(x) will give us meaningful features

There are efficient ways to do that (check my webpage)

Page 48: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

Stacking RBMs

Learn the first layer by maximizing P(x) =∑

h1 P(x,h1)

Once trained, the features of x are EP [h1|x]

Learn the second layer by maximizing

P(h1) =∑

h2 P(h1,h2)

Keep going...Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 25 / 31

Page 49: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

Results - 1

Page 50: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

Results - 2

Page 51: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

What is deep?

Pretraining

Page 52: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

Take-home messages - Deep learning

Building a hierarchy of features can be beneficial

Training all layers at once is very hard

Pretraining alleviates the problem of bad local minima

Regularizes the model

Page 53: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

Take-home messages - Pretraining

Pretraining requires building blocks

ALL pretraining methods share the same idea: model the

input while having somewhat invariant features

By no means is this necessarily the best method

And now for some applications...

Page 54: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

Take-home messages - Pretraining

Pretraining requires building blocks

ALL pretraining methods share the same idea: model the

input while having somewhat invariant features

By no means is this necessarily the best method

And now for some applications...

Page 55: Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng DL/UFL workshop 16/12/11 Nicolas Le Roux Adam Coates,

This is the end...

Questions?

Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua...

Documents

Transcript of Nicolas Le Roux - WordPress.com › ...Deep Learning tutorial Nicolas Le Roux Adam Coates, Yoshua...