Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5...

55
Perceptron and Artificial Neural Networks Eric Xing Lecture 7, October 1, 2015 Reading: Chap. 5 CB Machine Learning 10-701, Fall 2015 1 © Eric Xing @ CMU, 2015

Transcript of Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5...

Page 1: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

Perceptron and Artificial Neural Networks

Eric Xing

Lecture 7, October 1, 2015

Reading: Chap. 5 CB

Machine Learning

10-701, Fall 2015

1© Eric Xing @ CMU, 2015

Page 2: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

Recall Logistic Regression (sigmoid classifier, MaxEnt classifier, …)

The prediction rule:

In this case, learning p(y|x) amounts to learning ...? Algorithm: gradient ascent

What is the limitation?

2© Eric Xing @ CMU, 2015

xM

iii

n T

exxyp

11

exp1

1 )|1(

01

Page 3: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

Learning highly non-linear functions

f: X Y f might be non-linear function X (vector of) continuous and/or discrete vars Y (vector of) continuous and/or discrete vars

The XOR gate Speech recognition

3© Eric Xing @ CMU, 2015

Page 4: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

Our brain is very good at this …

© Eric Xing @ CMU, 2015 4

Page 5: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

How a neuron works

Activation function:

An mathematical expression

© Eric Xing @ CMU, 2015 5

Synapses

Axon

Dendrites

Synapses+++--

(weights)

Nodes

M

iiiwxX

1

XX

Y if ,1 if ,1

Threshold

Inputs

x1

x2

OutputY

Hard Limiter

w2

w1

Linear Combiner

w0

xwM

iii

T

exwxyp

11

exp1

1 )|1(

01

Page 6: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

From biological neuron to artificial neuron (perceptron)

From biological neuron network to artificial neuron networks

Perceptron and Neural Nets

Threshold

Inputs

x1

x2

OutputY

HardLimiter

w2

w1

LinearCombiner

6© Eric Xing @ CMU, 2015

Soma Soma

Synapse

Synapse

Dendrites

Axon

Synapse

DendritesAxon

Input Layer Output Layer

Middle Layer

I n p

u t

S i

g n

a l s

O u

t p

u t

S i

g n

a l s

Page 7: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

Jargon Pseudo-Correspondence Independent variable = input variable Dependent variable = output variable Coefficients = “weights” Estimates = “targets”

Logistic Regression Model (the sigmoid unit)Inputs Output

Age 34

1Gender

Stage 4

“Probability of beingAlive”

5

8

40.6

Coefficients

a, b, c

Independent variables

x1, x2, x3Dependent variable

p Prediction7© Eric Xing @ CMU, 2015

Page 8: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

A perceptron learning algorithm

Recall the nice property of sigmoid function

Consider regression problem f:XY , for scalar Y:

We used to maximize the conditional data likelihood

Here …

8© Eric Xing @ CMU, 2015

Page 9: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

Gradient Descent

xd = input

td = target output

od =observed unit

output

wi =weight i

9© Eric Xing @ CMU, 2015

Page 10: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

The perceptron learning rules

xd = input

td = target output

od =observed unit

output

wi =weight i

Batch mode:Do until converge:

1. compute gradient ED[w]

2.

Incremental mode:Do until converge:

For each training example d in D

1. compute gradient Ed[w]

2.

where

10© Eric Xing @ CMU, 2015

Page 11: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

MLE vs MAP Maximum conditional likelihood estimate

Maximum a posteriori estimate

11© Eric Xing @ CMU, 2015

Page 12: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

What decision surface does a perceptron define?

x y Z (color)

0 0 10 1 11 0 11 1 0

NAND

some possible values for w1 and w2

w1 w2

0.200.200.250.40

0.350.400.300.20

f(x1w1 + x2w2) = y f(0w1 + 0w2) = 1 f(0w1 + 1w2) = 1 f(1w1 + 0w2) = 1 f(1w1 + 1w2) = 0

y

x1 x2

w1 w2

= 0.5

f(a) = 1, for a > 0, for a

12© Eric Xing @ CMU, 2015

Page 13: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

x y Z (color)

0 0 00 1 11 0 11 1 0

What decision surface does a perceptron define?

NAND

some possible values for w1 and w2

w1 w2

f(x1w1 + x2w2) = y f(0w1 + 0w2) = 0 f(0w1 + 1w2) = 1 f(1w1 + 0w2) = 1 f(1w1 + 1w2) = 0

y

x1 x2

w1 w2

= 0.5

f(a) = 1, for a > 0, for a

13© Eric Xing @ CMU, 2015

Page 14: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

x y Z (color)

0 0 00 1 11 0 11 1 0

What decision surface does a perceptron define?

NAND

f(a) = 1, for a > 0, for a

w1 w4w3

w2

w5 w6

= 0.5 for all units

a possible set of values for (w1, w2, w3, w4, w5 , w6):(0.6,-0.6,-0.7,0.8,1,1)

14© Eric Xing @ CMU, 2015

Page 15: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

Cough No cough

CoughNo coughNo headache No headache

Headache Headache

No disease

Meningitis Flu

Pneumonia

No treatmentTreatment

00 10

01 11

000 100

010

101

111011

110

Non Linear Separation

15© Eric Xing @ CMU, 2015

Page 16: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

Inputs

Weights

Output

Independent variables

Dependent variable

Prediction

Age 34

2Gender

Stage 4

.6

.5

.8

.2

.1

.3.7

.2

WeightsHiddenLayer

“Probability of beingAlive”

0.6

.4

.2

Neural Network Model

16© Eric Xing @ CMU, 2015

Page 17: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

Inputs

Weights

Output

Independent variables

Dependent variable

Prediction

Age 34

2Gender

Stage 4

.6

.5

.8

.1

.7

WeightsHiddenLayer

“Probability of beingAlive”

0.6

“Combined logistic models”

17© Eric Xing @ CMU, 2015

Page 18: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

Inputs

Weights

Output

Independent variables

Dependent variable

Prediction

Age 34

2Gender

Stage 4

.5

.8.2

.3

.2

WeightsHiddenLayer

“Probability of beingAlive”

0.6

18© Eric Xing @ CMU, 2015

Page 19: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

Inputs

Weights

Output

Independent variables

Dependent variable

Prediction

Age 34

1Gender

Stage 4

.6.5

.8.2

.1

.3.7

.2

WeightsHiddenLayer

“Probability of beingAlive”

0.6

19© Eric Xing @ CMU, 2015

Page 20: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

WeightsIndependent variables

Dependent variable

Prediction

Age 34

2Gender

Stage 4

.6

.5

.8

.2

.1

.3.7

.2

WeightsHiddenLayer

“Probability of beingAlive”

0.6

.4

.2

Not really, no target for hidden units...

20© Eric Xing @ CMU, 2015

Page 21: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

weights

Output units

No disease Pneumonia Flu Meningitis

Input units

Cough Headache

what we gotwhat we wanted-error

rulechange weights todecrease the error

Recall perceptrons

21© Eric Xing @ CMU, 2015

Page 22: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

Hidden Units and Backpropagation

Output units

Input units

Hiddenunits

what we gotwhat we wanted-error

rule

rule

22© Eric Xing @ CMU, 2015

Page 23: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

Backpropagation Algorithm Initialize all weights to small random numbers

Until convergence, Do

1. Input the training example to the network and compute the network outputs

1. For each output unit k

2. For each hidden unit h

3. Undate each network weight wi,j

where

xd = input

td = target output

od =observed unit

output

wi =weight i

23© Eric Xing @ CMU, 2015

Page 24: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

More on Backpropatation It is doing gradient descent over entire network weight vector Easily generalized to arbitrary directed graphs Will find a local, not necessarily global error minimum

In practice, often works well (can run multiple times)

Often include weight momentum

Minimizes error over training examples Will it generalize well to subsequent testing examples?

Training can take thousands of iterations, very slow! Using network after training is very fast

24© Eric Xing @ CMU, 2015

Page 25: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

Learning Hidden Layer Representation A network:

A target function:

Can this be learned?

25© Eric Xing @ CMU, 2015

Page 26: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

Learning Hidden Layer Representation A network:

Learned hidden layer representation:

26© Eric Xing @ CMU, 2015

Page 27: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

Training

27© Eric Xing @ CMU, 2015

Page 28: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

Training

28© Eric Xing @ CMU, 2015

Page 29: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

The "Driver" Network

29© Eric Xing @ CMU, 2015

Page 30: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

Artificial neural networks – what you should know Highly expressive non-linear functions Highly parallel network of logistic function units Minimizing sum of squared training errors

Gives MLE estimates of network weights if we assume zero mean Gaussian noise on output values

Minimizing sum of sq errors plus weight squared (regularization) MAP estimates assuming weight priors are zero mean Gaussian

Gradient descent as training procedure How to derive your own gradient descent procedure

Discover useful representations at hidden units Local minima is greatest problem Overfitting, regularization, early stopping

30© Eric Xing @ CMU, 2015

Page 31: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

Modern ANN topics: “Deep” Learning

31© Eric Xing @ CMU, 2015

Page 32: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

X1 X2 X3

“X1” “X1X3” “X1X2X3”

Y

“X2”

X1 X2 X3 X1X2 X1X3 X2X3

Y

(23-1) possible combinations

X1X2X3

Y = a(X1) + b(X2) + c(X3) + d(X1X2) + ...

Non-linear LR vs. ANN

32© Eric Xing @ CMU, 2015

Page 33: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

Courtesy: Lee and Ng

Computer vision features

SIFT Spin image

HoG RIFT

Textons GLOH

Drawbacks of feature engineering1. Needs expert knowledge2. Time consuming hand-tuning

33© Eric Xing @ CMU, 2015

Page 34: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

Using ANN to hierarchical representation

Good Representations are hierarchical

• In Language: hierarchy in syntax and semantics– Words->Parts of Speech->Sentences->Text– Objects,Actions,Attributes...-> Phrases -> Statements -> Stories

• In Vision: part-whole hierarchy– Pixels->Edges->Textons->Parts->Objects->Scenes

TrainableFeatureExtractor

TrainableFeatureExtractor

TrainableClassifier

34© Eric Xing @ CMU, 2015

Page 35: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

“Deep” learning: learning hierarchical representations

• Deep Learning: learning a hierarchy of internal representations• From low-level features to mid-level invariant representations,

to object identities• Representations are increasingly invariant as we go up the

layers• using multiple stages gets around the specificity/invariance

dilemma

TrainableFeatureExtractor

TrainableFeatureExtractor

TrainableClassifier

Learned Internal Representation

35© Eric Xing @ CMU, 2015

Page 36: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

“Deep” models Neural Networks: Feed-forward*

You have seen it

Autoencoders (multilayer neural net with target output = input) Probabilistic -- Directed: PCA, Sparse Coding Probabilistic -- Undirected: MRFs and RBMs*

Recursive Neural Networks* Convolutional Neural Nets

© Eric Xing @ CMU, 2015 36

Page 37: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

Filtering + NonLinearity + Pooling = 1 stage of a Convolutional Net

• [Hubel & Wiesel 1962]: – simple cells detect local features– complex cells “pool” the outputs of simple cells within a retinotopic

neighborhood.

pooling subsampling

“Simple cells”“Complex cells”

Multiple convolutions

Retinotopic Feature Maps37© Eric Xing @ CMU, 2015

Page 38: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

Convolutions,Filtering

PoolingSubsampling

Convolutions,Filtering Pooling

Subsampling

Convolutions,Filtering

Convolutions,Classification

Convolutional Network: Multi-Stage Trainable Architecture

Hierarchical ArchitectureRepresentations are more global, more invariant, and more abstract as we go up the layers

Alternated Layers of Filtering and Spatial PoolingFiltering detects conjunctions of featuresPooling computes local disjunctions of features

Fully TrainableAll the layers are trainable

38© Eric Xing @ CMU, 2015

Page 39: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

input1@32x32

Layer 16@28x28

Layer 26@14x14

Layer 312@10x10 Layer 4

12@5x5

Layer 5100@1x1

10

5x5convolution

5x5convolution

5x5convolution2x2

pooling/subsampling

2x2pooling/subsampling

Layer 6: 10

Convolutional Net Architecture for Hand-writing recognition

Convolutional net for handwriting recognition (400,000 synapses) Convolutional layers (simple cells): all units in a feature plane share the same weights Pooling/subsampling layers (complex cells): for invariance to small distortions. Supervised gradient-descent learning using back-propagation The entire network is trained end-to-end. All the layers are trained simultaneously. [LeCun et al. Proc IEEE, 1998]

39© Eric Xing @ CMU, 2015

Page 40: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

How to train?

40© Eric Xing @ CMU, 2015

But this is very slow !!!

Page 41: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

Some new ideas to speed up Stacking from smaller building blocks

Layers Blocks

Approximate Inference Undirected connections for all layers (Markov net) [Related work: Salakhutdinov

and Hinton, 2009] Block Gibbs sampling or mean-field Hierarchical probabilistic inference

Layer-wise Unsupervised Learning

© Eric Xing @ CMU, 2015 41

Page 42: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

input ...

Layer-wise Unsupervised Pre-training

© Eric Xing @ CMU, 2015 42

Page 43: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

input ...

features ...

Layer-wise Unsupervised Pre-training

© Eric Xing @ CMU, 2015 43

Page 44: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

input ...

features ...

Reconstructionof input

... ... input?=

Layer-wise Unsupervised Pre-training

© Eric Xing @ CMU, 2015 44

Page 45: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

input ...

features ...

Layer-wise Unsupervised Pre-training

© Eric Xing @ CMU, 2015 45

Page 46: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

input ...

features ...

More abstract features

...

Layer-wise Unsupervised Pre-training

© Eric Xing @ CMU, 2015 46

Page 47: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

input ...

features ...

More abstract features

...

Reconstructionof features

... ...=?

Layer-wise Unsupervised Pre-training

© Eric Xing @ CMU, 2015 47

Page 48: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

input ...

features ...

More abstract features

...

Layer-wise Unsupervised Pre-training

© Eric Xing @ CMU, 2015 48

Page 49: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

input ...

features ...

More abstract features

...

Even more abstract features

...

Layer-wise Unsupervised Pre-training

© Eric Xing @ CMU, 2015 49

Page 50: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

input ...

features ...

More abstract features

...

Even more abstract features

...

Outputf(X) =

? TargetY

Layer-wise Unsupervised Pre-training

© Eric Xing @ CMU, 2015 50

Page 51: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

Handwritten Digit Dataset MNIST: 60,000 training samples, 10,000 test samples

Application: MNIST Handwritten Digit Dataset

51© Eric Xing @ CMU, 2015

Page 52: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

CLASSIFIER DEFORMATION PREPROCESSING ERROR (%) Referencelinear classifier (1-layer NN) none 12.00 LeCun et al. 1998linear classifier (1-layer NN) deskewing 8.40 LeCun et al. 1998pairwise linear classifier deskewing 7.60 LeCun et al. 1998K-nearest-neighbors, (L2) none 3.09 Kenneth Wilder, U. ChicagoK-nearest-neighbors, (L2) deskewing 2.40 LeCun et al. 1998K-nearest-neighbors, (L2) deskew, clean, blur 1.80 Kenneth Wilder, U. ChicagoK-NN L3, 2 pixel jit ter deskew, clean, blur 1.22 Kenneth Wilder, U. ChicagoK-NN, shape context m atching shape context feature 0.63 Belongie et al. IEEE PAMI 200240 PCA + quadrat ic classifier none 3.30 LeCun et al. 19981000 RBF + linear classifier none 3.60 LeCun et al. 1998K-NN, Tangent Distance subsam p 16x16 pixels 1.10 LeCun et al. 1998SVM, Gaussian Kernel none 1.40SVM deg 4 polynom ial deskewing 1.10 LeCun et al. 1998Reduced Set SVM deg 5 poly deskewing 1.00 LeCun et al. 1998Virtual SVM deg-9 poly Affine none 0.80 LeCun et al. 1998V-SVM, 2-pixel jit tered none 0.68 DeCoste and Scholkopf, MLJ 2002V-SVM, 2-pixel jit tered deskewing 0.56 DeCoste and Scholkopf, MLJ 20022-layer NN, 300 HU, MSE none 4.70 LeCun et al. 19982-layer NN, 300 HU, MSE, Affine none 3.60 LeCun et al. 19982-layer NN, 300 HU deskewing 1.60 LeCun et al. 19983-layer NN, 500+ 150 HU none 2.95 LeCun et al. 19983-layer NN, 500+ 150 HU Affine none 2.45 LeCun et al. 19983-layer NN, 500+ 300 HU, CE, reg none 1.53 Hinton, unpublished, 20052-layer NN, 800 HU, CE none 1.60 Sim ard et al., ICDAR 20032-layer NN, 800 HU, CE Affine none 1.10 Sim ard et al., ICDAR 20032-layer NN, 800 HU, MSE Elast ic none 0.90 Sim ard et al., ICDAR 20032-layer NN, 800 HU, CE Elast ic none 0.70 Sim ard et al., ICDAR 2003Convolut ional net LeNet-1 subsam p 16x16 pixels 1.70 LeCun et al. 1998Convolut ional net LeNet-4 none 1.10 LeCun et al. 1998Convolut ional net LeNet-5, none 0.95 LeCun et al. 1998Conv. net LeNet-5, Affine none 0.80 LeCun et al. 1998Boosted LeNet-4 Affine none 0.70 LeCun et al. 1998Conv. net , CE Affine none 0.60 Sim ard et al., ICDAR 2003Com v net , CE Elast ic none 0.40 Sim ard et al., ICDAR 2003

Results on MNIST Handwritten Digits

52© Eric Xing @ CMU, 2015

Page 53: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

Face Detection with a Convolutional Net

53© Eric Xing @ CMU, 2015

Page 54: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

Feature learning Successful learning of intermediate representations

[Lee et al ICML 2009, Lee et al NIPS 2009]

© Eric Xing @ CMU, 2015 54

Page 55: Machine Learning - cs.cmu.eduepxing/Class/10701/slides/lecture7-nn.pdf · 10 5x5 convolution 5x5 convolution 5x5 2x2 convolution pooling/ subsampling 2x2 pooling/ subsampling Layer

Weaknesses & Criticisms Learning everything. Better to encode prior knowledge about

structure of images.

Not clear if an explicit global objective is indeed optimized, making theoretical analysis difficult Many (arbitrary) approximations are introduced

Many different loss functions, gate functions, transformation functions are used

Many different implementation exist

Comparison is based on the end empirical results on downstream task, not the actual direct task DNN is designed to compute, make verification and tuning of components of DNN very hard. Imagine using “getting a good tip by the waiter” to evaluate the performance of

chef in the kitchen

55© Eric Xing @ CMU, 2015