Lecture 4: Some Advanced Topics in Deep Learning (II)...Hao Su Lecture -4 9 Mhaskar, Poggio, Liao,...

34
Lecture 4: Some Advanced Topics in Deep Learning (II) Instructor: Hao Su Jan 18, 2018

Transcript of Lecture 4: Some Advanced Topics in Deep Learning (II)...Hao Su Lecture -4 9 Mhaskar, Poggio, Liao,...

Page 1: Lecture 4: Some Advanced Topics in Deep Learning (II)...Hao Su Lecture -4 9 Mhaskar, Poggio, Liao, 2016 [Tomasi Poggio’s lecture in STATS385] Hao Su Lecture - 10 1/9/2018 [Tomasi

Lecture 4:Some Advanced Topics in

Deep Learning (II)

Instructor: Hao Su

Jan 18, 2018

Page 2: Lecture 4: Some Advanced Topics in Deep Learning (II)...Hao Su Lecture -4 9 Mhaskar, Poggio, Liao, 2016 [Tomasi Poggio’s lecture in STATS385] Hao Su Lecture - 10 1/9/2018 [Tomasi

1/9/2018

Agenda

Hao Su Lecture -4 2

• Review of last lecture

• Theories behind Machine Learning

Page 3: Lecture 4: Some Advanced Topics in Deep Learning (II)...Hao Su Lecture -4 9 Mhaskar, Poggio, Liao, 2016 [Tomasi Poggio’s lecture in STATS385] Hao Su Lecture - 10 1/9/2018 [Tomasi

1/9/2018

Agenda

Hao Su Lecture -4 3

• Review of last lecture

• Theories behind Machine Learning

Page 4: Lecture 4: Some Advanced Topics in Deep Learning (II)...Hao Su Lecture -4 9 Mhaskar, Poggio, Liao, 2016 [Tomasi Poggio’s lecture in STATS385] Hao Su Lecture - 10 1/9/2018 [Tomasi

1/9/2018

Deep Networks: Three Theory Questions

Hao Su Lecture -4 4

• Approximation Theory: When and why are deep networks better than shallow networks?

• Optimization Theory: What is the landscape of the empirical risk?

• Learning Theory: How can deep learning not overfit?

[Tomasi Poggio in STATS385]

Page 5: Lecture 4: Some Advanced Topics in Deep Learning (II)...Hao Su Lecture -4 9 Mhaskar, Poggio, Liao, 2016 [Tomasi Poggio’s lecture in STATS385] Hao Su Lecture - 10 1/9/2018 [Tomasi

1/9/2018

Deep Networks: Three Theory Questions

Hao Su Lecture -4 5

• Approximation Theory: When and why are deep networks better than shallow networks?

• Optimization Theory: What is the landscape of the empirical risk?

• Learning Theory: How can deep learning not overfit?

[Tomasi Poggio’s lecture in STATS385]

Page 6: Lecture 4: Some Advanced Topics in Deep Learning (II)...Hao Su Lecture -4 9 Mhaskar, Poggio, Liao, 2016 [Tomasi Poggio’s lecture in STATS385] Hao Su Lecture - 10 1/9/2018 [Tomasi

1/9/2018

Deep and Shallow Networks: Universality

Hao Su Lecture -4 6

[Tomasi Poggio’s lecture in STATS385]

Page 7: Lecture 4: Some Advanced Topics in Deep Learning (II)...Hao Su Lecture -4 9 Mhaskar, Poggio, Liao, 2016 [Tomasi Poggio’s lecture in STATS385] Hao Su Lecture - 10 1/9/2018 [Tomasi

1/9/2018

Classical learning theory and Kernel Machines(Regularization in RKHS)

Hao Su Lecture -4 7

[Tomasi Poggio’s lecture in STATS385]

Page 8: Lecture 4: Some Advanced Topics in Deep Learning (II)...Hao Su Lecture -4 9 Mhaskar, Poggio, Liao, 2016 [Tomasi Poggio’s lecture in STATS385] Hao Su Lecture - 10 1/9/2018 [Tomasi

1/9/2018

Classical kernel machines are equivalent to shallow networks

Hao Su Lecture -4 8

[Tomasi Poggio’s lecture in STATS385]

Page 9: Lecture 4: Some Advanced Topics in Deep Learning (II)...Hao Su Lecture -4 9 Mhaskar, Poggio, Liao, 2016 [Tomasi Poggio’s lecture in STATS385] Hao Su Lecture - 10 1/9/2018 [Tomasi

1/9/2018

Curse of dimensionality

Hao Su Lecture -4 9

Mhaskar, Poggio, Liao, 2016[Tomasi Poggio’s lecture in STATS385]

Page 10: Lecture 4: Some Advanced Topics in Deep Learning (II)...Hao Su Lecture -4 9 Mhaskar, Poggio, Liao, 2016 [Tomasi Poggio’s lecture in STATS385] Hao Su Lecture - 10 1/9/2018 [Tomasi

1/9/2018 Hao Su Lecture - 10[Tomasi Poggio’s lecture in STATS385]

4

Page 11: Lecture 4: Some Advanced Topics in Deep Learning (II)...Hao Su Lecture -4 9 Mhaskar, Poggio, Liao, 2016 [Tomasi Poggio’s lecture in STATS385] Hao Su Lecture - 10 1/9/2018 [Tomasi

1/9/2018

Hierarchically local compositionality

Hao Su Lecture - 114

Page 12: Lecture 4: Some Advanced Topics in Deep Learning (II)...Hao Su Lecture -4 9 Mhaskar, Poggio, Liao, 2016 [Tomasi Poggio’s lecture in STATS385] Hao Su Lecture - 10 1/9/2018 [Tomasi

1/9/2018

Proof by induction

Hao Su Lecture - 124

Page 13: Lecture 4: Some Advanced Topics in Deep Learning (II)...Hao Su Lecture -4 9 Mhaskar, Poggio, Liao, 2016 [Tomasi Poggio’s lecture in STATS385] Hao Su Lecture - 10 1/9/2018 [Tomasi

1/9/2018

Intuition

Hao Su Lecture - 13

• The family of compositional functions are easier to approximate (can be approximated with less parameters)

• Deep learning leverages the compositional structure

• Why compositional functions are important for perception?

4

Page 14: Lecture 4: Some Advanced Topics in Deep Learning (II)...Hao Su Lecture -4 9 Mhaskar, Poggio, Liao, 2016 [Tomasi Poggio’s lecture in STATS385] Hao Su Lecture - 10 1/9/2018 [Tomasi

1/9/2018

Locality of constituent functions is key: CIFAR

Hao Su Lecture - 144

Page 15: Lecture 4: Some Advanced Topics in Deep Learning (II)...Hao Su Lecture -4 9 Mhaskar, Poggio, Liao, 2016 [Tomasi Poggio’s lecture in STATS385] Hao Su Lecture - 10 1/9/2018 [Tomasi

1/9/2018

Other perspectives of why deep is good

Hao Su Lecture - 15

The main result of [Telgarsky, 2016, Colt] says that there are functions with many oscillations that cannot be represented by shallow networks with linear complexity but can be represented with low complexity by deep networks.

4

Page 16: Lecture 4: Some Advanced Topics in Deep Learning (II)...Hao Su Lecture -4 9 Mhaskar, Poggio, Liao, 2016 [Tomasi Poggio’s lecture in STATS385] Hao Su Lecture - 10 1/9/2018 [Tomasi

1/9/2018

Proof sketch

Hao Su Lecture - 164

Page 17: Lecture 4: Some Advanced Topics in Deep Learning (II)...Hao Su Lecture -4 9 Mhaskar, Poggio, Liao, 2016 [Tomasi Poggio’s lecture in STATS385] Hao Su Lecture - 10 1/9/2018 [Tomasi

1/9/2018

Deep Networks: Three Theory Questions

Hao Su Lecture - 17

• Approximation Theory: When and why are deep networks better than shallow networks?

• Optimization Theory: What is the landscape of the empirical risk?

• Learning Theory: How can deep learning not overfit?

[Tomasi Poggio’s lecture in STATS385]

4

Page 18: Lecture 4: Some Advanced Topics in Deep Learning (II)...Hao Su Lecture -4 9 Mhaskar, Poggio, Liao, 2016 [Tomasi Poggio’s lecture in STATS385] Hao Su Lecture - 10 1/9/2018 [Tomasi

1/9/2018

The shape of objective function to be optimized

Hao Su Lecture - 184

Page 19: Lecture 4: Some Advanced Topics in Deep Learning (II)...Hao Su Lecture -4 9 Mhaskar, Poggio, Liao, 2016 [Tomasi Poggio’s lecture in STATS385] Hao Su Lecture - 10 1/9/2018 [Tomasi

1/9/2018

Two challenging cases for optimization

Hao Su Lecture - 194

Page 20: Lecture 4: Some Advanced Topics in Deep Learning (II)...Hao Su Lecture -4 9 Mhaskar, Poggio, Liao, 2016 [Tomasi Poggio’s lecture in STATS385] Hao Su Lecture - 10 1/9/2018 [Tomasi

1/9/2018

Hessian characterizes the shape at bottom

Hao Su Lecture - 204

Page 21: Lecture 4: Some Advanced Topics in Deep Learning (II)...Hao Su Lecture -4 9 Mhaskar, Poggio, Liao, 2016 [Tomasi Poggio’s lecture in STATS385] Hao Su Lecture - 10 1/9/2018 [Tomasi

1/9/2018

Critical points

Hao Su Lecture - 21

‖∇L(w)‖=0

4

Page 22: Lecture 4: Some Advanced Topics in Deep Learning (II)...Hao Su Lecture -4 9 Mhaskar, Poggio, Liao, 2016 [Tomasi Poggio’s lecture in STATS385] Hao Su Lecture - 10 1/9/2018 [Tomasi

1/9/2018

Critical points

Hao Su Lecture - 22

• If all eigenvalues are positive the point is called a local minimum• If r of them are negative and the rest are positive, then it is called a saddle

point with index r. • At the critical point, the eigenvectors indicate the directions in which the

value of the function locally changes. • Moreover, the changes are proportional to the corresponding -signed-

eigenvalue.

4

Page 23: Lecture 4: Some Advanced Topics in Deep Learning (II)...Hao Su Lecture -4 9 Mhaskar, Poggio, Liao, 2016 [Tomasi Poggio’s lecture in STATS385] Hao Su Lecture - 10 1/9/2018 [Tomasi

1/9/2018

Critical points

Hao Su Lecture - 23

• Classical optimization theory mostly assumes Hessian non-degenerated

• Is this the case for machine learning loss functions?

4

Page 24: Lecture 4: Some Advanced Topics in Deep Learning (II)...Hao Su Lecture -4 9 Mhaskar, Poggio, Liao, 2016 [Tomasi Poggio’s lecture in STATS385] Hao Su Lecture - 10 1/9/2018 [Tomasi

1/9/2018

Spectrum of Hessian for an MLP

Hao Su Lecture - 24

• 5K parameters in total

EMPIRICAL ANALYSIS OF THE HESSIAN OF OVER- PARAMETRIZED NEURAL NETWORKS, Sagun et al.

4

Page 25: Lecture 4: Some Advanced Topics in Deep Learning (II)...Hao Su Lecture -4 9 Mhaskar, Poggio, Liao, 2016 [Tomasi Poggio’s lecture in STATS385] Hao Su Lecture - 10 1/9/2018 [Tomasi

1/9/2018 Hao Su Lecture - 25

An explanation

EMPIRICAL ANALYSIS OF THE HESSIAN OF OVER- PARAMETRIZED NEURAL NETWORKS, Sagun et al.

4

Page 26: Lecture 4: Some Advanced Topics in Deep Learning (II)...Hao Su Lecture -4 9 Mhaskar, Poggio, Liao, 2016 [Tomasi Poggio’s lecture in STATS385] Hao Su Lecture - 10 1/9/2018 [Tomasi

1/9/2018 Hao Su Lecture - 26

Rank 1 Vanish(claimed by author)

If you are interested, check it!(get credit)

An explanation

there are at least M − N many trivial eigenvalues of the Hessian!

EMPIRICAL ANALYSIS OF THE HESSIAN OF OVER- PARAMETRIZED NEURAL NETWORKS, Sagun et al.

4

Page 27: Lecture 4: Some Advanced Topics in Deep Learning (II)...Hao Su Lecture -4 9 Mhaskar, Poggio, Liao, 2016 [Tomasi Poggio’s lecture in STATS385] Hao Su Lecture - 10 1/9/2018 [Tomasi

1/9/2018

Relation between #params and eigenvalues

Hao Su Lecture - 27

• Growing the size of the network with fixed data, architecture, and algorithm.

• 1K samples from MNIST

“the large positive eigenvalues depend on data, so that their number should not, in principle, change with the increasing dimensionality of the parameter space.”

4

Page 28: Lecture 4: Some Advanced Topics in Deep Learning (II)...Hao Su Lecture -4 9 Mhaskar, Poggio, Liao, 2016 [Tomasi Poggio’s lecture in STATS385] Hao Su Lecture - 10 1/9/2018 [Tomasi

1/9/2018

Relation between #params and eigenvalues

Hao Su Lecture - 28

“increasing the number of dimensions keeps the ratio of small negative eigenvalues fixed.”

EMPIRICAL ANALYSIS OF THE HESSIAN OF OVER- PARAMETRIZED NEURAL NETWORKS, Sagun et al.

4

Page 29: Lecture 4: Some Advanced Topics in Deep Learning (II)...Hao Su Lecture -4 9 Mhaskar, Poggio, Liao, 2016 [Tomasi Poggio’s lecture in STATS385] Hao Su Lecture - 10 1/9/2018 [Tomasi

1/9/2018

Deep Networks: Three Theory Questions

Hao Su Lecture - 29

• Approximation Theory: When and why are deep networks better than shallow networks?

• Optimization Theory: What is the landscape of the empirical risk?

• Learning Theory: How can deep learning not overfit?

[Tomasi Poggio’s lecture in STATS385]

4

Page 30: Lecture 4: Some Advanced Topics in Deep Learning (II)...Hao Su Lecture -4 9 Mhaskar, Poggio, Liao, 2016 [Tomasi Poggio’s lecture in STATS385] Hao Su Lecture - 10 1/9/2018 [Tomasi

1/9/2018

Existing theories are insufficient

Hao Su Lecture - 30

• Regularization• VC dimension• Rademacher complexity• Uniform stability• …

4

Page 31: Lecture 4: Some Advanced Topics in Deep Learning (II)...Hao Su Lecture -4 9 Mhaskar, Poggio, Liao, 2016 [Tomasi Poggio’s lecture in STATS385] Hao Su Lecture - 10 1/9/2018 [Tomasi

1/9/2018

A Statistical Mechanics Analysis

Hao Su Lecture - 31

SGD:Assumptions:

THREE FACTORS INFLUENCING MINIMA IN SGD, JASTRZEBSKI ET AL

4

Page 32: Lecture 4: Some Advanced Topics in Deep Learning (II)...Hao Su Lecture -4 9 Mhaskar, Poggio, Liao, 2016 [Tomasi Poggio’s lecture in STATS385] Hao Su Lecture - 10 1/9/2018 [Tomasi

1/9/2018

Main Results

Hao Su Lecture - 32

is a measure of the noise in the system set by the choice of learning rate and batch size S .

: learning rate, S: batch size

THREE FACTORS INFLUENCING MINIMA IN SGD, JASTRZEBSKI ET AL

η

n = ηS

4

Page 33: Lecture 4: Some Advanced Topics in Deep Learning (II)...Hao Su Lecture -4 9 Mhaskar, Poggio, Liao, 2016 [Tomasi Poggio’s lecture in STATS385] Hao Su Lecture - 10 1/9/2018 [Tomasi

1/9/2018 Hao Su Lecture - 33

THREE FACTORS INFLUENCING MINIMA IN SGD, JASTRZEBSKI ET AL

4

Page 34: Lecture 4: Some Advanced Topics in Deep Learning (II)...Hao Su Lecture -4 9 Mhaskar, Poggio, Liao, 2016 [Tomasi Poggio’s lecture in STATS385] Hao Su Lecture - 10 1/9/2018 [Tomasi

1/9/2018

Deep Networks: Three Theory Questions

Hao Su Lecture - 34

• Approximation Theory: When and why are deep networks better than shallow networks?

• Optimization Theory: What is the landscape of the empirical risk?

• Learning Theory: How can deep learning not overfit?

[Tomasi Poggio’s lecture in STATS385]

4