MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and ...

MIT 9.520/6.860, Fall 2018

Statistical Learning Theory and Applications

Class 07: Implicit Regularization

Lorenzo Rosasco

Learning algorithm design so far

I ERM, penalized/constrained

minw∈Rd

n∑i=1

`(yi ,w>xi )+λ‖w‖2︸︷︷︸

Lλ(w)

I Optimization by GD

wt+1 = wt −γ∇Lλ(wt ),

+ variants: Newton method, stochastic gradients.

Non linear extensions via features/kernels.

L.Rosasco, 9.520/6.860 2018

Beyond ERM

I Are there other design principles?

I So far statistics/regularization separate from computations.

Today we will see how optimization regularizes implictly.

L.Rosasco, 9.520/6.860 2018

Least squares (recap)

We start with least squares.

Xw = Y

minw∈Rd

∥∥∥Y − Xw∥∥∥2

︸︷︷︸n>d

minw∈Rd

‖w‖2 , subj. to Xw = Y︸︷︷︸n<d

⇒ w† = X†Y

L.Rosasco, 9.520/6.860 2018

Iterative solvers for least squares

L(w) =1n

∥∥∥Y − Xw∥∥∥2.

The gradient descent iteration is

wt+1 = wt −γ2nX>(X wt − Y).

For suitable γL(wt )→min L(w)

L.Rosasco, 9.520/6.860 2018

Implicit bias/regularization

Gradient descent

wt+1 = wt −γ2nX>(X wt − Y).

converges to the minimal norm solution for suitable w0.

Reminder: the minimal norm solution w† satisfies

w† = X>c , c ∈Rn that is w† ⊥ Null(X).

L.Rosasco, 9.520/6.860 2018

Implicit bias/regularization

Then,wt 7→ w†.

Gradient descent explores solutions with a bias towards small norms.

Regularization is not achieved via explicit constraint/penalties.

In this sense it is implicit.

L.Rosasco, 9.520/6.860 2018

Terminology: regularization and pseudosolutions?

I In signal processing minimal norm solutions are calledregularization.

I In classical regularization theory, they are calledpseudosolutions.

I Regularization refers to a family of solutions converging topseudosolutions, e.g. Tikhonov’s. See later.

L.Rosasco, 9.520/6.860 2018

Terminology: implicit or iterative regularization?

I In machine learning, implicit regularization has recently becomefashionable.

I It refers to regularization achieved without imposing constraintsor adding penalties.

I In classical regularization theory, it is called iterativeregularization and it is a old classic idea.

I We will see the idea of early stopping is also very much related.

L.Rosasco, 9.520/6.860 2018

Back for more regularization

According to classical regularization theory: among differentregularized solutions, one ensuring stability should be selected.

I For example, in Tikhonov regularization

wλ→ w†

as λ→ 0.

I But in practice λ , 0 is chosen, when data are noisy/sampled.

L.Rosasco, 9.520/6.860 2018

Regularization by gradient descent?

Gradient descent converges to the tminimal norm solution, but:

I does it define meaningful regularized solutions?

I Where is the regularization parameter?

L.Rosasco, 9.520/6.860 2018

An intuition: early stopping

0 0.2 0.4 0.6 0.8 1

Fitting on the training setIteration #1

L.Rosasco, 9.520/6.860 2018

0 0.2 0.4 0.6 0.8 1

L.Rosasco, 9.520/6.860 2018

0 0.2 0.4 0.6 0.8 1

L.Rosasco, 9.520/6.860 2018

0 0.2 0.4 0.6 0.8 1

L.Rosasco, 9.520/6.860 2018

Is there a way to formalize this intuition?

L.Rosasco, 9.520/6.860 2018

Interlude: geometric series

Recall for |a | < 1

∞∑j=0

a j = (1−a)−1,t∑

a j = (1−at )(1−a)−1.

Equivalently for |b | < 1

∞∑j=0

(1− b)j = b−1,t∑

(1− b)j = (1− (1− b)t )b−1.

L.Rosasco, 9.520/6.860 2018

Interlude II: Neumann series

Assume 1−A invertible matrix and ‖A‖ < 1

∞∑j=0

A j = (I −A)−1,t∑

A j = (I −A t )(I −A)−1.

or equivalently B invertible and ‖B‖ < 1

∞∑j=0

(I −B)j = B−1,t∑

(I −B)j = (I − (I −B)t )B−1.

L.Rosasco, 9.520/6.860 2018

Rewriting GD

By induction

wt+1 = wt −γ2nX>(X w − Y)

can be written as

wt+1 = γ2n

t∑j=0

(I −γX>X)j X>Y .

L.Rosasco, 9.520/6.860 2018

Rewriting GD (cont.)

I Write

wt+1 = wt −γ2nX>(X w − Y) = (I −γ 2

nX>X)wt +γ

2nX>Y .

I Assume

wt = γ2n

t−1∑j=0

(I −γ 2nX>X)j X>Y .

I Then

wt+1 = (I −γ 2nX>X)γ

t−1∑j=0

(I −γ 2nX>X)j X>Y +γ

= γ2n

t∑j=0

L.Rosasco, 9.520/6.860 2018

Neumann series and GD

This is pretty cool

wt+1 = γ2n

t∑j=0

GD is a truncated power series approximation of the pseudoinverse!

If γ is such that1∥∥∥I −γ 2

∥∥∥ < 1, then for large t

t∑j=0

(I −γ 2nX>X)j X> ≈ X†

and we recover wt → w†.

1Compare to classic conditions. L.Rosasco, 9.520/6.860 2018

Stability properties of GD

For any t

wt = (I − (I −γ 2nX>X)t )(X>X)−1X>Y

(assume invertibility for simplicity).

Thenwt ≈ (X>X)−1X>Y︸︷︷︸

large t

, wt ≈γ

nX>Y︸︷︷︸

small t

Compare to Tikhonov wλ = (X>X +λnI)−1X>Y

wλ ≈ (X>X)−1Y︸︷︷︸small λ

, wλ ≈ λnX>Y︸︷︷︸large λ

L.Rosasco, 9.520/6.860 2018

Spectral view and filtering

Recall for Tikhonov

wλ =r∑

sjs2j +λ

(u>j Y)vj .

For GD

wλ =r∑

(1− (1−γ 2n s

sj(u>j Y)vj .

Both methods can be seen as spectral filtering

wλ =r∑

F(sj )(u>j Y)vj ,

for some suitable filter function F .

L.Rosasco, 9.520/6.860 2018

Implicit regularization and early stopping

The stability of GD decreases with t , i.e. higher condition number for

(I − (I −γ 2nX>X)t )(X>X)−1X>.

Early-stopping the iteration as a (implicit) regularization effect.

L.Rosasco, 9.520/6.860 2018

Summary so far

wt+1 = wt −γ2nX>(X w − Y) = γ

t∑j=0

(I −γX>X)j X>Y .

I Implicit bias: gradient descent converges to the minimal normsolution.

I Stability: the number of iteration is a regularization parameter.

Name game: gradient descent, Landweber iteration, L2-Boosting.

L.Rosasco, 9.520/6.860 2018

A bit of history

These ideas are fashionable nowt but has also a long history.

I The idea that iterations converge to pseudosolutions is from the50’s.

I The observation that iterations control stability dates back atleast to the 80’s.

Classic name is iterative regularization (there are books about it).

L.Rosasco, 9.520/6.860 2018

Why is it back in fashion?

I Early stopping is used a heuristic while training neural nets.

I Convergence to minimal norm solutions could helpunderstanding generalization in deep learning?

I New perspective on algorithm design merging stats andoptimization.

L.Rosasco, 9.520/6.860 2018

Statistics meets optimization

GD offers a new a perspective on algorithm design.

I Training time= complexity?

I Iterations control statistical accuracy and numerical complexity.

I Recently, this kind of regularization is called computational oralgorithmic.

L.Rosasco, 9.520/6.860 2018

Beyond least squares

I Other forms of optimization?

I Other loss functions?

I Other norms?

I Other class of functions?

L.Rosasco, 9.520/6.860 2018

Other forms of optimization

Largely unexplored there are results on:

I Accelerated methods and conjugate gradient.

I Stochastic/incremental gradient methods.

It is clear that other parameters control regularization/stability, e.gstep-size, mini-batch-size, averaging etc.

L.Rosasco, 9.520/6.860 2018

Other loss functions

There are some results.

For ` convex, let

L(w) =1n

n∑i=1

`(yi ,w>xi ).

The gradient/subgradient descent iteration is

wt+1 = wt −γt∇L(wt ).

L.Rosasco, 9.520/6.860 2018

Other loss functions (cont.)

wt+1 = wt −γt∇L(wt )

An intuition: note that, if supt∥∥∥∇L(wt )

∥∥∥ ≤ B

‖wt‖ ≤∑t

γtB ,

the number of iterations/stepsize control the norm of the iterates.

L.Rosasco, 9.520/6.860 2018

Other norms

Largely unexplored.

I Gradient descent needs be replaced to bias iterations towardsdesired norms.

I Bregman iterations, mirror descent, proximal gradients can beused.

L.Rosasco, 9.520/6.860 2018

Other class of functions

Extensions using kernel/features are straight forward.

Considering neural nets is considerably harder.

In this context the following perspective has been considered:I given a the function class (neural nets),

I given an algorithm (SGD),

I find which norm the iterates converge to.

L.Rosasco, 9.520/6.860 2018

Summary

A different way to design algorithms.

I Implicit/iterative regularization.

I Iterative regularization for least squares.

I Extensions.

L.Rosasco, 9.520/6.860 2018

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and ...

Documents

Transcript of MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and ...

Functional Analysis Review - mit.edu9.520/spring12/slides/mathcamp2012-fa-slides.pdf · Functional Analysis Review Lorenzo Rosasco {slides courtesy of Andre Wibisono 9.520: Statistical

Hierarchical Learning Machines: Derived Kernels and the ...9.520/spring10/Classes/class16_derivedkernel_2010.pdf · A. Wibisono, L. Rosasco Derived Kernel and the Neural Response.

The Learning Problem and Regularization9.520/spring12/slides/class02/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 ... This is

OLAP mit SQL, OLAP mit MDX

9.520 Class 03, 15 February 2006 Andrea Caponnetto9.520/spring06/Classes/class03.pdf · 2006. 10. 10. · 9.520 Class 03, 15 February 2006 Andrea Caponnetto. About this class Goal

PLUG & PRODUCE MIT STANDARDS SCHLUSS MIT WIRRWARR

Learning Deep Generative Models - mit.edu9.520/spring10/Classes/class19_DBN_2010.pdf · Learning Deep Generative Models 9.520 Class 19 Ruslan Salakhutdinov BCS and CSAIL, MIT 1

MIT SLOAN SCHOOL OF MANAGEMENT MIT … · mit sloan school of management mit computer science and artificial intelligence laboratory (csail) artificial intelligence: implications

The Power of Unentanglement Scott Aaronson (MIT) Salman Beigi (MIT) Andrew Drucker (MIT) | Bill Fefferman (Caltech) Peter Shor (MIT) | |

The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

Overview - MIT Sustainability...MIT GHG Inventory Overview – November 2016 MIT Office of Sustainability | sustainability.mit.edu MIT completed a greenhouse gas inventory to understand

9.520 Class 03, 15 February 2006 Andrea Caponnetto9.520/spring06/Classes/class03.pdf · sider, is a Reproducing Kernel Hilbert Space (RKHS). Lagrange multiplier’s technique Lagrange

MIT ICAT MIT ICAT. MIT ICAT MIT ICAT Motivation Adverse Weather Significantly Impacts Flight Operations Safety -- 22. 5% All US Accidents Efficiency.

Regularized Least Squares and Support Vector Machines9.520/fall13/slides/class06/class06_RLSSVM.pdf · Regularized Least Squares and Support Vector Machines Lorenzo Rosasco 9.520

Proceedings of the IEEE 2010 Antonio Torralba, MIT Jenny Yuen, MIT Bryan C. Russell, MIT.

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and ...web.mit.edu/9.520/www/fall18/slides/Class06_SGD.pdf · MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications

MIT School of Engineering, Pune · MIT College of Food TechnologyMIT School of Engineering (MIT-CFT), Pune MIT College of Management (MIT-COM), Pune MIT School of Film & Television

MIT Electron Microprobe Facility - MIT - Massachusetts Institute

Statistical modeling and classification in Biological Sequence Space April 26, 04; 9.520 Gene Yeo Poggio,

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and ...9.520/fall17/slides/class21_ldr3_nn.pdf · Road map Last class: I Part II: Data representations by unsupervised learning