Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw...

230
Generalization in Deep Learning Peter Bartlett Alexander Rakhlin UC Berkeley MIT May 28, 2019 1 / 174

Transcript of Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw...

Page 1: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Generalization in Deep Learning

Peter Bartlett Alexander Rakhlin

UC Berkeley

MIT

May 28, 2019

1 / 174

Page 2: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Outline

Introduction

Uniform Laws of Large Numbers

Classification

Beyond ERM

Interpolation

2 / 174

Page 3: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

What is Generalization?

log(1 + 2 + 3) = log(1) + log(2) + log(3)

log(1 + 1.5 + 5) = log(1) + log(1.5) + log(5)

log(2 + 2) = log(2) + log(2)

log(1 + 1.25 + 9) = log(1) + log(1.25) + log(9)

3 / 174

Page 4: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

What is Generalization?

log(1 + 2 + 3) = log(1) + log(2) + log(3)

log(1 + 1.5 + 5) = log(1) + log(1.5) + log(5)

log(2 + 2) = log(2) + log(2)

log(1 + 1.25 + 9) = log(1) + log(1.25) + log(9)

3 / 174

Page 5: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

What is Generalization?

log(1 + 2 + 3) = log(1) + log(2) + log(3)

log(1 + 1.5 + 5) = log(1) + log(1.5) + log(5)

log(2 + 2) = log(2) + log(2)

log(1 + 1.25 + 9) = log(1) + log(1.25) + log(9)

3 / 174

Page 6: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Outline

IntroductionSetupPerceptronBias-Variance Tradeoff

Uniform Laws of Large Numbers

Classification

Beyond ERM

Interpolation

4 / 174

Page 7: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Statistical Learning Theory

Aim: Predict an outcome y from some set Y of possible outcomes, onthe basis of some observation x from a feature space X .

Use data set of n pairs

S = (x1, y1), . . . , (xn, yn)

to choose a function f : X → Y so that, for subsequent (x , y) pairs, f (x)is a good prediction of y .

5 / 174

Page 8: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Formulations of Prediction Problems

To define the notion of a ‘good prediction,’ we can define a loss function

` : Y × Y → R.

`(y , y) is cost of predicting y when the outcome is y .

Aim: `(f (x), y) small.

6 / 174

Page 9: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Formulations of Prediction Problems

ExampleIn pattern classification problems, the aim is to classify a pattern x intoone of a finite number of classes (that is, the label space Y is finite). Ifall mistakes are equally bad, we could define

`(y , y) = Iy 6= y =

1 if y 6= y ,

0 otherwise.

ExampleIn a regression problem, with Y = R, we might choose the quadratic lossfunction, `(y , y) = (y − y)2.

7 / 174

Page 10: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Probabilistic Assumptions

Assume:

I There is a probability distribution P on X × Y,

I The pairs (X1,Y1), . . . , (Xn,Yn), (X ,Y ) are chosen independentlyaccording to P

The aim is to choose f with small risk:

L(f ) = E`(f (X ),Y ).

For instance, in the pattern classification example, this is themisclassification probability.

L(f ) = L01(f ) = EIf (X ) 6= Y = Pr(f (X ) 6= Y ).

8 / 174

Page 11: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Why not estimate the underlying distribution P (or PY |X ) first?

This is in general a harder problem than prediction.

9 / 174

Page 12: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Key difficulty: our goals are in terms of unknown quantities related tounknown P. Have to use empirical data instead. Purview of statistics.

For instance, we can calculate the empirical loss of f : X → Y

L(f ) =1

n

n∑i=1

`(Yi , f (Xi ))

10 / 174

Page 13: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

The function x 7→ fn(x) = fn(x ;X1,Y1, . . . ,Xn,Yn) is random, since itdepends on the random data S = (X1,Y1, . . . ,Xn,Yn). Thus, the risk

L(fn) = E[`(fn(X ),Y )|S

]= E

[`(fn(X ;X1,Y1, . . . ,Xn,Yn),Y )|S

]is a random variable. We might aim for EL(fn) small, or L(fn) small withhigh probability (over the training data).

11 / 174

Page 14: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Quiz: what is random here?

1. L(f ) for a given fixed f

2. fn

3. L(fn)

4. L(fn)

5. L(f ) for a given fixed f

12 / 174

Page 15: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Theoretical analysis of performance is typically easier if fn has closed form(in terms of the training data).

E.g. ordinary least squares fn(x) = x>(X>X )−1X>Y .

Unfortunately, most ML and many statistical procedures are not explicitlydefined but arise as

I solutions to an optimization objective (e.g. logistic regression)

I as an iterative procedure without an immediately obvious objectivefunction (e.g. AdaBoost, Random Forests, etc)

13 / 174

Page 16: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

The Gold Standard

Within the framework we set up, the smallest expected loss is achievedby the Bayes optimal function

f ∗ = arg minf

L(f )

where the minimization is over all (measurable) prediction rulesf : X → Y.

The value of the lowest expected loss is called the Bayes error:

L(f ∗) = inffL(f )

Of course, we cannot calculate any of these quantities since P isunknown.

14 / 174

Page 17: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Bayes Optimal Function

Bayes optimal function f ∗ takes on the following forms in these twoparticular cases:

I Binary classification (Y = 0, 1) with the indicator loss:

f ∗(x) = Iη(x) ≥ 1/2, where η(x) = E[Y |X = x ]

0

1

(x)

I Regression (Y = R) with squared loss:

f ∗(x) = η(x), where η(x) = E[Y |X = x ]

15 / 174

Page 18: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

The big question: is there a way to construct a learning algorithm with aguarantee that

L(fn)− L(f ∗)

is small for large enough sample size n?

16 / 174

Page 19: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Consistency

An algorithm that ensures

limn→∞

L(fn) = L(f ∗) almost surely

is called universally consistent. Consistency ensures that our algorithm isapproaching the best possible prediction performance as the sample sizeincreases.

The good news: consistency is possible to achieve.

I easy if X is a finite or countable set

I not too hard if X is infinite, and the underlying relationship betweenx and y is “continuous”

17 / 174

Page 20: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

The bad news...

In general, we cannot prove anything quantitative about L(fn)− L(f ∗),unless we make further assumptions (incorporate prior knowledge).

“No Free Lunch” Theorems: unless we posit assumptions,

I For any algorithm fn, any n and any ε > 0, there exists adistribution P such that L(f ∗) = 0 and

EL(fn) ≥ 1

2− ε

I For any algorithm fn, and any sequence an that converges to 0, thereexists a probability distribution P such that L(f ∗) = 0 and for all n

EL(fn) ≥ an

18 / 174

Page 21: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

is this really “bad news”?

Not really. We always have some domain knowledge.

Two ways of incorporating prior knowledge:

I Direct way: assumptions on distribution P (e.g. margin, smoothnessof regression function, etc)

I Indirect way: redefine the goal to perform as well as a reference setF of predictors:

L(fn)− inff∈F

L(f )

F encapsulates our inductive bias.

We often make both of these assumptions.

19 / 174

Page 22: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Outline

IntroductionSetupPerceptronBias-Variance Tradeoff

Uniform Laws of Large Numbers

Classification

Beyond ERM

Interpolation

20 / 174

Page 23: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Perceptron

<latexit sha1_base64="+03BdUB6TWqLjuCfwSU3OIMjF3A=">AAAB7HicbVDLSgNBEOz1GeMr6tHLYBA8hV0R1FvQi8cIbhJIljA7mU3GzGOZmRXCkn/w4kHFqx/kzb9xkuxBEwsaiqpuurvilDNjff/bW1ldW9/YLG2Vt3d29/YrB4dNozJNaEgUV7odY0M5kzS0zHLaTjXFIua0FY9up37riWrDlHyw45RGAg8kSxjB1knN7gALgXuVql/zZ0DLJChIFQo0epWvbl+RTFBpCcfGdAI/tVGOtWWE00m5mxmaYjLCA9pxVGJBTZTPrp2gU6f0UaK0K2nRTP09kWNhzFjErlNgOzSL3lT8z+tkNrmKcibTzFJJ5ouSjCOr0PR11GeaEsvHjmCimbsVkSHWmFgXUNmFECy+vEzC89p1Lbi/qNZvijRKcAwncAYBXEId7qABIRB4hGd4hTdPeS/eu/cxb13xipkj+APv8wfzVo7p</latexit><latexit sha1_base64="+03BdUB6TWqLjuCfwSU3OIMjF3A=">AAAB7HicbVDLSgNBEOz1GeMr6tHLYBA8hV0R1FvQi8cIbhJIljA7mU3GzGOZmRXCkn/w4kHFqx/kzb9xkuxBEwsaiqpuurvilDNjff/bW1ldW9/YLG2Vt3d29/YrB4dNozJNaEgUV7odY0M5kzS0zHLaTjXFIua0FY9up37riWrDlHyw45RGAg8kSxjB1knN7gALgXuVql/zZ0DLJChIFQo0epWvbl+RTFBpCcfGdAI/tVGOtWWE00m5mxmaYjLCA9pxVGJBTZTPrp2gU6f0UaK0K2nRTP09kWNhzFjErlNgOzSL3lT8z+tkNrmKcibTzFJJ5ouSjCOr0PR11GeaEsvHjmCimbsVkSHWmFgXUNmFECy+vEzC89p1Lbi/qNZvijRKcAwncAYBXEId7qABIRB4hGd4hTdPeS/eu/cxb13xipkj+APv8wfzVo7p</latexit><latexit sha1_base64="+03BdUB6TWqLjuCfwSU3OIMjF3A=">AAAB7HicbVDLSgNBEOz1GeMr6tHLYBA8hV0R1FvQi8cIbhJIljA7mU3GzGOZmRXCkn/w4kHFqx/kzb9xkuxBEwsaiqpuurvilDNjff/bW1ldW9/YLG2Vt3d29/YrB4dNozJNaEgUV7odY0M5kzS0zHLaTjXFIua0FY9up37riWrDlHyw45RGAg8kSxjB1knN7gALgXuVql/zZ0DLJChIFQo0epWvbl+RTFBpCcfGdAI/tVGOtWWE00m5mxmaYjLCA9pxVGJBTZTPrp2gU6f0UaK0K2nRTP09kWNhzFjErlNgOzSL3lT8z+tkNrmKcibTzFJJ5ouSjCOr0PR11GeaEsvHjmCimbsVkSHWmFgXUNmFECy+vEzC89p1Lbi/qNZvijRKcAwncAYBXEId7qABIRB4hGd4hTdPeS/eu/cxb13xipkj+APv8wfzVo7p</latexit><latexit sha1_base64="+03BdUB6TWqLjuCfwSU3OIMjF3A=">AAAB7HicbVDLSgNBEOz1GeMr6tHLYBA8hV0R1FvQi8cIbhJIljA7mU3GzGOZmRXCkn/w4kHFqx/kzb9xkuxBEwsaiqpuurvilDNjff/bW1ldW9/YLG2Vt3d29/YrB4dNozJNaEgUV7odY0M5kzS0zHLaTjXFIua0FY9up37riWrDlHyw45RGAg8kSxjB1knN7gALgXuVql/zZ0DLJChIFQo0epWvbl+RTFBpCcfGdAI/tVGOtWWE00m5mxmaYjLCA9pxVGJBTZTPrp2gU6f0UaK0K2nRTP09kWNhzFjErlNgOzSL3lT8z+tkNrmKcibTzFJJ5ouSjCOr0PR11GeaEsvHjmCimbsVkSHWmFgXUNmFECy+vEzC89p1Lbi/qNZvijRKcAwncAYBXEId7qABIRB4hGd4hTdPeS/eu/cxb13xipkj+APv8wfzVo7p</latexit>

21 / 174

Page 24: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Perceptron

(x1, y1), . . . , (xT , yT ) ∈ X ×±1 (T may or may not be same as n)

Maintain a hypothesis wt ∈ Rd (initialize w1 = 0).

On round t,

I Consider (xt , yt)

I Form prediction yt = sign(〈wt , xt〉)I If yt 6= yt , update

wt+1 = wt + ytxt

elsewt+1 = wt

22 / 174

Page 25: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Perceptron

wt<latexit sha1_base64="7J0RWCUqTAOGp/2VV5Bhw8gCVEA=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKMR6DXjxGNA9IljA7mU2GzM4uM71KCPkELx4U8eoXefNvnLxAjQUNRVU33V1BIoVB1/1yMiura+sb2c3c1vbO7l5+/6Bu4lQzXmOxjHUzoIZLoXgNBUreTDSnUSB5IxhcT/zGA9dGxOoehwn3I9pTIhSMopXuHjvYyRfcojsFcYveeal0USbeQlmQAsxR7eQ/292YpRFXyCQ1puW5CfojqlEwyce5dmp4QtmA9njLUkUjbvzR9NQxObFKl4SxtqWQTNWfEyMaGTOMAtsZUeybv95E/M9rpRhe+iOhkhS5YrNFYSoJxmTyN+kKzRnKoSWUaWFvJaxPNWVo08nZEJZeXib1s6JnI7o9L1Su5nFk4QiO4RQ8KEMFbqAKNWDQgyd4gVdHOs/Om/M+a80485lD+AXn4xuZmI3/</latexit><latexit sha1_base64="7J0RWCUqTAOGp/2VV5Bhw8gCVEA=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKMR6DXjxGNA9IljA7mU2GzM4uM71KCPkELx4U8eoXefNvnLxAjQUNRVU33V1BIoVB1/1yMiura+sb2c3c1vbO7l5+/6Bu4lQzXmOxjHUzoIZLoXgNBUreTDSnUSB5IxhcT/zGA9dGxOoehwn3I9pTIhSMopXuHjvYyRfcojsFcYveeal0USbeQlmQAsxR7eQ/292YpRFXyCQ1puW5CfojqlEwyce5dmp4QtmA9njLUkUjbvzR9NQxObFKl4SxtqWQTNWfEyMaGTOMAtsZUeybv95E/M9rpRhe+iOhkhS5YrNFYSoJxmTyN+kKzRnKoSWUaWFvJaxPNWVo08nZEJZeXib1s6JnI7o9L1Su5nFk4QiO4RQ8KEMFbqAKNWDQgyd4gVdHOs/Om/M+a80485lD+AXn4xuZmI3/</latexit><latexit sha1_base64="7J0RWCUqTAOGp/2VV5Bhw8gCVEA=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKMR6DXjxGNA9IljA7mU2GzM4uM71KCPkELx4U8eoXefNvnLxAjQUNRVU33V1BIoVB1/1yMiura+sb2c3c1vbO7l5+/6Bu4lQzXmOxjHUzoIZLoXgNBUreTDSnUSB5IxhcT/zGA9dGxOoehwn3I9pTIhSMopXuHjvYyRfcojsFcYveeal0USbeQlmQAsxR7eQ/292YpRFXyCQ1puW5CfojqlEwyce5dmp4QtmA9njLUkUjbvzR9NQxObFKl4SxtqWQTNWfEyMaGTOMAtsZUeybv95E/M9rpRhe+iOhkhS5YrNFYSoJxmTyN+kKzRnKoSWUaWFvJaxPNWVo08nZEJZeXib1s6JnI7o9L1Su5nFk4QiO4RQ8KEMFbqAKNWDQgyd4gVdHOs/Om/M+a80485lD+AXn4xuZmI3/</latexit><latexit sha1_base64="7J0RWCUqTAOGp/2VV5Bhw8gCVEA=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKMR6DXjxGNA9IljA7mU2GzM4uM71KCPkELx4U8eoXefNvnLxAjQUNRVU33V1BIoVB1/1yMiura+sb2c3c1vbO7l5+/6Bu4lQzXmOxjHUzoIZLoXgNBUreTDSnUSB5IxhcT/zGA9dGxOoehwn3I9pTIhSMopXuHjvYyRfcojsFcYveeal0USbeQlmQAsxR7eQ/292YpRFXyCQ1puW5CfojqlEwyce5dmp4QtmA9njLUkUjbvzR9NQxObFKl4SxtqWQTNWfEyMaGTOMAtsZUeybv95E/M9rpRhe+iOhkhS5YrNFYSoJxmTyN+kKzRnKoSWUaWFvJaxPNWVo08nZEJZeXib1s6JnI7o9L1Su5nFk4QiO4RQ8KEMFbqAKNWDQgyd4gVdHOs/Om/M+a80485lD+AXn4xuZmI3/</latexit>

xt<latexit sha1_base64="wK72DhZCMdU9mhIbOxvRhU8Wi14=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKMR6DXjxGNA9IljA7mU2GzM4uM71iCPkELx4U8eoXefNvnLxAjQUNRVU33V1BIoVB1/1yMiura+sb2c3c1vbO7l5+/6Bu4lQzXmOxjHUzoIZLoXgNBUreTDSnUSB5IxhcT/zGA9dGxOoehwn3I9pTIhSMopXuHjvYyRfcojsFcYveeal0USbeQlmQAsxR7eQ/292YpRFXyCQ1puW5CfojqlEwyce5dmp4QtmA9njLUkUjbvzR9NQxObFKl4SxtqWQTNWfEyMaGTOMAtsZUeybv95E/M9rpRhe+iOhkhS5YrNFYSoJxmTyN+kKzRnKoSWUaWFvJaxPNWVo08nZEJZeXib1s6JnI7o9L1Su5nFk4QiO4RQ8KEMFbqAKNWDQgyd4gVdHOs/Om/M+a80485lD+AXn4xubHo4A</latexit><latexit sha1_base64="wK72DhZCMdU9mhIbOxvRhU8Wi14=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKMR6DXjxGNA9IljA7mU2GzM4uM71iCPkELx4U8eoXefNvnLxAjQUNRVU33V1BIoVB1/1yMiura+sb2c3c1vbO7l5+/6Bu4lQzXmOxjHUzoIZLoXgNBUreTDSnUSB5IxhcT/zGA9dGxOoehwn3I9pTIhSMopXuHjvYyRfcojsFcYveeal0USbeQlmQAsxR7eQ/292YpRFXyCQ1puW5CfojqlEwyce5dmp4QtmA9njLUkUjbvzR9NQxObFKl4SxtqWQTNWfEyMaGTOMAtsZUeybv95E/M9rpRhe+iOhkhS5YrNFYSoJxmTyN+kKzRnKoSWUaWFvJaxPNWVo08nZEJZeXib1s6JnI7o9L1Su5nFk4QiO4RQ8KEMFbqAKNWDQgyd4gVdHOs/Om/M+a80485lD+AXn4xubHo4A</latexit><latexit sha1_base64="wK72DhZCMdU9mhIbOxvRhU8Wi14=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKMR6DXjxGNA9IljA7mU2GzM4uM71iCPkELx4U8eoXefNvnLxAjQUNRVU33V1BIoVB1/1yMiura+sb2c3c1vbO7l5+/6Bu4lQzXmOxjHUzoIZLoXgNBUreTDSnUSB5IxhcT/zGA9dGxOoehwn3I9pTIhSMopXuHjvYyRfcojsFcYveeal0USbeQlmQAsxR7eQ/292YpRFXyCQ1puW5CfojqlEwyce5dmp4QtmA9njLUkUjbvzR9NQxObFKl4SxtqWQTNWfEyMaGTOMAtsZUeybv95E/M9rpRhe+iOhkhS5YrNFYSoJxmTyN+kKzRnKoSWUaWFvJaxPNWVo08nZEJZeXib1s6JnI7o9L1Su5nFk4QiO4RQ8KEMFbqAKNWDQgyd4gVdHOs/Om/M+a80485lD+AXn4xubHo4A</latexit><latexit sha1_base64="wK72DhZCMdU9mhIbOxvRhU8Wi14=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKMR6DXjxGNA9IljA7mU2GzM4uM71iCPkELx4U8eoXefNvnLxAjQUNRVU33V1BIoVB1/1yMiura+sb2c3c1vbO7l5+/6Bu4lQzXmOxjHUzoIZLoXgNBUreTDSnUSB5IxhcT/zGA9dGxOoehwn3I9pTIhSMopXuHjvYyRfcojsFcYveeal0USbeQlmQAsxR7eQ/292YpRFXyCQ1puW5CfojqlEwyce5dmp4QtmA9njLUkUjbvzR9NQxObFKl4SxtqWQTNWfEyMaGTOMAtsZUeybv95E/M9rpRhe+iOhkhS5YrNFYSoJxmTyN+kKzRnKoSWUaWFvJaxPNWVo08nZEJZeXib1s6JnI7o9L1Su5nFk4QiO4RQ8KEMFbqAKNWDQgyd4gVdHOs/Om/M+a80485lD+AXn4xubHo4A</latexit>

23 / 174

Page 26: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

For simplicity, suppose all data are in a unit ball, ‖xt‖ ≤ 1.

Margin with respect to (x1, y1), . . . , (xT , yT ):

γ = max‖w‖=1

mini∈[T ]

(yi 〈w , xi 〉)+ ,

where (a)+ = max0, a.

Perceptron makes at most 1/γ2 mistakes (and corrections) on anysequence of examples with margin γ.

Theorem (Novikoff ’62).

24 / 174

Page 27: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Proof: Let m be the number of mistakes after T iterations. If a mistakeis made on round t,

‖wt+1‖2 = ‖wt + ytxt‖2 ≤ ‖wt‖2 + 2yt 〈wt , xt〉+ 1 ≤ ‖wt‖2 + 1.

Hence,‖wT‖2 ≤ m.

For optimal hyperplane w∗

γ ≤ 〈w∗, ytxt〉 = 〈w∗,wt+1 − wt〉 .

Hence (adding and canceling),

mγ ≤ 〈w∗,wT 〉 ≤ ‖wT‖ ≤√m.

25 / 174

Page 28: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Recap

For any T and (x1, y1), . . . , (xT , yT ),

T∑t=1

Iyt 〈wt , xt〉 ≤ 0 ≤ D2

γ2

where γ = γ(x1:T , y1:T ) is margin and D = D(x1:T , y1:T ) = maxt ‖xt‖.

Let w∗ denote the max margin hyperplane, ‖w∗‖ = 1.

26 / 174

Page 29: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Consequence for i.i.d. data (I)

Do one pass on i.i.d. sequence (X1,Y1), . . . , (Xn,Yn) (i.e. T = n).

Claim: expected indicator loss of randomly picked function x 7→ 〈wτ , x〉(τ ∼ unif(1, . . . , n)) is at most

1

n× E

[D2

γ2

].

27 / 174

Page 30: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Proof: Take expectation on both sides:

E

[1

n

n∑t=1

IYt 〈wt ,Xt〉 ≤ 0]≤ E

[D2

nγ2

]Left-hand side can be written as (recall notation S = (Xi ,Yi )

ni=1)

EτES IYτ 〈wτ ,Xτ 〉 ≤ 0

Since wτ is a function of X1:τ−1,Y1:τ−1, above is

ESEτEX ,Y IY 〈wτ ,X 〉 ≤ 0 = ESEτL01(wτ )

Claim follows.

NB: To ensure E[D2/γ2] is not infinite, we assume margin γ in P.

28 / 174

Page 31: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Consequence for i.i.d. data (II)

Now, rather than doing one pass, cycle through data

(X1,Y1), . . . , (Xn,Yn)

repeatedly until no more mistakes (i.e. T ≤ n × (D2/γ2)).

Then final hyperplane of Perceptron separates the data perfectly, i.e.finds minimum L01(wT ) = 0 for

L01(w) =1

n

n∑i=1

IYi 〈w ,Xi 〉 ≤ 0.

Empirical Risk Minimization with finite-time convergence. Can we sayanything about future performance of wT ?

29 / 174

Page 32: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Consequence for i.i.d. data (II)

Claim:

EL01(wT ) ≤ 1

n + 1× E

[D2

γ2

]

30 / 174

Page 33: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Proof: Shortcuts: z = (x , y) and `(w , z) = Iy 〈w , x〉 ≤ 0. Then

ESEZ `(wT ,Z ) = ES,Zn+1

[1

n + 1

n+1∑t=1

`(w (−t),Zt)

]

where w (−t) is Perceptron’s final hyperplane after cycling through dataZ1, . . . ,Zt−1,Zt+1, . . . ,Zn+1.

That is, leave-one-out is unbiased estimate of expected loss.

Now consider cycling Perceptron on Z1, . . . ,Zn+1 until no more errors.Let i1, . . . , im be indices on which Perceptron errs in any of the cycles.We know m ≤ D2/γ2. However, if index t /∈ i1, . . . , im, then whetheror not Zt was included does not matter, and Zt is correctly classified byw (−t). Claim follows.

31 / 174

Page 34: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

A few remarks

I Bayes error L01(f∗) = 0 since we assume margin in the distribution

P (and, hence, in the data).

I Our assumption on P is not about its parametric or nonparametricform, but rather on what happens at the boundary.

I Curiously, we beat the CLT rate of 1/√n.

32 / 174

Page 35: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Overparametrization and overfittingThe last lecture will discuss overparametrization and overfitting.

I dimension d never appears in the mistake bound of Perceptron.Hence, we can even take d =∞.

I Perceptron is 1-layer neural network (no hidden layer) with dneurons.

I For d > n, any n points in general position can be separated (zeroempirical error).

Conclusions:

I More parameters than data does not necessarily contradict goodprediction performance (“generalization”).

I Perfectly fitting the data does not necessarily contradict goodgeneralization (particularly with no noise).

I Complexity is subtle (e.g. number of parameters vs margin)

I ... however, margin is a strong assumption (more than just no noise)

33 / 174

Page 36: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Outline

IntroductionSetupPerceptronBias-Variance Tradeoff

Uniform Laws of Large Numbers

Classification

Beyond ERM

Interpolation

34 / 174

Page 37: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Relaxing the assumptions

We provedEL01(wT )− L01(f

∗) ≤ O(1/n)

under the assumption that P is linearly separable with margin γ.

The assumption on the probability distribution implies that the Bayesclassifier is a linear separator (“realizable” setup).

We may relax the margin assumption, yet still hope to do well with linearseparators. That is, we would aim to minimize

EL01(wT )− L01(w∗)

hoping thatL01(w

∗)− L01(f∗)

is small, where w∗ is the best hyperplane classifier with respect to L01.

35 / 174

Page 38: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

More generally, we will work with some class of functions F that we hopecaptures well the relationship between X and Y . The choice of F givesrise to a bias-variance decomposition.

Let fF ∈ argminf∈F

L(f ) be the function in F with smallest expected loss.

36 / 174

Page 39: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Bias-Variance Tradeoff

L(fn)− L(f ∗) = L(fn)− inff∈F

L(f )︸ ︷︷ ︸Estimation Error

+ inff∈F

L(f )− L(f ∗)︸ ︷︷ ︸Approximation Error

F

fn ffF

Clearly, the two terms are at odds with each other:

I Making F larger means smaller approximation error but (as we willsee) larger estimation error

I Taking a larger sample n means smaller estimation error and has noeffect on the approximation error.

I Thus, it makes sense to trade off size of F and n (Structural RiskMinimization, or Method of Sieves, or Model Selection).

37 / 174

Page 40: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Bias-Variance Tradeoff

In simple problems (e.g. linearly separable data) we might get away withhaving no bias-variance decomposition (e.g. as was done in thePerceptron case).

However, for more complex problems, our assumptions on P often implythat class containing f ∗ is too large and the estimation error cannot becontrolled. In such cases, we need to produce a “biased” solution inhopes of reducing variance.

Finally, the bias-variance tradeoff need not be in the form we justpresented. We will consider a different decomposition in the last lecture.

38 / 174

Page 41: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Outline

Introduction

Uniform Laws of Large NumbersMotivationRademacher AveragesRademacher Complexity: Structural ResultsSigmoid NetworksReLU NetworksCovering Numbers

Classification

Beyond ERM

Interpolation

39 / 174

Page 42: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Consider the performance of empirical risk minimization:

Choose fn ∈ F to minimize L(f ), where L is the empirical risk,

L(f ) = Pn`(f (X ),Y ) =1

n

n∑i=1

`(f (Xi ),Yi ).

For pattern classification, this is the proportion of training examplesmisclassified.

How does the excess risk, L(fn)− L(fF ) behave? We can write

L(fn)− L(fF ) =[L(fn)− L(fn)

]+[L(fn)− L(fF )

]+[L(fF )− L(fF )

]Therefore,

EL(fn)− L(fF ) ≤ E[L(fn)− L(fn)

]because second term is nonpositive and third is zero in expectation.

40 / 174

Page 43: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

The term L(fn)− L(fn) is not necessarily zero in expectation (check!). Aneasy upper bound is

L(fn)− L(fn) ≤ supf∈F

∣∣∣L(f )− L(f )∣∣∣ ,

and this motivates the study of uniform laws of large numbers.

Roadmap: study the sup to

I understand when learning is possible,

I understand implications for sample complexity,

I design new algorithms (regularizer arises from upper bound on sup)

41 / 174

Page 44: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Loss Class

In what follows, we will work with a function class G on Z, and for thelearning application, G = ` F :

G = (x , y) 7→ `(f (x), y) : f ∈ F

and Z = X × Y.

42 / 174

Page 45: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Empirical Process Viewpoint

g0

Eg

all functions

g0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gn g0

1

n

nX

i=1

g(zi)

gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gG gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

43 / 174

Page 46: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Empirical Process Viewpoint

g0

Eg

all functions

g0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gn g0

1

n

nX

i=1

g(zi)

gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gG gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

43 / 174

Page 47: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Empirical Process Viewpoint

g0

Eg

all functionsg0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gn g0

1

n

nX

i=1

g(zi)

gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gG gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

43 / 174

Page 48: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Empirical Process Viewpoint

g0

Eg

all functionsg0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gn

g0

1

n

nX

i=1

g(zi)

gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gG gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

43 / 174

Page 49: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Empirical Process Viewpoint

g0

Eg

all functionsg0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gn

g0

1

n

nX

i=1

g(zi)

gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gG gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

43 / 174

Page 50: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Empirical Process Viewpoint

g0

Eg

all functionsg0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gn g0

1

n

nX

i=1

g(zi)

gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gG gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

43 / 174

Page 51: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Empirical Process Viewpoint

g0

Eg

all functionsg0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gn g0

1

n

nX

i=1

g(zi)

gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gG gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

43 / 174

Page 52: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Empirical Process Viewpoint

g0

Eg

all functionsg0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gn g0

1

n

nX

i=1

g(zi)

gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gG gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

43 / 174

Page 53: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Uniform Laws of Large Numbers

For a class G of functions g : Z → [0, 1], suppose that Z1, . . . ,Zn,Z arei.i.d. on Z, and consider

U = supg∈G

∣∣∣∣∣Eg(Z )− 1

n

n∑i=1

g(Zi )

∣∣∣∣∣ = supg∈G|Pg − Png | =: ‖P − Pn︸ ︷︷ ︸

emp proc

‖G .

If U converges to 0, this is called a uniform law of large numbers.

44 / 174

Page 54: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Glivenko-Cantelli Classes

G is a Glivenko-Cantelli class for P if ‖Pn − P‖G P→ 0.

Definition.

I P is a distribution on Z,

I Z1, . . . ,Zn are drawn i.i.d. from P,

I Pn is the empirical distribution (mass 1/n at each of Z1, . . . ,Zn),

I G is a set of measurable real-valued functions on Z with finiteexpectation under P,

I Pn − P is an empirical process, that is, a stochastic processindexed by a class of functions G, and

I ‖Pn − P‖G := supg∈G |Png − Pg |.

45 / 174

Page 55: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Glivenko-Cantelli ClassesWhy ‘Glivenko-Cantelli’ ? An example of a uniform law of large numbers.

‖Fn − F‖∞as→ 0.

Theorem (Glivenko-Cantelli).

Here, F is a cumulative distribution function, Fn is the empiricalcumulative distribution function,

Fn(t) =1

n

n∑i=1

IZi ≤ t,

where Z1, . . . ,Zn are i.i.d. with distribution F , and‖F − G‖∞ = supt |F (t)− G (t)|.

‖Pn − P‖G as→ 0, for G = z 7→ Iz ≤ θ : θ ∈ R.

Theorem (Glivenko-Cantelli).

46 / 174

Page 56: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Glivenko-Cantelli Classes

Not all G are Glivenko-Cantelli classes. For instance,

G = Iz ∈ S : S ⊂ R, |S | <∞ .

Then for a continuous distribution P, Pg = 0 for any g ∈ G, butsupg∈G Png = 1 for all n. So although Png

as→ Pg for all g ∈ G, thisconvergence is not uniform over G. G is too large.

In general, how do we decide whether G is “too large”?

47 / 174

Page 57: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Outline

Introduction

Uniform Laws of Large NumbersMotivationRademacher AveragesRademacher Complexity: Structural ResultsSigmoid NetworksReLU NetworksCovering Numbers

Classification

Beyond ERM

Interpolation

48 / 174

Page 58: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Rademacher AveragesGiven a set H ⊆ [−1, 1]n, measure its “size” by average length ofprojection onto random direction.

Rademacher process indexed by vector h ∈ H:

Rn(h) =1

n〈ε, h〉

where ε = (ε1, . . . , εn) is vector of i.i.d. Rademacher random variables(symmetric ±1).

Rademacher averages: expected maximum of the process

Eε∥∥∥Rn

∥∥∥H

= E suph∈H

∣∣∣∣1n 〈ε, h〉∣∣∣∣

49 / 174

Page 59: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Examples

If H = h0,E∥∥∥Rn

∥∥∥H

= O(n−1/2).

If H = −1, 1n,

E∥∥∥Rn

∥∥∥H

= 1.

If v + c−1, 1n ⊆ H, then

E∥∥∥Rn

∥∥∥H≥ c − o(1)

How about H = −1, 1 where 1 is a vector of 1’s?

50 / 174

Page 60: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Examples

If H is finite,

E∥∥∥Rn

∥∥∥H≤ c

√log |H|

n

for some constant c .

This bound can be loose, as it does not take into account“overlaps”/correlations between vectors.

51 / 174

Page 61: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Examples

Let Bnp be a unit ball in Rn:

Bnp = x ∈ Rn : ‖x‖p ≤ 1, ‖x‖p =

(n∑

i=1

|xi |p)1/p

.

For H = Bn2

1

nE max‖g‖2≤1

| 〈ε, g〉 | =1

nE ‖ε‖2 =

1√n

On the other hand, Rademacher averages for Bn1 are 1

n .

52 / 174

Page 62: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Rademacher Averages

What do these Rademacher averages have to do with our problem ofbounding uniform deviations?

Suppose we take

H = G(Z n1 ) = (g(Z1), . . . , g(Zn)) : g ∈ G ⊂ Rn

g1<latexit sha1_base64="upU6x31J8AFJsV7QHUFUq5QBYAA=">AAAJ2HicjVZLb+M2ENbu9hG7r30cexEaBNgCQmClTWIfCgS72bYBtkm6STbbtY2AoiiZMCkKJBVHFQT0UKDooZce2p/T39Fbf0qHkh1btIJWkOnBzDfDmW+GhIKUUaV7vb/v3X/wzrvvvb/R6X7w4Ucff/Lw0ePXSmQSkwssmJBvAqQIowm50FQz8iaVBPGAkctg+tzYL6+JVFQk5zpPyZijOKERxUiD6iy+8q8ebva2e72e7/uuEfz9vR4Ig0F/x++7vjHBs3nwZOvnfxzHOb16tPHXKBQ44yTRmCGlhn4v1eMCSU0xI2V3lCmSIjxFMRmCmCBO1Lioci3dLdCEbiQk/BLtVtpVjwJxpXIeAJIjPVG2zSjbbMNMR/1xQZM00yTB9UZRxlwtXFO4G1JJsGY5CAhLCrm6eIIkwhroaewyAbiUJGpWgrmmqYcYGxc3ecNUaDr9samB/jBLo0EgSQx9qg3GidFAIpkDdVLMlKcmKCWq7G5Z/CoiaeRVJVdSKkVElOkpYoZFb4KSUGTaqxsRQP+JhDDucn/lPo0ECz9fi/3fUVcCrvJx7o8LAzNkNyzQ6QRLopuTUNzwlGdM06aWZIzIa25FQCwW0KIJ924lipvtSITkiBE+LiACbwZN42bviiDg/zt7TbnpQaMglGDCSrehNMRJFVnQmURpRGML+11yBiMtmiMxjCTQyomeiPCrkEQI2BkXPKzUYdk2+N78cHgMaXIDklEAOU1wDDlMKL5p7iaBp9C7NhdFOF4yvEPu5N69k3wUKMHgnHkC7haG8nEB6ehUABkQjIAyN+kWo+p4FwHLSFkuLCFVKfhYRnB0n5+8PHnlHr74+uj46Pzo5PisOwJeIOEaeYjk9BtJSFIWMg7Korft73pwVe2Zxd81nK/CX9J4on8gjInZrUPfQKtl0I6H+Pktet8A68UGz3NZYqtM6qU18DNT5gI82K1TqNZBK95y8Ctove61OryCmbkts+ZlvtrwZwzauMACBF4bckZSisoC83xqMBDxC8+Hv/1eacWSFE+rrRdYd7s/gGXwJSw7fQt+OaF6URUUY14LcZrJlJGVhkFrPQNKyAwLzuFOKkazKswQzvDIDF7tWSuLTb+00NWEWeBK14KVUIoFNaoWpJAoidfizrUt+LiaXAu+MtJteZs+WR7z5q2jmRmClvSXw7Huo6o2Ww7z3rfkY3rdssNyBlqrzlszqg/PukMIjLQ5LQ/cuk9az4zlsZikCr/lvjg+rC+Ysy588iy+a9y7hdc72z7I3/ubB32nfjacT53PnKeO7+w7B863zqlz4WAndn5z/nD+7Lzt/NT5pfNrDb1/b+7zxGk8nd//BQ28jZo=</latexit><latexit sha1_base64="sXoBpG6JmbUzgx55nt8lBNfWW3Y=">AAAJ2HicjVZfb+NEEPcd/5rw74575MWiqnRIVmUX2iYPSNVdD6h0tOXaXg+SqFqv184qu15rd900WJZ4QEI88MIDfBze+A7wAfgEfABm7aSJN67Acjajmd/MzvxmduUwY1Rp3//z3v3XXn/jzbc2Ot2333n3vfcfPPzgpRK5xOQCCybkqxApwmhKLjTVjLzKJEE8ZOQynDw19strIhUV6bmeZWTEUZLSmGKkQXWWXAVXDzb9bd/3gyBwjRDs7/kg9Pu9naDnBsYEz+bBo60f/vnj779Orx5u/D6MBM45STVmSKlB4Gd6VCCpKWak7A5zRTKEJyghAxBTxIkaFVWupbsFmsiNhYRfqt1Ku+pRIK7UjIeA5EiPlW0zyjbbINdxb1TQNMs1SXG9UZwzVwvXFO5GVBKs2QwEhCWFXF08RhJhDfQ0dhkDXEoSNyvBXNPMQ4yNiptZw1RoOvmuqYH+MEujQSBpAn2qDcaJ0VAiOQPqpJgqT41RRlTZ3bL4VUTS2KtKrqRMipgo01PEDIveGKWRyLVXNyKE/hMJYdzl/sp9HAsWfbwW+7+jrgRc5eM8GBUGZshuWKDTKZZENyehuOEZz5mmTS3JGZHX3IqAWCKgRWPu3UoUN9uRCskRI3xUQATeDJolzd4VYcj/d/aactODRkEoxYSVbkNpiJMqtqBTibKYJhb2q/QMRlo0R2IQS6CVEz0W0WcRiRGwMyp4VKmjsm3wvfnh8BjS5AYkowBymuAEchhTfNPcTQJPkXdtLopotGR4h9zJvXsn+ShUgsE58wTcLQzNRgWkozMBZEAwAsqZSbcYVse7CFlOynJhiajKwMcygqP79OT5yQv38NnnR8dH50cnx2fdIfACCdfIQyQnX0hC0rKQSVgW/naw68FVtWeWYNdwvgp/TpOx/oYwJqa3Dj0DrZZ+Ox7iz27R+wZYLzZ4nssSW2VSL62Bn5gyF+D+bp1CtfZb8ZZDUEHrda/V4QXMzG2ZNS/z1YY/YdDGBRYg8NqQM5JRVBaYzyYGAxE/8QL42/dLK5akeFJtvcC6270+LP1PYdnpWfDLMdWLqqAY81qI01xmjKw0DFrrGVBKplhwDndSMZxWYQZwhodm8GrPWllsBqWFribMAle6FqyEUiyoUbUghURpshZ3rm3BJ9XkWvCVkW7L2/TJ8pg3bx3NzBC0pL8cjnUfVbXZcpj3viUf0+uWHZYz0Fr1rDWj+vCsO0TASJvT8sCt+2T1zFgei0mq8Fvus+PD+oI568Inz+K7xr1beLmzHYD8dbB50HPqZ8P50PnIeewEzr5z4HzpnDoXDnYS52fnV+e3zred7zs/dn6qoffvzX0eOY2n88u/KiCQNA==</latexit><latexit sha1_base64="sXoBpG6JmbUzgx55nt8lBNfWW3Y=">AAAJ2HicjVZfb+NEEPcd/5rw74575MWiqnRIVmUX2iYPSNVdD6h0tOXaXg+SqFqv184qu15rd900WJZ4QEI88MIDfBze+A7wAfgEfABm7aSJN67Acjajmd/MzvxmduUwY1Rp3//z3v3XXn/jzbc2Ot2333n3vfcfPPzgpRK5xOQCCybkqxApwmhKLjTVjLzKJEE8ZOQynDw19strIhUV6bmeZWTEUZLSmGKkQXWWXAVXDzb9bd/3gyBwjRDs7/kg9Pu9naDnBsYEz+bBo60f/vnj779Orx5u/D6MBM45STVmSKlB4Gd6VCCpKWak7A5zRTKEJyghAxBTxIkaFVWupbsFmsiNhYRfqt1Ku+pRIK7UjIeA5EiPlW0zyjbbINdxb1TQNMs1SXG9UZwzVwvXFO5GVBKs2QwEhCWFXF08RhJhDfQ0dhkDXEoSNyvBXNPMQ4yNiptZw1RoOvmuqYH+MEujQSBpAn2qDcaJ0VAiOQPqpJgqT41RRlTZ3bL4VUTS2KtKrqRMipgo01PEDIveGKWRyLVXNyKE/hMJYdzl/sp9HAsWfbwW+7+jrgRc5eM8GBUGZshuWKDTKZZENyehuOEZz5mmTS3JGZHX3IqAWCKgRWPu3UoUN9uRCskRI3xUQATeDJolzd4VYcj/d/aactODRkEoxYSVbkNpiJMqtqBTibKYJhb2q/QMRlo0R2IQS6CVEz0W0WcRiRGwMyp4VKmjsm3wvfnh8BjS5AYkowBymuAEchhTfNPcTQJPkXdtLopotGR4h9zJvXsn+ShUgsE58wTcLQzNRgWkozMBZEAwAsqZSbcYVse7CFlOynJhiajKwMcygqP79OT5yQv38NnnR8dH50cnx2fdIfACCdfIQyQnX0hC0rKQSVgW/naw68FVtWeWYNdwvgp/TpOx/oYwJqa3Dj0DrZZ+Ox7iz27R+wZYLzZ4nssSW2VSL62Bn5gyF+D+bp1CtfZb8ZZDUEHrda/V4QXMzG2ZNS/z1YY/YdDGBRYg8NqQM5JRVBaYzyYGAxE/8QL42/dLK5akeFJtvcC6270+LP1PYdnpWfDLMdWLqqAY81qI01xmjKw0DFrrGVBKplhwDndSMZxWYQZwhodm8GrPWllsBqWFribMAle6FqyEUiyoUbUghURpshZ3rm3BJ9XkWvCVkW7L2/TJ8pg3bx3NzBC0pL8cjnUfVbXZcpj3viUf0+uWHZYz0Fr1rDWj+vCsO0TASJvT8sCt+2T1zFgei0mq8Fvus+PD+oI568Inz+K7xr1beLmzHYD8dbB50HPqZ8P50PnIeewEzr5z4HzpnDoXDnYS52fnV+e3zred7zs/dn6qoffvzX0eOY2n88u/KiCQNA==</latexit><latexit sha1_base64="IKa0cr9W32dKAhXNEcFAYkW1opU=">AAAJ2HicjVbNbuM2ENZu/2L3b7c99iI0CLAFhMBKm8Q+FFjsZtsG2CbpJtlsaxsBRVESYVIUSCqOKgjoreihlx7ax+lz9G06lOzYohW0gkwPZr4ZznwzJBRkjCo9GPzz4OFbb7/z7ntbvf77H3z40cePHn/yWolcYnKJBRPyTYAUYTQll5pqRt5kkiAeMHIVzJ4b+9UNkYqK9EIXGZlyFKc0ohhpUJ3H1/71o+3B7mAw8H3fNYJ/eDAAYTQa7vlD1zcmeLadxXN2/Xjr70kocM5JqjFDSo39QaanJZKaYkaq/iRXJEN4hmIyBjFFnKhpWedauTugCd1ISPil2q216x4l4koVPAAkRzpRts0ou2zjXEfDaUnTLNckxc1GUc5cLVxTuBtSSbBmBQgISwq5ujhBEmEN9LR2SQAuJYnalWCuaeYhxqblbdEylZrOfm5roD/M0mgQSBpDnxqDcWI0kEgWQJ0Uc+WpBGVEVf0di19FJI28uuRayqSIiDI9Rcyw6CUoDUWuvaYRAfSfSAjjrvZX7pNIsPCLjdj/HXUt4DofF/60NDBDdssCnU6xJLo9CeUtz3jONG1rSc6IvOFWBMRiAS1KuHcnUdxuRyokR4zwaQkReDtoFrd7VwYB/9/Za8pND1oFoRQTVrktpSFOqsiCziXKIhpb2O/Tcxhp0R6JcSSBVk50IsKvQxIhYGda8rBWh1XX4HuLw+ExpMktSEYB5LTBMeSQUHzb3k0CT6F3Yy6KcLpieI/cy717L/koUILBOfME3C0MFdMS0tGZADIgGAFlYdItJ/XxLgOWk6paWkKqMvCxjODoPj99efrKPXrxzfHJ8cXx6cl5fwK8QMIN8gjJ2beSkLQqZRxU5WDX3/fgqjowi79vOF+Hv6Rxon8kjIn5ncPQQOtl1I2H+MUd+tAAm8UGL3JZYetMmqUz8DNT5hI82m9SqNdRJ95y8Gtosx50OryCmbkrs+FlsdrwZwzauMQCBF4bck4yiqoS82JmMBDxS8+Hv8NBZcWSFM/qrZdYd3c4gmX0FSx7Qwt+lVC9rAqKMa+FOMtlxshaw6C1ngGlZI4F53AnlZN5HWYMZ3hiBq/xbJTltl9Z6HrCLHCt68BKKMWCGlUHUkiUxhtxF9oOfFxPrgVfG+muvE2fLI9F8zbRzAxBR/qr4dj0UXWbLYdF7zvyMb3u2GE1A51VF50ZNYdn0yEERrqcVgdu0ydrZsbyWE5Sjd9xX5wcNRfMeR8+eZbfNe79wuu9XR/kH/ztp8PFx8+W85nzufPE8Z1D56nznXPmXDrYiZ3fnT+dv3o/9X7p/dr7rYE+fLDw+dRpPb0//gXMf4u+</latexit>

g2<latexit sha1_base64="647aWPtTo779AnQGQnYNZ516kjI=">AAAJ2HicjVZLb+M2ENbu9hG7r30cexEaBNgCQmC5TWIfCgS72bYBtkm6TjbbtY2AoiiZMCkKJBXHFQT0UKDooZce2p/T39Fbf0qHkh1btIJWkOnBzDfDmW+GhIKUUaU7nb/v3X/wzrvvvb/Van/w4Ucff/Lw0ePXSmQSkwssmJBvAqQIowm50FQz8iaVBPGAkctg+tzYL6+JVFQk53qekjFHcUIjipEG1SC+6l493O7sdjod3/ddI/gH+x0Q+v1e1++5vjHBs334ZOfnfxzHObt6tPXXKBQ44yTRmCGlhn4n1eMcSU0xI0V7lCmSIjxFMRmCmCBO1Dgvcy3cHdCEbiQk/BLtltp1jxxxpeY8ACRHeqJsm1E22YaZjnrjnCZppkmCq42ijLlauKZwN6SSYM3mICAsKeTq4gmSCGugp7bLBOBSkqheCeaaph5ibJzfzGumXNPpj3UN9IdZGg0CSWLoU2UwTowGEsk5UCfFTHlqglKiivaOxa8ikkZeWXIppVJERJmeImZY9CYoCUWmvaoRAfSfSAjjrvZX7tNIsPDzjdj/HXUt4Dof5/44NzBDds0CnU6wJLo+CfkNT3nGNK1rScaIvOZWBMRiAS2acO9WorjejkRIjhjh4xwi8HrQNK73Lg8C/r+z15SbHtQKQgkmrHBrSkOcVJEFnUmURjS2sN8lAxhpUR+JYSSBVk70RIRfhSRCwM4452GpDoumwfcWh8NjSJMbkIwCyKmDY8hhQvFNfTcJPIXetbkowvGK4S65k3v3TvJRoASDc+YJuFsYmo9zSEenAsiAYASUc5NuPiqPdx6wjBTF0hJSlYKPZQRH9/npy9NX7tGLr49Pjs+PT08G7RHwAglXyCMkp99IQpIil3FQ5J1df8+Dq2rfLP6e4Xwd/pLGE/0DYUzMbh16Blou/WY8xJ/fog8MsFps8CKXFbbMpFoaAz8zZS7B/b0qhXLtN+ItB7+EVut+o8MrmJnbMiteFqsNf8agjUssQOC1IQOSUlTkmM+nBgMRv/B8+DvoFFYsSfG03HqJdXd7fVj6X8LS7VnwywnVy6qgGPNaiLNMpoysNQxa6xlQQmZYcA53Uj6alWGGcIZHZvAqz0qZb/uFhS4nzAKXugashFIsqFE1IIVESbwRd6FtwMfl5FrwtZFuytv0yfJYNG8TzcwQNKS/Go5NH1W22XJY9L4hH9Prhh1WM9BY9bwxo+rwbDqEwEiT0+rAbfqk1cxYHstJKvE77ouTo+qCGbThk2f5XePeLbzu7vogf+9vH/ac6tlyPnU+c546vnPgHDrfOmfOhYOd2PnN+cP5s/W29VPrl9avFfT+vYXPE6f2tH7/FxcujZs=</latexit><latexit sha1_base64="mQ1KtLtT5frIg5JmOrnmriowHDo=">AAAJ2HicjVZfb+NEEPcd/5rw74575MWiqnRIVhUH2iYPSNVdD6h0tOXSXg+SqFqv184qu15rd900WJZ4QEI88MIDfBze+A7wAfgEfABm7aSJN67Acjajmd/MzvxmduUgZVTpTufPe/dfe/2NN9/aarXffufd995/8PCDl0pkEpMLLJiQrwKkCKMJudBUM/IqlQTxgJHLYPrU2C+viVRUJOd6npIxR3FCI4qRBtUgvupePdju7HY6Hd/3XSP4B/sdEPr9Xtfvub4xwbN9+Gjnh3/++Puvs6uHW7+PQoEzThKNGVJq6HdSPc6R1BQzUrRHmSIpwlMUkyGICeJEjfMy18LdAU3oRkLCL9FuqV33yBFXas4DQHKkJ8q2GWWTbZjpqDfOaZJmmiS42ijKmKuFawp3QyoJ1mwOAsKSQq4uniCJsAZ6artMAC4lieqVYK5p6iHGxvnNvGbKNZ1+V9dAf5il0SCQJIY+VQbjxGggkZwDdVLMlKcmKCWqaO9Y/CoiaeSVJZdSKkVElOkpYoZFb4KSUGTaqxoRQP+JhDDuan/lPo4ECz/eiP3fUdcCrvNx7o9zAzNk1yzQ6QRLouuTkN/wlGdM07qWZIzIa25FQCwW0KIJ924liuvtSITkiBE+ziECrwdN43rv8iDg/zt7TbnpQa0glGDCCremNMRJFVnQmURpRGML+1UygJEW9ZEYRhJo5URPRPhZSCIE7IxzHpbqsGgafG9xODyGNLkBySiAnDo4hhwmFN/Ud5PAU+hdm4siHK8Y7pI7uXfvJB8FSjA4Z56Au4Wh+TiHdHQqgAwIRkA5N+nmo/J45wHLSFEsLSFVKfhYRnB0n54+P33hHj37/Pjk+Pz49GTQHgEvkHCFPEJy+oUkJClyGQdF3tn19zy4qvbN4u8Zztfhz2k80d8QxsTs1qFnoOXSb8ZD/Pkt+sAAq8UGL3JZYctMqqUx8BNT5hLc36tSKNd+I95y8Etote43OryAmbkts+JlsdrwJwzauMQCBF4bMiApRUWO+XxqMBDxE8+Hv4NOYcWSFE/LrZdYd7fXh6X/KSzdngW/nFC9rAqKMa+FOMtkyshaw6C1ngElZIYF53An5aNZGWYIZ3hkBq/yrJT5tl9Y6HLCLHCpa8BKKMWCGlUDUkiUxBtxF9oGfFxOrgVfG+mmvE2fLI9F8zbRzAxBQ/qr4dj0UWWbLYdF7xvyMb1u2GE1A41Vzxszqg7PpkMIjDQ5rQ7cpk9azYzlsZykEr/jPjs5qi6YQRs+eZbfNe7dwsvurg/y1/72Yc+pni3nQ+cj57HjOwfOofOlc+ZcONiJnZ+dX53fWt+2vm/92Pqpgt6/t/B55NSe1i//AjOSkDU=</latexit><latexit sha1_base64="mQ1KtLtT5frIg5JmOrnmriowHDo=">AAAJ2HicjVZfb+NEEPcd/5rw74575MWiqnRIVhUH2iYPSNVdD6h0tOXSXg+SqFqv184qu15rd900WJZ4QEI88MIDfBze+A7wAfgEfABm7aSJN67Acjajmd/MzvxmduUgZVTpTufPe/dfe/2NN9/aarXffufd995/8PCDl0pkEpMLLJiQrwKkCKMJudBUM/IqlQTxgJHLYPrU2C+viVRUJOd6npIxR3FCI4qRBtUgvupePdju7HY6Hd/3XSP4B/sdEPr9Xtfvub4xwbN9+Gjnh3/++Puvs6uHW7+PQoEzThKNGVJq6HdSPc6R1BQzUrRHmSIpwlMUkyGICeJEjfMy18LdAU3oRkLCL9FuqV33yBFXas4DQHKkJ8q2GWWTbZjpqDfOaZJmmiS42ijKmKuFawp3QyoJ1mwOAsKSQq4uniCJsAZ6artMAC4lieqVYK5p6iHGxvnNvGbKNZ1+V9dAf5il0SCQJIY+VQbjxGggkZwDdVLMlKcmKCWqaO9Y/CoiaeSVJZdSKkVElOkpYoZFb4KSUGTaqxoRQP+JhDDuan/lPo4ECz/eiP3fUdcCrvNx7o9zAzNk1yzQ6QRLouuTkN/wlGdM07qWZIzIa25FQCwW0KIJ924liuvtSITkiBE+ziECrwdN43rv8iDg/zt7TbnpQa0glGDCCremNMRJFVnQmURpRGML+1UygJEW9ZEYRhJo5URPRPhZSCIE7IxzHpbqsGgafG9xODyGNLkBySiAnDo4hhwmFN/Ud5PAU+hdm4siHK8Y7pI7uXfvJB8FSjA4Z56Au4Wh+TiHdHQqgAwIRkA5N+nmo/J45wHLSFEsLSFVKfhYRnB0n54+P33hHj37/Pjk+Pz49GTQHgEvkHCFPEJy+oUkJClyGQdF3tn19zy4qvbN4u8Zztfhz2k80d8QxsTs1qFnoOXSb8ZD/Pkt+sAAq8UGL3JZYctMqqUx8BNT5hLc36tSKNd+I95y8Etote43OryAmbkts+JlsdrwJwzauMQCBF4bMiApRUWO+XxqMBDxE8+Hv4NOYcWSFE/LrZdYd7fXh6X/KSzdngW/nFC9rAqKMa+FOMtkyshaw6C1ngElZIYF53An5aNZGWYIZ3hkBq/yrJT5tl9Y6HLCLHCpa8BKKMWCGlUDUkiUxBtxF9oGfFxOrgVfG+mmvE2fLI9F8zbRzAxBQ/qr4dj0UWWbLYdF7xvyMb1u2GE1A41Vzxszqg7PpkMIjDQ5rQ7cpk9azYzlsZykEr/jPjs5qi6YQRs+eZbfNe7dwsvurg/y1/72Yc+pni3nQ+cj57HjOwfOofOlc+ZcONiJnZ+dX53fWt+2vm/92Pqpgt6/t/B55NSe1i//AjOSkDU=</latexit><latexit sha1_base64="z+p23dkdGhVa20Q1jNHE5rneJ2s=">AAAJ2HicjVbNbuM2ENZu/2L3b7c99iI0CLAFhMBym8Q+FFjsZtsG2CbpOtlsaxsBRVESYVIUSCqOKwjoreihlx7ax+lz9G06lOzYohW0gkwPZr4ZznwzJBRkjCrd6/3z4OFbb7/z7ns7ne77H3z40cePHn/yWolcYnKJBRPyTYAUYTQll5pqRt5kkiAeMHIVzJ4b+9UNkYqK9EIvMjLlKE5pRDHSoBrF1/3rR7u9/V6v5/u+awT/6LAHwnA46PsD1zcmeHad5XN+/Xjn70kocM5JqjFDSo39XqanBZKaYkbK7iRXJEN4hmIyBjFFnKhpUeVaunugCd1ISPil2q20mx4F4koteABIjnSibJtRttnGuY4G04KmWa5JiuuNopy5WrimcDekkmDNFiAgLCnk6uIESYQ10NPYJQG4lCRqVoK5ppmHGJsWt4uGqdB09nNTA/1hlkaDQNIY+lQbjBOjgURyAdRJMVeeSlBGVNnds/hVRNLIq0qupEyKiCjTU8QMi16C0lDk2qsbEUD/iYQw7np/5T6JBAu/2Ir931E3Am7yceFPCwMzZDcs0OkUS6Kbk1Dc8oznTNOmluSMyBtuRUAsFtCihHt3EsXNdqRCcsQInxYQgTeDZnGzd0UQ8P+dvabc9KBREEoxYaXbUBripIos6FyiLKKxhf0+HcFIi+ZIjCMJtHKiExF+HZIIATvTgoeVOizbBt9bHg6PIU1uQTIKIKcJjiGHhOLb5m4SeAq9G3NRhNM1w31yL/fuveSjQAkG58wTcLcwtJgWkI7OBJABwQgoFybdYlId7yJgOSnLlSWkKgMfywiO7vOzl2ev3OMX35ycnlycnJ2OuhPgBRKukcdIzr6VhKRlIeOgLHr7/oEHV9WhWfwDw/km/CWNE/0jYUzM7xwGBlotw3Y8xF/coY8MsF5s8DKXNbbKpF5aAz8zZa7Aw4M6hWodtuItB7+C1uthq8MrmJm7MmtelqsNf8agjSssQOC1ISOSUVQWmC9mBgMRv/R8+DvqlVYsSfGs2nqFdfcHQ1iGX8HSH1jwq4TqVVVQjHktxHkuM0Y2Ggat9QwoJXMsOIc7qZjMqzBjOMMTM3i1Z60sdv3SQlcTZoErXQtWQikW1KhakEKiNN6Ku9S24ONqci34xki35W36ZHksm7eNZmYIWtJfD8e2j6rabDkse9+Sj+l1yw7rGWitetGaUX14th1CYKTNaX3gtn2yemYsj9UkVfg998XpcX3BjLrwybP6rnHvF173932Qf/B3nw6WHz87zmfO584Tx3eOnKfOd865c+lgJ3Z+d/50/ur81Pml82vntxr68MHS51On8XT++BfV8Yu/</latexit>

z1<latexit sha1_base64="V7g08aoOjvxavyziU7UXAQe9o4U=">AAAJ2HicjVZfb9s2EFfb/Ym9dWu7x70ICwJ0gBBY2ZLYDwOCNt0WoEuyJmm62kJAUZRMmBQFkorjCgL2NuxhL3vYPs6AfYt9mx0lO7ZoBZsg04e73x3vfnckFGaMKt3r/XPv/oP33v/gw41O96OPH37y6aPHT14rkUtMLrBgQr4JkSKMpuRCU83Im0wSxENGLsPJc2O/vCZSUZGe61lGAo6SlMYUIw2qs3dX/tWjzd52r9fzfd81gr+/1wNhMOjv+H3XNyZ4Ng8e/p27juOcXj3e+GsUCZxzkmrMkFJDv5fpoEBSU8xI2R3limQIT1BChiCmiBMVFFWupbsFmsiNhYRfqt1Ku+pRIK7UjIeA5EiPlW0zyjbbMNdxPyhomuWapLjeKM6Zq4VrCncjKgnWbAYCwpJCri4eI4mwBnoau4wBLiWJm5VgrmnmIcaC4mbWMBWaTt41NdAfZmk0CCRNoE+1wTgxGkokZ0CdFFPlqTHKiCq7Wxa/ikgae1XJlZRJERNleoqYYdEbozQSufbqRoTQfyIhjLvcX7lPY8GiL9di/3fUlYCrfJz7QWFghuyGBTqdYkl0cxKKG57xnGna1JKcEXnNrQiIJQJaNOberURxsx2pkBwxwoMCIvBm0Cxp9q4IQ/6/s9eUmx40CkIpJqx0G0pDnFSxBZ1KlMU0sbA/pGcw0qI5EsNYAq2c6LGIvolIjICdoOBRpY7KtsH35ofDY0iTG5CMAshpghPIYUzxTXM3CTxF3rW5KKJgyfAOuZN7907yUagEg3PmCbhbGJoFBaSjMwFkQDACyplJtxhVx7sIWU7KcmGJqMrAxzKCo/v85OXJK/fwxbdHx0fnRyfHZ90R8AIJ18hDJCffSULSspBJWBa9bX/Xg6tqzyz+ruF8Ff6SJmP9E2FMTG8d+gZaLYN2PMSf3aL3DbBebPA8lyW2yqReWgM/M2UuwIPdOoVqHbTiLQe/gtbrXqvDK5iZ2zJrXuarDX/GoI0LLEDgtSFnJKOoLDCfTQwGIn7l+fC33yutWJLiSbX1Autu9wewDL6GZadvwS/HVC+qgmLMayFOc5kxstIwaK1nQCmZYsE53EnFaFqFGcIZHpnBqz1rZbHplxa6mjALXOlasBJKsaBG1YIUEqXJWty5tgWfVJNrwVdGui1v0yfLY968dTQzQ9CS/nI41n1U1WbLYd77lnxMr1t2WM5Aa9Wz1ozqw7PuEAEjbU7LA7fuk9UzY3ksJqnCb7kvjg/rC+asC588i+8a927h9c62D/KP/uZB36mfDedz5wvnqeM7+86B871z6lw42Emc35w/nD87bzs/d37p/FpD79+b+3zmNJ7O7/8CuwONdQ==</latexit><latexit sha1_base64="xAosoZfITzQxTFF+VKA4fyGVA3w=">AAAJ2HicjVZfb9s2EFe7f7G3bu32uBdhQYAOEAIrWxL7YUDQpt0CdEnWJE1XWwgoipIJk6JAUnFcQcDehj3sZQ/bN9jXGLBvsW+zo2THFq1gE2T6cPe7493vjoTCjFGle71/7t1/59333v9go9P98KMHH3/y8NGnr5TIJSYXWDAhX4dIEUZTcqGpZuR1JgniISOX4eSpsV9eE6moSM/1LCMBR0lKY4qRBtXZ2yv/6uFmb7vX6/m+7xrB39/rgTAY9Hf8vusbEzybBw/+zreed/88vXq08dcoEjjnJNWYIaWGfi/TQYGkppiRsjvKFckQnqCEDEFMEScqKKpcS3cLNJEbCwm/VLuVdtWjQFypGQ8ByZEeK9tmlG22Ya7jflDQNMs1SXG9UZwzVwvXFO5GVBKs2QwEhCWFXF08RhJhDfQ0dhkDXEoSNyvBXNPMQ4wFxc2sYSo0nbxtaqA/zNJoEEiaQJ9qg3FiNJRIzoA6KabKU2OUEVV2tyx+FZE09qqSKymTIibK9BQxw6I3Rmkkcu3VjQih/0RCGHe5v3Ifx4JFX67F/u+oKwFX+Tj3g8LADNkNC3Q6xZLo5iQUNzzjOdO0qSU5I/KaWxEQSwS0aMy9W4niZjtSITlihAcFRODNoFnS7F0Rhvx/Z68pNz1oFIRSTFjpNpSGOKliCzqVKItpYmG/T89gpEVzJIaxBFo50WMRfRORGAE7QcGjSh2VbYPvzQ+Hx5AmNyAZBZDTBCeQw5jim+ZuEniKvGtzUUTBkuEdcif37p3ko1AJBufME3C3MDQLCkhHZwLIgGAElDOTbjGqjncRspyU5cISUZWBj2UER/fpyYuTl+7hs+dHx0fnRyfHZ90R8AIJ18hDJCffSkLSspBJWBa9bX/Xg6tqzyz+ruF8Ff6CJmP9I2FMTG8d+gZaLYN2PMSf3aL3DbBebPA8lyW2yqReWgM/MWUuwIPdOoVqHbTiLQe/gtbrXqvDS5iZ2zJrXuarDX/CoI0LLEDgtSFnJKOoLDCfTQwGIn7l+fC33yutWJLiSbX1Autu9wewDL6GZadvwS/HVC+qgmLMayFOc5kxstIwaK1nQCmZYsE53EnFaFqFGcIZHpnBqz1rZbHplxa6mjALXOlasBJKsaBG1YIUEqXJWty5tgWfVJNrwVdGui1v0yfLY968dTQzQ9CS/nI41n1U1WbLYd77lnxMr1t2WM5Aa9Wz1ozqw7PuEAEjbU7LA7fuk9UzY3ksJqnCb7nPjg/rC+asC588i+8a927h1c62D/IP/uZB36mfDedz5wvnseM7+86B851z6lw42EmcX53fnT86bzo/dX7u/FJD79+b+3zmNJ7Ob/8C0OCOcA==</latexit><latexit sha1_base64="xAosoZfITzQxTFF+VKA4fyGVA3w=">AAAJ2HicjVZfb9s2EFe7f7G3bu32uBdhQYAOEAIrWxL7YUDQpt0CdEnWJE1XWwgoipIJk6JAUnFcQcDehj3sZQ/bN9jXGLBvsW+zo2THFq1gE2T6cPe7493vjoTCjFGle71/7t1/59333v9go9P98KMHH3/y8NGnr5TIJSYXWDAhX4dIEUZTcqGpZuR1JgniISOX4eSpsV9eE6moSM/1LCMBR0lKY4qRBtXZ2yv/6uFmb7vX6/m+7xrB39/rgTAY9Hf8vusbEzybBw/+zreed/88vXq08dcoEjjnJNWYIaWGfi/TQYGkppiRsjvKFckQnqCEDEFMEScqKKpcS3cLNJEbCwm/VLuVdtWjQFypGQ8ByZEeK9tmlG22Ya7jflDQNMs1SXG9UZwzVwvXFO5GVBKs2QwEhCWFXF08RhJhDfQ0dhkDXEoSNyvBXNPMQ4wFxc2sYSo0nbxtaqA/zNJoEEiaQJ9qg3FiNJRIzoA6KabKU2OUEVV2tyx+FZE09qqSKymTIibK9BQxw6I3Rmkkcu3VjQih/0RCGHe5v3Ifx4JFX67F/u+oKwFX+Tj3g8LADNkNC3Q6xZLo5iQUNzzjOdO0qSU5I/KaWxEQSwS0aMy9W4niZjtSITlihAcFRODNoFnS7F0Rhvx/Z68pNz1oFIRSTFjpNpSGOKliCzqVKItpYmG/T89gpEVzJIaxBFo50WMRfRORGAE7QcGjSh2VbYPvzQ+Hx5AmNyAZBZDTBCeQw5jim+ZuEniKvGtzUUTBkuEdcif37p3ko1AJBufME3C3MDQLCkhHZwLIgGAElDOTbjGqjncRspyU5cISUZWBj2UER/fpyYuTl+7hs+dHx0fnRyfHZ90R8AIJ18hDJCffSkLSspBJWBa9bX/Xg6tqzyz+ruF8Ff6CJmP9I2FMTG8d+gZaLYN2PMSf3aL3DbBebPA8lyW2yqReWgM/MWUuwIPdOoVqHbTiLQe/gtbrXqvDS5iZ2zJrXuarDX/CoI0LLEDgtSFnJKOoLDCfTQwGIn7l+fC33yutWJLiSbX1Autu9wewDL6GZadvwS/HVC+qgmLMayFOc5kxstIwaK1nQCmZYsE53EnFaFqFGcIZHpnBqz1rZbHplxa6mjALXOlasBJKsaBG1YIUEqXJWty5tgWfVJNrwVdGui1v0yfLY968dTQzQ9CS/nI41n1U1WbLYd77lnxMr1t2WM5Aa9Wz1ozqw7PuEAEjbU7LA7fuk9UzY3ksJqnCb7nPjg/rC+asC588i+8a927h1c62D/IP/uZB36mfDedz5wvnseM7+86B851z6lw42EmcX53fnT86bzo/dX7u/FJD79+b+3zmNJ7Ob/8C0OCOcA==</latexit><latexit sha1_base64="yCmzR+UYSiB70+ceKvNWi19m/ko=">AAAJ2HicjVZfb9s2EFfb/Ym9f233uBdhQYAOEAIrWxL7YUDRptsCdElWJ01XWwgoipIJk6JAUnFcQcDehj3sZQ/bx9nn2LfZUbJji1awCTJ9uPvd8e53R0JhxqjSvd4/9+4/eO/9Dz7c6nQ/+viTTz97+OjxayVyickFFkzINyFShNGUXGiqGXmTSYJ4yMhlOH1u7JfXRCoq0nM9z0jAUZLSmGKkQTV8d+VfPdzu7fZ6Pd/3XSP4hwc9EAaD/p7fd31jgmfbWTxnV4+2/h5HAuecpBozpNTI72U6KJDUFDNSdse5IhnCU5SQEYgp4kQFRZVr6e6AJnJjIeGXarfSrnsUiCs15yEgOdITZduMss02ynXcDwqaZrkmKa43inPmauGawt2ISoI1m4OAsKSQq4snSCKsgZ7GLhOAS0niZiWYa5p5iLGguJk3TIWm03dNDfSHWRoNAkkT6FNtME6MhhLJOVAnxUx5aoIyosrujsWvIpLGXlVyJWVSxESZniJmWPQmKI1Err26ESH0n0gI4672V+6TWLDoq43Y/x11LeA6H+d+UBiYIbthgU6nWBLdnITihmc8Z5o2tSRnRF5zKwJiiYAWTbh3K1HcbEcqJEeM8KCACLwZNEuavSvCkP/v7DXlpgeNglCKCSvdhtIQJ1VsQWcSZTFNLOyP6RBGWjRHYhRLoJUTPRHRtxGJEbATFDyq1FHZNvje4nB4DGlyA5JRADlNcAI5TCi+ae4mgafIuzYXRRSsGN4jd3Lv3kk+CpVgcM48AXcLQ/OggHR0JoAMCEZAOTfpFuPqeBchy0lZLi0RVRn4WEZwdJ+fvjx95R69+O745Pj8+PRk2B0DL5BwjTxCcvq9JCQtC5mEZdHb9fc9uKoOzOLvG87X4S9pMtE/E8bE7Nahb6DVMmjHQ/z5LfrQAOvFBi9yWWGrTOqlNfAzU+YSPNivU6jWQSvecvAraL0etDq8gpm5LbPmZbHa8GcM2rjEAgReGzIkGUVlgfl8ajAQ8WvPh7/DXmnFkhRPq62XWHe3P4Bl8A0se30LfjmhelkVFGNeC3GWy4yRtYZBaz0DSskMC87hTirGsyrMCM7w2Axe7Vkri22/tNDVhFngSteClVCKBTWqFqSQKE024i60LfikmlwLvjbSbXmbPlkei+ZtopkZgpb0V8Ox6aOqNlsOi9635GN63bLDagZaq563ZlQfnk2HCBhpc1oduE2frJ4Zy2M5SRV+x31xclRfMMMufPIsv2vcu4XXe7s+yD/520/7i4+fLecL50vnieM7h85T5wfnzLlwsJM4vzt/On913nZ+6fza+a2G3r+38PncaTydP/4FgCqL0Q==</latexit>

z2<latexit sha1_base64="TZvAKWPUDTKESOwa2HzSiORibKQ=">AAAJ2HicjVZfb9s2EFfb/Ym9dWu7x70ICwJ0gBBY3pLYDwOCNt0WoEuyOmm62kZAUZRMmBQFkorjCgL2NuxhL3vYPs6AfYt9mx0lO7ZoBZsg04e73x3vfnckFKSMKt3p/HPv/oP33v/gw61W+6OPH37y6aPHT14rkUlMLrBgQr4JkCKMJuRCU83Im1QSxANGLoPpc2O/vCZSUZGc63lKxhzFCY0oRhpUg3dX3atH253dTqfj+75rBP9gvwNCv9/r+j3XNyZ4tg8f/p25juOcXT3e+msUCpxxkmjMkFJDv5PqcY6kppiRoj3KFEkRnqKYDEFMECdqnJe5Fu4OaEI3EhJ+iXZL7bpHjrhScx4AkiM9UbbNKJtsw0xHvXFOkzTTJMHVRlHGXC1cU7gbUkmwZnMQEJYUcnXxBEmENdBT22UCcClJVK8Ec01TDzE2zm/mNVOu6fRdXQP9YZZGg0CSGPpUGYwTo4FEcg7USTFTnpqglKiivWPxq4ikkVeWXEqpFBFRpqeIGRa9CUpCkWmvakQA/ScSwrir/ZX7NBIs/HIj9n9HXQu4zse5P84NzJBds0CnEyyJrk9CfsNTnjFN61qSMSKvuRUBsVhAiybcu5UorrcjEZIjRvg4hwi8HjSN673Lg4D/7+w15aYHtYJQggkr3JrSECdVZEFnEqURjS3sD8kARlrUR2IYSaCVEz0R4TchiRCwM855WKrDomnwvcXh8BjS5AYkowBy6uAYcphQfFPfTQJPoXdtLopwvGK4S+7k3r2TfBQoweCceQLuFobm4xzS0akAMiAYAeXcpJuPyuOdBywjRbG0hFSl4GMZwdF9fvry9JV79OLb45Pj8+PTk0F7BLxAwhXyCMnpd5KQpMhlHBR5Z9ff8+Cq2jeLv2c4X4e/pPFE/0QYE7Nbh56Blku/GQ/x57foAwOsFhu8yGWFLTOplsbAz0yZS3B/r0qhXPuNeMvBL6HVut/o8Apm5rbMipfFasOfMWjjEgsQeG3IgKQUFTnm86nBQMSvPB/+DjqFFUtSPC23XmLd3V4flv7XsHR7FvxyQvWyKijGvBbiLJMpI2sNg9Z6BpSQGRacw52Uj2ZlmCGc4ZEZvMqzUubbfmGhywmzwKWuASuhFAtqVA1IIVESb8RdaBvwcTm5FnxtpJvyNn2yPBbN20QzMwQN6a+GY9NHlW22HBa9b8jH9Lphh9UMNFY9b8yoOjybDiEw0uS0OnCbPmk1M5bHcpJK/I774uSoumAGbfjkWX7XuHcLr7u7Psg/+tuHPad6tpzPnS+cp47vHDiHzvfOmXPhYCd2fnP+cP5svW393Pql9WsFvX9v4fOZU3tav/8LxHWNdg==</latexit><latexit sha1_base64="81/ItuAdF3EDwYZiH30cbjCQeeo=">AAAJ2HicjVZfb9s2EFe7f7G3bu32uBdhQYAOEALLWxL7YUDQpt0CdElWJ01X2wgoipIJk6JAUnFcQcDehj3sZQ/bN9jXGLBvsW+zo2THFq1gE2T6cPe7493vjoSClFGlO51/7t1/59333v9gq9X+8KMHH3/y8NGnr5TIJCYXWDAhXwdIEUYTcqGpZuR1KgniASOXwfSpsV9eE6moSM71PCVjjuKERhQjDarB26vu1cPtzm6n0/F93zWCf7DfAaHf73X9nusbEzzbhw/+znaet/88u3q09dcoFDjjJNGYIaWGfifV4xxJTTEjRXuUKZIiPEUxGYKYIE7UOC9zLdwd0IRuJCT8Eu2W2nWPHHGl5jwAJEd6omybUTbZhpmOeuOcJmmmSYKrjaKMuVq4pnA3pJJgzeYgICwp5OriCZIIa6CntssE4FKSqF4J5pqmHmJsnN/Ma6Zc0+nbugb6wyyNBoEkMfSpMhgnRgOJ5Byok2KmPDVBKVFFe8fiVxFJI68suZRSKSKiTE8RMyx6E5SEItNe1YgA+k8khHFX+yv3cSRY+OVG7P+OuhZwnY9zf5wbmCG7ZoFOJ1gSXZ+E/IanPGOa1rUkY0RecysCYrGAFk24dytRXG9HIiRHjPBxDhF4PWga13uXBwH/39lryk0PagWhBBNWuDWlIU6qyILOJEojGlvY75MBjLSoj8QwkkArJ3oiwm9CEiFgZ5zzsFSHRdPge4vD4TGkyQ1IRgHk1MEx5DCh+Ka+mwSeQu/aXBTheMVwl9zJvXsn+ShQgsE58wTcLQzNxzmko1MBZEAwAsq5STcflcc7D1hGimJpCalKwccygqP79PTF6Uv36Nnz45Pj8+PTk0F7BLxAwhXyCMnpt5KQpMhlHBR5Z9ff8+Cq2jeLv2c4X4e/oPFE/0gYE7Nbh56Blku/GQ/x57foAwOsFhu8yGWFLTOplsbAT0yZS3B/r0qhXPuNeMvBL6HVut/o8BJm5rbMipfFasOfMGjjEgsQeG3IgKQUFTnm86nBQMSvPB/+DjqFFUtSPC23XmLd3V4flv7XsHR7FvxyQvWyKijGvBbiLJMpI2sNg9Z6BpSQGRacw52Uj2ZlmCGc4ZEZvMqzUubbfmGhywmzwKWuASuhFAtqVA1IIVESb8RdaBvwcTm5FnxtpJvyNn2yPBbN20QzMwQN6a+GY9NHlW22HBa9b8jH9Lphh9UMNFY9b8yoOjybDiEw0uS0OnCbPmk1M5bHcpJK/I777OSoumAGbfjkWX7XuHcLr7q7Psg/+NuHPad6tpzPnS+cx47vHDiHznfOmXPhYCd2fnV+d/5ovWn91Pq59UsFvX9v4fOZU3tav/0L2lKOcQ==</latexit><latexit sha1_base64="81/ItuAdF3EDwYZiH30cbjCQeeo=">AAAJ2HicjVZfb9s2EFe7f7G3bu32uBdhQYAOEALLWxL7YUDQpt0CdElWJ01X2wgoipIJk6JAUnFcQcDehj3sZQ/bN9jXGLBvsW+zo2THFq1gE2T6cPe7493vjoSClFGlO51/7t1/59333v9gq9X+8KMHH3/y8NGnr5TIJCYXWDAhXwdIEUYTcqGpZuR1KgniASOXwfSpsV9eE6moSM71PCVjjuKERhQjDarB26vu1cPtzm6n0/F93zWCf7DfAaHf73X9nusbEzzbhw/+znaet/88u3q09dcoFDjjJNGYIaWGfifV4xxJTTEjRXuUKZIiPEUxGYKYIE7UOC9zLdwd0IRuJCT8Eu2W2nWPHHGl5jwAJEd6omybUTbZhpmOeuOcJmmmSYKrjaKMuVq4pnA3pJJgzeYgICwp5OriCZIIa6CntssE4FKSqF4J5pqmHmJsnN/Ma6Zc0+nbugb6wyyNBoEkMfSpMhgnRgOJ5Byok2KmPDVBKVFFe8fiVxFJI68suZRSKSKiTE8RMyx6E5SEItNe1YgA+k8khHFX+yv3cSRY+OVG7P+OuhZwnY9zf5wbmCG7ZoFOJ1gSXZ+E/IanPGOa1rUkY0RecysCYrGAFk24dytRXG9HIiRHjPBxDhF4PWga13uXBwH/39lryk0PagWhBBNWuDWlIU6qyILOJEojGlvY75MBjLSoj8QwkkArJ3oiwm9CEiFgZ5zzsFSHRdPge4vD4TGkyQ1IRgHk1MEx5DCh+Ka+mwSeQu/aXBTheMVwl9zJvXsn+ShQgsE58wTcLQzNxzmko1MBZEAwAsq5STcflcc7D1hGimJpCalKwccygqP79PTF6Uv36Nnz45Pj8+PTk0F7BLxAwhXyCMnpt5KQpMhlHBR5Z9ff8+Cq2jeLv2c4X4e/oPFE/0gYE7Nbh56Blku/GQ/x57foAwOsFhu8yGWFLTOplsbAT0yZS3B/r0qhXPuNeMvBL6HVut/o8BJm5rbMipfFasOfMGjjEgsQeG3IgKQUFTnm86nBQMSvPB/+DjqFFUtSPC23XmLd3V4flv7XsHR7FvxyQvWyKijGvBbiLJMpI2sNg9Z6BpSQGRacw52Uj2ZlmCGc4ZEZvMqzUubbfmGhywmzwKWuASuhFAtqVA1IIVESb8RdaBvwcTm5FnxtpJvyNn2yPBbN20QzMwQN6a+GY9NHlW22HBa9b8jH9Lphh9UMNFY9b8yoOjybDiEw0uS0OnCbPmk1M5bHcpJK/I777OSoumAGbfjkWX7XuHcLr7q7Psg/+NuHPad6tpzPnS+cx47vHDiHznfOmXPhYCd2fnV+d/5ovWn91Pq59UsFvX9v4fOZU3tav/0L2lKOcQ==</latexit><latexit sha1_base64="g8QBduw5zRf6C5jqumitT8HWfDk=">AAAJ2HicjVZfb9s2EFfb/Ym9f233uBdhQYAOEALLWxL7YUDRptsCdEnWJE1XWwgoipIJk6JAUnFcQcDehj3sZQ/bx9nn2LfZUbJji1awCTJ9uPvd8e53R0JhxqjSvd4/9+4/eO/9Dz7c6nQ/+viTTz97+OjxayVyickFFkzINyFShNGUXGiqGXmTSYJ4yMhlOH1u7JfXRCoq0nM9z0jAUZLSmGKkQXX27qp/9XC7t9vr9Xzfd43gH+z3QBgOB31/4PrGBM+2s3hOrx5t/T2OBM45STVmSKmR38t0UCCpKWak7I5zRTKEpyghIxBTxIkKiirX0t0BTeTGQsIv1W6lXfcoEFdqzkNAcqQnyrYZZZttlOt4EBQ0zXJNUlxvFOfM1cI1hbsRlQRrNgcBYUkhVxdPkERYAz2NXSYAl5LEzUow1zTzEGNBcTNvmApNp++aGugPszQaBJIm0KfaYJwYDSWSc6BOipny1ARlRJXdHYtfRSSNvarkSsqkiIkyPUXMsOhNUBqJXHt1I0LoP5EQxl3tr9wnsWDRVxux/zvqWsB1Ps79oDAwQ3bDAp1OsSS6OQnFDc94zjRtaknOiLzmVgTEEgEtmnDvVqK42Y5USI4Y4UEBEXgzaJY0e1eEIf/f2WvKTQ8aBaEUE1a6DaUhTqrYgs4kymKaWNgf0zMYadEciVEsgVZO9ERE30YkRsBOUPCoUkdl2+B7i8PhMaTJDUhGAeQ0wQnkMKH4prmbBJ4i79pcFFGwYrhP7uTevZN8FCrB4Jx5Au4WhuZBAenoTAAZEIyAcm7SLcbV8S5ClpOyXFoiqjLwsYzg6D4/eXnyyj188d3R8dH50cnxWXcMvEDCNfIQyen3kpC0LGQSlkVv19/z4KraN4u/Zzhfh7+kyUT/TBgTs1uHgYFWy7AdD/Hnt+gDA6wXG7zIZYWtMqmX1sDPTJlL8HCvTqFah614y8GvoPW63+rwCmbmtsyal8Vqw58xaOMSCxB4bcgZySgqC8znU4OBiF97Pvwd9EorlqR4Wm29xLq7gyEsw29g6Q8s+OWE6mVVUIx5LcRpLjNG1hoGrfUMKCUzLDiHO6kYz6owIzjDYzN4tWetLLb90kJXE2aBK10LVkIpFtSoWpBCojTZiLvQtuCTanIt+NpIt+Vt+mR5LJq3iWZmCFrSXw3Hpo+q2mw5LHrfko/pdcsOqxlorXremlF9eDYdImCkzWl14DZ9snpmLI/lJFX4HffF8WF9wZx14ZNn+V3j3i287u/6IP/kbz8dLD5+tpwvnC+dJ47vHDhPnR+cU+fCwU7i/O786fzVedv5pfNr57caev/ewudzp/F0/vgXiZyL0g==</latexit>

zn<latexit sha1_base64="4ZFc4Osjg/3AMiZMY0aOID5dIWQ=">AAAJ2HicjVZfb9s2EFfb/Ym9dWu7x70ICwJ0gBBY2ZLYDwOCNt0WoEuyJmm62kJAUZRMmBQFkorjCgL2NuxhL3vYPs6AfYt9mx0lO7ZoBZsg04e73x3vfnckFGaMKt3r/XPv/oP33v/gw41O96OPH37y6aPHT14rkUtMLrBgQr4JkSKMpuRCU83Im0wSxENGLsPJc2O/vCZSUZGe61lGAo6SlMYUIw2qs3dX6dWjzd52r9fzfd81gr+/1wNhMOjv+H3XNyZ4Ng8e/p27juOcXj3e+GsUCZxzkmrMkFJDv5fpoEBSU8xI2R3limQIT1BChiCmiBMVFFWupbsFmsiNhYRfqt1Ku+pRIK7UjIeA5EiPlW0zyjbbMNdxPyhomuWapLjeKM6Zq4VrCncjKgnWbAYCwpJCri4eI4mwBnoau4wBLiWJm5VgrmnmIcaC4mbWMBWaTt41NdAfZmk0CCRNoE+1wTgxGkokZ0CdFFPlqTHKiCq7Wxa/ikgae1XJlZRJERNleoqYYdEbozQSufbqRoTQfyIhjLvcX7lPY8GiL9di/3fUlYCrfJz7QWFghuyGBTqdYkl0cxKKG57xnGna1JKcEXnNrQiIJQJaNOberURxsx2pkBwxwoMCIvBm0Cxp9q4IQ/6/s9eUmx40CkIpJqx0G0pDnFSxBZ1KlMU0sbA/pGcw0qI5EsNYAq2c6LGIvolIjICdoOBRpY7KtsH35ofDY0iTG5CMAshpghPIYUzxTXM3CTxF3rW5KKJgyfAOuZN7907yUagEg3PmCbhbGJoFBaSjMwFkQDACyplJtxhVx7sIWU7KcmGJqMrAxzKCo/v85OXJK/fwxbdHx0fnRyfHZ90R8AIJ18hDJCffSULSspBJWBa9bX/Xg6tqzyz+ruF8Ff6SJmP9E2FMTG8d+gZaLYN2PMSf3aL3DbBebPA8lyW2yqReWgM/M2UuwIPdOoVqHbTiLQe/gtbrXqvDK5iZ2zJrXuarDX/GoI0LLEDgtSFnJKOoLDCfTQwGIn7l+fC33yutWJLiSbX1Autu9wewDL6GZadvwS/HVC+qgmLMayFOc5kxstIwaK1nQCmZYsE53EnFaFqFGcIZHpnBqz1rZbHplxa6mjALXOlasBJKsaBG1YIUEqXJWty5tgWfVJNrwVdGui1v0yfLY968dTQzQ9CS/nI41n1U1WbLYd77lnxMr1t2WM5Aa9Wz1ozqw7PuEAEjbU7LA7fuk9UzY3ksJqnCb7kvjg/rC+asC588i+8a927h9c62D/KP/uZB36mfDedz5wvnqeM7+86B871z6lw42Emc35w/nD87bzs/d37p/FpD79+b+3zmNJ7O7/8C+0uNsg==</latexit><latexit sha1_base64="ApJtwQVj/qeEN9CA1NtMsNSc5HI=">AAAJ2HicjVZfb9s2EFe7f7G3bu32uBdhQYAOEAIrWxL7YUDQpt0CdEnWJE1XWwgoipIJk6JAUnFcQcDehj3sZQ/bN9jXGLBvsW+zo2THFq1gE2T6cPe7493vjoTCjFGle71/7t1/59333v9go9P98KMHH3/y8NGnr5TIJSYXWDAhX4dIEUZTcqGpZuR1JgniISOX4eSpsV9eE6moSM/1LCMBR0lKY4qRBtXZ26v06uFmb7vX6/m+7xrB39/rgTAY9Hf8vusbEzybBw/+zreed/88vXq08dcoEjjnJNWYIaWGfi/TQYGkppiRsjvKFckQnqCEDEFMEScqKKpcS3cLNJEbCwm/VLuVdtWjQFypGQ8ByZEeK9tmlG22Ya7jflDQNMs1SXG9UZwzVwvXFO5GVBKs2QwEhCWFXF08RhJhDfQ0dhkDXEoSNyvBXNPMQ4wFxc2sYSo0nbxtaqA/zNJoEEiaQJ9qg3FiNJRIzoA6KabKU2OUEVV2tyx+FZE09qqSKymTIibK9BQxw6I3Rmkkcu3VjQih/0RCGHe5v3Ifx4JFX67F/u+oKwFX+Tj3g8LADNkNC3Q6xZLo5iQUNzzjOdO0qSU5I/KaWxEQSwS0aMy9W4niZjtSITlihAcFRODNoFnS7F0Rhvx/Z68pNz1oFIRSTFjpNpSGOKliCzqVKItpYmG/T89gpEVzJIaxBFo50WMRfRORGAE7QcGjSh2VbYPvzQ+Hx5AmNyAZBZDTBCeQw5jim+ZuEniKvGtzUUTBkuEdcif37p3ko1AJBufME3C3MDQLCkhHZwLIgGAElDOTbjGqjncRspyU5cISUZWBj2UER/fpyYuTl+7hs+dHx0fnRyfHZ90R8AIJ18hDJCffSkLSspBJWBa9bX/Xg6tqzyz+ruF8Ff6CJmP9I2FMTG8d+gZaLYN2PMSf3aL3DbBebPA8lyW2yqReWgM/MWUuwIPdOoVqHbTiLQe/gtbrXqvDS5iZ2zJrXuarDX/CoI0LLEDgtSFnJKOoLDCfTQwGIn7l+fC33yutWJLiSbX1Autu9wewDL6GZadvwS/HVC+qgmLMayFOc5kxstIwaK1nQCmZYsE53EnFaFqFGcIZHpnBqz1rZbHplxa6mjALXOlasBJKsaBG1YIUEqXJWty5tgWfVJNrwVdGui1v0yfLY968dTQzQ9CS/nI41n1U1WbLYd77lnxMr1t2WM5Aa9Wz1ozqw7PuEAEjbU7LA7fuk9UzY3ksJqnCb7nPjg/rC+asC588i+8a927h1c62D/IP/uZB36mfDedz5wvnseM7+86B851z6lw42EmcX53fnT86bzo/dX7u/FJD79+b+3zmNJ7Ob/8CETeOrQ==</latexit><latexit sha1_base64="ApJtwQVj/qeEN9CA1NtMsNSc5HI=">AAAJ2HicjVZfb9s2EFe7f7G3bu32uBdhQYAOEAIrWxL7YUDQpt0CdEnWJE1XWwgoipIJk6JAUnFcQcDehj3sZQ/bN9jXGLBvsW+zo2THFq1gE2T6cPe7493vjoTCjFGle71/7t1/59333v9go9P98KMHH3/y8NGnr5TIJSYXWDAhX4dIEUZTcqGpZuR1JgniISOX4eSpsV9eE6moSM/1LCMBR0lKY4qRBtXZ26v06uFmb7vX6/m+7xrB39/rgTAY9Hf8vusbEzybBw/+zreed/88vXq08dcoEjjnJNWYIaWGfi/TQYGkppiRsjvKFckQnqCEDEFMEScqKKpcS3cLNJEbCwm/VLuVdtWjQFypGQ8ByZEeK9tmlG22Ya7jflDQNMs1SXG9UZwzVwvXFO5GVBKs2QwEhCWFXF08RhJhDfQ0dhkDXEoSNyvBXNPMQ4wFxc2sYSo0nbxtaqA/zNJoEEiaQJ9qg3FiNJRIzoA6KabKU2OUEVV2tyx+FZE09qqSKymTIibK9BQxw6I3Rmkkcu3VjQih/0RCGHe5v3Ifx4JFX67F/u+oKwFX+Tj3g8LADNkNC3Q6xZLo5iQUNzzjOdO0qSU5I/KaWxEQSwS0aMy9W4niZjtSITlihAcFRODNoFnS7F0Rhvx/Z68pNz1oFIRSTFjpNpSGOKliCzqVKItpYmG/T89gpEVzJIaxBFo50WMRfRORGAE7QcGjSh2VbYPvzQ+Hx5AmNyAZBZDTBCeQw5jim+ZuEniKvGtzUUTBkuEdcif37p3ko1AJBufME3C3MDQLCkhHZwLIgGAElDOTbjGqjncRspyU5cISUZWBj2UER/fpyYuTl+7hs+dHx0fnRyfHZ90R8AIJ18hDJCffSkLSspBJWBa9bX/Xg6tqzyz+ruF8Ff6CJmP9I2FMTG8d+gZaLYN2PMSf3aL3DbBebPA8lyW2yqReWgM/MWUuwIPdOoVqHbTiLQe/gtbrXqvDS5iZ2zJrXuarDX/CoI0LLEDgtSFnJKOoLDCfTQwGIn7l+fC33yutWJLiSbX1Autu9wewDL6GZadvwS/HVC+qgmLMayFOc5kxstIwaK1nQCmZYsE53EnFaFqFGcIZHpnBqz1rZbHplxa6mjALXOlasBJKsaBG1YIUEqXJWty5tgWfVJNrwVdGui1v0yfLY968dTQzQ9CS/nI41n1U1WbLYd77lnxMr1t2WM5Aa9Wz1ozqw7PuEAEjbU7LA7fuk9UzY3ksJqnCb7nPjg/rC+asC588i+8a927h1c62D/IP/uZB36mfDedz5wvnseM7+86B851z6lw42EmcX53fnT86bzo/dX7u/FJD79+b+3zmNJ7Ob/8CETeOrQ==</latexit><latexit sha1_base64="kRKiSD+HgKged5aOJIkSLIO3sLs=">AAAJ2HicjVZfb9s2EFfb/Ym9f233uBdhQYAOEAIrWxL7YUDRptsCdElWJ01XWwgoipIJk6JAUnFcQcDehj3sZQ/bx9nn2LfZUbJji1awCTJ9uPvd8e53R0JhxqjSvd4/9+4/eO/9Dz7c6nQ/+viTTz97+OjxayVyickFFkzINyFShNGUXGiqGXmTSYJ4yMhlOH1u7JfXRCoq0nM9z0jAUZLSmGKkQTV8d5VePdzu7fZ6Pd/3XSP4hwc9EAaD/p7fd31jgmfbWTxnV4+2/h5HAuecpBozpNTI72U6KJDUFDNSdse5IhnCU5SQEYgp4kQFRZVr6e6AJnJjIeGXarfSrnsUiCs15yEgOdITZduMss02ynXcDwqaZrkmKa43inPmauGawt2ISoI1m4OAsKSQq4snSCKsgZ7GLhOAS0niZiWYa5p5iLGguJk3TIWm03dNDfSHWRoNAkkT6FNtME6MhhLJOVAnxUx5aoIyosrujsWvIpLGXlVyJWVSxESZniJmWPQmKI1Err26ESH0n0gI4672V+6TWLDoq43Y/x11LeA6H+d+UBiYIbthgU6nWBLdnITihmc8Z5o2tSRnRF5zKwJiiYAWTbh3K1HcbEcqJEeM8KCACLwZNEuavSvCkP/v7DXlpgeNglCKCSvdhtIQJ1VsQWcSZTFNLOyP6RBGWjRHYhRLoJUTPRHRtxGJEbATFDyq1FHZNvje4nB4DGlyA5JRADlNcAI5TCi+ae4mgafIuzYXRRSsGN4jd3Lv3kk+CpVgcM48AXcLQ/OggHR0JoAMCEZAOTfpFuPqeBchy0lZLi0RVRn4WEZwdJ+fvjx95R69+O745Pj8+PRk2B0DL5BwjTxCcvq9JCQtC5mEZdHb9fc9uKoOzOLvG87X4S9pMtE/E8bE7Nahb6DVMmjHQ/z5LfrQAOvFBi9yWWGrTOqlNfAzU+YSPNivU6jWQSvecvAraL0etDq8gpm5LbPmZbHa8GcM2rjEAgReGzIkGUVlgfl8ajAQ8WvPh7/DXmnFkhRPq62XWHe3P4Bl8A0se30LfjmhelkVFGNeC3GWy4yRtYZBaz0DSskMC87hTirGsyrMCM7w2Axe7Vkri22/tNDVhFngSteClVCKBTWqFqSQKE024i60LfikmlwLvjbSbXmbPlkei+ZtopkZgpb0V8Ox6aOqNlsOi9635GN63bLDagZaq563ZlQfnk2HCBhpc1oduE2frJ4Zy2M5SRV+x31xclRfMMMufPIsv2vcu4XXe7s+yD/520/7i4+fLecL50vnieM7h85T5wfnzLlwsJM4vzt/On913nZ+6fza+a2G3r+38PncaTydP/4FwHKMDg==</latexit>

53 / 174

Page 63: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Rademacher Averages

Then, conditionally on Z n1 ,

Eε∥∥∥Rn

∥∥∥H

= Eε supg∈G

∣∣∣∣∣1nn∑

i=1

εig(Zi )

∣∣∣∣∣is a data-dependent quantity (hence the “hat”) that measures complexityof G projected onto data.

Also define the Rademacher process Rn (without the “hat”)

Rn(g) =1

n

n∑i=1

εig(Zi ),

and Rademacher complexity of G as

E‖Rn‖G = EZ n1Eε∥∥∥Rn

∥∥∥G(Z n

1 ).

54 / 174

Page 64: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Uniform Laws and Rademacher Complexity

We’ll look at a proof of a uniform law of large numbers that involves twosteps:

1. Symmetrization, which bounds E‖P − Pn‖G in terms of theRademacher complexity of F , E‖Rn‖G .

2. Both ‖P − Pn‖G and ‖Rn‖G are concentrated about theirexpectation.

55 / 174

Page 65: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Uniform Laws and Rademacher Complexity

For any G,E‖P − Pn‖G ≤ 2E‖Rn‖G

If G ⊂ [0, 1]Z , then

1

2E‖Rn‖G −

√log 2

2n≤ E‖P − Pn‖G ≤ 2E‖Rn‖G ,

and, with probability at least 1− 2 exp(−2ε2n),

E‖P − Pn‖G − ε ≤ ‖P − Pn‖G ≤ E‖P − Pn‖G + ε.

Thus, E‖Rn‖G → 0 iff ‖P − Pn‖G as→ 0.

Theorem.

That is, the supremum of the empirical process P − Pn is concentratedabout its expectation, and its expectation is about the same as theexpected sup of the Rademacher process Rn.

56 / 174

Page 66: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Uniform Laws and Rademacher Complexity

The first step is to symmetrize by replacing Pg by P ′ng = 1n

∑ni=1 g(Z ′i ).

In particular, we have

E‖P − Pn‖G = E supg∈G

∣∣∣∣∣E[

1

n

n∑i=1

(g(Z ′i )− g(Zi ))

∣∣∣∣∣Z n1

]∣∣∣∣∣

≤ EE

[supg∈G

∣∣∣∣∣1nn∑

i=1

(g(Z ′i )− g(Zi ))

∣∣∣∣∣∣∣∣∣∣Z n

1

]

= E supg∈G

∣∣∣∣∣1nn∑

i=1

(g(Z ′i )− g(Zi ))

∣∣∣∣∣ = E‖P ′n − Pn‖G .

57 / 174

Page 67: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Uniform Laws and Rademacher Complexity

The first step is to symmetrize by replacing Pg by P ′ng = 1n

∑ni=1 g(Z ′i ).

In particular, we have

E‖P − Pn‖G = E supg∈G

∣∣∣∣∣E[

1

n

n∑i=1

(g(Z ′i )− g(Zi ))

∣∣∣∣∣Z n1

]∣∣∣∣∣≤ EE

[supg∈G

∣∣∣∣∣1nn∑

i=1

(g(Z ′i )− g(Zi ))

∣∣∣∣∣∣∣∣∣∣Z n

1

]

= E supg∈G

∣∣∣∣∣1nn∑

i=1

(g(Z ′i )− g(Zi ))

∣∣∣∣∣ = E‖P ′n − Pn‖G .

57 / 174

Page 68: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Uniform Laws and Rademacher Complexity

The first step is to symmetrize by replacing Pg by P ′ng = 1n

∑ni=1 g(Z ′i ).

In particular, we have

E‖P − Pn‖G = E supg∈G

∣∣∣∣∣E[

1

n

n∑i=1

(g(Z ′i )− g(Zi ))

∣∣∣∣∣Z n1

]∣∣∣∣∣≤ EE

[supg∈G

∣∣∣∣∣1nn∑

i=1

(g(Z ′i )− g(Zi ))

∣∣∣∣∣∣∣∣∣∣Z n

1

]

= E supg∈G

∣∣∣∣∣1nn∑

i=1

(g(Z ′i )− g(Zi ))

∣∣∣∣∣ = E‖P ′n − Pn‖G .

57 / 174

Page 69: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Uniform Laws and Rademacher Complexity

Another symmetrization: for any εi ∈ ±1,

E‖P ′n − Pn‖G = E supg∈G

∣∣∣∣∣1nn∑

i=1

(g(Z ′i )− g(Zi ))

∣∣∣∣∣= E sup

g∈G

∣∣∣∣∣1nn∑

i=1

εi (g(Z ′i )− g(Zi ))

∣∣∣∣∣ ,This follows from the fact that Zi and Z ′i are i.i.d., and so thedistribution of the supremum is unchanged when we swap them.And so in particular the expectation of the supremum is unchanged.And since this is true for any εi , we can take the expectation over anyrandom choice of the εi . We’ll pick them independently and uniformly.

58 / 174

Page 70: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Uniform Laws and Rademacher Complexity

E supg∈G

∣∣∣∣∣1nn∑

i=1

εi (g(Z ′i )− g(Zi ))

∣∣∣∣∣≤ E sup

g∈G

∣∣∣∣∣1nn∑

i=1

εig(Z ′i )

∣∣∣∣∣+ supg∈G

∣∣∣∣∣1nn∑

i=1

εig(Zi )

∣∣∣∣∣= 2E sup

g∈G

∣∣∣∣∣1nn∑

i=1

εig(Zi )

∣∣∣∣∣︸ ︷︷ ︸Rademacher complexity

= 2E‖Rn‖G ,

where Rn is the Rademacher process Rn(g) = (1/n)∑n

i=1 εig(Zi ).

59 / 174

Page 71: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Uniform Laws and Rademacher Complexity

The second inequality (desymmetrization) follows from:

E‖Rn‖G ≤ E

∥∥∥∥∥1

n

n∑i=1

εi (g(Zi )− Eg(Zi ))

∥∥∥∥∥G

+ E

∥∥∥∥∥1

n

n∑i=1

εiEg(Zi )

∥∥∥∥∥G

≤ E

∥∥∥∥∥1

n

n∑i=1

εi (g(Zi )− g(Z ′i ))

∥∥∥∥∥G

+ ‖P‖G E∣∣∣∣∣1n

n∑i=1

εi

∣∣∣∣∣= E

∥∥∥∥∥1

n

n∑i=1

(g(Zi )− Eg(Zi ) + Eg(Z ′i )− g(Z ′i ))

∥∥∥∥∥G

+ ‖P‖G E∣∣∣∣∣1n

n∑i=1

εi

∣∣∣∣∣≤ 2E ‖Pn − P‖G +

√2 log 2

n.

60 / 174

Page 72: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Uniform Laws and Rademacher Complexity

For any G,E‖P − Pn‖G ≤ 2E‖Rn‖G

If G ⊂ [0, 1]Z , then

1

2E‖Rn‖G −

√log 2

2n≤ E‖P − Pn‖G ≤ 2E‖Rn‖G ,

and, with probability at least 1− 2 exp(−2ε2n),

E‖P − Pn‖G − ε ≤ ‖P − Pn‖G ≤ E‖P − Pn‖G + ε.

Thus, E‖Rn‖G → 0 iff ‖P − Pn‖G as→ 0.

Theorem.

That is, the supremum of the empirical process P − Pn is concentratedabout its expectation, and its expectation is about the same as theexpected sup of the Rademacher process Rn.

61 / 174

Page 73: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Back to the prediction problem

Recall: for prediction problem, we have G = ` F for some loss function` and a set of functions F .

We found that E ‖P − Pn‖`F can be related to E ‖Rn‖`F . Question:can we get E ‖Rn‖F instead?

For classification, this is easy. Suppose F ⊆ ±1X , Y ∈ ±1, and`(y , y ′) = Iy 6= y ′ = (1− yy ′)/2. Then

E ‖Rn‖`F = E supf∈F

∣∣∣∣∣1nn∑

i=1

εi`(yi , f (xi ))

∣∣∣∣∣=

1

2E sup

f∈F

∣∣∣∣∣1nn∑

i=1

εi +1

n

n∑i=1

−εiyi f (xi )

∣∣∣∣∣≤ O

(1√n

)+

1

2E ‖Rn‖F .

Other loss functions?

62 / 174

Page 74: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Outline

Introduction

Uniform Laws of Large NumbersMotivationRademacher AveragesRademacher Complexity: Structural ResultsSigmoid NetworksReLU NetworksCovering Numbers

Classification

Beyond ERM

Interpolation

63 / 174

Page 75: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Rademacher Complexity: Structural Results

Subsets F ⊆ G implies ‖Rn‖F ≤ ‖Rn‖G .

Scaling ‖Rn‖cF = |c |‖Rn‖F .

Plus Constant For |g(X )| ≤ 1,|E‖Rn‖F+g − E‖Rn‖F | ≤

√2 log 2/n.

Convex Hull ‖Rn‖coF = ‖Rn‖F , where coF is the convex hullof F .

Contraction If ` : R× R→ [−1, 1] has y 7→ `(y , y) 1-Lipschitzfor all y , then for ` F = (x , y) 7→ `(f (x), y),E‖Rn‖`F ≤ 2E‖Rn‖F + c/

√n.

Theorem.

64 / 174

Page 76: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Outline

Introduction

Uniform Laws of Large NumbersMotivationRademacher AveragesRademacher Complexity: Structural ResultsSigmoid NetworksReLU NetworksCovering Numbers

Classification

Beyond ERM

Interpolation

65 / 174

Page 77: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Deep Networks

Deep compositions of nonlinear functions

h = hm hm−1 · · · h1

e.g., hi : x 7→ σ(Wix) hi : x 7→ r(Wix)

σ(v)i =1

1 + exp(−vi ), r(v)i = max0, vi

66 / 174

Page 78: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Deep Networks

Deep compositions of nonlinear functions

h = hm hm−1 · · · h1

e.g., hi : x 7→ σ(Wix)

hi : x 7→ r(Wix)

σ(v)i =1

1 + exp(−vi ),

r(v)i = max0, vi

66 / 174

Page 79: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Deep Networks

Deep compositions of nonlinear functions

h = hm hm−1 · · · h1

e.g., hi : x 7→ σ(Wix)

hi : x 7→ r(Wix)

σ(v)i =1

1 + exp(−vi ),

r(v)i = max0, vi

66 / 174

Page 80: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Deep Networks

Deep compositions of nonlinear functions

h = hm hm−1 · · · h1

e.g., hi : x 7→ σ(Wix) hi : x 7→ r(Wix)

σ(v)i =1

1 + exp(−vi ), r(v)i = max0, vi

66 / 174

Page 81: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Deep Networks

Deep compositions of nonlinear functions

h = hm hm−1 · · · h1

e.g., hi : x 7→ σ(Wix) hi : x 7→ r(Wix)

σ(v)i =1

1 + exp(−vi ), r(v)i = max0, vi

66 / 174

Page 82: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Rademacher Complexity of Sigmoid Networks

Consider the following class FB,d of two-layer neural networks onRd :

FB,d =

x 7→

k∑i=1

wiσ(vTi x)

: ‖w‖1 ≤ B, ‖vi‖1 ≤ B, k ≥ 1

,

where B > 0 and the nonlinear function σ : R → R satisfies theLipschitz condition, |σ(a)−σ(b)| ≤ |a−b|, and σ(0) = 0. Supposethat the distribution is such that ‖X‖∞ ≤ 1 a.s. Then

E‖Rn‖FB,d≤ 2B2

√2 log 2d

n,

where d is the dimension of the input space, X = Rd .

Theorem.

67 / 174

Page 83: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Rademacher Complexity of Sigmoid Networks: Proof

Define G := (x1, . . . , xd) 7→ xj : 1 ≤ j ≤ d,

VB :=

x 7→ v ′x : ‖v‖1 =

d∑i=1

|vi | ≤ B

= Bco (G ∪ −G) ,

with co(F) :=

k∑

i=1

αi fi : k ≥ 1, αi ≥ 0, ‖α‖1 = 1, fi ∈ F.

Then FB,d =

x 7→

k∑i=1

wiσ(vi (x)) | k ≥ 1,k∑

i=1

wi ≤ B, vi ∈ VB

= Bco (σ VB ∪ −σ VB) .

68 / 174

Page 84: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Rademacher Complexity of Sigmoid Networks: Proof

Define G := (x1, . . . , xd) 7→ xj : 1 ≤ j ≤ d,

VB :=

x 7→ v ′x : ‖v‖1 =

d∑i=1

|vi | ≤ B

= Bco (G ∪ −G) ,

with co(F) :=

k∑

i=1

αi fi : k ≥ 1, αi ≥ 0, ‖α‖1 = 1, fi ∈ F.

Then FB,d =

x 7→

k∑i=1

wiσ(vi (x)) | k ≥ 1,k∑

i=1

wi ≤ B, vi ∈ VB

= Bco (σ VB ∪ −σ VB) .

68 / 174

Page 85: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Rademacher Complexity of Sigmoid Networks: Proof

E‖Rn‖FB,d= E‖Rn‖Bco(σVB∪−σVB )

= BE‖Rn‖co(σVB∪−σVB )

= BE‖Rn‖σVB∪−σVB≤ BE‖Rn‖σVB + BE‖Rn‖−σVB= 2BE‖Rn‖σVB≤ 2BE‖Rn‖VB= 2BE‖Rn‖Bco(G∪−G)

= 2B2E‖Rn‖G∪−G

≤ 2B2

√2 log (2d)

n.

69 / 174

Page 86: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Rademacher Complexity of Sigmoid Networks: Proof

E‖Rn‖FB,d= E‖Rn‖Bco(σVB∪−σVB )

= BE‖Rn‖co(σVB∪−σVB )

= BE‖Rn‖σVB∪−σVB

≤ BE‖Rn‖σVB + BE‖Rn‖−σVB= 2BE‖Rn‖σVB≤ 2BE‖Rn‖VB= 2BE‖Rn‖Bco(G∪−G)

= 2B2E‖Rn‖G∪−G

≤ 2B2

√2 log (2d)

n.

69 / 174

Page 87: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Rademacher Complexity of Sigmoid Networks: Proof

E‖Rn‖FB,d= E‖Rn‖Bco(σVB∪−σVB )

= BE‖Rn‖co(σVB∪−σVB )

= BE‖Rn‖σVB∪−σVB≤ BE‖Rn‖σVB + BE‖Rn‖−σVB

= 2BE‖Rn‖σVB≤ 2BE‖Rn‖VB= 2BE‖Rn‖Bco(G∪−G)

= 2B2E‖Rn‖G∪−G

≤ 2B2

√2 log (2d)

n.

69 / 174

Page 88: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Rademacher Complexity of Sigmoid Networks: Proof

E‖Rn‖FB,d= E‖Rn‖Bco(σVB∪−σVB )

= BE‖Rn‖co(σVB∪−σVB )

= BE‖Rn‖σVB∪−σVB≤ BE‖Rn‖σVB + BE‖Rn‖−σVB= 2BE‖Rn‖σVB

≤ 2BE‖Rn‖VB= 2BE‖Rn‖Bco(G∪−G)

= 2B2E‖Rn‖G∪−G

≤ 2B2

√2 log (2d)

n.

69 / 174

Page 89: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Rademacher Complexity of Sigmoid Networks: Proof

E‖Rn‖FB,d= E‖Rn‖Bco(σVB∪−σVB )

= BE‖Rn‖co(σVB∪−σVB )

= BE‖Rn‖σVB∪−σVB≤ BE‖Rn‖σVB + BE‖Rn‖−σVB= 2BE‖Rn‖σVB≤ 2BE‖Rn‖VB

= 2BE‖Rn‖Bco(G∪−G)

= 2B2E‖Rn‖G∪−G

≤ 2B2

√2 log (2d)

n.

69 / 174

Page 90: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Rademacher Complexity of Sigmoid Networks: Proof

E‖Rn‖FB,d= E‖Rn‖Bco(σVB∪−σVB )

= BE‖Rn‖co(σVB∪−σVB )

= BE‖Rn‖σVB∪−σVB≤ BE‖Rn‖σVB + BE‖Rn‖−σVB= 2BE‖Rn‖σVB≤ 2BE‖Rn‖VB= 2BE‖Rn‖Bco(G∪−G)

= 2B2E‖Rn‖G∪−G

≤ 2B2

√2 log (2d)

n.

69 / 174

Page 91: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Rademacher Complexity of Sigmoid Networks: Proof

E‖Rn‖FB,d= E‖Rn‖Bco(σVB∪−σVB )

= BE‖Rn‖co(σVB∪−σVB )

= BE‖Rn‖σVB∪−σVB≤ BE‖Rn‖σVB + BE‖Rn‖−σVB= 2BE‖Rn‖σVB≤ 2BE‖Rn‖VB= 2BE‖Rn‖Bco(G∪−G)

= 2B2E‖Rn‖G∪−G

≤ 2B2

√2 log (2d)

n.

69 / 174

Page 92: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Rademacher Complexity of Sigmoid Networks: Proof

E‖Rn‖FB,d= E‖Rn‖Bco(σVB∪−σVB )

= BE‖Rn‖co(σVB∪−σVB )

= BE‖Rn‖σVB∪−σVB≤ BE‖Rn‖σVB + BE‖Rn‖−σVB= 2BE‖Rn‖σVB≤ 2BE‖Rn‖VB= 2BE‖Rn‖Bco(G∪−G)

= 2B2E‖Rn‖G∪−G

≤ 2B2

√2 log (2d)

n.

69 / 174

Page 93: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Rademacher Complexity of Sigmoid Networks: Proof

The final step involved the following result.

For f ∈ F satisfying |f (x)| ≤ 1,

E‖Rn‖F ≤ E√

2 log(2|F(X n1 )|)

n.

Lemma.

70 / 174

Page 94: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Outline

Introduction

Uniform Laws of Large NumbersMotivationRademacher AveragesRademacher Complexity: Structural ResultsSigmoid NetworksReLU NetworksCovering Numbers

Classification

Beyond ERM

Interpolation

71 / 174

Page 95: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Rademacher Complexity of ReLU Networks

Using the positive homogeneity property of the ReLU nonlinearity (thatis, for all α ≥ 0 and x ∈ R, σ(αx) = ασ(x)) gives a simple argument tobound the Rademacher complexity.

For a training set (X1,Y1), . . . , (Xn,Yn) ∈ X×±1 with ‖Xi‖ ≤ 1a.s.,

E ‖Rn‖FF,B≤ (2B)L√

n,

where f ∈ FF ,B is an L-layer network of the form

FF ,B := WLσ(WL−1 · · ·σ(W1x) · · · ),

σ is 1-Lipschitz, positive homogeneous (that is, for all α ≥ 0and x ∈ R, σ(αx) = ασ(x)), and applied componentwise, and‖Wi‖F ≤ B. (WL is a row vector.)

Theorem.

72 / 174

Page 96: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

ReLU Networks: Proof

(Write Eε as the conditional expectation given the data.)

Eε supf∈F,‖W‖F≤B

1

n

∥∥∥∥∥n∑

i=1

εiσ(Wf (Xi ))

∥∥∥∥∥2

≤ 2BEε supf∈F

1

n

∥∥∥∥∥n∑

i=1

εi f (Xi )

∥∥∥∥∥2

.

Lemma.

Iterating this and using Jensen’s inequality proves the theorem.

73 / 174

Page 97: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

ReLU Networks: Proof

For W> = (w1 · · ·wk), we use positive homogeneity:∥∥∥∥∥n∑

i=1

εiσ(Wf (xi ))

∥∥∥∥∥2

=k∑

j=1

(n∑

i=1

εiσ(w>j f (xi ))

)2

=k∑

j=1

‖wj‖2

(n∑

i=1

εiσ

(w>j‖wj‖

f (xi )

))2

.

74 / 174

Page 98: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

ReLU Networks: Proof

sup‖W‖F≤B

k∑j=1

‖wj‖2

(n∑

i=1

εiσ

(w>j‖wj‖

f (xi )

))2

= sup‖wj‖=1;‖α‖1≤B2

k∑j=1

αj

(n∑

i=1

εiσ(w>j f (xi )

))2

= B2 sup‖w‖=1

(n∑

i=1

εiσ(w>f (xi )

))2

,

then apply (Contraction) and Cauchy-Schwartz inequalities.

75 / 174

Page 99: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Outline

Introduction

Uniform Laws of Large NumbersMotivationRademacher AveragesRademacher Complexity: Structural ResultsSigmoid NetworksReLU NetworksCovering Numbers

Classification

Beyond ERM

Interpolation

76 / 174

Page 100: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Rademacher Complexity and Covering Numbers

An ε-cover of a subset T of a metric space (S , d) is a set T ⊂ Tsuch that for each t ∈ T there is a t ∈ T such that d(t, t) ≤ ε.The ε-covering number of T is

N (ε,T , d) = min|T | : T is an ε-cover of T.

Definition.

Intuition: A k-dimensional set T has N (ε,T , k) = Θ(1/εk).Example: ([0, 1]k , l∞).

77 / 174

Page 101: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Rademacher Complexity and Covering Numbers

For any F and any X1, . . . ,Xn,

Eε‖Rn‖F(X n1 ) ≤ c

∫ ∞0

√lnN (ε,F(X n

1 ), `2)

ndε.

and

Eε‖Rn‖F(X n1 ) ≥

1

c log nsupα>0

α

√lnN (α,F(X n

1 ), `2)

n.

Theorem.

78 / 174

Page 102: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Covering numbers and smoothly parameterized functions

Let F be a parameterized class of functions,

F = f (θ, ·) : θ ∈ Θ.

Let ‖ · ‖Θ be a norm on Θ and let ‖ · ‖F be a norm on F . Supposethat the mapping θ 7→ f (θ, ·) is L-Lipschitz, that is,

‖f (θ, ·)− f (θ′, ·)‖F ≤ L‖θ − θ′‖Θ.

Then N (ε,F , ‖ · ‖F ) ≤ N (ε/L,Θ, ‖ · ‖Θ).

Lemma.

79 / 174

Page 103: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Covering numbers and smoothly parameterized functions

A Lipschitz parameterization allows us to translate a cover of theparameter space into a cover of the function space.

Example: If F is smoothly parameterized by (a compact set of) kparameters, then N (ε,F) = O(1/εk).

80 / 174

Page 104: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Recap

g0

Eg

all functions

g0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gn g0

1

n

nX

i=1

g(zi)

gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gG gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

81 / 174

Page 105: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Recap

g0

Eg

all functions

g0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gn g0

1

n

nX

i=1

g(zi)

gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gG gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

81 / 174

Page 106: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Recap

g0

Eg

all functionsg0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gn g0

1

n

nX

i=1

g(zi)

gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gG gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

81 / 174

Page 107: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Recap

g0

Eg

all functionsg0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gn

g0

1

n

nX

i=1

g(zi)

gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gG gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

81 / 174

Page 108: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Recap

g0

Eg

all functionsg0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gn

g0

1

n

nX

i=1

g(zi)

gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gG gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

81 / 174

Page 109: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Recap

g0

Eg

all functionsg0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gn g0

1

n

nX

i=1

g(zi)

gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gG gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

81 / 174

Page 110: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Recap

g0

Eg

all functionsg0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gn g0

1

n

nX

i=1

g(zi)

gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gG gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

81 / 174

Page 111: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Recap

g0

Eg

all functionsg0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gn g0

1

n

nX

i=1

g(zi)

gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gG gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

81 / 174

Page 112: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Recap

E ‖P − Pn‖F E ‖Rn‖F

Advantage of dealing with Rademacher process:

I often easier to analyze

I can work conditionally on the data

Example: F = x 7→ 〈w , x〉 : ‖w‖ ≤ 1, ‖x‖ ≤ 1. Then

Eε∥∥∥Rn

∥∥∥F

= Eε sup‖w‖≤1

∣∣∣∣∣1nn∑

i=1

εi 〈w , xi 〉∣∣∣∣∣ = Eε sup

‖w‖≤1

∣∣∣∣∣⟨w ,

1

n

n∑i=1

εixi

⟩∣∣∣∣∣which is equal to

∥∥∥∥∥1

n

n∑i=1

εixi

∥∥∥∥∥ ≤√∑n

i=1 ‖xi‖2

n=

√tr(XX>)

n

82 / 174

Page 113: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Outline

Introduction

Uniform Laws of Large Numbers

ClassificationGrowth FunctionVC-DimensionVC-Dimension of Neural NetworksLarge Margin Classifiers

Beyond ERM

Interpolation

83 / 174

Page 114: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Classification and Rademacher Complexity

Recall:

For F ⊆ ±1X , Y ∈ ±1, and `(y , y ′) = Iy 6= y ′ = (1− yy ′)/2,

E ‖Rn‖`F ≤1

2E ‖Rn‖F + O

(1√n

).

How do we get a handle on Rademacher complexity of a set F ofclassifiers? Recall that we are looking at “size” of F projected onto data:

F(xn1 ) = (f (x1), . . . , f (xn)) : f ∈ F ⊆ −1, 1n

84 / 174

Page 115: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Controlling Rademacher Complexity: Growth Function

Example: For the class of step functions, F = x 7→ Ix ≤ α : α ∈ R,the set of restrictions,

F(xn1 ) = (f (x1), . . . , f (xn)) : f ∈ F

is always small: |F(xn1 )| ≤ n + 1. So E‖Rn‖F ≤√

2 log 2(n+1)n .

Example: If F is parameterized by k bits,F =

x 7→ f (x , θ) : θ ∈ 0, 1k

for some f : X × 0, 1k → [0, 1],

|F(xn1 )| ≤ 2k ,

E‖Rn‖F ≤√

2(k + 1) log 2

n.

85 / 174

Page 116: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Controlling Rademacher Complexity: Growth Function

Example: For the class of step functions, F = x 7→ Ix ≤ α : α ∈ R,the set of restrictions,

F(xn1 ) = (f (x1), . . . , f (xn)) : f ∈ F

is always small: |F(xn1 )| ≤ n + 1. So E‖Rn‖F ≤√

2 log 2(n+1)n .

Example: If F is parameterized by k bits,F =

x 7→ f (x , θ) : θ ∈ 0, 1k

for some f : X × 0, 1k → [0, 1],

|F(xn1 )| ≤ 2k ,

E‖Rn‖F ≤√

2(k + 1) log 2

n.

85 / 174

Page 117: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Growth Function

For a class F ⊆ 0, 1X , the growth function is

ΠF (n) = max|F(xn1 )| : x1, . . . , xn ∈ X.

Definition.

I E‖Rn‖F ≤√

2 log(2ΠF (n))n ; E‖P − Pn‖F = O

(√log(ΠF (n))

n

).

I ΠF (n) ≤ |F|, limn→∞ ΠF (n) = |F|.I ΠF (n) ≤ 2n. (But then this gives no useful bound on E‖Rn‖F .)

86 / 174

Page 118: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Growth Function

For a class F ⊆ 0, 1X , the growth function is

ΠF (n) = max|F(xn1 )| : x1, . . . , xn ∈ X.

Definition.

I E‖Rn‖F ≤√

2 log(2ΠF (n))n ; E‖P − Pn‖F = O

(√log(ΠF (n))

n

).

I ΠF (n) ≤ |F|, limn→∞ ΠF (n) = |F|.I ΠF (n) ≤ 2n. (But then this gives no useful bound on E‖Rn‖F .)

86 / 174

Page 119: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Growth Function

For a class F ⊆ 0, 1X , the growth function is

ΠF (n) = max|F(xn1 )| : x1, . . . , xn ∈ X.

Definition.

I E‖Rn‖F ≤√

2 log(2ΠF (n))n ; E‖P − Pn‖F = O

(√log(ΠF (n))

n

).

I ΠF (n) ≤ |F|, limn→∞ ΠF (n) = |F|.

I ΠF (n) ≤ 2n. (But then this gives no useful bound on E‖Rn‖F .)

86 / 174

Page 120: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Growth Function

For a class F ⊆ 0, 1X , the growth function is

ΠF (n) = max|F(xn1 )| : x1, . . . , xn ∈ X.

Definition.

I E‖Rn‖F ≤√

2 log(2ΠF (n))n ; E‖P − Pn‖F = O

(√log(ΠF (n))

n

).

I ΠF (n) ≤ |F|, limn→∞ ΠF (n) = |F|.I ΠF (n) ≤ 2n. (But then this gives no useful bound on E‖Rn‖F .)

86 / 174

Page 121: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Outline

Introduction

Uniform Laws of Large Numbers

ClassificationGrowth FunctionVC-DimensionVC-Dimension of Neural NetworksLarge Margin Classifiers

Beyond ERM

Interpolation

87 / 174

Page 122: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Vapnik-Chervonenkis Dimension

A class F ⊆ 0, 1X shatters x1, . . . , xd ⊆ X means that|F(xd1 )| = 2d . The Vapnik-Chervonenkis dimension of F is

dVC (F) = max d : some x1, . . . , xd ∈ X is shattered by F= max

d : ΠF (d) = 2d

.

Definition.

88 / 174

Page 123: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Vapnik-Chervonenkis Dimension: “Sauer’s Lemma”

dVC (F) ≤ d implies

ΠF (n) ≤d∑

i=0

(n

i

).

If n ≥ d , the latter sum is no more than(end

)d.

Theorem (Vapnik-Chervonenkis).

So the VC-dimension d is a single integer summary of the growthfunction:

ΠF (n)

= 2n if n ≤ d ,

≤ (e/d)d nd if n > d .

89 / 174

Page 124: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Vapnik-Chervonenkis Dimension: “Sauer’s Lemma”

dVC (F) ≤ d implies

ΠF (n) ≤d∑

i=0

(n

i

).

If n ≥ d , the latter sum is no more than(end

)d.

Theorem (Vapnik-Chervonenkis).

So the VC-dimension d is a single integer summary of the growthfunction:

ΠF (n)

= 2n if n ≤ d ,

≤ (e/d)d nd if n > d .

89 / 174

Page 125: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Vapnik-Chervonenkis Dimension: “Sauer’s Lemma”

Thus, for dVC (F) ≤ d and n ≥ d ,

supP

E‖P − Pn‖F = O

(√d log(n/d)

n

).

And there is a matching lower bound.

Uniform convergence uniformly over probability distributions is equivalentto finiteness of the VC-dimension.

90 / 174

Page 126: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

VC-dimension Bounds for Parameterized Families

Consider a parameterized class of binary-valued functions,

F = x 7→ f (x , θ) : θ ∈ Rp ,

where f : X × Rp → ±1.

Suppose that for each x , θ 7→ f (x , θ) can be computed using no morethan t operations of the following kinds:

1. arithmetic (+, −, ×, /),

2. comparisons (>, =, <),

3. output ±1.

dVC (F) ≤ 4p(t + 2).

Theorem (Goldberg and Jerrum).

91 / 174

Page 127: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Outline

Introduction

Uniform Laws of Large Numbers

ClassificationGrowth FunctionVC-DimensionVC-Dimension of Neural NetworksLarge Margin Classifiers

Beyond ERM

Interpolation

92 / 174

Page 128: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

VC-Dimension of Neural Networks

Consider the class F of −1, 1-valued functions computed by anetwork with L layers, p parameters, and k computation units withthe following nonlinearities:

1. Piecewise constant (linear threshold units):VCdim(F) = Θ (p).

(Baum and Haussler, 1989)

2. Piecewise linear (ReLUs): VCdim(F) = Θ (pL).(B., Harvey, Liaw, Mehrabian, 2017)

3. Piecewise polynomial: VCdim(F) = O(pL2).

(B., Maiorov, Meir, 1998)

4. Sigmoid: VCdim(F) = O(p2k2

).

(Karpinsky and MacIntyre, 1994)

Theorem.

93 / 174

Page 129: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

VC-Dimension of Neural Networks

Consider the class F of −1, 1-valued functions computed by anetwork with L layers, p parameters, and k computation units withthe following nonlinearities:

1. Piecewise constant (linear threshold units):VCdim(F) = Θ (p).

(Baum and Haussler, 1989)

2. Piecewise linear (ReLUs): VCdim(F) = Θ (pL).(B., Harvey, Liaw, Mehrabian, 2017)

3. Piecewise polynomial: VCdim(F) = O(pL2).

(B., Maiorov, Meir, 1998)

4. Sigmoid: VCdim(F) = O(p2k2

).

(Karpinsky and MacIntyre, 1994)

Theorem.

93 / 174

Page 130: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

VC-Dimension of Neural Networks

Consider the class F of −1, 1-valued functions computed by anetwork with L layers, p parameters, and k computation units withthe following nonlinearities:

1. Piecewise constant (linear threshold units):VCdim(F) = Θ (p).

(Baum and Haussler, 1989)

2. Piecewise linear (ReLUs): VCdim(F) = Θ (pL).(B., Harvey, Liaw, Mehrabian, 2017)

3. Piecewise polynomial: VCdim(F) = O(pL2).

(B., Maiorov, Meir, 1998)

4. Sigmoid: VCdim(F) = O(p2k2

).

(Karpinsky and MacIntyre, 1994)

Theorem.

93 / 174

Page 131: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

VC-Dimension of Neural Networks

Consider the class F of −1, 1-valued functions computed by anetwork with L layers, p parameters, and k computation units withthe following nonlinearities:

1. Piecewise constant (linear threshold units):VCdim(F) = Θ (p).

(Baum and Haussler, 1989)

2. Piecewise linear (ReLUs): VCdim(F) = Θ (pL).(B., Harvey, Liaw, Mehrabian, 2017)

3. Piecewise polynomial: VCdim(F) = O(pL2).

(B., Maiorov, Meir, 1998)

4. Sigmoid: VCdim(F) = O(p2k2

).

(Karpinsky and MacIntyre, 1994)

Theorem.

93 / 174

Page 132: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

VC-Dimension of Neural Networks

Consider the class F of −1, 1-valued functions computed by anetwork with L layers, p parameters, and k computation units withthe following nonlinearities:

1. Piecewise constant (linear threshold units):VCdim(F) = Θ (p).

(Baum and Haussler, 1989)

2. Piecewise linear (ReLUs): VCdim(F) = Θ (pL).(B., Harvey, Liaw, Mehrabian, 2017)

3. Piecewise polynomial: VCdim(F) = O(pL2).

(B., Maiorov, Meir, 1998)

4. Sigmoid: VCdim(F) = O(p2k2

).

(Karpinsky and MacIntyre, 1994)

Theorem.

93 / 174

Page 133: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Outline

Introduction

Uniform Laws of Large Numbers

ClassificationGrowth FunctionVC-DimensionVC-Dimension of Neural NetworksLarge Margin Classifiers

Beyond ERM

Interpolation

94 / 174

Page 134: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Some Experimental Observations

(Schapire, Freund, B, Lee, 1997)

95 / 174

Page 135: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Some Experimental Observations

(Schapire, Freund, B, Lee, 1997)

96 / 174

Page 136: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Some Experimental Observations

(Lawrence, Giles, Tsoi, 1997)

97 / 174

Page 137: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Some Experimental Observations

(Neyshabur, Tomioka, Srebro, 2015)

98 / 174

Page 138: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Large-Margin Classifiers

I Consider a real-valued function f : X → R used for classification.

I The prediction on x ∈ X is sign(f (x)) ∈ −1, 1.I If yf (x) > 0 then f classifies x correctly.

I We call yf (x) the margin of f on x . (c.f. the perceptron)

I We can view a larger margin as a more confident correctclassification.

I Minimizing a continuous loss, such as

n∑i=1

(f (Xi )− Yi )2, or

n∑i=1

exp (−Yi f (Xi )) ,

encourages large margins.

I For large-margin classifiers, we should expect the fine-grained detailsof f to be less important.

99 / 174

Page 139: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Recall

Contraction property:

If ` is y 7→ `(y , y) 1-Lipschitz for all y and |`| ≤ 1, then

E‖Rn‖`F ≤ 2E‖Rn‖F + c/√n

Unfortunately, classification loss is not Lipschitz. By writing classificationloss as (1− yy ′)/2, we did show that

E‖Rn‖` sign(F) ≤ 2E‖Rn‖sign(F) + c/√n

but we want E‖Rn‖F .

100 / 174

Page 140: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Rademacher Complexity for Lipschitz Loss

Consider the 1/γ-Lipschitz surrogate loss

φ(α) =

1 if α ≤ 0,

1− α/γ if 0 < α < γ,

0 if α ≥ 1.

(yf(x))

Iyf(x)60

yf(x)

101 / 174

Page 141: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Rademacher Complexity for Lipschitz Loss

Large margin loss is an upper bound and classification loss is a lowerbound:

IYf (X ) ≤ 0 ≤ φ(Yf (X )) ≤ IYf (X ) ≤ γ.

So if we can relate the Lipschitz risk Pφ(Yf (X )) to the Lipschitzempirical risk Pnφ(Yf (X )), we have a large margin bound:

PIYf (X ) ≤ 0 ≤ Pφ(Yf (X )) c.f. Pnφ(Yf (X )) ≤ PnIYf (X ) ≤ γ.

102 / 174

Page 142: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Rademacher Complexity for Lipschitz Loss

With high probability, for all f ∈ F ,

L01(f ) = PIYf (X ) ≤ 0≤ Pφ(Yf (X ))

≤ Pnφ(Yf (X )) +c

γE‖Rn‖F + O(1/

√n)

≤ PnIYf (X ) ≤ γ+c

γE‖Rn‖F + O(1/

√n).

Notice that we’ve turned a classification problem into a regressionproblem.

The VC-dimension (which captures arbitrarily fine-grained properties ofthe function class) is no longer important.

103 / 174

Page 143: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Example: Linear Classifiers

Let F = x 7→ 〈w , x〉 : ‖w‖ ≤ 1. Then, since E ‖Rn‖F . n−1/2,

L01(f ) .1

n

n∑i=1

IYi f (Xi ) ≤ γ+1

γ

1√n

Compare to Perceptron.

104 / 174

Page 144: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Risk versus surrogate risk

In general, these margin bounds are upper bounds only.

But for any surrogate loss φ, we can define a transform ψ that gives atight relationship between the excess 0− 1 risk (over the Bayes risk) tothe excess φ-risk:

1. For any P and f , ψ (L01(f )− L∗01) ≤ Lφ(f )− L∗φ.

2. For |X | ≥ 2, ε > 0 and θ ∈ [0, 1], there is a P and an f with

L01(f )− L∗01 = θ

ψ(θ) ≤ Lφ(f )− L∗φ ≤ ψ(θ) + ε.

Theorem.

105 / 174

Page 145: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Generalization: Margins and Size of Parameters

I A classification problem becomes a regression problem if we use aloss function that doesn’t vary too quickly.

I For regression, the complexity of a neural network is controlled bythe size of the parameters, and can be independent of the numberof parameters.

I We have a tradeoff between the fit to the training data (margins)and the complexity (size of parameters):

Pr(sign(f (X )) 6= Y ) ≤ 1

n

n∑i=1

φ(Yi f (Xi )) + pn(f )

I Even if the training set is classified correctly, it might be worthwhileto increase the complexity, to improve this loss function.

106 / 174

Page 146: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Generalization: Margins and Size of Parameters

I A classification problem becomes a regression problem if we use aloss function that doesn’t vary too quickly.

I For regression, the complexity of a neural network is controlled bythe size of the parameters, and can be independent of the numberof parameters.

I We have a tradeoff between the fit to the training data (margins)and the complexity (size of parameters):

Pr(sign(f (X )) 6= Y ) ≤ 1

n

n∑i=1

φ(Yi f (Xi )) + pn(f )

I Even if the training set is classified correctly, it might be worthwhileto increase the complexity, to improve this loss function.

106 / 174

Page 147: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Generalization: Margins and Size of Parameters

I A classification problem becomes a regression problem if we use aloss function that doesn’t vary too quickly.

I For regression, the complexity of a neural network is controlled bythe size of the parameters, and can be independent of the numberof parameters.

I We have a tradeoff between the fit to the training data (margins)and the complexity (size of parameters):

Pr(sign(f (X )) 6= Y ) ≤ 1

n

n∑i=1

φ(Yi f (Xi )) + pn(f )

I Even if the training set is classified correctly, it might be worthwhileto increase the complexity, to improve this loss function.

106 / 174

Page 148: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Generalization: Margins and Size of Parameters

I A classification problem becomes a regression problem if we use aloss function that doesn’t vary too quickly.

I For regression, the complexity of a neural network is controlled bythe size of the parameters, and can be independent of the numberof parameters.

I We have a tradeoff between the fit to the training data (margins)and the complexity (size of parameters):

Pr(sign(f (X )) 6= Y ) ≤ 1

n

n∑i=1

φ(Yi f (Xi )) + pn(f )

I Even if the training set is classified correctly, it might be worthwhileto increase the complexity, to improve this loss function.

106 / 174

Page 149: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

In the next few sections, we show that

E[L(fn)− L(fn)

]can be small if the learning algorithm fn has certain properties.

Keep in mind that this is an interesting quantity for two reasons:

I it upper bounds the excess expected loss of ERM, as we showedbefore.

I can be written as an a-posteriori bound

L(fn) ≤ L(fn) + . . .

regardless of whether fn minimized empirical loss.

107 / 174

Page 150: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Outline

Introduction

Uniform Laws of Large Numbers

Classification

Beyond ERMCompression BoundsAlgorithmic StabilityLocal Methods

Interpolation

108 / 174

Page 151: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Compression Set

Let us use the shortened notation for data: S = Z1, . . . ,Zn, and let us

make the dependence of the algorithm fn on the training set explicit:fn = fn[S]. As before, denote G = (x , y) 7→ `(f (x), y) : f ∈ F, and

write gn(·) = `(fn(·), ·). Let us write gn[S](·) to emphasize thedependence.

Suppose there exists a “compression function” Ck which selects from anydataset S of size n a subset of k examples Ck(S) ⊆ S such that

fn[S] = fk [Ck(S)]

That is, the learning algorithm produces the same function when given Sor its subset Ck(S).

One can keep in mind the example of support vectors in SVMs.

109 / 174

Page 152: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Then,

L(fn)− L(fn) = Egn −1

n

n∑i=1

gn(Zi )

= Egk [Ck(S)](Z )− 1

n

n∑i=1

gk [Ck(S)](Zi )

≤ maxI⊆1,...,n,|I |≤k

Egk [SI ](Z )− 1

n

n∑i=1

gk [SI ](Zi )

where SI is the subset indexed by I .

110 / 174

Page 153: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Since gk [SI ] only depends on k out of n points, the empirical average is“mostly out of sample”. Adding and subtracting loss functions for anadditional set of i.i.d. random variables W = Z ′1, . . . ,Z ′k results in anupper bound

maxI⊆1,...,n,|I |≤k

Egk [SI ](Z )− 1

n

∑Z ′∈S′

gk [SI ](Z ′)

+(b − a)k

n

where [a, b] is the range of functions in G and S ′ is obtained from S byreplacing SI with the corresponding subset WI .

111 / 174

Page 154: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

For each fixed I , the random variable

Egk [SI ](Z )− 1

n

∑Z ′∈S′

gk [SI ](Z ′)

is zero mean with standard deviation O((b − a)/√n). Hence, the

expected maximum over I with respect to S,W is at most

c

√(b − a)k log(en/k)

n

since log(nk

)≤ k log(en/k).

112 / 174

Page 155: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Conclusion: compression-style argument limits the bias

E[L(fn)− L(fn)

]≤ O

(√k log n

n

),

which is non-vacuous if k = o(n/ log n).

Recall that this term was the upper bound (up to log) on expected excessloss of ERM if class has VC dimension k . However, a possible equivalencebetween compression and VC dimension is still being investigated.

113 / 174

Page 156: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Example: Classification with Thresholds in 1D

I X = [0, 1], Y = 0, 1I F = fθ : fθ(x) = Ix ≥ θ, θ ∈ [0, 1]I `(fθ(x), y) = Ifθ(x) 6= y

0 1

fn

For any set of data (x1, y1), . . . , (xn, yn), the ERM solution fn has theproperty that the first occurrence xl on the left of the threshold has labelyl = 0, while first occurrence xr on the right – label yr = 1.

Enough to take k = 2 and define fn[S] = f2[(xl , 0), (xr , 1)].

114 / 174

Page 157: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Further examples/observations:

I Compression of size d for hyperplanes (realizable case)

I Compression of size 1/γ2 for margin case

I Bernstein bound gives 1/n rate rather than 1/√n rate on realizable

data (zero empirical error).

115 / 174

Page 158: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Outline

Introduction

Uniform Laws of Large Numbers

Classification

Beyond ERMCompression BoundsAlgorithmic StabilityLocal Methods

Interpolation

116 / 174

Page 159: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Algorithmic Stability

Recall that compression was a way to upper bound E[L(fn)− L(fn)

].

Algorithmic stability is another path to the same goal.

Compare:

I Compression: fn depends only on a subset of k datapoints.

I Stability: fn does not depend on any of the datapoints too strongly.

As before, let’s write shorthand g = ` f and gn = ` fn.

117 / 174

Page 160: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Algorithmic StabilityWe now write

ESL(fn) = EZ1,...,Zn,Z gn[Z1, . . . ,Zn](Z )

Again, the meaning of gn[Z1, . . . ,Zn](Z ): train on Z1, . . . ,Zn and test onZ .

On the other hand,

ES L(fn) = EZ1,...,Zn

1

n

n∑i=1

gn[Z1, . . . ,Zn](Zi )

=1

n

n∑i=1

EZ1,...,Zn gn[Z1, . . . ,Zn](Zi )

= EZ1,...,Zn gn[Z1, . . . ,Zn](Z1)

where the last step holds for symmetric algorithms (wrt permutation oftraining data). Of course, instead of Z1 we can take any Zi .

118 / 174

Page 161: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Now comes the renaming trick. It takes a minute to get used to, if youhaven’t seen it.

Note that Z1, . . . ,Zn,Z are i.i.d. Hence,

ESL(fn) = EZ1,...,Zn,Z gn[Z1, . . . ,Zn](Z ) = EZ1,...,Zn,Z gn[Z ,Z2, . . . ,Zn](Z1)

Therefore,

EL(fn)− L(fn)

= EZ1,...,Zn,Z gn[Z ,Z2, . . . ,Zn](Z1)− gn[Z1, . . . ,Zn](Z1)

119 / 174

Page 162: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Of course, we haven’t really done much except re-writing expectation.But the difference

gn[Z ,Z2, . . . ,Zn](Z1)− gn[Z1, . . . ,Zn](Z1)

has a “stability” interpretation. If it holds that the output of thealgorithm “does not change much” when one datapoint is replaced with

another, then the gap EL(fn)− L(fn)

is small.

Moreover, since everything we’ve written is an equality, this stability is

equivalent to having small gap EL(fn)− L(fn)

.

120 / 174

Page 163: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

NB: our aim of ensuring small EL(fn)− L(fn)

only makes sense if

L(fn) is small (e.g. on average). That is, the analysis only makes sensefor those methods that explicitly or implicitly minimize empirical loss (ora regularized variant of it).

It’s not enough to be stable. Consider a learning mechanism that ignoresthe data and outputs fn = f0, a constant function. Then

EL(fn)− L(fn)

= 0 and the algorithm is very stable. However, it does

not do anything interesting.

121 / 174

Page 164: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Uniform Stability

Rather than the average notion we just discussed, let’s consider a muchstronger notion:

We say that algorithm is β uniformly stable if

∀i ∈ [n], z1, . . . , zn, z′, z

∣∣∣gn[S](z)− gn[S i,z′ ](z)∣∣∣ ≤ β

where S i,z′ = z1, . . . , zi−1, z′, zi+1, . . . , zn.

122 / 174

Page 165: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Uniform Stability

Clearly, for any realization of Z1, . . . ,Zn,Z ,

gn[Z ,Z2, . . . ,Zn](Z1)− gn[Z1, . . . ,Zn](Z1) ≤ β,

and so expected loss of a β-uniformly-stable ERM is β-close to itsempirical error (in expectation). See recent work by Feldman andVondrak for sharp high probability statements.

Of course, it is unclear at this point whether a β-uniformly-stable ERM(or near-ERM) exists.

123 / 174

Page 166: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Kernel Ridge Regression

Consider

fn = argminf∈H

1

n

n∑i=1

(f (Xi )− Yi )2 + λ ‖f ‖2

K

in RKHS H corresponding to kernel K .

Assume K (x , x) ≤ κ2 for any x .

Kernel Ridge Regression is β-uniformly stable with β = O(

1λn

)Lemma.

124 / 174

Page 167: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Proof (stability of Kernel Ridge Regression)

To prove this, first recall the definition of a σ-strongly convex function φon convex domain W:

∀u, v ∈ W, φ(u) ≥ φ(v) + 〈∇φ(v), u − v〉+σ

2‖u − v‖2

.

Suppose φ, φ′ are both σ-strongly convex. Suppose w ,w ′ satisfy∇φ(w) = ∇φ′(w ′) = 0. Then

φ(w ′) ≥ φ(w) +σ

2‖w − w ′‖2

andφ′(w) ≥ φ′(w ′) +

σ

2‖w − w ′‖2

As a trivial consequence,

σ ‖w − w ′‖2 ≤ [φ(w ′)− φ′(w ′)] + [φ′(w)− φ(w)]

125 / 174

Page 168: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Proof (stability of Kernel Ridge Regression)Now take

φ(f ) =1

n

∑i∈S

(f (xi )− yi )2 + λ ‖f ‖2

K

and

φ′(f ) =1

n

∑i∈S′

(f (xi )− yi )2 + λ ‖f ‖2

K

where S and S ′ differ in one element: (xi , yi ) is replaced with (x ′i , y′i ).

Let fn, f′n be the minimizers of φ, φ′, respectively. Then

φ(f ′n)− φ′(f ′n) ≤ 1

n

((f ′n(xi )− yi )

2 − (f ′n(x ′i )− y ′i )2)

and

φ′(fn)− φ(fn) ≤ 1

n

((fn(x ′i )− y ′i )2 − (fn(xi )− yi )

2)

NB: we have been operating with f as vectors. To be precise, one needsto define the notion of strong convexity over H. Let us sweep it underthe rug and say that φ, φ′ are 2λ-strongly convex with respect to ‖·‖K .

126 / 174

Page 169: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Proof (stability of Kernel Ridge Regression)

Then∥∥∥fn − f ′n

∥∥∥2

Kis at most

1

2λn

((f ′n(xi )− yi )

2 − (fn(xi )− yi )2 + (fn(x ′i )− y ′i )2 − (f ′n(x ′i )− y ′i )2

)which is at most

1

2λnC∥∥∥fn − f ′n

∥∥∥∞

where C = 4(1 + c) if |Yi | ≤ 1 and |fn(xi )| ≤ c .

On the other hand, for any x

f (x) = 〈f ,Kx〉 ≤ ‖f ‖K ‖Kx‖ = ‖f ‖K√〈Kx ,Kx〉 = ‖f ‖K

√K (x , x) ≤ κ ‖f ‖K

and so‖f ‖∞ ≤ κ ‖f ‖K .

127 / 174

Page 170: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Proof (stability of Kernel Ridge Regression)

Putting everything together,∥∥∥fn − f ′n

∥∥∥2

K≤ 1

2λnC∥∥∥fn − f ′n

∥∥∥∞≤ κC

2λn

∥∥∥fn − f ′n

∥∥∥K

Hence, ∥∥∥fn − f ′n

∥∥∥K≤ 1

2λnC∥∥∥fn − f ′n

∥∥∥∞≤ κC

2λn

To finish the claim,

(fn(xi )−yi )2−(f ′n(xi )−yi )2 ≤ C∥∥∥fn − f ′n

∥∥∥∞≤ κC

∥∥∥fn − f ′n

∥∥∥K≤ O

(1

λn

)

128 / 174

Page 171: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Outline

Introduction

Uniform Laws of Large Numbers

Classification

Beyond ERMCompression BoundsAlgorithmic StabilityLocal Methods

Interpolation

129 / 174

Page 172: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

We consider “local” procedures such as k-Nearest-Neighbors or localsmoothing. They have a different bias-variance decomposition (we do notfix a class F).

Analysis will rely on local similarity (e.g. Lipschitz-ness) of regressionfunction f ∗.Idea: to predict y at a given x , look up in the dataset those Yi for whichXi is “close” to x .

130 / 174

Page 173: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Bias-VarianceIt’s time to revisit the bias-variance picture. Recall that our goal was toensure that

EL(fn)− L(f ∗)

decreases with data size n, where f ∗ gives smallest possible L.

For “simple problems” (that is, strong assumptions on P), one canensure this without the bias-variance decomposition. Examples:Perceptron, linear regression in d < n regime, etc.

However, for more interesting problems, we cannot get this difference tobe small without introducing bias, because otherwise the variance(fluctuation of the stochastic part) is too large.

Our approach so far was to split this term into anestimation-approximation error with respect to some class F :

EL(fn)− L(fF ) + L(fF )− L(f ∗)

131 / 174

Page 174: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Bias-Variance

Now we’ll study a different bias-variance decomposition, typically used innonparametric statistics. We will only work with square loss.

Rather than fixing F that controls the estimation error, we fix analgorithm (procedure/estimator) fn that has some tunable parameter.

By definition E[Y |X = x ] = f ∗(x). Then we write

EL(fn)− L(f ∗) = E(fn(X )− Y )2 − E(f ∗(X )− Y )2

= E(fn(X )− f ∗(X ) + f ∗(X )− Y )2 − E(f ∗(X )− Y )2

= E(fn(X )− f ∗(X ))2

because the cross term vanishes (check!)

132 / 174

Page 175: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Bias-Variance

Before proceeding, let us discuss the last expression.

E(fn(X )− f ∗(X ))2 = ES∫x

(fn(x)− f ∗(x))2P(dx)

=

∫x

ES(fn(x)− f ∗(x))2P(dx)

We will often analyze ES(fn(x)− f ∗(x))2 for fixed x and then integrate.

The integral is a measure of distance between two functions:

‖f − g‖2L2(P) =

∫x

(f (x)− g(x))2P(dx).

133 / 174

Page 176: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Bias-Variance

Let us drop L2(P) from notation for brevity. The bias-variancedecomposition can be written as

ES∥∥∥fn − f ∗

∥∥∥2

L2(P)= ES

∥∥∥fn − EY1:n [fn] + EY1:n [fn]− f ∗∥∥∥2

= ES∥∥∥fn − EY1:n [fn]

∥∥∥2

+ EX1:n

∥∥∥EY1:n [fn]− f ∗∥∥∥2

,

because the cross term is zero in expectation.

The first term is variance, the second is squared bias. One typicallyincreases with the parameter, the other decreases.

Parameter is chosen either (a) theoretically or (b) by cross-validation(this is the usual case in practice).

134 / 174

Page 177: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

k-Nearest Neighbors

As before, we are given (X1,Y1), . . . , (Xn,Yn) i.i.d. from P. To make aprediction of Y at a given x , we sort points according to distance‖Xi − x‖. Let

(X(1),Y(1)), . . . , (X(n),Y(n))

be the sorted list (remember this depends on x).

The k-NN estimate is defined as

fn(x) =1

k

k∑i=1

Y(i).

135 / 174

Page 178: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Variance: Given x ,

fn(x)− EY1:n [fn(x)] =1

k

k∑i=1

(Y(i) − f ∗(X(i)))

which is on the order of 1/√k . Then variance is of the order 1

k .

Bias: a bit more complicated. For a given x ,

EY1:n [fn(x)]− f ∗(x) =1

k

k∑i=1

(f ∗(X(i))− f ∗(x)).

Suppose f ∗ is 1-Lipschitz. Then the square of above is(1

k

k∑i=1

(f ∗(X(i))− f ∗(x))

)2

≤ 1

k

k∑i=1

∥∥X(i) − x∥∥2

So, the bias is governed by how close the closest k random points are tox .

136 / 174

Page 179: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Claim: enough to know the upper bound on the closest point to x amongn points.

Argument: for simplicity assume that n/k is an integer. Divide theoriginal (unsorted) dataset into k blocks, n/k size each. Let X i be theclosest point to x in ith block. Then the collection X 1, . . . ,X k , ak-subset which is no closer than the set of k nearest neighbors. That is,

1

k

k∑i=1

∥∥X(i) − x∥∥2 ≤ 1

k

k∑i=1

∥∥X i − x∥∥2

Taking expectation (with respect to dataset), the bias term is at most

E

1

k

k∑i=1

∥∥X i − x∥∥2

= E

∥∥X 1 − x∥∥2

which is expected squared distance from x to the closest point in arandom set of n/k points.

137 / 174

Page 180: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

If support of X is bounded and d ≥ 3, then one can estimate

E∥∥X − X(1)

∥∥2. n−2/d .

That is, we expect the closest neighbor of a random point X to be nofurther than n−1/d away from one of n randomly drawn points.

Thus, the bias (which is the expected squared distance from x to theclosest point in a random set of n/k points) is at most

(n/k)−2/d .

138 / 174

Page 181: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Putting everything together, the bias-variance decomposition yields

1

k+

(k

n

)2/d

Optimal choice is k ∼ n2

2+d and the overall rate of estimation at a givenpoint x is

n−2

2+d .

Since the result holds for any x , the integrated risk is also

E∥∥∥fn − f ∗

∥∥∥2

. n−2

2+d .

139 / 174

Page 182: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Summary

I We sketched the proof that k-Nearest-Neighbors has samplecomplexity guarantees for prediction or estimation problems withsquare loss if k is chosen appropriately.

I Analysis is very different from “empirical process” approach forERM.

I Truly nonparametric!

I No assumptions on underlying density (in d ≥ 3) beyond compactsupport. Additional assumptions needed for d ≤ 3.

140 / 174

Page 183: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Local Kernel Regression: Nadaraya-Watson

Fix a kernel K : Rd → R≥0. Assume K is zero outside unit Euclidean ball

at origin (not true for e−x2

, but close enough).

(figure from Gyorfi et al)

Let Kh(x) = K (x/h), and so Kh(x − x ′) is zero if ‖x − x ′‖ ≥ h.

h is “bandwidth” – tunable parameter.

Assume K (x) > cI‖x‖ ≤ 1 for some c > 0. This is important for the“averaging effect” to kick in.

141 / 174

Page 184: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Nadaraya-Watson estimator:

fn(x) =n∑

i=1

YiWi (x)

with

Wi (x) =Kh(x − Xi )∑ni=1 Kh(x − Xi )

(Note:∑

i Wi = 1).

142 / 174

Page 185: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Unlike the k-NN example, bias is easier to estimate.

Bias: for a given x ,

EY1:n [fn(x)] = EY1:n

[n∑

i=1

YiWi (x)

]=

n∑i=1

f ∗(Xi )Wi (x)

and so

EY1:n [fn(x)]− f ∗(x) =n∑

i=1

(f ∗(Xi )− f ∗(x))Wi (x)

Suppose f ∗ is 1-Lipschitz. Since Kh is zero outside the h-radius ball,

|EY1:n [fn(x)]− f ∗(x)|2 ≤ h2.

143 / 174

Page 186: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Variance: we have

fn(x)− EY1:n [fn(x)] =n∑

i=1

(Yi − f ∗(Xi ))Wi (x)

Expectation of square of this difference is at most

E

[n∑

i=1

(Yi − f ∗(Xi ))2Wi (x)2

]

since cross terms are zero (fix X ’s, take expectation with respect to theY ’s).

We are left analyzing

nE[

Kh(x − X1)2

(∑n

i=1 Kh(x − Xi ))2

]

Under some assumptions on density of X , the denominator is at least(nhd)2 with high prob, whereas EKh(x − X1)2 = O(hd) assuming∫K 2 <∞. This gives an overall variance of O(1/(nhd)). Many details

skipped here (e.g. problems at the boundary, assumptions, etc)144 / 174

Page 187: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Overall, bias and variance with h ∼ n−1

2+d yield

h2 +1

nhd= n−

22+d

145 / 174

Page 188: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Summary

I Analyzed smoothing methods with kernels. As with nearestneighbors, slow (nonparametric) rates in large d .

I Same bias-variance decomposition approach as k-NN.

146 / 174

Page 189: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Outline

Introduction

Uniform Laws of Large Numbers

Classification

Beyond ERM

InterpolationOverfitting in Deep NetworksNearest neighborLocal Kernel SmoothingKernel and Linear Regression

147 / 174

Page 190: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Overfitting in Deep Networks

I Deep networks can be trained to zerotraining error (for regression loss)

I ... with near state-of-the-artperformance (Ruslan Salakhutdinov,

Simons Machine Learning Boot Camp, January 2017)

I ... even for noisy problems.

I No tradeoff between fit to trainingdata and complexity!

I Benign overfitting.

(Zhang, Bengio, Hardt, Recht, Vinyals, 2017)

148 / 174

Page 191: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Overfitting in Deep Networks

I Deep networks can be trained to zerotraining error (for regression loss)

I ... with near state-of-the-artperformance (Ruslan Salakhutdinov,

Simons Machine Learning Boot Camp, January 2017)

I ... even for noisy problems.

I No tradeoff between fit to trainingdata and complexity!

I Benign overfitting.

(Zhang, Bengio, Hardt, Recht, Vinyals, 2017)

148 / 174

Page 192: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Overfitting in Deep Networks

I Deep networks can be trained to zerotraining error (for regression loss)

I ... with near state-of-the-artperformance (Ruslan Salakhutdinov,

Simons Machine Learning Boot Camp, January 2017)

I ... even for noisy problems.

I No tradeoff between fit to trainingdata and complexity!

I Benign overfitting.

(Zhang, Bengio, Hardt, Recht, Vinyals, 2017)

148 / 174

Page 193: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Overfitting in Deep Networks

I Deep networks can be trained to zerotraining error (for regression loss)

I ... with near state-of-the-artperformance (Ruslan Salakhutdinov,

Simons Machine Learning Boot Camp, January 2017)

I ... even for noisy problems.

I No tradeoff between fit to trainingdata and complexity!

I Benign overfitting.

(Zhang, Bengio, Hardt, Recht, Vinyals, 2017)

148 / 174

Page 194: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Overfitting in Deep Networks

I Deep networks can be trained to zerotraining error (for regression loss)

I ... with near state-of-the-artperformance (Ruslan Salakhutdinov,

Simons Machine Learning Boot Camp, January 2017)

I ... even for noisy problems.

I No tradeoff between fit to trainingdata and complexity!

I Benign overfitting.

(Zhang, Bengio, Hardt, Recht, Vinyals, 2017)

148 / 174

Page 195: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Overfitting in Deep Networks

Not surprising:

I More parameters than sample size. (That’s called ’nonparametric.’)

I Good performance with zero classification loss on training data.(c.f. Margins analysis: trade-off between (regression) fit andcomplexity.)

I Good performance with zero regression loss on training data whenthe Bayes risk is zero. (c.f. Perceptron)

Surprising:

I Good performance with zero regression loss on noisy training data.

I All the label noise is making its way into the parameters, but it isn’thurting prediction accuracy!

149 / 174

Page 196: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Overfitting in Deep Networks

Not surprising:

I More parameters than sample size. (That’s called ’nonparametric.’)

I Good performance with zero classification loss on training data.(c.f. Margins analysis: trade-off between (regression) fit andcomplexity.)

I Good performance with zero regression loss on training data whenthe Bayes risk is zero. (c.f. Perceptron)

Surprising:

I Good performance with zero regression loss on noisy training data.

I All the label noise is making its way into the parameters, but it isn’thurting prediction accuracy!

149 / 174

Page 197: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Overfitting in Deep Networks

Not surprising:

I More parameters than sample size. (That’s called ’nonparametric.’)

I Good performance with zero classification loss on training data.(c.f. Margins analysis: trade-off between (regression) fit andcomplexity.)

I Good performance with zero regression loss on training data whenthe Bayes risk is zero. (c.f. Perceptron)

Surprising:

I Good performance with zero regression loss on noisy training data.

I All the label noise is making its way into the parameters, but it isn’thurting prediction accuracy!

149 / 174

Page 198: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Overfitting in Deep Networks

Not surprising:

I More parameters than sample size. (That’s called ’nonparametric.’)

I Good performance with zero classification loss on training data.(c.f. Margins analysis: trade-off between (regression) fit andcomplexity.)

I Good performance with zero regression loss on training data whenthe Bayes risk is zero. (c.f. Perceptron)

Surprising:

I Good performance with zero regression loss on noisy training data.

I All the label noise is making its way into the parameters, but it isn’thurting prediction accuracy!

149 / 174

Page 199: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Overfitting in Deep Networks

Not surprising:

I More parameters than sample size. (That’s called ’nonparametric.’)

I Good performance with zero classification loss on training data.(c.f. Margins analysis: trade-off between (regression) fit andcomplexity.)

I Good performance with zero regression loss on training data whenthe Bayes risk is zero. (c.f. Perceptron)

Surprising:

I Good performance with zero regression loss on noisy training data.

I All the label noise is making its way into the parameters, but it isn’thurting prediction accuracy!

149 / 174

Page 200: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Overfitting in Deep Networks

Not surprising:

I More parameters than sample size. (That’s called ’nonparametric.’)

I Good performance with zero classification loss on training data.(c.f. Margins analysis: trade-off between (regression) fit andcomplexity.)

I Good performance with zero regression loss on training data whenthe Bayes risk is zero. (c.f. Perceptron)

Surprising:

I Good performance with zero regression loss on noisy training data.

I All the label noise is making its way into the parameters, but it isn’thurting prediction accuracy!

149 / 174

Page 201: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Overfitting

“... interpolating fits... [are] unlikely to predict future data well at all.”“Elements of Statistical Learning,” Hastie, Tibshirani, Friedman, 2001

150 / 174

Page 202: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Overfitting

151 / 174

Page 203: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Benign Overfitting

A new statistical phenomenon:good prediction with zero training error for regression loss

I Statistical wisdom says a prediction rule should not fit too well.

I But deep networks are trained to fit noisy data perfectly, and theypredict well.

152 / 174

Page 204: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Not an Isolated Phenomenon

0.0 0.2 0.4 0.6 0.8 1.0 1.2lambda

10 1

log(

erro

r)Kernel Regression on MNIST

digits pair [i,j][2,5][2,9][3,6][3,8][4,7]

λ = 0: the interpolated solution, perfect fit on training data.

153 / 174

Page 205: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Not an Isolated Phenomenon

154 / 174

Page 206: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Outline

Introduction

Uniform Laws of Large Numbers

Classification

Beyond ERM

InterpolationOverfitting in Deep NetworksNearest neighborLocal Kernel SmoothingKernel and Linear Regression

155 / 174

Page 207: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Simplicial interpolation (Belkin-Hsu-Mitra)

Observe: 1-Nearest Neighbor is interpolating. However, one cannotguarantee EL(fn)− L(f ∗) small.

Cover-Hart ’67:lim

n→∞EL(fn) ≤ 2L(f ∗)

156 / 174

Page 208: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Simplicial interpolation (Belkin-Hsu-Mitra)

Under regularity conditions, simplicial interpolation fn

lim supn→∞

E∥∥∥fn − f∗

∥∥∥2

≤ 2

d + 2E(f ∗(X )− Y )2

Blessing of high dimensionality!

157 / 174

Page 209: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Outline

Introduction

Uniform Laws of Large Numbers

Classification

Beyond ERM

InterpolationOverfitting in Deep NetworksNearest neighborLocal Kernel SmoothingKernel and Linear Regression

158 / 174

Page 210: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Nadaraya-Watson estimator (Belkin-R.-Tsybakov)

Consider the Nadaraya-Watson estimator. Take a kernel that approachesa large value τ at 0, e.g.

K (x) = min1/ ‖x‖α , τ

Note that large τ means fn(Xi ) ≈ Yi since the weight Wi (Xi ) is large.

If τ =∞, we get interpolation fn(Xi ) = Yi of all training data. Yet,earlier proof still goes through. Hence, “memorizing the data” (governedby parameter τ) is completely decoupled from the bias-variance trade-off(as given by parameter h).

Minimax rates are possible (with a suitable choice of h).

159 / 174

Page 211: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Outline

Introduction

Uniform Laws of Large Numbers

Classification

Beyond ERM

InterpolationOverfitting in Deep NetworksNearest neighborLocal Kernel SmoothingKernel and Linear Regression

160 / 174

Page 212: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Min-norm interpolation in RKHS

Min-norm interpolation:

fn := argminf∈H

‖f ‖H, s.t. f (xi ) = yi , ∀i ≤ n .

Closed-form solution:

fn(x) = K (x ,X )K (X ,X )−1Y

Note: GD/SGD started from 0 converges to this minimum-norm solution.

Difficulty in analyzing bias/variance of fn: it is not clear how the randommatrix K−1 “aggregates” Y ’s between data points.

161 / 174

Page 213: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Min-norm interpolation in Rd with d n

(Hastie, Montanari, Rosset, Tibshirani, 2019)

Linear regression with p, n→∞, p/n→ γ, empirical spectral distributionof Σ (the discrete measure on its set of eigenvalues) converges to a fixedmeasure.Apply random matrix theory to explore the asymptotics of the excessprediction error as γ, the noise variance, and the eigenvalue distributionvary.Also study the asymptotics of a model involving random nonlinearfeatures.

162 / 174

Page 214: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Implicit regularization in regime d n(Liang and R., 2018)

Informal Theorem: Geometric properties of the data design X , highdimensionality, and curvature of the kernel ⇒ interpolated solutiongeneralizes.

Bias bounded in terms of effective dimension

dim(XX>) =∑j

λj

(XX>

d

)[γ + λj

(XX>

d

)]2

where γ > 0 is due to implicit regularization due to curvature of kerneland high dimensionality.

Compare with regularized least squares as low pass filter:

dim(XX>) =∑j

λj

(XX>

d

)λ+ λj

(XX>

d

)163 / 174

Page 215: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Implicit regularization in regime d n

Test error (y-axis) as a function of how quickly spectrum decays. Bestperformance if spectrum decays but not too quickly.

λj(Σ) =

(1−

(j

d

)κ)1/κ

164 / 174

Page 216: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Lower Bounds

(R. and Zhai, 2018)

Min-norm solution for Laplace kernel

Kσ(x , x ′) = σ−d exp−‖x − x ′‖ /σ

is not consistent if d is low:

Theorem: for odd dimension d , with probability 1−O(n−1/2), forany choice of σ,

E∥∥∥fn − f ∗

∥∥∥2

≥ Ωd(1).

Theorem.

Conclusion: need high dimensionality of data.

165 / 174

Page 217: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Connection to neural networksFor wide enough randomly initialized neural networks, GD dynamicsquickly converge to (approximately) min-norm interpolating solution withrespect to a certain kernel.

f (x) =1√m

m∑i=1

aiσ(〈wi , x〉)

For square loss and relu σ,

∂L

∂wi=

1√m

1

n

n∑j=1

(f (xj)− yj)aiσ′(〈wi , xj〉)xj

GD in continuous time:

d

dtwi (t) = − ∂L

∂wi

Thusd

dtf (x) =

m∑i=1

(∂f (x)

∂wi

)T (dwi

dt

)166 / 174

Page 218: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Connection to neural networksFor wide enough randomly initialized neural networks, GD dynamicsquickly converge to (approximately) min-norm interpolating solution withrespect to a certain kernel.

f (x) =1√m

m∑i=1

aiσ(〈wi , x〉)

For square loss and relu σ,

∂L

∂wi=

1√m

1

n

n∑j=1

(f (xj)− yj)aiσ′(〈wi , xj〉)xj

GD in continuous time:

d

dtwi (t) = − ∂L

∂wi

Thus

d

dtf (x) =

m∑i=1

(1√maiσ′(〈wi , x〉)x

)T− 1√

m

1

n

n∑j=1

(f (xj)− yj)aiσ′(〈wi , xj〉)xj

166 / 174

Page 219: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Connection to neural networksFor wide enough randomly initialized neural networks, GD dynamicsquickly converge to (approximately) min-norm interpolating solution withrespect to a certain kernel.

f (x) =1√m

m∑i=1

aiσ(〈wi , x〉)

For square loss and relu σ,

∂L

∂wi=

1√m

1

n

n∑j=1

(f (xj)− yj)aiσ′(〈wi , xj〉)xj

GD in continuous time:

d

dtwi (t) = − ∂L

∂wi

Thus

d

dtf (x) = − 1

n

n∑j=1

(f (xj)− yj)

[1

m

m∑i=1

a2i σ′(〈wi , x〉)σ′(〈wi , xj〉) 〈x , xj〉

]

166 / 174

Page 220: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Connection to neural networksFor wide enough randomly initialized neural networks, GD dynamicsquickly converge to (approximately) min-norm interpolating solution withrespect to a certain kernel.

f (x) =1√m

m∑i=1

aiσ(〈wi , x〉)

For square loss and relu σ,

∂L

∂wi=

1√m

1

n

n∑j=1

(f (xj)− yj)aiσ′(〈wi , xj〉)xj

GD in continuous time:

d

dtwi (t) = − ∂L

∂wi

Thusd

dtf (x) = − 1

n

n∑j=1

(f (xj)− yj)Km(x , xj)︸ ︷︷ ︸→K∞(x,xj )

166 / 174

Page 221: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Connection to neural networks

(Xie, Liang, Song, ’16), (Jacot, Gabriel, Hongler ’18), (Du, Zhai, Poczos, Singh ’18),

etc.

K∞(x , xj) = E[〈x , xj〉 I〈w , x〉 ≥ 0, 〈w , xj〉 ≥ 0]

See Jason’s talk.

167 / 174

Page 222: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Linear Regression(B., Long, Lugosi, Tsigler, 2019)

I (x , y) Gaussian, mean zero.

I Define:

Σ := Exx> =∑i

λiviv>i , (assume λ1 ≥ λ2 ≥ · · · )

θ∗ := arg minθ

E(y − x>θ

)2,

σ2 := E(y − x>θ∗)2.

I Estimator θ =(X>X

)†X>y , which solves

minθ

‖θ‖2

s.t. ‖Xθ − y‖2 = minβ‖Xβ − y‖2

= 0.

I When can the label noise be hidden in θ without hurting predictiveaccuracy?

168 / 174

Page 223: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Linear Regression(B., Long, Lugosi, Tsigler, 2019)

I (x , y) Gaussian, mean zero.

I Define:

Σ := Exx> =∑i

λiviv>i , (assume λ1 ≥ λ2 ≥ · · · )

θ∗ := arg minθ

E(y − x>θ

)2,

σ2 := E(y − x>θ∗)2.

I Estimator θ =(X>X

)†X>y , which solves

minθ

‖θ‖2

s.t. ‖Xθ − y‖2 = minβ‖Xβ − y‖2 = 0.

I When can the label noise be hidden in θ without hurting predictiveaccuracy?

168 / 174

Page 224: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Linear Regression

Excess prediction error:

L(θ)− L∗ := E(x,y)

[(y − x>θ

)2

−(y − x>θ∗

)2]

=(θ − θ∗

)>Σ(θ − θ∗

).

Σ determines how estimation error in different parameter directions affectprediction accuracy.

169 / 174

Page 225: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Linear Regression

For universal constants b, c , and any θ∗, σ2, Σ, if λn > 0 then

σ2

cmin

k∗

n+

n

Rk∗(Σ), 1

≤ EL(θ)− L∗ ≤ c

(‖θ∗‖2

√tr(Σ)

n+ σ2

(k∗

n+

n

Rk∗(Σ)

)),

where k∗ = min k ≥ 0 : rk(Σ) ≥ bn.

Theorem.

170 / 174

Page 226: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Notions of Effective Rank

Recall that λ1 ≥ λ2 ≥ · · · are the eigenvalues of Σ. Define theeffective ranks

rk(Σ) =

∑i>k λi

λk+1, Rk(Σ) =

(∑i>k λi

)2∑i>k λ

2i

.

Definition.

ExampleIf rank(Σ) = p, we can write

r0(Σ) = rank(Σ)s(Σ), R0(Σ) = rank(Σ)S(Σ),

with s(Σ) =1/p

∑pi=1 λi

λ1, S(Σ) =

(1/p

∑pi=1 λi

)2

1/p∑p

i=1 λ2i

.

Both s and S lie between 1/p (λ2 ≈ 0) and 1 (λi all equal).

171 / 174

Page 227: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Notions of Effective Rank

Recall that λ1 ≥ λ2 ≥ · · · are the eigenvalues of Σ. Define theeffective ranks

rk(Σ) =

∑i>k λi

λk+1, Rk(Σ) =

(∑i>k λi

)2∑i>k λ

2i

.

Definition.

ExampleIf rank(Σ) = p, we can write

r0(Σ) = rank(Σ)s(Σ), R0(Σ) = rank(Σ)S(Σ),

with s(Σ) =1/p

∑pi=1 λi

λ1, S(Σ) =

(1/p

∑pi=1 λi

)2

1/p∑p

i=1 λ2i

.

Both s and S lie between 1/p (λ2 ≈ 0) and 1 (λi all equal).171 / 174

Page 228: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Benign Overfitting in Linear Regression

Intuition

I To avoid harming prediction accuracy, the noise energy must bedistributed across many unimportant directions:

I There should be many non-zero eigenvalues.I They should have a small sum compared to n.I There should be many eigenvalues no larger than λk∗

(the number depends on how assymmetric they are).

172 / 174

Page 229: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Overfitting?

I Fitting data “too” well versus

I Bias too low, variance too high.

Key takeaway: we should not conflate these two.

173 / 174

Page 230: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w

Outline

Introduction

Uniform Laws of Large Numbers

Classification

Beyond ERM

InterpolationOverfitting in Deep NetworksNearest neighborLocal Kernel SmoothingKernel and Linear Regression

174 / 174