Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw...

Generalization in Deep Learning

Peter Bartlett Alexander Rakhlin

UC Berkeley

MIT

May 28, 2019

1 / 174

Outline

Introduction

Uniform Laws of Large Numbers

Classification

Beyond ERM

Interpolation

2 / 174

What is Generalization?

log(1 + 2 + 3) = log(1) + log(2) + log(3)

log(1 + 1.5 + 5) = log(1) + log(1.5) + log(5)

log(2 + 2) = log(2) + log(2)

log(1 + 1.25 + 9) = log(1) + log(1.25) + log(9)

3 / 174

Outline

IntroductionSetupPerceptronBias-Variance Tradeoff


Classification

Beyond ERM

Interpolation

4 / 174

Statistical Learning Theory

Aim: Predict an outcome y from some set Y of possible outcomes, onthe basis of some observation x from a feature space X .

Use data set of n pairs

S = (x1, y1), . . . , (xn, yn)

to choose a function f : X → Y so that, for subsequent (x , y) pairs, f (x)is a good prediction of y .

5 / 174

Formulations of Prediction Problems

To define the notion of a ‘good prediction,’ we can define a loss function

` : Y × Y → R.

`(y , y) is cost of predicting y when the outcome is y .

Aim: `(f (x), y) small.

6 / 174

Formulations of Prediction Problems

ExampleIn pattern classification problems, the aim is to classify a pattern x intoone of a finite number of classes (that is, the label space Y is finite). Ifall mistakes are equally bad, we could define

`(y , y) = Iy 6= y =

1 if y 6= y ,

0 otherwise.

ExampleIn a regression problem, with Y = R, we might choose the quadratic lossfunction, `(y , y) = (y − y)2.

7 / 174

Probabilistic Assumptions

Assume:

I There is a probability distribution P on X × Y,

I The pairs (X1,Y1), . . . , (Xn,Yn), (X ,Y ) are chosen independentlyaccording to P

The aim is to choose f with small risk:

L(f ) = E`(f (X ),Y ).

For instance, in the pattern classification example, this is themisclassification probability.

L(f ) = L01(f ) = EIf (X ) 6= Y = Pr(f (X ) 6= Y ).

8 / 174

Why not estimate the underlying distribution P (or PY |X ) first?

This is in general a harder problem than prediction.

9 / 174

Key difficulty: our goals are in terms of unknown quantities related tounknown P. Have to use empirical data instead. Purview of statistics.

For instance, we can calculate the empirical loss of f : X → Y

L(f ) =1

n

n∑i=1

`(Yi , f (Xi ))

10 / 174

The function x 7→ fn(x) = fn(x ;X1,Y1, . . . ,Xn,Yn) is random, since itdepends on the random data S = (X1,Y1, . . . ,Xn,Yn). Thus, the risk

L(fn) = E[`(fn(X ),Y )|S

]= E

[`(fn(X ;X1,Y1, . . . ,Xn,Yn),Y )|S

]is a random variable. We might aim for EL(fn) small, or L(fn) small withhigh probability (over the training data).

11 / 174

Quiz: what is random here?

1. L(f ) for a given fixed f

2. fn

3. L(fn)

4. L(fn)

5. L(f ) for a given fixed f

12 / 174

Theoretical analysis of performance is typically easier if fn has closed form(in terms of the training data).

E.g. ordinary least squares fn(x) = x>(X>X )−1X>Y .

Unfortunately, most ML and many statistical procedures are not explicitlydefined but arise as

I solutions to an optimization objective (e.g. logistic regression)

I as an iterative procedure without an immediately obvious objectivefunction (e.g. AdaBoost, Random Forests, etc)

13 / 174

The Gold Standard

Within the framework we set up, the smallest expected loss is achievedby the Bayes optimal function

f ∗ = arg minf

L(f )

where the minimization is over all (measurable) prediction rulesf : X → Y.

The value of the lowest expected loss is called the Bayes error:

L(f ∗) = inffL(f )

Of course, we cannot calculate any of these quantities since P isunknown.

14 / 174

Bayes Optimal Function

Bayes optimal function f ∗ takes on the following forms in these twoparticular cases:

I Binary classification (Y = 0, 1) with the indicator loss:

f ∗(x) = Iη(x) ≥ 1/2, where η(x) = E[Y |X = x ]

0

1

(x)

I Regression (Y = R) with squared loss:

f ∗(x) = η(x), where η(x) = E[Y |X = x ]

15 / 174

The big question: is there a way to construct a learning algorithm with aguarantee that

L(fn)− L(f ∗)

is small for large enough sample size n?

16 / 174

Consistency

An algorithm that ensures

limn→∞

L(fn) = L(f ∗) almost surely

is called universally consistent. Consistency ensures that our algorithm isapproaching the best possible prediction performance as the sample sizeincreases.

The good news: consistency is possible to achieve.

I easy if X is a finite or countable set

I not too hard if X is infinite, and the underlying relationship betweenx and y is “continuous”

17 / 174

The bad news...

In general, we cannot prove anything quantitative about L(fn)− L(f ∗),unless we make further assumptions (incorporate prior knowledge).

“No Free Lunch” Theorems: unless we posit assumptions,

I For any algorithm fn, any n and any ε > 0, there exists adistribution P such that L(f ∗) = 0 and

EL(fn) ≥ 1

2− ε

I For any algorithm fn, and any sequence an that converges to 0, thereexists a probability distribution P such that L(f ∗) = 0 and for all n

EL(fn) ≥ an

18 / 174

is this really “bad news”?

Not really. We always have some domain knowledge.

Two ways of incorporating prior knowledge:

I Direct way: assumptions on distribution P (e.g. margin, smoothnessof regression function, etc)

I Indirect way: redefine the goal to perform as well as a reference setF of predictors:

L(fn)− inff∈F

L(f )

F encapsulates our inductive bias.

We often make both of these assumptions.

19 / 174

Outline



Classification

Beyond ERM

Interpolation

20 / 174

Perceptron

<latexit sha1_base64="+03BdUB6TWqLjuCfwSU3OIMjF3A=">AAAB7HicbVDLSgNBEOz1GeMr6tHLYBA8hV0R1FvQi8cIbhJIljA7mU3GzGOZmRXCkn/w4kHFqx/kzb9xkuxBEwsaiqpuurvilDNjff/bW1ldW9/YLG2Vt3d29/YrB4dNozJNaEgUV7odY0M5kzS0zHLaTjXFIua0FY9up37riWrDlHyw45RGAg8kSxjB1knN7gALgXuVql/zZ0DLJChIFQo0epWvbl+RTFBpCcfGdAI/tVGOtWWE00m5mxmaYjLCA9pxVGJBTZTPrp2gU6f0UaK0K2nRTP09kWNhzFjErlNgOzSL3lT8z+tkNrmKcibTzFJJ5ouSjCOr0PR11GeaEsvHjmCimbsVkSHWmFgXUNmFECy+vEzC89p1Lbi/qNZvijRKcAwncAYBXEId7qABIRB4hGd4hTdPeS/eu/cxb13xipkj+APv8wfzVo7p</latexit><latexit sha1_base64="+03BdUB6TWqLjuCfwSU3OIMjF3A=">AAAB7HicbVDLSgNBEOz1GeMr6tHLYBA8hV0R1FvQi8cIbhJIljA7mU3GzGOZmRXCkn/w4kHFqx/kzb9xkuxBEwsaiqpuurvilDNjff/bW1ldW9/YLG2Vt3d29/YrB4dNozJNaEgUV7odY0M5kzS0zHLaTjXFIua0FY9up37riWrDlHyw45RGAg8kSxjB1knN7gALgXuVql/zZ0DLJChIFQo0epWvbl+RTFBpCcfGdAI/tVGOtWWE00m5mxmaYjLCA9pxVGJBTZTPrp2gU6f0UaK0K2nRTP09kWNhzFjErlNgOzSL3lT8z+tkNrmKcibTzFJJ5ouSjCOr0PR11GeaEsvHjmCimbsVkSHWmFgXUNmFECy+vEzC89p1Lbi/qNZvijRKcAwncAYBXEId7qABIRB4hGd4hTdPeS/eu/cxb13xipkj+APv8wfzVo7p</latexit><latexit sha1_base64="+03BdUB6TWqLjuCfwSU3OIMjF3A=">AAAB7HicbVDLSgNBEOz1GeMr6tHLYBA8hV0R1FvQi8cIbhJIljA7mU3GzGOZmRXCkn/w4kHFqx/kzb9xkuxBEwsaiqpuurvilDNjff/bW1ldW9/YLG2Vt3d29/YrB4dNozJNaEgUV7odY0M5kzS0zHLaTjXFIua0FY9up37riWrDlHyw45RGAg8kSxjB1knN7gALgXuVql/zZ0DLJChIFQo0epWvbl+RTFBpCcfGdAI/tVGOtWWE00m5mxmaYjLCA9pxVGJBTZTPrp2gU6f0UaK0K2nRTP09kWNhzFjErlNgOzSL3lT8z+tkNrmKcibTzFJJ5ouSjCOr0PR11GeaEsvHjmCimbsVkSHWmFgXUNmFECy+vEzC89p1Lbi/qNZvijRKcAwncAYBXEId7qABIRB4hGd4hTdPeS/eu/cxb13xipkj+APv8wfzVo7p</latexit><latexit sha1_base64="+03BdUB6TWqLjuCfwSU3OIMjF3A=">AAAB7HicbVDLSgNBEOz1GeMr6tHLYBA8hV0R1FvQi8cIbhJIljA7mU3GzGOZmRXCkn/w4kHFqx/kzb9xkuxBEwsaiqpuurvilDNjff/bW1ldW9/YLG2Vt3d29/YrB4dNozJNaEgUV7odY0M5kzS0zHLaTjXFIua0FY9up37riWrDlHyw45RGAg8kSxjB1knN7gALgXuVql/zZ0DLJChIFQo0epWvbl+RTFBpCcfGdAI/tVGOtWWE00m5mxmaYjLCA9pxVGJBTZTPrp2gU6f0UaK0K2nRTP09kWNhzFjErlNgOzSL3lT8z+tkNrmKcibTzFJJ5ouSjCOr0PR11GeaEsvHjmCimbsVkSHWmFgXUNmFECy+vEzC89p1Lbi/qNZvijRKcAwncAYBXEId7qABIRB4hGd4hTdPeS/eu/cxb13xipkj+APv8wfzVo7p</latexit>

21 / 174

Perceptron

(x1, y1), . . . , (xT , yT ) ∈ X ×±1 (T may or may not be same as n)

Maintain a hypothesis wt ∈ Rd (initialize w1 = 0).

On round t,

I Consider (xt , yt)

I Form prediction yt = sign(〈wt , xt〉)I If yt 6= yt , update

wt+1 = wt + ytxt

elsewt+1 = wt

22 / 174

Perceptron

wt<latexit sha1_base64="7J0RWCUqTAOGp/2VV5Bhw8gCVEA=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKMR6DXjxGNA9IljA7mU2GzM4uM71KCPkELx4U8eoXefNvnLxAjQUNRVU33V1BIoVB1/1yMiura+sb2c3c1vbO7l5+/6Bu4lQzXmOxjHUzoIZLoXgNBUreTDSnUSB5IxhcT/zGA9dGxOoehwn3I9pTIhSMopXuHjvYyRfcojsFcYveeal0USbeQlmQAsxR7eQ/292YpRFXyCQ1puW5CfojqlEwyce5dmp4QtmA9njLUkUjbvzR9NQxObFKl4SxtqWQTNWfEyMaGTOMAtsZUeybv95E/M9rpRhe+iOhkhS5YrNFYSoJxmTyN+kKzRnKoSWUaWFvJaxPNWVo08nZEJZeXib1s6JnI7o9L1Su5nFk4QiO4RQ8KEMFbqAKNWDQgyd4gVdHOs/Om/M+a80485lD+AXn4xuZmI3/</latexit><latexit sha1_base64="7J0RWCUqTAOGp/2VV5Bhw8gCVEA=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKMR6DXjxGNA9IljA7mU2GzM4uM71KCPkELx4U8eoXefNvnLxAjQUNRVU33V1BIoVB1/1yMiura+sb2c3c1vbO7l5+/6Bu4lQzXmOxjHUzoIZLoXgNBUreTDSnUSB5IxhcT/zGA9dGxOoehwn3I9pTIhSMopXuHjvYyRfcojsFcYveeal0USbeQlmQAsxR7eQ/292YpRFXyCQ1puW5CfojqlEwyce5dmp4QtmA9njLUkUjbvzR9NQxObFKl4SxtqWQTNWfEyMaGTOMAtsZUeybv95E/M9rpRhe+iOhkhS5YrNFYSoJxmTyN+kKzRnKoSWUaWFvJaxPNWVo08nZEJZeXib1s6JnI7o9L1Su5nFk4QiO4RQ8KEMFbqAKNWDQgyd4gVdHOs/Om/M+a80485lD+AXn4xuZmI3/</latexit><latexit sha1_base64="7J0RWCUqTAOGp/2VV5Bhw8gCVEA=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKMR6DXjxGNA9IljA7mU2GzM4uM71KCPkELx4U8eoXefNvnLxAjQUNRVU33V1BIoVB1/1yMiura+sb2c3c1vbO7l5+/6Bu4lQzXmOxjHUzoIZLoXgNBUreTDSnUSB5IxhcT/zGA9dGxOoehwn3I9pTIhSMopXuHjvYyRfcojsFcYveeal0USbeQlmQAsxR7eQ/292YpRFXyCQ1puW5CfojqlEwyce5dmp4QtmA9njLUkUjbvzR9NQxObFKl4SxtqWQTNWfEyMaGTOMAtsZUeybv95E/M9rpRhe+iOhkhS5YrNFYSoJxmTyN+kKzRnKoSWUaWFvJaxPNWVo08nZEJZeXib1s6JnI7o9L1Su5nFk4QiO4RQ8KEMFbqAKNWDQgyd4gVdHOs/Om/M+a80485lD+AXn4xuZmI3/</latexit><latexit sha1_base64="7J0RWCUqTAOGp/2VV5Bhw8gCVEA=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKMR6DXjxGNA9IljA7mU2GzM4uM71KCPkELx4U8eoXefNvnLxAjQUNRVU33V1BIoVB1/1yMiura+sb2c3c1vbO7l5+/6Bu4lQzXmOxjHUzoIZLoXgNBUreTDSnUSB5IxhcT/zGA9dGxOoehwn3I9pTIhSMopXuHjvYyRfcojsFcYveeal0USbeQlmQAsxR7eQ/292YpRFXyCQ1puW5CfojqlEwyce5dmp4QtmA9njLUkUjbvzR9NQxObFKl4SxtqWQTNWfEyMaGTOMAtsZUeybv95E/M9rpRhe+iOhkhS5YrNFYSoJxmTyN+kKzRnKoSWUaWFvJaxPNWVo08nZEJZeXib1s6JnI7o9L1Su5nFk4QiO4RQ8KEMFbqAKNWDQgyd4gVdHOs/Om/M+a80485lD+AXn4xuZmI3/</latexit>

xt<latexit sha1_base64="wK72DhZCMdU9mhIbOxvRhU8Wi14=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKMR6DXjxGNA9IljA7mU2GzM4uM71iCPkELx4U8eoXefNvnLxAjQUNRVU33V1BIoVB1/1yMiura+sb2c3c1vbO7l5+/6Bu4lQzXmOxjHUzoIZLoXgNBUreTDSnUSB5IxhcT/zGA9dGxOoehwn3I9pTIhSMopXuHjvYyRfcojsFcYveeal0USbeQlmQAsxR7eQ/292YpRFXyCQ1puW5CfojqlEwyce5dmp4QtmA9njLUkUjbvzR9NQxObFKl4SxtqWQTNWfEyMaGTOMAtsZUeybv95E/M9rpRhe+iOhkhS5YrNFYSoJxmTyN+kKzRnKoSWUaWFvJaxPNWVo08nZEJZeXib1s6JnI7o9L1Su5nFk4QiO4RQ8KEMFbqAKNWDQgyd4gVdHOs/Om/M+a80485lD+AXn4xubHo4A</latexit><latexit sha1_base64="wK72DhZCMdU9mhIbOxvRhU8Wi14=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKMR6DXjxGNA9IljA7mU2GzM4uM71iCPkELx4U8eoXefNvnLxAjQUNRVU33V1BIoVB1/1yMiura+sb2c3c1vbO7l5+/6Bu4lQzXmOxjHUzoIZLoXgNBUreTDSnUSB5IxhcT/zGA9dGxOoehwn3I9pTIhSMopXuHjvYyRfcojsFcYveeal0USbeQlmQAsxR7eQ/292YpRFXyCQ1puW5CfojqlEwyce5dmp4QtmA9njLUkUjbvzR9NQxObFKl4SxtqWQTNWfEyMaGTOMAtsZUeybv95E/M9rpRhe+iOhkhS5YrNFYSoJxmTyN+kKzRnKoSWUaWFvJaxPNWVo08nZEJZeXib1s6JnI7o9L1Su5nFk4QiO4RQ8KEMFbqAKNWDQgyd4gVdHOs/Om/M+a80485lD+AXn4xubHo4A</latexit><latexit sha1_base64="wK72DhZCMdU9mhIbOxvRhU8Wi14=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKMR6DXjxGNA9IljA7mU2GzM4uM71iCPkELx4U8eoXefNvnLxAjQUNRVU33V1BIoVB1/1yMiura+sb2c3c1vbO7l5+/6Bu4lQzXmOxjHUzoIZLoXgNBUreTDSnUSB5IxhcT/zGA9dGxOoehwn3I9pTIhSMopXuHjvYyRfcojsFcYveeal0USbeQlmQAsxR7eQ/292YpRFXyCQ1puW5CfojqlEwyce5dmp4QtmA9njLUkUjbvzR9NQxObFKl4SxtqWQTNWfEyMaGTOMAtsZUeybv95E/M9rpRhe+iOhkhS5YrNFYSoJxmTyN+kKzRnKoSWUaWFvJaxPNWVo08nZEJZeXib1s6JnI7o9L1Su5nFk4QiO4RQ8KEMFbqAKNWDQgyd4gVdHOs/Om/M+a80485lD+AXn4xubHo4A</latexit><latexit sha1_base64="wK72DhZCMdU9mhIbOxvRhU8Wi14=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKMR6DXjxGNA9IljA7mU2GzM4uM71iCPkELx4U8eoXefNvnLxAjQUNRVU33V1BIoVB1/1yMiura+sb2c3c1vbO7l5+/6Bu4lQzXmOxjHUzoIZLoXgNBUreTDSnUSB5IxhcT/zGA9dGxOoehwn3I9pTIhSMopXuHjvYyRfcojsFcYveeal0USbeQlmQAsxR7eQ/292YpRFXyCQ1puW5CfojqlEwyce5dmp4QtmA9njLUkUjbvzR9NQxObFKl4SxtqWQTNWfEyMaGTOMAtsZUeybv95E/M9rpRhe+iOhkhS5YrNFYSoJxmTyN+kKzRnKoSWUaWFvJaxPNWVo08nZEJZeXib1s6JnI7o9L1Su5nFk4QiO4RQ8KEMFbqAKNWDQgyd4gVdHOs/Om/M+a80485lD+AXn4xubHo4A</latexit>

23 / 174

For simplicity, suppose all data are in a unit ball, ‖xt‖ ≤ 1.

Margin with respect to (x1, y1), . . . , (xT , yT ):

γ = max‖w‖=1

mini∈[T ]

(yi 〈w , xi 〉)+ ,

where (a)+ = max0, a.

Perceptron makes at most 1/γ2 mistakes (and corrections) on anysequence of examples with margin γ.

Theorem (Novikoff ’62).

24 / 174

Proof: Let m be the number of mistakes after T iterations. If a mistakeis made on round t,

‖wt+1‖2 = ‖wt + ytxt‖2 ≤ ‖wt‖2 + 2yt 〈wt , xt〉+ 1 ≤ ‖wt‖2 + 1.

Hence,‖wT‖2 ≤ m.

For optimal hyperplane w∗

γ ≤ 〈w∗, ytxt〉 = 〈w∗,wt+1 − wt〉 .

Hence (adding and canceling),

mγ ≤ 〈w∗,wT 〉 ≤ ‖wT‖ ≤√m.

25 / 174

Recap

For any T and (x1, y1), . . . , (xT , yT ),

T∑t=1

Iyt 〈wt , xt〉 ≤ 0 ≤ D2

γ2

where γ = γ(x1:T , y1:T ) is margin and D = D(x1:T , y1:T ) = maxt ‖xt‖.

Let w∗ denote the max margin hyperplane, ‖w∗‖ = 1.

26 / 174

Consequence for i.i.d. data (I)

Do one pass on i.i.d. sequence (X1,Y1), . . . , (Xn,Yn) (i.e. T = n).

Claim: expected indicator loss of randomly picked function x 7→ 〈wτ , x〉(τ ∼ unif(1, . . . , n)) is at most

1

n× E

[D2

γ2

].

27 / 174

Proof: Take expectation on both sides:

E

[1

n

n∑t=1

IYt 〈wt ,Xt〉 ≤ 0]≤ E

[D2

nγ2

]Left-hand side can be written as (recall notation S = (Xi ,Yi )

ni=1)

EτES IYτ 〈wτ ,Xτ 〉 ≤ 0

Since wτ is a function of X1:τ−1,Y1:τ−1, above is

ESEτEX ,Y IY 〈wτ ,X 〉 ≤ 0 = ESEτL01(wτ )

Claim follows.

NB: To ensure E[D2/γ2] is not infinite, we assume margin γ in P.

28 / 174

Consequence for i.i.d. data (II)

Now, rather than doing one pass, cycle through data

(X1,Y1), . . . , (Xn,Yn)

repeatedly until no more mistakes (i.e. T ≤ n × (D2/γ2)).

Then final hyperplane of Perceptron separates the data perfectly, i.e.finds minimum L01(wT ) = 0 for

L01(w) =1

n

n∑i=1

IYi 〈w ,Xi 〉 ≤ 0.

Empirical Risk Minimization with finite-time convergence. Can we sayanything about future performance of wT ?

29 / 174

Consequence for i.i.d. data (II)

Claim:

EL01(wT ) ≤ 1

n + 1× E

[D2

γ2

]

30 / 174

Proof: Shortcuts: z = (x , y) and `(w , z) = Iy 〈w , x〉 ≤ 0. Then

ESEZ `(wT ,Z ) = ES,Zn+1

[1

n + 1

n+1∑t=1

`(w (−t),Zt)

]

where w (−t) is Perceptron’s final hyperplane after cycling through dataZ1, . . . ,Zt−1,Zt+1, . . . ,Zn+1.

That is, leave-one-out is unbiased estimate of expected loss.

Now consider cycling Perceptron on Z1, . . . ,Zn+1 until no more errors.Let i1, . . . , im be indices on which Perceptron errs in any of the cycles.We know m ≤ D2/γ2. However, if index t /∈ i1, . . . , im, then whetheror not Zt was included does not matter, and Zt is correctly classified byw (−t). Claim follows.

31 / 174

A few remarks

I Bayes error L01(f∗) = 0 since we assume margin in the distribution

P (and, hence, in the data).

I Our assumption on P is not about its parametric or nonparametricform, but rather on what happens at the boundary.

I Curiously, we beat the CLT rate of 1/√n.

32 / 174

Overparametrization and overfittingThe last lecture will discuss overparametrization and overfitting.

I dimension d never appears in the mistake bound of Perceptron.Hence, we can even take d =∞.

I Perceptron is 1-layer neural network (no hidden layer) with dneurons.

I For d > n, any n points in general position can be separated (zeroempirical error).

Conclusions:

I More parameters than data does not necessarily contradict goodprediction performance (“generalization”).

I Perfectly fitting the data does not necessarily contradict goodgeneralization (particularly with no noise).

I Complexity is subtle (e.g. number of parameters vs margin)

I ... however, margin is a strong assumption (more than just no noise)

33 / 174

Outline



Classification

Beyond ERM

Interpolation

34 / 174

Relaxing the assumptions

We provedEL01(wT )− L01(f

∗) ≤ O(1/n)

under the assumption that P is linearly separable with margin γ.

The assumption on the probability distribution implies that the Bayesclassifier is a linear separator (“realizable” setup).

We may relax the margin assumption, yet still hope to do well with linearseparators. That is, we would aim to minimize

EL01(wT )− L01(w∗)

hoping thatL01(w

∗)− L01(f∗)

is small, where w∗ is the best hyperplane classifier with respect to L01.

35 / 174

More generally, we will work with some class of functions F that we hopecaptures well the relationship between X and Y . The choice of F givesrise to a bias-variance decomposition.

Let fF ∈ argminf∈F

L(f ) be the function in F with smallest expected loss.

36 / 174

Bias-Variance Tradeoff

L(fn)− L(f ∗) = L(fn)− inff∈F

L(f )︸︷︷︸Estimation Error

+ inff∈F

L(f )− L(f ∗)︸︷︷︸Approximation Error

F

fn ffF

Clearly, the two terms are at odds with each other:

I Making F larger means smaller approximation error but (as we willsee) larger estimation error

I Taking a larger sample n means smaller estimation error and has noeffect on the approximation error.

I Thus, it makes sense to trade off size of F and n (Structural RiskMinimization, or Method of Sieves, or Model Selection).

37 / 174

Bias-Variance Tradeoff

In simple problems (e.g. linearly separable data) we might get away withhaving no bias-variance decomposition (e.g. as was done in thePerceptron case).

However, for more complex problems, our assumptions on P often implythat class containing f ∗ is too large and the estimation error cannot becontrolled. In such cases, we need to produce a “biased” solution inhopes of reducing variance.

Finally, the bias-variance tradeoff need not be in the form we justpresented. We will consider a different decomposition in the last lecture.

38 / 174

Outline

Introduction

Uniform Laws of Large NumbersMotivationRademacher AveragesRademacher Complexity: Structural ResultsSigmoid NetworksReLU NetworksCovering Numbers

Classification

Beyond ERM

Interpolation

39 / 174

Consider the performance of empirical risk minimization:

Choose fn ∈ F to minimize L(f ), where L is the empirical risk,

L(f ) = Pn`(f (X ),Y ) =1

n

n∑i=1

`(f (Xi ),Yi ).

For pattern classification, this is the proportion of training examplesmisclassified.

How does the excess risk, L(fn)− L(fF ) behave? We can write

L(fn)− L(fF ) =[L(fn)− L(fn)

]+[L(fn)− L(fF )

]+[L(fF )− L(fF )

]Therefore,

EL(fn)− L(fF ) ≤ E[L(fn)− L(fn)

]because second term is nonpositive and third is zero in expectation.

40 / 174

The term L(fn)− L(fn) is not necessarily zero in expectation (check!). Aneasy upper bound is

L(fn)− L(fn) ≤ supf∈F

∣∣∣L(f )− L(f )∣∣∣ ,

and this motivates the study of uniform laws of large numbers.

Roadmap: study the sup to

I understand when learning is possible,

I understand implications for sample complexity,

I design new algorithms (regularizer arises from upper bound on sup)

41 / 174

Loss Class

In what follows, we will work with a function class G on Z, and for thelearning application, G = ` F :

G = (x , y) 7→ `(f (x), y) : f ∈ F

and Z = X × Y.

42 / 174

Empirical Process Viewpoint

g0

Eg

all functions

g0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gn g0

1

n

nX

i=1

g(zi)

gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gG gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

43 / 174


g0

Eg

all functionsg0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gn g0

1

n

nX

i=1

g(zi)

gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gG gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

43 / 174


g0

Eg

all functionsg0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gn

g0

1

n

nX

i=1

g(zi)

gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gG gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

43 / 174


g0

Eg

all functionsg0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gn g0

1

n

nX

i=1

g(zi)

gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gG gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

43 / 174


g0

Eg

all functionsg0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gn g0

1

n

nX

i=1

g(zi)

gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gG gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

43 / 174


For a class G of functions g : Z → [0, 1], suppose that Z1, . . . ,Zn,Z arei.i.d. on Z, and consider

U = supg∈G

∣∣∣∣∣Eg(Z )− 1

n

n∑i=1

g(Zi )

∣∣∣∣∣ = supg∈G|Pg − Png | =: ‖P − Pn︸︷︷︸

emp proc

‖G .

If U converges to 0, this is called a uniform law of large numbers.

44 / 174

Glivenko-Cantelli Classes

G is a Glivenko-Cantelli class for P if ‖Pn − P‖G P→ 0.

Definition.

I P is a distribution on Z,

I Z1, . . . ,Zn are drawn i.i.d. from P,

I Pn is the empirical distribution (mass 1/n at each of Z1, . . . ,Zn),

I G is a set of measurable real-valued functions on Z with finiteexpectation under P,

I Pn − P is an empirical process, that is, a stochastic processindexed by a class of functions G, and

I ‖Pn − P‖G := supg∈G |Png − Pg |.

45 / 174

Glivenko-Cantelli ClassesWhy ‘Glivenko-Cantelli’ ? An example of a uniform law of large numbers.

‖Fn − F‖∞as→ 0.

Theorem (Glivenko-Cantelli).

Here, F is a cumulative distribution function, Fn is the empiricalcumulative distribution function,

Fn(t) =1

n

n∑i=1

IZi ≤ t,

where Z1, . . . ,Zn are i.i.d. with distribution F , and‖F − G‖∞ = supt |F (t)− G (t)|.

‖Pn − P‖G as→ 0, for G = z 7→ Iz ≤ θ : θ ∈ R.

Theorem (Glivenko-Cantelli).

46 / 174

Glivenko-Cantelli Classes

Not all G are Glivenko-Cantelli classes. For instance,

G = Iz ∈ S : S ⊂ R, |S | <∞ .

Then for a continuous distribution P, Pg = 0 for any g ∈ G, butsupg∈G Png = 1 for all n. So although Png

as→ Pg for all g ∈ G, thisconvergence is not uniform over G. G is too large.

In general, how do we decide whether G is “too large”?

47 / 174

Outline

Introduction


Classification

Beyond ERM

Interpolation

48 / 174

Rademacher AveragesGiven a set H ⊆ [−1, 1]n, measure its “size” by average length ofprojection onto random direction.

Rademacher process indexed by vector h ∈ H:

Rn(h) =1

n〈ε, h〉

where ε = (ε1, . . . , εn) is vector of i.i.d. Rademacher random variables(symmetric ±1).

Rademacher averages: expected maximum of the process

Eε∥∥∥Rn

∥∥∥H

= E suph∈H

∣∣∣∣1n 〈ε, h〉∣∣∣∣

49 / 174

Examples

If H = h0,E∥∥∥Rn

∥∥∥H

= O(n−1/2).

If H = −1, 1n,

E∥∥∥Rn

∥∥∥H

= 1.

If v + c−1, 1n ⊆ H, then

E∥∥∥Rn

∥∥∥H≥ c − o(1)

How about H = −1, 1 where 1 is a vector of 1’s?

50 / 174

Examples

If H is finite,

E∥∥∥Rn

∥∥∥H≤ c

√log |H|

n

for some constant c .

This bound can be loose, as it does not take into account“overlaps”/correlations between vectors.

51 / 174

Examples

Let Bnp be a unit ball in Rn:

Bnp = x ∈ Rn : ‖x‖p ≤ 1, ‖x‖p =

(n∑

i=1

|xi |p)1/p

.

For H = Bn2

1

nE max‖g‖2≤1

| 〈ε, g〉 | =1

nE ‖ε‖2 =

1√n

On the other hand, Rademacher averages for Bn1 are 1

n .

52 / 174

Rademacher Averages

What do these Rademacher averages have to do with our problem ofbounding uniform deviations?

Suppose we take

H = G(Z n1 ) = (g(Z1), . . . , g(Zn)) : g ∈ G ⊂ Rn

g1<latexit sha1_base64="upU6x31J8AFJsV7QHUFUq5QBYAA=">AAAJ2HicjVZLb+M2ENbu9hG7r30cexEaBNgCQmClTWIfCgS72bYBtkm6STbbtY2AoiiZMCkKJBVHFQT0UKDooZce2p/T39Fbf0qHkh1btIJWkOnBzDfDmW+GhIKUUaV7vb/v3X/wzrvvvb/R6X7w4Ucff/Lw0ePXSmQSkwssmJBvAqQIowm50FQz8iaVBPGAkctg+tzYL6+JVFQk5zpPyZijOKERxUiD6iy+8q8ebva2e72e7/uuEfz9vR4Ig0F/x++7vjHBs3nwZOvnfxzHOb16tPHXKBQ44yTRmCGlhn4v1eMCSU0xI2V3lCmSIjxFMRmCmCBO1Lioci3dLdCEbiQk/BLtVtpVjwJxpXIeAJIjPVG2zSjbbMNMR/1xQZM00yTB9UZRxlwtXFO4G1JJsGY5CAhLCrm6eIIkwhroaewyAbiUJGpWgrmmqYcYGxc3ecNUaDr9samB/jBLo0EgSQx9qg3GidFAIpkDdVLMlKcmKCWq7G5Z/CoiaeRVJVdSKkVElOkpYoZFb4KSUGTaqxsRQP+JhDDucn/lPo0ECz9fi/3fUVcCrvJx7o8LAzNkNyzQ6QRLopuTUNzwlGdM06aWZIzIa25FQCwW0KIJ924lipvtSITkiBE+LiACbwZN42bviiDg/zt7TbnpQaMglGDCSrehNMRJFVnQmURpRGML+11yBiMtmiMxjCTQyomeiPCrkEQI2BkXPKzUYdk2+N78cHgMaXIDklEAOU1wDDlMKL5p7iaBp9C7NhdFOF4yvEPu5N69k3wUKMHgnHkC7haG8nEB6ehUABkQjIAyN+kWo+p4FwHLSFkuLCFVKfhYRnB0n5+8PHnlHr74+uj46Pzo5PisOwJeIOEaeYjk9BtJSFIWMg7Korft73pwVe2Zxd81nK/CX9J4on8gjInZrUPfQKtl0I6H+Pktet8A68UGz3NZYqtM6qU18DNT5gI82K1TqNZBK95y8Ctove61OryCmbkts+ZlvtrwZwzauMACBF4bckZSisoC83xqMBDxC8+Hv/1eacWSFE+rrRdYd7s/gGXwJSw7fQt+OaF6URUUY14LcZrJlJGVhkFrPQNKyAwLzuFOKkazKswQzvDIDF7tWSuLTb+00NWEWeBK14KVUIoFNaoWpJAoidfizrUt+LiaXAu+MtJteZs+WR7z5q2jmRmClvSXw7Huo6o2Ww7z3rfkY3rdssNyBlqrzlszqg/PukMIjLQ5LQ/cuk9az4zlsZikCr/lvjg+rC+Ysy588iy+a9y7hdc72z7I3/ubB32nfjacT53PnKeO7+w7B863zqlz4WAndn5z/nD+7Lzt/NT5pfNrDb1/b+7zxGk8nd//BQ28jZo=</latexit><latexit sha1_base64="sXoBpG6JmbUzgx55nt8lBNfWW3Y=">AAAJ2HicjVZfb+NEEPcd/5rw74575MWiqnRIVmUX2iYPSNVdD6h0tOXaXg+SqFqv184qu15rd900WJZ4QEI88MIDfBze+A7wAfgEfABm7aSJN67Acjajmd/MzvxmduUwY1Rp3//z3v3XXn/jzbc2Ot2333n3vfcfPPzgpRK5xOQCCybkqxApwmhKLjTVjLzKJEE8ZOQynDw19strIhUV6bmeZWTEUZLSmGKkQXWWXAVXDzb9bd/3gyBwjRDs7/kg9Pu9naDnBsYEz+bBo60f/vnj779Orx5u/D6MBM45STVmSKlB4Gd6VCCpKWak7A5zRTKEJyghAxBTxIkaFVWupbsFmsiNhYRfqt1Ku+pRIK7UjIeA5EiPlW0zyjbbINdxb1TQNMs1SXG9UZwzVwvXFO5GVBKs2QwEhCWFXF08RhJhDfQ0dhkDXEoSNyvBXNPMQ4yNiptZw1RoOvmuqYH+MEujQSBpAn2qDcaJ0VAiOQPqpJgqT41RRlTZ3bL4VUTS2KtKrqRMipgo01PEDIveGKWRyLVXNyKE/hMJYdzl/sp9HAsWfbwW+7+jrgRc5eM8GBUGZshuWKDTKZZENyehuOEZz5mmTS3JGZHX3IqAWCKgRWPu3UoUN9uRCskRI3xUQATeDJolzd4VYcj/d/aactODRkEoxYSVbkNpiJMqtqBTibKYJhb2q/QMRlo0R2IQS6CVEz0W0WcRiRGwMyp4VKmjsm3wvfnh8BjS5AYkowBymuAEchhTfNPcTQJPkXdtLopotGR4h9zJvXsn+ShUgsE58wTcLQzNRgWkozMBZEAwAsqZSbcYVse7CFlOynJhiajKwMcygqP79OT5yQv38NnnR8dH50cnx2fdIfACCdfIQyQnX0hC0rKQSVgW/naw68FVtWeWYNdwvgp/TpOx/oYwJqa3Dj0DrZZ+Ox7iz27R+wZYLzZ4nssSW2VSL62Bn5gyF+D+bp1CtfZb8ZZDUEHrda/V4QXMzG2ZNS/z1YY/YdDGBRYg8NqQM5JRVBaYzyYGAxE/8QL42/dLK5akeFJtvcC6270+LP1PYdnpWfDLMdWLqqAY81qI01xmjKw0DFrrGVBKplhwDndSMZxWYQZwhodm8GrPWllsBqWFribMAle6FqyEUiyoUbUghURpshZ3rm3BJ9XkWvCVkW7L2/TJ8pg3bx3NzBC0pL8cjnUfVbXZcpj3viUf0+uWHZYz0Fr1rDWj+vCsO0TASJvT8sCt+2T1zFgei0mq8Fvus+PD+oI568Inz+K7xr1beLmzHYD8dbB50HPqZ8P50PnIeewEzr5z4HzpnDoXDnYS52fnV+e3zred7zs/dn6qoffvzX0eOY2n88u/KiCQNA==</latexit><latexit sha1_base64="sXoBpG6JmbUzgx55nt8lBNfWW3Y=">AAAJ2HicjVZfb+NEEPcd/5rw74575MWiqnRIVmUX2iYPSNVdD6h0tOXaXg+SqFqv184qu15rd900WJZ4QEI88MIDfBze+A7wAfgEfABm7aSJN67Acjajmd/MzvxmduUwY1Rp3//z3v3XXn/jzbc2Ot2333n3vfcfPPzgpRK5xOQCCybkqxApwmhKLjTVjLzKJEE8ZOQynDw19strIhUV6bmeZWTEUZLSmGKkQXWWXAVXDzb9bd/3gyBwjRDs7/kg9Pu9naDnBsYEz+bBo60f/vnj779Orx5u/D6MBM45STVmSKlB4Gd6VCCpKWak7A5zRTKEJyghAxBTxIkaFVWupbsFmsiNhYRfqt1Ku+pRIK7UjIeA5EiPlW0zyjbbINdxb1TQNMs1SXG9UZwzVwvXFO5GVBKs2QwEhCWFXF08RhJhDfQ0dhkDXEoSNyvBXNPMQ4yNiptZw1RoOvmuqYH+MEujQSBpAn2qDcaJ0VAiOQPqpJgqT41RRlTZ3bL4VUTS2KtKrqRMipgo01PEDIveGKWRyLVXNyKE/hMJYdzl/sp9HAsWfbwW+7+jrgRc5eM8GBUGZshuWKDTKZZENyehuOEZz5mmTS3JGZHX3IqAWCKgRWPu3UoUN9uRCskRI3xUQATeDJolzd4VYcj/d/aactODRkEoxYSVbkNpiJMqtqBTibKYJhb2q/QMRlo0R2IQS6CVEz0W0WcRiRGwMyp4VKmjsm3wvfnh8BjS5AYkowBymuAEchhTfNPcTQJPkXdtLopotGR4h9zJvXsn+ShUgsE58wTcLQzNRgWkozMBZEAwAsqZSbcYVse7CFlOynJhiajKwMcygqP79OT5yQv38NnnR8dH50cnx2fdIfACCdfIQyQnX0hC0rKQSVgW/naw68FVtWeWYNdwvgp/TpOx/oYwJqa3Dj0DrZZ+Ox7iz27R+wZYLzZ4nssSW2VSL62Bn5gyF+D+bp1CtfZb8ZZDUEHrda/V4QXMzG2ZNS/z1YY/YdDGBRYg8NqQM5JRVBaYzyYGAxE/8QL42/dLK5akeFJtvcC6270+LP1PYdnpWfDLMdWLqqAY81qI01xmjKw0DFrrGVBKplhwDndSMZxWYQZwhodm8GrPWllsBqWFribMAle6FqyEUiyoUbUghURpshZ3rm3BJ9XkWvCVkW7L2/TJ8pg3bx3NzBC0pL8cjnUfVbXZcpj3viUf0+uWHZYz0Fr1rDWj+vCsO0TASJvT8sCt+2T1zFgei0mq8Fvus+PD+oI568Inz+K7xr1beLmzHYD8dbB50HPqZ8P50PnIeewEzr5z4HzpnDoXDnYS52fnV+e3zred7zs/dn6qoffvzX0eOY2n88u/KiCQNA==</latexit><latexit sha1_base64="IKa0cr9W32dKAhXNEcFAYkW1opU=">AAAJ2HicjVbNbuM2ENZu/2L3b7c99iI0CLAFhMBKm8Q+FFjsZtsG2CbpJtlsaxsBRVESYVIUSCqOKgjoreihlx7ax+lz9G06lOzYohW0gkwPZr4ZznwzJBRkjCo9GPzz4OFbb7/z7ntbvf77H3z40cePHn/yWolcYnKJBRPyTYAUYTQll5pqRt5kkiAeMHIVzJ4b+9UNkYqK9EIXGZlyFKc0ohhpUJ3H1/71o+3B7mAw8H3fNYJ/eDAAYTQa7vlD1zcmeLadxXN2/Xjr70kocM5JqjFDSo39QaanJZKaYkaq/iRXJEN4hmIyBjFFnKhpWedauTugCd1ISPil2q216x4l4koVPAAkRzpRts0ou2zjXEfDaUnTLNckxc1GUc5cLVxTuBtSSbBmBQgISwq5ujhBEmEN9LR2SQAuJYnalWCuaeYhxqblbdEylZrOfm5roD/M0mgQSBpDnxqDcWI0kEgWQJ0Uc+WpBGVEVf0di19FJI28uuRayqSIiDI9Rcyw6CUoDUWuvaYRAfSfSAjjrvZX7pNIsPCLjdj/HXUt4DofF/60NDBDdssCnU6xJLo9CeUtz3jONG1rSc6IvOFWBMRiAS1KuHcnUdxuRyokR4zwaQkReDtoFrd7VwYB/9/Za8pND1oFoRQTVrktpSFOqsiCziXKIhpb2O/Tcxhp0R6JcSSBVk50IsKvQxIhYGda8rBWh1XX4HuLw+ExpMktSEYB5LTBMeSQUHzb3k0CT6F3Yy6KcLpieI/cy717L/koUILBOfME3C0MFdMS0tGZADIgGAFlYdItJ/XxLgOWk6paWkKqMvCxjODoPj99efrKPXrxzfHJ8cXx6cl5fwK8QMIN8gjJ2beSkLQqZRxU5WDX3/fgqjowi79vOF+Hv6Rxon8kjIn5ncPQQOtl1I2H+MUd+tAAm8UGL3JZYetMmqUz8DNT5hI82m9SqNdRJ95y8Gtosx50OryCmbkrs+FlsdrwZwzauMQCBF4bck4yiqoS82JmMBDxS8+Hv8NBZcWSFM/qrZdYd3c4gmX0FSx7Qwt+lVC9rAqKMa+FOMtlxshaw6C1ngGlZI4F53AnlZN5HWYMZ3hiBq/xbJTltl9Z6HrCLHCt68BKKMWCGlUHUkiUxhtxF9oOfFxPrgVfG+muvE2fLI9F8zbRzAxBR/qr4dj0UXWbLYdF7zvyMb3u2GE1A51VF50ZNYdn0yEERrqcVgdu0ydrZsbyWE5Sjd9xX5wcNRfMeR8+eZbfNe79wuu9XR/kH/ztp8PFx8+W85nzufPE8Z1D56nznXPmXDrYiZ3fnT+dv3o/9X7p/dr7rYE+fLDw+dRpPb0//gXMf4u+</latexit>

g2<latexit sha1_base64="647aWPtTo779AnQGQnYNZ516kjI=">AAAJ2HicjVZLb+M2ENbu9hG7r30cexEaBNgCQmC5TWIfCgS72bYBtkm6TjbbtY2AoiiZMCkKJBXHFQT0UKDooZce2p/T39Fbf0qHkh1btIJWkOnBzDfDmW+GhIKUUaU7nb/v3X/wzrvvvb/Van/w4Ucff/Lw0ePXSmQSkwssmJBvAqQIowm50FQz8iaVBPGAkctg+tzYL6+JVFQk53qekjFHcUIjipEG1SC+6l493O7sdjod3/ddI/gH+x0Q+v1e1++5vjHBs334ZOfnfxzHObt6tPXXKBQ44yTRmCGlhn4n1eMcSU0xI0V7lCmSIjxFMRmCmCBO1Dgvcy3cHdCEbiQk/BLtltp1jxxxpeY8ACRHeqJsm1E22YaZjnrjnCZppkmCq42ijLlauKZwN6SSYM3mICAsKeTq4gmSCGugp7bLBOBSkqheCeaaph5ibJzfzGumXNPpj3UN9IdZGg0CSWLoU2UwTowGEsk5UCfFTHlqglKiivaOxa8ikkZeWXIppVJERJmeImZY9CYoCUWmvaoRAfSfSAjjrvZX7tNIsPDzjdj/HXUt4Dof5/44NzBDds0CnU6wJLo+CfkNT3nGNK1rScaIvOZWBMRiAS2acO9WorjejkRIjhjh4xwi8HrQNK73Lg8C/r+z15SbHtQKQgkmrHBrSkOcVJEFnUmURjS2sN8lAxhpUR+JYSSBVk70RIRfhSRCwM4452GpDoumwfcWh8NjSJMbkIwCyKmDY8hhQvFNfTcJPIXetbkowvGK4S65k3v3TvJRoASDc+YJuFsYmo9zSEenAsiAYASUc5NuPiqPdx6wjBTF0hJSlYKPZQRH9/npy9NX7tGLr49Pjs+PT08G7RHwAglXyCMkp99IQpIil3FQ5J1df8+Dq2rfLP6e4Xwd/pLGE/0DYUzMbh16Blou/WY8xJ/fog8MsFps8CKXFbbMpFoaAz8zZS7B/b0qhXLtN+ItB7+EVut+o8MrmJnbMiteFqsNf8agjUssQOC1IQOSUlTkmM+nBgMRv/B8+DvoFFYsSfG03HqJdXd7fVj6X8LS7VnwywnVy6qgGPNaiLNMpoysNQxa6xlQQmZYcA53Uj6alWGGcIZHZvAqz0qZb/uFhS4nzAKXugashFIsqFE1IIVESbwRd6FtwMfl5FrwtZFuytv0yfJYNG8TzcwQNKS/Go5NH1W22XJY9L4hH9Prhh1WM9BY9bwxo+rwbDqEwEiT0+rAbfqk1cxYHstJKvE77ouTo+qCGbThk2f5XePeLbzu7vogf+9vH/ac6tlyPnU+c546vnPgHDrfOmfOhYOd2PnN+cP5s/W29VPrl9avFfT+vYXPE6f2tH7/FxcujZs=</latexit><latexit sha1_base64="mQ1KtLtT5frIg5JmOrnmriowHDo=">AAAJ2HicjVZfb+NEEPcd/5rw74575MWiqnRIVhUH2iYPSNVdD6h0tOXSXg+SqFqv184qu15rd900WJZ4QEI88MIDfBze+A7wAfgEfABm7aSJN67Acjajmd/MzvxmduUgZVTpTufPe/dfe/2NN9/aarXffufd995/8PCDl0pkEpMLLJiQrwKkCKMJudBUM/IqlQTxgJHLYPrU2C+viVRUJOd6npIxR3FCI4qRBtUgvupePdju7HY6Hd/3XSP4B/sdEPr9Xtfvub4xwbN9+Gjnh3/++Puvs6uHW7+PQoEzThKNGVJq6HdSPc6R1BQzUrRHmSIpwlMUkyGICeJEjfMy18LdAU3oRkLCL9FuqV33yBFXas4DQHKkJ8q2GWWTbZjpqDfOaZJmmiS42ijKmKuFawp3QyoJ1mwOAsKSQq4uniCJsAZ6artMAC4lieqVYK5p6iHGxvnNvGbKNZ1+V9dAf5il0SCQJIY+VQbjxGggkZwDdVLMlKcmKCWqaO9Y/CoiaeSVJZdSKkVElOkpYoZFb4KSUGTaqxoRQP+JhDDuan/lPo4ECz/eiP3fUdcCrvNx7o9zAzNk1yzQ6QRLouuTkN/wlGdM07qWZIzIa25FQCwW0KIJ924liuvtSITkiBE+ziECrwdN43rv8iDg/zt7TbnpQa0glGDCCremNMRJFVnQmURpRGML+1UygJEW9ZEYRhJo5URPRPhZSCIE7IxzHpbqsGgafG9xODyGNLkBySiAnDo4hhwmFN/Ud5PAU+hdm4siHK8Y7pI7uXfvJB8FSjA4Z56Au4Wh+TiHdHQqgAwIRkA5N+nmo/J45wHLSFEsLSFVKfhYRnB0n54+P33hHj37/Pjk+Pz49GTQHgEvkHCFPEJy+oUkJClyGQdF3tn19zy4qvbN4u8Zztfhz2k80d8QxsTs1qFnoOXSb8ZD/Pkt+sAAq8UGL3JZYctMqqUx8BNT5hLc36tSKNd+I95y8Etote43OryAmbkts+JlsdrwJwzauMQCBF4bMiApRUWO+XxqMBDxE8+Hv4NOYcWSFE/LrZdYd7fXh6X/KSzdngW/nFC9rAqKMa+FOMtkyshaw6C1ngElZIYF53An5aNZGWYIZ3hkBq/yrJT5tl9Y6HLCLHCpa8BKKMWCGlUDUkiUxBtxF9oGfFxOrgVfG+mmvE2fLI9F8zbRzAxBQ/qr4dj0UWWbLYdF7xvyMb1u2GE1A41Vzxszqg7PpkMIjDQ5rQ7cpk9azYzlsZykEr/jPjs5qi6YQRs+eZbfNe7dwsvurg/y1/72Yc+pni3nQ+cj57HjOwfOofOlc+ZcONiJnZ+dX53fWt+2vm/92Pqpgt6/t/B55NSe1i//AjOSkDU=</latexit><latexit sha1_base64="mQ1KtLtT5frIg5JmOrnmriowHDo=">AAAJ2HicjVZfb+NEEPcd/5rw74575MWiqnRIVhUH2iYPSNVdD6h0tOXSXg+SqFqv184qu15rd900WJZ4QEI88MIDfBze+A7wAfgEfABm7aSJN67Acjajmd/MzvxmduUgZVTpTufPe/dfe/2NN9/aarXffufd995/8PCDl0pkEpMLLJiQrwKkCKMJudBUM/IqlQTxgJHLYPrU2C+viVRUJOd6npIxR3FCI4qRBtUgvupePdju7HY6Hd/3XSP4B/sdEPr9Xtfvub4xwbN9+Gjnh3/++Puvs6uHW7+PQoEzThKNGVJq6HdSPc6R1BQzUrRHmSIpwlMUkyGICeJEjfMy18LdAU3oRkLCL9FuqV33yBFXas4DQHKkJ8q2GWWTbZjpqDfOaZJmmiS42ijKmKuFawp3QyoJ1mwOAsKSQq4uniCJsAZ6artMAC4lieqVYK5p6iHGxvnNvGbKNZ1+V9dAf5il0SCQJIY+VQbjxGggkZwDdVLMlKcmKCWqaO9Y/CoiaeSVJZdSKkVElOkpYoZFb4KSUGTaqxoRQP+JhDDuan/lPo4ECz/eiP3fUdcCrvNx7o9zAzNk1yzQ6QRLouuTkN/wlGdM07qWZIzIa25FQCwW0KIJ924liuvtSITkiBE+ziECrwdN43rv8iDg/zt7TbnpQa0glGDCCremNMRJFVnQmURpRGML+1UygJEW9ZEYRhJo5URPRPhZSCIE7IxzHpbqsGgafG9xODyGNLkBySiAnDo4hhwmFN/Ud5PAU+hdm4siHK8Y7pI7uXfvJB8FSjA4Z56Au4Wh+TiHdHQqgAwIRkA5N+nmo/J45wHLSFEsLSFVKfhYRnB0n54+P33hHj37/Pjk+Pz49GTQHgEvkHCFPEJy+oUkJClyGQdF3tn19zy4qvbN4u8Zztfhz2k80d8QxsTs1qFnoOXSb8ZD/Pkt+sAAq8UGL3JZYctMqqUx8BNT5hLc36tSKNd+I95y8Etote43OryAmbkts+JlsdrwJwzauMQCBF4bMiApRUWO+XxqMBDxE8+Hv4NOYcWSFE/LrZdYd7fXh6X/KSzdngW/nFC9rAqKMa+FOMtkyshaw6C1ngElZIYF53An5aNZGWYIZ3hkBq/yrJT5tl9Y6HLCLHCpa8BKKMWCGlUDUkiUxBtxF9oGfFxOrgVfG+mmvE2fLI9F8zbRzAxBQ/qr4dj0UWWbLYdF7xvyMb1u2GE1A41Vzxszqg7PpkMIjDQ5rQ7cpk9azYzlsZykEr/jPjs5qi6YQRs+eZbfNe7dwsvurg/y1/72Yc+pni3nQ+cj57HjOwfOofOlc+ZcONiJnZ+dX53fWt+2vm/92Pqpgt6/t/B55NSe1i//AjOSkDU=</latexit><latexit sha1_base64="z+p23dkdGhVa20Q1jNHE5rneJ2s=">AAAJ2HicjVbNbuM2ENZu/2L3b7c99iI0CLAFhMBym8Q+FFjsZtsG2CbpOtlsaxsBRVESYVIUSCqOKwjoreihlx7ax+lz9G06lOzYohW0gkwPZr4ZznwzJBRkjCrd6/3z4OFbb7/z7ns7ne77H3z40cePHn/yWolcYnKJBRPyTYAUYTQll5pqRt5kkiAeMHIVzJ4b+9UNkYqK9EIvMjLlKE5pRDHSoBrF1/3rR7u9/V6v5/u+awT/6LAHwnA46PsD1zcmeHad5XN+/Xjn70kocM5JqjFDSo39XqanBZKaYkbK7iRXJEN4hmIyBjFFnKhpUeVaunugCd1ISPil2q20mx4F4koteABIjnSibJtRttnGuY4G04KmWa5JiuuNopy5WrimcDekkmDNFiAgLCnk6uIESYQ10NPYJQG4lCRqVoK5ppmHGJsWt4uGqdB09nNTA/1hlkaDQNIY+lQbjBOjgURyAdRJMVeeSlBGVNnds/hVRNLIq0qupEyKiCjTU8QMi16C0lDk2qsbEUD/iYQw7np/5T6JBAu/2Ir931E3Am7yceFPCwMzZDcs0OkUS6Kbk1Dc8oznTNOmluSMyBtuRUAsFtCihHt3EsXNdqRCcsQInxYQgTeDZnGzd0UQ8P+dvabc9KBREEoxYaXbUBripIos6FyiLKKxhf0+HcFIi+ZIjCMJtHKiExF+HZIIATvTgoeVOizbBt9bHg6PIU1uQTIKIKcJjiGHhOLb5m4SeAq9G3NRhNM1w31yL/fuveSjQAkG58wTcLcwtJgWkI7OBJABwQgoFybdYlId7yJgOSnLlSWkKgMfywiO7vOzl2ev3OMX35ycnlycnJ2OuhPgBRKukcdIzr6VhKRlIeOgLHr7/oEHV9WhWfwDw/km/CWNE/0jYUzM7xwGBlotw3Y8xF/coY8MsF5s8DKXNbbKpF5aAz8zZa7Aw4M6hWodtuItB7+C1uthq8MrmJm7MmtelqsNf8agjSssQOC1ISOSUVQWmC9mBgMRv/R8+DvqlVYsSfGs2nqFdfcHQ1iGX8HSH1jwq4TqVVVQjHktxHkuM0Y2Ggat9QwoJXMsOIc7qZjMqzBjOMMTM3i1Z60sdv3SQlcTZoErXQtWQikW1KhakEKiNN6Ku9S24ONqci34xki35W36ZHksm7eNZmYIWtJfD8e2j6rabDkse9+Sj+l1yw7rGWitetGaUX14th1CYKTNaX3gtn2yemYsj9UkVfg998XpcX3BjLrwybP6rnHvF173932Qf/B3nw6WHz87zmfO584Tx3eOnKfOd865c+lgJ3Z+d/50/ur81Pml82vntxr68MHS51On8XT++BfV8Yu/</latexit>

z1<latexit sha1_base64="V7g08aoOjvxavyziU7UXAQe9o4U=">AAAJ2HicjVZfb9s2EFfb/Ym9dWu7x70ICwJ0gBBY2ZLYDwOCNt0WoEuyJmm62kJAUZRMmBQFkorjCgL2NuxhL3vYPs6AfYt9mx0lO7ZoBZsg04e73x3vfnckFGaMKt3r/XPv/oP33v/gw41O96OPH37y6aPHT14rkUtMLrBgQr4JkSKMpuRCU83Im0wSxENGLsPJc2O/vCZSUZGe61lGAo6SlMYUIw2qs3dX/tWjzd52r9fzfd81gr+/1wNhMOjv+H3XNyZ4Ng8e/p27juOcXj3e+GsUCZxzkmrMkFJDv5fpoEBSU8xI2R3limQIT1BChiCmiBMVFFWupbsFmsiNhYRfqt1Ku+pRIK7UjIeA5EiPlW0zyjbbMNdxPyhomuWapLjeKM6Zq4VrCncjKgnWbAYCwpJCri4eI4mwBnoau4wBLiWJm5VgrmnmIcaC4mbWMBWaTt41NdAfZmk0CCRNoE+1wTgxGkokZ0CdFFPlqTHKiCq7Wxa/ikgae1XJlZRJERNleoqYYdEbozQSufbqRoTQfyIhjLvcX7lPY8GiL9di/3fUlYCrfJz7QWFghuyGBTqdYkl0cxKKG57xnGna1JKcEXnNrQiIJQJaNOberURxsx2pkBwxwoMCIvBm0Cxp9q4IQ/6/s9eUmx40CkIpJqx0G0pDnFSxBZ1KlMU0sbA/pGcw0qI5EsNYAq2c6LGIvolIjICdoOBRpY7KtsH35ofDY0iTG5CMAshpghPIYUzxTXM3CTxF3rW5KKJgyfAOuZN7907yUagEg3PmCbhbGJoFBaSjMwFkQDACyplJtxhVx7sIWU7KcmGJqMrAxzKCo/v85OXJK/fwxbdHx0fnRyfHZ90R8AIJ18hDJCffSULSspBJWBa9bX/Xg6tqzyz+ruF8Ff6SJmP9E2FMTG8d+gZaLYN2PMSf3aL3DbBebPA8lyW2yqReWgM/M2UuwIPdOoVqHbTiLQe/gtbrXqvDK5iZ2zJrXuarDX/GoI0LLEDgtSFnJKOoLDCfTQwGIn7l+fC33yutWJLiSbX1Autu9wewDL6GZadvwS/HVC+qgmLMayFOc5kxstIwaK1nQCmZYsE53EnFaFqFGcIZHpnBqz1rZbHplxa6mjALXOlasBJKsaBG1YIUEqXJWty5tgWfVJNrwVdGui1v0yfLY968dTQzQ9CS/nI41n1U1WbLYd77lnxMr1t2WM5Aa9Wz1ozqw7PuEAEjbU7LA7fuk9UzY3ksJqnCb7kvjg/rC+asC588i+8a927h9c62D/KP/uZB36mfDedz5wvnqeM7+86B871z6lw42Emc35w/nD87bzs/d37p/FpD79+b+3zmNJ7O7/8CuwONdQ==</latexit><latexit sha1_base64="xAosoZfITzQxTFF+VKA4fyGVA3w=">AAAJ2HicjVZfb9s2EFe7f7G3bu32uBdhQYAOEAIrWxL7YUDQpt0CdEnWJE1XWwgoipIJk6JAUnFcQcDehj3sZQ/bN9jXGLBvsW+zo2THFq1gE2T6cPe7493vjoTCjFGle71/7t1/59333v9go9P98KMHH3/y8NGnr5TIJSYXWDAhX4dIEUZTcqGpZuR1JgniISOX4eSpsV9eE6moSM/1LCMBR0lKY4qRBtXZ2yv/6uFmb7vX6/m+7xrB39/rgTAY9Hf8vusbEzybBw/+zreed/88vXq08dcoEjjnJNWYIaWGfi/TQYGkppiRsjvKFckQnqCEDEFMEScqKKpcS3cLNJEbCwm/VLuVdtWjQFypGQ8ByZEeK9tmlG22Ya7jflDQNMs1SXG9UZwzVwvXFO5GVBKs2QwEhCWFXF08RhJhDfQ0dhkDXEoSNyvBXNPMQ4wFxc2sYSo0nbxtaqA/zNJoEEiaQJ9qg3FiNJRIzoA6KabKU2OUEVV2tyx+FZE09qqSKymTIibK9BQxw6I3Rmkkcu3VjQih/0RCGHe5v3Ifx4JFX67F/u+oKwFX+Tj3g8LADNkNC3Q6xZLo5iQUNzzjOdO0qSU5I/KaWxEQSwS0aMy9W4niZjtSITlihAcFRODNoFnS7F0Rhvx/Z68pNz1oFIRSTFjpNpSGOKliCzqVKItpYmG/T89gpEVzJIaxBFo50WMRfRORGAE7QcGjSh2VbYPvzQ+Hx5AmNyAZBZDTBCeQw5jim+ZuEniKvGtzUUTBkuEdcif37p3ko1AJBufME3C3MDQLCkhHZwLIgGAElDOTbjGqjncRspyU5cISUZWBj2UER/fpyYuTl+7hs+dHx0fnRyfHZ90R8AIJ18hDJCffSkLSspBJWBa9bX/Xg6tqzyz+ruF8Ff6CJmP9I2FMTG8d+gZaLYN2PMSf3aL3DbBebPA8lyW2yqReWgM/MWUuwIPdOoVqHbTiLQe/gtbrXqvDS5iZ2zJrXuarDX/CoI0LLEDgtSFnJKOoLDCfTQwGIn7l+fC33yutWJLiSbX1Autu9wewDL6GZadvwS/HVC+qgmLMayFOc5kxstIwaK1nQCmZYsE53EnFaFqFGcIZHpnBqz1rZbHplxa6mjALXOlasBJKsaBG1YIUEqXJWty5tgWfVJNrwVdGui1v0yfLY968dTQzQ9CS/nI41n1U1WbLYd77lnxMr1t2WM5Aa9Wz1ozqw7PuEAEjbU7LA7fuk9UzY3ksJqnCb7nPjg/rC+asC588i+8a927h1c62D/IP/uZB36mfDedz5wvnseM7+86B851z6lw42EmcX53fnT86bzo/dX7u/FJD79+b+3zmNJ7Ob/8C0OCOcA==</latexit><latexit sha1_base64="xAosoZfITzQxTFF+VKA4fyGVA3w=">AAAJ2HicjVZfb9s2EFe7f7G3bu32uBdhQYAOEAIrWxL7YUDQpt0CdEnWJE1XWwgoipIJk6JAUnFcQcDehj3sZQ/bN9jXGLBvsW+zo2THFq1gE2T6cPe7493vjoTCjFGle71/7t1/59333v9go9P98KMHH3/y8NGnr5TIJSYXWDAhX4dIEUZTcqGpZuR1JgniISOX4eSpsV9eE6moSM/1LCMBR0lKY4qRBtXZ2yv/6uFmb7vX6/m+7xrB39/rgTAY9Hf8vusbEzybBw/+zreed/88vXq08dcoEjjnJNWYIaWGfi/TQYGkppiRsjvKFckQnqCEDEFMEScqKKpcS3cLNJEbCwm/VLuVdtWjQFypGQ8ByZEeK9tmlG22Ya7jflDQNMs1SXG9UZwzVwvXFO5GVBKs2QwEhCWFXF08RhJhDfQ0dhkDXEoSNyvBXNPMQ4wFxc2sYSo0nbxtaqA/zNJoEEiaQJ9qg3FiNJRIzoA6KabKU2OUEVV2tyx+FZE09qqSKymTIibK9BQxw6I3Rmkkcu3VjQih/0RCGHe5v3Ifx4JFX67F/u+oKwFX+Tj3g8LADNkNC3Q6xZLo5iQUNzzjOdO0qSU5I/KaWxEQSwS0aMy9W4niZjtSITlihAcFRODNoFnS7F0Rhvx/Z68pNz1oFIRSTFjpNpSGOKliCzqVKItpYmG/T89gpEVzJIaxBFo50WMRfRORGAE7QcGjSh2VbYPvzQ+Hx5AmNyAZBZDTBCeQw5jim+ZuEniKvGtzUUTBkuEdcif37p3ko1AJBufME3C3MDQLCkhHZwLIgGAElDOTbjGqjncRspyU5cISUZWBj2UER/fpyYuTl+7hs+dHx0fnRyfHZ90R8AIJ18hDJCffSkLSspBJWBa9bX/Xg6tqzyz+ruF8Ff6CJmP9I2FMTG8d+gZaLYN2PMSf3aL3DbBebPA8lyW2yqReWgM/MWUuwIPdOoVqHbTiLQe/gtbrXqvDS5iZ2zJrXuarDX/CoI0LLEDgtSFnJKOoLDCfTQwGIn7l+fC33yutWJLiSbX1Autu9wewDL6GZadvwS/HVC+qgmLMayFOc5kxstIwaK1nQCmZYsE53EnFaFqFGcIZHpnBqz1rZbHplxa6mjALXOlasBJKsaBG1YIUEqXJWty5tgWfVJNrwVdGui1v0yfLY968dTQzQ9CS/nI41n1U1WbLYd77lnxMr1t2WM5Aa9Wz1ozqw7PuEAEjbU7LA7fuk9UzY3ksJqnCb7nPjg/rC+asC588i+8a927h1c62D/IP/uZB36mfDedz5wvnseM7+86B851z6lw42EmcX53fnT86bzo/dX7u/FJD79+b+3zmNJ7Ob/8C0OCOcA==</latexit><latexit sha1_base64="yCmzR+UYSiB70+ceKvNWi19m/ko=">AAAJ2HicjVZfb9s2EFfb/Ym9f233uBdhQYAOEAIrWxL7YUDRptsCdElWJ01XWwgoipIJk6JAUnFcQcDehj3sZQ/bx9nn2LfZUbJji1awCTJ9uPvd8e53R0JhxqjSvd4/9+4/eO/9Dz7c6nQ/+viTTz97+OjxayVyickFFkzINyFShNGUXGiqGXmTSYJ4yMhlOH1u7JfXRCoq0nM9z0jAUZLSmGKkQTV8d+VfPdzu7fZ6Pd/3XSP4hwc9EAaD/p7fd31jgmfbWTxnV4+2/h5HAuecpBozpNTI72U6KJDUFDNSdse5IhnCU5SQEYgp4kQFRZVr6e6AJnJjIeGXarfSrnsUiCs15yEgOdITZduMss02ynXcDwqaZrkmKa43inPmauGawt2ISoI1m4OAsKSQq4snSCKsgZ7GLhOAS0niZiWYa5p5iLGguJk3TIWm03dNDfSHWRoNAkkT6FNtME6MhhLJOVAnxUx5aoIyosrujsWvIpLGXlVyJWVSxESZniJmWPQmKI1Err26ESH0n0gI4672V+6TWLDoq43Y/x11LeA6H+d+UBiYIbthgU6nWBLdnITihmc8Z5o2tSRnRF5zKwJiiYAWTbh3K1HcbEcqJEeM8KCACLwZNEuavSvCkP/v7DXlpgeNglCKCSvdhtIQJ1VsQWcSZTFNLOyP6RBGWjRHYhRLoJUTPRHRtxGJEbATFDyq1FHZNvje4nB4DGlyA5JRADlNcAI5TCi+ae4mgafIuzYXRRSsGN4jd3Lv3kk+CpVgcM48AXcLQ/OggHR0JoAMCEZAOTfpFuPqeBchy0lZLi0RVRn4WEZwdJ+fvjx95R69+O745Pj8+PRk2B0DL5BwjTxCcvq9JCQtC5mEZdHb9fc9uKoOzOLvG87X4S9pMtE/E8bE7Nahb6DVMmjHQ/z5LfrQAOvFBi9yWWGrTOqlNfAzU+YSPNivU6jWQSvecvAraL0etDq8gpm5LbPmZbHa8GcM2rjEAgReGzIkGUVlgfl8ajAQ8WvPh7/DXmnFkhRPq62XWHe3P4Bl8A0se30LfjmhelkVFGNeC3GWy4yRtYZBaz0DSskMC87hTirGsyrMCM7w2Axe7Vkri22/tNDVhFngSteClVCKBTWqFqSQKE024i60LfikmlwLvjbSbXmbPlkei+ZtopkZgpb0V8Ox6aOqNlsOi9635GN63bLDagZaq563ZlQfnk2HCBhpc1oduE2frJ4Zy2M5SRV+x31xclRfMMMufPIsv2vcu4XXe7s+yD/520/7i4+fLecL50vnieM7h85T5wfnzLlwsJM4vzt/On913nZ+6fza+a2G3r+38PncaTydP/4FgCqL0Q==</latexit>

z2<latexit sha1_base64="TZvAKWPUDTKESOwa2HzSiORibKQ=">AAAJ2HicjVZfb9s2EFfb/Ym9dWu7x70ICwJ0gBBY3pLYDwOCNt0WoEuyOmm62kZAUZRMmBQFkorjCgL2NuxhL3vYPs6AfYt9mx0lO7ZoBZsg04e73x3vfnckFKSMKt3p/HPv/oP33v/gw61W+6OPH37y6aPHT14rkUlMLrBgQr4JkCKMJuRCU83Im1QSxANGLoPpc2O/vCZSUZGc63lKxhzFCY0oRhpUg3dX3atH253dTqfj+75rBP9gvwNCv9/r+j3XNyZ4tg8f/p25juOcXT3e+msUCpxxkmjMkFJDv5PqcY6kppiRoj3KFEkRnqKYDEFMECdqnJe5Fu4OaEI3EhJ+iXZL7bpHjrhScx4AkiM9UbbNKJtsw0xHvXFOkzTTJMHVRlHGXC1cU7gbUkmwZnMQEJYUcnXxBEmENdBT22UCcClJVK8Ec01TDzE2zm/mNVOu6fRdXQP9YZZGg0CSGPpUGYwTo4FEcg7USTFTnpqglKiivWPxq4ikkVeWXEqpFBFRpqeIGRa9CUpCkWmvakQA/ScSwrir/ZX7NBIs/HIj9n9HXQu4zse5P84NzJBds0CnEyyJrk9CfsNTnjFN61qSMSKvuRUBsVhAiybcu5UorrcjEZIjRvg4hwi8HjSN673Lg4D/7+w15aYHtYJQggkr3JrSECdVZEFnEqURjS3sD8kARlrUR2IYSaCVEz0R4TchiRCwM855WKrDomnwvcXh8BjS5AYkowBy6uAYcphQfFPfTQJPoXdtLopwvGK4S+7k3r2TfBQoweCceQLuFobm4xzS0akAMiAYAeXcpJuPyuOdBywjRbG0hFSl4GMZwdF9fvry9JV79OLb45Pj8+PTk0F7BLxAwhXyCMnpd5KQpMhlHBR5Z9ff8+Cq2jeLv2c4X4e/pPFE/0QYE7Nbh56Blku/GQ/x57foAwOsFhu8yGWFLTOplsbAz0yZS3B/r0qhXPuNeMvBL6HVut/o8Apm5rbMipfFasOfMWjjEgsQeG3IgKQUFTnm86nBQMSvPB/+DjqFFUtSPC23XmLd3V4flv7XsHR7FvxyQvWyKijGvBbiLJMpI2sNg9Z6BpSQGRacw52Uj2ZlmCGc4ZEZvMqzUubbfmGhywmzwKWuASuhFAtqVA1IIVESb8RdaBvwcTm5FnxtpJvyNn2yPBbN20QzMwQN6a+GY9NHlW22HBa9b8jH9Lphh9UMNFY9b8yoOjybDiEw0uS0OnCbPmk1M5bHcpJK/I774uSoumAGbfjkWX7XuHcLr7u7Psg/+tuHPad6tpzPnS+cp47vHDiHzvfOmXPhYCd2fnP+cP5svW393Pql9WsFvX9v4fOZU3tav/8LxHWNdg==</latexit><latexit sha1_base64="81/ItuAdF3EDwYZiH30cbjCQeeo=">AAAJ2HicjVZfb9s2EFe7f7G3bu32uBdhQYAOEALLWxL7YUDQpt0CdElWJ01X2wgoipIJk6JAUnFcQcDehj3sZQ/bN9jXGLBvsW+zo2THFq1gE2T6cPe7493vjoSClFGlO51/7t1/59333v9gq9X+8KMHH3/y8NGnr5TIJCYXWDAhXwdIEUYTcqGpZuR1KgniASOXwfSpsV9eE6moSM71PCVjjuKERhQjDarB26vu1cPtzm6n0/F93zWCf7DfAaHf73X9nusbEzzbhw/+znaet/88u3q09dcoFDjjJNGYIaWGfifV4xxJTTEjRXuUKZIiPEUxGYKYIE7UOC9zLdwd0IRuJCT8Eu2W2nWPHHGl5jwAJEd6omybUTbZhpmOeuOcJmmmSYKrjaKMuVq4pnA3pJJgzeYgICwp5OriCZIIa6CntssE4FKSqF4J5pqmHmJsnN/Ma6Zc0+nbugb6wyyNBoEkMfSpMhgnRgOJ5Byok2KmPDVBKVFFe8fiVxFJI68suZRSKSKiTE8RMyx6E5SEItNe1YgA+k8khHFX+yv3cSRY+OVG7P+OuhZwnY9zf5wbmCG7ZoFOJ1gSXZ+E/IanPGOa1rUkY0RecysCYrGAFk24dytRXG9HIiRHjPBxDhF4PWga13uXBwH/39lryk0PagWhBBNWuDWlIU6qyILOJEojGlvY75MBjLSoj8QwkkArJ3oiwm9CEiFgZ5zzsFSHRdPge4vD4TGkyQ1IRgHk1MEx5DCh+Ka+mwSeQu/aXBTheMVwl9zJvXsn+ShQgsE58wTcLQzNxzmko1MBZEAwAsq5STcflcc7D1hGimJpCalKwccygqP79PTF6Uv36Nnz45Pj8+PTk0F7BLxAwhXyCMnpt5KQpMhlHBR5Z9ff8+Cq2jeLv2c4X4e/oPFE/0gYE7Nbh56Blku/GQ/x57foAwOsFhu8yGWFLTOplsbAT0yZS3B/r0qhXPuNeMvBL6HVut/o8BJm5rbMipfFasOfMGjjEgsQeG3IgKQUFTnm86nBQMSvPB/+DjqFFUtSPC23XmLd3V4flv7XsHR7FvxyQvWyKijGvBbiLJMpI2sNg9Z6BpSQGRacw52Uj2ZlmCGc4ZEZvMqzUubbfmGhywmzwKWuASuhFAtqVA1IIVESb8RdaBvwcTm5FnxtpJvyNn2yPBbN20QzMwQN6a+GY9NHlW22HBa9b8jH9Lphh9UMNFY9b8yoOjybDiEw0uS0OnCbPmk1M5bHcpJK/I777OSoumAGbfjkWX7XuHcLr7q7Psg/+NuHPad6tpzPnS+cx47vHDiHznfOmXPhYCd2fnV+d/5ovWn91Pq59UsFvX9v4fOZU3tav/0L2lKOcQ==</latexit><latexit sha1_base64="81/ItuAdF3EDwYZiH30cbjCQeeo=">AAAJ2HicjVZfb9s2EFe7f7G3bu32uBdhQYAOEALLWxL7YUDQpt0CdElWJ01X2wgoipIJk6JAUnFcQcDehj3sZQ/bN9jXGLBvsW+zo2THFq1gE2T6cPe7493vjoSClFGlO51/7t1/59333v9gq9X+8KMHH3/y8NGnr5TIJCYXWDAhXwdIEUYTcqGpZuR1KgniASOXwfSpsV9eE6moSM71PCVjjuKERhQjDarB26vu1cPtzm6n0/F93zWCf7DfAaHf73X9nusbEzzbhw/+znaet/88u3q09dcoFDjjJNGYIaWGfifV4xxJTTEjRXuUKZIiPEUxGYKYIE7UOC9zLdwd0IRuJCT8Eu2W2nWPHHGl5jwAJEd6omybUTbZhpmOeuOcJmmmSYKrjaKMuVq4pnA3pJJgzeYgICwp5OriCZIIa6CntssE4FKSqF4J5pqmHmJsnN/Ma6Zc0+nbugb6wyyNBoEkMfSpMhgnRgOJ5Byok2KmPDVBKVFFe8fiVxFJI68suZRSKSKiTE8RMyx6E5SEItNe1YgA+k8khHFX+yv3cSRY+OVG7P+OuhZwnY9zf5wbmCG7ZoFOJ1gSXZ+E/IanPGOa1rUkY0RecysCYrGAFk24dytRXG9HIiRHjPBxDhF4PWga13uXBwH/39lryk0PagWhBBNWuDWlIU6qyILOJEojGlvY75MBjLSoj8QwkkArJ3oiwm9CEiFgZ5zzsFSHRdPge4vD4TGkyQ1IRgHk1MEx5DCh+Ka+mwSeQu/aXBTheMVwl9zJvXsn+ShQgsE58wTcLQzNxzmko1MBZEAwAsq5STcflcc7D1hGimJpCalKwccygqP79PTF6Uv36Nnz45Pj8+PTk0F7BLxAwhXyCMnpt5KQpMhlHBR5Z9ff8+Cq2jeLv2c4X4e/oPFE/0gYE7Nbh56Blku/GQ/x57foAwOsFhu8yGWFLTOplsbAT0yZS3B/r0qhXPuNeMvBL6HVut/o8BJm5rbMipfFasOfMGjjEgsQeG3IgKQUFTnm86nBQMSvPB/+DjqFFUtSPC23XmLd3V4flv7XsHR7FvxyQvWyKijGvBbiLJMpI2sNg9Z6BpSQGRacw52Uj2ZlmCGc4ZEZvMqzUubbfmGhywmzwKWuASuhFAtqVA1IIVESb8RdaBvwcTm5FnxtpJvyNn2yPBbN20QzMwQN6a+GY9NHlW22HBa9b8jH9Lphh9UMNFY9b8yoOjybDiEw0uS0OnCbPmk1M5bHcpJK/I777OSoumAGbfjkWX7XuHcLr7q7Psg/+NuHPad6tpzPnS+cx47vHDiHznfOmXPhYCd2fnV+d/5ovWn91Pq59UsFvX9v4fOZU3tav/0L2lKOcQ==</latexit><latexit sha1_base64="g8QBduw5zRf6C5jqumitT8HWfDk=">AAAJ2HicjVZfb9s2EFfb/Ym9f233uBdhQYAOEALLWxL7YUDRptsCdEnWJE1XWwgoipIJk6JAUnFcQcDehj3sZQ/bx9nn2LfZUbJji1awCTJ9uPvd8e53R0JhxqjSvd4/9+4/eO/9Dz7c6nQ/+viTTz97+OjxayVyickFFkzINyFShNGUXGiqGXmTSYJ4yMhlOH1u7JfXRCoq0nM9z0jAUZLSmGKkQXX27qp/9XC7t9vr9Xzfd43gH+z3QBgOB31/4PrGBM+2s3hOrx5t/T2OBM45STVmSKmR38t0UCCpKWak7I5zRTKEpyghIxBTxIkKiirX0t0BTeTGQsIv1W6lXfcoEFdqzkNAcqQnyrYZZZttlOt4EBQ0zXJNUlxvFOfM1cI1hbsRlQRrNgcBYUkhVxdPkERYAz2NXSYAl5LEzUow1zTzEGNBcTNvmApNp++aGugPszQaBJIm0KfaYJwYDSWSc6BOipny1ARlRJXdHYtfRSSNvarkSsqkiIkyPUXMsOhNUBqJXHt1I0LoP5EQxl3tr9wnsWDRVxux/zvqWsB1Ps79oDAwQ3bDAp1OsSS6OQnFDc94zjRtaknOiLzmVgTEEgEtmnDvVqK42Y5USI4Y4UEBEXgzaJY0e1eEIf/f2WvKTQ8aBaEUE1a6DaUhTqrYgs4kymKaWNgf0zMYadEciVEsgVZO9ERE30YkRsBOUPCoUkdl2+B7i8PhMaTJDUhGAeQ0wQnkMKH4prmbBJ4i79pcFFGwYrhP7uTevZN8FCrB4Jx5Au4WhuZBAenoTAAZEIyAcm7SLcbV8S5ClpOyXFoiqjLwsYzg6D4/eXnyyj188d3R8dH50cnxWXcMvEDCNfIQyen3kpC0LGQSlkVv19/z4KraN4u/Zzhfh7+kyUT/TBgTs1uHgYFWy7AdD/Hnt+gDA6wXG7zIZYWtMqmX1sDPTJlL8HCvTqFah614y8GvoPW63+rwCmbmtsyal8Vqw58xaOMSCxB4bcgZySgqC8znU4OBiF97Pvwd9EorlqR4Wm29xLq7gyEsw29g6Q8s+OWE6mVVUIx5LcRpLjNG1hoGrfUMKCUzLDiHO6kYz6owIzjDYzN4tWetLLb90kJXE2aBK10LVkIpFtSoWpBCojTZiLvQtuCTanIt+NpIt+Vt+mR5LJq3iWZmCFrSXw3Hpo+q2mw5LHrfko/pdcsOqxlorXremlF9eDYdImCkzWl14DZ9snpmLI/lJFX4HffF8WF9wZx14ZNn+V3j3i287u/6IP/kbz8dLD5+tpwvnC+dJ47vHDhPnR+cU+fCwU7i/O786fzVedv5pfNr57caev/ewudzp/F0/vgXiZyL0g==</latexit>

zn<latexit sha1_base64="4ZFc4Osjg/3AMiZMY0aOID5dIWQ=">AAAJ2HicjVZfb9s2EFfb/Ym9dWu7x70ICwJ0gBBY2ZLYDwOCNt0WoEuyJmm62kJAUZRMmBQFkorjCgL2NuxhL3vYPs6AfYt9mx0lO7ZoBZsg04e73x3vfnckFGaMKt3r/XPv/oP33v/gw41O96OPH37y6aPHT14rkUtMLrBgQr4JkSKMpuRCU83Im0wSxENGLsPJc2O/vCZSUZGe61lGAo6SlMYUIw2qs3dX6dWjzd52r9fzfd81gr+/1wNhMOjv+H3XNyZ4Ng8e/p27juOcXj3e+GsUCZxzkmrMkFJDv5fpoEBSU8xI2R3limQIT1BChiCmiBMVFFWupbsFmsiNhYRfqt1Ku+pRIK7UjIeA5EiPlW0zyjbbMNdxPyhomuWapLjeKM6Zq4VrCncjKgnWbAYCwpJCri4eI4mwBnoau4wBLiWJm5VgrmnmIcaC4mbWMBWaTt41NdAfZmk0CCRNoE+1wTgxGkokZ0CdFFPlqTHKiCq7Wxa/ikgae1XJlZRJERNleoqYYdEbozQSufbqRoTQfyIhjLvcX7lPY8GiL9di/3fUlYCrfJz7QWFghuyGBTqdYkl0cxKKG57xnGna1JKcEXnNrQiIJQJaNOberURxsx2pkBwxwoMCIvBm0Cxp9q4IQ/6/s9eUmx40CkIpJqx0G0pDnFSxBZ1KlMU0sbA/pGcw0qI5EsNYAq2c6LGIvolIjICdoOBRpY7KtsH35ofDY0iTG5CMAshpghPIYUzxTXM3CTxF3rW5KKJgyfAOuZN7907yUagEg3PmCbhbGJoFBaSjMwFkQDACyplJtxhVx7sIWU7KcmGJqMrAxzKCo/v85OXJK/fwxbdHx0fnRyfHZ90R8AIJ18hDJCffSULSspBJWBa9bX/Xg6tqzyz+ruF8Ff6SJmP9E2FMTG8d+gZaLYN2PMSf3aL3DbBebPA8lyW2yqReWgM/M2UuwIPdOoVqHbTiLQe/gtbrXqvDK5iZ2zJrXuarDX/GoI0LLEDgtSFnJKOoLDCfTQwGIn7l+fC33yutWJLiSbX1Autu9wewDL6GZadvwS/HVC+qgmLMayFOc5kxstIwaK1nQCmZYsE53EnFaFqFGcIZHpnBqz1rZbHplxa6mjALXOlasBJKsaBG1YIUEqXJWty5tgWfVJNrwVdGui1v0yfLY968dTQzQ9CS/nI41n1U1WbLYd77lnxMr1t2WM5Aa9Wz1ozqw7PuEAEjbU7LA7fuk9UzY3ksJqnCb7kvjg/rC+asC588i+8a927h9c62D/KP/uZB36mfDedz5wvnqeM7+86B871z6lw42Emc35w/nD87bzs/d37p/FpD79+b+3zmNJ7O7/8C+0uNsg==</latexit><latexit sha1_base64="ApJtwQVj/qeEN9CA1NtMsNSc5HI=">AAAJ2HicjVZfb9s2EFe7f7G3bu32uBdhQYAOEAIrWxL7YUDQpt0CdEnWJE1XWwgoipIJk6JAUnFcQcDehj3sZQ/bN9jXGLBvsW+zo2THFq1gE2T6cPe7493vjoTCjFGle71/7t1/59333v9go9P98KMHH3/y8NGnr5TIJSYXWDAhX4dIEUZTcqGpZuR1JgniISOX4eSpsV9eE6moSM/1LCMBR0lKY4qRBtXZ26v06uFmb7vX6/m+7xrB39/rgTAY9Hf8vusbEzybBw/+zreed/88vXq08dcoEjjnJNWYIaWGfi/TQYGkppiRsjvKFckQnqCEDEFMEScqKKpcS3cLNJEbCwm/VLuVdtWjQFypGQ8ByZEeK9tmlG22Ya7jflDQNMs1SXG9UZwzVwvXFO5GVBKs2QwEhCWFXF08RhJhDfQ0dhkDXEoSNyvBXNPMQ4wFxc2sYSo0nbxtaqA/zNJoEEiaQJ9qg3FiNJRIzoA6KabKU2OUEVV2tyx+FZE09qqSKymTIibK9BQxw6I3Rmkkcu3VjQih/0RCGHe5v3Ifx4JFX67F/u+oKwFX+Tj3g8LADNkNC3Q6xZLo5iQUNzzjOdO0qSU5I/KaWxEQSwS0aMy9W4niZjtSITlihAcFRODNoFnS7F0Rhvx/Z68pNz1oFIRSTFjpNpSGOKliCzqVKItpYmG/T89gpEVzJIaxBFo50WMRfRORGAE7QcGjSh2VbYPvzQ+Hx5AmNyAZBZDTBCeQw5jim+ZuEniKvGtzUUTBkuEdcif37p3ko1AJBufME3C3MDQLCkhHZwLIgGAElDOTbjGqjncRspyU5cISUZWBj2UER/fpyYuTl+7hs+dHx0fnRyfHZ90R8AIJ18hDJCffSkLSspBJWBa9bX/Xg6tqzyz+ruF8Ff6CJmP9I2FMTG8d+gZaLYN2PMSf3aL3DbBebPA8lyW2yqReWgM/MWUuwIPdOoVqHbTiLQe/gtbrXqvDS5iZ2zJrXuarDX/CoI0LLEDgtSFnJKOoLDCfTQwGIn7l+fC33yutWJLiSbX1Autu9wewDL6GZadvwS/HVC+qgmLMayFOc5kxstIwaK1nQCmZYsE53EnFaFqFGcIZHpnBqz1rZbHplxa6mjALXOlasBJKsaBG1YIUEqXJWty5tgWfVJNrwVdGui1v0yfLY968dTQzQ9CS/nI41n1U1WbLYd77lnxMr1t2WM5Aa9Wz1ozqw7PuEAEjbU7LA7fuk9UzY3ksJqnCb7nPjg/rC+asC588i+8a927h1c62D/IP/uZB36mfDedz5wvnseM7+86B851z6lw42EmcX53fnT86bzo/dX7u/FJD79+b+3zmNJ7Ob/8CETeOrQ==</latexit><latexit sha1_base64="ApJtwQVj/qeEN9CA1NtMsNSc5HI=">AAAJ2HicjVZfb9s2EFe7f7G3bu32uBdhQYAOEAIrWxL7YUDQpt0CdEnWJE1XWwgoipIJk6JAUnFcQcDehj3sZQ/bN9jXGLBvsW+zo2THFq1gE2T6cPe7493vjoTCjFGle71/7t1/59333v9go9P98KMHH3/y8NGnr5TIJSYXWDAhX4dIEUZTcqGpZuR1JgniISOX4eSpsV9eE6moSM/1LCMBR0lKY4qRBtXZ26v06uFmb7vX6/m+7xrB39/rgTAY9Hf8vusbEzybBw/+zreed/88vXq08dcoEjjnJNWYIaWGfi/TQYGkppiRsjvKFckQnqCEDEFMEScqKKpcS3cLNJEbCwm/VLuVdtWjQFypGQ8ByZEeK9tmlG22Ya7jflDQNMs1SXG9UZwzVwvXFO5GVBKs2QwEhCWFXF08RhJhDfQ0dhkDXEoSNyvBXNPMQ4wFxc2sYSo0nbxtaqA/zNJoEEiaQJ9qg3FiNJRIzoA6KabKU2OUEVV2tyx+FZE09qqSKymTIibK9BQxw6I3Rmkkcu3VjQih/0RCGHe5v3Ifx4JFX67F/u+oKwFX+Tj3g8LADNkNC3Q6xZLo5iQUNzzjOdO0qSU5I/KaWxEQSwS0aMy9W4niZjtSITlihAcFRODNoFnS7F0Rhvx/Z68pNz1oFIRSTFjpNpSGOKliCzqVKItpYmG/T89gpEVzJIaxBFo50WMRfRORGAE7QcGjSh2VbYPvzQ+Hx5AmNyAZBZDTBCeQw5jim+ZuEniKvGtzUUTBkuEdcif37p3ko1AJBufME3C3MDQLCkhHZwLIgGAElDOTbjGqjncRspyU5cISUZWBj2UER/fpyYuTl+7hs+dHx0fnRyfHZ90R8AIJ18hDJCffSkLSspBJWBa9bX/Xg6tqzyz+ruF8Ff6CJmP9I2FMTG8d+gZaLYN2PMSf3aL3DbBebPA8lyW2yqReWgM/MWUuwIPdOoVqHbTiLQe/gtbrXqvDS5iZ2zJrXuarDX/CoI0LLEDgtSFnJKOoLDCfTQwGIn7l+fC33yutWJLiSbX1Autu9wewDL6GZadvwS/HVC+qgmLMayFOc5kxstIwaK1nQCmZYsE53EnFaFqFGcIZHpnBqz1rZbHplxa6mjALXOlasBJKsaBG1YIUEqXJWty5tgWfVJNrwVdGui1v0yfLY968dTQzQ9CS/nI41n1U1WbLYd77lnxMr1t2WM5Aa9Wz1ozqw7PuEAEjbU7LA7fuk9UzY3ksJqnCb7nPjg/rC+asC588i+8a927h1c62D/IP/uZB36mfDedz5wvnseM7+86B851z6lw42EmcX53fnT86bzo/dX7u/FJD79+b+3zmNJ7Ob/8CETeOrQ==</latexit><latexit sha1_base64="kRKiSD+HgKged5aOJIkSLIO3sLs=">AAAJ2HicjVZfb9s2EFfb/Ym9f233uBdhQYAOEAIrWxL7YUDRptsCdElWJ01XWwgoipIJk6JAUnFcQcDehj3sZQ/bx9nn2LfZUbJji1awCTJ9uPvd8e53R0JhxqjSvd4/9+4/eO/9Dz7c6nQ/+viTTz97+OjxayVyickFFkzINyFShNGUXGiqGXmTSYJ4yMhlOH1u7JfXRCoq0nM9z0jAUZLSmGKkQTV8d5VePdzu7fZ6Pd/3XSP4hwc9EAaD/p7fd31jgmfbWTxnV4+2/h5HAuecpBozpNTI72U6KJDUFDNSdse5IhnCU5SQEYgp4kQFRZVr6e6AJnJjIeGXarfSrnsUiCs15yEgOdITZduMss02ynXcDwqaZrkmKa43inPmauGawt2ISoI1m4OAsKSQq4snSCKsgZ7GLhOAS0niZiWYa5p5iLGguJk3TIWm03dNDfSHWRoNAkkT6FNtME6MhhLJOVAnxUx5aoIyosrujsWvIpLGXlVyJWVSxESZniJmWPQmKI1Err26ESH0n0gI4672V+6TWLDoq43Y/x11LeA6H+d+UBiYIbthgU6nWBLdnITihmc8Z5o2tSRnRF5zKwJiiYAWTbh3K1HcbEcqJEeM8KCACLwZNEuavSvCkP/v7DXlpgeNglCKCSvdhtIQJ1VsQWcSZTFNLOyP6RBGWjRHYhRLoJUTPRHRtxGJEbATFDyq1FHZNvje4nB4DGlyA5JRADlNcAI5TCi+ae4mgafIuzYXRRSsGN4jd3Lv3kk+CpVgcM48AXcLQ/OggHR0JoAMCEZAOTfpFuPqeBchy0lZLi0RVRn4WEZwdJ+fvjx95R69+O745Pj8+PRk2B0DL5BwjTxCcvq9JCQtC5mEZdHb9fc9uKoOzOLvG87X4S9pMtE/E8bE7Nahb6DVMmjHQ/z5LfrQAOvFBi9yWWGrTOqlNfAzU+YSPNivU6jWQSvecvAraL0etDq8gpm5LbPmZbHa8GcM2rjEAgReGzIkGUVlgfl8ajAQ8WvPh7/DXmnFkhRPq62XWHe3P4Bl8A0se30LfjmhelkVFGNeC3GWy4yRtYZBaz0DSskMC87hTirGsyrMCM7w2Axe7Vkri22/tNDVhFngSteClVCKBTWqFqSQKE024i60LfikmlwLvjbSbXmbPlkei+ZtopkZgpb0V8Ox6aOqNlsOi9635GN63bLDagZaq563ZlQfnk2HCBhpc1oduE2frJ4Zy2M5SRV+x31xclRfMMMufPIsv2vcu4XXe7s+yD/520/7i4+fLecL50vnieM7h85T5wfnzLlwsJM4vzt/On913nZ+6fza+a2G3r+38PncaTydP/4FwHKMDg==</latexit>

53 / 174

Rademacher Averages

Then, conditionally on Z n1 ,

Eε∥∥∥Rn

∥∥∥H

= Eε supg∈G

∣∣∣∣∣1nn∑

i=1

εig(Zi )

∣∣∣∣∣is a data-dependent quantity (hence the “hat”) that measures complexityof G projected onto data.

Also define the Rademacher process Rn (without the “hat”)

Rn(g) =1

n

n∑i=1

εig(Zi ),

and Rademacher complexity of G as

E‖Rn‖G = EZ n1Eε∥∥∥Rn

∥∥∥G(Z n

1 ).

54 / 174

Uniform Laws and Rademacher Complexity

We’ll look at a proof of a uniform law of large numbers that involves twosteps:

1. Symmetrization, which bounds E‖P − Pn‖G in terms of theRademacher complexity of F , E‖Rn‖G .

2. Both ‖P − Pn‖G and ‖Rn‖G are concentrated about theirexpectation.

55 / 174


For any G,E‖P − Pn‖G ≤ 2E‖Rn‖G

If G ⊂ [0, 1]Z , then

1

2E‖Rn‖G −

√log 2

2n≤ E‖P − Pn‖G ≤ 2E‖Rn‖G ,

and, with probability at least 1− 2 exp(−2ε2n),

E‖P − Pn‖G − ε ≤ ‖P − Pn‖G ≤ E‖P − Pn‖G + ε.

Thus, E‖Rn‖G → 0 iff ‖P − Pn‖G as→ 0.

Theorem.

That is, the supremum of the empirical process P − Pn is concentratedabout its expectation, and its expectation is about the same as theexpected sup of the Rademacher process Rn.

56 / 174


The first step is to symmetrize by replacing Pg by P ′ng = 1n

∑ni=1 g(Z ′i ).

In particular, we have

E‖P − Pn‖G = E supg∈G

∣∣∣∣∣E[

1

n

n∑i=1

(g(Z ′i )− g(Zi ))

∣∣∣∣∣Z n1

]∣∣∣∣∣

≤ EE

[supg∈G

∣∣∣∣∣1nn∑

i=1

(g(Z ′i )− g(Zi ))

∣∣∣∣∣∣∣∣∣∣Z n

1

]

= E supg∈G

∣∣∣∣∣1nn∑

i=1

(g(Z ′i )− g(Zi ))

∣∣∣∣∣ = E‖P ′n − Pn‖G .

57 / 174


The first step is to symmetrize by replacing Pg by P ′ng = 1n

∑ni=1 g(Z ′i ).

In particular, we have

E‖P − Pn‖G = E supg∈G

∣∣∣∣∣E[

1

n

n∑i=1

(g(Z ′i )− g(Zi ))

∣∣∣∣∣Z n1

]∣∣∣∣∣≤ EE

[supg∈G

∣∣∣∣∣1nn∑

i=1

(g(Z ′i )− g(Zi ))

∣∣∣∣∣∣∣∣∣∣Z n

1

]

= E supg∈G

∣∣∣∣∣1nn∑

i=1

(g(Z ′i )− g(Zi ))

∣∣∣∣∣ = E‖P ′n − Pn‖G .

57 / 174


Another symmetrization: for any εi ∈ ±1,

E‖P ′n − Pn‖G = E supg∈G

∣∣∣∣∣1nn∑

i=1

(g(Z ′i )− g(Zi ))

∣∣∣∣∣= E sup

g∈G

∣∣∣∣∣1nn∑

i=1

εi (g(Z ′i )− g(Zi ))

∣∣∣∣∣ ,This follows from the fact that Zi and Z ′i are i.i.d., and so thedistribution of the supremum is unchanged when we swap them.And so in particular the expectation of the supremum is unchanged.And since this is true for any εi , we can take the expectation over anyrandom choice of the εi . We’ll pick them independently and uniformly.

58 / 174


E supg∈G

∣∣∣∣∣1nn∑

i=1

εi (g(Z ′i )− g(Zi ))

∣∣∣∣∣≤ E sup

g∈G

∣∣∣∣∣1nn∑

i=1

εig(Z ′i )

∣∣∣∣∣+ supg∈G

∣∣∣∣∣1nn∑

i=1

εig(Zi )

∣∣∣∣∣= 2E sup

g∈G

∣∣∣∣∣1nn∑

i=1

εig(Zi )

∣∣∣∣∣︸︷︷︸Rademacher complexity

= 2E‖Rn‖G ,

where Rn is the Rademacher process Rn(g) = (1/n)∑n

i=1 εig(Zi ).

59 / 174


The second inequality (desymmetrization) follows from:

E‖Rn‖G ≤ E

∥∥∥∥∥1

n

n∑i=1

εi (g(Zi )− Eg(Zi ))

∥∥∥∥∥G

+ E

∥∥∥∥∥1

n

n∑i=1

εiEg(Zi )

∥∥∥∥∥G

≤ E

∥∥∥∥∥1

n

n∑i=1

εi (g(Zi )− g(Z ′i ))

∥∥∥∥∥G

+ ‖P‖G E∣∣∣∣∣1n

n∑i=1

εi

∣∣∣∣∣= E

∥∥∥∥∥1

n

n∑i=1

(g(Zi )− Eg(Zi ) + Eg(Z ′i )− g(Z ′i ))

∥∥∥∥∥G

+ ‖P‖G E∣∣∣∣∣1n

n∑i=1

εi

∣∣∣∣∣≤ 2E ‖Pn − P‖G +

√2 log 2

n.

60 / 174


For any G,E‖P − Pn‖G ≤ 2E‖Rn‖G

If G ⊂ [0, 1]Z , then

1

2E‖Rn‖G −

√log 2

2n≤ E‖P − Pn‖G ≤ 2E‖Rn‖G ,

and, with probability at least 1− 2 exp(−2ε2n),

E‖P − Pn‖G − ε ≤ ‖P − Pn‖G ≤ E‖P − Pn‖G + ε.

Thus, E‖Rn‖G → 0 iff ‖P − Pn‖G as→ 0.

Theorem.

That is, the supremum of the empirical process P − Pn is concentratedabout its expectation, and its expectation is about the same as theexpected sup of the Rademacher process Rn.

61 / 174

Back to the prediction problem

Recall: for prediction problem, we have G = ` F for some loss function` and a set of functions F .

We found that E ‖P − Pn‖`F can be related to E ‖Rn‖`F . Question:can we get E ‖Rn‖F instead?

For classification, this is easy. Suppose F ⊆ ±1X , Y ∈ ±1, and`(y , y ′) = Iy 6= y ′ = (1− yy ′)/2. Then

E ‖Rn‖`F = E supf∈F

∣∣∣∣∣1nn∑

i=1

εi`(yi , f (xi ))

∣∣∣∣∣=

1

2E sup

f∈F

∣∣∣∣∣1nn∑

i=1

εi +1

n

n∑i=1

−εiyi f (xi )

∣∣∣∣∣≤ O

(1√n

)+

1

2E ‖Rn‖F .

Other loss functions?

62 / 174

Outline

Introduction


Classification

Beyond ERM

Interpolation

63 / 174

Rademacher Complexity: Structural Results

Subsets F ⊆ G implies ‖Rn‖F ≤ ‖Rn‖G .

Scaling ‖Rn‖cF = |c |‖Rn‖F .

Plus Constant For |g(X )| ≤ 1,|E‖Rn‖F+g − E‖Rn‖F | ≤

√2 log 2/n.

Convex Hull ‖Rn‖coF = ‖Rn‖F , where coF is the convex hullof F .

Contraction If ` : R× R→ [−1, 1] has y 7→ `(y , y) 1-Lipschitzfor all y , then for ` F = (x , y) 7→ `(f (x), y),E‖Rn‖`F ≤ 2E‖Rn‖F + c/

√n.

Theorem.

64 / 174

Outline

Introduction


Classification

Beyond ERM

Interpolation

65 / 174

Deep Networks

Deep compositions of nonlinear functions

h = hm hm−1 · · · h1

e.g., hi : x 7→ σ(Wix) hi : x 7→ r(Wix)

σ(v)i =1

1 + exp(−vi ), r(v)i = max0, vi

66 / 174

Deep Networks


h = hm hm−1 · · · h1

e.g., hi : x 7→ σ(Wix)

hi : x 7→ r(Wix)

σ(v)i =1

1 + exp(−vi ),

r(v)i = max0, vi

66 / 174

Deep Networks


h = hm hm−1 · · · h1

e.g., hi : x 7→ σ(Wix) hi : x 7→ r(Wix)

σ(v)i =1

1 + exp(−vi ), r(v)i = max0, vi

66 / 174

Rademacher Complexity of Sigmoid Networks

Consider the following class FB,d of two-layer neural networks onRd :

FB,d =

x 7→

k∑i=1

wiσ(vTi x)

: ‖w‖1 ≤ B, ‖vi‖1 ≤ B, k ≥ 1

,

where B > 0 and the nonlinear function σ : R → R satisfies theLipschitz condition, |σ(a)−σ(b)| ≤ |a−b|, and σ(0) = 0. Supposethat the distribution is such that ‖X‖∞ ≤ 1 a.s. Then

E‖Rn‖FB,d≤ 2B2

√2 log 2d

n,

where d is the dimension of the input space, X = Rd .

Theorem.

67 / 174

Rademacher Complexity of Sigmoid Networks: Proof

Define G := (x1, . . . , xd) 7→ xj : 1 ≤ j ≤ d,

VB :=

x 7→ v ′x : ‖v‖1 =

d∑i=1

|vi | ≤ B

= Bco (G ∪ −G) ,

with co(F) :=

k∑

i=1

αi fi : k ≥ 1, αi ≥ 0, ‖α‖1 = 1, fi ∈ F.

Then FB,d =

x 7→

k∑i=1

wiσ(vi (x)) | k ≥ 1,k∑

i=1

wi ≤ B, vi ∈ VB

= Bco (σ VB ∪ −σ VB) .

68 / 174


E‖Rn‖FB,d= E‖Rn‖Bco(σVB∪−σVB )

= BE‖Rn‖co(σVB∪−σVB )

= BE‖Rn‖σVB∪−σVB≤ BE‖Rn‖σVB + BE‖Rn‖−σVB= 2BE‖Rn‖σVB≤ 2BE‖Rn‖VB= 2BE‖Rn‖Bco(G∪−G)

= 2B2E‖Rn‖G∪−G

≤ 2B2

√2 log (2d)

n.

69 / 174




= BE‖Rn‖σVB∪−σVB

≤ BE‖Rn‖σVB + BE‖Rn‖−σVB= 2BE‖Rn‖σVB≤ 2BE‖Rn‖VB= 2BE‖Rn‖Bco(G∪−G)


≤ 2B2

√2 log (2d)

n.

69 / 174




= BE‖Rn‖σVB∪−σVB≤ BE‖Rn‖σVB + BE‖Rn‖−σVB

= 2BE‖Rn‖σVB≤ 2BE‖Rn‖VB= 2BE‖Rn‖Bco(G∪−G)


≤ 2B2

√2 log (2d)

n.

69 / 174




= BE‖Rn‖σVB∪−σVB≤ BE‖Rn‖σVB + BE‖Rn‖−σVB= 2BE‖Rn‖σVB

≤ 2BE‖Rn‖VB= 2BE‖Rn‖Bco(G∪−G)


≤ 2B2

√2 log (2d)

n.

69 / 174




= BE‖Rn‖σVB∪−σVB≤ BE‖Rn‖σVB + BE‖Rn‖−σVB= 2BE‖Rn‖σVB≤ 2BE‖Rn‖VB

= 2BE‖Rn‖Bco(G∪−G)


≤ 2B2

√2 log (2d)

n.

69 / 174




= BE‖Rn‖σVB∪−σVB≤ BE‖Rn‖σVB + BE‖Rn‖−σVB= 2BE‖Rn‖σVB≤ 2BE‖Rn‖VB= 2BE‖Rn‖Bco(G∪−G)


≤ 2B2

√2 log (2d)

n.

69 / 174


The final step involved the following result.

For f ∈ F satisfying |f (x)| ≤ 1,

E‖Rn‖F ≤ E√

2 log(2|F(X n1 )|)

n.

Lemma.

70 / 174

Outline

Introduction


Classification

Beyond ERM

Interpolation

71 / 174

Rademacher Complexity of ReLU Networks

Using the positive homogeneity property of the ReLU nonlinearity (thatis, for all α ≥ 0 and x ∈ R, σ(αx) = ασ(x)) gives a simple argument tobound the Rademacher complexity.

For a training set (X1,Y1), . . . , (Xn,Yn) ∈ X×±1 with ‖Xi‖ ≤ 1a.s.,

E ‖Rn‖FF,B≤ (2B)L√

n,

where f ∈ FF ,B is an L-layer network of the form

FF ,B := WLσ(WL−1 · · ·σ(W1x) · · · ),

σ is 1-Lipschitz, positive homogeneous (that is, for all α ≥ 0and x ∈ R, σ(αx) = ασ(x)), and applied componentwise, and‖Wi‖F ≤ B. (WL is a row vector.)

Theorem.

72 / 174

ReLU Networks: Proof

(Write Eε as the conditional expectation given the data.)

Eε supf∈F,‖W‖F≤B

1

n

∥∥∥∥∥n∑

i=1

εiσ(Wf (Xi ))

∥∥∥∥∥2

≤ 2BEε supf∈F

1

n

∥∥∥∥∥n∑

i=1

εi f (Xi )

∥∥∥∥∥2

.

Lemma.

Iterating this and using Jensen’s inequality proves the theorem.

73 / 174


For W> = (w1 · · ·wk), we use positive homogeneity:∥∥∥∥∥n∑

i=1

εiσ(Wf (xi ))

∥∥∥∥∥2

=k∑

j=1

(n∑

i=1

εiσ(w>j f (xi ))

)2

=k∑

j=1

‖wj‖2

(n∑

i=1

εiσ

(w>j‖wj‖

f (xi )

))2

.

74 / 174


sup‖W‖F≤B

k∑j=1

‖wj‖2

(n∑

i=1

εiσ

(w>j‖wj‖

f (xi )

))2

= sup‖wj‖=1;‖α‖1≤B2

k∑j=1

αj

(n∑

i=1

εiσ(w>j f (xi )

))2

= B2 sup‖w‖=1

(n∑

i=1

εiσ(w>f (xi )

))2

,

then apply (Contraction) and Cauchy-Schwartz inequalities.

75 / 174

Outline

Introduction


Classification

Beyond ERM

Interpolation

76 / 174

Rademacher Complexity and Covering Numbers

An ε-cover of a subset T of a metric space (S , d) is a set T ⊂ Tsuch that for each t ∈ T there is a t ∈ T such that d(t, t) ≤ ε.The ε-covering number of T is

N (ε,T , d) = min|T | : T is an ε-cover of T.

Definition.

Intuition: A k-dimensional set T has N (ε,T , k) = Θ(1/εk).Example: ([0, 1]k , l∞).

77 / 174

Rademacher Complexity and Covering Numbers

For any F and any X1, . . . ,Xn,

Eε‖Rn‖F(X n1 ) ≤ c

∫ ∞0

√lnN (ε,F(X n

1 ), `2)

ndε.

and

Eε‖Rn‖F(X n1 ) ≥

1

c log nsupα>0

α

√lnN (α,F(X n

1 ), `2)

n.

Theorem.

78 / 174

Covering numbers and smoothly parameterized functions

Let F be a parameterized class of functions,

F = f (θ, ·) : θ ∈ Θ.

Let ‖ · ‖Θ be a norm on Θ and let ‖ · ‖F be a norm on F . Supposethat the mapping θ 7→ f (θ, ·) is L-Lipschitz, that is,

‖f (θ, ·)− f (θ′, ·)‖F ≤ L‖θ − θ′‖Θ.

Then N (ε,F , ‖ · ‖F ) ≤ N (ε/L,Θ, ‖ · ‖Θ).

Lemma.

79 / 174

Covering numbers and smoothly parameterized functions

A Lipschitz parameterization allows us to translate a cover of theparameter space into a cover of the function space.

Example: If F is smoothly parameterized by (a compact set of) kparameters, then N (ε,F) = O(1/εk).

80 / 174

Recap

g0

Eg

all functions

g0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gn g0

1

n

nX

i=1

g(zi)

gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gG gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

81 / 174

Recap

g0

Eg

all functionsg0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gn g0

1

n

nX

i=1

g(zi)

gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gG gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

81 / 174

Recap

g0

Eg

all functionsg0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gn

g0

1

n

nX

i=1

g(zi)

gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gG gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

81 / 174

Recap

g0

Eg

all functionsg0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gn g0

1

n

nX

i=1

g(zi)

gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gG gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

81 / 174

Recap

g0

Eg

all functionsg0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gn g0

1

n

nX

i=1

g(zi)

gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

gG gn

G

g0

Eg

all functions

1

n

nX

i=1

g(zi)

81 / 174

Recap

E ‖P − Pn‖F E ‖Rn‖F

Advantage of dealing with Rademacher process:

I often easier to analyze

I can work conditionally on the data

Example: F = x 7→ 〈w , x〉 : ‖w‖ ≤ 1, ‖x‖ ≤ 1. Then

Eε∥∥∥Rn

∥∥∥F

= Eε sup‖w‖≤1

∣∣∣∣∣1nn∑

i=1

εi 〈w , xi 〉∣∣∣∣∣ = Eε sup

‖w‖≤1

∣∣∣∣∣⟨w ,

1

n

n∑i=1

εixi

⟩∣∣∣∣∣which is equal to

Eε

∥∥∥∥∥1

n

n∑i=1

εixi

∥∥∥∥∥ ≤√∑n

i=1 ‖xi‖2

n=

√tr(XX>)

n

82 / 174

Outline

Introduction


ClassificationGrowth FunctionVC-DimensionVC-Dimension of Neural NetworksLarge Margin Classifiers

Beyond ERM

Interpolation

83 / 174

Classification and Rademacher Complexity

Recall:

For F ⊆ ±1X , Y ∈ ±1, and `(y , y ′) = Iy 6= y ′ = (1− yy ′)/2,

E ‖Rn‖`F ≤1

2E ‖Rn‖F + O

(1√n

).

How do we get a handle on Rademacher complexity of a set F ofclassifiers? Recall that we are looking at “size” of F projected onto data:

F(xn1 ) = (f (x1), . . . , f (xn)) : f ∈ F ⊆ −1, 1n

84 / 174

Controlling Rademacher Complexity: Growth Function

Example: For the class of step functions, F = x 7→ Ix ≤ α : α ∈ R,the set of restrictions,

F(xn1 ) = (f (x1), . . . , f (xn)) : f ∈ F

is always small: |F(xn1 )| ≤ n + 1. So E‖Rn‖F ≤√

2 log 2(n+1)n .

Example: If F is parameterized by k bits,F =

x 7→ f (x , θ) : θ ∈ 0, 1k

for some f : X × 0, 1k → [0, 1],

|F(xn1 )| ≤ 2k ,

E‖Rn‖F ≤√

2(k + 1) log 2

n.

85 / 174

Growth Function

For a class F ⊆ 0, 1X , the growth function is

ΠF (n) = max|F(xn1 )| : x1, . . . , xn ∈ X.

Definition.

I E‖Rn‖F ≤√

2 log(2ΠF (n))n ; E‖P − Pn‖F = O

(√log(ΠF (n))

n

).

I ΠF (n) ≤ |F|, limn→∞ ΠF (n) = |F|.I ΠF (n) ≤ 2n. (But then this gives no useful bound on E‖Rn‖F .)

86 / 174

Growth Function


ΠF (n) = max|F(xn1 )| : x1, . . . , xn ∈ X.

Definition.

I E‖Rn‖F ≤√


(√log(ΠF (n))

n

).

I ΠF (n) ≤ |F|, limn→∞ ΠF (n) = |F|.

I ΠF (n) ≤ 2n. (But then this gives no useful bound on E‖Rn‖F .)

86 / 174

Growth Function


ΠF (n) = max|F(xn1 )| : x1, . . . , xn ∈ X.

Definition.

I E‖Rn‖F ≤√


(√log(ΠF (n))

n

).

I ΠF (n) ≤ |F|, limn→∞ ΠF (n) = |F|.I ΠF (n) ≤ 2n. (But then this gives no useful bound on E‖Rn‖F .)

86 / 174

Outline

Introduction



Beyond ERM

Interpolation

87 / 174

Vapnik-Chervonenkis Dimension

A class F ⊆ 0, 1X shatters x1, . . . , xd ⊆ X means that|F(xd1 )| = 2d . The Vapnik-Chervonenkis dimension of F is

dVC (F) = max d : some x1, . . . , xd ∈ X is shattered by F= max

d : ΠF (d) = 2d

.

Definition.

88 / 174

Vapnik-Chervonenkis Dimension: “Sauer’s Lemma”

dVC (F) ≤ d implies

ΠF (n) ≤d∑

i=0

(n

i

).

If n ≥ d , the latter sum is no more than(end

)d.

Theorem (Vapnik-Chervonenkis).

So the VC-dimension d is a single integer summary of the growthfunction:

ΠF (n)

= 2n if n ≤ d ,

≤ (e/d)d nd if n > d .

89 / 174

Vapnik-Chervonenkis Dimension: “Sauer’s Lemma”

Thus, for dVC (F) ≤ d and n ≥ d ,

supP

E‖P − Pn‖F = O

(√d log(n/d)

n

).

And there is a matching lower bound.

Uniform convergence uniformly over probability distributions is equivalentto finiteness of the VC-dimension.

90 / 174

VC-dimension Bounds for Parameterized Families

Consider a parameterized class of binary-valued functions,

F = x 7→ f (x , θ) : θ ∈ Rp ,

where f : X × Rp → ±1.

Suppose that for each x , θ 7→ f (x , θ) can be computed using no morethan t operations of the following kinds:

1. arithmetic (+, −, ×, /),

2. comparisons (>, =, <),

3. output ±1.

dVC (F) ≤ 4p(t + 2).

Theorem (Goldberg and Jerrum).

91 / 174

Outline

Introduction



Beyond ERM

Interpolation

92 / 174

VC-Dimension of Neural Networks

Consider the class F of −1, 1-valued functions computed by anetwork with L layers, p parameters, and k computation units withthe following nonlinearities:

1. Piecewise constant (linear threshold units):VCdim(F) = Θ (p).

(Baum and Haussler, 1989)

2. Piecewise linear (ReLUs): VCdim(F) = Θ (pL).(B., Harvey, Liaw, Mehrabian, 2017)

3. Piecewise polynomial: VCdim(F) = O(pL2).

(B., Maiorov, Meir, 1998)

4. Sigmoid: VCdim(F) = O(p2k2

).

(Karpinsky and MacIntyre, 1994)

Theorem.

93 / 174

Outline

Introduction



Beyond ERM

Interpolation

94 / 174

Some Experimental Observations

(Schapire, Freund, B, Lee, 1997)

95 / 174


(Schapire, Freund, B, Lee, 1997)

96 / 174


(Lawrence, Giles, Tsoi, 1997)

97 / 174


(Neyshabur, Tomioka, Srebro, 2015)

98 / 174

Large-Margin Classifiers

I Consider a real-valued function f : X → R used for classification.

I The prediction on x ∈ X is sign(f (x)) ∈ −1, 1.I If yf (x) > 0 then f classifies x correctly.

I We call yf (x) the margin of f on x . (c.f. the perceptron)

I We can view a larger margin as a more confident correctclassification.

I Minimizing a continuous loss, such as

n∑i=1

(f (Xi )− Yi )2, or

n∑i=1

exp (−Yi f (Xi )) ,

encourages large margins.

I For large-margin classifiers, we should expect the fine-grained detailsof f to be less important.

99 / 174

Recall

Contraction property:

If ` is y 7→ `(y , y) 1-Lipschitz for all y and |`| ≤ 1, then

E‖Rn‖`F ≤ 2E‖Rn‖F + c/√n

Unfortunately, classification loss is not Lipschitz. By writing classificationloss as (1− yy ′)/2, we did show that

E‖Rn‖` sign(F) ≤ 2E‖Rn‖sign(F) + c/√n

but we want E‖Rn‖F .

100 / 174

Rademacher Complexity for Lipschitz Loss

Consider the 1/γ-Lipschitz surrogate loss

φ(α) =

1 if α ≤ 0,

1− α/γ if 0 < α < γ,

0 if α ≥ 1.

(yf(x))

Iyf(x)60

yf(x)

101 / 174


Large margin loss is an upper bound and classification loss is a lowerbound:

IYf (X ) ≤ 0 ≤ φ(Yf (X )) ≤ IYf (X ) ≤ γ.

So if we can relate the Lipschitz risk Pφ(Yf (X )) to the Lipschitzempirical risk Pnφ(Yf (X )), we have a large margin bound:

PIYf (X ) ≤ 0 ≤ Pφ(Yf (X )) c.f. Pnφ(Yf (X )) ≤ PnIYf (X ) ≤ γ.

102 / 174


With high probability, for all f ∈ F ,

L01(f ) = PIYf (X ) ≤ 0≤ Pφ(Yf (X ))

≤ Pnφ(Yf (X )) +c

γE‖Rn‖F + O(1/

√n)

≤ PnIYf (X ) ≤ γ+c

γE‖Rn‖F + O(1/

√n).

Notice that we’ve turned a classification problem into a regressionproblem.

The VC-dimension (which captures arbitrarily fine-grained properties ofthe function class) is no longer important.

103 / 174

Example: Linear Classifiers

Let F = x 7→ 〈w , x〉 : ‖w‖ ≤ 1. Then, since E ‖Rn‖F . n−1/2,

L01(f ) .1

n

n∑i=1

IYi f (Xi ) ≤ γ+1

γ

1√n

Compare to Perceptron.

104 / 174

Risk versus surrogate risk

In general, these margin bounds are upper bounds only.

But for any surrogate loss φ, we can define a transform ψ that gives atight relationship between the excess 0− 1 risk (over the Bayes risk) tothe excess φ-risk:

1. For any P and f , ψ (L01(f )− L∗01) ≤ Lφ(f )− L∗φ.

2. For |X | ≥ 2, ε > 0 and θ ∈ [0, 1], there is a P and an f with

L01(f )− L∗01 = θ

ψ(θ) ≤ Lφ(f )− L∗φ ≤ ψ(θ) + ε.

Theorem.

105 / 174

Generalization: Margins and Size of Parameters

I A classification problem becomes a regression problem if we use aloss function that doesn’t vary too quickly.

I For regression, the complexity of a neural network is controlled bythe size of the parameters, and can be independent of the numberof parameters.

I We have a tradeoff between the fit to the training data (margins)and the complexity (size of parameters):

Pr(sign(f (X )) 6= Y ) ≤ 1

n

n∑i=1

φ(Yi f (Xi )) + pn(f )

I Even if the training set is classified correctly, it might be worthwhileto increase the complexity, to improve this loss function.

106 / 174

In the next few sections, we show that

E[L(fn)− L(fn)

]can be small if the learning algorithm fn has certain properties.

Keep in mind that this is an interesting quantity for two reasons:

I it upper bounds the excess expected loss of ERM, as we showedbefore.

I can be written as an a-posteriori bound

L(fn) ≤ L(fn) + . . .

regardless of whether fn minimized empirical loss.

107 / 174

Outline

Introduction


Classification

Beyond ERMCompression BoundsAlgorithmic StabilityLocal Methods

Interpolation

108 / 174

Compression Set

Let us use the shortened notation for data: S = Z1, . . . ,Zn, and let us

make the dependence of the algorithm fn on the training set explicit:fn = fn[S]. As before, denote G = (x , y) 7→ `(f (x), y) : f ∈ F, and

write gn(·) = `(fn(·), ·). Let us write gn[S](·) to emphasize thedependence.

Suppose there exists a “compression function” Ck which selects from anydataset S of size n a subset of k examples Ck(S) ⊆ S such that

fn[S] = fk [Ck(S)]

That is, the learning algorithm produces the same function when given Sor its subset Ck(S).

One can keep in mind the example of support vectors in SVMs.

109 / 174

Then,

L(fn)− L(fn) = Egn −1

n

n∑i=1

gn(Zi )

= Egk [Ck(S)](Z )− 1

n

n∑i=1

gk [Ck(S)](Zi )

≤ maxI⊆1,...,n,|I |≤k

Egk [SI ](Z )− 1

n

n∑i=1

gk [SI ](Zi )

where SI is the subset indexed by I .

110 / 174

Since gk [SI ] only depends on k out of n points, the empirical average is“mostly out of sample”. Adding and subtracting loss functions for anadditional set of i.i.d. random variables W = Z ′1, . . . ,Z ′k results in anupper bound

maxI⊆1,...,n,|I |≤k

Egk [SI ](Z )− 1

n

∑Z ′∈S′

gk [SI ](Z ′)

+(b − a)k

n

where [a, b] is the range of functions in G and S ′ is obtained from S byreplacing SI with the corresponding subset WI .

111 / 174

For each fixed I , the random variable

Egk [SI ](Z )− 1

n

∑Z ′∈S′

gk [SI ](Z ′)

is zero mean with standard deviation O((b − a)/√n). Hence, the

expected maximum over I with respect to S,W is at most

c

√(b − a)k log(en/k)

n

since log(nk

)≤ k log(en/k).

112 / 174

Conclusion: compression-style argument limits the bias

E[L(fn)− L(fn)

]≤ O

(√k log n

n

),

which is non-vacuous if k = o(n/ log n).

Recall that this term was the upper bound (up to log) on expected excessloss of ERM if class has VC dimension k . However, a possible equivalencebetween compression and VC dimension is still being investigated.

113 / 174

Example: Classification with Thresholds in 1D

I X = [0, 1], Y = 0, 1I F = fθ : fθ(x) = Ix ≥ θ, θ ∈ [0, 1]I `(fθ(x), y) = Ifθ(x) 6= y

0 1

fn

For any set of data (x1, y1), . . . , (xn, yn), the ERM solution fn has theproperty that the first occurrence xl on the left of the threshold has labelyl = 0, while first occurrence xr on the right – label yr = 1.

Enough to take k = 2 and define fn[S] = f2[(xl , 0), (xr , 1)].

114 / 174

Further examples/observations:

I Compression of size d for hyperplanes (realizable case)

I Compression of size 1/γ2 for margin case

I Bernstein bound gives 1/n rate rather than 1/√n rate on realizable

data (zero empirical error).

115 / 174

Outline

Introduction


Classification


Interpolation

116 / 174

Algorithmic Stability

Recall that compression was a way to upper bound E[L(fn)− L(fn)

].

Algorithmic stability is another path to the same goal.

Compare:

I Compression: fn depends only on a subset of k datapoints.

I Stability: fn does not depend on any of the datapoints too strongly.

As before, let’s write shorthand g = ` f and gn = ` fn.

117 / 174

Algorithmic StabilityWe now write

ESL(fn) = EZ1,...,Zn,Z gn[Z1, . . . ,Zn](Z )

Again, the meaning of gn[Z1, . . . ,Zn](Z ): train on Z1, . . . ,Zn and test onZ .

On the other hand,

ES L(fn) = EZ1,...,Zn

1

n

n∑i=1

gn[Z1, . . . ,Zn](Zi )

=1

n

n∑i=1

EZ1,...,Zn gn[Z1, . . . ,Zn](Zi )

= EZ1,...,Zn gn[Z1, . . . ,Zn](Z1)

where the last step holds for symmetric algorithms (wrt permutation oftraining data). Of course, instead of Z1 we can take any Zi .

118 / 174

Now comes the renaming trick. It takes a minute to get used to, if youhaven’t seen it.

Note that Z1, . . . ,Zn,Z are i.i.d. Hence,

ESL(fn) = EZ1,...,Zn,Z gn[Z1, . . . ,Zn](Z ) = EZ1,...,Zn,Z gn[Z ,Z2, . . . ,Zn](Z1)

Therefore,

EL(fn)− L(fn)

= EZ1,...,Zn,Z gn[Z ,Z2, . . . ,Zn](Z1)− gn[Z1, . . . ,Zn](Z1)

119 / 174

Of course, we haven’t really done much except re-writing expectation.But the difference

gn[Z ,Z2, . . . ,Zn](Z1)− gn[Z1, . . . ,Zn](Z1)

has a “stability” interpretation. If it holds that the output of thealgorithm “does not change much” when one datapoint is replaced with

another, then the gap EL(fn)− L(fn)

is small.

Moreover, since everything we’ve written is an equality, this stability is

equivalent to having small gap EL(fn)− L(fn)

.

120 / 174

NB: our aim of ensuring small EL(fn)− L(fn)

only makes sense if

L(fn) is small (e.g. on average). That is, the analysis only makes sensefor those methods that explicitly or implicitly minimize empirical loss (ora regularized variant of it).

It’s not enough to be stable. Consider a learning mechanism that ignoresthe data and outputs fn = f0, a constant function. Then

EL(fn)− L(fn)

= 0 and the algorithm is very stable. However, it does

not do anything interesting.

121 / 174

Uniform Stability

Rather than the average notion we just discussed, let’s consider a muchstronger notion:

We say that algorithm is β uniformly stable if

∀i ∈ [n], z1, . . . , zn, z′, z

∣∣∣gn[S](z)− gn[S i,z′ ](z)∣∣∣ ≤ β

where S i,z′ = z1, . . . , zi−1, z′, zi+1, . . . , zn.

122 / 174

Uniform Stability

Clearly, for any realization of Z1, . . . ,Zn,Z ,

gn[Z ,Z2, . . . ,Zn](Z1)− gn[Z1, . . . ,Zn](Z1) ≤ β,

and so expected loss of a β-uniformly-stable ERM is β-close to itsempirical error (in expectation). See recent work by Feldman andVondrak for sharp high probability statements.

Of course, it is unclear at this point whether a β-uniformly-stable ERM(or near-ERM) exists.

123 / 174

Kernel Ridge Regression

Consider

fn = argminf∈H

1

n

n∑i=1

(f (Xi )− Yi )2 + λ ‖f ‖2

K

in RKHS H corresponding to kernel K .

Assume K (x , x) ≤ κ2 for any x .

Kernel Ridge Regression is β-uniformly stable with β = O(

1λn

)Lemma.

124 / 174

Proof (stability of Kernel Ridge Regression)

To prove this, first recall the definition of a σ-strongly convex function φon convex domain W:

∀u, v ∈ W, φ(u) ≥ φ(v) + 〈∇φ(v), u − v〉+σ

2‖u − v‖2

.

Suppose φ, φ′ are both σ-strongly convex. Suppose w ,w ′ satisfy∇φ(w) = ∇φ′(w ′) = 0. Then

φ(w ′) ≥ φ(w) +σ

2‖w − w ′‖2

andφ′(w) ≥ φ′(w ′) +

σ

2‖w − w ′‖2

As a trivial consequence,

σ ‖w − w ′‖2 ≤ [φ(w ′)− φ′(w ′)] + [φ′(w)− φ(w)]

125 / 174

Proof (stability of Kernel Ridge Regression)Now take

φ(f ) =1

n

∑i∈S

(f (xi )− yi )2 + λ ‖f ‖2

K

and

φ′(f ) =1

n

∑i∈S′

(f (xi )− yi )2 + λ ‖f ‖2

K

where S and S ′ differ in one element: (xi , yi ) is replaced with (x ′i , y′i ).

Let fn, f′n be the minimizers of φ, φ′, respectively. Then

φ(f ′n)− φ′(f ′n) ≤ 1

n

((f ′n(xi )− yi )

2 − (f ′n(x ′i )− y ′i )2)

and

φ′(fn)− φ(fn) ≤ 1

n

((fn(x ′i )− y ′i )2 − (fn(xi )− yi )

2)

NB: we have been operating with f as vectors. To be precise, one needsto define the notion of strong convexity over H. Let us sweep it underthe rug and say that φ, φ′ are 2λ-strongly convex with respect to ‖·‖K .

126 / 174


Then∥∥∥fn − f ′n

∥∥∥2

Kis at most

1

2λn

((f ′n(xi )− yi )

2 − (fn(xi )− yi )2 + (fn(x ′i )− y ′i )2 − (f ′n(x ′i )− y ′i )2

)which is at most

1

2λnC∥∥∥fn − f ′n

∥∥∥∞

where C = 4(1 + c) if |Yi | ≤ 1 and |fn(xi )| ≤ c .

On the other hand, for any x

f (x) = 〈f ,Kx〉 ≤ ‖f ‖K ‖Kx‖ = ‖f ‖K√〈Kx ,Kx〉 = ‖f ‖K

√K (x , x) ≤ κ ‖f ‖K

and so‖f ‖∞ ≤ κ ‖f ‖K .

127 / 174


Putting everything together,∥∥∥fn − f ′n

∥∥∥2

K≤ 1


∥∥∥∞≤ κC

2λn

∥∥∥fn − f ′n

∥∥∥K

Hence, ∥∥∥fn − f ′n

∥∥∥K≤ 1


∥∥∥∞≤ κC

2λn

To finish the claim,

(fn(xi )−yi )2−(f ′n(xi )−yi )2 ≤ C∥∥∥fn − f ′n

∥∥∥∞≤ κC

∥∥∥fn − f ′n

∥∥∥K≤ O

(1

λn

)

128 / 174

Outline

Introduction


Classification


Interpolation

129 / 174

We consider “local” procedures such as k-Nearest-Neighbors or localsmoothing. They have a different bias-variance decomposition (we do notfix a class F).

Analysis will rely on local similarity (e.g. Lipschitz-ness) of regressionfunction f ∗.Idea: to predict y at a given x , look up in the dataset those Yi for whichXi is “close” to x .

130 / 174

Bias-VarianceIt’s time to revisit the bias-variance picture. Recall that our goal was toensure that

EL(fn)− L(f ∗)

decreases with data size n, where f ∗ gives smallest possible L.

For “simple problems” (that is, strong assumptions on P), one canensure this without the bias-variance decomposition. Examples:Perceptron, linear regression in d < n regime, etc.

However, for more interesting problems, we cannot get this difference tobe small without introducing bias, because otherwise the variance(fluctuation of the stochastic part) is too large.

Our approach so far was to split this term into anestimation-approximation error with respect to some class F :

EL(fn)− L(fF ) + L(fF )− L(f ∗)

131 / 174

Bias-Variance

Now we’ll study a different bias-variance decomposition, typically used innonparametric statistics. We will only work with square loss.

Rather than fixing F that controls the estimation error, we fix analgorithm (procedure/estimator) fn that has some tunable parameter.

By definition E[Y |X = x ] = f ∗(x). Then we write

EL(fn)− L(f ∗) = E(fn(X )− Y )2 − E(f ∗(X )− Y )2

= E(fn(X )− f ∗(X ) + f ∗(X )− Y )2 − E(f ∗(X )− Y )2

= E(fn(X )− f ∗(X ))2

because the cross term vanishes (check!)

132 / 174

Bias-Variance

Before proceeding, let us discuss the last expression.

E(fn(X )− f ∗(X ))2 = ES∫x

(fn(x)− f ∗(x))2P(dx)

=

∫x

ES(fn(x)− f ∗(x))2P(dx)

We will often analyze ES(fn(x)− f ∗(x))2 for fixed x and then integrate.

The integral is a measure of distance between two functions:

‖f − g‖2L2(P) =

∫x

(f (x)− g(x))2P(dx).

133 / 174

Bias-Variance

Let us drop L2(P) from notation for brevity. The bias-variancedecomposition can be written as

ES∥∥∥fn − f ∗

∥∥∥2

L2(P)= ES

∥∥∥fn − EY1:n [fn] + EY1:n [fn]− f ∗∥∥∥2

= ES∥∥∥fn − EY1:n [fn]

∥∥∥2

+ EX1:n

∥∥∥EY1:n [fn]− f ∗∥∥∥2

,

because the cross term is zero in expectation.

The first term is variance, the second is squared bias. One typicallyincreases with the parameter, the other decreases.

Parameter is chosen either (a) theoretically or (b) by cross-validation(this is the usual case in practice).

134 / 174

k-Nearest Neighbors

As before, we are given (X1,Y1), . . . , (Xn,Yn) i.i.d. from P. To make aprediction of Y at a given x , we sort points according to distance‖Xi − x‖. Let

(X(1),Y(1)), . . . , (X(n),Y(n))

be the sorted list (remember this depends on x).

The k-NN estimate is defined as

fn(x) =1

k

k∑i=1

Y(i).

135 / 174

Variance: Given x ,

fn(x)− EY1:n [fn(x)] =1

k

k∑i=1

(Y(i) − f ∗(X(i)))

which is on the order of 1/√k . Then variance is of the order 1

k .

Bias: a bit more complicated. For a given x ,

EY1:n [fn(x)]− f ∗(x) =1

k

k∑i=1

(f ∗(X(i))− f ∗(x)).

Suppose f ∗ is 1-Lipschitz. Then the square of above is(1

k

k∑i=1

(f ∗(X(i))− f ∗(x))

)2

≤ 1

k

k∑i=1

∥∥X(i) − x∥∥2

So, the bias is governed by how close the closest k random points are tox .

136 / 174

Claim: enough to know the upper bound on the closest point to x amongn points.

Argument: for simplicity assume that n/k is an integer. Divide theoriginal (unsorted) dataset into k blocks, n/k size each. Let X i be theclosest point to x in ith block. Then the collection X 1, . . . ,X k , ak-subset which is no closer than the set of k nearest neighbors. That is,

1

k

k∑i=1

∥∥X(i) − x∥∥2 ≤ 1

k

k∑i=1

∥∥X i − x∥∥2

Taking expectation (with respect to dataset), the bias term is at most

E

1

k

k∑i=1

∥∥X i − x∥∥2

= E

∥∥X 1 − x∥∥2

which is expected squared distance from x to the closest point in arandom set of n/k points.

137 / 174

If support of X is bounded and d ≥ 3, then one can estimate

E∥∥X − X(1)

∥∥2. n−2/d .

That is, we expect the closest neighbor of a random point X to be nofurther than n−1/d away from one of n randomly drawn points.

Thus, the bias (which is the expected squared distance from x to theclosest point in a random set of n/k points) is at most

(n/k)−2/d .

138 / 174

Putting everything together, the bias-variance decomposition yields

1

k+

(k

n

)2/d

Optimal choice is k ∼ n2

2+d and the overall rate of estimation at a givenpoint x is

n−2

2+d .

Since the result holds for any x , the integrated risk is also

E∥∥∥fn − f ∗

∥∥∥2

. n−2

2+d .

139 / 174

Summary

I We sketched the proof that k-Nearest-Neighbors has samplecomplexity guarantees for prediction or estimation problems withsquare loss if k is chosen appropriately.

I Analysis is very different from “empirical process” approach forERM.

I Truly nonparametric!

I No assumptions on underlying density (in d ≥ 3) beyond compactsupport. Additional assumptions needed for d ≤ 3.

140 / 174

Local Kernel Regression: Nadaraya-Watson

Fix a kernel K : Rd → R≥0. Assume K is zero outside unit Euclidean ball

at origin (not true for e−x2

, but close enough).

(figure from Gyorfi et al)

Let Kh(x) = K (x/h), and so Kh(x − x ′) is zero if ‖x − x ′‖ ≥ h.

h is “bandwidth” – tunable parameter.

Assume K (x) > cI‖x‖ ≤ 1 for some c > 0. This is important for the“averaging effect” to kick in.

141 / 174

Nadaraya-Watson estimator:

fn(x) =n∑

i=1

YiWi (x)

with

Wi (x) =Kh(x − Xi )∑ni=1 Kh(x − Xi )

(Note:∑

i Wi = 1).

142 / 174

Unlike the k-NN example, bias is easier to estimate.

Bias: for a given x ,

EY1:n [fn(x)] = EY1:n

[n∑

i=1

YiWi (x)

]=

n∑i=1

f ∗(Xi )Wi (x)

and so

EY1:n [fn(x)]− f ∗(x) =n∑

i=1

(f ∗(Xi )− f ∗(x))Wi (x)

Suppose f ∗ is 1-Lipschitz. Since Kh is zero outside the h-radius ball,

|EY1:n [fn(x)]− f ∗(x)|2 ≤ h2.

143 / 174

Variance: we have

fn(x)− EY1:n [fn(x)] =n∑

i=1

(Yi − f ∗(Xi ))Wi (x)

Expectation of square of this difference is at most

E

[n∑

i=1

(Yi − f ∗(Xi ))2Wi (x)2

]

since cross terms are zero (fix X ’s, take expectation with respect to theY ’s).

We are left analyzing

nE[

Kh(x − X1)2

(∑n

i=1 Kh(x − Xi ))2

]

Under some assumptions on density of X , the denominator is at least(nhd)2 with high prob, whereas EKh(x − X1)2 = O(hd) assuming∫K 2 <∞. This gives an overall variance of O(1/(nhd)). Many details

skipped here (e.g. problems at the boundary, assumptions, etc)144 / 174

Overall, bias and variance with h ∼ n−1

2+d yield

h2 +1

nhd= n−

22+d

145 / 174

Summary

I Analyzed smoothing methods with kernels. As with nearestneighbors, slow (nonparametric) rates in large d .

I Same bias-variance decomposition approach as k-NN.

146 / 174

Outline

Introduction


Classification

Beyond ERM

InterpolationOverfitting in Deep NetworksNearest neighborLocal Kernel SmoothingKernel and Linear Regression

147 / 174

Overfitting in Deep Networks

I Deep networks can be trained to zerotraining error (for regression loss)

I ... with near state-of-the-artperformance (Ruslan Salakhutdinov,

Simons Machine Learning Boot Camp, January 2017)

I ... even for noisy problems.

I No tradeoff between fit to trainingdata and complexity!

I Benign overfitting.

(Zhang, Bengio, Hardt, Recht, Vinyals, 2017)

148 / 174

Overfitting in Deep Networks

Not surprising:

I More parameters than sample size. (That’s called ’nonparametric.’)

I Good performance with zero classification loss on training data.(c.f. Margins analysis: trade-off between (regression) fit andcomplexity.)

I Good performance with zero regression loss on training data whenthe Bayes risk is zero. (c.f. Perceptron)

Surprising:

I Good performance with zero regression loss on noisy training data.

I All the label noise is making its way into the parameters, but it isn’thurting prediction accuracy!

149 / 174

Overfitting

“... interpolating fits... [are] unlikely to predict future data well at all.”“Elements of Statistical Learning,” Hastie, Tibshirani, Friedman, 2001

150 / 174

Overfitting

151 / 174

Benign Overfitting

A new statistical phenomenon:good prediction with zero training error for regression loss

I Statistical wisdom says a prediction rule should not fit too well.

I But deep networks are trained to fit noisy data perfectly, and theypredict well.

152 / 174

Not an Isolated Phenomenon

0.0 0.2 0.4 0.6 0.8 1.0 1.2lambda

10 1

log(

erro

r)Kernel Regression on MNIST

digits pair [i,j][2,5][2,9][3,6][3,8][4,7]

λ = 0: the interpolated solution, perfect fit on training data.

153 / 174

Not an Isolated Phenomenon

154 / 174

Outline

Introduction


Classification

Beyond ERM


155 / 174

Simplicial interpolation (Belkin-Hsu-Mitra)

Observe: 1-Nearest Neighbor is interpolating. However, one cannotguarantee EL(fn)− L(f ∗) small.

Cover-Hart ’67:lim

n→∞EL(fn) ≤ 2L(f ∗)

156 / 174

Simplicial interpolation (Belkin-Hsu-Mitra)

Under regularity conditions, simplicial interpolation fn

lim supn→∞

E∥∥∥fn − f∗

∥∥∥2

≤ 2

d + 2E(f ∗(X )− Y )2

Blessing of high dimensionality!

157 / 174

Outline

Introduction


Classification

Beyond ERM


158 / 174

Nadaraya-Watson estimator (Belkin-R.-Tsybakov)

Consider the Nadaraya-Watson estimator. Take a kernel that approachesa large value τ at 0, e.g.

K (x) = min1/ ‖x‖α , τ

Note that large τ means fn(Xi ) ≈ Yi since the weight Wi (Xi ) is large.

If τ =∞, we get interpolation fn(Xi ) = Yi of all training data. Yet,earlier proof still goes through. Hence, “memorizing the data” (governedby parameter τ) is completely decoupled from the bias-variance trade-off(as given by parameter h).

Minimax rates are possible (with a suitable choice of h).

159 / 174

Outline

Introduction


Classification

Beyond ERM


160 / 174

Min-norm interpolation in RKHS

Min-norm interpolation:

fn := argminf∈H

‖f ‖H, s.t. f (xi ) = yi , ∀i ≤ n .

Closed-form solution:

fn(x) = K (x ,X )K (X ,X )−1Y

Note: GD/SGD started from 0 converges to this minimum-norm solution.

Difficulty in analyzing bias/variance of fn: it is not clear how the randommatrix K−1 “aggregates” Y ’s between data points.

161 / 174

Min-norm interpolation in Rd with d n

(Hastie, Montanari, Rosset, Tibshirani, 2019)

Linear regression with p, n→∞, p/n→ γ, empirical spectral distributionof Σ (the discrete measure on its set of eigenvalues) converges to a fixedmeasure.Apply random matrix theory to explore the asymptotics of the excessprediction error as γ, the noise variance, and the eigenvalue distributionvary.Also study the asymptotics of a model involving random nonlinearfeatures.

162 / 174

Implicit regularization in regime d n(Liang and R., 2018)

Informal Theorem: Geometric properties of the data design X , highdimensionality, and curvature of the kernel ⇒ interpolated solutiongeneralizes.

Bias bounded in terms of effective dimension

dim(XX>) =∑j

λj

(XX>

d

)[γ + λj

(XX>

d

)]2

where γ > 0 is due to implicit regularization due to curvature of kerneland high dimensionality.

Compare with regularized least squares as low pass filter:

dim(XX>) =∑j

λj

(XX>

d

)λ+ λj

(XX>

d

)163 / 174

Implicit regularization in regime d n

Test error (y-axis) as a function of how quickly spectrum decays. Bestperformance if spectrum decays but not too quickly.

λj(Σ) =

(1−

(j

d

)κ)1/κ

164 / 174

Lower Bounds

(R. and Zhai, 2018)

Min-norm solution for Laplace kernel

Kσ(x , x ′) = σ−d exp−‖x − x ′‖ /σ

is not consistent if d is low:

Theorem: for odd dimension d , with probability 1−O(n−1/2), forany choice of σ,

E∥∥∥fn − f ∗

∥∥∥2

≥ Ωd(1).

Theorem.

Conclusion: need high dimensionality of data.

165 / 174

Connection to neural networksFor wide enough randomly initialized neural networks, GD dynamicsquickly converge to (approximately) min-norm interpolating solution withrespect to a certain kernel.

f (x) =1√m

m∑i=1

aiσ(〈wi , x〉)

For square loss and relu σ,

∂L

∂wi=

1√m

1

n

n∑j=1

(f (xj)− yj)aiσ′(〈wi , xj〉)xj

GD in continuous time:

d

dtwi (t) = − ∂L

∂wi

Thusd

dtf (x) =

m∑i=1

(∂f (x)

∂wi

)T (dwi

dt

)166 / 174


f (x) =1√m

m∑i=1

aiσ(〈wi , x〉)


∂L

∂wi=

1√m

1

n

n∑j=1



d

dtwi (t) = − ∂L

∂wi

Thus

d

dtf (x) =

m∑i=1

(1√maiσ′(〈wi , x〉)x

)T− 1√

m

1

n

n∑j=1


166 / 174


f (x) =1√m

m∑i=1

aiσ(〈wi , x〉)


∂L

∂wi=

1√m

1

n

n∑j=1



d

dtwi (t) = − ∂L

∂wi

Thus

d

dtf (x) = − 1

n

n∑j=1

(f (xj)− yj)

[1

m

m∑i=1

a2i σ′(〈wi , x〉)σ′(〈wi , xj〉) 〈x , xj〉

]

166 / 174


f (x) =1√m

m∑i=1

aiσ(〈wi , x〉)


∂L

∂wi=

1√m

1

n

n∑j=1



d

dtwi (t) = − ∂L

∂wi

Thusd

dtf (x) = − 1

n

n∑j=1

(f (xj)− yj)Km(x , xj)︸︷︷︸→K∞(x,xj )

166 / 174

Connection to neural networks

(Xie, Liang, Song, ’16), (Jacot, Gabriel, Hongler ’18), (Du, Zhai, Poczos, Singh ’18),

etc.

K∞(x , xj) = E[〈x , xj〉 I〈w , x〉 ≥ 0, 〈w , xj〉 ≥ 0]

See Jason’s talk.

167 / 174

Linear Regression(B., Long, Lugosi, Tsigler, 2019)

I (x , y) Gaussian, mean zero.

I Define:

Σ := Exx> =∑i

λiviv>i , (assume λ1 ≥ λ2 ≥ · · · )

θ∗ := arg minθ

E(y − x>θ

)2,

σ2 := E(y − x>θ∗)2.

I Estimator θ =(X>X

)†X>y , which solves

minθ

‖θ‖2

s.t. ‖Xθ − y‖2 = minβ‖Xβ − y‖2

= 0.

I When can the label noise be hidden in θ without hurting predictiveaccuracy?

168 / 174

Linear Regression(B., Long, Lugosi, Tsigler, 2019)

I (x , y) Gaussian, mean zero.

I Define:

Σ := Exx> =∑i

λiviv>i , (assume λ1 ≥ λ2 ≥ · · · )

θ∗ := arg minθ

E(y − x>θ

)2,

σ2 := E(y − x>θ∗)2.

I Estimator θ =(X>X

)†X>y , which solves

minθ

‖θ‖2

s.t. ‖Xθ − y‖2 = minβ‖Xβ − y‖2 = 0.

I When can the label noise be hidden in θ without hurting predictiveaccuracy?

168 / 174

Linear Regression

Excess prediction error:

L(θ)− L∗ := E(x,y)

[(y − x>θ

)2

−(y − x>θ∗

)2]

=(θ − θ∗

)>Σ(θ − θ∗

).

Σ determines how estimation error in different parameter directions affectprediction accuracy.

169 / 174

Linear Regression

For universal constants b, c , and any θ∗, σ2, Σ, if λn > 0 then

σ2

cmin

k∗

n+

n

Rk∗(Σ), 1

≤ EL(θ)− L∗ ≤ c

(‖θ∗‖2

√tr(Σ)

n+ σ2

(k∗

n+

n

Rk∗(Σ)

)),

where k∗ = min k ≥ 0 : rk(Σ) ≥ bn.

Theorem.

170 / 174

Notions of Effective Rank

Recall that λ1 ≥ λ2 ≥ · · · are the eigenvalues of Σ. Define theeffective ranks

rk(Σ) =

∑i>k λi

λk+1, Rk(Σ) =

(∑i>k λi

)2∑i>k λ

2i

.

Definition.

ExampleIf rank(Σ) = p, we can write

r0(Σ) = rank(Σ)s(Σ), R0(Σ) = rank(Σ)S(Σ),

with s(Σ) =1/p

∑pi=1 λi

λ1, S(Σ) =

(1/p

∑pi=1 λi

)2

1/p∑p

i=1 λ2i

.

Both s and S lie between 1/p (λ2 ≈ 0) and 1 (λi all equal).

171 / 174

Notions of Effective Rank

Recall that λ1 ≥ λ2 ≥ · · · are the eigenvalues of Σ. Define theeffective ranks

rk(Σ) =

∑i>k λi

λk+1, Rk(Σ) =

(∑i>k λi

)2∑i>k λ

2i

.

Definition.

ExampleIf rank(Σ) = p, we can write

r0(Σ) = rank(Σ)s(Σ), R0(Σ) = rank(Σ)S(Σ),

with s(Σ) =1/p

∑pi=1 λi

λ1, S(Σ) =

(1/p

∑pi=1 λi

)2

1/p∑p

i=1 λ2i

.

Both s and S lie between 1/p (λ2 ≈ 0) and 1 (λi all equal).171 / 174

Benign Overfitting in Linear Regression

Intuition

I To avoid harming prediction accuracy, the noise energy must bedistributed across many unimportant directions:

I There should be many non-zero eigenvalues.I They should have a small sum compared to n.I There should be many eigenvalues no larger than λk∗

(the number depends on how assymmetric they are).

172 / 174

Overfitting?

I Fitting data “too” well versus

I Bias too low, variance too high.

Key takeaway: we should not conflate these two.

173 / 174

Outline

Introduction


Classification

Beyond ERM


174 / 174

Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw...

Documents

Transcript of Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw...