Post on 07-Mar-2021
Generalization in Deep Learning
Peter Bartlett Alexander Rakhlin
UC Berkeley
MIT
May 28, 2019
1 / 174
Outline
Introduction
Uniform Laws of Large Numbers
Classification
Beyond ERM
Interpolation
2 / 174
What is Generalization?
log(1 + 2 + 3) = log(1) + log(2) + log(3)
log(1 + 1.5 + 5) = log(1) + log(1.5) + log(5)
log(2 + 2) = log(2) + log(2)
log(1 + 1.25 + 9) = log(1) + log(1.25) + log(9)
3 / 174
What is Generalization?
log(1 + 2 + 3) = log(1) + log(2) + log(3)
log(1 + 1.5 + 5) = log(1) + log(1.5) + log(5)
log(2 + 2) = log(2) + log(2)
log(1 + 1.25 + 9) = log(1) + log(1.25) + log(9)
3 / 174
What is Generalization?
log(1 + 2 + 3) = log(1) + log(2) + log(3)
log(1 + 1.5 + 5) = log(1) + log(1.5) + log(5)
log(2 + 2) = log(2) + log(2)
log(1 + 1.25 + 9) = log(1) + log(1.25) + log(9)
3 / 174
Outline
IntroductionSetupPerceptronBias-Variance Tradeoff
Uniform Laws of Large Numbers
Classification
Beyond ERM
Interpolation
4 / 174
Statistical Learning Theory
Aim: Predict an outcome y from some set Y of possible outcomes, onthe basis of some observation x from a feature space X .
Use data set of n pairs
S = (x1, y1), . . . , (xn, yn)
to choose a function f : X → Y so that, for subsequent (x , y) pairs, f (x)is a good prediction of y .
5 / 174
Formulations of Prediction Problems
To define the notion of a ‘good prediction,’ we can define a loss function
` : Y × Y → R.
`(y , y) is cost of predicting y when the outcome is y .
Aim: `(f (x), y) small.
6 / 174
Formulations of Prediction Problems
ExampleIn pattern classification problems, the aim is to classify a pattern x intoone of a finite number of classes (that is, the label space Y is finite). Ifall mistakes are equally bad, we could define
`(y , y) = Iy 6= y =
1 if y 6= y ,
0 otherwise.
ExampleIn a regression problem, with Y = R, we might choose the quadratic lossfunction, `(y , y) = (y − y)2.
7 / 174
Probabilistic Assumptions
Assume:
I There is a probability distribution P on X × Y,
I The pairs (X1,Y1), . . . , (Xn,Yn), (X ,Y ) are chosen independentlyaccording to P
The aim is to choose f with small risk:
L(f ) = E`(f (X ),Y ).
For instance, in the pattern classification example, this is themisclassification probability.
L(f ) = L01(f ) = EIf (X ) 6= Y = Pr(f (X ) 6= Y ).
8 / 174
Why not estimate the underlying distribution P (or PY |X ) first?
This is in general a harder problem than prediction.
9 / 174
Key difficulty: our goals are in terms of unknown quantities related tounknown P. Have to use empirical data instead. Purview of statistics.
For instance, we can calculate the empirical loss of f : X → Y
L(f ) =1
n
n∑i=1
`(Yi , f (Xi ))
10 / 174
The function x 7→ fn(x) = fn(x ;X1,Y1, . . . ,Xn,Yn) is random, since itdepends on the random data S = (X1,Y1, . . . ,Xn,Yn). Thus, the risk
L(fn) = E[`(fn(X ),Y )|S
]= E
[`(fn(X ;X1,Y1, . . . ,Xn,Yn),Y )|S
]is a random variable. We might aim for EL(fn) small, or L(fn) small withhigh probability (over the training data).
11 / 174
Quiz: what is random here?
1. L(f ) for a given fixed f
2. fn
3. L(fn)
4. L(fn)
5. L(f ) for a given fixed f
12 / 174
Theoretical analysis of performance is typically easier if fn has closed form(in terms of the training data).
E.g. ordinary least squares fn(x) = x>(X>X )−1X>Y .
Unfortunately, most ML and many statistical procedures are not explicitlydefined but arise as
I solutions to an optimization objective (e.g. logistic regression)
I as an iterative procedure without an immediately obvious objectivefunction (e.g. AdaBoost, Random Forests, etc)
13 / 174
The Gold Standard
Within the framework we set up, the smallest expected loss is achievedby the Bayes optimal function
f ∗ = arg minf
L(f )
where the minimization is over all (measurable) prediction rulesf : X → Y.
The value of the lowest expected loss is called the Bayes error:
L(f ∗) = inffL(f )
Of course, we cannot calculate any of these quantities since P isunknown.
14 / 174
Bayes Optimal Function
Bayes optimal function f ∗ takes on the following forms in these twoparticular cases:
I Binary classification (Y = 0, 1) with the indicator loss:
f ∗(x) = Iη(x) ≥ 1/2, where η(x) = E[Y |X = x ]
0
1
(x)
I Regression (Y = R) with squared loss:
f ∗(x) = η(x), where η(x) = E[Y |X = x ]
15 / 174
The big question: is there a way to construct a learning algorithm with aguarantee that
L(fn)− L(f ∗)
is small for large enough sample size n?
16 / 174
Consistency
An algorithm that ensures
limn→∞
L(fn) = L(f ∗) almost surely
is called universally consistent. Consistency ensures that our algorithm isapproaching the best possible prediction performance as the sample sizeincreases.
The good news: consistency is possible to achieve.
I easy if X is a finite or countable set
I not too hard if X is infinite, and the underlying relationship betweenx and y is “continuous”
17 / 174
The bad news...
In general, we cannot prove anything quantitative about L(fn)− L(f ∗),unless we make further assumptions (incorporate prior knowledge).
“No Free Lunch” Theorems: unless we posit assumptions,
I For any algorithm fn, any n and any ε > 0, there exists adistribution P such that L(f ∗) = 0 and
EL(fn) ≥ 1
2− ε
I For any algorithm fn, and any sequence an that converges to 0, thereexists a probability distribution P such that L(f ∗) = 0 and for all n
EL(fn) ≥ an
18 / 174
is this really “bad news”?
Not really. We always have some domain knowledge.
Two ways of incorporating prior knowledge:
I Direct way: assumptions on distribution P (e.g. margin, smoothnessof regression function, etc)
I Indirect way: redefine the goal to perform as well as a reference setF of predictors:
L(fn)− inff∈F
L(f )
F encapsulates our inductive bias.
We often make both of these assumptions.
19 / 174
Outline
IntroductionSetupPerceptronBias-Variance Tradeoff
Uniform Laws of Large Numbers
Classification
Beyond ERM
Interpolation
20 / 174
Perceptron
<latexit sha1_base64="+03BdUB6TWqLjuCfwSU3OIMjF3A=">AAAB7HicbVDLSgNBEOz1GeMr6tHLYBA8hV0R1FvQi8cIbhJIljA7mU3GzGOZmRXCkn/w4kHFqx/kzb9xkuxBEwsaiqpuurvilDNjff/bW1ldW9/YLG2Vt3d29/YrB4dNozJNaEgUV7odY0M5kzS0zHLaTjXFIua0FY9up37riWrDlHyw45RGAg8kSxjB1knN7gALgXuVql/zZ0DLJChIFQo0epWvbl+RTFBpCcfGdAI/tVGOtWWE00m5mxmaYjLCA9pxVGJBTZTPrp2gU6f0UaK0K2nRTP09kWNhzFjErlNgOzSL3lT8z+tkNrmKcibTzFJJ5ouSjCOr0PR11GeaEsvHjmCimbsVkSHWmFgXUNmFECy+vEzC89p1Lbi/qNZvijRKcAwncAYBXEId7qABIRB4hGd4hTdPeS/eu/cxb13xipkj+APv8wfzVo7p</latexit><latexit sha1_base64="+03BdUB6TWqLjuCfwSU3OIMjF3A=">AAAB7HicbVDLSgNBEOz1GeMr6tHLYBA8hV0R1FvQi8cIbhJIljA7mU3GzGOZmRXCkn/w4kHFqx/kzb9xkuxBEwsaiqpuurvilDNjff/bW1ldW9/YLG2Vt3d29/YrB4dNozJNaEgUV7odY0M5kzS0zHLaTjXFIua0FY9up37riWrDlHyw45RGAg8kSxjB1knN7gALgXuVql/zZ0DLJChIFQo0epWvbl+RTFBpCcfGdAI/tVGOtWWE00m5mxmaYjLCA9pxVGJBTZTPrp2gU6f0UaK0K2nRTP09kWNhzFjErlNgOzSL3lT8z+tkNrmKcibTzFJJ5ouSjCOr0PR11GeaEsvHjmCimbsVkSHWmFgXUNmFECy+vEzC89p1Lbi/qNZvijRKcAwncAYBXEId7qABIRB4hGd4hTdPeS/eu/cxb13xipkj+APv8wfzVo7p</latexit><latexit sha1_base64="+03BdUB6TWqLjuCfwSU3OIMjF3A=">AAAB7HicbVDLSgNBEOz1GeMr6tHLYBA8hV0R1FvQi8cIbhJIljA7mU3GzGOZmRXCkn/w4kHFqx/kzb9xkuxBEwsaiqpuurvilDNjff/bW1ldW9/YLG2Vt3d29/YrB4dNozJNaEgUV7odY0M5kzS0zHLaTjXFIua0FY9up37riWrDlHyw45RGAg8kSxjB1knN7gALgXuVql/zZ0DLJChIFQo0epWvbl+RTFBpCcfGdAI/tVGOtWWE00m5mxmaYjLCA9pxVGJBTZTPrp2gU6f0UaK0K2nRTP09kWNhzFjErlNgOzSL3lT8z+tkNrmKcibTzFJJ5ouSjCOr0PR11GeaEsvHjmCimbsVkSHWmFgXUNmFECy+vEzC89p1Lbi/qNZvijRKcAwncAYBXEId7qABIRB4hGd4hTdPeS/eu/cxb13xipkj+APv8wfzVo7p</latexit><latexit sha1_base64="+03BdUB6TWqLjuCfwSU3OIMjF3A=">AAAB7HicbVDLSgNBEOz1GeMr6tHLYBA8hV0R1FvQi8cIbhJIljA7mU3GzGOZmRXCkn/w4kHFqx/kzb9xkuxBEwsaiqpuurvilDNjff/bW1ldW9/YLG2Vt3d29/YrB4dNozJNaEgUV7odY0M5kzS0zHLaTjXFIua0FY9up37riWrDlHyw45RGAg8kSxjB1knN7gALgXuVql/zZ0DLJChIFQo0epWvbl+RTFBpCcfGdAI/tVGOtWWE00m5mxmaYjLCA9pxVGJBTZTPrp2gU6f0UaK0K2nRTP09kWNhzFjErlNgOzSL3lT8z+tkNrmKcibTzFJJ5ouSjCOr0PR11GeaEsvHjmCimbsVkSHWmFgXUNmFECy+vEzC89p1Lbi/qNZvijRKcAwncAYBXEId7qABIRB4hGd4hTdPeS/eu/cxb13xipkj+APv8wfzVo7p</latexit>
21 / 174
Perceptron
(x1, y1), . . . , (xT , yT ) ∈ X ×±1 (T may or may not be same as n)
Maintain a hypothesis wt ∈ Rd (initialize w1 = 0).
On round t,
I Consider (xt , yt)
I Form prediction yt = sign(〈wt , xt〉)I If yt 6= yt , update
wt+1 = wt + ytxt
elsewt+1 = wt
22 / 174
Perceptron
wt<latexit sha1_base64="7J0RWCUqTAOGp/2VV5Bhw8gCVEA=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKMR6DXjxGNA9IljA7mU2GzM4uM71KCPkELx4U8eoXefNvnLxAjQUNRVU33V1BIoVB1/1yMiura+sb2c3c1vbO7l5+/6Bu4lQzXmOxjHUzoIZLoXgNBUreTDSnUSB5IxhcT/zGA9dGxOoehwn3I9pTIhSMopXuHjvYyRfcojsFcYveeal0USbeQlmQAsxR7eQ/292YpRFXyCQ1puW5CfojqlEwyce5dmp4QtmA9njLUkUjbvzR9NQxObFKl4SxtqWQTNWfEyMaGTOMAtsZUeybv95E/M9rpRhe+iOhkhS5YrNFYSoJxmTyN+kKzRnKoSWUaWFvJaxPNWVo08nZEJZeXib1s6JnI7o9L1Su5nFk4QiO4RQ8KEMFbqAKNWDQgyd4gVdHOs/Om/M+a80485lD+AXn4xuZmI3/</latexit><latexit sha1_base64="7J0RWCUqTAOGp/2VV5Bhw8gCVEA=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKMR6DXjxGNA9IljA7mU2GzM4uM71KCPkELx4U8eoXefNvnLxAjQUNRVU33V1BIoVB1/1yMiura+sb2c3c1vbO7l5+/6Bu4lQzXmOxjHUzoIZLoXgNBUreTDSnUSB5IxhcT/zGA9dGxOoehwn3I9pTIhSMopXuHjvYyRfcojsFcYveeal0USbeQlmQAsxR7eQ/292YpRFXyCQ1puW5CfojqlEwyce5dmp4QtmA9njLUkUjbvzR9NQxObFKl4SxtqWQTNWfEyMaGTOMAtsZUeybv95E/M9rpRhe+iOhkhS5YrNFYSoJxmTyN+kKzRnKoSWUaWFvJaxPNWVo08nZEJZeXib1s6JnI7o9L1Su5nFk4QiO4RQ8KEMFbqAKNWDQgyd4gVdHOs/Om/M+a80485lD+AXn4xuZmI3/</latexit><latexit sha1_base64="7J0RWCUqTAOGp/2VV5Bhw8gCVEA=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKMR6DXjxGNA9IljA7mU2GzM4uM71KCPkELx4U8eoXefNvnLxAjQUNRVU33V1BIoVB1/1yMiura+sb2c3c1vbO7l5+/6Bu4lQzXmOxjHUzoIZLoXgNBUreTDSnUSB5IxhcT/zGA9dGxOoehwn3I9pTIhSMopXuHjvYyRfcojsFcYveeal0USbeQlmQAsxR7eQ/292YpRFXyCQ1puW5CfojqlEwyce5dmp4QtmA9njLUkUjbvzR9NQxObFKl4SxtqWQTNWfEyMaGTOMAtsZUeybv95E/M9rpRhe+iOhkhS5YrNFYSoJxmTyN+kKzRnKoSWUaWFvJaxPNWVo08nZEJZeXib1s6JnI7o9L1Su5nFk4QiO4RQ8KEMFbqAKNWDQgyd4gVdHOs/Om/M+a80485lD+AXn4xuZmI3/</latexit><latexit sha1_base64="7J0RWCUqTAOGp/2VV5Bhw8gCVEA=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKMR6DXjxGNA9IljA7mU2GzM4uM71KCPkELx4U8eoXefNvnLxAjQUNRVU33V1BIoVB1/1yMiura+sb2c3c1vbO7l5+/6Bu4lQzXmOxjHUzoIZLoXgNBUreTDSnUSB5IxhcT/zGA9dGxOoehwn3I9pTIhSMopXuHjvYyRfcojsFcYveeal0USbeQlmQAsxR7eQ/292YpRFXyCQ1puW5CfojqlEwyce5dmp4QtmA9njLUkUjbvzR9NQxObFKl4SxtqWQTNWfEyMaGTOMAtsZUeybv95E/M9rpRhe+iOhkhS5YrNFYSoJxmTyN+kKzRnKoSWUaWFvJaxPNWVo08nZEJZeXib1s6JnI7o9L1Su5nFk4QiO4RQ8KEMFbqAKNWDQgyd4gVdHOs/Om/M+a80485lD+AXn4xuZmI3/</latexit>
xt<latexit sha1_base64="wK72DhZCMdU9mhIbOxvRhU8Wi14=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKMR6DXjxGNA9IljA7mU2GzM4uM71iCPkELx4U8eoXefNvnLxAjQUNRVU33V1BIoVB1/1yMiura+sb2c3c1vbO7l5+/6Bu4lQzXmOxjHUzoIZLoXgNBUreTDSnUSB5IxhcT/zGA9dGxOoehwn3I9pTIhSMopXuHjvYyRfcojsFcYveeal0USbeQlmQAsxR7eQ/292YpRFXyCQ1puW5CfojqlEwyce5dmp4QtmA9njLUkUjbvzR9NQxObFKl4SxtqWQTNWfEyMaGTOMAtsZUeybv95E/M9rpRhe+iOhkhS5YrNFYSoJxmTyN+kKzRnKoSWUaWFvJaxPNWVo08nZEJZeXib1s6JnI7o9L1Su5nFk4QiO4RQ8KEMFbqAKNWDQgyd4gVdHOs/Om/M+a80485lD+AXn4xubHo4A</latexit><latexit sha1_base64="wK72DhZCMdU9mhIbOxvRhU8Wi14=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKMR6DXjxGNA9IljA7mU2GzM4uM71iCPkELx4U8eoXefNvnLxAjQUNRVU33V1BIoVB1/1yMiura+sb2c3c1vbO7l5+/6Bu4lQzXmOxjHUzoIZLoXgNBUreTDSnUSB5IxhcT/zGA9dGxOoehwn3I9pTIhSMopXuHjvYyRfcojsFcYveeal0USbeQlmQAsxR7eQ/292YpRFXyCQ1puW5CfojqlEwyce5dmp4QtmA9njLUkUjbvzR9NQxObFKl4SxtqWQTNWfEyMaGTOMAtsZUeybv95E/M9rpRhe+iOhkhS5YrNFYSoJxmTyN+kKzRnKoSWUaWFvJaxPNWVo08nZEJZeXib1s6JnI7o9L1Su5nFk4QiO4RQ8KEMFbqAKNWDQgyd4gVdHOs/Om/M+a80485lD+AXn4xubHo4A</latexit><latexit sha1_base64="wK72DhZCMdU9mhIbOxvRhU8Wi14=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKMR6DXjxGNA9IljA7mU2GzM4uM71iCPkELx4U8eoXefNvnLxAjQUNRVU33V1BIoVB1/1yMiura+sb2c3c1vbO7l5+/6Bu4lQzXmOxjHUzoIZLoXgNBUreTDSnUSB5IxhcT/zGA9dGxOoehwn3I9pTIhSMopXuHjvYyRfcojsFcYveeal0USbeQlmQAsxR7eQ/292YpRFXyCQ1puW5CfojqlEwyce5dmp4QtmA9njLUkUjbvzR9NQxObFKl4SxtqWQTNWfEyMaGTOMAtsZUeybv95E/M9rpRhe+iOhkhS5YrNFYSoJxmTyN+kKzRnKoSWUaWFvJaxPNWVo08nZEJZeXib1s6JnI7o9L1Su5nFk4QiO4RQ8KEMFbqAKNWDQgyd4gVdHOs/Om/M+a80485lD+AXn4xubHo4A</latexit><latexit sha1_base64="wK72DhZCMdU9mhIbOxvRhU8Wi14=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKMR6DXjxGNA9IljA7mU2GzM4uM71iCPkELx4U8eoXefNvnLxAjQUNRVU33V1BIoVB1/1yMiura+sb2c3c1vbO7l5+/6Bu4lQzXmOxjHUzoIZLoXgNBUreTDSnUSB5IxhcT/zGA9dGxOoehwn3I9pTIhSMopXuHjvYyRfcojsFcYveeal0USbeQlmQAsxR7eQ/292YpRFXyCQ1puW5CfojqlEwyce5dmp4QtmA9njLUkUjbvzR9NQxObFKl4SxtqWQTNWfEyMaGTOMAtsZUeybv95E/M9rpRhe+iOhkhS5YrNFYSoJxmTyN+kKzRnKoSWUaWFvJaxPNWVo08nZEJZeXib1s6JnI7o9L1Su5nFk4QiO4RQ8KEMFbqAKNWDQgyd4gVdHOs/Om/M+a80485lD+AXn4xubHo4A</latexit>
23 / 174
For simplicity, suppose all data are in a unit ball, ‖xt‖ ≤ 1.
Margin with respect to (x1, y1), . . . , (xT , yT ):
γ = max‖w‖=1
mini∈[T ]
(yi 〈w , xi 〉)+ ,
where (a)+ = max0, a.
Perceptron makes at most 1/γ2 mistakes (and corrections) on anysequence of examples with margin γ.
Theorem (Novikoff ’62).
24 / 174
Proof: Let m be the number of mistakes after T iterations. If a mistakeis made on round t,
‖wt+1‖2 = ‖wt + ytxt‖2 ≤ ‖wt‖2 + 2yt 〈wt , xt〉+ 1 ≤ ‖wt‖2 + 1.
Hence,‖wT‖2 ≤ m.
For optimal hyperplane w∗
γ ≤ 〈w∗, ytxt〉 = 〈w∗,wt+1 − wt〉 .
Hence (adding and canceling),
mγ ≤ 〈w∗,wT 〉 ≤ ‖wT‖ ≤√m.
25 / 174
Recap
For any T and (x1, y1), . . . , (xT , yT ),
T∑t=1
Iyt 〈wt , xt〉 ≤ 0 ≤ D2
γ2
where γ = γ(x1:T , y1:T ) is margin and D = D(x1:T , y1:T ) = maxt ‖xt‖.
Let w∗ denote the max margin hyperplane, ‖w∗‖ = 1.
26 / 174
Consequence for i.i.d. data (I)
Do one pass on i.i.d. sequence (X1,Y1), . . . , (Xn,Yn) (i.e. T = n).
Claim: expected indicator loss of randomly picked function x 7→ 〈wτ , x〉(τ ∼ unif(1, . . . , n)) is at most
1
n× E
[D2
γ2
].
27 / 174
Proof: Take expectation on both sides:
E
[1
n
n∑t=1
IYt 〈wt ,Xt〉 ≤ 0]≤ E
[D2
nγ2
]Left-hand side can be written as (recall notation S = (Xi ,Yi )
ni=1)
EτES IYτ 〈wτ ,Xτ 〉 ≤ 0
Since wτ is a function of X1:τ−1,Y1:τ−1, above is
ESEτEX ,Y IY 〈wτ ,X 〉 ≤ 0 = ESEτL01(wτ )
Claim follows.
NB: To ensure E[D2/γ2] is not infinite, we assume margin γ in P.
28 / 174
Consequence for i.i.d. data (II)
Now, rather than doing one pass, cycle through data
(X1,Y1), . . . , (Xn,Yn)
repeatedly until no more mistakes (i.e. T ≤ n × (D2/γ2)).
Then final hyperplane of Perceptron separates the data perfectly, i.e.finds minimum L01(wT ) = 0 for
L01(w) =1
n
n∑i=1
IYi 〈w ,Xi 〉 ≤ 0.
Empirical Risk Minimization with finite-time convergence. Can we sayanything about future performance of wT ?
29 / 174
Consequence for i.i.d. data (II)
Claim:
EL01(wT ) ≤ 1
n + 1× E
[D2
γ2
]
30 / 174
Proof: Shortcuts: z = (x , y) and `(w , z) = Iy 〈w , x〉 ≤ 0. Then
ESEZ `(wT ,Z ) = ES,Zn+1
[1
n + 1
n+1∑t=1
`(w (−t),Zt)
]
where w (−t) is Perceptron’s final hyperplane after cycling through dataZ1, . . . ,Zt−1,Zt+1, . . . ,Zn+1.
That is, leave-one-out is unbiased estimate of expected loss.
Now consider cycling Perceptron on Z1, . . . ,Zn+1 until no more errors.Let i1, . . . , im be indices on which Perceptron errs in any of the cycles.We know m ≤ D2/γ2. However, if index t /∈ i1, . . . , im, then whetheror not Zt was included does not matter, and Zt is correctly classified byw (−t). Claim follows.
31 / 174
A few remarks
I Bayes error L01(f∗) = 0 since we assume margin in the distribution
P (and, hence, in the data).
I Our assumption on P is not about its parametric or nonparametricform, but rather on what happens at the boundary.
I Curiously, we beat the CLT rate of 1/√n.
32 / 174
Overparametrization and overfittingThe last lecture will discuss overparametrization and overfitting.
I dimension d never appears in the mistake bound of Perceptron.Hence, we can even take d =∞.
I Perceptron is 1-layer neural network (no hidden layer) with dneurons.
I For d > n, any n points in general position can be separated (zeroempirical error).
Conclusions:
I More parameters than data does not necessarily contradict goodprediction performance (“generalization”).
I Perfectly fitting the data does not necessarily contradict goodgeneralization (particularly with no noise).
I Complexity is subtle (e.g. number of parameters vs margin)
I ... however, margin is a strong assumption (more than just no noise)
33 / 174
Outline
IntroductionSetupPerceptronBias-Variance Tradeoff
Uniform Laws of Large Numbers
Classification
Beyond ERM
Interpolation
34 / 174
Relaxing the assumptions
We provedEL01(wT )− L01(f
∗) ≤ O(1/n)
under the assumption that P is linearly separable with margin γ.
The assumption on the probability distribution implies that the Bayesclassifier is a linear separator (“realizable” setup).
We may relax the margin assumption, yet still hope to do well with linearseparators. That is, we would aim to minimize
EL01(wT )− L01(w∗)
hoping thatL01(w
∗)− L01(f∗)
is small, where w∗ is the best hyperplane classifier with respect to L01.
35 / 174
More generally, we will work with some class of functions F that we hopecaptures well the relationship between X and Y . The choice of F givesrise to a bias-variance decomposition.
Let fF ∈ argminf∈F
L(f ) be the function in F with smallest expected loss.
36 / 174
Bias-Variance Tradeoff
L(fn)− L(f ∗) = L(fn)− inff∈F
L(f )︸ ︷︷ ︸Estimation Error
+ inff∈F
L(f )− L(f ∗)︸ ︷︷ ︸Approximation Error
F
fn ffF
Clearly, the two terms are at odds with each other:
I Making F larger means smaller approximation error but (as we willsee) larger estimation error
I Taking a larger sample n means smaller estimation error and has noeffect on the approximation error.
I Thus, it makes sense to trade off size of F and n (Structural RiskMinimization, or Method of Sieves, or Model Selection).
37 / 174
Bias-Variance Tradeoff
In simple problems (e.g. linearly separable data) we might get away withhaving no bias-variance decomposition (e.g. as was done in thePerceptron case).
However, for more complex problems, our assumptions on P often implythat class containing f ∗ is too large and the estimation error cannot becontrolled. In such cases, we need to produce a “biased” solution inhopes of reducing variance.
Finally, the bias-variance tradeoff need not be in the form we justpresented. We will consider a different decomposition in the last lecture.
38 / 174
Outline
Introduction
Uniform Laws of Large NumbersMotivationRademacher AveragesRademacher Complexity: Structural ResultsSigmoid NetworksReLU NetworksCovering Numbers
Classification
Beyond ERM
Interpolation
39 / 174
Consider the performance of empirical risk minimization:
Choose fn ∈ F to minimize L(f ), where L is the empirical risk,
L(f ) = Pn`(f (X ),Y ) =1
n
n∑i=1
`(f (Xi ),Yi ).
For pattern classification, this is the proportion of training examplesmisclassified.
How does the excess risk, L(fn)− L(fF ) behave? We can write
L(fn)− L(fF ) =[L(fn)− L(fn)
]+[L(fn)− L(fF )
]+[L(fF )− L(fF )
]Therefore,
EL(fn)− L(fF ) ≤ E[L(fn)− L(fn)
]because second term is nonpositive and third is zero in expectation.
40 / 174
The term L(fn)− L(fn) is not necessarily zero in expectation (check!). Aneasy upper bound is
L(fn)− L(fn) ≤ supf∈F
∣∣∣L(f )− L(f )∣∣∣ ,
and this motivates the study of uniform laws of large numbers.
Roadmap: study the sup to
I understand when learning is possible,
I understand implications for sample complexity,
I design new algorithms (regularizer arises from upper bound on sup)
41 / 174
Loss Class
In what follows, we will work with a function class G on Z, and for thelearning application, G = ` F :
G = (x , y) 7→ `(f (x), y) : f ∈ F
and Z = X × Y.
42 / 174
Empirical Process Viewpoint
g0
Eg
all functions
g0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gn g0
1
n
nX
i=1
g(zi)
gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gG gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
43 / 174
Empirical Process Viewpoint
g0
Eg
all functions
g0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gn g0
1
n
nX
i=1
g(zi)
gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gG gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
43 / 174
Empirical Process Viewpoint
g0
Eg
all functionsg0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gn g0
1
n
nX
i=1
g(zi)
gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gG gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
43 / 174
Empirical Process Viewpoint
g0
Eg
all functionsg0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gn
g0
1
n
nX
i=1
g(zi)
gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gG gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
43 / 174
Empirical Process Viewpoint
g0
Eg
all functionsg0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gn
g0
1
n
nX
i=1
g(zi)
gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gG gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
43 / 174
Empirical Process Viewpoint
g0
Eg
all functionsg0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gn g0
1
n
nX
i=1
g(zi)
gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gG gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
43 / 174
Empirical Process Viewpoint
g0
Eg
all functionsg0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gn g0
1
n
nX
i=1
g(zi)
gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gG gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
43 / 174
Empirical Process Viewpoint
g0
Eg
all functionsg0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gn g0
1
n
nX
i=1
g(zi)
gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gG gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
43 / 174
Uniform Laws of Large Numbers
For a class G of functions g : Z → [0, 1], suppose that Z1, . . . ,Zn,Z arei.i.d. on Z, and consider
U = supg∈G
∣∣∣∣∣Eg(Z )− 1
n
n∑i=1
g(Zi )
∣∣∣∣∣ = supg∈G|Pg − Png | =: ‖P − Pn︸ ︷︷ ︸
emp proc
‖G .
If U converges to 0, this is called a uniform law of large numbers.
44 / 174
Glivenko-Cantelli Classes
G is a Glivenko-Cantelli class for P if ‖Pn − P‖G P→ 0.
Definition.
I P is a distribution on Z,
I Z1, . . . ,Zn are drawn i.i.d. from P,
I Pn is the empirical distribution (mass 1/n at each of Z1, . . . ,Zn),
I G is a set of measurable real-valued functions on Z with finiteexpectation under P,
I Pn − P is an empirical process, that is, a stochastic processindexed by a class of functions G, and
I ‖Pn − P‖G := supg∈G |Png − Pg |.
45 / 174
Glivenko-Cantelli ClassesWhy ‘Glivenko-Cantelli’ ? An example of a uniform law of large numbers.
‖Fn − F‖∞as→ 0.
Theorem (Glivenko-Cantelli).
Here, F is a cumulative distribution function, Fn is the empiricalcumulative distribution function,
Fn(t) =1
n
n∑i=1
IZi ≤ t,
where Z1, . . . ,Zn are i.i.d. with distribution F , and‖F − G‖∞ = supt |F (t)− G (t)|.
‖Pn − P‖G as→ 0, for G = z 7→ Iz ≤ θ : θ ∈ R.
Theorem (Glivenko-Cantelli).
46 / 174
Glivenko-Cantelli Classes
Not all G are Glivenko-Cantelli classes. For instance,
G = Iz ∈ S : S ⊂ R, |S | <∞ .
Then for a continuous distribution P, Pg = 0 for any g ∈ G, butsupg∈G Png = 1 for all n. So although Png
as→ Pg for all g ∈ G, thisconvergence is not uniform over G. G is too large.
In general, how do we decide whether G is “too large”?
47 / 174
Outline
Introduction
Uniform Laws of Large NumbersMotivationRademacher AveragesRademacher Complexity: Structural ResultsSigmoid NetworksReLU NetworksCovering Numbers
Classification
Beyond ERM
Interpolation
48 / 174
Rademacher AveragesGiven a set H ⊆ [−1, 1]n, measure its “size” by average length ofprojection onto random direction.
Rademacher process indexed by vector h ∈ H:
Rn(h) =1
n〈ε, h〉
where ε = (ε1, . . . , εn) is vector of i.i.d. Rademacher random variables(symmetric ±1).
Rademacher averages: expected maximum of the process
Eε∥∥∥Rn
∥∥∥H
= E suph∈H
∣∣∣∣1n 〈ε, h〉∣∣∣∣
49 / 174
Examples
If H = h0,E∥∥∥Rn
∥∥∥H
= O(n−1/2).
If H = −1, 1n,
E∥∥∥Rn
∥∥∥H
= 1.
If v + c−1, 1n ⊆ H, then
E∥∥∥Rn
∥∥∥H≥ c − o(1)
How about H = −1, 1 where 1 is a vector of 1’s?
50 / 174
Examples
If H is finite,
E∥∥∥Rn
∥∥∥H≤ c
√log |H|
n
for some constant c .
This bound can be loose, as it does not take into account“overlaps”/correlations between vectors.
51 / 174
Examples
Let Bnp be a unit ball in Rn:
Bnp = x ∈ Rn : ‖x‖p ≤ 1, ‖x‖p =
(n∑
i=1
|xi |p)1/p
.
For H = Bn2
1
nE max‖g‖2≤1
| 〈ε, g〉 | =1
nE ‖ε‖2 =
1√n
On the other hand, Rademacher averages for Bn1 are 1
n .
52 / 174
Rademacher Averages
What do these Rademacher averages have to do with our problem ofbounding uniform deviations?
Suppose we take
H = G(Z n1 ) = (g(Z1), . . . , g(Zn)) : g ∈ G ⊂ Rn
g1<latexit sha1_base64="upU6x31J8AFJsV7QHUFUq5QBYAA=">AAAJ2HicjVZLb+M2ENbu9hG7r30cexEaBNgCQmClTWIfCgS72bYBtkm6STbbtY2AoiiZMCkKJBVHFQT0UKDooZce2p/T39Fbf0qHkh1btIJWkOnBzDfDmW+GhIKUUaV7vb/v3X/wzrvvvb/R6X7w4Ucff/Lw0ePXSmQSkwssmJBvAqQIowm50FQz8iaVBPGAkctg+tzYL6+JVFQk5zpPyZijOKERxUiD6iy+8q8ebva2e72e7/uuEfz9vR4Ig0F/x++7vjHBs3nwZOvnfxzHOb16tPHXKBQ44yTRmCGlhn4v1eMCSU0xI2V3lCmSIjxFMRmCmCBO1Lioci3dLdCEbiQk/BLtVtpVjwJxpXIeAJIjPVG2zSjbbMNMR/1xQZM00yTB9UZRxlwtXFO4G1JJsGY5CAhLCrm6eIIkwhroaewyAbiUJGpWgrmmqYcYGxc3ecNUaDr9samB/jBLo0EgSQx9qg3GidFAIpkDdVLMlKcmKCWq7G5Z/CoiaeRVJVdSKkVElOkpYoZFb4KSUGTaqxsRQP+JhDDucn/lPo0ECz9fi/3fUVcCrvJx7o8LAzNkNyzQ6QRLopuTUNzwlGdM06aWZIzIa25FQCwW0KIJ924lipvtSITkiBE+LiACbwZN42bviiDg/zt7TbnpQaMglGDCSrehNMRJFVnQmURpRGML+11yBiMtmiMxjCTQyomeiPCrkEQI2BkXPKzUYdk2+N78cHgMaXIDklEAOU1wDDlMKL5p7iaBp9C7NhdFOF4yvEPu5N69k3wUKMHgnHkC7haG8nEB6ehUABkQjIAyN+kWo+p4FwHLSFkuLCFVKfhYRnB0n5+8PHnlHr74+uj46Pzo5PisOwJeIOEaeYjk9BtJSFIWMg7Korft73pwVe2Zxd81nK/CX9J4on8gjInZrUPfQKtl0I6H+Pktet8A68UGz3NZYqtM6qU18DNT5gI82K1TqNZBK95y8Ctove61OryCmbkts+ZlvtrwZwzauMACBF4bckZSisoC83xqMBDxC8+Hv/1eacWSFE+rrRdYd7s/gGXwJSw7fQt+OaF6URUUY14LcZrJlJGVhkFrPQNKyAwLzuFOKkazKswQzvDIDF7tWSuLTb+00NWEWeBK14KVUIoFNaoWpJAoidfizrUt+LiaXAu+MtJteZs+WR7z5q2jmRmClvSXw7Huo6o2Ww7z3rfkY3rdssNyBlqrzlszqg/PukMIjLQ5LQ/cuk9az4zlsZikCr/lvjg+rC+Ysy588iy+a9y7hdc72z7I3/ubB32nfjacT53PnKeO7+w7B863zqlz4WAndn5z/nD+7Lzt/NT5pfNrDb1/b+7zxGk8nd//BQ28jZo=</latexit><latexit sha1_base64="sXoBpG6JmbUzgx55nt8lBNfWW3Y=">AAAJ2HicjVZfb+NEEPcd/5rw74575MWiqnRIVmUX2iYPSNVdD6h0tOXaXg+SqFqv184qu15rd900WJZ4QEI88MIDfBze+A7wAfgEfABm7aSJN67Acjajmd/MzvxmduUwY1Rp3//z3v3XXn/jzbc2Ot2333n3vfcfPPzgpRK5xOQCCybkqxApwmhKLjTVjLzKJEE8ZOQynDw19strIhUV6bmeZWTEUZLSmGKkQXWWXAVXDzb9bd/3gyBwjRDs7/kg9Pu9naDnBsYEz+bBo60f/vnj779Orx5u/D6MBM45STVmSKlB4Gd6VCCpKWak7A5zRTKEJyghAxBTxIkaFVWupbsFmsiNhYRfqt1Ku+pRIK7UjIeA5EiPlW0zyjbbINdxb1TQNMs1SXG9UZwzVwvXFO5GVBKs2QwEhCWFXF08RhJhDfQ0dhkDXEoSNyvBXNPMQ4yNiptZw1RoOvmuqYH+MEujQSBpAn2qDcaJ0VAiOQPqpJgqT41RRlTZ3bL4VUTS2KtKrqRMipgo01PEDIveGKWRyLVXNyKE/hMJYdzl/sp9HAsWfbwW+7+jrgRc5eM8GBUGZshuWKDTKZZENyehuOEZz5mmTS3JGZHX3IqAWCKgRWPu3UoUN9uRCskRI3xUQATeDJolzd4VYcj/d/aactODRkEoxYSVbkNpiJMqtqBTibKYJhb2q/QMRlo0R2IQS6CVEz0W0WcRiRGwMyp4VKmjsm3wvfnh8BjS5AYkowBymuAEchhTfNPcTQJPkXdtLopotGR4h9zJvXsn+ShUgsE58wTcLQzNRgWkozMBZEAwAsqZSbcYVse7CFlOynJhiajKwMcygqP79OT5yQv38NnnR8dH50cnx2fdIfACCdfIQyQnX0hC0rKQSVgW/naw68FVtWeWYNdwvgp/TpOx/oYwJqa3Dj0DrZZ+Ox7iz27R+wZYLzZ4nssSW2VSL62Bn5gyF+D+bp1CtfZb8ZZDUEHrda/V4QXMzG2ZNS/z1YY/YdDGBRYg8NqQM5JRVBaYzyYGAxE/8QL42/dLK5akeFJtvcC6270+LP1PYdnpWfDLMdWLqqAY81qI01xmjKw0DFrrGVBKplhwDndSMZxWYQZwhodm8GrPWllsBqWFribMAle6FqyEUiyoUbUghURpshZ3rm3BJ9XkWvCVkW7L2/TJ8pg3bx3NzBC0pL8cjnUfVbXZcpj3viUf0+uWHZYz0Fr1rDWj+vCsO0TASJvT8sCt+2T1zFgei0mq8Fvus+PD+oI568Inz+K7xr1beLmzHYD8dbB50HPqZ8P50PnIeewEzr5z4HzpnDoXDnYS52fnV+e3zred7zs/dn6qoffvzX0eOY2n88u/KiCQNA==</latexit><latexit sha1_base64="sXoBpG6JmbUzgx55nt8lBNfWW3Y=">AAAJ2HicjVZfb+NEEPcd/5rw74575MWiqnRIVmUX2iYPSNVdD6h0tOXaXg+SqFqv184qu15rd900WJZ4QEI88MIDfBze+A7wAfgEfABm7aSJN67Acjajmd/MzvxmduUwY1Rp3//z3v3XXn/jzbc2Ot2333n3vfcfPPzgpRK5xOQCCybkqxApwmhKLjTVjLzKJEE8ZOQynDw19strIhUV6bmeZWTEUZLSmGKkQXWWXAVXDzb9bd/3gyBwjRDs7/kg9Pu9naDnBsYEz+bBo60f/vnj779Orx5u/D6MBM45STVmSKlB4Gd6VCCpKWak7A5zRTKEJyghAxBTxIkaFVWupbsFmsiNhYRfqt1Ku+pRIK7UjIeA5EiPlW0zyjbbINdxb1TQNMs1SXG9UZwzVwvXFO5GVBKs2QwEhCWFXF08RhJhDfQ0dhkDXEoSNyvBXNPMQ4yNiptZw1RoOvmuqYH+MEujQSBpAn2qDcaJ0VAiOQPqpJgqT41RRlTZ3bL4VUTS2KtKrqRMipgo01PEDIveGKWRyLVXNyKE/hMJYdzl/sp9HAsWfbwW+7+jrgRc5eM8GBUGZshuWKDTKZZENyehuOEZz5mmTS3JGZHX3IqAWCKgRWPu3UoUN9uRCskRI3xUQATeDJolzd4VYcj/d/aactODRkEoxYSVbkNpiJMqtqBTibKYJhb2q/QMRlo0R2IQS6CVEz0W0WcRiRGwMyp4VKmjsm3wvfnh8BjS5AYkowBymuAEchhTfNPcTQJPkXdtLopotGR4h9zJvXsn+ShUgsE58wTcLQzNRgWkozMBZEAwAsqZSbcYVse7CFlOynJhiajKwMcygqP79OT5yQv38NnnR8dH50cnx2fdIfACCdfIQyQnX0hC0rKQSVgW/naw68FVtWeWYNdwvgp/TpOx/oYwJqa3Dj0DrZZ+Ox7iz27R+wZYLzZ4nssSW2VSL62Bn5gyF+D+bp1CtfZb8ZZDUEHrda/V4QXMzG2ZNS/z1YY/YdDGBRYg8NqQM5JRVBaYzyYGAxE/8QL42/dLK5akeFJtvcC6270+LP1PYdnpWfDLMdWLqqAY81qI01xmjKw0DFrrGVBKplhwDndSMZxWYQZwhodm8GrPWllsBqWFribMAle6FqyEUiyoUbUghURpshZ3rm3BJ9XkWvCVkW7L2/TJ8pg3bx3NzBC0pL8cjnUfVbXZcpj3viUf0+uWHZYz0Fr1rDWj+vCsO0TASJvT8sCt+2T1zFgei0mq8Fvus+PD+oI568Inz+K7xr1beLmzHYD8dbB50HPqZ8P50PnIeewEzr5z4HzpnDoXDnYS52fnV+e3zred7zs/dn6qoffvzX0eOY2n88u/KiCQNA==</latexit><latexit sha1_base64="IKa0cr9W32dKAhXNEcFAYkW1opU=">AAAJ2HicjVbNbuM2ENZu/2L3b7c99iI0CLAFhMBKm8Q+FFjsZtsG2CbpJtlsaxsBRVESYVIUSCqOKgjoreihlx7ax+lz9G06lOzYohW0gkwPZr4ZznwzJBRkjCo9GPzz4OFbb7/z7ntbvf77H3z40cePHn/yWolcYnKJBRPyTYAUYTQll5pqRt5kkiAeMHIVzJ4b+9UNkYqK9EIXGZlyFKc0ohhpUJ3H1/71o+3B7mAw8H3fNYJ/eDAAYTQa7vlD1zcmeLadxXN2/Xjr70kocM5JqjFDSo39QaanJZKaYkaq/iRXJEN4hmIyBjFFnKhpWedauTugCd1ISPil2q216x4l4koVPAAkRzpRts0ou2zjXEfDaUnTLNckxc1GUc5cLVxTuBtSSbBmBQgISwq5ujhBEmEN9LR2SQAuJYnalWCuaeYhxqblbdEylZrOfm5roD/M0mgQSBpDnxqDcWI0kEgWQJ0Uc+WpBGVEVf0di19FJI28uuRayqSIiDI9Rcyw6CUoDUWuvaYRAfSfSAjjrvZX7pNIsPCLjdj/HXUt4DofF/60NDBDdssCnU6xJLo9CeUtz3jONG1rSc6IvOFWBMRiAS1KuHcnUdxuRyokR4zwaQkReDtoFrd7VwYB/9/Za8pND1oFoRQTVrktpSFOqsiCziXKIhpb2O/Tcxhp0R6JcSSBVk50IsKvQxIhYGda8rBWh1XX4HuLw+ExpMktSEYB5LTBMeSQUHzb3k0CT6F3Yy6KcLpieI/cy717L/koUILBOfME3C0MFdMS0tGZADIgGAFlYdItJ/XxLgOWk6paWkKqMvCxjODoPj99efrKPXrxzfHJ8cXx6cl5fwK8QMIN8gjJ2beSkLQqZRxU5WDX3/fgqjowi79vOF+Hv6Rxon8kjIn5ncPQQOtl1I2H+MUd+tAAm8UGL3JZYetMmqUz8DNT5hI82m9SqNdRJ95y8Gtosx50OryCmbkrs+FlsdrwZwzauMQCBF4bck4yiqoS82JmMBDxS8+Hv8NBZcWSFM/qrZdYd3c4gmX0FSx7Qwt+lVC9rAqKMa+FOMtlxshaw6C1ngGlZI4F53AnlZN5HWYMZ3hiBq/xbJTltl9Z6HrCLHCt68BKKMWCGlUHUkiUxhtxF9oOfFxPrgVfG+muvE2fLI9F8zbRzAxBR/qr4dj0UXWbLYdF7zvyMb3u2GE1A51VF50ZNYdn0yEERrqcVgdu0ydrZsbyWE5Sjd9xX5wcNRfMeR8+eZbfNe79wuu9XR/kH/ztp8PFx8+W85nzufPE8Z1D56nznXPmXDrYiZ3fnT+dv3o/9X7p/dr7rYE+fLDw+dRpPb0//gXMf4u+</latexit>
g2<latexit sha1_base64="647aWPtTo779AnQGQnYNZ516kjI=">AAAJ2HicjVZLb+M2ENbu9hG7r30cexEaBNgCQmC5TWIfCgS72bYBtkm6TjbbtY2AoiiZMCkKJBXHFQT0UKDooZce2p/T39Fbf0qHkh1btIJWkOnBzDfDmW+GhIKUUaU7nb/v3X/wzrvvvb/Van/w4Ucff/Lw0ePXSmQSkwssmJBvAqQIowm50FQz8iaVBPGAkctg+tzYL6+JVFQk53qekjFHcUIjipEG1SC+6l493O7sdjod3/ddI/gH+x0Q+v1e1++5vjHBs334ZOfnfxzHObt6tPXXKBQ44yTRmCGlhn4n1eMcSU0xI0V7lCmSIjxFMRmCmCBO1Dgvcy3cHdCEbiQk/BLtltp1jxxxpeY8ACRHeqJsm1E22YaZjnrjnCZppkmCq42ijLlauKZwN6SSYM3mICAsKeTq4gmSCGugp7bLBOBSkqheCeaaph5ibJzfzGumXNPpj3UN9IdZGg0CSWLoU2UwTowGEsk5UCfFTHlqglKiivaOxa8ikkZeWXIppVJERJmeImZY9CYoCUWmvaoRAfSfSAjjrvZX7tNIsPDzjdj/HXUt4Dof5/44NzBDds0CnU6wJLo+CfkNT3nGNK1rScaIvOZWBMRiAS2acO9WorjejkRIjhjh4xwi8HrQNK73Lg8C/r+z15SbHtQKQgkmrHBrSkOcVJEFnUmURjS2sN8lAxhpUR+JYSSBVk70RIRfhSRCwM4452GpDoumwfcWh8NjSJMbkIwCyKmDY8hhQvFNfTcJPIXetbkowvGK4S65k3v3TvJRoASDc+YJuFsYmo9zSEenAsiAYASUc5NuPiqPdx6wjBTF0hJSlYKPZQRH9/npy9NX7tGLr49Pjs+PT08G7RHwAglXyCMkp99IQpIil3FQ5J1df8+Dq2rfLP6e4Xwd/pLGE/0DYUzMbh16Blou/WY8xJ/fog8MsFps8CKXFbbMpFoaAz8zZS7B/b0qhXLtN+ItB7+EVut+o8MrmJnbMiteFqsNf8agjUssQOC1IQOSUlTkmM+nBgMRv/B8+DvoFFYsSfG03HqJdXd7fVj6X8LS7VnwywnVy6qgGPNaiLNMpoysNQxa6xlQQmZYcA53Uj6alWGGcIZHZvAqz0qZb/uFhS4nzAKXugashFIsqFE1IIVESbwRd6FtwMfl5FrwtZFuytv0yfJYNG8TzcwQNKS/Go5NH1W22XJY9L4hH9Prhh1WM9BY9bwxo+rwbDqEwEiT0+rAbfqk1cxYHstJKvE77ouTo+qCGbThk2f5XePeLbzu7vogf+9vH/ac6tlyPnU+c546vnPgHDrfOmfOhYOd2PnN+cP5s/W29VPrl9avFfT+vYXPE6f2tH7/FxcujZs=</latexit><latexit sha1_base64="mQ1KtLtT5frIg5JmOrnmriowHDo=">AAAJ2HicjVZfb+NEEPcd/5rw74575MWiqnRIVhUH2iYPSNVdD6h0tOXSXg+SqFqv184qu15rd900WJZ4QEI88MIDfBze+A7wAfgEfABm7aSJN67Acjajmd/MzvxmduUgZVTpTufPe/dfe/2NN9/aarXffufd995/8PCDl0pkEpMLLJiQrwKkCKMJudBUM/IqlQTxgJHLYPrU2C+viVRUJOd6npIxR3FCI4qRBtUgvupePdju7HY6Hd/3XSP4B/sdEPr9Xtfvub4xwbN9+Gjnh3/++Puvs6uHW7+PQoEzThKNGVJq6HdSPc6R1BQzUrRHmSIpwlMUkyGICeJEjfMy18LdAU3oRkLCL9FuqV33yBFXas4DQHKkJ8q2GWWTbZjpqDfOaZJmmiS42ijKmKuFawp3QyoJ1mwOAsKSQq4uniCJsAZ6artMAC4lieqVYK5p6iHGxvnNvGbKNZ1+V9dAf5il0SCQJIY+VQbjxGggkZwDdVLMlKcmKCWqaO9Y/CoiaeSVJZdSKkVElOkpYoZFb4KSUGTaqxoRQP+JhDDuan/lPo4ECz/eiP3fUdcCrvNx7o9zAzNk1yzQ6QRLouuTkN/wlGdM07qWZIzIa25FQCwW0KIJ924liuvtSITkiBE+ziECrwdN43rv8iDg/zt7TbnpQa0glGDCCremNMRJFVnQmURpRGML+1UygJEW9ZEYRhJo5URPRPhZSCIE7IxzHpbqsGgafG9xODyGNLkBySiAnDo4hhwmFN/Ud5PAU+hdm4siHK8Y7pI7uXfvJB8FSjA4Z56Au4Wh+TiHdHQqgAwIRkA5N+nmo/J45wHLSFEsLSFVKfhYRnB0n54+P33hHj37/Pjk+Pz49GTQHgEvkHCFPEJy+oUkJClyGQdF3tn19zy4qvbN4u8Zztfhz2k80d8QxsTs1qFnoOXSb8ZD/Pkt+sAAq8UGL3JZYctMqqUx8BNT5hLc36tSKNd+I95y8Etote43OryAmbkts+JlsdrwJwzauMQCBF4bMiApRUWO+XxqMBDxE8+Hv4NOYcWSFE/LrZdYd7fXh6X/KSzdngW/nFC9rAqKMa+FOMtkyshaw6C1ngElZIYF53An5aNZGWYIZ3hkBq/yrJT5tl9Y6HLCLHCpa8BKKMWCGlUDUkiUxBtxF9oGfFxOrgVfG+mmvE2fLI9F8zbRzAxBQ/qr4dj0UWWbLYdF7xvyMb1u2GE1A41Vzxszqg7PpkMIjDQ5rQ7cpk9azYzlsZykEr/jPjs5qi6YQRs+eZbfNe7dwsvurg/y1/72Yc+pni3nQ+cj57HjOwfOofOlc+ZcONiJnZ+dX53fWt+2vm/92Pqpgt6/t/B55NSe1i//AjOSkDU=</latexit><latexit sha1_base64="mQ1KtLtT5frIg5JmOrnmriowHDo=">AAAJ2HicjVZfb+NEEPcd/5rw74575MWiqnRIVhUH2iYPSNVdD6h0tOXSXg+SqFqv184qu15rd900WJZ4QEI88MIDfBze+A7wAfgEfABm7aSJN67Acjajmd/MzvxmduUgZVTpTufPe/dfe/2NN9/aarXffufd995/8PCDl0pkEpMLLJiQrwKkCKMJudBUM/IqlQTxgJHLYPrU2C+viVRUJOd6npIxR3FCI4qRBtUgvupePdju7HY6Hd/3XSP4B/sdEPr9Xtfvub4xwbN9+Gjnh3/++Puvs6uHW7+PQoEzThKNGVJq6HdSPc6R1BQzUrRHmSIpwlMUkyGICeJEjfMy18LdAU3oRkLCL9FuqV33yBFXas4DQHKkJ8q2GWWTbZjpqDfOaZJmmiS42ijKmKuFawp3QyoJ1mwOAsKSQq4uniCJsAZ6artMAC4lieqVYK5p6iHGxvnNvGbKNZ1+V9dAf5il0SCQJIY+VQbjxGggkZwDdVLMlKcmKCWqaO9Y/CoiaeSVJZdSKkVElOkpYoZFb4KSUGTaqxoRQP+JhDDuan/lPo4ECz/eiP3fUdcCrvNx7o9zAzNk1yzQ6QRLouuTkN/wlGdM07qWZIzIa25FQCwW0KIJ924liuvtSITkiBE+ziECrwdN43rv8iDg/zt7TbnpQa0glGDCCremNMRJFVnQmURpRGML+1UygJEW9ZEYRhJo5URPRPhZSCIE7IxzHpbqsGgafG9xODyGNLkBySiAnDo4hhwmFN/Ud5PAU+hdm4siHK8Y7pI7uXfvJB8FSjA4Z56Au4Wh+TiHdHQqgAwIRkA5N+nmo/J45wHLSFEsLSFVKfhYRnB0n54+P33hHj37/Pjk+Pz49GTQHgEvkHCFPEJy+oUkJClyGQdF3tn19zy4qvbN4u8Zztfhz2k80d8QxsTs1qFnoOXSb8ZD/Pkt+sAAq8UGL3JZYctMqqUx8BNT5hLc36tSKNd+I95y8Etote43OryAmbkts+JlsdrwJwzauMQCBF4bMiApRUWO+XxqMBDxE8+Hv4NOYcWSFE/LrZdYd7fXh6X/KSzdngW/nFC9rAqKMa+FOMtkyshaw6C1ngElZIYF53An5aNZGWYIZ3hkBq/yrJT5tl9Y6HLCLHCpa8BKKMWCGlUDUkiUxBtxF9oGfFxOrgVfG+mmvE2fLI9F8zbRzAxBQ/qr4dj0UWWbLYdF7xvyMb1u2GE1A41Vzxszqg7PpkMIjDQ5rQ7cpk9azYzlsZykEr/jPjs5qi6YQRs+eZbfNe7dwsvurg/y1/72Yc+pni3nQ+cj57HjOwfOofOlc+ZcONiJnZ+dX53fWt+2vm/92Pqpgt6/t/B55NSe1i//AjOSkDU=</latexit><latexit sha1_base64="z+p23dkdGhVa20Q1jNHE5rneJ2s=">AAAJ2HicjVbNbuM2ENZu/2L3b7c99iI0CLAFhMBym8Q+FFjsZtsG2CbpOtlsaxsBRVESYVIUSCqOKwjoreihlx7ax+lz9G06lOzYohW0gkwPZr4ZznwzJBRkjCrd6/3z4OFbb7/z7ns7ne77H3z40cePHn/yWolcYnKJBRPyTYAUYTQll5pqRt5kkiAeMHIVzJ4b+9UNkYqK9EIvMjLlKE5pRDHSoBrF1/3rR7u9/V6v5/u+awT/6LAHwnA46PsD1zcmeHad5XN+/Xjn70kocM5JqjFDSo39XqanBZKaYkbK7iRXJEN4hmIyBjFFnKhpUeVaunugCd1ISPil2q20mx4F4koteABIjnSibJtRttnGuY4G04KmWa5JiuuNopy5WrimcDekkmDNFiAgLCnk6uIESYQ10NPYJQG4lCRqVoK5ppmHGJsWt4uGqdB09nNTA/1hlkaDQNIY+lQbjBOjgURyAdRJMVeeSlBGVNnds/hVRNLIq0qupEyKiCjTU8QMi16C0lDk2qsbEUD/iYQw7np/5T6JBAu/2Ir931E3Am7yceFPCwMzZDcs0OkUS6Kbk1Dc8oznTNOmluSMyBtuRUAsFtCihHt3EsXNdqRCcsQInxYQgTeDZnGzd0UQ8P+dvabc9KBREEoxYaXbUBripIos6FyiLKKxhf0+HcFIi+ZIjCMJtHKiExF+HZIIATvTgoeVOizbBt9bHg6PIU1uQTIKIKcJjiGHhOLb5m4SeAq9G3NRhNM1w31yL/fuveSjQAkG58wTcLcwtJgWkI7OBJABwQgoFybdYlId7yJgOSnLlSWkKgMfywiO7vOzl2ev3OMX35ycnlycnJ2OuhPgBRKukcdIzr6VhKRlIeOgLHr7/oEHV9WhWfwDw/km/CWNE/0jYUzM7xwGBlotw3Y8xF/coY8MsF5s8DKXNbbKpF5aAz8zZa7Aw4M6hWodtuItB7+C1uthq8MrmJm7MmtelqsNf8agjSssQOC1ISOSUVQWmC9mBgMRv/R8+DvqlVYsSfGs2nqFdfcHQ1iGX8HSH1jwq4TqVVVQjHktxHkuM0Y2Ggat9QwoJXMsOIc7qZjMqzBjOMMTM3i1Z60sdv3SQlcTZoErXQtWQikW1KhakEKiNN6Ku9S24ONqci34xki35W36ZHksm7eNZmYIWtJfD8e2j6rabDkse9+Sj+l1yw7rGWitetGaUX14th1CYKTNaX3gtn2yemYsj9UkVfg998XpcX3BjLrwybP6rnHvF173932Qf/B3nw6WHz87zmfO584Tx3eOnKfOd865c+lgJ3Z+d/50/ur81Pml82vntxr68MHS51On8XT++BfV8Yu/</latexit>
z1<latexit sha1_base64="V7g08aoOjvxavyziU7UXAQe9o4U=">AAAJ2HicjVZfb9s2EFfb/Ym9dWu7x70ICwJ0gBBY2ZLYDwOCNt0WoEuyJmm62kJAUZRMmBQFkorjCgL2NuxhL3vYPs6AfYt9mx0lO7ZoBZsg04e73x3vfnckFGaMKt3r/XPv/oP33v/gw41O96OPH37y6aPHT14rkUtMLrBgQr4JkSKMpuRCU83Im0wSxENGLsPJc2O/vCZSUZGe61lGAo6SlMYUIw2qs3dX/tWjzd52r9fzfd81gr+/1wNhMOjv+H3XNyZ4Ng8e/p27juOcXj3e+GsUCZxzkmrMkFJDv5fpoEBSU8xI2R3limQIT1BChiCmiBMVFFWupbsFmsiNhYRfqt1Ku+pRIK7UjIeA5EiPlW0zyjbbMNdxPyhomuWapLjeKM6Zq4VrCncjKgnWbAYCwpJCri4eI4mwBnoau4wBLiWJm5VgrmnmIcaC4mbWMBWaTt41NdAfZmk0CCRNoE+1wTgxGkokZ0CdFFPlqTHKiCq7Wxa/ikgae1XJlZRJERNleoqYYdEbozQSufbqRoTQfyIhjLvcX7lPY8GiL9di/3fUlYCrfJz7QWFghuyGBTqdYkl0cxKKG57xnGna1JKcEXnNrQiIJQJaNOberURxsx2pkBwxwoMCIvBm0Cxp9q4IQ/6/s9eUmx40CkIpJqx0G0pDnFSxBZ1KlMU0sbA/pGcw0qI5EsNYAq2c6LGIvolIjICdoOBRpY7KtsH35ofDY0iTG5CMAshpghPIYUzxTXM3CTxF3rW5KKJgyfAOuZN7907yUagEg3PmCbhbGJoFBaSjMwFkQDACyplJtxhVx7sIWU7KcmGJqMrAxzKCo/v85OXJK/fwxbdHx0fnRyfHZ90R8AIJ18hDJCffSULSspBJWBa9bX/Xg6tqzyz+ruF8Ff6SJmP9E2FMTG8d+gZaLYN2PMSf3aL3DbBebPA8lyW2yqReWgM/M2UuwIPdOoVqHbTiLQe/gtbrXqvDK5iZ2zJrXuarDX/GoI0LLEDgtSFnJKOoLDCfTQwGIn7l+fC33yutWJLiSbX1Autu9wewDL6GZadvwS/HVC+qgmLMayFOc5kxstIwaK1nQCmZYsE53EnFaFqFGcIZHpnBqz1rZbHplxa6mjALXOlasBJKsaBG1YIUEqXJWty5tgWfVJNrwVdGui1v0yfLY968dTQzQ9CS/nI41n1U1WbLYd77lnxMr1t2WM5Aa9Wz1ozqw7PuEAEjbU7LA7fuk9UzY3ksJqnCb7kvjg/rC+asC588i+8a927h9c62D/KP/uZB36mfDedz5wvnqeM7+86B871z6lw42Emc35w/nD87bzs/d37p/FpD79+b+3zmNJ7O7/8CuwONdQ==</latexit><latexit sha1_base64="xAosoZfITzQxTFF+VKA4fyGVA3w=">AAAJ2HicjVZfb9s2EFe7f7G3bu32uBdhQYAOEAIrWxL7YUDQpt0CdEnWJE1XWwgoipIJk6JAUnFcQcDehj3sZQ/bN9jXGLBvsW+zo2THFq1gE2T6cPe7493vjoTCjFGle71/7t1/59333v9go9P98KMHH3/y8NGnr5TIJSYXWDAhX4dIEUZTcqGpZuR1JgniISOX4eSpsV9eE6moSM/1LCMBR0lKY4qRBtXZ2yv/6uFmb7vX6/m+7xrB39/rgTAY9Hf8vusbEzybBw/+zreed/88vXq08dcoEjjnJNWYIaWGfi/TQYGkppiRsjvKFckQnqCEDEFMEScqKKpcS3cLNJEbCwm/VLuVdtWjQFypGQ8ByZEeK9tmlG22Ya7jflDQNMs1SXG9UZwzVwvXFO5GVBKs2QwEhCWFXF08RhJhDfQ0dhkDXEoSNyvBXNPMQ4wFxc2sYSo0nbxtaqA/zNJoEEiaQJ9qg3FiNJRIzoA6KabKU2OUEVV2tyx+FZE09qqSKymTIibK9BQxw6I3Rmkkcu3VjQih/0RCGHe5v3Ifx4JFX67F/u+oKwFX+Tj3g8LADNkNC3Q6xZLo5iQUNzzjOdO0qSU5I/KaWxEQSwS0aMy9W4niZjtSITlihAcFRODNoFnS7F0Rhvx/Z68pNz1oFIRSTFjpNpSGOKliCzqVKItpYmG/T89gpEVzJIaxBFo50WMRfRORGAE7QcGjSh2VbYPvzQ+Hx5AmNyAZBZDTBCeQw5jim+ZuEniKvGtzUUTBkuEdcif37p3ko1AJBufME3C3MDQLCkhHZwLIgGAElDOTbjGqjncRspyU5cISUZWBj2UER/fpyYuTl+7hs+dHx0fnRyfHZ90R8AIJ18hDJCffSkLSspBJWBa9bX/Xg6tqzyz+ruF8Ff6CJmP9I2FMTG8d+gZaLYN2PMSf3aL3DbBebPA8lyW2yqReWgM/MWUuwIPdOoVqHbTiLQe/gtbrXqvDS5iZ2zJrXuarDX/CoI0LLEDgtSFnJKOoLDCfTQwGIn7l+fC33yutWJLiSbX1Autu9wewDL6GZadvwS/HVC+qgmLMayFOc5kxstIwaK1nQCmZYsE53EnFaFqFGcIZHpnBqz1rZbHplxa6mjALXOlasBJKsaBG1YIUEqXJWty5tgWfVJNrwVdGui1v0yfLY968dTQzQ9CS/nI41n1U1WbLYd77lnxMr1t2WM5Aa9Wz1ozqw7PuEAEjbU7LA7fuk9UzY3ksJqnCb7nPjg/rC+asC588i+8a927h1c62D/IP/uZB36mfDedz5wvnseM7+86B851z6lw42EmcX53fnT86bzo/dX7u/FJD79+b+3zmNJ7Ob/8C0OCOcA==</latexit><latexit sha1_base64="xAosoZfITzQxTFF+VKA4fyGVA3w=">AAAJ2HicjVZfb9s2EFe7f7G3bu32uBdhQYAOEAIrWxL7YUDQpt0CdEnWJE1XWwgoipIJk6JAUnFcQcDehj3sZQ/bN9jXGLBvsW+zo2THFq1gE2T6cPe7493vjoTCjFGle71/7t1/59333v9go9P98KMHH3/y8NGnr5TIJSYXWDAhX4dIEUZTcqGpZuR1JgniISOX4eSpsV9eE6moSM/1LCMBR0lKY4qRBtXZ2yv/6uFmb7vX6/m+7xrB39/rgTAY9Hf8vusbEzybBw/+zreed/88vXq08dcoEjjnJNWYIaWGfi/TQYGkppiRsjvKFckQnqCEDEFMEScqKKpcS3cLNJEbCwm/VLuVdtWjQFypGQ8ByZEeK9tmlG22Ya7jflDQNMs1SXG9UZwzVwvXFO5GVBKs2QwEhCWFXF08RhJhDfQ0dhkDXEoSNyvBXNPMQ4wFxc2sYSo0nbxtaqA/zNJoEEiaQJ9qg3FiNJRIzoA6KabKU2OUEVV2tyx+FZE09qqSKymTIibK9BQxw6I3Rmkkcu3VjQih/0RCGHe5v3Ifx4JFX67F/u+oKwFX+Tj3g8LADNkNC3Q6xZLo5iQUNzzjOdO0qSU5I/KaWxEQSwS0aMy9W4niZjtSITlihAcFRODNoFnS7F0Rhvx/Z68pNz1oFIRSTFjpNpSGOKliCzqVKItpYmG/T89gpEVzJIaxBFo50WMRfRORGAE7QcGjSh2VbYPvzQ+Hx5AmNyAZBZDTBCeQw5jim+ZuEniKvGtzUUTBkuEdcif37p3ko1AJBufME3C3MDQLCkhHZwLIgGAElDOTbjGqjncRspyU5cISUZWBj2UER/fpyYuTl+7hs+dHx0fnRyfHZ90R8AIJ18hDJCffSkLSspBJWBa9bX/Xg6tqzyz+ruF8Ff6CJmP9I2FMTG8d+gZaLYN2PMSf3aL3DbBebPA8lyW2yqReWgM/MWUuwIPdOoVqHbTiLQe/gtbrXqvDS5iZ2zJrXuarDX/CoI0LLEDgtSFnJKOoLDCfTQwGIn7l+fC33yutWJLiSbX1Autu9wewDL6GZadvwS/HVC+qgmLMayFOc5kxstIwaK1nQCmZYsE53EnFaFqFGcIZHpnBqz1rZbHplxa6mjALXOlasBJKsaBG1YIUEqXJWty5tgWfVJNrwVdGui1v0yfLY968dTQzQ9CS/nI41n1U1WbLYd77lnxMr1t2WM5Aa9Wz1ozqw7PuEAEjbU7LA7fuk9UzY3ksJqnCb7nPjg/rC+asC588i+8a927h1c62D/IP/uZB36mfDedz5wvnseM7+86B851z6lw42EmcX53fnT86bzo/dX7u/FJD79+b+3zmNJ7Ob/8C0OCOcA==</latexit><latexit sha1_base64="yCmzR+UYSiB70+ceKvNWi19m/ko=">AAAJ2HicjVZfb9s2EFfb/Ym9f233uBdhQYAOEAIrWxL7YUDRptsCdElWJ01XWwgoipIJk6JAUnFcQcDehj3sZQ/bx9nn2LfZUbJji1awCTJ9uPvd8e53R0JhxqjSvd4/9+4/eO/9Dz7c6nQ/+viTTz97+OjxayVyickFFkzINyFShNGUXGiqGXmTSYJ4yMhlOH1u7JfXRCoq0nM9z0jAUZLSmGKkQTV8d+VfPdzu7fZ6Pd/3XSP4hwc9EAaD/p7fd31jgmfbWTxnV4+2/h5HAuecpBozpNTI72U6KJDUFDNSdse5IhnCU5SQEYgp4kQFRZVr6e6AJnJjIeGXarfSrnsUiCs15yEgOdITZduMss02ynXcDwqaZrkmKa43inPmauGawt2ISoI1m4OAsKSQq4snSCKsgZ7GLhOAS0niZiWYa5p5iLGguJk3TIWm03dNDfSHWRoNAkkT6FNtME6MhhLJOVAnxUx5aoIyosrujsWvIpLGXlVyJWVSxESZniJmWPQmKI1Err26ESH0n0gI4672V+6TWLDoq43Y/x11LeA6H+d+UBiYIbthgU6nWBLdnITihmc8Z5o2tSRnRF5zKwJiiYAWTbh3K1HcbEcqJEeM8KCACLwZNEuavSvCkP/v7DXlpgeNglCKCSvdhtIQJ1VsQWcSZTFNLOyP6RBGWjRHYhRLoJUTPRHRtxGJEbATFDyq1FHZNvje4nB4DGlyA5JRADlNcAI5TCi+ae4mgafIuzYXRRSsGN4jd3Lv3kk+CpVgcM48AXcLQ/OggHR0JoAMCEZAOTfpFuPqeBchy0lZLi0RVRn4WEZwdJ+fvjx95R69+O745Pj8+PRk2B0DL5BwjTxCcvq9JCQtC5mEZdHb9fc9uKoOzOLvG87X4S9pMtE/E8bE7Nahb6DVMmjHQ/z5LfrQAOvFBi9yWWGrTOqlNfAzU+YSPNivU6jWQSvecvAraL0etDq8gpm5LbPmZbHa8GcM2rjEAgReGzIkGUVlgfl8ajAQ8WvPh7/DXmnFkhRPq62XWHe3P4Bl8A0se30LfjmhelkVFGNeC3GWy4yRtYZBaz0DSskMC87hTirGsyrMCM7w2Axe7Vkri22/tNDVhFngSteClVCKBTWqFqSQKE024i60LfikmlwLvjbSbXmbPlkei+ZtopkZgpb0V8Ox6aOqNlsOi9635GN63bLDagZaq563ZlQfnk2HCBhpc1oduE2frJ4Zy2M5SRV+x31xclRfMMMufPIsv2vcu4XXe7s+yD/520/7i4+fLecL50vnieM7h85T5wfnzLlwsJM4vzt/On913nZ+6fza+a2G3r+38PncaTydP/4FgCqL0Q==</latexit>
z2<latexit sha1_base64="TZvAKWPUDTKESOwa2HzSiORibKQ=">AAAJ2HicjVZfb9s2EFfb/Ym9dWu7x70ICwJ0gBBY3pLYDwOCNt0WoEuyOmm62kZAUZRMmBQFkorjCgL2NuxhL3vYPs6AfYt9mx0lO7ZoBZsg04e73x3vfnckFKSMKt3p/HPv/oP33v/gw61W+6OPH37y6aPHT14rkUlMLrBgQr4JkCKMJuRCU83Im1QSxANGLoPpc2O/vCZSUZGc63lKxhzFCY0oRhpUg3dX3atH253dTqfj+75rBP9gvwNCv9/r+j3XNyZ4tg8f/p25juOcXT3e+msUCpxxkmjMkFJDv5PqcY6kppiRoj3KFEkRnqKYDEFMECdqnJe5Fu4OaEI3EhJ+iXZL7bpHjrhScx4AkiM9UbbNKJtsw0xHvXFOkzTTJMHVRlHGXC1cU7gbUkmwZnMQEJYUcnXxBEmENdBT22UCcClJVK8Ec01TDzE2zm/mNVOu6fRdXQP9YZZGg0CSGPpUGYwTo4FEcg7USTFTnpqglKiivWPxq4ikkVeWXEqpFBFRpqeIGRa9CUpCkWmvakQA/ScSwrir/ZX7NBIs/HIj9n9HXQu4zse5P84NzJBds0CnEyyJrk9CfsNTnjFN61qSMSKvuRUBsVhAiybcu5UorrcjEZIjRvg4hwi8HjSN673Lg4D/7+w15aYHtYJQggkr3JrSECdVZEFnEqURjS3sD8kARlrUR2IYSaCVEz0R4TchiRCwM855WKrDomnwvcXh8BjS5AYkowBy6uAYcphQfFPfTQJPoXdtLopwvGK4S+7k3r2TfBQoweCceQLuFobm4xzS0akAMiAYAeXcpJuPyuOdBywjRbG0hFSl4GMZwdF9fvry9JV79OLb45Pj8+PTk0F7BLxAwhXyCMnpd5KQpMhlHBR5Z9ff8+Cq2jeLv2c4X4e/pPFE/0QYE7Nbh56Blku/GQ/x57foAwOsFhu8yGWFLTOplsbAz0yZS3B/r0qhXPuNeMvBL6HVut/o8Apm5rbMipfFasOfMWjjEgsQeG3IgKQUFTnm86nBQMSvPB/+DjqFFUtSPC23XmLd3V4flv7XsHR7FvxyQvWyKijGvBbiLJMpI2sNg9Z6BpSQGRacw52Uj2ZlmCGc4ZEZvMqzUubbfmGhywmzwKWuASuhFAtqVA1IIVESb8RdaBvwcTm5FnxtpJvyNn2yPBbN20QzMwQN6a+GY9NHlW22HBa9b8jH9Lphh9UMNFY9b8yoOjybDiEw0uS0OnCbPmk1M5bHcpJK/I774uSoumAGbfjkWX7XuHcLr7u7Psg/+tuHPad6tpzPnS+cp47vHDiHzvfOmXPhYCd2fnP+cP5svW393Pql9WsFvX9v4fOZU3tav/8LxHWNdg==</latexit><latexit sha1_base64="81/ItuAdF3EDwYZiH30cbjCQeeo=">AAAJ2HicjVZfb9s2EFe7f7G3bu32uBdhQYAOEALLWxL7YUDQpt0CdElWJ01X2wgoipIJk6JAUnFcQcDehj3sZQ/bN9jXGLBvsW+zo2THFq1gE2T6cPe7493vjoSClFGlO51/7t1/59333v9gq9X+8KMHH3/y8NGnr5TIJCYXWDAhXwdIEUYTcqGpZuR1KgniASOXwfSpsV9eE6moSM71PCVjjuKERhQjDarB26vu1cPtzm6n0/F93zWCf7DfAaHf73X9nusbEzzbhw/+znaet/88u3q09dcoFDjjJNGYIaWGfifV4xxJTTEjRXuUKZIiPEUxGYKYIE7UOC9zLdwd0IRuJCT8Eu2W2nWPHHGl5jwAJEd6omybUTbZhpmOeuOcJmmmSYKrjaKMuVq4pnA3pJJgzeYgICwp5OriCZIIa6CntssE4FKSqF4J5pqmHmJsnN/Ma6Zc0+nbugb6wyyNBoEkMfSpMhgnRgOJ5Byok2KmPDVBKVFFe8fiVxFJI68suZRSKSKiTE8RMyx6E5SEItNe1YgA+k8khHFX+yv3cSRY+OVG7P+OuhZwnY9zf5wbmCG7ZoFOJ1gSXZ+E/IanPGOa1rUkY0RecysCYrGAFk24dytRXG9HIiRHjPBxDhF4PWga13uXBwH/39lryk0PagWhBBNWuDWlIU6qyILOJEojGlvY75MBjLSoj8QwkkArJ3oiwm9CEiFgZ5zzsFSHRdPge4vD4TGkyQ1IRgHk1MEx5DCh+Ka+mwSeQu/aXBTheMVwl9zJvXsn+ShQgsE58wTcLQzNxzmko1MBZEAwAsq5STcflcc7D1hGimJpCalKwccygqP79PTF6Uv36Nnz45Pj8+PTk0F7BLxAwhXyCMnpt5KQpMhlHBR5Z9ff8+Cq2jeLv2c4X4e/oPFE/0gYE7Nbh56Blku/GQ/x57foAwOsFhu8yGWFLTOplsbAT0yZS3B/r0qhXPuNeMvBL6HVut/o8BJm5rbMipfFasOfMGjjEgsQeG3IgKQUFTnm86nBQMSvPB/+DjqFFUtSPC23XmLd3V4flv7XsHR7FvxyQvWyKijGvBbiLJMpI2sNg9Z6BpSQGRacw52Uj2ZlmCGc4ZEZvMqzUubbfmGhywmzwKWuASuhFAtqVA1IIVESb8RdaBvwcTm5FnxtpJvyNn2yPBbN20QzMwQN6a+GY9NHlW22HBa9b8jH9Lphh9UMNFY9b8yoOjybDiEw0uS0OnCbPmk1M5bHcpJK/I777OSoumAGbfjkWX7XuHcLr7q7Psg/+NuHPad6tpzPnS+cx47vHDiHznfOmXPhYCd2fnV+d/5ovWn91Pq59UsFvX9v4fOZU3tav/0L2lKOcQ==</latexit><latexit sha1_base64="81/ItuAdF3EDwYZiH30cbjCQeeo=">AAAJ2HicjVZfb9s2EFe7f7G3bu32uBdhQYAOEALLWxL7YUDQpt0CdElWJ01X2wgoipIJk6JAUnFcQcDehj3sZQ/bN9jXGLBvsW+zo2THFq1gE2T6cPe7493vjoSClFGlO51/7t1/59333v9gq9X+8KMHH3/y8NGnr5TIJCYXWDAhXwdIEUYTcqGpZuR1KgniASOXwfSpsV9eE6moSM71PCVjjuKERhQjDarB26vu1cPtzm6n0/F93zWCf7DfAaHf73X9nusbEzzbhw/+znaet/88u3q09dcoFDjjJNGYIaWGfifV4xxJTTEjRXuUKZIiPEUxGYKYIE7UOC9zLdwd0IRuJCT8Eu2W2nWPHHGl5jwAJEd6omybUTbZhpmOeuOcJmmmSYKrjaKMuVq4pnA3pJJgzeYgICwp5OriCZIIa6CntssE4FKSqF4J5pqmHmJsnN/Ma6Zc0+nbugb6wyyNBoEkMfSpMhgnRgOJ5Byok2KmPDVBKVFFe8fiVxFJI68suZRSKSKiTE8RMyx6E5SEItNe1YgA+k8khHFX+yv3cSRY+OVG7P+OuhZwnY9zf5wbmCG7ZoFOJ1gSXZ+E/IanPGOa1rUkY0RecysCYrGAFk24dytRXG9HIiRHjPBxDhF4PWga13uXBwH/39lryk0PagWhBBNWuDWlIU6qyILOJEojGlvY75MBjLSoj8QwkkArJ3oiwm9CEiFgZ5zzsFSHRdPge4vD4TGkyQ1IRgHk1MEx5DCh+Ka+mwSeQu/aXBTheMVwl9zJvXsn+ShQgsE58wTcLQzNxzmko1MBZEAwAsq5STcflcc7D1hGimJpCalKwccygqP79PTF6Uv36Nnz45Pj8+PTk0F7BLxAwhXyCMnpt5KQpMhlHBR5Z9ff8+Cq2jeLv2c4X4e/oPFE/0gYE7Nbh56Blku/GQ/x57foAwOsFhu8yGWFLTOplsbAT0yZS3B/r0qhXPuNeMvBL6HVut/o8BJm5rbMipfFasOfMGjjEgsQeG3IgKQUFTnm86nBQMSvPB/+DjqFFUtSPC23XmLd3V4flv7XsHR7FvxyQvWyKijGvBbiLJMpI2sNg9Z6BpSQGRacw52Uj2ZlmCGc4ZEZvMqzUubbfmGhywmzwKWuASuhFAtqVA1IIVESb8RdaBvwcTm5FnxtpJvyNn2yPBbN20QzMwQN6a+GY9NHlW22HBa9b8jH9Lphh9UMNFY9b8yoOjybDiEw0uS0OnCbPmk1M5bHcpJK/I777OSoumAGbfjkWX7XuHcLr7q7Psg/+NuHPad6tpzPnS+cx47vHDiHznfOmXPhYCd2fnV+d/5ovWn91Pq59UsFvX9v4fOZU3tav/0L2lKOcQ==</latexit><latexit sha1_base64="g8QBduw5zRf6C5jqumitT8HWfDk=">AAAJ2HicjVZfb9s2EFfb/Ym9f233uBdhQYAOEALLWxL7YUDRptsCdEnWJE1XWwgoipIJk6JAUnFcQcDehj3sZQ/bx9nn2LfZUbJji1awCTJ9uPvd8e53R0JhxqjSvd4/9+4/eO/9Dz7c6nQ/+viTTz97+OjxayVyickFFkzINyFShNGUXGiqGXmTSYJ4yMhlOH1u7JfXRCoq0nM9z0jAUZLSmGKkQXX27qp/9XC7t9vr9Xzfd43gH+z3QBgOB31/4PrGBM+2s3hOrx5t/T2OBM45STVmSKmR38t0UCCpKWak7I5zRTKEpyghIxBTxIkKiirX0t0BTeTGQsIv1W6lXfcoEFdqzkNAcqQnyrYZZZttlOt4EBQ0zXJNUlxvFOfM1cI1hbsRlQRrNgcBYUkhVxdPkERYAz2NXSYAl5LEzUow1zTzEGNBcTNvmApNp++aGugPszQaBJIm0KfaYJwYDSWSc6BOipny1ARlRJXdHYtfRSSNvarkSsqkiIkyPUXMsOhNUBqJXHt1I0LoP5EQxl3tr9wnsWDRVxux/zvqWsB1Ps79oDAwQ3bDAp1OsSS6OQnFDc94zjRtaknOiLzmVgTEEgEtmnDvVqK42Y5USI4Y4UEBEXgzaJY0e1eEIf/f2WvKTQ8aBaEUE1a6DaUhTqrYgs4kymKaWNgf0zMYadEciVEsgVZO9ERE30YkRsBOUPCoUkdl2+B7i8PhMaTJDUhGAeQ0wQnkMKH4prmbBJ4i79pcFFGwYrhP7uTevZN8FCrB4Jx5Au4WhuZBAenoTAAZEIyAcm7SLcbV8S5ClpOyXFoiqjLwsYzg6D4/eXnyyj188d3R8dH50cnxWXcMvEDCNfIQyen3kpC0LGQSlkVv19/z4KraN4u/Zzhfh7+kyUT/TBgTs1uHgYFWy7AdD/Hnt+gDA6wXG7zIZYWtMqmX1sDPTJlL8HCvTqFah614y8GvoPW63+rwCmbmtsyal8Vqw58xaOMSCxB4bcgZySgqC8znU4OBiF97Pvwd9EorlqR4Wm29xLq7gyEsw29g6Q8s+OWE6mVVUIx5LcRpLjNG1hoGrfUMKCUzLDiHO6kYz6owIzjDYzN4tWetLLb90kJXE2aBK10LVkIpFtSoWpBCojTZiLvQtuCTanIt+NpIt+Vt+mR5LJq3iWZmCFrSXw3Hpo+q2mw5LHrfko/pdcsOqxlorXremlF9eDYdImCkzWl14DZ9snpmLI/lJFX4HffF8WF9wZx14ZNn+V3j3i287u/6IP/kbz8dLD5+tpwvnC+dJ47vHDhPnR+cU+fCwU7i/O786fzVedv5pfNr57caev/ewudzp/F0/vgXiZyL0g==</latexit>
zn<latexit sha1_base64="4ZFc4Osjg/3AMiZMY0aOID5dIWQ=">AAAJ2HicjVZfb9s2EFfb/Ym9dWu7x70ICwJ0gBBY2ZLYDwOCNt0WoEuyJmm62kJAUZRMmBQFkorjCgL2NuxhL3vYPs6AfYt9mx0lO7ZoBZsg04e73x3vfnckFGaMKt3r/XPv/oP33v/gw41O96OPH37y6aPHT14rkUtMLrBgQr4JkSKMpuRCU83Im0wSxENGLsPJc2O/vCZSUZGe61lGAo6SlMYUIw2qs3dX6dWjzd52r9fzfd81gr+/1wNhMOjv+H3XNyZ4Ng8e/p27juOcXj3e+GsUCZxzkmrMkFJDv5fpoEBSU8xI2R3limQIT1BChiCmiBMVFFWupbsFmsiNhYRfqt1Ku+pRIK7UjIeA5EiPlW0zyjbbMNdxPyhomuWapLjeKM6Zq4VrCncjKgnWbAYCwpJCri4eI4mwBnoau4wBLiWJm5VgrmnmIcaC4mbWMBWaTt41NdAfZmk0CCRNoE+1wTgxGkokZ0CdFFPlqTHKiCq7Wxa/ikgae1XJlZRJERNleoqYYdEbozQSufbqRoTQfyIhjLvcX7lPY8GiL9di/3fUlYCrfJz7QWFghuyGBTqdYkl0cxKKG57xnGna1JKcEXnNrQiIJQJaNOberURxsx2pkBwxwoMCIvBm0Cxp9q4IQ/6/s9eUmx40CkIpJqx0G0pDnFSxBZ1KlMU0sbA/pGcw0qI5EsNYAq2c6LGIvolIjICdoOBRpY7KtsH35ofDY0iTG5CMAshpghPIYUzxTXM3CTxF3rW5KKJgyfAOuZN7907yUagEg3PmCbhbGJoFBaSjMwFkQDACyplJtxhVx7sIWU7KcmGJqMrAxzKCo/v85OXJK/fwxbdHx0fnRyfHZ90R8AIJ18hDJCffSULSspBJWBa9bX/Xg6tqzyz+ruF8Ff6SJmP9E2FMTG8d+gZaLYN2PMSf3aL3DbBebPA8lyW2yqReWgM/M2UuwIPdOoVqHbTiLQe/gtbrXqvDK5iZ2zJrXuarDX/GoI0LLEDgtSFnJKOoLDCfTQwGIn7l+fC33yutWJLiSbX1Autu9wewDL6GZadvwS/HVC+qgmLMayFOc5kxstIwaK1nQCmZYsE53EnFaFqFGcIZHpnBqz1rZbHplxa6mjALXOlasBJKsaBG1YIUEqXJWty5tgWfVJNrwVdGui1v0yfLY968dTQzQ9CS/nI41n1U1WbLYd77lnxMr1t2WM5Aa9Wz1ozqw7PuEAEjbU7LA7fuk9UzY3ksJqnCb7kvjg/rC+asC588i+8a927h9c62D/KP/uZB36mfDedz5wvnqeM7+86B871z6lw42Emc35w/nD87bzs/d37p/FpD79+b+3zmNJ7O7/8C+0uNsg==</latexit><latexit sha1_base64="ApJtwQVj/qeEN9CA1NtMsNSc5HI=">AAAJ2HicjVZfb9s2EFe7f7G3bu32uBdhQYAOEAIrWxL7YUDQpt0CdEnWJE1XWwgoipIJk6JAUnFcQcDehj3sZQ/bN9jXGLBvsW+zo2THFq1gE2T6cPe7493vjoTCjFGle71/7t1/59333v9go9P98KMHH3/y8NGnr5TIJSYXWDAhX4dIEUZTcqGpZuR1JgniISOX4eSpsV9eE6moSM/1LCMBR0lKY4qRBtXZ26v06uFmb7vX6/m+7xrB39/rgTAY9Hf8vusbEzybBw/+zreed/88vXq08dcoEjjnJNWYIaWGfi/TQYGkppiRsjvKFckQnqCEDEFMEScqKKpcS3cLNJEbCwm/VLuVdtWjQFypGQ8ByZEeK9tmlG22Ya7jflDQNMs1SXG9UZwzVwvXFO5GVBKs2QwEhCWFXF08RhJhDfQ0dhkDXEoSNyvBXNPMQ4wFxc2sYSo0nbxtaqA/zNJoEEiaQJ9qg3FiNJRIzoA6KabKU2OUEVV2tyx+FZE09qqSKymTIibK9BQxw6I3Rmkkcu3VjQih/0RCGHe5v3Ifx4JFX67F/u+oKwFX+Tj3g8LADNkNC3Q6xZLo5iQUNzzjOdO0qSU5I/KaWxEQSwS0aMy9W4niZjtSITlihAcFRODNoFnS7F0Rhvx/Z68pNz1oFIRSTFjpNpSGOKliCzqVKItpYmG/T89gpEVzJIaxBFo50WMRfRORGAE7QcGjSh2VbYPvzQ+Hx5AmNyAZBZDTBCeQw5jim+ZuEniKvGtzUUTBkuEdcif37p3ko1AJBufME3C3MDQLCkhHZwLIgGAElDOTbjGqjncRspyU5cISUZWBj2UER/fpyYuTl+7hs+dHx0fnRyfHZ90R8AIJ18hDJCffSkLSspBJWBa9bX/Xg6tqzyz+ruF8Ff6CJmP9I2FMTG8d+gZaLYN2PMSf3aL3DbBebPA8lyW2yqReWgM/MWUuwIPdOoVqHbTiLQe/gtbrXqvDS5iZ2zJrXuarDX/CoI0LLEDgtSFnJKOoLDCfTQwGIn7l+fC33yutWJLiSbX1Autu9wewDL6GZadvwS/HVC+qgmLMayFOc5kxstIwaK1nQCmZYsE53EnFaFqFGcIZHpnBqz1rZbHplxa6mjALXOlasBJKsaBG1YIUEqXJWty5tgWfVJNrwVdGui1v0yfLY968dTQzQ9CS/nI41n1U1WbLYd77lnxMr1t2WM5Aa9Wz1ozqw7PuEAEjbU7LA7fuk9UzY3ksJqnCb7nPjg/rC+asC588i+8a927h1c62D/IP/uZB36mfDedz5wvnseM7+86B851z6lw42EmcX53fnT86bzo/dX7u/FJD79+b+3zmNJ7Ob/8CETeOrQ==</latexit><latexit sha1_base64="ApJtwQVj/qeEN9CA1NtMsNSc5HI=">AAAJ2HicjVZfb9s2EFe7f7G3bu32uBdhQYAOEAIrWxL7YUDQpt0CdEnWJE1XWwgoipIJk6JAUnFcQcDehj3sZQ/bN9jXGLBvsW+zo2THFq1gE2T6cPe7493vjoTCjFGle71/7t1/59333v9go9P98KMHH3/y8NGnr5TIJSYXWDAhX4dIEUZTcqGpZuR1JgniISOX4eSpsV9eE6moSM/1LCMBR0lKY4qRBtXZ26v06uFmb7vX6/m+7xrB39/rgTAY9Hf8vusbEzybBw/+zreed/88vXq08dcoEjjnJNWYIaWGfi/TQYGkppiRsjvKFckQnqCEDEFMEScqKKpcS3cLNJEbCwm/VLuVdtWjQFypGQ8ByZEeK9tmlG22Ya7jflDQNMs1SXG9UZwzVwvXFO5GVBKs2QwEhCWFXF08RhJhDfQ0dhkDXEoSNyvBXNPMQ4wFxc2sYSo0nbxtaqA/zNJoEEiaQJ9qg3FiNJRIzoA6KabKU2OUEVV2tyx+FZE09qqSKymTIibK9BQxw6I3Rmkkcu3VjQih/0RCGHe5v3Ifx4JFX67F/u+oKwFX+Tj3g8LADNkNC3Q6xZLo5iQUNzzjOdO0qSU5I/KaWxEQSwS0aMy9W4niZjtSITlihAcFRODNoFnS7F0Rhvx/Z68pNz1oFIRSTFjpNpSGOKliCzqVKItpYmG/T89gpEVzJIaxBFo50WMRfRORGAE7QcGjSh2VbYPvzQ+Hx5AmNyAZBZDTBCeQw5jim+ZuEniKvGtzUUTBkuEdcif37p3ko1AJBufME3C3MDQLCkhHZwLIgGAElDOTbjGqjncRspyU5cISUZWBj2UER/fpyYuTl+7hs+dHx0fnRyfHZ90R8AIJ18hDJCffSkLSspBJWBa9bX/Xg6tqzyz+ruF8Ff6CJmP9I2FMTG8d+gZaLYN2PMSf3aL3DbBebPA8lyW2yqReWgM/MWUuwIPdOoVqHbTiLQe/gtbrXqvDS5iZ2zJrXuarDX/CoI0LLEDgtSFnJKOoLDCfTQwGIn7l+fC33yutWJLiSbX1Autu9wewDL6GZadvwS/HVC+qgmLMayFOc5kxstIwaK1nQCmZYsE53EnFaFqFGcIZHpnBqz1rZbHplxa6mjALXOlasBJKsaBG1YIUEqXJWty5tgWfVJNrwVdGui1v0yfLY968dTQzQ9CS/nI41n1U1WbLYd77lnxMr1t2WM5Aa9Wz1ozqw7PuEAEjbU7LA7fuk9UzY3ksJqnCb7nPjg/rC+asC588i+8a927h1c62D/IP/uZB36mfDedz5wvnseM7+86B851z6lw42EmcX53fnT86bzo/dX7u/FJD79+b+3zmNJ7Ob/8CETeOrQ==</latexit><latexit sha1_base64="kRKiSD+HgKged5aOJIkSLIO3sLs=">AAAJ2HicjVZfb9s2EFfb/Ym9f233uBdhQYAOEAIrWxL7YUDRptsCdElWJ01XWwgoipIJk6JAUnFcQcDehj3sZQ/bx9nn2LfZUbJji1awCTJ9uPvd8e53R0JhxqjSvd4/9+4/eO/9Dz7c6nQ/+viTTz97+OjxayVyickFFkzINyFShNGUXGiqGXmTSYJ4yMhlOH1u7JfXRCoq0nM9z0jAUZLSmGKkQTV8d5VePdzu7fZ6Pd/3XSP4hwc9EAaD/p7fd31jgmfbWTxnV4+2/h5HAuecpBozpNTI72U6KJDUFDNSdse5IhnCU5SQEYgp4kQFRZVr6e6AJnJjIeGXarfSrnsUiCs15yEgOdITZduMss02ynXcDwqaZrkmKa43inPmauGawt2ISoI1m4OAsKSQq4snSCKsgZ7GLhOAS0niZiWYa5p5iLGguJk3TIWm03dNDfSHWRoNAkkT6FNtME6MhhLJOVAnxUx5aoIyosrujsWvIpLGXlVyJWVSxESZniJmWPQmKI1Err26ESH0n0gI4672V+6TWLDoq43Y/x11LeA6H+d+UBiYIbthgU6nWBLdnITihmc8Z5o2tSRnRF5zKwJiiYAWTbh3K1HcbEcqJEeM8KCACLwZNEuavSvCkP/v7DXlpgeNglCKCSvdhtIQJ1VsQWcSZTFNLOyP6RBGWjRHYhRLoJUTPRHRtxGJEbATFDyq1FHZNvje4nB4DGlyA5JRADlNcAI5TCi+ae4mgafIuzYXRRSsGN4jd3Lv3kk+CpVgcM48AXcLQ/OggHR0JoAMCEZAOTfpFuPqeBchy0lZLi0RVRn4WEZwdJ+fvjx95R69+O745Pj8+PRk2B0DL5BwjTxCcvq9JCQtC5mEZdHb9fc9uKoOzOLvG87X4S9pMtE/E8bE7Nahb6DVMmjHQ/z5LfrQAOvFBi9yWWGrTOqlNfAzU+YSPNivU6jWQSvecvAraL0etDq8gpm5LbPmZbHa8GcM2rjEAgReGzIkGUVlgfl8ajAQ8WvPh7/DXmnFkhRPq62XWHe3P4Bl8A0se30LfjmhelkVFGNeC3GWy4yRtYZBaz0DSskMC87hTirGsyrMCM7w2Axe7Vkri22/tNDVhFngSteClVCKBTWqFqSQKE024i60LfikmlwLvjbSbXmbPlkei+ZtopkZgpb0V8Ox6aOqNlsOi9635GN63bLDagZaq563ZlQfnk2HCBhpc1oduE2frJ4Zy2M5SRV+x31xclRfMMMufPIsv2vcu4XXe7s+yD/520/7i4+fLecL50vnieM7h85T5wfnzLlwsJM4vzt/On913nZ+6fza+a2G3r+38PncaTydP/4FwHKMDg==</latexit>
53 / 174
Rademacher Averages
Then, conditionally on Z n1 ,
Eε∥∥∥Rn
∥∥∥H
= Eε supg∈G
∣∣∣∣∣1nn∑
i=1
εig(Zi )
∣∣∣∣∣is a data-dependent quantity (hence the “hat”) that measures complexityof G projected onto data.
Also define the Rademacher process Rn (without the “hat”)
Rn(g) =1
n
n∑i=1
εig(Zi ),
and Rademacher complexity of G as
E‖Rn‖G = EZ n1Eε∥∥∥Rn
∥∥∥G(Z n
1 ).
54 / 174
Uniform Laws and Rademacher Complexity
We’ll look at a proof of a uniform law of large numbers that involves twosteps:
1. Symmetrization, which bounds E‖P − Pn‖G in terms of theRademacher complexity of F , E‖Rn‖G .
2. Both ‖P − Pn‖G and ‖Rn‖G are concentrated about theirexpectation.
55 / 174
Uniform Laws and Rademacher Complexity
For any G,E‖P − Pn‖G ≤ 2E‖Rn‖G
If G ⊂ [0, 1]Z , then
1
2E‖Rn‖G −
√log 2
2n≤ E‖P − Pn‖G ≤ 2E‖Rn‖G ,
and, with probability at least 1− 2 exp(−2ε2n),
E‖P − Pn‖G − ε ≤ ‖P − Pn‖G ≤ E‖P − Pn‖G + ε.
Thus, E‖Rn‖G → 0 iff ‖P − Pn‖G as→ 0.
Theorem.
That is, the supremum of the empirical process P − Pn is concentratedabout its expectation, and its expectation is about the same as theexpected sup of the Rademacher process Rn.
56 / 174
Uniform Laws and Rademacher Complexity
The first step is to symmetrize by replacing Pg by P ′ng = 1n
∑ni=1 g(Z ′i ).
In particular, we have
E‖P − Pn‖G = E supg∈G
∣∣∣∣∣E[
1
n
n∑i=1
(g(Z ′i )− g(Zi ))
∣∣∣∣∣Z n1
]∣∣∣∣∣
≤ EE
[supg∈G
∣∣∣∣∣1nn∑
i=1
(g(Z ′i )− g(Zi ))
∣∣∣∣∣∣∣∣∣∣Z n
1
]
= E supg∈G
∣∣∣∣∣1nn∑
i=1
(g(Z ′i )− g(Zi ))
∣∣∣∣∣ = E‖P ′n − Pn‖G .
57 / 174
Uniform Laws and Rademacher Complexity
The first step is to symmetrize by replacing Pg by P ′ng = 1n
∑ni=1 g(Z ′i ).
In particular, we have
E‖P − Pn‖G = E supg∈G
∣∣∣∣∣E[
1
n
n∑i=1
(g(Z ′i )− g(Zi ))
∣∣∣∣∣Z n1
]∣∣∣∣∣≤ EE
[supg∈G
∣∣∣∣∣1nn∑
i=1
(g(Z ′i )− g(Zi ))
∣∣∣∣∣∣∣∣∣∣Z n
1
]
= E supg∈G
∣∣∣∣∣1nn∑
i=1
(g(Z ′i )− g(Zi ))
∣∣∣∣∣ = E‖P ′n − Pn‖G .
57 / 174
Uniform Laws and Rademacher Complexity
The first step is to symmetrize by replacing Pg by P ′ng = 1n
∑ni=1 g(Z ′i ).
In particular, we have
E‖P − Pn‖G = E supg∈G
∣∣∣∣∣E[
1
n
n∑i=1
(g(Z ′i )− g(Zi ))
∣∣∣∣∣Z n1
]∣∣∣∣∣≤ EE
[supg∈G
∣∣∣∣∣1nn∑
i=1
(g(Z ′i )− g(Zi ))
∣∣∣∣∣∣∣∣∣∣Z n
1
]
= E supg∈G
∣∣∣∣∣1nn∑
i=1
(g(Z ′i )− g(Zi ))
∣∣∣∣∣ = E‖P ′n − Pn‖G .
57 / 174
Uniform Laws and Rademacher Complexity
Another symmetrization: for any εi ∈ ±1,
E‖P ′n − Pn‖G = E supg∈G
∣∣∣∣∣1nn∑
i=1
(g(Z ′i )− g(Zi ))
∣∣∣∣∣= E sup
g∈G
∣∣∣∣∣1nn∑
i=1
εi (g(Z ′i )− g(Zi ))
∣∣∣∣∣ ,This follows from the fact that Zi and Z ′i are i.i.d., and so thedistribution of the supremum is unchanged when we swap them.And so in particular the expectation of the supremum is unchanged.And since this is true for any εi , we can take the expectation over anyrandom choice of the εi . We’ll pick them independently and uniformly.
58 / 174
Uniform Laws and Rademacher Complexity
E supg∈G
∣∣∣∣∣1nn∑
i=1
εi (g(Z ′i )− g(Zi ))
∣∣∣∣∣≤ E sup
g∈G
∣∣∣∣∣1nn∑
i=1
εig(Z ′i )
∣∣∣∣∣+ supg∈G
∣∣∣∣∣1nn∑
i=1
εig(Zi )
∣∣∣∣∣= 2E sup
g∈G
∣∣∣∣∣1nn∑
i=1
εig(Zi )
∣∣∣∣∣︸ ︷︷ ︸Rademacher complexity
= 2E‖Rn‖G ,
where Rn is the Rademacher process Rn(g) = (1/n)∑n
i=1 εig(Zi ).
59 / 174
Uniform Laws and Rademacher Complexity
The second inequality (desymmetrization) follows from:
E‖Rn‖G ≤ E
∥∥∥∥∥1
n
n∑i=1
εi (g(Zi )− Eg(Zi ))
∥∥∥∥∥G
+ E
∥∥∥∥∥1
n
n∑i=1
εiEg(Zi )
∥∥∥∥∥G
≤ E
∥∥∥∥∥1
n
n∑i=1
εi (g(Zi )− g(Z ′i ))
∥∥∥∥∥G
+ ‖P‖G E∣∣∣∣∣1n
n∑i=1
εi
∣∣∣∣∣= E
∥∥∥∥∥1
n
n∑i=1
(g(Zi )− Eg(Zi ) + Eg(Z ′i )− g(Z ′i ))
∥∥∥∥∥G
+ ‖P‖G E∣∣∣∣∣1n
n∑i=1
εi
∣∣∣∣∣≤ 2E ‖Pn − P‖G +
√2 log 2
n.
60 / 174
Uniform Laws and Rademacher Complexity
For any G,E‖P − Pn‖G ≤ 2E‖Rn‖G
If G ⊂ [0, 1]Z , then
1
2E‖Rn‖G −
√log 2
2n≤ E‖P − Pn‖G ≤ 2E‖Rn‖G ,
and, with probability at least 1− 2 exp(−2ε2n),
E‖P − Pn‖G − ε ≤ ‖P − Pn‖G ≤ E‖P − Pn‖G + ε.
Thus, E‖Rn‖G → 0 iff ‖P − Pn‖G as→ 0.
Theorem.
That is, the supremum of the empirical process P − Pn is concentratedabout its expectation, and its expectation is about the same as theexpected sup of the Rademacher process Rn.
61 / 174
Back to the prediction problem
Recall: for prediction problem, we have G = ` F for some loss function` and a set of functions F .
We found that E ‖P − Pn‖`F can be related to E ‖Rn‖`F . Question:can we get E ‖Rn‖F instead?
For classification, this is easy. Suppose F ⊆ ±1X , Y ∈ ±1, and`(y , y ′) = Iy 6= y ′ = (1− yy ′)/2. Then
E ‖Rn‖`F = E supf∈F
∣∣∣∣∣1nn∑
i=1
εi`(yi , f (xi ))
∣∣∣∣∣=
1
2E sup
f∈F
∣∣∣∣∣1nn∑
i=1
εi +1
n
n∑i=1
−εiyi f (xi )
∣∣∣∣∣≤ O
(1√n
)+
1
2E ‖Rn‖F .
Other loss functions?
62 / 174
Outline
Introduction
Uniform Laws of Large NumbersMotivationRademacher AveragesRademacher Complexity: Structural ResultsSigmoid NetworksReLU NetworksCovering Numbers
Classification
Beyond ERM
Interpolation
63 / 174
Rademacher Complexity: Structural Results
Subsets F ⊆ G implies ‖Rn‖F ≤ ‖Rn‖G .
Scaling ‖Rn‖cF = |c |‖Rn‖F .
Plus Constant For |g(X )| ≤ 1,|E‖Rn‖F+g − E‖Rn‖F | ≤
√2 log 2/n.
Convex Hull ‖Rn‖coF = ‖Rn‖F , where coF is the convex hullof F .
Contraction If ` : R× R→ [−1, 1] has y 7→ `(y , y) 1-Lipschitzfor all y , then for ` F = (x , y) 7→ `(f (x), y),E‖Rn‖`F ≤ 2E‖Rn‖F + c/
√n.
Theorem.
64 / 174
Outline
Introduction
Uniform Laws of Large NumbersMotivationRademacher AveragesRademacher Complexity: Structural ResultsSigmoid NetworksReLU NetworksCovering Numbers
Classification
Beyond ERM
Interpolation
65 / 174
Deep Networks
Deep compositions of nonlinear functions
h = hm hm−1 · · · h1
e.g., hi : x 7→ σ(Wix) hi : x 7→ r(Wix)
σ(v)i =1
1 + exp(−vi ), r(v)i = max0, vi
66 / 174
Deep Networks
Deep compositions of nonlinear functions
h = hm hm−1 · · · h1
e.g., hi : x 7→ σ(Wix)
hi : x 7→ r(Wix)
σ(v)i =1
1 + exp(−vi ),
r(v)i = max0, vi
66 / 174
Deep Networks
Deep compositions of nonlinear functions
h = hm hm−1 · · · h1
e.g., hi : x 7→ σ(Wix)
hi : x 7→ r(Wix)
σ(v)i =1
1 + exp(−vi ),
r(v)i = max0, vi
66 / 174
Deep Networks
Deep compositions of nonlinear functions
h = hm hm−1 · · · h1
e.g., hi : x 7→ σ(Wix) hi : x 7→ r(Wix)
σ(v)i =1
1 + exp(−vi ), r(v)i = max0, vi
66 / 174
Deep Networks
Deep compositions of nonlinear functions
h = hm hm−1 · · · h1
e.g., hi : x 7→ σ(Wix) hi : x 7→ r(Wix)
σ(v)i =1
1 + exp(−vi ), r(v)i = max0, vi
66 / 174
Rademacher Complexity of Sigmoid Networks
Consider the following class FB,d of two-layer neural networks onRd :
FB,d =
x 7→
k∑i=1
wiσ(vTi x)
: ‖w‖1 ≤ B, ‖vi‖1 ≤ B, k ≥ 1
,
where B > 0 and the nonlinear function σ : R → R satisfies theLipschitz condition, |σ(a)−σ(b)| ≤ |a−b|, and σ(0) = 0. Supposethat the distribution is such that ‖X‖∞ ≤ 1 a.s. Then
E‖Rn‖FB,d≤ 2B2
√2 log 2d
n,
where d is the dimension of the input space, X = Rd .
Theorem.
67 / 174
Rademacher Complexity of Sigmoid Networks: Proof
Define G := (x1, . . . , xd) 7→ xj : 1 ≤ j ≤ d,
VB :=
x 7→ v ′x : ‖v‖1 =
d∑i=1
|vi | ≤ B
= Bco (G ∪ −G) ,
with co(F) :=
k∑
i=1
αi fi : k ≥ 1, αi ≥ 0, ‖α‖1 = 1, fi ∈ F.
Then FB,d =
x 7→
k∑i=1
wiσ(vi (x)) | k ≥ 1,k∑
i=1
wi ≤ B, vi ∈ VB
= Bco (σ VB ∪ −σ VB) .
68 / 174
Rademacher Complexity of Sigmoid Networks: Proof
Define G := (x1, . . . , xd) 7→ xj : 1 ≤ j ≤ d,
VB :=
x 7→ v ′x : ‖v‖1 =
d∑i=1
|vi | ≤ B
= Bco (G ∪ −G) ,
with co(F) :=
k∑
i=1
αi fi : k ≥ 1, αi ≥ 0, ‖α‖1 = 1, fi ∈ F.
Then FB,d =
x 7→
k∑i=1
wiσ(vi (x)) | k ≥ 1,k∑
i=1
wi ≤ B, vi ∈ VB
= Bco (σ VB ∪ −σ VB) .
68 / 174
Rademacher Complexity of Sigmoid Networks: Proof
E‖Rn‖FB,d= E‖Rn‖Bco(σVB∪−σVB )
= BE‖Rn‖co(σVB∪−σVB )
= BE‖Rn‖σVB∪−σVB≤ BE‖Rn‖σVB + BE‖Rn‖−σVB= 2BE‖Rn‖σVB≤ 2BE‖Rn‖VB= 2BE‖Rn‖Bco(G∪−G)
= 2B2E‖Rn‖G∪−G
≤ 2B2
√2 log (2d)
n.
69 / 174
Rademacher Complexity of Sigmoid Networks: Proof
E‖Rn‖FB,d= E‖Rn‖Bco(σVB∪−σVB )
= BE‖Rn‖co(σVB∪−σVB )
= BE‖Rn‖σVB∪−σVB
≤ BE‖Rn‖σVB + BE‖Rn‖−σVB= 2BE‖Rn‖σVB≤ 2BE‖Rn‖VB= 2BE‖Rn‖Bco(G∪−G)
= 2B2E‖Rn‖G∪−G
≤ 2B2
√2 log (2d)
n.
69 / 174
Rademacher Complexity of Sigmoid Networks: Proof
E‖Rn‖FB,d= E‖Rn‖Bco(σVB∪−σVB )
= BE‖Rn‖co(σVB∪−σVB )
= BE‖Rn‖σVB∪−σVB≤ BE‖Rn‖σVB + BE‖Rn‖−σVB
= 2BE‖Rn‖σVB≤ 2BE‖Rn‖VB= 2BE‖Rn‖Bco(G∪−G)
= 2B2E‖Rn‖G∪−G
≤ 2B2
√2 log (2d)
n.
69 / 174
Rademacher Complexity of Sigmoid Networks: Proof
E‖Rn‖FB,d= E‖Rn‖Bco(σVB∪−σVB )
= BE‖Rn‖co(σVB∪−σVB )
= BE‖Rn‖σVB∪−σVB≤ BE‖Rn‖σVB + BE‖Rn‖−σVB= 2BE‖Rn‖σVB
≤ 2BE‖Rn‖VB= 2BE‖Rn‖Bco(G∪−G)
= 2B2E‖Rn‖G∪−G
≤ 2B2
√2 log (2d)
n.
69 / 174
Rademacher Complexity of Sigmoid Networks: Proof
E‖Rn‖FB,d= E‖Rn‖Bco(σVB∪−σVB )
= BE‖Rn‖co(σVB∪−σVB )
= BE‖Rn‖σVB∪−σVB≤ BE‖Rn‖σVB + BE‖Rn‖−σVB= 2BE‖Rn‖σVB≤ 2BE‖Rn‖VB
= 2BE‖Rn‖Bco(G∪−G)
= 2B2E‖Rn‖G∪−G
≤ 2B2
√2 log (2d)
n.
69 / 174
Rademacher Complexity of Sigmoid Networks: Proof
E‖Rn‖FB,d= E‖Rn‖Bco(σVB∪−σVB )
= BE‖Rn‖co(σVB∪−σVB )
= BE‖Rn‖σVB∪−σVB≤ BE‖Rn‖σVB + BE‖Rn‖−σVB= 2BE‖Rn‖σVB≤ 2BE‖Rn‖VB= 2BE‖Rn‖Bco(G∪−G)
= 2B2E‖Rn‖G∪−G
≤ 2B2
√2 log (2d)
n.
69 / 174
Rademacher Complexity of Sigmoid Networks: Proof
E‖Rn‖FB,d= E‖Rn‖Bco(σVB∪−σVB )
= BE‖Rn‖co(σVB∪−σVB )
= BE‖Rn‖σVB∪−σVB≤ BE‖Rn‖σVB + BE‖Rn‖−σVB= 2BE‖Rn‖σVB≤ 2BE‖Rn‖VB= 2BE‖Rn‖Bco(G∪−G)
= 2B2E‖Rn‖G∪−G
≤ 2B2
√2 log (2d)
n.
69 / 174
Rademacher Complexity of Sigmoid Networks: Proof
E‖Rn‖FB,d= E‖Rn‖Bco(σVB∪−σVB )
= BE‖Rn‖co(σVB∪−σVB )
= BE‖Rn‖σVB∪−σVB≤ BE‖Rn‖σVB + BE‖Rn‖−σVB= 2BE‖Rn‖σVB≤ 2BE‖Rn‖VB= 2BE‖Rn‖Bco(G∪−G)
= 2B2E‖Rn‖G∪−G
≤ 2B2
√2 log (2d)
n.
69 / 174
Rademacher Complexity of Sigmoid Networks: Proof
The final step involved the following result.
For f ∈ F satisfying |f (x)| ≤ 1,
E‖Rn‖F ≤ E√
2 log(2|F(X n1 )|)
n.
Lemma.
70 / 174
Outline
Introduction
Uniform Laws of Large NumbersMotivationRademacher AveragesRademacher Complexity: Structural ResultsSigmoid NetworksReLU NetworksCovering Numbers
Classification
Beyond ERM
Interpolation
71 / 174
Rademacher Complexity of ReLU Networks
Using the positive homogeneity property of the ReLU nonlinearity (thatis, for all α ≥ 0 and x ∈ R, σ(αx) = ασ(x)) gives a simple argument tobound the Rademacher complexity.
For a training set (X1,Y1), . . . , (Xn,Yn) ∈ X×±1 with ‖Xi‖ ≤ 1a.s.,
E ‖Rn‖FF,B≤ (2B)L√
n,
where f ∈ FF ,B is an L-layer network of the form
FF ,B := WLσ(WL−1 · · ·σ(W1x) · · · ),
σ is 1-Lipschitz, positive homogeneous (that is, for all α ≥ 0and x ∈ R, σ(αx) = ασ(x)), and applied componentwise, and‖Wi‖F ≤ B. (WL is a row vector.)
Theorem.
72 / 174
ReLU Networks: Proof
(Write Eε as the conditional expectation given the data.)
Eε supf∈F,‖W‖F≤B
1
n
∥∥∥∥∥n∑
i=1
εiσ(Wf (Xi ))
∥∥∥∥∥2
≤ 2BEε supf∈F
1
n
∥∥∥∥∥n∑
i=1
εi f (Xi )
∥∥∥∥∥2
.
Lemma.
Iterating this and using Jensen’s inequality proves the theorem.
73 / 174
ReLU Networks: Proof
For W> = (w1 · · ·wk), we use positive homogeneity:∥∥∥∥∥n∑
i=1
εiσ(Wf (xi ))
∥∥∥∥∥2
=k∑
j=1
(n∑
i=1
εiσ(w>j f (xi ))
)2
=k∑
j=1
‖wj‖2
(n∑
i=1
εiσ
(w>j‖wj‖
f (xi )
))2
.
74 / 174
ReLU Networks: Proof
sup‖W‖F≤B
k∑j=1
‖wj‖2
(n∑
i=1
εiσ
(w>j‖wj‖
f (xi )
))2
= sup‖wj‖=1;‖α‖1≤B2
k∑j=1
αj
(n∑
i=1
εiσ(w>j f (xi )
))2
= B2 sup‖w‖=1
(n∑
i=1
εiσ(w>f (xi )
))2
,
then apply (Contraction) and Cauchy-Schwartz inequalities.
75 / 174
Outline
Introduction
Uniform Laws of Large NumbersMotivationRademacher AveragesRademacher Complexity: Structural ResultsSigmoid NetworksReLU NetworksCovering Numbers
Classification
Beyond ERM
Interpolation
76 / 174
Rademacher Complexity and Covering Numbers
An ε-cover of a subset T of a metric space (S , d) is a set T ⊂ Tsuch that for each t ∈ T there is a t ∈ T such that d(t, t) ≤ ε.The ε-covering number of T is
N (ε,T , d) = min|T | : T is an ε-cover of T.
Definition.
Intuition: A k-dimensional set T has N (ε,T , k) = Θ(1/εk).Example: ([0, 1]k , l∞).
77 / 174
Rademacher Complexity and Covering Numbers
For any F and any X1, . . . ,Xn,
Eε‖Rn‖F(X n1 ) ≤ c
∫ ∞0
√lnN (ε,F(X n
1 ), `2)
ndε.
and
Eε‖Rn‖F(X n1 ) ≥
1
c log nsupα>0
α
√lnN (α,F(X n
1 ), `2)
n.
Theorem.
78 / 174
Covering numbers and smoothly parameterized functions
Let F be a parameterized class of functions,
F = f (θ, ·) : θ ∈ Θ.
Let ‖ · ‖Θ be a norm on Θ and let ‖ · ‖F be a norm on F . Supposethat the mapping θ 7→ f (θ, ·) is L-Lipschitz, that is,
‖f (θ, ·)− f (θ′, ·)‖F ≤ L‖θ − θ′‖Θ.
Then N (ε,F , ‖ · ‖F ) ≤ N (ε/L,Θ, ‖ · ‖Θ).
Lemma.
79 / 174
Covering numbers and smoothly parameterized functions
A Lipschitz parameterization allows us to translate a cover of theparameter space into a cover of the function space.
Example: If F is smoothly parameterized by (a compact set of) kparameters, then N (ε,F) = O(1/εk).
80 / 174
Recap
g0
Eg
all functions
g0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gn g0
1
n
nX
i=1
g(zi)
gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gG gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
81 / 174
Recap
g0
Eg
all functions
g0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gn g0
1
n
nX
i=1
g(zi)
gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gG gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
81 / 174
Recap
g0
Eg
all functionsg0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gn g0
1
n
nX
i=1
g(zi)
gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gG gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
81 / 174
Recap
g0
Eg
all functionsg0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gn
g0
1
n
nX
i=1
g(zi)
gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gG gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
81 / 174
Recap
g0
Eg
all functionsg0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gn
g0
1
n
nX
i=1
g(zi)
gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gG gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
81 / 174
Recap
g0
Eg
all functionsg0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gn g0
1
n
nX
i=1
g(zi)
gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gG gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
81 / 174
Recap
g0
Eg
all functionsg0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gn g0
1
n
nX
i=1
g(zi)
gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gG gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
81 / 174
Recap
g0
Eg
all functionsg0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gn g0
1
n
nX
i=1
g(zi)
gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gG gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
81 / 174
Recap
E ‖P − Pn‖F E ‖Rn‖F
Advantage of dealing with Rademacher process:
I often easier to analyze
I can work conditionally on the data
Example: F = x 7→ 〈w , x〉 : ‖w‖ ≤ 1, ‖x‖ ≤ 1. Then
Eε∥∥∥Rn
∥∥∥F
= Eε sup‖w‖≤1
∣∣∣∣∣1nn∑
i=1
εi 〈w , xi 〉∣∣∣∣∣ = Eε sup
‖w‖≤1
∣∣∣∣∣⟨w ,
1
n
n∑i=1
εixi
⟩∣∣∣∣∣which is equal to
Eε
∥∥∥∥∥1
n
n∑i=1
εixi
∥∥∥∥∥ ≤√∑n
i=1 ‖xi‖2
n=
√tr(XX>)
n
82 / 174
Outline
Introduction
Uniform Laws of Large Numbers
ClassificationGrowth FunctionVC-DimensionVC-Dimension of Neural NetworksLarge Margin Classifiers
Beyond ERM
Interpolation
83 / 174
Classification and Rademacher Complexity
Recall:
For F ⊆ ±1X , Y ∈ ±1, and `(y , y ′) = Iy 6= y ′ = (1− yy ′)/2,
E ‖Rn‖`F ≤1
2E ‖Rn‖F + O
(1√n
).
How do we get a handle on Rademacher complexity of a set F ofclassifiers? Recall that we are looking at “size” of F projected onto data:
F(xn1 ) = (f (x1), . . . , f (xn)) : f ∈ F ⊆ −1, 1n
84 / 174
Controlling Rademacher Complexity: Growth Function
Example: For the class of step functions, F = x 7→ Ix ≤ α : α ∈ R,the set of restrictions,
F(xn1 ) = (f (x1), . . . , f (xn)) : f ∈ F
is always small: |F(xn1 )| ≤ n + 1. So E‖Rn‖F ≤√
2 log 2(n+1)n .
Example: If F is parameterized by k bits,F =
x 7→ f (x , θ) : θ ∈ 0, 1k
for some f : X × 0, 1k → [0, 1],
|F(xn1 )| ≤ 2k ,
E‖Rn‖F ≤√
2(k + 1) log 2
n.
85 / 174
Controlling Rademacher Complexity: Growth Function
Example: For the class of step functions, F = x 7→ Ix ≤ α : α ∈ R,the set of restrictions,
F(xn1 ) = (f (x1), . . . , f (xn)) : f ∈ F
is always small: |F(xn1 )| ≤ n + 1. So E‖Rn‖F ≤√
2 log 2(n+1)n .
Example: If F is parameterized by k bits,F =
x 7→ f (x , θ) : θ ∈ 0, 1k
for some f : X × 0, 1k → [0, 1],
|F(xn1 )| ≤ 2k ,
E‖Rn‖F ≤√
2(k + 1) log 2
n.
85 / 174
Growth Function
For a class F ⊆ 0, 1X , the growth function is
ΠF (n) = max|F(xn1 )| : x1, . . . , xn ∈ X.
Definition.
I E‖Rn‖F ≤√
2 log(2ΠF (n))n ; E‖P − Pn‖F = O
(√log(ΠF (n))
n
).
I ΠF (n) ≤ |F|, limn→∞ ΠF (n) = |F|.I ΠF (n) ≤ 2n. (But then this gives no useful bound on E‖Rn‖F .)
86 / 174
Growth Function
For a class F ⊆ 0, 1X , the growth function is
ΠF (n) = max|F(xn1 )| : x1, . . . , xn ∈ X.
Definition.
I E‖Rn‖F ≤√
2 log(2ΠF (n))n ; E‖P − Pn‖F = O
(√log(ΠF (n))
n
).
I ΠF (n) ≤ |F|, limn→∞ ΠF (n) = |F|.I ΠF (n) ≤ 2n. (But then this gives no useful bound on E‖Rn‖F .)
86 / 174
Growth Function
For a class F ⊆ 0, 1X , the growth function is
ΠF (n) = max|F(xn1 )| : x1, . . . , xn ∈ X.
Definition.
I E‖Rn‖F ≤√
2 log(2ΠF (n))n ; E‖P − Pn‖F = O
(√log(ΠF (n))
n
).
I ΠF (n) ≤ |F|, limn→∞ ΠF (n) = |F|.
I ΠF (n) ≤ 2n. (But then this gives no useful bound on E‖Rn‖F .)
86 / 174
Growth Function
For a class F ⊆ 0, 1X , the growth function is
ΠF (n) = max|F(xn1 )| : x1, . . . , xn ∈ X.
Definition.
I E‖Rn‖F ≤√
2 log(2ΠF (n))n ; E‖P − Pn‖F = O
(√log(ΠF (n))
n
).
I ΠF (n) ≤ |F|, limn→∞ ΠF (n) = |F|.I ΠF (n) ≤ 2n. (But then this gives no useful bound on E‖Rn‖F .)
86 / 174
Outline
Introduction
Uniform Laws of Large Numbers
ClassificationGrowth FunctionVC-DimensionVC-Dimension of Neural NetworksLarge Margin Classifiers
Beyond ERM
Interpolation
87 / 174
Vapnik-Chervonenkis Dimension
A class F ⊆ 0, 1X shatters x1, . . . , xd ⊆ X means that|F(xd1 )| = 2d . The Vapnik-Chervonenkis dimension of F is
dVC (F) = max d : some x1, . . . , xd ∈ X is shattered by F= max
d : ΠF (d) = 2d
.
Definition.
88 / 174
Vapnik-Chervonenkis Dimension: “Sauer’s Lemma”
dVC (F) ≤ d implies
ΠF (n) ≤d∑
i=0
(n
i
).
If n ≥ d , the latter sum is no more than(end
)d.
Theorem (Vapnik-Chervonenkis).
So the VC-dimension d is a single integer summary of the growthfunction:
ΠF (n)
= 2n if n ≤ d ,
≤ (e/d)d nd if n > d .
89 / 174
Vapnik-Chervonenkis Dimension: “Sauer’s Lemma”
dVC (F) ≤ d implies
ΠF (n) ≤d∑
i=0
(n
i
).
If n ≥ d , the latter sum is no more than(end
)d.
Theorem (Vapnik-Chervonenkis).
So the VC-dimension d is a single integer summary of the growthfunction:
ΠF (n)
= 2n if n ≤ d ,
≤ (e/d)d nd if n > d .
89 / 174
Vapnik-Chervonenkis Dimension: “Sauer’s Lemma”
Thus, for dVC (F) ≤ d and n ≥ d ,
supP
E‖P − Pn‖F = O
(√d log(n/d)
n
).
And there is a matching lower bound.
Uniform convergence uniformly over probability distributions is equivalentto finiteness of the VC-dimension.
90 / 174
VC-dimension Bounds for Parameterized Families
Consider a parameterized class of binary-valued functions,
F = x 7→ f (x , θ) : θ ∈ Rp ,
where f : X × Rp → ±1.
Suppose that for each x , θ 7→ f (x , θ) can be computed using no morethan t operations of the following kinds:
1. arithmetic (+, −, ×, /),
2. comparisons (>, =, <),
3. output ±1.
dVC (F) ≤ 4p(t + 2).
Theorem (Goldberg and Jerrum).
91 / 174
Outline
Introduction
Uniform Laws of Large Numbers
ClassificationGrowth FunctionVC-DimensionVC-Dimension of Neural NetworksLarge Margin Classifiers
Beyond ERM
Interpolation
92 / 174
VC-Dimension of Neural Networks
Consider the class F of −1, 1-valued functions computed by anetwork with L layers, p parameters, and k computation units withthe following nonlinearities:
1. Piecewise constant (linear threshold units):VCdim(F) = Θ (p).
(Baum and Haussler, 1989)
2. Piecewise linear (ReLUs): VCdim(F) = Θ (pL).(B., Harvey, Liaw, Mehrabian, 2017)
3. Piecewise polynomial: VCdim(F) = O(pL2).
(B., Maiorov, Meir, 1998)
4. Sigmoid: VCdim(F) = O(p2k2
).
(Karpinsky and MacIntyre, 1994)
Theorem.
93 / 174
VC-Dimension of Neural Networks
Consider the class F of −1, 1-valued functions computed by anetwork with L layers, p parameters, and k computation units withthe following nonlinearities:
1. Piecewise constant (linear threshold units):VCdim(F) = Θ (p).
(Baum and Haussler, 1989)
2. Piecewise linear (ReLUs): VCdim(F) = Θ (pL).(B., Harvey, Liaw, Mehrabian, 2017)
3. Piecewise polynomial: VCdim(F) = O(pL2).
(B., Maiorov, Meir, 1998)
4. Sigmoid: VCdim(F) = O(p2k2
).
(Karpinsky and MacIntyre, 1994)
Theorem.
93 / 174
VC-Dimension of Neural Networks
Consider the class F of −1, 1-valued functions computed by anetwork with L layers, p parameters, and k computation units withthe following nonlinearities:
1. Piecewise constant (linear threshold units):VCdim(F) = Θ (p).
(Baum and Haussler, 1989)
2. Piecewise linear (ReLUs): VCdim(F) = Θ (pL).(B., Harvey, Liaw, Mehrabian, 2017)
3. Piecewise polynomial: VCdim(F) = O(pL2).
(B., Maiorov, Meir, 1998)
4. Sigmoid: VCdim(F) = O(p2k2
).
(Karpinsky and MacIntyre, 1994)
Theorem.
93 / 174
VC-Dimension of Neural Networks
Consider the class F of −1, 1-valued functions computed by anetwork with L layers, p parameters, and k computation units withthe following nonlinearities:
1. Piecewise constant (linear threshold units):VCdim(F) = Θ (p).
(Baum and Haussler, 1989)
2. Piecewise linear (ReLUs): VCdim(F) = Θ (pL).(B., Harvey, Liaw, Mehrabian, 2017)
3. Piecewise polynomial: VCdim(F) = O(pL2).
(B., Maiorov, Meir, 1998)
4. Sigmoid: VCdim(F) = O(p2k2
).
(Karpinsky and MacIntyre, 1994)
Theorem.
93 / 174
VC-Dimension of Neural Networks
Consider the class F of −1, 1-valued functions computed by anetwork with L layers, p parameters, and k computation units withthe following nonlinearities:
1. Piecewise constant (linear threshold units):VCdim(F) = Θ (p).
(Baum and Haussler, 1989)
2. Piecewise linear (ReLUs): VCdim(F) = Θ (pL).(B., Harvey, Liaw, Mehrabian, 2017)
3. Piecewise polynomial: VCdim(F) = O(pL2).
(B., Maiorov, Meir, 1998)
4. Sigmoid: VCdim(F) = O(p2k2
).
(Karpinsky and MacIntyre, 1994)
Theorem.
93 / 174
Outline
Introduction
Uniform Laws of Large Numbers
ClassificationGrowth FunctionVC-DimensionVC-Dimension of Neural NetworksLarge Margin Classifiers
Beyond ERM
Interpolation
94 / 174
Some Experimental Observations
(Schapire, Freund, B, Lee, 1997)
95 / 174
Some Experimental Observations
(Schapire, Freund, B, Lee, 1997)
96 / 174
Some Experimental Observations
(Lawrence, Giles, Tsoi, 1997)
97 / 174
Some Experimental Observations
(Neyshabur, Tomioka, Srebro, 2015)
98 / 174
Large-Margin Classifiers
I Consider a real-valued function f : X → R used for classification.
I The prediction on x ∈ X is sign(f (x)) ∈ −1, 1.I If yf (x) > 0 then f classifies x correctly.
I We call yf (x) the margin of f on x . (c.f. the perceptron)
I We can view a larger margin as a more confident correctclassification.
I Minimizing a continuous loss, such as
n∑i=1
(f (Xi )− Yi )2, or
n∑i=1
exp (−Yi f (Xi )) ,
encourages large margins.
I For large-margin classifiers, we should expect the fine-grained detailsof f to be less important.
99 / 174
Recall
Contraction property:
If ` is y 7→ `(y , y) 1-Lipschitz for all y and |`| ≤ 1, then
E‖Rn‖`F ≤ 2E‖Rn‖F + c/√n
Unfortunately, classification loss is not Lipschitz. By writing classificationloss as (1− yy ′)/2, we did show that
E‖Rn‖` sign(F) ≤ 2E‖Rn‖sign(F) + c/√n
but we want E‖Rn‖F .
100 / 174
Rademacher Complexity for Lipschitz Loss
Consider the 1/γ-Lipschitz surrogate loss
φ(α) =
1 if α ≤ 0,
1− α/γ if 0 < α < γ,
0 if α ≥ 1.
(yf(x))
Iyf(x)60
yf(x)
101 / 174
Rademacher Complexity for Lipschitz Loss
Large margin loss is an upper bound and classification loss is a lowerbound:
IYf (X ) ≤ 0 ≤ φ(Yf (X )) ≤ IYf (X ) ≤ γ.
So if we can relate the Lipschitz risk Pφ(Yf (X )) to the Lipschitzempirical risk Pnφ(Yf (X )), we have a large margin bound:
PIYf (X ) ≤ 0 ≤ Pφ(Yf (X )) c.f. Pnφ(Yf (X )) ≤ PnIYf (X ) ≤ γ.
102 / 174
Rademacher Complexity for Lipschitz Loss
With high probability, for all f ∈ F ,
L01(f ) = PIYf (X ) ≤ 0≤ Pφ(Yf (X ))
≤ Pnφ(Yf (X )) +c
γE‖Rn‖F + O(1/
√n)
≤ PnIYf (X ) ≤ γ+c
γE‖Rn‖F + O(1/
√n).
Notice that we’ve turned a classification problem into a regressionproblem.
The VC-dimension (which captures arbitrarily fine-grained properties ofthe function class) is no longer important.
103 / 174
Example: Linear Classifiers
Let F = x 7→ 〈w , x〉 : ‖w‖ ≤ 1. Then, since E ‖Rn‖F . n−1/2,
L01(f ) .1
n
n∑i=1
IYi f (Xi ) ≤ γ+1
γ
1√n
Compare to Perceptron.
104 / 174
Risk versus surrogate risk
In general, these margin bounds are upper bounds only.
But for any surrogate loss φ, we can define a transform ψ that gives atight relationship between the excess 0− 1 risk (over the Bayes risk) tothe excess φ-risk:
1. For any P and f , ψ (L01(f )− L∗01) ≤ Lφ(f )− L∗φ.
2. For |X | ≥ 2, ε > 0 and θ ∈ [0, 1], there is a P and an f with
L01(f )− L∗01 = θ
ψ(θ) ≤ Lφ(f )− L∗φ ≤ ψ(θ) + ε.
Theorem.
105 / 174
Generalization: Margins and Size of Parameters
I A classification problem becomes a regression problem if we use aloss function that doesn’t vary too quickly.
I For regression, the complexity of a neural network is controlled bythe size of the parameters, and can be independent of the numberof parameters.
I We have a tradeoff between the fit to the training data (margins)and the complexity (size of parameters):
Pr(sign(f (X )) 6= Y ) ≤ 1
n
n∑i=1
φ(Yi f (Xi )) + pn(f )
I Even if the training set is classified correctly, it might be worthwhileto increase the complexity, to improve this loss function.
106 / 174
Generalization: Margins and Size of Parameters
I A classification problem becomes a regression problem if we use aloss function that doesn’t vary too quickly.
I For regression, the complexity of a neural network is controlled bythe size of the parameters, and can be independent of the numberof parameters.
I We have a tradeoff between the fit to the training data (margins)and the complexity (size of parameters):
Pr(sign(f (X )) 6= Y ) ≤ 1
n
n∑i=1
φ(Yi f (Xi )) + pn(f )
I Even if the training set is classified correctly, it might be worthwhileto increase the complexity, to improve this loss function.
106 / 174
Generalization: Margins and Size of Parameters
I A classification problem becomes a regression problem if we use aloss function that doesn’t vary too quickly.
I For regression, the complexity of a neural network is controlled bythe size of the parameters, and can be independent of the numberof parameters.
I We have a tradeoff between the fit to the training data (margins)and the complexity (size of parameters):
Pr(sign(f (X )) 6= Y ) ≤ 1
n
n∑i=1
φ(Yi f (Xi )) + pn(f )
I Even if the training set is classified correctly, it might be worthwhileto increase the complexity, to improve this loss function.
106 / 174
Generalization: Margins and Size of Parameters
I A classification problem becomes a regression problem if we use aloss function that doesn’t vary too quickly.
I For regression, the complexity of a neural network is controlled bythe size of the parameters, and can be independent of the numberof parameters.
I We have a tradeoff between the fit to the training data (margins)and the complexity (size of parameters):
Pr(sign(f (X )) 6= Y ) ≤ 1
n
n∑i=1
φ(Yi f (Xi )) + pn(f )
I Even if the training set is classified correctly, it might be worthwhileto increase the complexity, to improve this loss function.
106 / 174
In the next few sections, we show that
E[L(fn)− L(fn)
]can be small if the learning algorithm fn has certain properties.
Keep in mind that this is an interesting quantity for two reasons:
I it upper bounds the excess expected loss of ERM, as we showedbefore.
I can be written as an a-posteriori bound
L(fn) ≤ L(fn) + . . .
regardless of whether fn minimized empirical loss.
107 / 174
Outline
Introduction
Uniform Laws of Large Numbers
Classification
Beyond ERMCompression BoundsAlgorithmic StabilityLocal Methods
Interpolation
108 / 174
Compression Set
Let us use the shortened notation for data: S = Z1, . . . ,Zn, and let us
make the dependence of the algorithm fn on the training set explicit:fn = fn[S]. As before, denote G = (x , y) 7→ `(f (x), y) : f ∈ F, and
write gn(·) = `(fn(·), ·). Let us write gn[S](·) to emphasize thedependence.
Suppose there exists a “compression function” Ck which selects from anydataset S of size n a subset of k examples Ck(S) ⊆ S such that
fn[S] = fk [Ck(S)]
That is, the learning algorithm produces the same function when given Sor its subset Ck(S).
One can keep in mind the example of support vectors in SVMs.
109 / 174
Then,
L(fn)− L(fn) = Egn −1
n
n∑i=1
gn(Zi )
= Egk [Ck(S)](Z )− 1
n
n∑i=1
gk [Ck(S)](Zi )
≤ maxI⊆1,...,n,|I |≤k
Egk [SI ](Z )− 1
n
n∑i=1
gk [SI ](Zi )
where SI is the subset indexed by I .
110 / 174
Since gk [SI ] only depends on k out of n points, the empirical average is“mostly out of sample”. Adding and subtracting loss functions for anadditional set of i.i.d. random variables W = Z ′1, . . . ,Z ′k results in anupper bound
maxI⊆1,...,n,|I |≤k
Egk [SI ](Z )− 1
n
∑Z ′∈S′
gk [SI ](Z ′)
+(b − a)k
n
where [a, b] is the range of functions in G and S ′ is obtained from S byreplacing SI with the corresponding subset WI .
111 / 174
For each fixed I , the random variable
Egk [SI ](Z )− 1
n
∑Z ′∈S′
gk [SI ](Z ′)
is zero mean with standard deviation O((b − a)/√n). Hence, the
expected maximum over I with respect to S,W is at most
c
√(b − a)k log(en/k)
n
since log(nk
)≤ k log(en/k).
112 / 174
Conclusion: compression-style argument limits the bias
E[L(fn)− L(fn)
]≤ O
(√k log n
n
),
which is non-vacuous if k = o(n/ log n).
Recall that this term was the upper bound (up to log) on expected excessloss of ERM if class has VC dimension k . However, a possible equivalencebetween compression and VC dimension is still being investigated.
113 / 174
Example: Classification with Thresholds in 1D
I X = [0, 1], Y = 0, 1I F = fθ : fθ(x) = Ix ≥ θ, θ ∈ [0, 1]I `(fθ(x), y) = Ifθ(x) 6= y
0 1
fn
For any set of data (x1, y1), . . . , (xn, yn), the ERM solution fn has theproperty that the first occurrence xl on the left of the threshold has labelyl = 0, while first occurrence xr on the right – label yr = 1.
Enough to take k = 2 and define fn[S] = f2[(xl , 0), (xr , 1)].
114 / 174
Further examples/observations:
I Compression of size d for hyperplanes (realizable case)
I Compression of size 1/γ2 for margin case
I Bernstein bound gives 1/n rate rather than 1/√n rate on realizable
data (zero empirical error).
115 / 174
Outline
Introduction
Uniform Laws of Large Numbers
Classification
Beyond ERMCompression BoundsAlgorithmic StabilityLocal Methods
Interpolation
116 / 174
Algorithmic Stability
Recall that compression was a way to upper bound E[L(fn)− L(fn)
].
Algorithmic stability is another path to the same goal.
Compare:
I Compression: fn depends only on a subset of k datapoints.
I Stability: fn does not depend on any of the datapoints too strongly.
As before, let’s write shorthand g = ` f and gn = ` fn.
117 / 174
Algorithmic StabilityWe now write
ESL(fn) = EZ1,...,Zn,Z gn[Z1, . . . ,Zn](Z )
Again, the meaning of gn[Z1, . . . ,Zn](Z ): train on Z1, . . . ,Zn and test onZ .
On the other hand,
ES L(fn) = EZ1,...,Zn
1
n
n∑i=1
gn[Z1, . . . ,Zn](Zi )
=1
n
n∑i=1
EZ1,...,Zn gn[Z1, . . . ,Zn](Zi )
= EZ1,...,Zn gn[Z1, . . . ,Zn](Z1)
where the last step holds for symmetric algorithms (wrt permutation oftraining data). Of course, instead of Z1 we can take any Zi .
118 / 174
Now comes the renaming trick. It takes a minute to get used to, if youhaven’t seen it.
Note that Z1, . . . ,Zn,Z are i.i.d. Hence,
ESL(fn) = EZ1,...,Zn,Z gn[Z1, . . . ,Zn](Z ) = EZ1,...,Zn,Z gn[Z ,Z2, . . . ,Zn](Z1)
Therefore,
EL(fn)− L(fn)
= EZ1,...,Zn,Z gn[Z ,Z2, . . . ,Zn](Z1)− gn[Z1, . . . ,Zn](Z1)
119 / 174
Of course, we haven’t really done much except re-writing expectation.But the difference
gn[Z ,Z2, . . . ,Zn](Z1)− gn[Z1, . . . ,Zn](Z1)
has a “stability” interpretation. If it holds that the output of thealgorithm “does not change much” when one datapoint is replaced with
another, then the gap EL(fn)− L(fn)
is small.
Moreover, since everything we’ve written is an equality, this stability is
equivalent to having small gap EL(fn)− L(fn)
.
120 / 174
NB: our aim of ensuring small EL(fn)− L(fn)
only makes sense if
L(fn) is small (e.g. on average). That is, the analysis only makes sensefor those methods that explicitly or implicitly minimize empirical loss (ora regularized variant of it).
It’s not enough to be stable. Consider a learning mechanism that ignoresthe data and outputs fn = f0, a constant function. Then
EL(fn)− L(fn)
= 0 and the algorithm is very stable. However, it does
not do anything interesting.
121 / 174
Uniform Stability
Rather than the average notion we just discussed, let’s consider a muchstronger notion:
We say that algorithm is β uniformly stable if
∀i ∈ [n], z1, . . . , zn, z′, z
∣∣∣gn[S](z)− gn[S i,z′ ](z)∣∣∣ ≤ β
where S i,z′ = z1, . . . , zi−1, z′, zi+1, . . . , zn.
122 / 174
Uniform Stability
Clearly, for any realization of Z1, . . . ,Zn,Z ,
gn[Z ,Z2, . . . ,Zn](Z1)− gn[Z1, . . . ,Zn](Z1) ≤ β,
and so expected loss of a β-uniformly-stable ERM is β-close to itsempirical error (in expectation). See recent work by Feldman andVondrak for sharp high probability statements.
Of course, it is unclear at this point whether a β-uniformly-stable ERM(or near-ERM) exists.
123 / 174
Kernel Ridge Regression
Consider
fn = argminf∈H
1
n
n∑i=1
(f (Xi )− Yi )2 + λ ‖f ‖2
K
in RKHS H corresponding to kernel K .
Assume K (x , x) ≤ κ2 for any x .
Kernel Ridge Regression is β-uniformly stable with β = O(
1λn
)Lemma.
124 / 174
Proof (stability of Kernel Ridge Regression)
To prove this, first recall the definition of a σ-strongly convex function φon convex domain W:
∀u, v ∈ W, φ(u) ≥ φ(v) + 〈∇φ(v), u − v〉+σ
2‖u − v‖2
.
Suppose φ, φ′ are both σ-strongly convex. Suppose w ,w ′ satisfy∇φ(w) = ∇φ′(w ′) = 0. Then
φ(w ′) ≥ φ(w) +σ
2‖w − w ′‖2
andφ′(w) ≥ φ′(w ′) +
σ
2‖w − w ′‖2
As a trivial consequence,
σ ‖w − w ′‖2 ≤ [φ(w ′)− φ′(w ′)] + [φ′(w)− φ(w)]
125 / 174
Proof (stability of Kernel Ridge Regression)Now take
φ(f ) =1
n
∑i∈S
(f (xi )− yi )2 + λ ‖f ‖2
K
and
φ′(f ) =1
n
∑i∈S′
(f (xi )− yi )2 + λ ‖f ‖2
K
where S and S ′ differ in one element: (xi , yi ) is replaced with (x ′i , y′i ).
Let fn, f′n be the minimizers of φ, φ′, respectively. Then
φ(f ′n)− φ′(f ′n) ≤ 1
n
((f ′n(xi )− yi )
2 − (f ′n(x ′i )− y ′i )2)
and
φ′(fn)− φ(fn) ≤ 1
n
((fn(x ′i )− y ′i )2 − (fn(xi )− yi )
2)
NB: we have been operating with f as vectors. To be precise, one needsto define the notion of strong convexity over H. Let us sweep it underthe rug and say that φ, φ′ are 2λ-strongly convex with respect to ‖·‖K .
126 / 174
Proof (stability of Kernel Ridge Regression)
Then∥∥∥fn − f ′n
∥∥∥2
Kis at most
1
2λn
((f ′n(xi )− yi )
2 − (fn(xi )− yi )2 + (fn(x ′i )− y ′i )2 − (f ′n(x ′i )− y ′i )2
)which is at most
1
2λnC∥∥∥fn − f ′n
∥∥∥∞
where C = 4(1 + c) if |Yi | ≤ 1 and |fn(xi )| ≤ c .
On the other hand, for any x
f (x) = 〈f ,Kx〉 ≤ ‖f ‖K ‖Kx‖ = ‖f ‖K√〈Kx ,Kx〉 = ‖f ‖K
√K (x , x) ≤ κ ‖f ‖K
and so‖f ‖∞ ≤ κ ‖f ‖K .
127 / 174
Proof (stability of Kernel Ridge Regression)
Putting everything together,∥∥∥fn − f ′n
∥∥∥2
K≤ 1
2λnC∥∥∥fn − f ′n
∥∥∥∞≤ κC
2λn
∥∥∥fn − f ′n
∥∥∥K
Hence, ∥∥∥fn − f ′n
∥∥∥K≤ 1
2λnC∥∥∥fn − f ′n
∥∥∥∞≤ κC
2λn
To finish the claim,
(fn(xi )−yi )2−(f ′n(xi )−yi )2 ≤ C∥∥∥fn − f ′n
∥∥∥∞≤ κC
∥∥∥fn − f ′n
∥∥∥K≤ O
(1
λn
)
128 / 174
Outline
Introduction
Uniform Laws of Large Numbers
Classification
Beyond ERMCompression BoundsAlgorithmic StabilityLocal Methods
Interpolation
129 / 174
We consider “local” procedures such as k-Nearest-Neighbors or localsmoothing. They have a different bias-variance decomposition (we do notfix a class F).
Analysis will rely on local similarity (e.g. Lipschitz-ness) of regressionfunction f ∗.Idea: to predict y at a given x , look up in the dataset those Yi for whichXi is “close” to x .
130 / 174
Bias-VarianceIt’s time to revisit the bias-variance picture. Recall that our goal was toensure that
EL(fn)− L(f ∗)
decreases with data size n, where f ∗ gives smallest possible L.
For “simple problems” (that is, strong assumptions on P), one canensure this without the bias-variance decomposition. Examples:Perceptron, linear regression in d < n regime, etc.
However, for more interesting problems, we cannot get this difference tobe small without introducing bias, because otherwise the variance(fluctuation of the stochastic part) is too large.
Our approach so far was to split this term into anestimation-approximation error with respect to some class F :
EL(fn)− L(fF ) + L(fF )− L(f ∗)
131 / 174
Bias-Variance
Now we’ll study a different bias-variance decomposition, typically used innonparametric statistics. We will only work with square loss.
Rather than fixing F that controls the estimation error, we fix analgorithm (procedure/estimator) fn that has some tunable parameter.
By definition E[Y |X = x ] = f ∗(x). Then we write
EL(fn)− L(f ∗) = E(fn(X )− Y )2 − E(f ∗(X )− Y )2
= E(fn(X )− f ∗(X ) + f ∗(X )− Y )2 − E(f ∗(X )− Y )2
= E(fn(X )− f ∗(X ))2
because the cross term vanishes (check!)
132 / 174
Bias-Variance
Before proceeding, let us discuss the last expression.
E(fn(X )− f ∗(X ))2 = ES∫x
(fn(x)− f ∗(x))2P(dx)
=
∫x
ES(fn(x)− f ∗(x))2P(dx)
We will often analyze ES(fn(x)− f ∗(x))2 for fixed x and then integrate.
The integral is a measure of distance between two functions:
‖f − g‖2L2(P) =
∫x
(f (x)− g(x))2P(dx).
133 / 174
Bias-Variance
Let us drop L2(P) from notation for brevity. The bias-variancedecomposition can be written as
ES∥∥∥fn − f ∗
∥∥∥2
L2(P)= ES
∥∥∥fn − EY1:n [fn] + EY1:n [fn]− f ∗∥∥∥2
= ES∥∥∥fn − EY1:n [fn]
∥∥∥2
+ EX1:n
∥∥∥EY1:n [fn]− f ∗∥∥∥2
,
because the cross term is zero in expectation.
The first term is variance, the second is squared bias. One typicallyincreases with the parameter, the other decreases.
Parameter is chosen either (a) theoretically or (b) by cross-validation(this is the usual case in practice).
134 / 174
k-Nearest Neighbors
As before, we are given (X1,Y1), . . . , (Xn,Yn) i.i.d. from P. To make aprediction of Y at a given x , we sort points according to distance‖Xi − x‖. Let
(X(1),Y(1)), . . . , (X(n),Y(n))
be the sorted list (remember this depends on x).
The k-NN estimate is defined as
fn(x) =1
k
k∑i=1
Y(i).
135 / 174
Variance: Given x ,
fn(x)− EY1:n [fn(x)] =1
k
k∑i=1
(Y(i) − f ∗(X(i)))
which is on the order of 1/√k . Then variance is of the order 1
k .
Bias: a bit more complicated. For a given x ,
EY1:n [fn(x)]− f ∗(x) =1
k
k∑i=1
(f ∗(X(i))− f ∗(x)).
Suppose f ∗ is 1-Lipschitz. Then the square of above is(1
k
k∑i=1
(f ∗(X(i))− f ∗(x))
)2
≤ 1
k
k∑i=1
∥∥X(i) − x∥∥2
So, the bias is governed by how close the closest k random points are tox .
136 / 174
Claim: enough to know the upper bound on the closest point to x amongn points.
Argument: for simplicity assume that n/k is an integer. Divide theoriginal (unsorted) dataset into k blocks, n/k size each. Let X i be theclosest point to x in ith block. Then the collection X 1, . . . ,X k , ak-subset which is no closer than the set of k nearest neighbors. That is,
1
k
k∑i=1
∥∥X(i) − x∥∥2 ≤ 1
k
k∑i=1
∥∥X i − x∥∥2
Taking expectation (with respect to dataset), the bias term is at most
E
1
k
k∑i=1
∥∥X i − x∥∥2
= E
∥∥X 1 − x∥∥2
which is expected squared distance from x to the closest point in arandom set of n/k points.
137 / 174
If support of X is bounded and d ≥ 3, then one can estimate
E∥∥X − X(1)
∥∥2. n−2/d .
That is, we expect the closest neighbor of a random point X to be nofurther than n−1/d away from one of n randomly drawn points.
Thus, the bias (which is the expected squared distance from x to theclosest point in a random set of n/k points) is at most
(n/k)−2/d .
138 / 174
Putting everything together, the bias-variance decomposition yields
1
k+
(k
n
)2/d
Optimal choice is k ∼ n2
2+d and the overall rate of estimation at a givenpoint x is
n−2
2+d .
Since the result holds for any x , the integrated risk is also
E∥∥∥fn − f ∗
∥∥∥2
. n−2
2+d .
139 / 174
Summary
I We sketched the proof that k-Nearest-Neighbors has samplecomplexity guarantees for prediction or estimation problems withsquare loss if k is chosen appropriately.
I Analysis is very different from “empirical process” approach forERM.
I Truly nonparametric!
I No assumptions on underlying density (in d ≥ 3) beyond compactsupport. Additional assumptions needed for d ≤ 3.
140 / 174
Local Kernel Regression: Nadaraya-Watson
Fix a kernel K : Rd → R≥0. Assume K is zero outside unit Euclidean ball
at origin (not true for e−x2
, but close enough).
(figure from Gyorfi et al)
Let Kh(x) = K (x/h), and so Kh(x − x ′) is zero if ‖x − x ′‖ ≥ h.
h is “bandwidth” – tunable parameter.
Assume K (x) > cI‖x‖ ≤ 1 for some c > 0. This is important for the“averaging effect” to kick in.
141 / 174
Nadaraya-Watson estimator:
fn(x) =n∑
i=1
YiWi (x)
with
Wi (x) =Kh(x − Xi )∑ni=1 Kh(x − Xi )
(Note:∑
i Wi = 1).
142 / 174
Unlike the k-NN example, bias is easier to estimate.
Bias: for a given x ,
EY1:n [fn(x)] = EY1:n
[n∑
i=1
YiWi (x)
]=
n∑i=1
f ∗(Xi )Wi (x)
and so
EY1:n [fn(x)]− f ∗(x) =n∑
i=1
(f ∗(Xi )− f ∗(x))Wi (x)
Suppose f ∗ is 1-Lipschitz. Since Kh is zero outside the h-radius ball,
|EY1:n [fn(x)]− f ∗(x)|2 ≤ h2.
143 / 174
Variance: we have
fn(x)− EY1:n [fn(x)] =n∑
i=1
(Yi − f ∗(Xi ))Wi (x)
Expectation of square of this difference is at most
E
[n∑
i=1
(Yi − f ∗(Xi ))2Wi (x)2
]
since cross terms are zero (fix X ’s, take expectation with respect to theY ’s).
We are left analyzing
nE[
Kh(x − X1)2
(∑n
i=1 Kh(x − Xi ))2
]
Under some assumptions on density of X , the denominator is at least(nhd)2 with high prob, whereas EKh(x − X1)2 = O(hd) assuming∫K 2 <∞. This gives an overall variance of O(1/(nhd)). Many details
skipped here (e.g. problems at the boundary, assumptions, etc)144 / 174
Overall, bias and variance with h ∼ n−1
2+d yield
h2 +1
nhd= n−
22+d
145 / 174
Summary
I Analyzed smoothing methods with kernels. As with nearestneighbors, slow (nonparametric) rates in large d .
I Same bias-variance decomposition approach as k-NN.
146 / 174
Outline
Introduction
Uniform Laws of Large Numbers
Classification
Beyond ERM
InterpolationOverfitting in Deep NetworksNearest neighborLocal Kernel SmoothingKernel and Linear Regression
147 / 174
Overfitting in Deep Networks
I Deep networks can be trained to zerotraining error (for regression loss)
I ... with near state-of-the-artperformance (Ruslan Salakhutdinov,
Simons Machine Learning Boot Camp, January 2017)
I ... even for noisy problems.
I No tradeoff between fit to trainingdata and complexity!
I Benign overfitting.
(Zhang, Bengio, Hardt, Recht, Vinyals, 2017)
148 / 174
Overfitting in Deep Networks
I Deep networks can be trained to zerotraining error (for regression loss)
I ... with near state-of-the-artperformance (Ruslan Salakhutdinov,
Simons Machine Learning Boot Camp, January 2017)
I ... even for noisy problems.
I No tradeoff between fit to trainingdata and complexity!
I Benign overfitting.
(Zhang, Bengio, Hardt, Recht, Vinyals, 2017)
148 / 174
Overfitting in Deep Networks
I Deep networks can be trained to zerotraining error (for regression loss)
I ... with near state-of-the-artperformance (Ruslan Salakhutdinov,
Simons Machine Learning Boot Camp, January 2017)
I ... even for noisy problems.
I No tradeoff between fit to trainingdata and complexity!
I Benign overfitting.
(Zhang, Bengio, Hardt, Recht, Vinyals, 2017)
148 / 174
Overfitting in Deep Networks
I Deep networks can be trained to zerotraining error (for regression loss)
I ... with near state-of-the-artperformance (Ruslan Salakhutdinov,
Simons Machine Learning Boot Camp, January 2017)
I ... even for noisy problems.
I No tradeoff between fit to trainingdata and complexity!
I Benign overfitting.
(Zhang, Bengio, Hardt, Recht, Vinyals, 2017)
148 / 174
Overfitting in Deep Networks
I Deep networks can be trained to zerotraining error (for regression loss)
I ... with near state-of-the-artperformance (Ruslan Salakhutdinov,
Simons Machine Learning Boot Camp, January 2017)
I ... even for noisy problems.
I No tradeoff between fit to trainingdata and complexity!
I Benign overfitting.
(Zhang, Bengio, Hardt, Recht, Vinyals, 2017)
148 / 174
Overfitting in Deep Networks
Not surprising:
I More parameters than sample size. (That’s called ’nonparametric.’)
I Good performance with zero classification loss on training data.(c.f. Margins analysis: trade-off between (regression) fit andcomplexity.)
I Good performance with zero regression loss on training data whenthe Bayes risk is zero. (c.f. Perceptron)
Surprising:
I Good performance with zero regression loss on noisy training data.
I All the label noise is making its way into the parameters, but it isn’thurting prediction accuracy!
149 / 174
Overfitting in Deep Networks
Not surprising:
I More parameters than sample size. (That’s called ’nonparametric.’)
I Good performance with zero classification loss on training data.(c.f. Margins analysis: trade-off between (regression) fit andcomplexity.)
I Good performance with zero regression loss on training data whenthe Bayes risk is zero. (c.f. Perceptron)
Surprising:
I Good performance with zero regression loss on noisy training data.
I All the label noise is making its way into the parameters, but it isn’thurting prediction accuracy!
149 / 174
Overfitting in Deep Networks
Not surprising:
I More parameters than sample size. (That’s called ’nonparametric.’)
I Good performance with zero classification loss on training data.(c.f. Margins analysis: trade-off between (regression) fit andcomplexity.)
I Good performance with zero regression loss on training data whenthe Bayes risk is zero. (c.f. Perceptron)
Surprising:
I Good performance with zero regression loss on noisy training data.
I All the label noise is making its way into the parameters, but it isn’thurting prediction accuracy!
149 / 174
Overfitting in Deep Networks
Not surprising:
I More parameters than sample size. (That’s called ’nonparametric.’)
I Good performance with zero classification loss on training data.(c.f. Margins analysis: trade-off between (regression) fit andcomplexity.)
I Good performance with zero regression loss on training data whenthe Bayes risk is zero. (c.f. Perceptron)
Surprising:
I Good performance with zero regression loss on noisy training data.
I All the label noise is making its way into the parameters, but it isn’thurting prediction accuracy!
149 / 174
Overfitting in Deep Networks
Not surprising:
I More parameters than sample size. (That’s called ’nonparametric.’)
I Good performance with zero classification loss on training data.(c.f. Margins analysis: trade-off between (regression) fit andcomplexity.)
I Good performance with zero regression loss on training data whenthe Bayes risk is zero. (c.f. Perceptron)
Surprising:
I Good performance with zero regression loss on noisy training data.
I All the label noise is making its way into the parameters, but it isn’thurting prediction accuracy!
149 / 174
Overfitting in Deep Networks
Not surprising:
I More parameters than sample size. (That’s called ’nonparametric.’)
I Good performance with zero classification loss on training data.(c.f. Margins analysis: trade-off between (regression) fit andcomplexity.)
I Good performance with zero regression loss on training data whenthe Bayes risk is zero. (c.f. Perceptron)
Surprising:
I Good performance with zero regression loss on noisy training data.
I All the label noise is making its way into the parameters, but it isn’thurting prediction accuracy!
149 / 174
Overfitting
“... interpolating fits... [are] unlikely to predict future data well at all.”“Elements of Statistical Learning,” Hastie, Tibshirani, Friedman, 2001
150 / 174
Overfitting
151 / 174
Benign Overfitting
A new statistical phenomenon:good prediction with zero training error for regression loss
I Statistical wisdom says a prediction rule should not fit too well.
I But deep networks are trained to fit noisy data perfectly, and theypredict well.
152 / 174
Not an Isolated Phenomenon
0.0 0.2 0.4 0.6 0.8 1.0 1.2lambda
10 1
log(
erro
r)Kernel Regression on MNIST
digits pair [i,j][2,5][2,9][3,6][3,8][4,7]
λ = 0: the interpolated solution, perfect fit on training data.
153 / 174
Not an Isolated Phenomenon
154 / 174
Outline
Introduction
Uniform Laws of Large Numbers
Classification
Beyond ERM
InterpolationOverfitting in Deep NetworksNearest neighborLocal Kernel SmoothingKernel and Linear Regression
155 / 174
Simplicial interpolation (Belkin-Hsu-Mitra)
Observe: 1-Nearest Neighbor is interpolating. However, one cannotguarantee EL(fn)− L(f ∗) small.
Cover-Hart ’67:lim
n→∞EL(fn) ≤ 2L(f ∗)
156 / 174
Simplicial interpolation (Belkin-Hsu-Mitra)
Under regularity conditions, simplicial interpolation fn
lim supn→∞
E∥∥∥fn − f∗
∥∥∥2
≤ 2
d + 2E(f ∗(X )− Y )2
Blessing of high dimensionality!
157 / 174
Outline
Introduction
Uniform Laws of Large Numbers
Classification
Beyond ERM
InterpolationOverfitting in Deep NetworksNearest neighborLocal Kernel SmoothingKernel and Linear Regression
158 / 174
Nadaraya-Watson estimator (Belkin-R.-Tsybakov)
Consider the Nadaraya-Watson estimator. Take a kernel that approachesa large value τ at 0, e.g.
K (x) = min1/ ‖x‖α , τ
Note that large τ means fn(Xi ) ≈ Yi since the weight Wi (Xi ) is large.
If τ =∞, we get interpolation fn(Xi ) = Yi of all training data. Yet,earlier proof still goes through. Hence, “memorizing the data” (governedby parameter τ) is completely decoupled from the bias-variance trade-off(as given by parameter h).
Minimax rates are possible (with a suitable choice of h).
159 / 174
Outline
Introduction
Uniform Laws of Large Numbers
Classification
Beyond ERM
InterpolationOverfitting in Deep NetworksNearest neighborLocal Kernel SmoothingKernel and Linear Regression
160 / 174
Min-norm interpolation in RKHS
Min-norm interpolation:
fn := argminf∈H
‖f ‖H, s.t. f (xi ) = yi , ∀i ≤ n .
Closed-form solution:
fn(x) = K (x ,X )K (X ,X )−1Y
Note: GD/SGD started from 0 converges to this minimum-norm solution.
Difficulty in analyzing bias/variance of fn: it is not clear how the randommatrix K−1 “aggregates” Y ’s between data points.
161 / 174
Min-norm interpolation in Rd with d n
(Hastie, Montanari, Rosset, Tibshirani, 2019)
Linear regression with p, n→∞, p/n→ γ, empirical spectral distributionof Σ (the discrete measure on its set of eigenvalues) converges to a fixedmeasure.Apply random matrix theory to explore the asymptotics of the excessprediction error as γ, the noise variance, and the eigenvalue distributionvary.Also study the asymptotics of a model involving random nonlinearfeatures.
162 / 174
Implicit regularization in regime d n(Liang and R., 2018)
Informal Theorem: Geometric properties of the data design X , highdimensionality, and curvature of the kernel ⇒ interpolated solutiongeneralizes.
Bias bounded in terms of effective dimension
dim(XX>) =∑j
λj
(XX>
d
)[γ + λj
(XX>
d
)]2
where γ > 0 is due to implicit regularization due to curvature of kerneland high dimensionality.
Compare with regularized least squares as low pass filter:
dim(XX>) =∑j
λj
(XX>
d
)λ+ λj
(XX>
d
)163 / 174
Implicit regularization in regime d n
Test error (y-axis) as a function of how quickly spectrum decays. Bestperformance if spectrum decays but not too quickly.
λj(Σ) =
(1−
(j
d
)κ)1/κ
164 / 174
Lower Bounds
(R. and Zhai, 2018)
Min-norm solution for Laplace kernel
Kσ(x , x ′) = σ−d exp−‖x − x ′‖ /σ
is not consistent if d is low:
Theorem: for odd dimension d , with probability 1−O(n−1/2), forany choice of σ,
E∥∥∥fn − f ∗
∥∥∥2
≥ Ωd(1).
Theorem.
Conclusion: need high dimensionality of data.
165 / 174
Connection to neural networksFor wide enough randomly initialized neural networks, GD dynamicsquickly converge to (approximately) min-norm interpolating solution withrespect to a certain kernel.
f (x) =1√m
m∑i=1
aiσ(〈wi , x〉)
For square loss and relu σ,
∂L
∂wi=
1√m
1
n
n∑j=1
(f (xj)− yj)aiσ′(〈wi , xj〉)xj
GD in continuous time:
d
dtwi (t) = − ∂L
∂wi
Thusd
dtf (x) =
m∑i=1
(∂f (x)
∂wi
)T (dwi
dt
)166 / 174
Connection to neural networksFor wide enough randomly initialized neural networks, GD dynamicsquickly converge to (approximately) min-norm interpolating solution withrespect to a certain kernel.
f (x) =1√m
m∑i=1
aiσ(〈wi , x〉)
For square loss and relu σ,
∂L
∂wi=
1√m
1
n
n∑j=1
(f (xj)− yj)aiσ′(〈wi , xj〉)xj
GD in continuous time:
d
dtwi (t) = − ∂L
∂wi
Thus
d
dtf (x) =
m∑i=1
(1√maiσ′(〈wi , x〉)x
)T− 1√
m
1
n
n∑j=1
(f (xj)− yj)aiσ′(〈wi , xj〉)xj
166 / 174
Connection to neural networksFor wide enough randomly initialized neural networks, GD dynamicsquickly converge to (approximately) min-norm interpolating solution withrespect to a certain kernel.
f (x) =1√m
m∑i=1
aiσ(〈wi , x〉)
For square loss and relu σ,
∂L
∂wi=
1√m
1
n
n∑j=1
(f (xj)− yj)aiσ′(〈wi , xj〉)xj
GD in continuous time:
d
dtwi (t) = − ∂L
∂wi
Thus
d
dtf (x) = − 1
n
n∑j=1
(f (xj)− yj)
[1
m
m∑i=1
a2i σ′(〈wi , x〉)σ′(〈wi , xj〉) 〈x , xj〉
]
166 / 174
Connection to neural networksFor wide enough randomly initialized neural networks, GD dynamicsquickly converge to (approximately) min-norm interpolating solution withrespect to a certain kernel.
f (x) =1√m
m∑i=1
aiσ(〈wi , x〉)
For square loss and relu σ,
∂L
∂wi=
1√m
1
n
n∑j=1
(f (xj)− yj)aiσ′(〈wi , xj〉)xj
GD in continuous time:
d
dtwi (t) = − ∂L
∂wi
Thusd
dtf (x) = − 1
n
n∑j=1
(f (xj)− yj)Km(x , xj)︸ ︷︷ ︸→K∞(x,xj )
166 / 174
Connection to neural networks
(Xie, Liang, Song, ’16), (Jacot, Gabriel, Hongler ’18), (Du, Zhai, Poczos, Singh ’18),
etc.
K∞(x , xj) = E[〈x , xj〉 I〈w , x〉 ≥ 0, 〈w , xj〉 ≥ 0]
See Jason’s talk.
167 / 174
Linear Regression(B., Long, Lugosi, Tsigler, 2019)
I (x , y) Gaussian, mean zero.
I Define:
Σ := Exx> =∑i
λiviv>i , (assume λ1 ≥ λ2 ≥ · · · )
θ∗ := arg minθ
E(y − x>θ
)2,
σ2 := E(y − x>θ∗)2.
I Estimator θ =(X>X
)†X>y , which solves
minθ
‖θ‖2
s.t. ‖Xθ − y‖2 = minβ‖Xβ − y‖2
= 0.
I When can the label noise be hidden in θ without hurting predictiveaccuracy?
168 / 174
Linear Regression(B., Long, Lugosi, Tsigler, 2019)
I (x , y) Gaussian, mean zero.
I Define:
Σ := Exx> =∑i
λiviv>i , (assume λ1 ≥ λ2 ≥ · · · )
θ∗ := arg minθ
E(y − x>θ
)2,
σ2 := E(y − x>θ∗)2.
I Estimator θ =(X>X
)†X>y , which solves
minθ
‖θ‖2
s.t. ‖Xθ − y‖2 = minβ‖Xβ − y‖2 = 0.
I When can the label noise be hidden in θ without hurting predictiveaccuracy?
168 / 174
Linear Regression
Excess prediction error:
L(θ)− L∗ := E(x,y)
[(y − x>θ
)2
−(y − x>θ∗
)2]
=(θ − θ∗
)>Σ(θ − θ∗
).
Σ determines how estimation error in different parameter directions affectprediction accuracy.
169 / 174
Linear Regression
For universal constants b, c , and any θ∗, σ2, Σ, if λn > 0 then
σ2
cmin
k∗
n+
n
Rk∗(Σ), 1
≤ EL(θ)− L∗ ≤ c
(‖θ∗‖2
√tr(Σ)
n+ σ2
(k∗
n+
n
Rk∗(Σ)
)),
where k∗ = min k ≥ 0 : rk(Σ) ≥ bn.
Theorem.
170 / 174
Notions of Effective Rank
Recall that λ1 ≥ λ2 ≥ · · · are the eigenvalues of Σ. Define theeffective ranks
rk(Σ) =
∑i>k λi
λk+1, Rk(Σ) =
(∑i>k λi
)2∑i>k λ
2i
.
Definition.
ExampleIf rank(Σ) = p, we can write
r0(Σ) = rank(Σ)s(Σ), R0(Σ) = rank(Σ)S(Σ),
with s(Σ) =1/p
∑pi=1 λi
λ1, S(Σ) =
(1/p
∑pi=1 λi
)2
1/p∑p
i=1 λ2i
.
Both s and S lie between 1/p (λ2 ≈ 0) and 1 (λi all equal).
171 / 174
Notions of Effective Rank
Recall that λ1 ≥ λ2 ≥ · · · are the eigenvalues of Σ. Define theeffective ranks
rk(Σ) =
∑i>k λi
λk+1, Rk(Σ) =
(∑i>k λi
)2∑i>k λ
2i
.
Definition.
ExampleIf rank(Σ) = p, we can write
r0(Σ) = rank(Σ)s(Σ), R0(Σ) = rank(Σ)S(Σ),
with s(Σ) =1/p
∑pi=1 λi
λ1, S(Σ) =
(1/p
∑pi=1 λi
)2
1/p∑p
i=1 λ2i
.
Both s and S lie between 1/p (λ2 ≈ 0) and 1 (λi all equal).171 / 174
Benign Overfitting in Linear Regression
Intuition
I To avoid harming prediction accuracy, the noise energy must bedistributed across many unimportant directions:
I There should be many non-zero eigenvalues.I They should have a small sum compared to n.I There should be many eigenvalues no larger than λk∗
(the number depends on how assymmetric they are).
172 / 174
Overfitting?
I Fitting data “too” well versus
I Bias too low, variance too high.
Key takeaway: we should not conflate these two.
173 / 174
Outline
Introduction
Uniform Laws of Large Numbers
Classification
Beyond ERM
InterpolationOverfitting in Deep NetworksNearest neighborLocal Kernel SmoothingKernel and Linear Regression
174 / 174