Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online...
Transcript of Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online...
![Page 1: Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online Book). Cambridge University Press , 2014. I O. Bousquet, S. Boucheron and G. Lugosi](https://reader030.fdocuments.us/reader030/viewer/2022040614/5f0c133b7e708231d4339ed7/html5/thumbnails/1.jpg)
Class 1Introduction to Statistical Learning Theory
Carlo CilibertoDepartment of Computer Science, UCL
October 5, 2018
![Page 2: Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online Book). Cambridge University Press , 2014. I O. Bousquet, S. Boucheron and G. Lugosi](https://reader030.fdocuments.us/reader030/viewer/2022040614/5f0c133b7e708231d4339ed7/html5/thumbnails/2.jpg)
Administrative Info
I Class times: Fridays 14:00 - 15:301
I Location: Ground Floor Lecture Theater, Wilkins Building2
I Office hours: (Time TBA), 3rd Floor Hub room, CS Building, 66Gower street.
I TA: Giulia Luise
I Website: cciliber.github.io/intro-stl
I email(s): [email protected], [email protected]
I Workload: 2 assignments (50%) and a final exam (50%). Finalexam requires to choose 3 problems out of 6. At least one problemfrom each “sides” of this course (RKHS or SLT) *must* be chosen.
1sometimes Wednesday though! See online syllabus2It will vary over the term! See online.
![Page 3: Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online Book). Cambridge University Press , 2014. I O. Bousquet, S. Boucheron and G. Lugosi](https://reader030.fdocuments.us/reader030/viewer/2022040614/5f0c133b7e708231d4339ed7/html5/thumbnails/3.jpg)
Course Material
Main resources for the course:
I Classes
I Slides
Books and other Resources:
I S. Shalev-Shwartz and S. Ben-David Understanding MachineLearning: From Theory to Algorithms (Online Book). CambridgeUniversity Press , 2014.
I O. Bousquet, S. Boucheron and G. Lugosi Introduction to StatisticalLearning Theory (Tutorial).
I T. Poggio and L. Rosasco course slides and videos from MIT 9.520:Statistical Learning Theory and Applications.
I P. Liang course notes from Stanford CS229T: Statistical LearningTheory.
![Page 4: Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online Book). Cambridge University Press , 2014. I O. Bousquet, S. Boucheron and G. Lugosi](https://reader030.fdocuments.us/reader030/viewer/2022040614/5f0c133b7e708231d4339ed7/html5/thumbnails/4.jpg)
Prerequisites
I Linear Algebra: familiarity with vector spaces, matrix operations(e.g. inversion, singular value decomposition (SVD)), inner productsand norms, etc.
I Calculus: limits, derivatives, measures, integrals, etc.
I Probability Theory: probability distributions, conditional andmarginal distribution, expectation, variance, etc.
![Page 5: Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online Book). Cambridge University Press , 2014. I O. Bousquet, S. Boucheron and G. Lugosi](https://reader030.fdocuments.us/reader030/viewer/2022040614/5f0c133b7e708231d4339ed7/html5/thumbnails/5.jpg)
Statistical Learning Theory (SLT)
SLT addresses questions related to:
I What does it mean for an algorithm to learn.
I What we can/cannot expect from a learning algorithm.
I How to design computationally & statistically efficient algorithms.
I What to do when a learning algorithm does not work...
SLT studies theoretical quantities that we don’t have access to:
It tries to bridge the gap between the unknown functional relationsgoverning a process and our (finite) empirical observations of it.
![Page 6: Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online Book). Cambridge University Press , 2014. I O. Bousquet, S. Boucheron and G. Lugosi](https://reader030.fdocuments.us/reader030/viewer/2022040614/5f0c133b7e708231d4339ed7/html5/thumbnails/6.jpg)
Motivations and Examples: Regression
Image credits: coursera
![Page 7: Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online Book). Cambridge University Press , 2014. I O. Bousquet, S. Boucheron and G. Lugosi](https://reader030.fdocuments.us/reader030/viewer/2022040614/5f0c133b7e708231d4339ed7/html5/thumbnails/7.jpg)
Motivations and Examples: Binary Classification
Spam detection: Automatically discriminate spam vs non-spam e-mails.
Image Classification
![Page 8: Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online Book). Cambridge University Press , 2014. I O. Bousquet, S. Boucheron and G. Lugosi](https://reader030.fdocuments.us/reader030/viewer/2022040614/5f0c133b7e708231d4339ed7/html5/thumbnails/8.jpg)
Motivations and Examples: Multi-class Classification
Identify the category of the object depicted in an image.
Example: Caltech 101
Image Credits: Anna Bosch and Andrew Zisserman
![Page 9: Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online Book). Cambridge University Press , 2014. I O. Bousquet, S. Boucheron and G. Lugosi](https://reader030.fdocuments.us/reader030/viewer/2022040614/5f0c133b7e708231d4339ed7/html5/thumbnails/9.jpg)
Motivations and Examples: Multi-class Classification
Scaling things up: detect correct object among thousands of categories.ImageNet Large Scale Visual Recognition Challenge
http://www.image-net.org/ - Image Credits to Fengjun Lv
![Page 10: Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online Book). Cambridge University Press , 2014. I O. Bousquet, S. Boucheron and G. Lugosi](https://reader030.fdocuments.us/reader030/viewer/2022040614/5f0c133b7e708231d4339ed7/html5/thumbnails/10.jpg)
Motivations and Examples: Structured Prediction
![Page 11: Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online Book). Cambridge University Press , 2014. I O. Bousquet, S. Boucheron and G. Lugosi](https://reader030.fdocuments.us/reader030/viewer/2022040614/5f0c133b7e708231d4339ed7/html5/thumbnails/11.jpg)
Formulating The Learning Problem
![Page 12: Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online Book). Cambridge University Press , 2014. I O. Bousquet, S. Boucheron and G. Lugosi](https://reader030.fdocuments.us/reader030/viewer/2022040614/5f0c133b7e708231d4339ed7/html5/thumbnails/12.jpg)
Formulating the Learning Problem
Main ingredients:
I X input and Y output spaces.
I ρ uknown distribution on X × Y.
I ` : Y × Y → R a loss function measuring the discrepancy `(y, y′)between any two points y, y′ ∈ Y.
We would like to minimize the expected risk
minimizef :X→Y
E(f) E(f) =
∫X×Y
`(f(x), y) dρ(x, y)
The expected prediction error incurred by a predictor3 f : X → Y.
3only measurable predictors are considered.
![Page 13: Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online Book). Cambridge University Press , 2014. I O. Bousquet, S. Boucheron and G. Lugosi](https://reader030.fdocuments.us/reader030/viewer/2022040614/5f0c133b7e708231d4339ed7/html5/thumbnails/13.jpg)
Input Space
Linear Spaces
I Vectors
I Matrices
I Functions
“Structured” Spaces
I Strings
I Graphs
I Probabilities
I Points on a manifold
I . . .
![Page 14: Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online Book). Cambridge University Press , 2014. I O. Bousquet, S. Boucheron and G. Lugosi](https://reader030.fdocuments.us/reader030/viewer/2022040614/5f0c133b7e708231d4339ed7/html5/thumbnails/14.jpg)
Output Space
Linear Spaces, e.g.
I Y = R regression
I Y = {1, . . . , T} classification
I Y = RT multi-task
“Structured” Spaces, e.g.
I Strings
I Graphs
I Probabilities
I Orders (i.e. Ranking)
I . . .
![Page 15: Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online Book). Cambridge University Press , 2014. I O. Bousquet, S. Boucheron and G. Lugosi](https://reader030.fdocuments.us/reader030/viewer/2022040614/5f0c133b7e708231d4339ed7/html5/thumbnails/15.jpg)
Probability Distribution
Informally: the distribution ρ on X ×Y encodes the probability of gettinga pair (x, y) ∈ X × Y when observing (sampling from) the unknownprocess.
Throughout the course we will assume ρ(x, y) = ρ(y|x)ρX (x)
I ρX (x) marginal distribution on X .
I ρ(y|x) conditional distribution on Y given x ∈ X .
![Page 16: Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online Book). Cambridge University Press , 2014. I O. Bousquet, S. Boucheron and G. Lugosi](https://reader030.fdocuments.us/reader030/viewer/2022040614/5f0c133b7e708231d4339ed7/html5/thumbnails/16.jpg)
Conditional Distribution
ρ(y|x) characterizes the relation between a given input x and thepossible outcomes y that could be observed.
In noisy settings it represents the uncertainty in our observations.
Example: y = f∗(x) + ε, with f∗ : X → R the “true” function andε ∼ N (0, σ) Gaussian distributed noise. Then:
ρ(y|x) = N (f∗(x), σ)
![Page 17: Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online Book). Cambridge University Press , 2014. I O. Bousquet, S. Boucheron and G. Lugosi](https://reader030.fdocuments.us/reader030/viewer/2022040614/5f0c133b7e708231d4339ed7/html5/thumbnails/17.jpg)
Loss Functions
The loss function` : Y × Y → [0,+∞)
represents the cost `(f(x), y) incurred when predicting f(x) instead of y.
It is part of the problem formulation:
E(f) =
∫`(f(x), y) dρ(x, y)
The minimizer of the risk (if it exists) is “chosen” by the loss.
![Page 18: Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online Book). Cambridge University Press , 2014. I O. Bousquet, S. Boucheron and G. Lugosi](https://reader030.fdocuments.us/reader030/viewer/2022040614/5f0c133b7e708231d4339ed7/html5/thumbnails/18.jpg)
Loss Functions for Regression
L(y, y′) = L(y − y′)
I Square loss L(y, y′) = (y − y′)2,
I Absolute loss L(y, y′) = |y − y′|,I ε-insensitive L(y, y′) = max(|y − y′| − ε, 0),
1.0 0.5 0.5 1.0
0.2
0.4
0.6
0.8
1.0
Square Loss
Absolute
insensitive-
Image credits: Lorenzo Rosasco.
![Page 19: Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online Book). Cambridge University Press , 2014. I O. Bousquet, S. Boucheron and G. Lugosi](https://reader030.fdocuments.us/reader030/viewer/2022040614/5f0c133b7e708231d4339ed7/html5/thumbnails/19.jpg)
Loss Functions for Classification
L(y, y′) = L(−yy′)
I 0-1 loss L(y, y′) = 1{−yy′>0}I Square loss L(y, y′) = (1− yy′)2,
I Hinge-loss L(y, y′) = max(1− yy′, 0),
I logistic loss L(y, y′) = log(1 + exp(−yy′)),
1 2
0.5
1.0
1.5
2.0
0 1 loss
square loss
Hinge loss
Logistic loss
0.5
Image credits: Lorenzo Rosasco.
![Page 20: Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online Book). Cambridge University Press , 2014. I O. Bousquet, S. Boucheron and G. Lugosi](https://reader030.fdocuments.us/reader030/viewer/2022040614/5f0c133b7e708231d4339ed7/html5/thumbnails/20.jpg)
Formulating the Learning Problem
The relation between X and Y encoded by the distribution ρ is unknownin reality. The only way we have to access a phenomenon is from finiteobservations.
The goal of a learning algorithm is therefore to find a goodapproximation fn : X → Y for the minimizer of expected risk
inff :X→Y
E(f)
from a finite set of examples (xi, yi)ni=1 sampled independently from ρ.
![Page 21: Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online Book). Cambridge University Press , 2014. I O. Bousquet, S. Boucheron and G. Lugosi](https://reader030.fdocuments.us/reader030/viewer/2022040614/5f0c133b7e708231d4339ed7/html5/thumbnails/21.jpg)
Defining Learning Algorithms
Let S =⋃n∈N(X × Y)n be the set of all finite datasets on X × Y.
Denote F the set of all measurable functions f : X → Y. A learningalgorithm is a map
A : S → FS 7→ A(S) : X → Y
To highlight our interest in studying the relation between the size of atraining set S = (xi, yi)
ni=1 and the corresponding predictor produced by
an algorithm A, we will often denote (with some abuse of notation)
fn = A(
(xi, yi)ni=1
)
![Page 22: Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online Book). Cambridge University Press , 2014. I O. Bousquet, S. Boucheron and G. Lugosi](https://reader030.fdocuments.us/reader030/viewer/2022040614/5f0c133b7e708231d4339ed7/html5/thumbnails/22.jpg)
Non-deterministic Learning Algorithms
We can also consider stochastic algorithms, where the estimator fn is notautomatically determined by the training set.
In these cases, given a dataset S ∈ S, an algorithm A(S) can be seen asa distribution on F and its output is one sample from A(S).
Under this interpretation a deterministic algorithm corresponds to A(S)being a Dirac’s delta.
![Page 23: Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online Book). Cambridge University Press , 2014. I O. Bousquet, S. Boucheron and G. Lugosi](https://reader030.fdocuments.us/reader030/viewer/2022040614/5f0c133b7e708231d4339ed7/html5/thumbnails/23.jpg)
Formulating the Learning Problem
Given a training set, we would like a learning algorithm to find a “good”predictor fn.
What does “good” mean? That it has small error (or excess risk) withrespect to the best solution of the learning problem.
Excess RiskE(fn)− inf
f∈FE(f)
![Page 24: Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online Book). Cambridge University Press , 2014. I O. Bousquet, S. Boucheron and G. Lugosi](https://reader030.fdocuments.us/reader030/viewer/2022040614/5f0c133b7e708231d4339ed7/html5/thumbnails/24.jpg)
The Elements of Learning Theory
![Page 25: Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online Book). Cambridge University Press , 2014. I O. Bousquet, S. Boucheron and G. Lugosi](https://reader030.fdocuments.us/reader030/viewer/2022040614/5f0c133b7e708231d4339ed7/html5/thumbnails/25.jpg)
Consistency
Ideally we would like the learning algorithm to be consistent
limn→+∞
E(fn)− inff∈FE(f) = 0
Namely that (asymptotically) our algorithm “solves” the problem.
However fn = A(S) is a random variable: the points in the training setS = (xi, yi)
ni=1 are randomly sampled from ρ.
So what do we mean by E(fn)→ inf E(f)?
![Page 26: Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online Book). Cambridge University Press , 2014. I O. Bousquet, S. Boucheron and G. Lugosi](https://reader030.fdocuments.us/reader030/viewer/2022040614/5f0c133b7e708231d4339ed7/html5/thumbnails/26.jpg)
Convergence of Random Variables
Convergence in expectation:
limn→+∞
E[E(fn)− inf
f∈FE(f)
]= 0
Convergence in probability:
limn→+∞
P(E(fn)− inf
f∈FE(f) > ε
)= 0 ∀ε > 0
Many other notions of convergence of random variables exist!
![Page 27: Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online Book). Cambridge University Press , 2014. I O. Bousquet, S. Boucheron and G. Lugosi](https://reader030.fdocuments.us/reader030/viewer/2022040614/5f0c133b7e708231d4339ed7/html5/thumbnails/27.jpg)
Consistency vs Convergence of the Estimator
Note that we are only interested in guaranteeing that the risk of ourestimator will converge to the best possible value
E(fn)→ inff∈FE(f)
but we are not directly interested in determining whether fn → f∗ (insome norm) where f∗ : X → Y is a minimizer of the expected risk
E(f∗) = inff :X→Y
E(f)
Actually, the risk could even not admit a minimizer f∗ (although typicallyit will).
This is a main difference with several settings such as compressivesensing and inverse problems.
![Page 28: Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online Book). Cambridge University Press , 2014. I O. Bousquet, S. Boucheron and G. Lugosi](https://reader030.fdocuments.us/reader030/viewer/2022040614/5f0c133b7e708231d4339ed7/html5/thumbnails/28.jpg)
Existence of a Minimizer for the Risk
However, the existence of f∗ can be useful in several situations.
Least Squares. `(f(x), y) = (f(x)− y)2. Then
E(f)− E(f∗) = ‖f − f∗‖L2(X ,ρ)
Lipschitz Loss. |`(z, y)− `(z′, y)| ≤ L|z − z′|
E(f)− E(f∗) ≤ L‖f − f∗‖L1(X ,ρ)
Convergence fn → f∗ (in L1 or L2 norm respectively) automaticallyguarantees consistency!
![Page 29: Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online Book). Cambridge University Press , 2014. I O. Bousquet, S. Boucheron and G. Lugosi](https://reader030.fdocuments.us/reader030/viewer/2022040614/5f0c133b7e708231d4339ed7/html5/thumbnails/29.jpg)
Measuring the “Quality” of a Learning Algorithm
Is consistency enough? Well no. It does not provide a quantitativemeasure of how “good” a learning algorithm is.
In other words, question: how do we compare two learning algorithms?
Answer: via their Learning Rates, namely the “speed” at which theexcess risk goes to zero as n increases.
Example: Expectation
E[E(fn)− inf
f∈FE(f)
]= O(n−α) for some α > 0.
We can compare two algorithms by determining which one has a fasterlearning rate (i.e. larger exponent α).
![Page 30: Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online Book). Cambridge University Press , 2014. I O. Bousquet, S. Boucheron and G. Lugosi](https://reader030.fdocuments.us/reader030/viewer/2022040614/5f0c133b7e708231d4339ed7/html5/thumbnails/30.jpg)
Sample Complexity, Error Bounds and Tail Bounds
Sample Complexity: minimum number n(ε, δ) of training points thealgorithm needs to achieve an excess risk lower than ε with at leastprobability 1− δ:
P(E(fn(ε,δ))− inf
f∈FE(f) ≤ ε
)≥ 1− δ
Error Bounds: Upper bound ε(δ, n) > 0 on the excess risk of fn whichholds with probability larger than 1− δ
P(E(fn)− inf
f∈FE(f) ≤ ε(δ, n)
)≥ 1− δ
Tail Bounds: Lower bound δ(ε, n) ∈ (0, 1) on the probability that fn willhave excess risk larger than ε
P(E(fn)− inf
f∈FE(f) ≤ ε
)≥ 1− δ(ε, n)
![Page 31: Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online Book). Cambridge University Press , 2014. I O. Bousquet, S. Boucheron and G. Lugosi](https://reader030.fdocuments.us/reader030/viewer/2022040614/5f0c133b7e708231d4339ed7/html5/thumbnails/31.jpg)
Empirical Risk as a Proxy
If ρ is unknown... how can we say anything about E(fn)− inff∈F E(f)?
We have “glimpses” of ρ only via the samples (xi, yi)ni=1. Can we use
them to gather some information about ρ (or better, on E(f))?
Consider function f : X → Y and its empirical risk
En(f) =1
n
n∑i=1
`(f(xi), yi)
A simple calculation shows that
ES∼ρn(En(f)) =1
n
n∑i=1
E(xi,yi)∼ρ(`(f(xi), yi)) =1
n
n∑i=1
E(f) = E(f)
The expectation of En(f) is the expected risk E(f)!
![Page 32: Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online Book). Cambridge University Press , 2014. I O. Bousquet, S. Boucheron and G. Lugosi](https://reader030.fdocuments.us/reader030/viewer/2022040614/5f0c133b7e708231d4339ed7/html5/thumbnails/32.jpg)
Empirical Vs Expected
How close is En(f) to E(f) with respect to the number n of trainingpoints?
Consider i.i.d. random variables X and (Xi)ni=1. Let X̄n = 1
n
∑ni=1Xi.
Then
E[(X̄n − E(X))2] = Var(X̄n) =Var(X)
n
Therefore the expected (squared) distance between the empirical mean ofthe Xi and their expectation E(X) goes to zero as O(1/n) (Assuming Xto have finite variance).
If Xi = `(f(xi), yi), we have X̄n = En(f) and therefore
E[(En(f)− E(f))2] =Var(`(f(x), y))
n
![Page 33: Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online Book). Cambridge University Press , 2014. I O. Bousquet, S. Boucheron and G. Lugosi](https://reader030.fdocuments.us/reader030/viewer/2022040614/5f0c133b7e708231d4339ed7/html5/thumbnails/33.jpg)
Empirical Vs Expected Risk
If Xi = `(f(xi), yi), we have X̄n = En(f) and therefore
E[(En(f)− E(f))2] =Var(`(f(x), y))
n
In particular
E[|En(f)− E(f)|] ≤√
Var(`(f(x), y))
n
![Page 34: Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online Book). Cambridge University Press , 2014. I O. Bousquet, S. Boucheron and G. Lugosi](https://reader030.fdocuments.us/reader030/viewer/2022040614/5f0c133b7e708231d4339ed7/html5/thumbnails/34.jpg)
Empirical Vs Expected
Assume for simplicity that there exists a minimizer f∗ : X → Y of theexpected risk
E(f∗) = inff∈FE(f)
For any function f : X → Y we can decompose the excess risk as
E(f)− E(f∗) =
E(f)− En(f) + En(f)− En(f∗) + En(f∗)− E(f∗),
recalling the definition En(f) := 1n
∑ni=1 `(f(xi), yi) of the empirical risk.
Note that this in particularly then also holds for fn, which we will usebelow. We can therefore leverage on the statistical relation between Enand E to study the expected risk in terms of the empirical risk.
This perspective leads to one of the most well-established strategies onSLT: Empirical Risk Minimization
![Page 35: Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online Book). Cambridge University Press , 2014. I O. Bousquet, S. Boucheron and G. Lugosi](https://reader030.fdocuments.us/reader030/viewer/2022040614/5f0c133b7e708231d4339ed7/html5/thumbnails/35.jpg)
Empirical Risk Minimization
Let fn be the minimizer of the empirical risk
fn = argminf∈F
En(f)
Then we automatically have En(fn)− En(f∗) ≤ 0 (for any choice oftraining set).
ThenE E(fn)− E(f∗) ≤ E E(fn)− En(fn) (why?)
We can focus on studying only the generalization error
E E(fn)− En(fn)
![Page 36: Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online Book). Cambridge University Press , 2014. I O. Bousquet, S. Boucheron and G. Lugosi](https://reader030.fdocuments.us/reader030/viewer/2022040614/5f0c133b7e708231d4339ed7/html5/thumbnails/36.jpg)
Generalization Error
How can we control the generalization error
En(fn)− E(fn)
with respect to the number n of examples?
This question is far from trivial...(and it is one of the main subject of SLT)
Indeed, En and fn both depend on the sampled training data. Therefore,we cannot use the result
E [ |En(fn)− E(fn)| ] ≤ O(1/√n)
which indeed will not be true in general... (next class).
![Page 37: Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online Book). Cambridge University Press , 2014. I O. Bousquet, S. Boucheron and G. Lugosi](https://reader030.fdocuments.us/reader030/viewer/2022040614/5f0c133b7e708231d4339ed7/html5/thumbnails/37.jpg)
A Taxonomy of Supervised Learning Problems
![Page 38: Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online Book). Cambridge University Press , 2014. I O. Bousquet, S. Boucheron and G. Lugosi](https://reader030.fdocuments.us/reader030/viewer/2022040614/5f0c133b7e708231d4339ed7/html5/thumbnails/38.jpg)
A Taxonomy of Supervised Learning Problems
In practice we can have many different problems and scenarios:
I Parametric Vs Non-parametric learning
I Fixed design Vs random design
I Transductive Vs inductive learning
I Offline/batch Vs online/adversarial learning
Different goals and assumptions but similar tools to study/solve them!
![Page 39: Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online Book). Cambridge University Press , 2014. I O. Bousquet, S. Boucheron and G. Lugosi](https://reader030.fdocuments.us/reader030/viewer/2022040614/5f0c133b7e708231d4339ed7/html5/thumbnails/39.jpg)
Parametric Vs Non-parametric
How much do we know about the model?
I Parametric: assume the predictor to be modeled by a finite numberof unknown parameters. Goal: find the parametrization that best fitsthe observed data. In several scenario the goal is not in (only)having good predictions but rather use the recovered model for otherpurposes (e.g. identification).
I Non-parametric. allow the parametrization of the model toincrease in complexity as more examples are observed. Goal: find anestimator with optimal generalization performance (i.e. lowestexpected risk E).
![Page 40: Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online Book). Cambridge University Press , 2014. I O. Bousquet, S. Boucheron and G. Lugosi](https://reader030.fdocuments.us/reader030/viewer/2022040614/5f0c133b7e708231d4339ed7/html5/thumbnails/40.jpg)
Fixed Design Vs Random Design
From experiment design...
I Fixed Design. Given training examples (xi, yi)ni=1, the goal is to
achieve good estimates for ρ(y|xi) on the prescribed training inputs.No distribution on the input data ρX is assumed/considered.
1
n
n∑i=1
∫Y`(f(xi), y) dρ(y|xi)
I Random Design. Agnostic about where the learned model will betested. The goal is to make good predictions with respect to thedistribution ρ(x, y).
![Page 41: Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online Book). Cambridge University Press , 2014. I O. Bousquet, S. Boucheron and G. Lugosi](https://reader030.fdocuments.us/reader030/viewer/2022040614/5f0c133b7e708231d4339ed7/html5/thumbnails/41.jpg)
Inductive Vs Transductive Learning
Do we have access to the test set in advance?
I Transductive: the goal is to achieve good prediction performanceon a prescribed set of test points (x̃j)
ntestj=1 provided in advance.
Transductive learning ignores the effect of ρX on the risk butfocuses only on
1
ntest
ntest∑j=1
∫Y`(f(x̃j), y) dρ(y|x̃j)
I Inductive Agnostic about where the learned model will be tested.The goal is to make good predictions with respect to the distributionρ(x, y).
![Page 42: Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online Book). Cambridge University Press , 2014. I O. Bousquet, S. Boucheron and G. Lugosi](https://reader030.fdocuments.us/reader030/viewer/2022040614/5f0c133b7e708231d4339ed7/html5/thumbnails/42.jpg)
Offline/Batch Vs Online/Adversarial Learning
How do we observe samples from ρ?
I Offline/Batch: a finite sample of input-output examplesindependently and identically distributed. Goal: minimize predictionerrors on new examples
I Online/Adversarial: We observe one input, propose a predictionand then observe the output. Goal: minimize the regret (i.e. choosethe estimator that would have made less mistakes).
Note. The distribution could be adversarial: ρ(y|x, f(x)) instead ofρ(y|x) can make things “hard” for us.
![Page 43: Class 1 Introduction to Statistical Learning Theory · Learning: From Theory to Algorithms (Online Book). Cambridge University Press , 2014. I O. Bousquet, S. Boucheron and G. Lugosi](https://reader030.fdocuments.us/reader030/viewer/2022040614/5f0c133b7e708231d4339ed7/html5/thumbnails/43.jpg)
Wrapping up
This class:
I Motivations and Examples
I Formulating the learning problem
I Brief introduction to Learning Theory
I A Taxonomy of supervised learning problems
Next class: overfitting and the need for regularization...