Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward...
Transcript of Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward...
![Page 1: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/1.jpg)
Introduction to Machine Learning and Deep
Learning
CNRS Formation Entreprise
Stephane Gaıffas
![Page 2: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/2.jpg)
Who are we ?
Stephane Gaıffas
Professor
LPSM, Universite de Paris
DMA, Ecole normale superieure
http://stephanegaiffas.github.io
![Page 3: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/3.jpg)
Who are we ?
![Page 4: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/4.jpg)
Webpage of the formation
https://stephanegaiffas.github.io/teaching/formation_cnrs
Practical sessions using Python 3 / jupyter notebook withscikit-learn and tensorflow with keras for deep learning
Tools
![Page 5: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/5.jpg)
How to get all the tools ? Simplest way is to use docker
Follow the steps described on the webpage:https://stephanegaiffas.github.io/teaching/formation_cnrs
![Page 6: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/6.jpg)
Agenda: Day 1
Morning Course
Introduction
Binary classification, Naive Bayes
Logistic regression
Standard metrics and recipes: overfitting, cross-validation
Regularization (Ridge, Lasso)
Afternoon practical session
Text classification
Scoring
![Page 7: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/7.jpg)
Agenda: Day 2
Morning Course
Decision trees, CART
Random Forests
Boosting, Gradient Boosting
Afternoon practical session
Prediction of load curves
Continuous labels / time series
![Page 8: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/8.jpg)
Agenda: Day 3
Morning Course
Introduction to Deep Learning
Feed-forward neural networks
Convolutional neural networks
Stochastic gradient descent, back-propagation
Regularization: dropout, penalization, early stopping
Afternoon practical session
Image classification: MNIST and NotMNIST
Comparison of different architectures
Text classification again
![Page 9: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/9.jpg)
Roundtable
![Page 10: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/10.jpg)
Teasers: data science or statistics?
![Page 11: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/11.jpg)
Teasers: an application in marketing (Real Time Bidding)
A customer visits a webpage with his browser: a complexprocess of content selection and delivery begins.
An advertiser might want to display an ad on the webpagewhere the user is going. The webpage belongs to a publisher.
The publisher sells ad space to advertisers who want to reachcustomers
In some cases, an auction starts: RTB (Real Time Bidding)
![Page 12: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/12.jpg)
Teasers: an application in marketing (Real Time Bidding)
Advertisers have 10ms (!) to give a price: they need toassess quickly how willing they are to display the ad to thiscustomer
Machine learning is used here to predict the probability ofclick on the ad. Time constraint: few model parameters toanswer quickly
Feature selection / dimension reduction is crucial here
Full process takes < 100ms
![Page 13: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/13.jpg)
Teasers: an application in marketing (Real Time Bidding)
Some figures:
10 million prediction of click probability per second
answers within 10ms
stores 20Terabytes of data daily
Aim
Based on past data, you want to find users that will click onsome ads
This problem can be formulated as a binary classificationproblem
![Page 14: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/14.jpg)
Classification = supervised learning with a binary label
Setting
You have past/historical data, containing data aboutindividuals i = 1, . . . , n
You have a features vector xi ∈ Rd for each individual i
For each i , you know if he/she clicked (yi = 1) or not(yi = −1)
We call yi ∈ {−1, 1} the label of i
(xi , yi ) are i.i.d realizations of (X ,Y )
Aim
Given a features vector x (with no corresponding label),predict a label y ∈ {−1, 1}Use data D = {(x1, y1), . . . , (xn, yn)} to construct a classifier
![Page 15: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/15.jpg)
Many ways to separate points!
![Page 16: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/16.jpg)
Today: model-based classification
Naive Bayes
Logistic regression
Penalization
Cross-validation
![Page 17: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/17.jpg)
Probabilistic / statistical approach
Model the distribution P(y |x) of y knowing or conditionallyon x (denoted y |x)
Construct estimators P(1|x) and P(−1|x) of
P(1|x) and P(−1|x) = 1− P(1|x)
Given x , classify using
y =
{1 if P(1|x) ≥ t
−1 otherwise
for some threshold t ∈ (0, 1)
![Page 18: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/18.jpg)
Bayes formula. We know that
P(y |x) =P(x |y)P(y)
P(x)=
P(x |y)P(y)∑y ′=−1,1 P(x |y ′)P(y ′)
If we know the distribution of x |y and the distribution of y , weknow the distribution of y |x
Bayes classifier. Classify using Bayes formula, given that:
We model P(x |y)
We are able to estimate P(x |y) based on data
![Page 19: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/19.jpg)
Maximum a posteriori. Classify using the discriminant functions
δ(y |x) = logP(x |y) + logP(y)
for y = 1,−1 and decide (largest, beyond a threshold, etc.)
Remark.
Different models on the distribution of x |y leads to differentclassifiers
The simplest one is the Naive Bayes
![Page 20: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/20.jpg)
Naive Bayes. A crude modeling for P(x |y): assume that features(coordinates) xj of x are independent conditionally on y :
P(x |y) =d∏
j=1
P(x j |y)
and model the univariate distribution xj |y , for instance
P(xj |y) = Normal(µj ,y , σ2j ,y ),
with parameters µj ,y and σ2j ,y easily estimated by averages
If a feature is discrete, use a Bernoulli or multinomialdistribution
Leads to a classifier which is very easy to compute
Requires only the computation of some averages
![Page 21: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/21.jpg)
Important notation. If a, b ∈ Rd two vectors (same dimension d)then
a>b = 〈a, b〉 =d∑
j=1
ajbj
is the inner product between a and b
Remark. The ”simplest” (non-constant) function Rd → R is alinear function
x 7→ x>w + b
with w ∈ Rd and b ∈ R
For d = 1 it’s simply a straight line
![Page 22: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/22.jpg)
Logistic regression
By far the most widely used classification algorithm
We want to explain the label y based on x , we want to“regress” y on x
Models the conditional distribution y |xFor y ∈ {−1, 1}, we consider the model
P(1|x) = σ(x>w + b)
where w ∈ Rd is a vector of model weights and b ∈ R is theintercept, and where σ is the sigmoid function
σ(z) =1
1 + e−z
![Page 23: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/23.jpg)
The sigmoid choice really is a choice. It is a modellingchoice.
It’s a way to map R→ [0, 1] (we want to model a probability)
We could also consider
P(1|x) = F (x>w + b)
for any distribution function F . Another popular choice is theGaussian distribution
F (z) = P(N(0, 1) ≤ z),
which leads to another loss called probit
![Page 24: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/24.jpg)
However, the sigmoid choice has the following niceinterpretation: an easy computation leads to
log( P(1|x)
P(−1|x)
)= x>w + b
This quantity is called the log-odd ratio
Note thatP(1|x) ≥ P(−1|x)
iffx>w + b ≥ 0.
This is a linear classification rule
Linear with respect to the considered features x
But, you choose the features: features engineering (moreon that later)
![Page 25: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/25.jpg)
Estimation of w and b
We have a model for y |xData (xi , yi ) is assumed i.i.d with the same distribution as(x , y)
Compute estimators w and b by maximum likelihoodestimation
Or equivalently, minimize the minus log-likelihood
More generally, when a model is used
Goodness-of-fit = −log likelihood
log is used mainly since averages are easier to study (and tocompute) than products
![Page 26: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/26.jpg)
For logistic regression, a computation lead to the formula
−loglikelihood(w , b) =n∑
i=1
log(1 + e−yi (x>i w+b))
So, in order to fit the logistic regression, we need to find
(w , b) ∈ argminw∈Rd ,b∈R
1
n
n∑i=1
log(1 + e−yi (x>i w+b))
(just added 1/n so that it’s an average)
![Page 27: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/27.jpg)
Compute w and b as follows:
(w , b) ∈ argminw∈Rd ,b∈R
1
n
n∑i=1
log(1 + e−yi (x>i w+b))
If we introduce the logistic loss function
`(y , y ′) = log(1 + e−yy′)
then
(w , b) ∈ argminw∈Rd ,b∈R
1
n
n∑i=1
`(yi , x>i w + b)
![Page 28: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/28.jpg)
A goodness-of-fit
(w , b) ∈ argminw∈Rd ,b∈R
1
n
n∑i=1
`(yi , x>i w + b)
is natural: it is an average of losses, one for each sample point
Note that
`(y , y ′) = log(1 + e−yy′) for logistic regression
`(y , y ′) = 12(y − y ′)2 for least-squares linear regression
These problems are convex and smooth
In most cases, training a model (such the logistic regression)means finding w and b that minimizes some function
Finding a minimizer requires to use an optimization algorithm (alsocalled solver)
![Page 29: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/29.jpg)
Training = Optimization
Training the model means minimizing some function to findmodel weights w and b
Usually done in machine learning using first order informationi.e. gradient (derivative) of the function ∇f (w)
Simplest algorithm is gradient descent, which performsiteratively
w t+1 ← w t − ηt∇f (w t)
where t is the iteration counter of the algorithm and ηt is asequence of step-sizes, also called learning rates
![Page 30: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/30.jpg)
Training = OptimizationMany solvers out there !
![Page 31: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/31.jpg)
Other classical loss functions for binary classication
Hinge loss (SVM), `(y , y ′) = (1− yy ′)+
Quadratic hinge loss (SVM), `(y , y ′) = 12(1− yy ′)2+
Huber loss `(y , y ′) = −4yy ′1yy ′<−1 + (1− yy ′)2+1yy ′≥−1
These losses can be understood as a convex approximation ofthe 0/1 loss `(y , y ′) = 1yy ′≤0
![Page 32: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/32.jpg)
Classifiers on toy datasets
Input data
.88
Naive Bayes
.88
Logistic regression
.70 .40
.95 .95
![Page 33: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/33.jpg)
Standard error metrics in classification
Precision, Recall, F-Score, AUC
For each sample i we have
an actual label yi
a predicted label yi
We can construct the confusion matrix
with
TP =∑n
i=1 1yi=1,yi=1
TN =∑n
i=1 1yi=−1,yi=−1FN =
∑ni=1 1yi=1,yi=−1
FP =∑n
i=1 1yi=−1,yi=1
with yes = 1 and no = −1
![Page 34: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/34.jpg)
Standard error metrics in classification
Precision =TP
#(predicted P)=
TP
TP + FP
Recall =TP
#(real P)=
TP
TP + FN
Accuracy =TP + TN
TP + TN + FP + FN
F-Score = 2Precision× Recall
Precision + Recall
Some vocabulary
Recall = Sensitivity
False-Discovery Rate FDR = 1− Precision
![Page 35: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/35.jpg)
ROC Curve (Receiver Operating Characteristic)
Based on the estimated probabilities pi ,1 = P(1|xi )Each point At of the curve has coordinates (FPRt ,TPRt),where FPRt and TPRt are FPR and TPR of the confusionmatrix obtained by the classification rule
yi =
{1 if pi ,1 ≥ t
−1 otherwise
for a threshold t varying in [0, 1]
AUC score is the Area Under the ROC Curve
![Page 36: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/36.jpg)
![Page 37: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/37.jpg)
Penalization to avoid overfittingComputing
w , b ∈ argminw∈Rd ,b∈R
1
n
n∑i=1
`(yi , 〈xi ,w〉+ b)
generally leads to a bad classifier. Minimize instead
w , b ∈ argminw∈Rd ,b∈R
{1
n
n∑i=1
`(yi , 〈xi ,w〉+ b) +1
Cpen(w)
}where
pen is a penalization function, it forbids w to be “toocomplex”
C > 0 is a tuning or smoothing parameter, that balancesgoodness-of-fit and penalization
![Page 38: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/38.jpg)
Penalization to avoid overfittingIn the problem
w , b ∈ argminw∈Rd ,b∈R
{1
n
n∑i=1
`(yi , 〈xi ,w〉+ b) +1
Cpen(w)
},
a well-chosen C > 0, allows to avoid overfitting
![Page 39: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/39.jpg)
Overfitting is what youwant to avoid
![Page 40: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/40.jpg)
Which penalization? The ridge penalization considers
pen(w) =1
2‖w‖22 =
1
2
d∑j=1
w2j
It penalizes the “magnitude” of w
This is the most widely used penalization
It’s nice and easy
It helps with correlated features
It actually helps training! (makes the problem even moreconvex...)
![Page 41: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/41.jpg)
There is another desirable property on w
If wj = 0, then feature j has no impact on the prediction:
y = sign(〈x , w〉+ b)
If we have many features (d is large), it would be nice if wcontained zeros, and many of them
Means that only few features are statistically relevant.
Means that only few features are useful to predict the label
Leads to a simpler model, with a “reduced” dimension
How to do it ?
![Page 42: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/42.jpg)
Tempting to use
w , b ∈ argminw∈Rd ,b∈ R
{1
n
n∑i=1
`(yi , 〈xi ,w〉+ b) +1
C‖w‖0
},
where‖w‖0 = #{j ∈ {1, . . . , d} : wj 6= 0}.
To solve this, explore all possible supports of w . Too long!(NP-hard)
![Page 43: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/43.jpg)
A convex proxy of ‖ · ‖0: the `1-norm
‖w‖1 =d∑
j=1
|wj |
A computation gives that
argminz ′∈R
{1
2(z ′ − z)2 + λ|z ′|
}= sign(z)(|z | − λ)+.
for λ > 0 and z ∈ R. Example with λ = 1
![Page 44: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/44.jpg)
Particular instances of problem
w , b ∈ argminw∈Rd ,b∈ R
{1
n
n∑i=1
`(yi , 〈xi ,w〉+ b) +1
Cpen(w)
},
For `(y , y ′) = 12(y − y ′)2 and pen(w) = 1
2‖w‖22, the problem is
called ridge regression
For `(y , y ′) = 12(y − y ′)2 and pen(w) = ‖w‖1, the problem is
called Lasso (Least absolute shrinkage and selection operator)
For `(y , y ′) = log(1 + e−yu′) and pen(w) = ‖w‖1, the problem is
called `1-penalized logistic regression
Many combinations possible...
![Page 45: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/45.jpg)
The combinations
(linear regression or logistic) + (ridge or `1)
are the most wildely used
Another penalization is
pen(w) =1
2‖w‖22 + α‖w‖1
called elastic-net, benefits from both the advantages of ridge and`1 penalization (where α ≥ 0 balances the two)
![Page 46: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/46.jpg)
Cross-validation
Generalization is the goal of supervised learning
A trained classifier has to be generalizable. It must be ableto work on other data than the training dataset
Generalizable means “works without overfitting”
This can be achieved using cross-validation
There is no machine learning without cross-validation atsome point!
In the case of penalization, we need to choose a penalizationparameter C that leads to a model that ”generalizes” well
![Page 47: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/47.jpg)
V-Fold cross-validation
Most standard cross-validation technique
Take V = 5 or V = 10. Pick a random partition I1, . . . , IV of{1, . . . , n}, where |Iv | ≈ n
V for any v = 1, . . . ,V
![Page 48: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/48.jpg)
Consider a setC = {C1, . . .CK}
of possible values for C . For each v = 1, . . . ,V
Put Iv ,train = ∪v ′ 6=v Iv ′ and Iv ,test = Iv
For each C ∈ C, find
wv ,C ∈ argminw
{ 1
|Iv ,train|∑
i∈Iv,train
`(yi , 〈xi ,w〉) +1
Cpen(w)
}Take
C ∈ argminC∈C
V∑v=1
∑i∈Iv,test
`(yi , 〈xi , wv ,C 〉)
Remark: depending on the problem, we might use a different loss(or score) for choosing C
![Page 49: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/49.jpg)
Training error:
C 7→V∑
v=1
∑i∈Iv,train
`(yi , 〈xi , wv ,C 〉)
Testing, validation or cross-validation error:
C 7→V∑
v=1
∑i∈Iv,test
`(yi , 〈xi , wv ,C 〉)
![Page 50: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/50.jpg)
This afternoon: practical session
Text classification
Scoring
![Page 51: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation](https://reader036.fdocuments.us/reader036/viewer/2022081400/609dc11759236a34654f38ad/html5/thumbnails/51.jpg)
Thank you!