LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite...
Transcript of LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite...
![Page 1: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/1.jpg)
LSML19: Introduction to
Large Scale Machine Learning
MINES ParisTech — March 2019
Chloé-Agathe AzencotCentre for Computational Biology, Mines ParisTech
![Page 2: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/2.jpg)
Practical information● Course website
http://cazencott.info > Teaching > LSML 19● Schedule:
– Morning sessions: 09:30 – 12:30– Afternoon sessions: 14:00 – 17:00
● Lectures are here in L.118● Practicals are in L.117 L.119 L.120● Grade:
– 60% Exam (Friday, 14:00)– 40% RAMP
![Page 3: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/3.jpg)
Today's goals
● Review what machine learning is● Understand scalabilities issues● Discover how to accelerate gradient descents with
stochastic gradient descent
![Page 4: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/4.jpg)
Acknowledgements
Slides inspired by Ala Al-Fuqaha, Ethem Alpaydi, Matthew Blaschko, Léon Bottou, Sanjiv Kumar, Trevor Hastie, Rob Tibshirani and Jean-Philippe Vert
![Page 5: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/5.jpg)
5
Why machine learning?
![Page 6: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/6.jpg)
6
Business Insider:2017 is the year of machine learning
● Improving doctors● Assisting lawyers● Make cars drive themselves
![Page 7: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/7.jpg)
7
Perception
![Page 8: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/8.jpg)
8
Communication
Chatbot example from www.supportbots.aiText courtesy an Eddie Izzard show from 1999 called Dressed to Kill.
![Page 9: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/9.jpg)
9
Reasoning
![Page 10: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/10.jpg)
10
Diagnosis
![Page 11: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/11.jpg)
11
Scientific discovery
LHC image: Anna Pantelia/CERN
![Page 12: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/12.jpg)
12
A common thread: ML
● Using algorithms to build models from example (training) data.
● Statistics + optimization + computer science
![Page 13: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/13.jpg)
13
Empirical risk minimization● Ingredients:
– Data
– Hypothesis class:Shape of the decision function f
– Loss function:Cost/error of f on one data point
● Recipe:Find, among all functions of the hypothesis class, one that minimizes the loss on the training data (empirical risk).
![Page 14: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/14.jpg)
14
features variables descriptors attributes
(Un)supervised learning setting
X y
pn
data matrixdesign matrix
outcometargetlabel
observationssamplesdata points
Binary classification:
Multi-class classification:
Regression:
supervision
● Iris dataset: n=150, p=4, t=1.● Cancer drug sensitivity: n=103,
p=106, t=100.● ImageNet: n=14.106, p=6.103,
t=22.103.
t
● Shopping, e-marketing...: n=O(106), p=O(109), t=O(108).
● Astronomy, GAFAs, web...: n=O(109), n=O(109), n=O(109).
![Page 15: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/15.jpg)
15
● Iris dataset: n=150, p=4, t=1.● Cancer drug sensitivity: n=103,
p=106, t=100.● ImageNet: n=14.106, p=6.103,
t=22.103.
● Shopping, e-marketing...: n=O(106), p=O(109), t=O(108).
● Astronomy, GAFAs, web...: n=O(109), n=O(109), n=O(109).
Scaling ML algorithms● What is large scale?
– Data does not fit in RAM;– Data streams;– Algorithms do not run in a reasonable time on a single
machine.● Important considerations:
– Performance increases with the number of samples.– Likelihood to overfit increases with the number of
features.
![Page 16: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/16.jpg)
16
A brief zoo of ML problems
![Page 17: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/17.jpg)
17
Unsupervised learning
Data
Images, text, measurements, omics data...
ML algo Data!
Learn a new representation of the data
X
p
n
![Page 18: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/18.jpg)
18
Dimensionality reduction
Data
Find a lower-dimensional representation
Data
X
p
n
Images, text, measurements, omics data... X
m
n
ML algo
![Page 19: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/19.jpg)
19
Clustering
Data
Group similar data points together
ML algo
![Page 20: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/20.jpg)
20
Unsupervised learning● Dimensionality reduction
PCA● Clustering
k-means● Density estimation● Feature learning
![Page 21: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/21.jpg)
21
Supervised learning
Data
ML algo
Make predictions
Labels
Predictor
decision function
X
p
n yn
![Page 22: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/22.jpg)
22
ClassificationMake discrete predictions
Binary classification
Multi-class classification
Data
ML algo
Labels
Predictor
![Page 23: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/23.jpg)
23
RegressionMake continuous predictions
Data
ML algo
Labels
Predictor
![Page 24: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/24.jpg)
24
Supervised learning● Regression
Ordinary least squares, ridge regression● Classification
Logistic regression, SVM● Structured output prediction
![Page 25: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/25.jpg)
25
Main ML paradigms● Unsupervised learning:
– Dimensionality reduction;– Clustering;– Density estimation;– Feature learning.
● Supervised learning:– Regression;– Classification;– Structured output prediction.
● Semi-supervised learning.● Reinforcement learning.
![Page 26: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/26.jpg)
26
Dimensionality reduction:Principal Components Analysis
![Page 27: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/27.jpg)
27
Principal Components Analysis● Objective:
– Reduce the dimension without losing the variability in the data;
– Find a low-dimensional space such as to maximize the variance of the data projected onto that space.
● The k-th principal component:– Is orthogonal to all previous components:
– Captures the largest amount of variance.
– Solution: w is the k-th eigenvector of
![Page 28: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/28.jpg)
28
PCA example: Population genetics
Novembre et al, 2008
Genetic data of 1387 Europeans
![Page 29: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/29.jpg)
29
Algorithmic complexity of PCA● Memory:
store the data (X) and the covariance matrix (XTX):
● Runtime:– Computing XTX: – Computing K eigenvectors by Lanczos iterations:– Computing the covariance matrix is more expensive than
computing the K first principal components!● Example n=109, p=108:
– Computing the covariance matrix: 1025 FLOPS. Fastest world computer (Nov 2018): 75 peta flops → 2+ years.
– Storing the covariance matrix: 1016B = (1016/250)PB = 8PB.
n x p p x p
![Page 30: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/30.jpg)
30
Clustering:k-means
![Page 31: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/31.jpg)
31
K-means clustering● Goal:
– Find a cluster assignement– that minimizes the intra-cluster variance:
– centroids:
![Page 32: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/32.jpg)
32
K-means clustering● Goal:
– Find a cluster assignement– that minimizes the intra-cluster variance:
– centroids:● Voronoi tesselation:
![Page 33: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/33.jpg)
33
K-means clustering● Goal:
– Find a cluster assignement– that minimizes the intra-cluster variance:
– centroids:● NP-hard → Iterative algorithm
– Assignment step: fix the centroids, update assignments
– Update step: update the centroids
![Page 34: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/34.jpg)
34
Kmeans clustering
![Page 35: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/35.jpg)
35
Pick 3 centroids at random.
Kmeans clustering
![Page 36: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/36.jpg)
36
Assign each observation to the nearest centroid.
Kmeans clustering
![Page 37: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/37.jpg)
37
Recompute centroids.
Kmeans clustering
![Page 38: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/38.jpg)
38
Re-assign each observation to the nearest centroid.
Kmeans clustering
![Page 39: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/39.jpg)
39
Recompute centroids, and iterate process until convergence.
Kmeans clustering
![Page 40: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/40.jpg)
40
k-means complexity– Assignment step: fix the centroids, update assignments:
Compute n x K distances in p dimensions → – Update step: update the centroids:
Sum n values in p dimensions →T iterations →
– Storage:Store n cluster assignements + K centroids → Store X →
![Page 41: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/41.jpg)
41
Ridge regression
![Page 42: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/42.jpg)
42
Linear regression
Least-squares fit (equivalent to MLE under the assumption of Gaussian noise):
Solution uniquely defined when invertible.
![Page 43: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/43.jpg)
43
Ridge regression● Hoerl & Kennard 1970● Goodness-of-fit + ridge regularization
● Solution unique and always exists
● Limit cases:– λ → 0 : OLS (non-regularized) solution (low bias, high variance);– λ → ∞ : β=0 (high bias, low variance).
● Correlated features get similar weights.
Empirical risk Ridge regularization
![Page 44: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/44.jpg)
44
Complexity of ridge regression
● Computing
● Inverting
● When n >> p, computing is more expensive than inverting it!
![Page 45: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/45.jpg)
45
Regularization path
regularization
regularization
regr
essi
on w
eigh
ts
erro
r (M
SE)
![Page 46: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/46.jpg)
46
Ridge regressionHyperparameter setting
![Page 47: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/47.jpg)
47
Setting λ
Overfitting
Pred
ictio
n er
ror
Model complexity
On training data
On new data
Underfitting
![Page 48: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/48.jpg)
48
Setting λ● Data splitting strategy: cross-validation:
– Cut the training set in k equally-sized chunks.– K folds: one chunk to test, the (K-1) others for training.
Valid
Valid
Valid
Valid
Training
Training
Training
Training
– Cross-validation score: perf averaged over the K folds.
For a grid of values for λ.λ1 λ2 λM...
![Page 49: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/49.jpg)
49
Setting λ● Data splitting strategy: cross-validation:
– Cut the training set in k equally-sized chunks.– K folds: one chunk to test, the (K-1) others for training.
Valid
Valid
Valid
Valid
Training
Training
Training
Training
– Cross-validation score: perf averaged over the K folds.– Choose the λ with the best cross-validation score.
● Multiplies time complexity by KM.
![Page 50: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/50.jpg)
50
Ridge regressionℓ2-regularized learning
![Page 51: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/51.jpg)
51
ℓ2-regularized learning
● Generalization of the ridge regression to any loss.● If the loss is convex, then the problem is strictly convex and
has a unique global solution, which can be found numerically.
Empirical riskℓ2 regularization
loss
![Page 52: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/52.jpg)
52
ℓ2-regularized learning
● Generalization of the ridge regression to any loss.● If the loss is convex, then the problem is strictly convex and
has a unique global solution, which can be found numerically.
Empirical riskℓ2 regularization
loss
● Absolute loss:● Quadratic loss:● ε-insensitive loss:● Huber loss: mix quadratic & linear
![Page 53: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/53.jpg)
53
Gradient descent
![Page 54: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/54.jpg)
54
Gradient descent
● Suppose the function to minimize is derivable:
If the loss is convex, then the problem is strictly convex and has a unique global solution, which can be found numerically.
(u, J(u))
(v, J(v))
(v, J(u)+ J’(u).(vu))
First-order Taylor expansion of f in u
![Page 55: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/55.jpg)
55
Gradient descent● Minimize a derivable, strictly convex function J by
finding where its gradient is 0– Set a0 randomly
– Update: a1 = a0 – α J(∇ a0)– Repeat– Stop when | J(∇ ak)|< ε.
J(a)
J’(a0) < 0
a0 a1
![Page 56: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/56.jpg)
56
Classification
![Page 57: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/57.jpg)
57
Logistic regression
● Model y as a linear function of x?
CAT
NON-CAT
![Page 58: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/58.jpg)
58
Logistic regression
● Model y as a linear function of x?Model the log-odds ratio as a linear function of x.
![Page 59: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/59.jpg)
59
Logistic regression
● Model y as a linear function of x?Model the log-odds ratio as a linear function of x.
p
![Page 60: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/60.jpg)
60
Ridge logistic regression● Le Cessie and van Houwelingen, 1992● Goodness-of-fit + ridge regularization
● Logistic loss: negative conditional likelihood
● No explicit solution.● Smooth convex optimization problem that can be
solved numerically.
Empirical riskRidge regularizationLogistic loss
![Page 61: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/61.jpg)
Newton-Raphson iterations● Minimise J convex, differentiable.● Gradient descent:
● Suppose f is twice differentiable.– Second-order Taylor’s expansion:
– Minimize in v instead of in u.
– Hence use
g
![Page 62: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/62.jpg)
62
Solving the logistic regression
● Can be solved with Newton-Raphson iterations.● Each step is equivalent to solving a weighted ridge regression.● → IRLS: Iteratively Reweighted Least Squares.● Complexity:
Number of iterations
![Page 63: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/63.jpg)
63
Large margin classifiers● Margin:
– classify x as positive if f(x) > 0 and negative otherwise.– ideally, yf(x) > 0.
● Large margin classifier: maximize yf(x):
for a convex, non-increasing function ● Logistic regression:
![Page 64: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/64.jpg)
64
Large margin classifiers
● Linear Support Vector Machine (SVM):
Hinge loss
1
1
![Page 65: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/65.jpg)
65
Linear SVM
● Non-smooth, convex optimization problem; Quadratic Program.● Equivalent to the dual problem
● Complexity (training):– Storing – Optimization:
● Complexity (prediction):– Primal:– Dual:
![Page 66: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/66.jpg)
66
Kernel methods
![Page 67: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/67.jpg)
67
Motivation
![Page 68: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/68.jpg)
68
Non-linear mapping to a feature space
R R2
![Page 69: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/69.jpg)
69
SVM in the feature space
● Train:
● Predict with the decision function
![Page 70: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/70.jpg)
70
Kernel SVM
● Train:
● Predict with the decision function
![Page 71: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/71.jpg)
71
Kernel trick● k may be quite efficient to compute, even if H is a very high-
dimensional or even infinite-dimensional space.● For any positive semi-definite function k, there exists a feature
space H and a feature map φ such that
● Hence you can define mappings implicitely.● Kernel trick: algorithms that only involve the samples through
their dot products can be rewritten using kernels in such a way that they can be applied in the initial space without ever computing the mapping.
![Page 72: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/72.jpg)
72
Non-linear mapping to a feature space
R R2
![Page 73: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/73.jpg)
73
Polynomial kernels
More generally, for
is an inner product in a feature space of all monomials of degree up to d.
![Page 74: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/74.jpg)
74
Gaussian kernel
The feature space has infinite dimension.
![Page 75: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/75.jpg)
75
Kernel ridge regression● Ridge regression in input space:
● In a feature space of dimension d:p x p
d x d
n x n● Ridge regression in sample space:
![Page 76: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/76.jpg)
76
Complexity of KRR
● Computing K:● Storing K:● Inverting K + λI:● Computing a prediction for one sample:
– computing κ:– computing the products:
![Page 77: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/77.jpg)
77
Algorithmic complexity recap
![Page 78: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/78.jpg)
78
Summary
Method Memory Train time Predict timePCA O(p2) O(np2) O(p)k-means O(np) O(Knp) O(Kp)Ridge regression O(p2) O(np2) O(p)Logistic regression O(np) O(np2) O(p)SVM, kernel methods O(np) O(n3) O(np)
● Training can take place offline● … unless data is streaming● Prediction should be fast!
![Page 79: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/79.jpg)
79
Techniques for large-scale ML
● Use the deep learning tricks.→ Deep learning (March 26, F. Moutarde).→ Natural Language Processing (March 27, E. Grave).
● Distribute data & computation on modern archs. → Systems for large-scale ML (March 28, C.-A. Azencott).
● Trade optimization accuracy for speed.
![Page 80: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/80.jpg)
80
Gradient descentsconvergence rates
![Page 81: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/81.jpg)
81
Flavours of convexity● J differentiable is convex iff for all
● J is L-smooth, or L-Lipschitz gradient iff– it is twice differentiable and for all
J doesn’t vary “sharply”.● J is m-strongly convex iff for all
J is “not too flat”, lower-bounded by a quadratic.
![Page 82: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/82.jpg)
82
Gradient descent convergence rates● Gradient descent with fixed step size:
– J convex, L-smooth:converges in O(1/ε) iterations → O(np/ε).
– J m-strongly convex, L-smooth:converges in O(κ log 1/ε) iterations → O(np κ log(1/ε)).
● Newton-Raphson iterations
– J convex, L-smooth:converges in O(log log 1/ε) iterations → O((np2+p3) log log(1/ε)).
![Page 83: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/83.jpg)
83
Gradient descent convergence rates● Gradient descent with fixed step size:
– J convex, L-smooth:converges in O(1/ε) iterations → O(np/ε).
– J m-strongly convex, L-smooth:converges in O(κ log 1/ε) iterations → O(np κ log(1/ε)).
● Newton-Raphson iterations
– J convex, L-smooth:converges in O(log log 1/ε) iterations → O((np2+p3) log log(1/ε)).
Usually unknown...
![Page 84: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/84.jpg)
Stochastic gradient descent
● Gradient descent:
● Stochastic gradient descent:
Sampling with replacement: chose l uniformely at random in {1, 2, …, m}.
– J convex, L-smooth:converges in O(κ/ε2) iterations → O(κp/ε2).
– J m-strongly convex, L-smooth:converges in O(κ/ε) iterations → O(κp/ε).
We got rid of n!
![Page 85: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional](https://reader034.fdocuments.us/reader034/viewer/2022050308/5f7020d82f22254ef61ea279/html5/thumbnails/85.jpg)
Convex optimization
● A number of ML algorithms can be formulated as convex optimization problems.
● They can be solved numerically thanks to variants of the gradient descent.
● Trading optimization accuracy for speed:– Necessary when reaching accuracy is too resource-
consuming.– Does not necessarily have a large impact on performance:
test error is more important than training error.