LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite...

LSML19: Introduction to

Large Scale Machine Learning

MINES ParisTech — March 2019

Chloé-Agathe AzencotCentre for Computational Biology, Mines ParisTech

chloe-agathe.azencott@mines-paristech.fr

Practical information● Course website

http://cazencott.info > Teaching > LSML 19● Schedule:

– Morning sessions: 09:30 – 12:30– Afternoon sessions: 14:00 – 17:00

● Lectures are here in L.118● Practicals are in L.117 L.119 L.120● Grade:

– 60% Exam (Friday, 14:00)– 40% RAMP

Today's goals

● Review what machine learning is● Understand scalabilities issues● Discover how to accelerate gradient descents with

stochastic gradient descent

Acknowledgements

Slides inspired by Ala Al-Fuqaha, Ethem Alpaydi, Matthew Blaschko, Léon Bottou, Sanjiv Kumar, Trevor Hastie, Rob Tibshirani and Jean-Philippe Vert

Why machine learning?

Business Insider:2017 is the year of machine learning

● Improving doctors● Assisting lawyers● Make cars drive themselves

Perception

Communication

Chatbot example from www.supportbots.aiText courtesy an Eddie Izzard show from 1999 called Dressed to Kill.

Reasoning

Diagnosis

Scientific discovery

LHC image: Anna Pantelia/CERN

A common thread: ML

● Using algorithms to build models from example (training) data.

● Statistics + optimization + computer science

Empirical risk minimization● Ingredients:

– Data

– Hypothesis class:Shape of the decision function f

– Loss function:Cost/error of f on one data point

● Recipe:Find, among all functions of the hypothesis class, one that minimizes the loss on the training data (empirical risk).

features variables descriptors attributes

(Un)supervised learning setting

data matrixdesign matrix

outcometargetlabel

observationssamplesdata points

Binary classification:

Multi-class classification:

Regression:

supervision

● Iris dataset: n=150, p=4, t=1.● Cancer drug sensitivity: n=103,

p=106, t=100.● ImageNet: n=14.106, p=6.103,

t=22.103.

● Shopping, e-marketing...: n=O(106), p=O(109), t=O(108).

● Astronomy, GAFAs, web...: n=O(109), n=O(109), n=O(109).

● Iris dataset: n=150, p=4, t=1.● Cancer drug sensitivity: n=103,

p=106, t=100.● ImageNet: n=14.106, p=6.103,

t=22.103.

● Shopping, e-marketing...: n=O(106), p=O(109), t=O(108).

● Astronomy, GAFAs, web...: n=O(109), n=O(109), n=O(109).

Scaling ML algorithms● What is large scale?

– Data does not fit in RAM;– Data streams;– Algorithms do not run in a reasonable time on a single

machine.● Important considerations:

– Performance increases with the number of samples.– Likelihood to overfit increases with the number of

features.

A brief zoo of ML problems

Unsupervised learning

Images, text, measurements, omics data...

ML algo Data!

Learn a new representation of the data

Dimensionality reduction

Find a lower-dimensional representation

Images, text, measurements, omics data... X

ML algo

Clustering

Group similar data points together

ML algo

Unsupervised learning● Dimensionality reduction

PCA● Clustering

k-means● Density estimation● Feature learning

Supervised learning

ML algo

Make predictions

Labels

Predictor

decision function

ClassificationMake discrete predictions

Binary classification

Multi-class classification

ML algo

Labels

Predictor

RegressionMake continuous predictions

ML algo

Labels

Predictor

Supervised learning● Regression

Ordinary least squares, ridge regression● Classification

Logistic regression, SVM● Structured output prediction

Main ML paradigms● Unsupervised learning:

– Dimensionality reduction;– Clustering;– Density estimation;– Feature learning.

● Supervised learning:– Regression;– Classification;– Structured output prediction.

● Semi-supervised learning.● Reinforcement learning.

Dimensionality reduction:Principal Components Analysis

Principal Components Analysis● Objective:

– Reduce the dimension without losing the variability in the data;

– Find a low-dimensional space such as to maximize the variance of the data projected onto that space.

● The k-th principal component:– Is orthogonal to all previous components:

– Captures the largest amount of variance.

– Solution: w is the k-th eigenvector of

PCA example: Population genetics

Novembre et al, 2008

Genetic data of 1387 Europeans

Algorithmic complexity of PCA● Memory:

store the data (X) and the covariance matrix (XTX):

● Runtime:– Computing XTX: – Computing K eigenvectors by Lanczos iterations:– Computing the covariance matrix is more expensive than

computing the K first principal components!● Example n=109, p=108:

– Computing the covariance matrix: 1025 FLOPS. Fastest world computer (Nov 2018): 75 peta flops → 2+ years.

– Storing the covariance matrix: 1016B = (1016/250)PB = 8PB.

n x p p x p

Clustering:k-means

K-means clustering● Goal:

– Find a cluster assignement– that minimizes the intra-cluster variance:

– centroids:

– centroids:● Voronoi tesselation:

– centroids:● NP-hard → Iterative algorithm

– Assignment step: fix the centroids, update assignments

– Update step: update the centroids

Kmeans clustering

Pick 3 centroids at random.

Kmeans clustering

Assign each observation to the nearest centroid.

Kmeans clustering

Recompute centroids.

Kmeans clustering

Re-assign each observation to the nearest centroid.

Kmeans clustering

Recompute centroids, and iterate process until convergence.

Kmeans clustering

k-means complexity– Assignment step: fix the centroids, update assignments:

Compute n x K distances in p dimensions → – Update step: update the centroids:

Sum n values in p dimensions →T iterations →

– Storage:Store n cluster assignements + K centroids → Store X →

Ridge regression

Linear regression

Least-squares fit (equivalent to MLE under the assumption of Gaussian noise):

Solution uniquely defined when invertible.

Ridge regression● Hoerl & Kennard 1970● Goodness-of-fit + ridge regularization

● Solution unique and always exists

● Limit cases:– λ → 0 : OLS (non-regularized) solution (low bias, high variance);– λ → ∞ : β=0 (high bias, low variance).

● Correlated features get similar weights.

Empirical risk Ridge regularization

Complexity of ridge regression

● Computing

● Inverting

● When n >> p, computing is more expensive than inverting it!

Regularization path

regularization

Ridge regressionHyperparameter setting

Setting λ

Overfitting

Model complexity

On training data

On new data

Underfitting

Setting λ● Data splitting strategy: cross-validation:

– Cut the training set in k equally-sized chunks.– K folds: one chunk to test, the (K-1) others for training.

Training

– Cross-validation score: perf averaged over the K folds.

For a grid of values for λ.λ1 λ2 λM...

Setting λ● Data splitting strategy: cross-validation:

– Cut the training set in k equally-sized chunks.– K folds: one chunk to test, the (K-1) others for training.

Training

– Cross-validation score: perf averaged over the K folds.– Choose the λ with the best cross-validation score.

● Multiplies time complexity by KM.

Ridge regressionℓ2-regularized learning

ℓ2-regularized learning

● Generalization of the ridge regression to any loss.● If the loss is convex, then the problem is strictly convex and

has a unique global solution, which can be found numerically.

Empirical riskℓ2 regularization

ℓ2-regularized learning

● Generalization of the ridge regression to any loss.● If the loss is convex, then the problem is strictly convex and

has a unique global solution, which can be found numerically.

Empirical riskℓ2 regularization

● Absolute loss:● Quadratic loss:● ε-insensitive loss:● Huber loss: mix quadratic & linear

Gradient descent

● Suppose the function to minimize is derivable:

If the loss is convex, then the problem is strictly convex and has a unique global solution, which can be found numerically.

(u, J(u))

(v, J(v))

(v, J(u)+ J’(u).(vu))

First-order Taylor expansion of f in u

Gradient descent● Minimize a derivable, strictly convex function J by

finding where its gradient is 0– Set a0 randomly

– Update: a1 = a0 – α J(∇ a0)– Repeat– Stop when | J(∇ ak)|< ε.

J’(a0) < 0

Classification

Logistic regression

● Model y as a linear function of x?

NON-CAT

Logistic regression

● Model y as a linear function of x?Model the log-odds ratio as a linear function of x.

Logistic regression

● Model y as a linear function of x?Model the log-odds ratio as a linear function of x.

Ridge logistic regression● Le Cessie and van Houwelingen, 1992● Goodness-of-fit + ridge regularization

● Logistic loss: negative conditional likelihood

● No explicit solution.● Smooth convex optimization problem that can be

solved numerically.

Empirical riskRidge regularizationLogistic loss

Newton-Raphson iterations● Minimise J convex, differentiable.● Gradient descent:

● Suppose f is twice differentiable.– Second-order Taylor’s expansion:

– Minimize in v instead of in u.

– Hence use

Solving the logistic regression

● Can be solved with Newton-Raphson iterations.● Each step is equivalent to solving a weighted ridge regression.● → IRLS: Iteratively Reweighted Least Squares.● Complexity:

Number of iterations

Large margin classifiers● Margin:

– classify x as positive if f(x) > 0 and negative otherwise.– ideally, yf(x) > 0.

● Large margin classifier: maximize yf(x):

for a convex, non-increasing function ● Logistic regression:

Large margin classifiers

● Linear Support Vector Machine (SVM):

Hinge loss

Linear SVM

● Non-smooth, convex optimization problem; Quadratic Program.● Equivalent to the dual problem

● Complexity (training):– Storing – Optimization:

● Complexity (prediction):– Primal:– Dual:

Kernel methods

Motivation

Non-linear mapping to a feature space

SVM in the feature space

● Train:

● Predict with the decision function

Kernel SVM

● Train:

● Predict with the decision function

Kernel trick● k may be quite efficient to compute, even if H is a very high-

dimensional or even infinite-dimensional space.● For any positive semi-definite function k, there exists a feature

space H and a feature map φ such that

● Hence you can define mappings implicitely.● Kernel trick: algorithms that only involve the samples through

their dot products can be rewritten using kernels in such a way that they can be applied in the initial space without ever computing the mapping.

Non-linear mapping to a feature space

Polynomial kernels

More generally, for

is an inner product in a feature space of all monomials of degree up to d.

Gaussian kernel

The feature space has infinite dimension.

Kernel ridge regression● Ridge regression in input space:

● In a feature space of dimension d:p x p

n x n● Ridge regression in sample space:

Complexity of KRR

● Computing K:● Storing K:● Inverting K + λI:● Computing a prediction for one sample:

– computing κ:– computing the products:

Algorithmic complexity recap

Summary

Method Memory Train time Predict timePCA O(p2) O(np2) O(p)k-means O(np) O(Knp) O(Kp)Ridge regression O(p2) O(np2) O(p)Logistic regression O(np) O(np2) O(p)SVM, kernel methods O(np) O(n3) O(np)

● Training can take place offline● … unless data is streaming● Prediction should be fast!

Techniques for large-scale ML

● Use the deep learning tricks.→ Deep learning (March 26, F. Moutarde).→ Natural Language Processing (March 27, E. Grave).

● Distribute data & computation on modern archs. → Systems for large-scale ML (March 28, C.-A. Azencott).

● Trade optimization accuracy for speed.

Gradient descentsconvergence rates

Flavours of convexity● J differentiable is convex iff for all

● J is L-smooth, or L-Lipschitz gradient iff– it is twice differentiable and for all

J doesn’t vary “sharply”.● J is m-strongly convex iff for all

J is “not too flat”, lower-bounded by a quadratic.

Gradient descent convergence rates● Gradient descent with fixed step size:

– J convex, L-smooth:converges in O(1/ε) iterations → O(np/ε).

– J m-strongly convex, L-smooth:converges in O(κ log 1/ε) iterations → O(np κ log(1/ε)).

● Newton-Raphson iterations

– J convex, L-smooth:converges in O(log log 1/ε) iterations → O((np2+p3) log log(1/ε)).

Gradient descent convergence rates● Gradient descent with fixed step size:

– J convex, L-smooth:converges in O(1/ε) iterations → O(np/ε).

– J m-strongly convex, L-smooth:converges in O(κ log 1/ε) iterations → O(np κ log(1/ε)).

● Newton-Raphson iterations

– J convex, L-smooth:converges in O(log log 1/ε) iterations → O((np2+p3) log log(1/ε)).

Usually unknown...

Stochastic gradient descent

● Gradient descent:

● Stochastic gradient descent:

Sampling with replacement: chose l uniformely at random in {1, 2, …, m}.

– J convex, L-smooth:converges in O(κ/ε2) iterations → O(κp/ε2).

– J m-strongly convex, L-smooth:converges in O(κ/ε) iterations → O(κp/ε).

We got rid of n!

Convex optimization

● A number of ML algorithms can be formulated as convex optimization problems.

● They can be solved numerically thanks to variants of the gradient descent.

● Trading optimization accuracy for speed:– Necessary when reaching accuracy is too resource-

consuming.– Does not necessarily have a large impact on performance:

test error is more important than training error.

LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite...

Documents

Transcript of LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite...

The Trick Switch - Humble Bundle · a gadget could be used to trick your friends, family, or even the local Makerspace when they see the LED staying on after the pushbutton has been

Kernel methods & sparse methods for computer visionKernel trick and modularity •Kernel trick: any algorithm for ﬁnite-dimensional vectors that only uses pairwise dot-products can

Twistor spaces of even dimensional Riemannian manifolds

Bookshop Trick

Trick Question

APTITUDE trick

Physics Trick

Calendar Magic - Trick #1 with Dice - Trick #2 !! !! %%%%%How%to%Perform%the%Magic%Trick:% %! % ...

3-dimensional murals painted on walls of buildings by the artist (trick the eye) John Pugh.

Trick Words

AKC Trick Dog Evaluator Guide- april 22 vs.burdicklabradors.com/files/akc_trick_dog_eval_guide.pdf · ! 2! About&AKC&Trick&Dog&! Welcome!to!the!AKC!Trick!Dog!program.!In!AKC!Trick!Dog,dogs!and!their!owners!

Increase your knowledge! - teachchildrenesl.com · Trick Treat Trick Treat Trick Treat Treat Trick Treat Trick Treat Trick Trick Treat Trick Treat Trick Treat Back of Cards Print

Lost Relatives of the Gumbel Trick - Matej Balogmatejbalog.eu/research/lost_relatives_of_the_gumbel_trick.pdf · Lost Relatives of the Gumbel Trick ... introduce the Gumbel trick

card trick

Two-dimensional analysis of subharmonic ultrasound ... · PDF fileTwo-dimensional analysis of subharmonic ultrasound generation at ... Configuration of ... detail. Short burst of even

MSRC's Hat Trick

Trick or Treat - WordPress.com · 2019-10-19 · Trick or Treat Trick or treat, ch ch ch Trick or treat, ch ch ch It’s Halloween. Say “Trick or Treat!” I see a ghost!I see a

Software Trick

Photoshop trick

Delphi Trick