Regression at big scale It's (also) a machine learning...

Regression at big scaleDario Malchiodi malchiodi@di.unimi.it

It's (also) a machine learning problemData: a set of associations between objects andlabelsGoal: find a mapping from objects to labels

describing observed data within a reasonable approximation levelgeneralizing to unseen observations

Technically, a supervised learning problem

NotationPyro: That's a dorky looking helmet. What's it for?

Magneto: This «dorky looking helmet» is the only thing that's going to

protect me from the REAL bad guys.

A good notation is like Magneto's helmet

denotes -th object/vector in a series ( the -th label)

(to avoid confusion with exponentiation)

denotes -th component of vector

(mix and match: denotes -th component of -th vector)

Will try to be consistent in using as component index and as object/vector index

Linear regressiongeneric object: for fixed

generic label:

approximation of label:

Assume a linear mapping between objects and labels

is an approximation (or a prediction) for

Our problem lies in finding

Affine mapping integrationAdding a threshold/offset in the mapping may help a lotTechnically, this brings us to an affine mappingPractically, just pretend you have an additional dimension and set its componentto 1

Nothing changed in our problem (still in search for )

Uh, rather simple?A feature!An option not to be underestimatedComplexity injectable through feature extraction:

augment object vectors adding product of componentsfor instance, pairs of components capture covarianceextendible to higher order moments

For instance:

Let's be clever: , because

Besides,

How do I find ?

Pretend we have a candidate Let's measure how is good at prediction:

is my label is the prediction

need a loss function: squared error Let's cumulate errors on all observations:

Solution

Find minimizing cumulated loss:

Solution

gather objects in the matrix gather labels in vector

SolutionIt's a convex problem, just nullify first derivatives:

This brings to

(remember: is , is )

Remember generalization?Real-world data is dirtyAiming at the smallest loss could lead to overfittingOccam's razor: find the right balance between model complexity and errorFor instance: Ridge regression

Closed form solution

Wait: ?Hyper-parameter to be tunedHow can it be selected?And what about assessing the learnt model capabilities?

Use data against overfittingSplit observations in three sets:

Training set, used to train models (in our case: finding out )Validation set, used for model selection (in our case: tuning )Test set, used to assess the machine learning output

AssessmentFix an error measure, typically MSE

Machine learning pipelineFix a discretization of the parameter spaceFor each discretized value :

run learning algorithm using and training setassess learnt model computing (R)MSE on validation set

Run learning algorithm using (with corresponding to the lowest(R)MSE) and training validation setAssess overall learning process computing (R)MSE on test set

Computational complexity for linearregressionRemember that

Time complexity: basic operations for matrix inversion ( is , is )

for matrix multiplicationSpace complexity: floats

for and its inverse for

Big-scale regressionTwo scenarios:

big , small ,big , big (what about small and big ?)

Big , small Time complexity

for matrix inversion is acceptableSpace complexity

for storage is acceptableInstead, computation of and storage of are bottlenecks

Solution

Distribute storage of across cluster nodesExpress as a sum of outer products

Matrix product through outerproductsLet be a matrix and be a matrix:

where , which means that (outer product of -th column of and -th row of

An example

Distributed computation of Compute as

requires local storage of , local computation of Compute summing local results and inverting

requires local storage of , local computation of

train.map(computer_outer_product) .reduce(sum_and_invert)

Big , big Time complexity

also matrix inversion is a bottleneckSpace complexity

also storage of is a bottleneckAnd of course previous bottlenecks are still there

Big , big Solution

A different approach to linear regressionRule of thumb: computation and storage should be linear in and

Exploit sparsityexplicitimplicit

Use a different algorithm

Gradient descentAssume

Choose , set Repeat until convergence

This algorithm converges to a local minimum for .

Gradient descentIf

Choose , set Repeat until convergence

Where with

Gradient descentCritical issues

Choosing initial pointSetting step size (learning rate)

Gradient descent for linearregression

Local minimization through gradient descent

ConvexityLinear regression is a convex problem, so gradient descent is OK

Dynamic learning rate

Big steps at the beginning of iteration

Small steps as we reach convergence

Parallelization of gradient descent

Send to all workersCompute summands in parallelNow each worker stores and (space complexity is )Computation is , too.

Classification problemNot that different from the ML framing of regression

Data: a set of associations between objects andlabelsGoal: find a mapping from objects to labels

describing observed data within a reasonable approximation levelgeneralizing to unseen observations

Technically, a supervised learning problem

Now, labels belong to a discrete set

positive/negative (binary classification, we'll stick on this)multi-class

Linear classificationUse something similar to regression in order to find two half-spaces for objectsClassify according to the half-space where objects belong

Note that in order for be reasonable

Evaluating predictionsIn regression: loss In binary classification: loss

null penalty in case of correct classificationunitary penalty in case of misclassification

Learning the classifierIn regression In our case

A problem with convexity loss is not convex!

In [22]:

%matplotlib inline

from matplotlib import pyplot as plt

def zero_one_loss(z): return 0 if z >= 0 else 1

n = 20

z = np.arange(-5, 10, .1)

g = plt.plot(z, map(zero_one_loss, z))

plt.ylim([-1, 2])

Approximate lossNeed for convexityBetween various possibilities, log loss

Out[22]:

(-1, 2)

In [17]:

import numpy as np

def log_loss(z): return np.log(1 + np.exp(-z))

z = np.arange(-3, 10, .5) g = plt.plot(z, map(log_loss, z))

Result: logistic regression

Thus we are optimizing

Optimization through gradient descent

Computing derivativesAs

We have

Regularized logistic regressionAs in ridge regression, add a regularization term

Probabilistic interpretationA step ahead, instead of predicting class for an object ......estimate the probability of a class given the object

Probabilistic interpretationCan't use linear regression:

because probabilities belong to Can't use sign:

for same resaon of beforeCan use logistic function:

Logistic function

kinda of smooth approximation of a step function

In [18]:

def logistic(z): return 1. / (1 + np.exp(-z))

z = np.arange(-7, 7, .5) g = plt.plot(z, map(logistic, z))

Predicting probabilities

Use logistic regression to learn Predict probabilities as

Classifying using probabilities

Threshold probability: is positive if

Choosing threshold: ROC curves

Two kind of errors:

false positives (FP): objects classified as positive when they are negativefalse negatives (FN): objects classified as negative when they are positive

Choosing threshold: ROC curves

threshold = 0: everything is positive, FN=TN=0threshold = 1: everything is negative, FP=TP=0

Regression at big scale It's (also) a machine learning...

Documents

Transcript of Regression at big scale It's (also) a machine learning...

4. Simple regression - OTexts Simple regression.pdf · Outline The simple linear model Least squares estimation Forecasting with regression Non-linear functional forms Regression

OFDM Carrier Frequency Oﬀset Estimation - DiVA portalkau.diva-portal.org/smash/get/diva2:5991/FULLTEXT01.pdfOFDM Carrier Frequency Oﬀset Estimation ... Estimation of the carrier-frequency

Giuseppe Navarragiuseppenavarra.altervista.org/document/curriculumnavarra.pdf · Professional proficiency of typography, offset and digital print. 2015 - 2015 - DELF, Certificate

Multiple Linear Regression (2nd Edition) Mark Tranmer Jen …hummedia.manchester.ac.uk/.../multiple-linear-regression.pdf · 2020-02-18 · 6 1 THE BASICS – UNDERSTANDING LINEAR

Approximate Simulation-Based Solution of Large-Scale Least ...dimitrib/regression.pdf · solution of very large problems. 1 Introduction We consider linear least squares problems

Multivar 2 - Simple and Multiple Regression.pdf

Regression Analysis - University of Guelph - Home Pagevkanetka/Regression.pdf · 2003-02-06 · • Dependent variable must have interval or ratio scale measurement. ... be necessary

Lecture 2: Linear methods for regression - CBCBhcorrada/PracticalML/pdf/lectures/regression.pdf · Lecture 2: Linear methods for regression ... Gender is a qualitative ... Nearest

Simple Regression.pdf

Regression modeling - Data Sciencemdsr-book.github.io/excerpts/mdsr-regression.pdf · Regression modeling Regression analysis is a powerful and ﬂexible framework that allows an

Modeling Varying Camera-IMU Time Offset in Optimization … · 2018. 8. 28. · Modeling Varying Camera-IMU Time Oﬀset in Optimization-Based VIO 3 1). The time oﬀset is jointly

Quantile Regression › hubfs › 402067 › Quantile Regression.pdf## Wheelbase 0.37512 0.13991 2.68109 0.00887 ## Weight -0.00919 0.00214 -4.30212 0.00005 … The output includes:

A Rigorous Method for Synthesis of Offset Shaped ...oliker/papers/comp-lett-Dec-09.07.pdf · A Rigorous Method for Synthesis of Offset Shaped Reflector Antennas Vladimir I. Oliker∗

Correlation and Regression: Example - The Personality Projectpersonality-project.org/revelle/syllabi/405/regression.pdf · Preliminaries Simple regressions Multiple R with interaction

Introduction to Regression in R - UCLA Statistical Consultingscc.stat.ucla.edu/page_attachments/0000/0199/Regression.pdf · Introduction to Regression in R Masanao Yajima yajima@stat.ucla.edu

Synchronization for OFDM systems - · PDF fileSynchronization: oﬀset/errors and their cause Symbol(timing)oﬀset (OFDM and DMT) transmitter and receiver do not have a common time

Applied logistic regression.pdf

multiple regression.pdf

Green DPS Index · 2017. 11. 15. · Green" DPS" BAAS$OptOut 838 Carbon$Oﬀset 41 4.9% 37 $ 4.4% Notto$scale$ DPS Greenometer The$DPS$Carbon$Neutrality$ Fund( Carbon$Oﬀset)willbe$

Big Data: Data Analysis Boot Camp Linear Regression › ... › 090-linear-regression.pdf · Intro. Background Linear regression Improving outcomesHands-onQ & AConclusionReferencesFiles