Post on 07-Sep-2019
Regression at big scaleDario Malchiodi malchiodi@di.unimi.it
It's (also) a machine learning problemData: a set of associations between objects andlabelsGoal: find a mapping from objects to labels
describing observed data within a reasonable approximation levelgeneralizing to unseen observations
Technically, a supervised learning problem
NotationPyro: That's a dorky looking helmet. What's it for?
Magneto: This «dorky looking helmet» is the only thing that's going to
protect me from the REAL bad guys.
A good notation is like Magneto's helmet
denotes -th object/vector in a series ( the -th label)
(to avoid confusion with exponentiation)
denotes -th component of vector
(mix and match: denotes -th component of -th vector)
Will try to be consistent in using as component index and as object/vector index
Linear regressiongeneric object: for fixed
generic label:
approximation of label:
Assume a linear mapping between objects and labels
is an approximation (or a prediction) for
Our problem lies in finding
Affine mapping integrationAdding a threshold/offset in the mapping may help a lotTechnically, this brings us to an affine mappingPractically, just pretend you have an additional dimension and set its componentto 1
Nothing changed in our problem (still in search for )
Uh, rather simple?A feature!An option not to be underestimatedComplexity injectable through feature extraction:
augment object vectors adding product of componentsfor instance, pairs of components capture covarianceextendible to higher order moments
For instance:
Let's be clever: , because
Besides,
How do I find ?
Pretend we have a candidate Let's measure how is good at prediction:
is my label is the prediction
need a loss function: squared error Let's cumulate errors on all observations:
Solution
Find minimizing cumulated loss:
Solution
gather objects in the matrix gather labels in vector
SolutionIt's a convex problem, just nullify first derivatives:
This brings to
(remember: is , is )
Remember generalization?Real-world data is dirtyAiming at the smallest loss could lead to overfittingOccam's razor: find the right balance between model complexity and errorFor instance: Ridge regression
Closed form solution
Wait: ?Hyper-parameter to be tunedHow can it be selected?And what about assessing the learnt model capabilities?
Use data against overfittingSplit observations in three sets:
Training set, used to train models (in our case: finding out )Validation set, used for model selection (in our case: tuning )Test set, used to assess the machine learning output
AssessmentFix an error measure, typically MSE
or
Machine learning pipelineFix a discretization of the parameter spaceFor each discretized value :
run learning algorithm using and training setassess learnt model computing (R)MSE on validation set
Run learning algorithm using (with corresponding to the lowest(R)MSE) and training validation setAssess overall learning process computing (R)MSE on test set
Computational complexity for linearregressionRemember that
Time complexity: basic operations for matrix inversion ( is , is )
for matrix multiplicationSpace complexity: floats
for and its inverse for
Big-scale regressionTwo scenarios:
big , small ,big , big (what about small and big ?)
Big , small Time complexity
for matrix inversion is acceptableSpace complexity
for storage is acceptableInstead, computation of and storage of are bottlenecks
Solution
Distribute storage of across cluster nodesExpress as a sum of outer products
Matrix product through outerproductsLet be a matrix and be a matrix:
Thus
where , which means that (outer product of -th column of and -th row of
An example
Distributed computation of Compute as
requires local storage of , local computation of Compute summing local results and inverting
requires local storage of , local computation of
train.map(computer_outer_product) .reduce(sum_and_invert)
Big , big Time complexity
also matrix inversion is a bottleneckSpace complexity
also storage of is a bottleneckAnd of course previous bottlenecks are still there
Big , big Solution
A different approach to linear regressionRule of thumb: computation and storage should be linear in and
Exploit sparsityexplicitimplicit
Use a different algorithm
Gradient descentAssume
Choose , set Repeat until convergence
This algorithm converges to a local minimum for .
Gradient descentIf
Choose , set Repeat until convergence
Where with
Gradient descentCritical issues
Choosing initial pointSetting step size (learning rate)
Gradient descent for linearregression
Thus
Local minimization through gradient descent
ConvexityLinear regression is a convex problem, so gradient descent is OK
Dynamic learning rate
Big steps at the beginning of iteration
Small steps as we reach convergence
Parallelization of gradient descent
Send to all workersCompute summands in parallelNow each worker stores and (space complexity is )Computation is , too.
Classification problemNot that different from the ML framing of regression
Data: a set of associations between objects andlabelsGoal: find a mapping from objects to labels
describing observed data within a reasonable approximation levelgeneralizing to unseen observations
Technically, a supervised learning problem
Now, labels belong to a discrete set
positive/negative (binary classification, we'll stick on this)multi-class
Linear classificationUse something similar to regression in order to find two half-spaces for objectsClassify according to the half-space where objects belong
Note that in order for be reasonable
Evaluating predictionsIn regression: loss In binary classification: loss
null penalty in case of correct classificationunitary penalty in case of misclassification
Let
Learning the classifierIn regression In our case
A problem with convexity loss is not convex!
In [22]:
%matplotlib inline
from matplotlib import pyplot as plt
def zero_one_loss(z): return 0 if z >= 0 else 1
n = 20
z = np.arange(-5, 10, .1)
g = plt.plot(z, map(zero_one_loss, z))
plt.ylim([-1, 2])
Approximate lossNeed for convexityBetween various possibilities, log loss
Out[22]:
(-1, 2)
In [17]:
import numpy as np
def log_loss(z): return np.log(1 + np.exp(-z))
z = np.arange(-3, 10, .5) g = plt.plot(z, map(log_loss, z))
Result: logistic regression
Thus we are optimizing
Optimization through gradient descent
Computing derivativesAs
We have
Thus
Regularized logistic regressionAs in ridge regression, add a regularization term
Probabilistic interpretationA step ahead, instead of predicting class for an object ......estimate the probability of a class given the object
Probabilistic interpretationCan't use linear regression:
because probabilities belong to Can't use sign:
for same resaon of beforeCan use logistic function:
Logistic function
kinda of smooth approximation of a step function
In [18]:
def logistic(z): return 1. / (1 + np.exp(-z))
z = np.arange(-7, 7, .5) g = plt.plot(z, map(logistic, z))
Predicting probabilities
Use logistic regression to learn Predict probabilities as
Classifying using probabilities
Threshold probability: is positive if
Choosing threshold: ROC curves
Two kind of errors:
false positives (FP): objects classified as positive when they are negativefalse negatives (FN): objects classified as negative when they are positive