Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… ·...

52
Robust Optimization & Machine Learning 3. Supervised Learning Laurent El Ghaoui Basics Supervised learning Least-squares SVM Logistic regression Kernels Motivations Kernel trick Examples References Short Course Robust Optimization and Machine Learning 3. Optimization in Supervised Learning Laurent El Ghaoui EECS and IEOR Departments UC Berkeley Spring seminar TRANSP-OR, Zinal, Jan. 16-19, 2012

Transcript of Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… ·...

Page 1: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Short CourseRobust Optimization and Machine Learning

3. Optimization in Supervised Learning

Laurent El Ghaoui

EECS and IEOR DepartmentsUC Berkeley

Spring seminar TRANSP-OR, Zinal, Jan. 16-19, 2012

January 17, 2012

Page 2: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Outline

Overview of Supervised LearningSupervised learning modelsLeast-squares and variantsSupport vector machineLogistic regression

KernelsMotivationsKernel trickExamples

References

Page 3: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Outline

Overview of Supervised LearningSupervised learning modelsLeast-squares and variantsSupport vector machineLogistic regression

KernelsMotivationsKernel trickExamples

References

Page 4: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

What is supervised learning?

In supervised learning, we are givenI A matrix of data points X = [x1, . . . , xm], with xi ∈ Rn;I A vector of “responses” y ∈ Rm.

The goal is to use the data and the information in y (the “supervised”part) to form a model. In turn the model can be used to predict avalue y for a (yet unseen) new data point x ∈ Rn.

Page 5: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Generic form of problem

Many classification and regression problems can be written

minw

L(X T w , y) + λp(w)

whereI X = [x1, . . . , xn] is a m × n matrix of data points.I y ∈ Rm contains a response vector (or labels).I w contains classifier/regression coefficients.I L is a “loss” function that depends on the problem considered.I p is a penalty function that is usually data-independent (more on

this later).I λ ≥ 0 is a regularization parameter.

Prediction/classification rule: depends only on wT x , where x ∈ Rn isa new data point.

Page 6: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Popular loss functionsI Squared loss: (for linear least-squares regression)

L(z, y) = ‖z − y‖22.

I Hinge loss: (for SVMs)

L(z, y) =m∑

i=1

max(0, 1− yizi)

I Logistic loss: (for logistic regression)

L(z, y) = −m∑

i=1

log(1 + e−yi zi ).

Page 7: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Linear least-squares

Basic model:min

w‖X T w − y‖2

2

whereI X = [x1, . . . , xn] is the m × n matrix of data points.I y ∈ Rm is the “response” vector,I w contains regression coefficients.I λ ≥ 0 is a regularization parameter.

Prediction rule: y = wT x , where x ∈ Rn is a new data point.

Page 8: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

ExampleA linear regression problem

Linear auto-regressive model for time-series: yt linear function ofyt−1, yt−2

yt = w1 + w2yt−1 + w3yt−2, t = 1, . . . ,T .

This writes yt = wT xt , with xt the “feature vectors”

xt := (1, yt−1, yt−2) , t = 1, . . . ,T .

Model fitting via least-squares: we minimize the sum-of-squares oferrors

minw‖X T w − y‖2

2

Prediction rule : once we’ve “learnt” w , we can make a prediction fortime T + 1: yT+1 = w1 + w2yT + w3yT−1 = wT xT+1.

Page 9: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Regularized least-squares

Often a penalty term is added, as in (Tikhonov) regularization:

minw‖X T w − y‖2

2 + λ‖w‖22,

where λ is usually chosen via cross-validation.

Motivation: Assume that X is additively perturbed by a matrix U withindependent Gaussian entries, with mean 0, and variance λ. Theabove corresponds to minimizing the expected value of the objective,since

EU ‖(X + U)T w − y‖22 = ‖X T w − y‖2

2 + λ‖w‖22.

Page 10: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Robust Least-Squares

Another motivation comes from the robust least-squares problem

minw

maxU : ‖U‖≤λ

‖(X + U)T w − y‖2,

where ‖ · ‖ stands for the largest singular value of its matrix argument.

The above can be expressed as

minw‖X T w − y‖2 + λ‖w‖2,

As λ scans R+, the path of solutions is the same as with regularizedLS.

Page 11: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

LASSO

The LASSO model is

minw‖X T w − y‖2

2 + λ‖w‖1,

where ‖ · ‖1 is the l1-norm (sum of absolute values) of its vectorargument.

I Observed to produce very sparse solutions (many zeroes in w).I The above has been extensively studied under various names

(e.g., compressed sensing).I Can be motivated by component-wise uncertainty model on X .I Can be solved by convex optimization methods.

Page 12: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

CVX syntax

Here is a matlab snippet that solves a LASSO problem via CVX, givenn ×m matrix X , n-vector y and non-negative scalar λ exist in theworkspace:

cvx_beginvariable w(n,1);variable r(m,1);minimize( r’*r + lambda*norm(w,1))subject tor == X’*w-y;cvx_end

I Can also use the syntax (X’*w-y)’*(X’*w-y) orsquare_pos(norm(X’*w-y,2))

I However the objective norm(X’*w-y,2)ˆ2 will be rejected byCVX, as it does know that what is squared is non-negative (thesquare of a non-negative convex function is convex).

Page 13: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Binary classification via SVMData

We are given a training data set:I Feature vectors: data points xi ∈ Rp, i = 1, . . . , n.I Labels: yi ∈ {−1, 1}, i = 1, . . . , n.

Examples:

Feature vectors LabelsCompanies’ corporate info default/no defaultStock price data price up/downNews data price up/downNews data sentiment (positive/negative)Emails presence of a keywordGenetic measures presence of disease

Page 14: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Linear classification

Using the training data set {xi , yi}ni=1, our goal is to find a classification

rule y = f (x) allowing to predict the label y of a new data point x .

Linear classification rule: assumes f is a combination of the signfunction and a linear (in fact, affine) function:

y = sign(wT x + b),

where w ∈ Rp, b ∈ R are given.

The goal of a linear classification algorithm is to find w , b, using thetraining data.

Page 15: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Separable data

The data is linearly separable if there exist a linear classification rulethat makes no error on the training set.

This is a set of linear inequalities constraints on (w , b):

yi(wT xi + b) ≥ 0, i = 1, . . . , n.

Strict separability corresponds the the same conditions, but with strictinequalities.

Page 16: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Geometry

Geometrically: the hyperplane

{x : wT x + b = 0}

perfectly separates the positiveand negative data points.

Page 17: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Linear algebra flashback: hyperplanes

Geometrically, a hyperplane H ={

w : wT x = c}

is a translation ofthe set of vectors orthogonal to w . The direction of the translation isdetermined by w , and the amount by c/‖w‖2. Indeed, the projectionof 0 onto H is x0 = cw/(wT w).

Page 18: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Geometry (cont’d)

Assuming strict separability, we can always rescale (w , b) and workwith

yi(wT xi + b) ≥ 1, i = 1, . . . , n.

Amounts to make sure that negative (resp. positive) class contained inhalf-space wT x + b ≤ −1 (resp. wT x + b ≥ 1).

The distance between the two“±1” boundaries turns out the beequal to 2/‖w‖2.

Thus the “margin” ‖w‖2 is ameasure of how well thehyperplane separates the dataapart.

Page 19: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Non-separable data

Separability constraints are homogeneous, so WLOG we can workwith

yi(wT xi + b) ≥ 1, i = 1, . . . , n.

If the above is infeasible, we try to minimize the “slacks”

minw,b,s

n∑i=1

si : s ≥ 0, yi(wT xi + b) ≥ 1− si , i = 1, . . . , n.

The above can be solved as a “linear programming” problem (invariables w , b, s).

Page 20: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Hinge loss function

The previous LP can be interpreted as minimizing the hinge lossfunction

L(w , b) :=m∑

i=1

max(1− yi(wT xi + b), 0).

This serves as an approximation to the number of errors made on thetraining set:

Page 21: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Regularization

The solution might not be unique, so we add a regularization term‖w‖2

2:

minw,b

12‖w‖2

2 + C · L(w , b)

where C > 0 allows to trade-off the accuracy on the training set andthe prediction error (more on why later). This makes the solutionunique.

The above model is called the Support Vector Machine . It is aquadratic program (Q). It can be reliably solved using special fastalgorithms that exploit its structure.

If C is large, and data is separable, reduces to the maximal-marginproblem

minw,b

12‖w‖2

2 : yi(wT xi + b) ≥ 1, i = 1, . . . , n.

Page 22: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

CVX syntax

CVX allows to bypass the need to put the problem in standard format.Here is a matlab snippet that solves an SVM problem, given n ×mmatrix X , n-vector y and non-negative scalar C exist in the workspace:

cvx_beginvariable w(n,1);variable b(1);minimize( C*sum(max(0,1-y.*(X’*w+b)))+ .5*w’*w)cvx_end

Page 23: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Robustness interpretation

Return to separable data. The set of constraints

yi(wT xi + b) ≥ 0, i = 1, . . . , n,

has many possible solutions (w , b).

We will select a solution based on the idea of robustness (to changesin data points).

Page 24: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Maximally robust separating hyperplane

Spherical uncertainty model: assume that the data points are actuallyunknown, but bounded:

xi ∈ Si := {xi + ui : ‖ui‖2 ≤ ρ} ,

where xi ’s are known, ρ > 0 is a given measure of uncertainty, and ui

is unknown.

Robust counterpart: we now ask that the separating hyperplaneseparates the spheres (and not just the points):

∀xi ∈ Si : yi(wT xi + b) ≥ 0, i = 1, . . . , n.

For separable data we can tryto separate spheres around thegiven points. We’ll grow thespheres’ radius until sphereseparation becomesimpossible.

Page 25: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Robust classification

We obtain the equivalent condition

yi(wT xi + b) ≥ ρ‖w‖2, i = 1, . . . , n.

Now we seek (w , b) which maximize ρ subject to the above.

By homogeneity we can always set ρ‖w‖2 = 1 , so that problemreduces to

minw‖w‖2 : yi(wT xi + b) ≥ 1, i = 1, . . . , n.

This is exactly the same problem as the SVM in separable case.

Page 26: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Dual problem

Denote by C+ (resp. C−) the set of points xi with yi = +1 (resp. −1).

It can be shown that the dual of the SVM problem can be expressedas:

minx+,x−

‖x+ − x−‖2 : x+ ∈ CoC+, x− ∈ CoC−,

where CoC denotes convex hull of set C, that is:

CoC =

{ q∑i=1

λixi : xi ∈ C, λ ≥ 0,q∑

i=1

λi = 1

}.

Page 27: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Geometry

Dual problem amounts to findthe smallest distance betweenthe two classes, eachrepresented by the convex hullof its points. The optimalhyperplane sits at the middle ofthe line segment joining the twoclosest points.

Page 28: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Separating boxes instead of spheres

We can use a box uncertainty model:

xi ∈ Bi := {xi + ui : ‖ui‖∞ ≤ ρ} .

This leads to

minw‖w‖1 : yi(wT xi + b) ≥ 1, i = 1, . . . , n.

Classifiers found that way tendto be sparse. In 2D, theboundary line tends to bevertical or horizontal.

Page 29: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

CVX syntax

Here is a matlab snippet that solves an SVM problem with l1 penalty,given n ×m matrix X , n-vector y and non-negative scalar C exist inthe workspace:

cvx_beginvariable w(n,1);variable b(1);minimize( C*sum(max(0,1-y.*(X’*w+b)))+ norm(w,1))cvx_end

Page 30: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Logistic model

We model the probability of a label Y to be equal y ∈ {−1, 1}, given adata point x ∈ Rn, as:

P(Y = y | x) = 11 + exp(−y(wT x + b))

.

This amounts to modeling the log-odds ratio as a linear function of X :

logP(Y = 1 | x)

P(Y = −1 | x) = wT x + b.

I The decision boundary P(Y = 1 | x) = P(Y = −1 | x) is thehyperplane with equation wT x + b = 0.

I The region P(Y = 1 | x) ≥ P(Y = −1 | x) (i.e., wT x + b ≥ 0)corresponds to points with predicted label y = +1.

Page 31: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Maximum-likelihood

The likelihood function is

l(w , b) =m∏

i=1

11 + e−yi (wT xi+b)

.

Now maximize the log-likelihood:

maxw,b

L(w , b) := −m∑

i=1

log(1 + e−yi (wT xi+b))

In practice, we may consider adding a regularization term

maxw,b

L(w , b) + λr(w),

with r(w) = ‖w‖22 or r(x) = ‖w‖1.

Page 32: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Learning problems seen so far

Least-squares linear regression, SVMs, and logistic regressionproblems can all be expressed as minimization problems:

minw

f (w , b)

withI f (w , b) = ‖X T w − y‖2

2 (square loss);I f (w , b) =

∑mi=1 max(0, 1− yi(wT xi + b)) (hinge loss);

I f (w , b) = −∑m

i=1 log(1 + exp(−yi(wT xi + b))) (logistic loss).

The regularized version involves an additional penalty term.

Page 33: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Outline

Overview of Supervised LearningSupervised learning modelsLeast-squares and variantsSupport vector machineLogistic regression

KernelsMotivationsKernel trickExamples

References

Page 34: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

A linear regression problem

Linear auto-regressive model for time-series: yt linear function ofyt−1, yt−2

yt = w1 + w2yt−1 + w3yt−2, t = 1, . . . ,T .

This writes yt = wT xt , with xt the “feature vectors”

xt := (1, yt−1, yt−2) , t = 1, . . . ,T .

Model fitting via least-squares:

minw‖X T w − y‖2

2

Prediction rule : yT+1 = w1 + w2yT + w3yT−1 = wT xT+1.

Page 35: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Nonlinear regression

Nonlinear auto-regressive model for time-series: yt quadratic functionof yt−1, yt−2

yt = w1 + w2yt−1 + w3yt−2 + w4y2t−1 + w5yt−1yt−2 + w6y2

t−2.

This writes yt = wTφ(xt), with φ(xt) the augmented feature vectors

φ(xt) :=(

1, yt−1, yt−2, y2t−1, yt−1yt−2, y2

t−2

).

Everything the same as before, with x replaced by φ(x).

Page 36: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Nonlinear classification

Non-linear (e.g., quadratic) decision boundary

w1x1 + w2x2 + w3x21 + w4x1x2 + w5x2

2 + b = 0.

Writes wTφ(x) + b = 0, with φ(x) := (x1, x2, x21 , x1x2, x2

2 ).

Page 37: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Challenges

In principle, it seems can always augment the dimension of thefeature space to make the data linearly separable. (See the video athttp://www.youtube.com/watch?v=3liCbRZPrZA)

How do we do it in a computationally efficient manner?

Page 38: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Generic form of learning problem

Many classification and regression problems can be written

minw

L(X T w , y) + λ‖w‖22

whereI X = [x1, . . . , xn] is a m × n matrix of data points.I y ∈ Rm contains a response vector (or labels).I w contains classifier coefficients.I L is a “loss” function that depends on the problem considered.I λ ≥ 0 is a regularization parameter.

Prediction/classification rule: depends only on wT x , where x ∈ Rn isa new data point.

Note: from now on, we consider l2 penalties only!

Page 39: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Loss functionsI Squared loss: (for linear least-squares regression)

L(z, y) = ‖z − y‖22.

I Hinge loss: (for SVMs)

L(z, y) =m∑

i=1

max(0, 1− yizi)

I Logistic loss: (for logistic regression)

L(z, y) = −m∑

i=1

log(1 + e−yi zi ).

Page 40: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Key result

For the generic problem:

minw

L(X T w) + λ‖w‖22

the optimal w lies in the span of the data points (x1, . . . , xm):

w = Xv

for some vector v ∈ Rm.

Page 41: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Proof

Any w ∈ Rn can be written as the sum of two orthogonal vectors:

w = Xv + r

where X T r = 0 (that is, r is in the nullspace N (X T )).

vectors and matrices 55

Since for any subspace S it holds that S⊥⊥ = S , taking the orthogonalcomplement of both sides in the previous equation, we obtain thatN (A)⊥ = R(A�), hence

Rn = N (A) ⊕ N (A)⊥ = N (A) ⊕ R(A�).

This means that the input space Rn can be decomposed as the directsum of two orthogonal subspaces N (A) and R(A�). Since dim R(A�) =

dim R(A) = rank(A), and obviously dim Rn = n, we also obtain that

dim N (A) + rank(A) = n. (2.10)

With a similar reasoning, we argue that

R(A)⊥ = {y ∈ Rm : y�z = 0, ∀z ∈ R(A)}= {y ∈ Rm : y�Ax = 0, ∀x ∈ Rn} = N (A�),

hence by taking orthogonal complement of both sides, we see that

R(A) = N (A�)⊥.

Therefore, the output space Rm is decomposed as

Rm = R(A) ⊕ R(A)⊥ = R(A) ⊕ N (A�),

and

dim N (A�) + rank(A) = m,

see a graphical exemplification in R3 in Figure 2.2.2.

Figure 2.26: Illustrationof the fundamentaltheorem of linearAlgebra in R3; A =[a1 a2].

Figure shows the case X = A = (a1, a2).

Page 42: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Consequence of key result

For the generic problem:

minw

L(X T w) + λ‖w‖22

the optimal w can be written as w = Xv for some vector v ∈ Rm.

Hence training problem depends only on K := X T X :

minv

L(Kv) + λvT Kv .

Page 43: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Kernel matrix

The training problem depends only on the “kernel matrix” K = X T X

Kij = xTi xj

K contains the scalar products between all data point pairs.

The prediction/classification rule depends on the scalar productsbetween new point x and the data points x1, . . . , xm:

wT x = vT X T x = vT k , k := X T x = (xT x1, . . . , xT xm).

Page 44: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Computational advantages

Once K is formed (this takes O(n)), then the training problem has onlym variables.

When n >> m, this leads to a dramatic reduction in problem size.

Page 45: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

How about the nonlinear case?

In the nonlinear case, we simply replace the feature vectors xi bysome “augmented” feature vectors φ(xi), with φ a non-linear mapping.

Example : in classification with quadratic decision boundary, we use

φ(x) := (x1, x2, x21 , x1x2, x2

2 ).

This leads to the modified kernel matrix

Kij = φ(xi)Tφ(xj), 1 ≤ i , j ≤ m.

Page 46: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

The kernel function

The kernel function associated with mapping φ is

k(x , z) = φ(x)Tφ(z).

It provides information about the metric in the feature space, e.g.:

‖φ(x)− φ(z)‖22 = k(x , x)− 2k(x , z) + k(z, z).

The computational effort involved inI solving the training problem;I making a prediction,

depends only on our ability to quickly evaluate such scalar products.

We can’t choose k arbitrarily; it has to satisfy the above for some φ.

Page 47: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Quadratic kernels

Classification with quadratic boundaries involves feature vectors

φ(x) = (1, x1, x2, x21 , x1x2, x2

2 ).

Fact : given two vectors x , z ∈ R2, we have

φ(x)Tφ(z) = (1 + xT z)2.

Page 48: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Polynomial kernels

More generally when φ(x) is the vector formed with all the productsbetween the components of x ∈ Rn, up to degree d , then for any twovectors x , z ∈ Rn,

φ(x)Tφ(z) = (1 + xT z)d .

Computational effort grows linearly in n.

This represents a dramatic reduction in speed over the “brute force”approach:

I Form φ(x), φ(z);I evaluate φ(x)Tφ(z).

Computational effort grows as nd .

Page 49: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Other kernels

Gaussian kernel function:

k(x , z) = exp(−‖x − z‖2

2

2σ2

),

where σ > 0 is a scale parameter. Allows to ignore points that are toofar apart. Corresponds to a non-linear mapping φ toinfinite-dimensional feature space.

There is a large variety (a zoo?) of other kernels, some adapted tostructure of data (text, images, etc).

Page 50: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

In practiceI Kernels need to be chosen by the user.I Choice not always obvious; Gaussian or polynomial kernels are

popular.I Control over-fitting via cross validation (wrt say, scale parameter

of Gaussian kernel, or degree of polynomial kernel).I Kernel methods not well adapted to l1-norm regularization.

Page 51: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Outline

Overview of Supervised LearningSupervised learning modelsLeast-squares and variantsSupport vector machineLogistic regression

KernelsMotivationsKernel trickExamples

References

Page 52: Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… · Logistic regression Kernels Motivations Kernel trick Examples References Short

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

References

S. Boyd and M. Grant.

The CVX optimization package, 2010.

Chih-Chung Chang and Chih-Jen Lin.

SVM-LIB: A library for SVM classification.

S. Sra, S.J. Wright, and S. Nowozin.

Optimization for Machine Learning.MIT Press, 2011.