Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… ·...

Robust Optimization &Machine Learning

3. SupervisedLearning

Laurent El Ghaoui

BasicsSupervised learning

Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Short CourseRobust Optimization and Machine Learning

3. Optimization in Supervised Learning

Laurent El Ghaoui

EECS and IEOR DepartmentsUC Berkeley

Spring seminar TRANSP-OR, Zinal, Jan. 16-19, 2012

January 17, 2012



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Outline

Overview of Supervised LearningSupervised learning modelsLeast-squares and variantsSupport vector machineLogistic regression

KernelsMotivationsKernel trickExamples

References



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

What is supervised learning?

In supervised learning, we are givenI A matrix of data points X = [x1, . . . , xm], with xi ∈ Rn;I A vector of “responses” y ∈ Rm.

The goal is to use the data and the information in y (the “supervised”part) to form a model. In turn the model can be used to predict avalue y for a (yet unseen) new data point x ∈ Rn.



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Generic form of problem

Many classification and regression problems can be written

minw

L(X T w , y) + λp(w)

whereI X = [x1, . . . , xn] is a m × n matrix of data points.I y ∈ Rm contains a response vector (or labels).I w contains classifier/regression coefficients.I L is a “loss” function that depends on the problem considered.I p is a penalty function that is usually data-independent (more on

this later).I λ ≥ 0 is a regularization parameter.

Prediction/classification rule: depends only on wT x , where x ∈ Rn isa new data point.



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Popular loss functionsI Squared loss: (for linear least-squares regression)

L(z, y) = ‖z − y‖22.

I Hinge loss: (for SVMs)

L(z, y) =m∑

i=1

max(0, 1− yizi)

I Logistic loss: (for logistic regression)

L(z, y) = −m∑

i=1

log(1 + e−yi zi ).



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Linear least-squares

Basic model:min

w‖X T w − y‖2

2

whereI X = [x1, . . . , xn] is the m × n matrix of data points.I y ∈ Rm is the “response” vector,I w contains regression coefficients.I λ ≥ 0 is a regularization parameter.

Prediction rule: y = wT x , where x ∈ Rn is a new data point.



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

ExampleA linear regression problem

Linear auto-regressive model for time-series: yt linear function ofyt−1, yt−2

yt = w1 + w2yt−1 + w3yt−2, t = 1, . . . ,T .

This writes yt = wT xt , with xt the “feature vectors”

xt := (1, yt−1, yt−2) , t = 1, . . . ,T .

Model fitting via least-squares: we minimize the sum-of-squares oferrors

minw‖X T w − y‖2

2

Prediction rule : once we’ve “learnt” w , we can make a prediction fortime T + 1: yT+1 = w1 + w2yT + w3yT−1 = wT xT+1.



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Regularized least-squares

Often a penalty term is added, as in (Tikhonov) regularization:


2 + λ‖w‖22,

where λ is usually chosen via cross-validation.

Motivation: Assume that X is additively perturbed by a matrix U withindependent Gaussian entries, with mean 0, and variance λ. Theabove corresponds to minimizing the expected value of the objective,since

EU ‖(X + U)T w − y‖22 = ‖X T w − y‖2

2 + λ‖w‖22.



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Robust Least-Squares

Another motivation comes from the robust least-squares problem

minw

maxU : ‖U‖≤λ

‖(X + U)T w − y‖2,

where ‖ · ‖ stands for the largest singular value of its matrix argument.

The above can be expressed as

minw‖X T w − y‖2 + λ‖w‖2,

As λ scans R+, the path of solutions is the same as with regularizedLS.



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

LASSO

The LASSO model is


2 + λ‖w‖1,

where ‖ · ‖1 is the l1-norm (sum of absolute values) of its vectorargument.

I Observed to produce very sparse solutions (many zeroes in w).I The above has been extensively studied under various names

(e.g., compressed sensing).I Can be motivated by component-wise uncertainty model on X .I Can be solved by convex optimization methods.



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

CVX syntax

Here is a matlab snippet that solves a LASSO problem via CVX, givenn ×m matrix X , n-vector y and non-negative scalar λ exist in theworkspace:

cvx_beginvariable w(n,1);variable r(m,1);minimize( r’*r + lambda*norm(w,1))subject tor == X’*w-y;cvx_end

I Can also use the syntax (X’*w-y)’*(X’*w-y) orsquare_pos(norm(X’*w-y,2))

I However the objective norm(X’*w-y,2)ˆ2 will be rejected byCVX, as it does know that what is squared is non-negative (thesquare of a non-negative convex function is convex).



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Binary classification via SVMData

We are given a training data set:I Feature vectors: data points xi ∈ Rp, i = 1, . . . , n.I Labels: yi ∈ {−1, 1}, i = 1, . . . , n.

Examples:

Feature vectors LabelsCompanies’ corporate info default/no defaultStock price data price up/downNews data price up/downNews data sentiment (positive/negative)Emails presence of a keywordGenetic measures presence of disease



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Linear classification

Using the training data set {xi , yi}ni=1, our goal is to find a classification

rule y = f (x) allowing to predict the label y of a new data point x .

Linear classification rule: assumes f is a combination of the signfunction and a linear (in fact, affine) function:

y = sign(wT x + b),

where w ∈ Rp, b ∈ R are given.

The goal of a linear classification algorithm is to find w , b, using thetraining data.



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Separable data

The data is linearly separable if there exist a linear classification rulethat makes no error on the training set.

This is a set of linear inequalities constraints on (w , b):

yi(wT xi + b) ≥ 0, i = 1, . . . , n.

Strict separability corresponds the the same conditions, but with strictinequalities.



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Geometry

Geometrically: the hyperplane

{x : wT x + b = 0}

perfectly separates the positiveand negative data points.



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Linear algebra flashback: hyperplanes

Geometrically, a hyperplane H ={

w : wT x = c}

is a translation ofthe set of vectors orthogonal to w . The direction of the translation isdetermined by w , and the amount by c/‖w‖2. Indeed, the projectionof 0 onto H is x0 = cw/(wT w).



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Geometry (cont’d)

Assuming strict separability, we can always rescale (w , b) and workwith

yi(wT xi + b) ≥ 1, i = 1, . . . , n.

Amounts to make sure that negative (resp. positive) class contained inhalf-space wT x + b ≤ −1 (resp. wT x + b ≥ 1).

The distance between the two“±1” boundaries turns out the beequal to 2/‖w‖2.

Thus the “margin” ‖w‖2 is ameasure of how well thehyperplane separates the dataapart.



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Non-separable data

Separability constraints are homogeneous, so WLOG we can workwith

yi(wT xi + b) ≥ 1, i = 1, . . . , n.

If the above is infeasible, we try to minimize the “slacks”

minw,b,s

n∑i=1

si : s ≥ 0, yi(wT xi + b) ≥ 1− si , i = 1, . . . , n.

The above can be solved as a “linear programming” problem (invariables w , b, s).



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Hinge loss function

The previous LP can be interpreted as minimizing the hinge lossfunction

L(w , b) :=m∑

i=1

max(1− yi(wT xi + b), 0).

This serves as an approximation to the number of errors made on thetraining set:



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Regularization

The solution might not be unique, so we add a regularization term‖w‖2

2:

minw,b

12‖w‖2

2 + C · L(w , b)

where C > 0 allows to trade-off the accuracy on the training set andthe prediction error (more on why later). This makes the solutionunique.

The above model is called the Support Vector Machine . It is aquadratic program (Q). It can be reliably solved using special fastalgorithms that exploit its structure.

If C is large, and data is separable, reduces to the maximal-marginproblem

minw,b

12‖w‖2

2 : yi(wT xi + b) ≥ 1, i = 1, . . . , n.



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

CVX syntax

CVX allows to bypass the need to put the problem in standard format.Here is a matlab snippet that solves an SVM problem, given n ×mmatrix X , n-vector y and non-negative scalar C exist in the workspace:

cvx_beginvariable w(n,1);variable b(1);minimize( C*sum(max(0,1-y.*(X’*w+b)))+ .5*w’*w)cvx_end



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Robustness interpretation

Return to separable data. The set of constraints

yi(wT xi + b) ≥ 0, i = 1, . . . , n,

has many possible solutions (w , b).

We will select a solution based on the idea of robustness (to changesin data points).



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Maximally robust separating hyperplane

Spherical uncertainty model: assume that the data points are actuallyunknown, but bounded:

xi ∈ Si := {xi + ui : ‖ui‖2 ≤ ρ} ,

where xi ’s are known, ρ > 0 is a given measure of uncertainty, and ui

is unknown.

Robust counterpart: we now ask that the separating hyperplaneseparates the spheres (and not just the points):

∀xi ∈ Si : yi(wT xi + b) ≥ 0, i = 1, . . . , n.

For separable data we can tryto separate spheres around thegiven points. We’ll grow thespheres’ radius until sphereseparation becomesimpossible.



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Robust classification

We obtain the equivalent condition

yi(wT xi + b) ≥ ρ‖w‖2, i = 1, . . . , n.

Now we seek (w , b) which maximize ρ subject to the above.

By homogeneity we can always set ρ‖w‖2 = 1 , so that problemreduces to

minw‖w‖2 : yi(wT xi + b) ≥ 1, i = 1, . . . , n.

This is exactly the same problem as the SVM in separable case.



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Dual problem

Denote by C+ (resp. C−) the set of points xi with yi = +1 (resp. −1).

It can be shown that the dual of the SVM problem can be expressedas:

minx+,x−

‖x+ − x−‖2 : x+ ∈ CoC+, x− ∈ CoC−,

where CoC denotes convex hull of set C, that is:

CoC =

{ q∑i=1

λixi : xi ∈ C, λ ≥ 0,q∑

i=1

λi = 1

}.



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Geometry

Dual problem amounts to findthe smallest distance betweenthe two classes, eachrepresented by the convex hullof its points. The optimalhyperplane sits at the middle ofthe line segment joining the twoclosest points.



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Separating boxes instead of spheres

We can use a box uncertainty model:

xi ∈ Bi := {xi + ui : ‖ui‖∞ ≤ ρ} .

This leads to

minw‖w‖1 : yi(wT xi + b) ≥ 1, i = 1, . . . , n.

Classifiers found that way tendto be sparse. In 2D, theboundary line tends to bevertical or horizontal.



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

CVX syntax

Here is a matlab snippet that solves an SVM problem with l1 penalty,given n ×m matrix X , n-vector y and non-negative scalar C exist inthe workspace:

cvx_beginvariable w(n,1);variable b(1);minimize( C*sum(max(0,1-y.*(X’*w+b)))+ norm(w,1))cvx_end



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Logistic model

We model the probability of a label Y to be equal y ∈ {−1, 1}, given adata point x ∈ Rn, as:

P(Y = y | x) = 11 + exp(−y(wT x + b))

.

This amounts to modeling the log-odds ratio as a linear function of X :

logP(Y = 1 | x)

P(Y = −1 | x) = wT x + b.

I The decision boundary P(Y = 1 | x) = P(Y = −1 | x) is thehyperplane with equation wT x + b = 0.

I The region P(Y = 1 | x) ≥ P(Y = −1 | x) (i.e., wT x + b ≥ 0)corresponds to points with predicted label y = +1.



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Maximum-likelihood

The likelihood function is

l(w , b) =m∏

i=1

11 + e−yi (wT xi+b)

.

Now maximize the log-likelihood:

maxw,b

L(w , b) := −m∑

i=1

log(1 + e−yi (wT xi+b))

In practice, we may consider adding a regularization term

maxw,b

L(w , b) + λr(w),

with r(w) = ‖w‖22 or r(x) = ‖w‖1.



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Learning problems seen so far

Least-squares linear regression, SVMs, and logistic regressionproblems can all be expressed as minimization problems:

minw

f (w , b)

withI f (w , b) = ‖X T w − y‖2

2 (square loss);I f (w , b) =

∑mi=1 max(0, 1− yi(wT xi + b)) (hinge loss);

I f (w , b) = −∑m

i=1 log(1 + exp(−yi(wT xi + b))) (logistic loss).

The regularized version involves an additional penalty term.



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Outline



References



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

A linear regression problem

Linear auto-regressive model for time-series: yt linear function ofyt−1, yt−2

yt = w1 + w2yt−1 + w3yt−2, t = 1, . . . ,T .

This writes yt = wT xt , with xt the “feature vectors”

xt := (1, yt−1, yt−2) , t = 1, . . . ,T .

Model fitting via least-squares:


2

Prediction rule : yT+1 = w1 + w2yT + w3yT−1 = wT xT+1.



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Nonlinear regression

Nonlinear auto-regressive model for time-series: yt quadratic functionof yt−1, yt−2

yt = w1 + w2yt−1 + w3yt−2 + w4y2t−1 + w5yt−1yt−2 + w6y2

t−2.

This writes yt = wTφ(xt), with φ(xt) the augmented feature vectors

φ(xt) :=(

1, yt−1, yt−2, y2t−1, yt−1yt−2, y2

t−2

).

Everything the same as before, with x replaced by φ(x).



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Nonlinear classification

Non-linear (e.g., quadratic) decision boundary

w1x1 + w2x2 + w3x21 + w4x1x2 + w5x2

2 + b = 0.

Writes wTφ(x) + b = 0, with φ(x) := (x1, x2, x21 , x1x2, x2

2 ).



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Challenges

In principle, it seems can always augment the dimension of thefeature space to make the data linearly separable. (See the video athttp://www.youtube.com/watch?v=3liCbRZPrZA)

How do we do it in a computationally efficient manner?

http://www.youtube.com/watch?v=3liCbRZPrZA



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Generic form of learning problem

Many classification and regression problems can be written

minw

L(X T w , y) + λ‖w‖22

whereI X = [x1, . . . , xn] is a m × n matrix of data points.I y ∈ Rm contains a response vector (or labels).I w contains classifier coefficients.I L is a “loss” function that depends on the problem considered.I λ ≥ 0 is a regularization parameter.

Prediction/classification rule: depends only on wT x , where x ∈ Rn isa new data point.

Note: from now on, we consider l2 penalties only!



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Loss functionsI Squared loss: (for linear least-squares regression)

L(z, y) = ‖z − y‖22.

I Hinge loss: (for SVMs)

L(z, y) =m∑

i=1

max(0, 1− yizi)

I Logistic loss: (for logistic regression)

L(z, y) = −m∑

i=1

log(1 + e−yi zi ).



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Key result

For the generic problem:

minw

L(X T w) + λ‖w‖22

the optimal w lies in the span of the data points (x1, . . . , xm):

w = Xv

for some vector v ∈ Rm.



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Proof

Any w ∈ Rn can be written as the sum of two orthogonal vectors:

w = Xv + r

where X T r = 0 (that is, r is in the nullspace N (X T )).

vectors and matrices 55

Since for any subspace S it holds that S⊥⊥ = S , taking the orthogonalcomplement of both sides in the previous equation, we obtain thatN (A)⊥ = R(A�), hence

Rn = N (A) ⊕ N (A)⊥ = N (A) ⊕ R(A�).

This means that the input space Rn can be decomposed as the directsum of two orthogonal subspaces N (A) and R(A�). Since dim R(A�) =

dim R(A) = rank(A), and obviously dim Rn = n, we also obtain that

dim N (A) + rank(A) = n. (2.10)

With a similar reasoning, we argue that

R(A)⊥ = {y ∈ Rm : y�z = 0, ∀z ∈ R(A)}= {y ∈ Rm : y�Ax = 0, ∀x ∈ Rn} = N (A�),

hence by taking orthogonal complement of both sides, we see that

R(A) = N (A�)⊥.

Therefore, the output space Rm is decomposed as

Rm = R(A) ⊕ R(A)⊥ = R(A) ⊕ N (A�),

and

dim N (A�) + rank(A) = m,

see a graphical exemplification in R3 in Figure 2.2.2.

Figure 2.26: Illustrationof the fundamentaltheorem of linearAlgebra in R3; A =[a1 a2].

Figure shows the case X = A = (a1, a2).



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Consequence of key result

For the generic problem:

minw

L(X T w) + λ‖w‖22

the optimal w can be written as w = Xv for some vector v ∈ Rm.

Hence training problem depends only on K := X T X :

minv

L(Kv) + λvT Kv .



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Kernel matrix

The training problem depends only on the “kernel matrix” K = X T X

Kij = xTi xj

K contains the scalar products between all data point pairs.

The prediction/classification rule depends on the scalar productsbetween new point x and the data points x1, . . . , xm:

wT x = vT X T x = vT k , k := X T x = (xT x1, . . . , xT xm).



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Computational advantages

Once K is formed (this takes O(n)), then the training problem has onlym variables.

When n >> m, this leads to a dramatic reduction in problem size.



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

How about the nonlinear case?

In the nonlinear case, we simply replace the feature vectors xi bysome “augmented” feature vectors φ(xi), with φ a non-linear mapping.

Example : in classification with quadratic decision boundary, we use

φ(x) := (x1, x2, x21 , x1x2, x2

2 ).

This leads to the modified kernel matrix

Kij = φ(xi)Tφ(xj), 1 ≤ i , j ≤ m.



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

The kernel function

The kernel function associated with mapping φ is

k(x , z) = φ(x)Tφ(z).

It provides information about the metric in the feature space, e.g.:

‖φ(x)− φ(z)‖22 = k(x , x)− 2k(x , z) + k(z, z).

The computational effort involved inI solving the training problem;I making a prediction,

depends only on our ability to quickly evaluate such scalar products.

We can’t choose k arbitrarily; it has to satisfy the above for some φ.



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Quadratic kernels

Classification with quadratic boundaries involves feature vectors

φ(x) = (1, x1, x2, x21 , x1x2, x2

2 ).

Fact : given two vectors x , z ∈ R2, we have

φ(x)Tφ(z) = (1 + xT z)2.



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Polynomial kernels

More generally when φ(x) is the vector formed with all the productsbetween the components of x ∈ Rn, up to degree d , then for any twovectors x , z ∈ Rn,

φ(x)Tφ(z) = (1 + xT z)d .

Computational effort grows linearly in n.

This represents a dramatic reduction in speed over the “brute force”approach:

I Form φ(x), φ(z);I evaluate φ(x)Tφ(z).

Computational effort grows as nd .



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Other kernels

Gaussian kernel function:

k(x , z) = exp(−‖x − z‖2

2

2σ2

),

where σ > 0 is a scale parameter. Allows to ignore points that are toofar apart. Corresponds to a non-linear mapping φ toinfinite-dimensional feature space.

There is a large variety (a zoo?) of other kernels, some adapted tostructure of data (text, images, etc).



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

In practiceI Kernels need to be chosen by the user.I Choice not always obvious; Gaussian or polynomial kernels are

popular.I Control over-fitting via cross validation (wrt say, scale parameter

of Gaussian kernel, or degree of polynomial kernel).I Kernel methods not well adapted to l1-norm regularization.



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

Outline



References



Laurent El Ghaoui


Least-squares

SVM

Logistic regression

KernelsMotivations

Kernel trick

Examples

References

References

S. Boyd and M. Grant.

The CVX optimization package, 2010.

Chih-Chung Chang and Chih-Jen Lin.

SVM-LIB: A library for SVM classification.

S. Sra, S.J. Wright, and S. Nowozin.

Optimization for Machine Learning.MIT Press, 2011.

Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… ·...

Documents

Transcript of Short Course Robust Optimization and Machine Learning [.2 ...elghaoui/Teaching/Zinal2012/Lect… ·...