Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I...
Transcript of Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I...
![Page 1: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are](https://reader030.fdocuments.us/reader030/viewer/2022041009/5eb4d927e7038907b058506f/html5/thumbnails/1.jpg)
Indirect Rule Learning: SupportVector Machines
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 2: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are](https://reader030.fdocuments.us/reader030/viewer/2022041009/5eb4d927e7038907b058506f/html5/thumbnails/2.jpg)
Indirect learning: loss optimization
I It doesn’t estimate the prediction rule f (x) directly, sincemost loss functions do not have explicit optimizers.
I Indirection learning aims to directly minimize anempirically approximate expected loss function.
– Most often, it minimizes (empirical risk minimization)n∑
i=1
L(Yi, f (Xi)).
I For example, least squares estimation:n∑
i=1
(Yi − f (Xi))2.
classification problem:n∑
i=1
I(Yi 6= f (Xi)).
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 3: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are](https://reader030.fdocuments.us/reader030/viewer/2022041009/5eb4d927e7038907b058506f/html5/thumbnails/3.jpg)
Potential challenges
I What is a good approximation for expected loss function?Empirical risk is most commonly used but there are otheralternatives.
I What is the choice of candidate f for optimization?I How to avoid overfitting?I Will computation be feasible?
– global minimizer– computation complexity
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 4: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are](https://reader030.fdocuments.us/reader030/viewer/2022041009/5eb4d927e7038907b058506f/html5/thumbnails/4.jpg)
Least squares estimation
I The empirical risk isn∑
i=1
(Yi − f (Xi))2.
I f (x) can be from– a class of linear functions;– a sieve space of basis functions (splines, wavelets, radialbasis);– or fully nonparametric (kernel estimation).
I Overfitting can be addressed using regularization:– variable selection for linear models;– penalized splines, shrinkage for sieve approximation– cross-validation for tuning parameter selection.
I Computation– convex optimization– co-ordinate decent optimization for large p
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 5: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are](https://reader030.fdocuments.us/reader030/viewer/2022041009/5eb4d927e7038907b058506f/html5/thumbnails/5.jpg)
Support Vector Machines
I Consider binary classification problem and use label{−1, 1} for two classes.
I We start from a simple classification rule which is a linearfunction of feature variables X.
I The idea of SVM is to
identify a hyperplace on feature space
that separates classes as much as possible.
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 6: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are](https://reader030.fdocuments.us/reader030/viewer/2022041009/5eb4d927e7038907b058506f/html5/thumbnails/6.jpg)
SVM illustration
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 7: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are](https://reader030.fdocuments.us/reader030/viewer/2022041009/5eb4d927e7038907b058506f/html5/thumbnails/7.jpg)
Mathematical formulation of SVM
I The goal is to find a hyperplane β0 + XTβ such that
Yi(β0 + XTi β) > 0
for all i = 1, ...,n.I Furthermore, we wish to maximize the margin given as M.I That is, we solve
max‖β‖=1
M subject to Yi(β0 + XTi β) ≥M.
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 8: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are](https://reader030.fdocuments.us/reader030/viewer/2022041009/5eb4d927e7038907b058506f/html5/thumbnails/8.jpg)
Equivalent optimization
I It is equivalent to
min12‖β‖2 subject to Yi(β0 + XT
i β) ≥ 1.
I There are two difficulties in practice– classes may not be separable so no solution exists;– classes may be separable but separation is nonlinear.
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 9: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are](https://reader030.fdocuments.us/reader030/viewer/2022041009/5eb4d927e7038907b058506f/html5/thumbnails/9.jpg)
Extension to imperfect separation data
I For imperfect separation, Yi(β0 + XTi β) may not be
positive, i.e., the prediction is wrong.I We should allow this misclassification but impose some
penalty for wrong prediction.I This can be done by introducing slack variables ξ1, ..., ξn
for each subject.I ξi ≥ 0 describes the distance off the correct classification
given by margins.I However, we should restrict the total penalty to be not too
large.
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 10: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are](https://reader030.fdocuments.us/reader030/viewer/2022041009/5eb4d927e7038907b058506f/html5/thumbnails/10.jpg)
SVM optimization
I The optimization is
max‖β‖=1
M, subject to Yi(β0 + XTi β) ≥M(1− ξi), i = 1, ...,n.
where ξi ≥ 0 andn∑
i=1
ξi ≤ a pre-specified constant.
I Equivalently,
min12‖β‖2 + C
n∑i=1
ξi subject to Yi(β0 + XTi β) ≥ (1− ξi),
ξi ≥ 0, i = 1, ...,n,
where C is a given constant (called cost parameter).I This is a convex minimization problem with linear
constraints.Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 11: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are](https://reader030.fdocuments.us/reader030/viewer/2022041009/5eb4d927e7038907b058506f/html5/thumbnails/11.jpg)
Solve SVM problem using duality
I The Lagrange (primal) function is
12‖β‖2+C
n∑i=1
ξi−n∑
i=1
αi
[Yi(β0 + XT
i β)− (1− ξi)]−
n∑i=1
µiξi,
where αi ≥ 0, µi ≥ 0 are the Lagrange multipliers.I Differentiate with respect to β0, β and ξi to obtain
β =n∑
i=1
αiYiXi,
0 =
n∑i=1
αiYi,
αi = C− µi, i = 1, ...,n.
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 12: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are](https://reader030.fdocuments.us/reader030/viewer/2022041009/5eb4d927e7038907b058506f/html5/thumbnails/12.jpg)
Dual problem
I After plugging β into the primal function and using theequations, the dual objective function is
LD =
n∑i=1
αi −12
n∑i=1
n∑j=1
αiαjYiYjXTi Xj.
I The dual problem becomes max LD subject to
0 ≤ αi ≤ C, i = 1, ...,n,n∑
i=1
αiYi = 0.
I Furthermore, KKT conditions give
αi
[Yi(XT
i β + β0)− (1− ξi)]= 0,
µiξi = 0,
Yi(XTi β + β0)− (1− ξi) ≥ 0.
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 13: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are](https://reader030.fdocuments.us/reader030/viewer/2022041009/5eb4d927e7038907b058506f/html5/thumbnails/13.jpg)
KKT conditions
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 14: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are](https://reader030.fdocuments.us/reader030/viewer/2022041009/5eb4d927e7038907b058506f/html5/thumbnails/14.jpg)
On SVM optimization
I Solving the dual problem is a simple convex quadraticprogramming problems (there are many solvers in allpackages).
I Since β =∑n
i=1 αiYiXi, the hyperplane is determined bythose observations with αi 6= 0, called support vectors.
I Among support vectors, some are on the margin edges(ξi = 0) and the remainders have αi = C.
I Any support vectors with ξi = 0 can be used to solve for β0(often taken to be the average if there are multiple).
I Sometimes β0 can be solved by directly minimize theprimal function.
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 15: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are](https://reader030.fdocuments.us/reader030/viewer/2022041009/5eb4d927e7038907b058506f/html5/thumbnails/15.jpg)
Illustrative example
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 16: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are](https://reader030.fdocuments.us/reader030/viewer/2022041009/5eb4d927e7038907b058506f/html5/thumbnails/16.jpg)
Go beyond linear SVM
I The most commonly used nonlinear prediction rule is torestrict f in a RKHS,HK (called kernel trick).
I Recall that RKHS is given by a kernel function K(x, y)where K(x, y) has an eigen-expansion
K(x, y) =∞∑
k=1
γkφk(x)φk(y),
where φk/√γk is the normalized basis function for
{HK, < ·, · >HK} .I We can represent f (x) using these basis functions
f (x) = β0 +∞∑
k=1
βkφk(x)/√γk.
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 17: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are](https://reader030.fdocuments.us/reader030/viewer/2022041009/5eb4d927e7038907b058506f/html5/thumbnails/17.jpg)
Dual problem with kernel trick
I Follow the same derivation as linear SVM (replace Xi bythe vector (φ1(Xi)/
√γ1, φ2(Xi)
√γ2, ...)
T then the dualobjective function becomes
n∑i=1
αi −12
n∑i=1
n∑j=1
αiαjYiYj(
∞∑k=1
φk(Xi)φk(Xj)/γk)
=
n∑i=1
αi −12
n∑i=1
n∑j=1
αiαjYiYjK(Xi,Xj).
I The prediction function becomes
f (x) = β0 +
∞∑k=1
n∑i=1
αiYiφk(Xi)φk(x)/γk =
n∑i=1
αiYiK(Xi, x).
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 18: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are](https://reader030.fdocuments.us/reader030/viewer/2022041009/5eb4d927e7038907b058506f/html5/thumbnails/18.jpg)
Advantage of kernel trick
I Our conclusions areI (a) restricting f toHK leads to a nonlinear prediction
function depending on the kernel function;I (b) solving the dual problem for the prediction function
only need to know the kernel function K(x, y) (notnecessarily the basis functions);
I (c) the optimization in the dual problem depends on thenumber of observations (n) but not the dimensionality ofXi’s.
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 19: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are](https://reader030.fdocuments.us/reader030/viewer/2022041009/5eb4d927e7038907b058506f/html5/thumbnails/19.jpg)
Choice of the kernel functions
I polynomial kernel:
K(x, x′) = (1 + xTx′)d
I Radial basis or Gaussian kernel:
K(x, x′) = exp{−γ‖x− x′‖2}
I Neural network:
K(x, x′) = tanh(k1xTx′ + k2).
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 20: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are](https://reader030.fdocuments.us/reader030/viewer/2022041009/5eb4d927e7038907b058506f/html5/thumbnails/20.jpg)
Revisit SVM example
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 21: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are](https://reader030.fdocuments.us/reader030/viewer/2022041009/5eb4d927e7038907b058506f/html5/thumbnails/21.jpg)
Loss formulation for SVM
I Revisit linear SVM formulation: we minimize ‖β‖ subjectto separation constraints
Yi(β0 + XTi β) ≥ 1− ξi, ξi ≥ 0, i = 1, ..,n
and that∑n
i=1 ξi is controlled by a constant.I We need to understand what exact empirical loss this
optimization tries to minimize because by achieving this,– we can characterize how SVM will possibly minimizeclassification loss (Fisher consistency);– we can study the stochastic variability of the SVMclassification (convergence rate and risk bound).
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 22: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are](https://reader030.fdocuments.us/reader030/viewer/2022041009/5eb4d927e7038907b058506f/html5/thumbnails/22.jpg)
Loss formulation: continue
I Equivalently, we minimize (for a given constant C)
12‖β‖2 + C
n∑i=1
ξi
subject toξi ≥ [1− Yi(β0 + XT
i β)]+,
where (1− z)+ = max(1− z, 0).I Hence, SVM is equivalent minimizing the following loss
n∑i=1
[1− Yi(β0 + XTi β)]+ +
λ
2‖β‖2.
I For nonlinear SVM, the loss isn∑
i=1
[1− Yif (Xi)]+ +λ
2‖f‖2HK.
I We name L(y, f ) = [1− yf ]+ the hinge loss function.Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 23: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are](https://reader030.fdocuments.us/reader030/viewer/2022041009/5eb4d927e7038907b058506f/html5/thumbnails/23.jpg)
Plot of the hinge loss
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 24: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are](https://reader030.fdocuments.us/reader030/viewer/2022041009/5eb4d927e7038907b058506f/html5/thumbnails/24.jpg)
Fisher consistency of SVM
I Fisher consistency: suppose f ∗ minimizes E[(1− Yf (X))+].Then sign(f ∗(x)) is the Bayes rule for the classificationproblem.
I Proof: note that
E[(1− Yf (X))+
∣∣∣X = x] = (1− f (x))+P(Y = 1|X = x)
+(1 + f (x))+P(Y = −1|X = x),
as a function of f (x), is a 3-piece of linear function,decreasing in (−∞,−1], linear in (−1, 1] and increasing in[1,∞).
I The minimize is attained at f (x) = −1 ifP(Y = −1|X = x) < P(Y = 1|X = 1) and 1 otherwise..
I We conclude the Fisher consistency.
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 25: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are](https://reader030.fdocuments.us/reader030/viewer/2022041009/5eb4d927e7038907b058506f/html5/thumbnails/25.jpg)
Extension of the hinge loss
I The hinge loss is one special case of the so-called largemargin loss with form φ(yf ) for some convex function of φ.
I Additional examples include– Binomial deviance: log(1 + e−yf )– Squared loss: (1− yf )2
– square hinge loss: (1− yf )2+
– AdaBoost: exp{−yf}.I A sufficient condition for the Fisher consistency is that φ(x)
is differentiable at 0 and φ′(0) < 0.
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 26: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are](https://reader030.fdocuments.us/reader030/viewer/2022041009/5eb4d927e7038907b058506f/html5/thumbnails/26.jpg)
SVM for regression
I Extension of SVM to continuous Y is based onmodification of the loss in SVM.
I Consider prediction f (X) for subject with feature X and Yis his/her true outcome.
I The inaccuracy of the prediction can be characterized asthe so-called ε-insensitive loss:
L(Y, f (X)) = max(|Y− f (X)| − ε, 0).
I The loss is zero if the prediction error is within ε.
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 27: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are](https://reader030.fdocuments.us/reader030/viewer/2022041009/5eb4d927e7038907b058506f/html5/thumbnails/27.jpg)
ε-insensitive loss
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 28: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are](https://reader030.fdocuments.us/reader030/viewer/2022041009/5eb4d927e7038907b058506f/html5/thumbnails/28.jpg)
Optimization problem in SVM for regression
I The objective function for linear prediction ismin ‖β‖2/2 + C{ξi + ξ′i} subject to
−ξ′i−ε ≤ Yi−(β0+XTi β) ≤ ε+ξi, ξi ≥ 0, ξ′i ≥ 0, i = 1, ...,n.
I The dual problem is
min εn∑
i=1
(αi+α′i)−
n∑i=1
Yi(αi−α′i)+12
n∑i=1
n∑j=1
(αi−α′i)(αj−α′j)XTi Xj
subject to
0 ≤ αi, α′i ≤ C,
n∑i=1
(αi − α′i) = 0, αiα′i = 0
I The prediction function is β0 + XTβ with
β =
n∑i=1
(αi − α′i)Xi.
Donglin Zeng, Department of Biostatistics, University of North Carolina