Stochastic Subgradient Approach for Solving Linear Support Vector Machines

Stochastic Subgradient Approach for Solving Linear Support Vector Machines

Jan RupnikJozef Stefan Institute

OutlineIntroductionSupport Vector MachinesStochastic Subgradient Descent SVM - PegasosExperiments

IntroductionSupport Vector Machines (SVMs) have become one of the most popular classification tools in the last decadeStraightforward implementations could not handle large sets of training examplesRecently methods for solving SVMs arose with linear computational complexityPegasos: primal estimated subgradient approach for solving SVMs

Hard Margin SVMMany possible hyperplanes perfectly separate the two classes (e.g. red line).

Only one hyperplane with the maximum margin (blue line).Support vectors

Soft Margin SVMAllow a small number of examples to be misclassified to find a large margin classifier.

The trade off between the classification accuracy on the training set and the size of the margin is a parameter for the SVM

Problem settingLet S = {(x1,y1),..., (xm, ym)} be the set of input-output pairs, where xi RN and yi {-1,1}.Find the hyperplane with the normal vector w RN and offset b R that has good classification accuracy on the training set S and has a large margin.Classify a new example x as sign(wx b)

Optimization problemRegularized hinge loss:

minw /2 ww + 1/m i(1 yi(wxi b))+

Trade off between margin and lossSize of the marginExpected hinge loss on the training setPositive for correctly classified examples, else negative(1 z)+ := max{0, 1-z} (hinge loss)First summand is a quadratic function, the sum is a piecewise linear function. The whole objective: piecewise quadratic.

PerceptronWe ignore the offset parameter b from now on (b = 0)Regularized Hinge Loss (SVM):minw /2 ww + 1/m i(1 yi(wxi))+Perceptronminw 1/m i(yi(wxi))+

0 1Loss functions0 10 1Standard 0/1 lossPerceptron lossHinge lossPenalizes all incorrectly classified examples with the same amountPenalizes incorrectly classified examples x proportionally to the size of |wx|Penalizes incorrectly classified examples and correctly classified examples that lie within the marginExamples that are correctly classified but fall within the margin

Stochastic Subgradient DescentGradient descent optimization in perceptron (smooth objective)Subgradient descent in pegasos (non differentiable objective)Gradient (unique)Subgradients (all equally valid)

Stochastic SubgradientSubgradient in perceptron: 1/m yixi for all misclassified examplesSubgradient in SVM: w + 1/m i(1 yixi) for all misclassified examplesFor every point w the subgradient is a function of the training sample S. We can estimate it from a smaller random subset of S of size k, A, (stochastic part) and speed up computations.Stochastic subgradient in SVM: w + 1/k (x,y) A && misclassified (1 yx)

Pegasos the algorithmLearning rateSubsampleSubgradient stepProjection into a ball (rescaling)Subgradient is zero on other training points

Whats new in pegasos?Sub-gradient descent technique 50 years oldSoft Margin SVM 14 years oldTypically the gradient descent methods suffer from slow convergenceAuthors of Pegasos proved that aggressive decrease in learning rate t still leads to convergence. Previous works: t = 1/(t))pegasos: t = 1/ (t)Proved that the solution always lies in a ball of radius 1/

SVM LightA popular SVM solver with superlinear computational complexitySolves a large quadratic programSolution can be expressed in terms a small subset of training vectors, called support vectorsActive set method to find the support vectorsSolve a series of smaller quadratic problemsHighly accurate solutionsAlgorithm and implementation by Thorsten JoachimsT. Joachims, Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning, B. Schlkopf and C. Burges and A. Smola (ed.), MIT Press, 1999

ExperimentsDataReuters RCV2Roughly 800.000 news articlesAlready preprocessed to bag of word vectors, publicly availableNumber of features roughly 50.000Sparse vectorscategory CCAT, which consists of 381.327 news

Quick convergence to suboptimal solutions200 iterations took 9.2 CPU seconds, the objective value was 0.3% higher than the optimal solution

560 iterations to get a 0.1% accurate solution

SVM Light takes roughly 4 hours of CPU time

Test set errorOptimizing the objective value to a high precission is often not necessary

The lowest error on the test set is achieved much earlier

Parameters k and TThe product of kT determines how close to the optimal value we get

If kT is fixed the k does not play a significant role

Conclusions and final notesPegasos one of the most efficient suboptimal SVM solversSuboptimal solutions often generalize to new examples wellCan take advantage of sparsityLinear solverNonlinear extensions have been proposed, but they suffer from slower convergence

Thank you!

Stochastic Subgradient Approach for Solving Linear Support Vector Machines

Documents

Transcript of Stochastic Subgradient Approach for Solving Linear Support Vector Machines

An Infeasible-Point Subgradient Method Using Adaptive ... › DB_FILE › › › 2011 › 04 › 3016.pdfAn Infeasible-Point Subgradient Method Using Adaptive Approximate Projections?

Stochastic Circuit Design and Performance Evaluation of ...jhan8/publications/Stochastic_VQ_revised_Feb... · Stochastic Circuit Design and Performance Evaluation of Vector Quantization

Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.

Adaptive Subgradient Methods for Online Learning and ...jmlr.org/papers/volume12/duchi11a/duchi11a.pdfAdaptive Subgradient Methods for Online Learning and Stochastic Optimization ...

Subgradient Methods - Stanford University · There are many results on convergence of the subgradient method. For constant step size and constant step length, the subgradient algorithm

Large Vector Autoregressions with stochastic volatility and ...2016/06/03 · Large Vector Autoregressions with stochastic volatility and ⁄exible priors Andrea Carriero Queen Mary,

Subgradient methods for huge-scale optimization problems

Subgradient and Bundle methods

INCREMENTAL SUBGRADIENT METHODS FOR …

Adaptive Subgradient Methods for Online Learning and ......show that adaptive subgradient methods outperform state-of-the-art, yet non-adaptive, subgradient algorithms. Keywords: subgradient

Stochastic subgradient method converges on tame functionsddrusv/stoc_subgrad_semi.pdfLyapunov function, semialgebraic, tame AMS Subject Classi cation: 65K05, 65K10, 34A60, 90C15 1

Incremental Gradient, Subgradient, and Proximal Methods ...

INCREMENTAL STOCHASTIC SUBGRADIENT ALGORITHMS …vvv.ece.illinois.edu/papers/journal/ram-nedi-veer-siam-2009.pdfoptimization algorithms as applied to minimize a sum of functions, when

Solving basis pursuit: infeasible-point subgradient ... › spear › slides › ...Infeasible-Point Subgradient Algorithm ISAL1 ISA = Infeasible-Point Subgradient Algorithm... works

Non-linear conjugate gradient methods for vector optimization · subgradient, interior point, and proximal methods were proposed for vector optimization, see [2,3,8,19,20,23,24,26{28,37,44,47].

Stochastic Subgradient Method - UCI

Learning CRFs for Image Parsing with Adaptive Subgradient ... · 3. Adaptive Subgradient Descent Learning In this section, we propose an adaptive subgradient de-scent algorithm for

Stochastic Subgradient MCMC Methods - arXiv Subgradient MCMC Methods ... stochastic subgradient methods for deterministic optimization, ... we conjoin the ideas of stochastic subgradi-ent

Subgradient Projectors: Extensions, Theory, and ...people.ok.ubc.ca/bauschke/Research/109.pdf · Subgradient Projectors: Extensions, Theory, and Characterizations Heinz H. Bauschke,

Applied Stochastic Processes and Control for Jump ...homepages.math.uic.edu/~hanson/pub/Slides/bk06sdpfinal.pdfThe stochastic processes are the nw-dimensional vector Wiener process