Optimization Tutorial

Pritam Sukumar & Daphne TsatsoulisCS 546: Machine Learning for Natural Language Processing

What is Optimization?

Find the minimum or maximum of an objective function given a set of constraints:

Why Do We Care?

Linear ClassificationMaximum Likelihood

K-Means

Prefer Convex Problems

Local (non global) minima and maxima:

Convex Functions and Sets

Important Convex Functions

Convex Optimization Problem

Lagrangian Dual

First Order Methods: Gradient Descent

Newton’s MethodIntroduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009

Subgradient DescentIntroduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009

Stochastic Gradient DescentStochastic Optimization for Machine Learning, Nathan Srebro and Ambuj Tewari, presented at ICML'10

Trust RegionsTrust Region Newton method for large-scale logistic regression, C.-J. Lin, R. C. Weng, and S. S. Keerthi, Journal of Machine Learning Research, 2008

Dual Coordinate DescentDual Coordinate Descent Methods for logistic regression and maximum entropy models, H.-F. Yu, F.-L. Huang, and C.-J. Lin, . Machine Learning Journal, 2011

Linear ClassificationRecent Advances of Large-scale linear classification, G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Proceedings of the IEEE, 100(2012)

Gradient Descent

Single Step Illustration

Full Gradient Descent Illustration

Newton’s Method

Inverse Hessian Gradient

Newton’s Method Picture

Subgradient Descent Motivation

Subgradient Descent – Algorithm

Online learning and optimization

• Goal of machine learning :– Minimize expected loss

given samples

• This is Stochastic Optimization– Assume loss function is convex

Batch (sub)gradient descent for ML

• Process all examples together in each step

• Entire training set examined at each step• Very slow when n is very large

Stochastic (sub)gradient descent

• “Optimize” one example at a time• Choose examples randomly (or reorder and

choose in order)– Learning representative of example distribution

Stochastic (sub)gradient descent

• Equivalent to online learning (the weight vector w changes with every example)

• Convergence guaranteed for convex functions (to local minimum)

Hybrid!

• Stochastic – 1 example per iteration• Batch – All the examples!• Sample Average Approximation (SAA): – Sample m examples at each step and perform SGD

on them• Allows for parallelization, but choice of m

based on heuristics

SGD - Issues

• Convergence very sensitive to learning rate ( ) (oscillations near solution due to probabilistic nature of sampling)– Might need to decrease with time to ensure the

algorithm converges eventually• Basically – SGD good for machine learning

with large data sets!

Problem Formulation

New Points

Limited Memory Quasi-Newton Methods

Limited Memory BFGS

Coordinate descent

• Minimize along each coordinate direction in turn. Repeat till minimum is found– One complete cycle of coordinate descent is the

same as gradient descent• In some cases, analytical expressions available:– Example: Dual form of SVM!

• Otherwise, numerical methods needed for each iteration

Dual coordinate descent

• Coordinate descent applied to the dual problem

• Commonly used to solve the dual problem for SVMs– Allows for application of the Kernel trick– Coordinate descent for optimization

• In this paper: Dual logistic regression and optimization using coordinate descent

Dual form of SVM

• SVM

• Dual form

Dual form of LR

• LR:

• Dual form (we let )

Coordinate descent for dual LR

• Along each coordinate direction:

Coordinate descent for dual LR

• No analytical expression available– Use numerical optimization (Newton’s

method/bisection method/BFGS/…)to iterate along each direction

• Beware of log!

Coordinate descent for dual ME

• Maximum Entropy (ME) is extension of LR to multi-class problems– In each iteartion, solve in two levels:• Outer level – Consider block of variables at a time

– Each block has all labels and one example• Inner level – Subproblem solved by dual coordinate

descent

• Can also be solved similar to online CRF (exponentiated gradient methods)

Large scale linear classification

• NLP (usually) has large number of features, examples

• Nonlinear classifiers (including kernel methods) more accurate, but slow

• Linear classifiers less accurate, but at least an order of magnitude faster– Loss in accuracy lower with increase in number of

examples• Speed usually dependent on more than

algorithm order– Memory/disk capacity– Parallelizability

• Choice of optimization method depends on:– Data property• Number of examples, features• Sparsity

– Formulation of problem• Differentiability• Convergence properties

– Primal vs dual– Low order vs high order methods

Comparison of performance

• Performance gap goes down with increase in number of features

• Training, testing time for linear classifiers is much faster

Thank you!

• Questions?

Optimization Tutorial

Documents

Transcript of Optimization Tutorial

CSC413 Tutorial: Optimization for Machine Learning

Tutorial: Robustness and Optimization

Mathematica Tutorial: Constrained Optimization - Wolfram Research

ANSYS Tutorial Design Optimization

A Brief GAMS Tutorial for Dynamic Optimization - CEPACcepac.cheme.cmu.edu/pasi2011/library/biegler/gams_tutorial.pdf · A Brief GAMS Tutorial for Dynamic Optimization L. T. Biegler

Search Engine Optimization Tutorial

Basic Search Engine Optimization tutorial

CS4234: Optimization Algorithms Tutorial Week 12: Game Day ...gilbert/CS4234/2015/tutorial/Week12… · CS4234: Optimization Algorithms Nov. 5, 2015 Tutorial Week 12: Game Day Slither

A Tutorial on Evolutionary Multiobjective Optimizationemooworkgroup/tutorial-slides-zitzler.pdfA Tutorial on Evolutionary Multiobjective Optimization Eckart Zitzler Computer Engineering

Tutorial on Optimization in Medicine

Tutorial on Neural Network Optimization Problems

Search Engine Optimization Tools - Tutorial: Google for Webmaster

Vivado Power Analysis Optimization Tutorial

Smart Grid Optimization - IARIA · Outline of the tutorial Smart Grid: What is it? Smart Grid optimization Generation 2 optimization Transmission optimization Distribution optimization

Tutorial on Neural Network Optimization Problems · Tutorial on Neural Network Optimization Problems Deep Learning Summer School Montreal August 9, 2015 ... - “Qualitatively characterizing

Vivado Design Suite Tutorial - origin.xilinx.comPower Analysis and Optimization Tutorial Power Analysis and Optimization 6 UG997 (v2017.1) April 5, 2017 Hardware Requirements Supported

Tutorial Composite Optimization MG1

Vivado Design Suite Tutorial: Power Analysis and Optimization ...

Quantum ESPRESSO tutorial: Self-Consistent …giannozz/QE-Tutorial/handson_pwscf.pdfQuantum ESPRESSO tutorial: Self-Consistent Calculations, Supercells, Structural Optimization What

Optimization tutorial