STA141C: Big Data & High Performance Statistical Computing...

STA141C: Big Data & High Performance StatisticalComputing

Lecture 8: Optimization

Cho-Jui HsiehUC Davis

May 9, 2017

Optimization

Numerical Optimization

Numerical Optimization:

minX f (X )

Can be applied to many areas.

Machine Learning: find a model that minimizes the predictionerror.

Properties of the Function

Smooth function: functions with continuous derivative.

Example: ridge regression

argminw

1

2‖Xw − y‖2 +

λ

2‖w‖2

Non-smooth function: Lasso, primal SVM

Lasso: argminw

1

2‖Xw − y‖2 + λ‖w‖1

SVM: argminw

n∑i=1

max(0, 1− yiwT xi ) +λ

2‖w‖2

Convex Functions

A function is convex if:

∀x1, x2,∀t ∈ [0, 1], f (tx1 + (1− t)x2) ≤ tf (x1) + (1− t)f (x2)

No local optimum (why?)

Figure from Wikipedia

Convex Functions

If f (x) is twice differentiable, then

f is convex if and only if ∇2f (x) � 0

Optimal solution may not be unique:

has a set of optimal solutions S

Gradient: capture the first order change of f :

f (x + αd ) = f (x) + α∇f (x)T d + O(α2)

If f is convex and differentiable, we have the following optimalitycondition:

x∗ ∈ S if and only if ∇f (x) = 0

Nonconvex Functions

If f is nonconvex, most algorithms can only converge to stationarypointsx̄ is a stationary point if and only if ∇f (x̄) = 0Three types of stationary points:

(1) Global optimum (2) Local optimum (3) Saddle pointExample: matrix completion, neural network, . . .Example: f (x , y) = 1

2(xy − a)2

Coordinate Descent

Coordinate Descent

Update one variable at a timeCoordinate Descent: repeatedly perform the following loop

Step 1: pick an index iStep 2: compute a step size δ∗ by (approximately) minimizing

argminδ

f (x + δei )

Step 3: xi ← xi + δ∗

Coordinate Descent (update sequence)

Three types of updating order:

Cyclic: update sequence

x1, x2, . . . , xn︸︷︷︸1st outer iteration

, x1, x2, . . . , xn︸︷︷︸2nd outer iteration

, . . .

Randomly permute the sequence for each outer iteration (fasterconvergence in practice)

Some interesting theoretical analysis for this recentlyC.-P. Lee and S. J. Wright, Random Permutations Fix a Worst Case for Cyclic Coordinate Descent. 2016

Random: each time pick a random coordinate to update

Typical way: sample from uniform distributionSample from uniform distribution vs sample from biased distributionP. Zhao and T. Zhang, Stochastic Optimization with Importance Sampling for Regularized Loss Minimization. In

ICML 2015

D. Csiba, Z. Qu and P. Richtarik, Stochastic Dual Coordinate Ascent with Adaptive Probabilities. In ICML 2015

Greedy Coordinate Descent

Greedy: choose the most “important” coordinate to update

How to measure the importance?

By first derivative: |∇i f (x)|By first and second derivative: |∇i f (x)/∇2

ii f (x)|By maximum reduction of objective function

i∗ = argmaxi=1,...,n

(f (x)−min

δf (x + δei )

)Need to consider the time complexity for variable selection

Useful for kernel SVM (see lecture 6)

Extension: block coordinate descent

Variables are divided into blocks {X1, . . . ,Xp}, where each Xi is asubset of variables and

X1 ∪ X2, . . . ,Xp = {1, . . . , n}, Xi ∩ Xj = ϕ, ∀i , j

Each time update a Xi by (approximately) solving the subproblemwithin the block

Example: alternating minimization for matrix completion (2 blocks).(See lecture 7)

Coordinate Descent (convergence)

Converge to optimal if f (·) is convex and smooth

Has a linear convergence rate if f (·) is strongly convex

Linear convergence: error f (x t)− f (x∗) decays as

β, β2, β3, . . .

for some β < 1.

Local linear convergence: an algorithm converges linearly after‖x − x∗‖ ≤ K for some K > 0

Coordinate Descent: other names

Alternating minimization (matrix completion)

Iterative scaling (for log-linear models)

Decomposition method (for kernel SVM)

Gauss Seidel (for linear system when the matrix is positive definite)

. . .

Gradient Descent

Gradient Descent

Gradient descent algorithm: repeatedly conduct the following update:

x t+1 ← x t − α∇f (x t)

where α > 0 is the step sizeIt is a fixed point iteration method:

x − α∇f (x) = x if and only if x is an optimal solution

Step size too large ⇒ diverge; too small ⇒ slow convergence

Gradient Descent: step size

A twice differentiable function has L-Lipchitz continuous gradient ifand only if

∇2f (x) ≤ LI ∀x

In this case, Condition 2 is satisfied if α < 1L

Theorem: gradient descent converges if α < 1L

Theorem: gradient descent converges linearly with α < 1L if f is

strongly convex

Gradient Descent

In practice, we do not know L. . . . . .

Step size α too large: the algorithm diverges

Step size α too small: the algorithm converges very slowly

Gradient Descent: line search

d ∗ is a “descent direction” if and only if (d ∗)T∇f (x) < 0

Armijo rule bakctracking line search:

Try α = 1, 12 ,14 , . . . until it staisfies

f (x + αd ∗) ≤ f (x) + γα(d ∗)T∇f (x)

where 0 < γ < 1

Figure from http://ool.sourceforge.net/ool-ref.html

Gradient Descent: line search

Gradient descent with line search:

Converges to optimal solutions if f is smoothConverges linearly if f is strongly convex

However, each iteration requires evaluating f several times

Several other step-size selection approaches

Gradient Descent: applying to ridge regression

Input: X ∈ RN×d , y ∈ RN , initial w (0)

Output: Solution w∗ := argminw12‖Xw − y‖2 + λ

2‖w‖2

1: t = 02: while not converged do3: Compute the gradient

g = XT (Xw − y) + λw

4: Choose step size αt

5: Update w ← w − αtg6: t ← t + 17: end while

Time complexity: O(nnz(X )) per iteration

Proximal Gradient Descent

How can we apply gradient descent to solve the Lasso problem?

argminw

1

2‖Xw − y‖2 + λ ‖w‖1︸︷︷︸

non−differentiable

General composite function minimization:

argminx

f (x) := {g(x) + h(x)}

where g is smooth and convex, h is convex but may benon-differentiable

Usually assume h is simple (for computational efficiency)

Proximal Gradient Descent: successive approximation

At each iteration, form an approximation of f (·):

f (x t + d ) ≈ f̃x t (d ) :=g(x t) +∇g(x t)T d +1

2dT (

1

αI )d + h(x t + d )

=g(x t) +∇g(x t)T d +1

2αdT d + h(x t + d )

Update solution by x t+1 ← x t + argmind f̃x t (d )

This is called “proximal” gradient descent

Usually d ∗ = argmind f̃x t (d ) has a closed form solution

Proximal Gradient Descent: `1-regularization

The subproblem:

x t+1 =x t + argmind∇g(x t)T d +

1

2αdT d + λ‖x t + d‖1

= argminu

1

2‖u − (x t − α∇g(x t))‖2 + λα‖u‖1

=§(x t − α∇g(x t), αλ),

where § is the soft-thresholding operator defined by

§(a, z) =

a− z if a > z

a + z if a < −z0 if a ∈ [−z , z ]

Proximal Gradient: soft-thresholding

Figure from http://jocelynchi.com/soft-thresholding-operator-and-the-lasso-solution/

Proximal Gradient Descent for Lasso

Input: X ∈ RN×d , y ∈ RN , initial w (0)

Output: Solution w∗ := argminw12‖Xw − y‖2 + λ‖w‖1

1: t = 02: while not converged do3: Compute the gradient

g = XT (Xw − y)

4: Choose step size αt

5: Update w ← §(w − αtg , αtλ)6: t ← t + 17: end while

Time complexity: O(nnz(X )) per iteration

Newton’s Method

Newton’s Method

Iteratively conduct the following updates:

x ← x − α∇2f (x)−1∇f (x)

where α is the step size

If α = 1: converges quadratically when x t is close enough to x∗:

‖x t+1 − x∗‖ ≤ K‖x t − x∗‖2

for some constant K . This means the error f (x t)− f (x∗) decaysquadratically:

β, β2, β4, β8, β16, . . .

Only need few iterations to converge in this “quadratic convergenceregion”

Newton’s Method

However, Newton’s update rule is more expensive than gradientdescent/coordinate descent

Newton’s Method

Need to compute ∇2f (x)−1∇f (x)

Closed form solution: O(d3) for solving a d dimensional linear system

Usually solved by another iterative solver:

gradient descent

coordinate descent

conjugate gradient method

. . .

Examples: primal L2-SVM/logistic regression, `1-regularized logisticregression, . . .

Stochastic Gradient Method

Stochastic Gradient Method: Motivation

Widely used for machine learning problems (with large number ofsamples)

Given training samples x1, . . . , xn, we usually want to solve thefollowing empirical risk minimization (ERM) problem:

argminw

n∑i=1

`i (xi ),

where each `i (·) is the loss function

Minimize the summation of individual loss on each sample

Stochastic Gradient Method

Assume the objective function can be written as

f (x) =1

n

n∑i=1

fi (x)

Stochastic gradient method:

Iterative conducts the following updates1 Choose an index i (uniform) randomly2 x t+1 ← x t − ηt∇fi (x t)

ηt > 0 is the step size

Why does SG work?

Ei [∇fi (x)] =1

n

n∑i=1

∇fi (x) = ∇f (x)

Is it a fixed point method? No if η > 0 because x∗ − η∇i f (x∗) 6= x∗

Is it a descent method? No, because f (x t+1) 6< f (x t)

Stochastic Gradient

Step size η has to decay to 0

(e.g., ηt = Ct−a for some constant a,C )

SGD converges sub-linearly

Many variants proposed recently

SVRG, SAGA (2013, 2014): variance reductionAdaGrad (2011): adaptive learning rateRMSProp (2012): estimate learning rate by a running average ofgradient.Adam (2015): adaptive moment estimation

Widely used in machine learning

SGD in Deep Learning

AdaGrad: adaptive step size for each parameter

Update rule at the t-th iteration:

Compute g = ∇f (x)Estimate the second moment: Gii = Gii + g2

i for all iParameter update: xi ← xi − η√

Giigi for all i

Proposed for convex optimization in:

“Adaptive subgradient methods for online learning and stochastic optimization” (JMLR

2011)

“Adaptive Bound Optimization for Online Convex Optimization” (COLT 2010)

SGD in Deep Learning

AdaGrad: adaptive step size for each parameter

Update rule at the t-th iteration:

Compute g = ∇f (x)Estimate the second moment: Gii = Gii + g2

i for all iParameter update: xi ← xi − η√

Giigi for all i

Stochastic Gradient: applying to ridge regression

Objective function:

argminw

1

n

n∑i=1

(wT xi − yi )2 + λ‖w‖2

How to write as argminw1n

∑ni=1 fi (w)?

How to decompose into n components?

Stochastic Gradient: applying to ridge regression

Objective function:

argminw

1

n

n∑i=1

(wT xi − yi )2 + λ‖w‖2

How to write as argminw1n

∑ni=1 fi (w)?

First approach: fi (w) = (wT xi − yi )2 + λ‖w‖2

Update rule:

w t+1 ← w t − 2ηt(wT xi − yi )xi − 2ηtλw

= (1− 2ηtλ)w − 2ηt(wT xi − yi )xi

Coming up

Other classification models

Questions?

STA141C: Big Data & High Performance Statistical Computing...

Documents

Transcript of STA141C: Big Data & High Performance Statistical Computing...