Jorge Nocedal - opt-ml.org

An Evolving Gradient Resampling Method for Machine Learning

Jorge Nocedal Northwestern University

NIPS, Montreal 2015

Collaborators

Figen Oztoprak Stefan Solntsev Richard Byrd

Outline

1.  How to improve upon the stochastic gradient method for

risk minimization

2.  Noise reduction methods Dynamic Sampling (batching) Aggregated Gradient methods (SAG, SVRG, etc)

3.  Second order methods

4.  Propose a noise reduction method that re-uses old gradients and also employs dynamic sampling

Organization of optimization methods

Stochastic Gradient Batch Gradient Method Method

Stochastic Newton Method

condition number

noise reduction

Second-order methods

Stochastic Newton

•  Averaging (Polyak-Ruppert) •  Momentum •  Natural gradient, Fischer •  quasi-Newton •  inexact Newton (Hessian-free)

Noise reducing methods

Stochastic Newton Method

•  Dynamic sampling methods

•  aggregated gradient methods • 

This Talk: combine both ideas Why?

Objective Function

Fs (w) =1| S |

fi∈S∑ (w;ξi )

Sample gradient approximation - batch (or mini-batch)

minw F(w) = E[ f (w;ξ )]

ξ = (x, y) random variable with distribution Pf (⋅;ξ ) composition of loss ℓ and prediction h

wk+1 = wk −α k∇f (wk;ξk ) stochastic gradient method (SG)

wk+1 = wk −α k∇FS (wk ) batch (mini) method

Transient behavior of SG

To ensure convergence α k → 0 in SG method to control variance. Steplength selected to achieve fast initial progress, but this willslow progress in later stages

Expected function decreaseE[F(wk+1)− F(wk )] ≤ −α k‖∇F(wk )‖2

2 +α k2E‖∇f (wk ,ξk )‖2

Initially, gradient decrease dominates; then variance in gradient hinters progress (area of confusion)

Dynamic sampling methods reduce gradient variance by increasing batch. What is the right rate?

Proposal: Gradient accuracy conditions

Consider stochastic gradient method with fixed steplength wk+1 = wk −αg(wk ,ξk )

Geometric noise reduction

If the variance of stochastic gradient decreases geometrically, the method yields linear convergence

Lemma If ∃M > 0, ζ ∈(0,1) s.t. E[‖g(wk ,ξk )‖2

2 ]−‖∇F(wk )‖22 ≤ M ζ k−1

Then E[F(wk )− F*] ≤νρ k−1

Extension of classical convergence result for gradient method where error in gradient estimates decreases sufficiently rapidly to preserve linear convergence

Schmidt et al Pasupathy et al

Proposal: Gradient accuracy conditions

Optimal work complexity

Moreover, we obtain optimal complexity bounds

We can ensure variance condition E[‖g(wk ,ξk )‖2

2 ]−‖∇F(wk )‖22 ≤ M ζ k−1

by letting | Sk | = ak−1 a >1

The total number of stochastic gradient evaluations to achieve E[F(wk )− F*] ≤ ε is O(1 / ε)

with favorable constants a∈[1,1− βcµ2

Pasupathy, Glynn et al 2014 Friedlander, Schmidt 2012 Homem-de-Mello, Shapiro 2012 Byrd, Chin, N., Wu 2013

∇Fs (w) =1| S |

∇fi∈S∑ (w;ξi )

Theorem: Suppose F is strongly convex. Consider wk+1 = wk − (1 / L)gk where Sk is chosen so that variance condition holds and | Sk | ≥ γ k for γ >1.Then ! E[F(wk )− F(w*)]≤Cρ k ρ <1! the number of gradient samples to achieve ε accuracy is

O κεωdλ

⎛⎝⎜

⎞⎠⎟ d = no. of variables

κ = condition number, λ = smallest eigenv of Hessian

‖Var∇ℓ(wk;i)‖1≤ω (population)

Dynamic sampling (batching)

At every iteration, choose a subset S of {1,…,n} and apply one step of an optimization algorithm to the function

FS (wk ) =

1| S | i∈S∑ f (w;ξi ),

At the start, a small sample size | S | is chosen •  If optimization step is likely to reduce F(w), sample size is kept

unchanged; new sample S is chosen; next optimization step taken •  Else, a larger sample size is chosen, a new random sample S is

selected, a new iterate computed

Many optimization methods can be used. This approach creates the opportunity of employing second order methods

How to implement this in practice?

1.  Predetermine a geometric increase, tuning parameter

2.  Use angle (i.e. variance test)

Popular: combination of these two strategies.

| Sk | = ak−1 a >1

Ensure bound is satisfied in expectation ‖g(wk )−∇F(wk )‖≤θ‖gk‖｠｠｠θ <1

Newton-CG method with dynamic sampling, Armijo line search Test Problem •  From Google VoiceSearch •  191,607 training points •  129 classes; 235 features •  30,315 parameters (variables) •  Small version of production problem •  Multi-class logistic regression •  Initial batch size: 1%; Hessian sample 10%

Numerical Tests:

wk+1 = wk −α k∇2FHk

(wk )−1gk α k ≈1

Numerical test

Classical Newton-CG

New method

L-BFGS (m=2,20)

Function

Time Dynamic change of sample sizes … based on variance estimate

Batch L-BFGS

Stochastic gradient descent

Dynamic Newton-CG

However, not completely satisfactory

More investigation is needed …. Particularly: •  Transition between stochastic and batch regimes •  Coordination between step size and batch size •  Use of second order information (one stochastic gradient is not too

noisy) •  Can the idea of re-using gradients in a gradient aggregation approach

Stochastic process gradient methods

1 − − −[ ]Sk m

twilight zone

α k = 1 / k

α k = 1

Transition from stochastic to batch regimes

Gradient aggregation could smooth transition

Randomized Aggregated Gradient Methods (for empirical risk min)

SAG, SAGA, SVRG, etc focus on minimizing empirical risk Iteration:

Expected Risk: F (w) = E [ f (w;ξ )]

Empirical Risk: Fm (w) = 1m i=1

∑ f (w;ξi ) =1m i=1

∑ fi (w)

wk+1 = wk −α yk

yk combination of gradients ∇fi evaluated at previous iterates φ j

yk =1m[∇f j (wk )−∇f j (φk−1

j )+ 1m

∇i=1

∑ fi (φk−1i )] Choose j

at random

Example of Gradient Aggregation Methods

yk =1m[∇f j (wk )−∇f j (φk−1

j )+ 1m

∇i=1

∑ fi (φk−1i )]

Achieve linear rate of convergence in expectation (after a full initialization pass)

SAG, SAGA, SVRG

Fm (w) = 1m i=1∑ f (w;ξi ) =

1m i=1∑ fi (w)

EGR Method

The Evolving Gradient Resampling Method for Expected Risk Minimization

Proposed algorithm

1.  Minimizes expected risk (not training error) 2.  Stores previous gradients and updates several (sk) at each 3.  iteration 4.  Additional (uk) gradients are computed at current iterate 5.  Total amount of stored gradients increases monotonically 6.  Shares properties with dynamic sampling and gradient aggregation

methods

Goal: analyze an algorithm of this generality (interesting in its own right) Finding right balance between re-using old information and batching can result in efficient method.

The EGR method

tk + uk( ∑ j∈Sk

[∇f j (wk )−∇f j (φk−1j )]+

∑∇fi (φk−1i )+ ∇

j∈Uk

∑ f j (wk ) )

Related work: Frostig et al 2014, Babanezhad et al 2015

tk number of gradients in storage at start of iteration kUk indices of new gradients sampled at wk uk = |Uk |Sk indices of previously computed gradients that are updated sk = | Sk |

How should sk and uk be controled?

Evaluated at wk

Algorithms included in framework

stochastic gradient method: sk = 0, uk = 1dynamic sampling method: sk = 0, uk = function(k)aggregated gradient: sk = constant, uk = 0

EGR lin: sk = 0, uk = rEGR quad: sk = rk, uk = rEGR exp: sk = uk ≈ a

Assumptions:1) sk = uk = a

k a∈!+ geometric growth2) F strongly convex, fi Lipschitz continuous gradients3) tr (var [∇fi (w)])≤ v2 ∀w

Lemma:

E[Ek[ek ]]E[‖wk −w*‖]σ k

⎜⎜

⎟⎟ ≤ M

E[Ek[ek−1]]E[‖wk−1 −w*‖]σ k−1

⎜⎜

⎟⎟

‖∇fi (i=1

∑ φki )−∇fi (wk )‖ σ k = v2 / tk+1

Lyapunov function

1−η1+η

(1+αL) 1−η1+η

αL 1−η1+η

αL 1−αµ α

0 0 11+η

⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟

η : probability that an old gradient is recomputedα : steplengthL: Lipschitz constantµ : strong convexity parameter

Lemma. For sufficiently small α the spectral radius of M satisfiesρM <1

Theorem: If α k is chosen small enough E‖wk −w*‖ ≤ ckβ R-linear convergence

SAG special case: tk = m, uk = 0, sk = constant

Simple proof of R-linear convergence of SAG but with a larger constant

Related Methods

yk =1m[∇f j (wk )−∇f j (φk−1

j )+ 1m

∇i=1

∑ fi (φk−1i )]

•  Streaming SVRG (Frostig et al 2014) •  Stop wasting gradients (Babanezhad et al 2015)

j SVRG

SAG(A) initialization phase

yk =1m[∇f j (wk )−∇f j (φk−1

j )+ 1m

∇i=1

∑ fi (φk−1i )]

1. Sample j ∈{1,...,m} at random2. Compute ∇f j (wk )3. If j has been sampled earlier, replace old gradient4. Else add new ∇f j (wk ) to aggregated gradient (memory grows)

Numerical Result

1.  Comparison of EGR with various growth rates 2.  Comparison with SGD 3.  Comparison with SAG-init and SAGA-init

Goal: analyze an algorithm of this generality (interesting in its own right) Finding right balance between re-using old information and batching can result in efficient method.

Susy EGR vs SG vs Dynamic Sampling

��

� ��

��

Random Comparing with SAG initialization

��

� ��

��

Random Larger initial batch for EGR

��

� ��

��

� ��

��

EGR with various growth rates (Alpha)

��

� ��

��

Random

��

� ��

��

The End

Jorge Nocedal - opt-ml.org

Documents

Transcript of Jorge Nocedal - opt-ml.org

-ce tCe Notes Opt Opt

Pallazio - Fiddler's Creek...Optional Tandem Garage Opt. Veggie Sink Opt. Veggie Sink Opt. Veggie Sink Opt. Window Opt. Window Opt. D Opt. W Opt.Ref Ref Opt. Ref Ov en / Mic ro LANAI

EISENHOWER - Maronda HomesEISENHOWER Elevation G Presidential Series Opt. Sunroom with Second Island Opt. 3 Car Garage Opt. Ambassador Bath Opt. Princess Bath with Bedroom 4 Opt. Executive

IWATA CUSTOM MICRON AIRBRUSH SERIES (VERSION 2) · 2020. 10. 16. · IWATA CUSTOM MICRON AIRBRUSH SERIES (VERSION 2) REPLACEMENT PARTS CM-SB x OPT OPT OPT OPT CM-C OPT x x x x OPT

PERBUP 16 2014 Opt Opt

TABLE DES MATIÈRES 2005 MerCruiser ®. • Mercury ... 200XL Ver3 4,3 L MPI 135L Opt 150L Opt 175L Opt 4,3 L MPI 90ELPT 4S 115ELPT 4S EFI 150XL Opt 200XL Opt 200XL Ver3 115ELPT Opt

Intuitive and user-friendly operation JFX500brochure.copiercatalog.com/mimaki/Mimaki_JFX500Broch_v1... · 2016. 1. 4. · OPT-J0322 OPT-J0216 OPT-J0217 OPT-J0232 OPT-J0330 Static

Opt enterprise dev opt presentation mi 2014

OPT@WORKER WEB MES€¦ · OPT@WORKER WEB MES Stato Completed Data REL_2016 Autore OPT SOLUTIONS b. File: OPT@WORKER WEB MES - PRESENTAZIONE_2017_ESTESA.docx Pag. 15 di 26 OPT SOLUTIONS

Bali / 5040 - 5040.pdf · me woodside up coats open to above cabinet opt. base cabinets opt. shower pantry opt. dining opt. door wh opt. door opt. home plus dw opt. 2ft garage extension

Opt Opt NetsNets

Dan Ryan Marion Raleigh 20160826 Screen - thebdxlive.com · Opt. 2’ Garage Ext. Opt. 4’ Garage Ext. Opt. Door Opt. Door Opt. Door Opt. Door Owner’s Suite 17’-0” x 13’-0”

. (Opt.) Fl. 2 (Opt.) Bsn. (Opt. ) (Opt. ) (Opt. ) A. Sax. 1 A. Sax. 2 (Opt. ) T. Sax. B. Sax. (Opt.) I (Opt. Hrn. I Hrn. 2 Trb.2 Trb.3 (Opt.) Euph. ... Picc. (Opt.) Fl. 2 (Opt.) Bsn.

CPT, OPT and STEM OPT - ufic.ufl.edu

11'- OPT. PTA C ABOVE OPT. OPT. BEDR M OPT. the BATH Bears 35' 1 1… · 2020-01-09 · Bears 35' 1 1'-8" Cave OPT. ABOVE 11'- 8" OPT. ABOVE OPT. OPT. 8" OPT. OPT. ROOM OPT. BE ROOM

PUMACARD OPT / SPOT OPT - Gilbarco

THE Wagner - carusohomes.com2-Car Garage 19'8" x 19'5" Opt. French Doors Opt. Tray Ceiling Opt. Tray Ceiling Opt. Utility Sink 6' Usable Porch w/Elevation #4 Storage Opt. Closet Opt.

PITT BRISTOL II 2300 RIVISION 1 Web 08-13-19...BRISTOL II FIRST FLOOR Opt. Railings Opt. Box Bay Window Opt. Study Opt. Gourmet Kitchen DW Oven Opt. Ref. Opt. Ref. Opt. Tray Clg. Pantry

Bordeaux - B35A · Bordeaux - B35A FIRST FLOOR OPTIONS Opt. Sheetrock Beams Opt. Vault Clg. Opt. Sheetrock Beams Opt. Sheetrock Beams Opt. Sheetrock Beams Opt. Sheetrock Beams Opt.

Numerical Optimization - Пријаваnasport.pmf.ni.ac.rs/materijali/2271/Numerical_Optimization Nocedal.pdf · Numerical Optimization With 85 Illustrations 13. Jorge Nocedal Stephen