Summary of Review Paper: Optimization Methods for Large-Scale … · 2020-01-26 · Introduction...

Summary of Review Paper: Optimization Methods for Large-Scale Machine LearningMuthu Chidambaram

Department of Computer Science, University of Virginia

https://qdata.github.io/deep2Read/

Introduction

● Authors: Leon Bouttou, Frank E. Curtis, Jorge Nocedal● Overview of optimization methods● Characterization of large-scale machine learning as a

distinctive setting● Research directions for next generation of optimization

methods

Stochastic vs Batch Gradient Methods

● Stochastic Gradient Descent○ Formulated as:○ Uses information more efficiently○ Computationally less expensive

● Batch Gradient Descent○ Formulated as:○ Better performance over large number of epochs○ Less noisy

Notation

● f: composition of loss and prediction functions● ξ: random sample or set of samples from data● w: parameters of prediction function● f_i: loss with respect to a single sample

SGD Analysis: Lipschitz Continuous

SGD Analysis: Restrictions on Moments

SGD Analysis: Strongly Convex (Fixed Stepsize)

SGD Analysis: Strongly Convex (Diminishing Stepsize)

Roles of Assumptions

● Strong Convexity○ Key for ensuring O(1/k) convergence

● Initialization○ Can be used to decrease the prominence of initial gap in decreasing

stepsize optimization

SGD Analysis: General Objectives

Complexity for Large-Scale Learning

● Consider infinite supply of training examples● Batch gradient descent increases linearly● SGD is independent of training examples

SGD Noise Reduction Methods

● Dynamic sampling○ Minibatches

● Gradient aggregation○ Store previous gradients

● Iterate averaging○ Average of iterated values

SGD Noise Reduction Behavior

SGD Dynamic Sampling

● Increasing minibatch size geometrically guarantees linear convergence

● Practical implementations: adaptive sampling○ Not tried extensively in ML

SGD Gradient Aggregation

● Stochastic Variance Reduced Gradient (SVRG)○ Start with batch update and use to correct bias in SGD

● SAGA○ Uses average of previous gradients to unbias SGD

SGD Iterate Averaging

● Take average of computed parameters to reduce noise

Second-Order Methods

● Motivation: SGD not scale invariant● Hessian-free Newton Method

○ Uses second-order information

● Quasi-Newton and Gauss-Newton Methods○ Mimic Newton method using sequence of first order information

● Natural Gradient○ Defines search direction in the space of realizable distributions

Second-Order Method Overview

Hessian-Free Inexact Newton Methods

● Solve Newton system with CG instead of matrix factorization

○ Only requires Hessian vector products■ Similar to kernel trick

Subsampled Hessian-Free Newton Methods

Stochastic Quasi-Newton Methods

● Approximate Hessian using only first-order methods● Problems

○ Hessian approximations can be dense, even when Hessian is sparse○ Limited memory scheme only allows provably linear convergence

Gauss-Newton Methods

● Minimize second-order Taylor series expansion

Natural Gradient Methods

● Invariant to all invertible transformations● Gradient descent over prediction functions

Diagonal Scaling Methods

● Rescale search direction using diagonal transformation● Examples

○ RMSProp○ AdaGrad

● Structural Methods○ Batch Normalization

Summary of Review Paper: Optimization Methods for Large-Scale … · 2020-01-26 · Introduction...

Documents

Transcript of Summary of Review Paper: Optimization Methods for Large-Scale … · 2020-01-26 · Introduction...

neural networks - arxiv.org · methods for solving inverse problems can be classi ed into two main categories: local and global optimization algorithms (Nocedal & Wright, 2006). Local

Knitro: An Integrated Package for Nonlinear …users.iems.northwestern.edu/~nocedal/PDFfiles/integrated.pdfKnitro: An Integrated Package for Nonlinear Optimization Richard H. Byrd

OPTIMIZATION OF METHODS FOR REMOVING VARIOUS TYPES … · optimization of methods for removing various

Heuristic Optimization Methods

Steering Exact Penalty Methods for Nonlinear Programmingusers.iems.northwestern.edu/~nocedal/PDFfiles/steering.pdf · 2008-09-13 · Steering Exact Penalty Methods for Nonlinear Programming

Numerical Optimization - Unconstrained Optimization (II)€¦ · One Dimensional Optimization Derivative-free methods (Search methods) Derivative-based methods (Approximation methods)

Numerical Optimization - Объединенный институт ядерных ...inis.jinr.ru/sl/M_Mathematics/MOc_Optimal control/Nocedal...Numerical optimization / Jorge Nocedal,

Optimization Methods for Large-Scale Machine Learningusers.iems.northwestern.edu/~nocedal/PDFfiles/SIAM... · OPTIMIZATION METHODS FOR LARGE-SCALE MACHINE LEARNING 225 Machine learning

Homotopy Optimization Methods for Global Optimization

Stochastic optimization methods for the simultaneous control ......Bottou, Curtis and Nocedal, Optimization methods for large-scale machine learning, 2018 13/23 Stochastic Gradient

Numerical Optimization - tomlr.free.frtomlr.free.fr/.../Numerical%20Optimization%20... · Numerical Optimization With 85 Illustrations 13. Jorge Nocedal Stephen J. Wright ECE Department

Survey Optimization Methods

NUMERICAL OPTIMIZATION - bayanbox.irbayanbox.ir/.../Numerical-Optimization-Solution-s-Manual.pdf · Solutions to Selected Problems in NUMERICAL OPTIMIZATION by J. Nocedal and S.J.

Lecture Notes on Numerical Optimization · excellent text book \Numerical Optimization" by Jorge Nocedal and Steve Wright [4]. This book appeared in Springer Verlag and is available

Optimization methods Review

Frank Edward Curtis Northwestern University Joint work with Richard Byrd and Jorge Nocedal February 12, 2007 Inexact Methods for PDE-Constrained Optimization.

Global convergence properties of conjugate gradient ...Global convergence properties of conjugate gradient methods for optimization Jean Charles Gilbert, J. Nocedal To cite this version:

Optimization Methods

Monica Flores Nocedal

Numerical Optimization - Пријаваnasport.pmf.ni.ac.rs/materijali/2271/Numerical_Optimization Nocedal.pdf · Numerical Optimization With 85 Illustrations 13. Jorge Nocedal Stephen