Gradient descent method

44
Gradient descent method 2013.11.10 Sanghyuk Chun Many contents are from Large Scale Optimization Lecture 4 & 5 by Caramanis & Sanghavi Convex Optimization Lecture 10 by Boyd & Vandenberghe Convex Optimization textbook Chapter 9 by Boyd & Vandenberghe 1

Transcript of Gradient descent method

Page 1: Gradient descent method

Gradient descent method2013.11.10

Sanghyuk Chun

Many contents are from

Large Scale Optimization Lecture 4 & 5 by Caramanis & Sanghavi

Convex Optimization Lecture 10 by Boyd & Vandenberghe

Convex Optimization textbook Chapter 9 by Boyd & Vandenberghe1

Page 2: Gradient descent method

Contents

β€’ Introduction

β€’ Example code & Usage

β€’ Convergence Conditions

β€’ Methods & Examples

β€’ Summary

2

Page 3: Gradient descent method

IntroductionUnconstraint minimization problem, Description, Pros and Cons

3

Page 4: Gradient descent method

Unconstrained minimization problems

β€’ Recall: Constrained minimization problemsβ€’ From Lecture 1, the formation of a general constrained convex

optimization problem is as followsβ€’ min𝑓 π‘₯ 𝑠. 𝑑. π‘₯ ∈ Ο‡

β€’ Where 𝑓: Ο‡ β†’ R is convex and smooth

β€’ From Lecture 1, the formation of an unconstrained optimization problem is as followsβ€’ min𝑓 π‘₯

β€’ Where 𝑓: 𝑅𝑛 β†’ 𝑅 is convex and smooth

β€’ In this problem, the necessary and sufficient condition for optimal solution x0 is

β€’ 𝛻𝑓 π‘₯ = 0 π‘Žπ‘‘ π‘₯ = π‘₯0

4

Page 5: Gradient descent method

Unconstrained minimization problems

β€’ Minimize f(x)β€’ When f is differentiable and convex, a necessary and sufficient

condition for a point π‘₯βˆ— to be optimal is 𝛻𝑓 π‘₯βˆ— = 0

β€’ Minimize f(x) is the same as fining solution of 𝛻𝑓 π‘₯βˆ— = 0β€’ Min f(x): Analytically solving the optimality equation

β€’ 𝛻𝑓 π‘₯βˆ— = 0: Usually be solved by an iterative algorithm

5

Page 6: Gradient descent method

Description of Gradient Descent Method

β€’ The idea relies on the fact that βˆ’π›»π‘“(π‘₯(π‘˜)) is a descent direction

β€’ π‘₯(π‘˜+1) = π‘₯(π‘˜) βˆ’ Ξ· π‘˜ 𝛻𝑓(π‘₯(π‘˜)) π‘€π‘–π‘‘β„Ž 𝑓 π‘₯ π‘˜+1 < 𝑓(π‘₯ π‘˜ )

β€’ βˆ†π‘₯(π‘˜) is the step, or search direction

β€’ Ξ· π‘˜ is the step size, or step length

β€’ Too small Ξ· π‘˜ will cause slow convergence

β€’ Too large Ξ· π‘˜ could cause overshoot the minima and diverge

6

Page 7: Gradient descent method

Description of Gradient Descent Method

β€’ Algorithm (Gradient Descent Method)β€’ given a starting point π‘₯ ∈ π‘‘π‘œπ‘š 𝑓

β€’ repeat1. βˆ†π‘₯ ≔ βˆ’π›»π‘“ π‘₯

2. Line search: Choose step size Ξ· via exact or backtracking line search

3. Update π‘₯ ≔ π‘₯ + Ξ·βˆ†π‘₯

β€’ until stopping criterion is satisfied

β€’ Stopping criterion usually 𝛻𝑓(π‘₯) 2 ≀ πœ–

β€’ Very simple, but often very slow; rarely used in practice

7

Page 8: Gradient descent method

Pros and Cons

β€’ Prosβ€’ Can be applied to every dimension and space (even possible to

infinite dimension)

β€’ Easy to implement

β€’ Consβ€’ Local optima problem

β€’ Relatively slow close to minimum

β€’ For non-differentiable functions, gradient methods are ill-defined

8

Page 9: Gradient descent method

Example Code & UsageExample Code, Usage, Questions

9

Page 10: Gradient descent method

Gradient Descent Example Code

β€’ http://mirlab.org/jang/matlab/toolbox/machineLearning/

10

Page 11: Gradient descent method

Usage of Gradient Descent Method

β€’ Linear Regressionβ€’ Find minimum loss function to choose best hypothesis

11

Example of Loss function:

π‘‘π‘Žπ‘‘π‘Žπ‘π‘Ÿπ‘’π‘‘π‘–π‘π‘‘ βˆ’ π‘‘π‘Žπ‘‘π‘Žπ‘œπ‘π‘ π‘’π‘Ÿπ‘£π‘’π‘‘2

Find the hypothesis (function) which minimize

the loss function

Page 12: Gradient descent method

Usage of Gradient Descent Method

β€’ Neural Networkβ€’ Back propagation

β€’ SVM (Support Vector Machine)

β€’ Graphical models

β€’ Least Mean Squared Filter

…and many other applications!

12

Page 13: Gradient descent method

Questions

β€’ Does Gradient Descent Method always converge?

β€’ If not, what is condition for convergence?

β€’ How can make Gradient Descent Method faster?

β€’ What is proper value for step size Ξ· π‘˜

13

Page 14: Gradient descent method

Convergence ConditionsL-Lipschitz function, Strong Convexity, Condition number

14

Page 15: Gradient descent method

L-Lipschitz function

β€’ Definitionβ€’ A function 𝑓: 𝑅𝑛→𝑅 is called L-Lipschitz if and only if

𝛻𝑓 π‘₯ βˆ’ 𝛻𝑓 𝑦 2 ≀ 𝐿 π‘₯ βˆ’ 𝑦 2, βˆ€π‘₯, 𝑦 ∈ 𝑅𝑛

β€’ We denote this condition by 𝑓 ∈ 𝐢𝐿, where 𝐢𝐿 is class of L-Lipschitzfunctions

15

Page 16: Gradient descent method

L-Lipschitz function

β€’ Lemma 4.1

β€’ 𝐼𝑓 𝑓 ∈ 𝐢𝐿 , π‘‘β„Žπ‘’π‘› 𝑓 𝑦 βˆ’ 𝑓 π‘₯ βˆ’ 𝛻𝑓 π‘₯ , 𝑦 βˆ’ π‘₯ ≀𝐿

2𝑦 βˆ’ π‘₯ 2

β€’ Theorem 4.2β€’ 𝐼𝑓 𝑓 ∈ 𝐢𝐿 π‘Žπ‘›π‘‘ π‘“βˆ— = min

π‘₯𝑓 π‘₯ > βˆ’βˆž, π‘‘β„Žπ‘’π‘› π‘‘β„Žπ‘’ π‘”π‘Ÿπ‘Žπ‘‘π‘–π‘’π‘›π‘‘ 𝑑𝑒𝑠𝑐𝑒𝑛𝑑

π‘Žπ‘™π‘”π‘œπ‘Ÿπ‘–π‘‘β„Žπ‘š π‘€π‘–π‘‘β„Ž 𝑓𝑖π‘₯𝑒𝑑 𝑠𝑑𝑒𝑝 𝑠𝑖𝑧𝑒 π‘ π‘‘π‘Žπ‘‘π‘–π‘ π‘“π‘¦π‘–π‘›π‘” Ξ· <2

𝐿𝑀𝑖𝑙𝑙 π‘π‘œπ‘›π‘£π‘’π‘Ÿπ‘”π‘’ π‘‘π‘œ

π‘Ž π‘ π‘‘π‘Žπ‘‘π‘–π‘œπ‘›π‘Žπ‘Ÿπ‘¦ π‘π‘œπ‘–π‘›π‘‘

16

Page 17: Gradient descent method

Strong Convexity and implications

β€’ Definitionβ€’ If there exist a constant m > 0 such that 𝛻2𝑓 ≻= π‘šπΌ π‘“π‘œπ‘Ÿ βˆ€π‘₯ ∈ 𝑆,

then the function f(x) is strongly convex function on S

17

Page 18: Gradient descent method

Strong Convexity and implications

β€’ Lemma 4.3β€’ If f is strongly convex on S, we have the following inequality:

β€’ 𝑓 𝑦 β‰₯ 𝑓 π‘₯ +< 𝛻𝑓 π‘₯ , 𝑦 βˆ’ π‘₯ > +π‘š

2𝑦 βˆ’ π‘₯ 2 π‘“π‘œπ‘Ÿ βˆ€π‘₯, 𝑦 ∈ 𝑆

β€’ Proof

18

( )

useful as stopping criterion (if you know m)

Page 19: Gradient descent method

Strong Convexity and implications

19

Proof

Page 20: Gradient descent method

Upper Bound of 𝛻2𝑓(π‘₯)

β€’ Lemma 4.3 implies that the sublevel sets contained in S are bounded, so in particular, S is bounded. Therefore the maximum eigenvalue of 𝛻2𝑓 π‘₯ is bounded above on Sβ€’ There exists a constant M such that 𝛻2𝑓 π‘₯ =β‰Ί 𝑀𝐼 π‘“π‘œπ‘Ÿ βˆ€π‘₯ ∈ 𝑆

β€’ Lemma 4.4β€’ πΉπ‘œπ‘Ÿ π‘Žπ‘›π‘¦ π‘₯, 𝑦 ∈ 𝑆, 𝑖𝑓 𝛻2𝑓 π‘₯ =β‰Ί 𝑀𝐼 π‘“π‘œπ‘Ÿ π‘Žπ‘™π‘™ π‘₯ ∈ 𝑆 π‘‘β„Žπ‘’π‘›

𝑓 𝑦 ≀ 𝑓 π‘₯ +< 𝛻𝑓 π‘₯ , 𝑦 βˆ’ π‘₯ > +𝑀

2𝑦 βˆ’ π‘₯ 2

20

Page 21: Gradient descent method

Condition Number

β€’ From Lemma 4.3 and 4.4 we haveπ‘šπΌ =β‰Ί 𝛻2𝑓 π‘₯ =β‰Ί 𝑀𝐼 π‘“π‘œπ‘Ÿ βˆ€ π‘₯ ∈ 𝑆,π‘š > 0,𝑀 > 0

β€’ The ratio k=M/m is thus an upper bound on the condition number of the matrix 𝛻2𝑓 π‘₯

β€’ When the ratio is close to 1, we call it well-conditioned

β€’ When the ratio is much larger than 1, we call it ill-conditioned

β€’ When the ratio is exactly 1, it is the best case that only one step will lead to the optimal solution (there is no wrong direction)

21

Page 22: Gradient descent method

Condition Number

β€’ Theorem 4.5β€’ Gradient descent for a strongly convex function f and step size

Ξ·=1

𝑀will converge as

β€’ 𝑓 π‘₯βˆ— βˆ’ π‘“βˆ— ≀ π‘π‘˜ 𝑓 π‘₯0 βˆ’ π‘“βˆ— , π‘€β„Žπ‘’π‘Ÿπ‘’ 𝑐 ≀ 1 βˆ’π‘š

𝑀

β€’ Rate of convergence c is known as linear convergence

β€’ Since we usually do not know the value of M, we do line search

β€’ For exact line search, 𝑐 = 1 βˆ’π‘š

𝑀

β€’ For backtracking line search, 𝑐 = 1 βˆ’ min 2π‘šπ›Ό,2π›½π›Όπ‘š

𝑀< 1

22

Page 23: Gradient descent method

Methods & ExamplesExact Line Search, Backtracking Line Search, Coordinate Descent Method, Steepest Descent Method

23

Page 24: Gradient descent method

Exact Line Search

β€’ The optimal line search method in which Ξ· is chosen to minimize f along the ray π‘₯ βˆ’ η𝛻𝑓 π‘₯ , as shown in below

β€’ Exact line search is used when the cost of minimization problem with one variable is low compared to the cost of computing the search direction itself.

β€’ It is not very practical24

Page 25: Gradient descent method

Exact Line Search

β€’ Convergence Analysisβ€’

β€’ 𝑓 π‘₯ π‘˜ βˆ’ π‘“βˆ— decreases by at least a constant factor in every iteration

β€’ Converging to 0 geometric fast. (linear convergence)

25

1

Page 26: Gradient descent method

Backtracking Line Search

β€’ It depends on two constants 𝛼, 𝛽 π‘€π‘–π‘‘β„Ž 0 < 𝛼 < 0.5, 0 < 𝛽 < 1

β€’ It starts with unit step size and then reduces it by the factor 𝛽 until the stopping condition

𝑓(π‘₯ βˆ’ η𝛻𝑓(π‘₯)) ≀ 𝑓(π‘₯) βˆ’ 𝛼η 𝛻𝑓 π‘₯ 2

β€’ Since βˆ’π›»π‘“ π‘₯ is a descent direction and βˆ’ 𝛻𝑓 π‘₯ 2 < 0, so for small enough step size Ξ·, we have

𝑓 π‘₯ βˆ’ η𝛻𝑓 π‘₯ β‰ˆ 𝑓 π‘₯ βˆ’ Ξ· 𝛻𝑓 π‘₯ 2 < 𝑓 π‘₯ βˆ’ 𝛼η 𝛻𝑓 π‘₯ 2

β€’ It shows that the backtracking line search eventually terminates

β€’ 𝛼 is typically chosen between 0.01 and 0.3

β€’ 𝛽 is often chosen to be between 0.1 and 0.8

26

Page 27: Gradient descent method

Backtracking Line Search

β€’

27

Page 28: Gradient descent method

Backtracking Line Search

β€’ Convergence Analysis

β€’ Claim: Ξ· ≀1

𝑀always satisfies the stopping condition

β€’ Proof

28

Page 29: Gradient descent method

Backtracking Line Search

β€’ Proof (cont)

29

Page 30: Gradient descent method

Line search types

β€’ Slide from Optimization Lecture 10 by Boyd

30

Page 31: Gradient descent method

Line search example

β€’ Slide from Optimization Lecture 10 by Boyd

31

Page 32: Gradient descent method

Coordinate Descent Method

β€’ Coordinate descent belongs to the class of several non derivative methods used for minimizing differentiable functions.

β€’ Here, cost is minimized in one coordinate direction in each iteration.

32

Page 33: Gradient descent method

Coordinate Descent Method

β€’ Prosβ€’ It is well suited for parallel computation

β€’ Consβ€’ May not reach the local minimum even for convex function

33

Page 34: Gradient descent method

Converge of Coordinate Descent

β€’ Lemma 5.4

34

Page 35: Gradient descent method

Coordinate Descent Method

β€’ Method of selecting the coordinate for next iterationβ€’ Cyclic Coordinate Descent

β€’ Greedy Coordinate Descent

β€’ (Uniform) Random Coordinate Descent

35

Page 36: Gradient descent method

Steepest Descent Method

β€’ The gradient descent method takes many iterations

β€’ Steepest Descent Method aims at choosing the best direction at each iteration

β€’ Normalized steepest descent directionβ€’ βˆ†π‘₯𝑛𝑠𝑑 = π‘Žπ‘Ÿπ‘”π‘šπ‘–π‘› 𝛻𝑓 π‘₯ 𝑇𝑣 𝑣 = 1}

β€’ Interpretation: for small 𝑣, 𝑓 π‘₯ + 𝑣 β‰ˆ 𝑓 π‘₯ + 𝛻𝑓 π‘₯ 𝑇𝑣 direction βˆ†π‘₯𝑛𝑠𝑑is unit-norm step with most negative directional derivative

β€’ Iteratively, the algorithm follows the following stepsβ€’ Calculate direction of descent βˆ†π‘₯𝑛𝑠𝑑

β€’ Calculate step size, t

β€’ π‘₯+ = π‘₯ + π‘‘βˆ†π‘₯𝑛𝑠𝑑

36

Page 37: Gradient descent method

Steepest Descent for various norms

β€’ The choice of norm used the steepest descent direction can be have dramatic effect on converge rate

β€’ 𝑙2 normβ€’ The steepest descent direction is as follows

β€’ βˆ†π‘₯𝑛𝑠𝑑 =βˆ’π›»π‘“(π‘₯)

𝛻𝑓(π‘₯) 2

β€’ 𝑙1 normβ€’ For π‘₯ 1 = 𝑖 π‘₯𝑖 , a descent direction is as follows,

β€’ βˆ†π‘₯𝑛𝑑𝑠 = βˆ’π‘ π‘–π‘”π‘›πœ•π‘“ π‘₯

πœ•π‘₯π‘–βˆ— 𝑒𝑖

βˆ—

β€’ π‘–βˆ— = π‘Žπ‘Ÿπ‘”min𝑖

πœ•π‘“

πœ•π‘₯𝑖

β€’ π‘™βˆž normβ€’ For π‘₯ ∞ = argmin

𝑖π‘₯𝑖 , a descent direction is as follows

β€’ βˆ†π‘₯𝑛𝑑𝑠 = βˆ’π‘ π‘–π‘”π‘› βˆ’π›»π‘“ π‘₯37

Page 38: Gradient descent method

Steepest Descent for various norms

38

Quadratic Norm 𝑙1-Norm

Page 39: Gradient descent method

Steepest Descent for various norms

β€’ Example

39

Page 40: Gradient descent method

Steepest Descent Convergence Rate

β€’ Fact: Any norm can be bounded by βˆ™ 2, i.e., βˆƒπ›Ύ, 𝛾 ∈ (0,1]such that, π‘₯ β‰₯ 𝛾 π‘₯ 2 π‘Žπ‘›π‘‘ π‘₯ βˆ— β‰₯ 𝛾 π‘₯ 2

β€’ Theorem 5.5β€’ If f is strongly convex with respect to m and M, and βˆ™ 2 has 𝛾, 𝛾 as

above then steepest decent with backtracking line search has linear convergence with rate

β€’ 𝑐 = 1 βˆ’ 2π‘šπ›Ό 𝛾2 min 1,𝛽𝛾

𝑀

β€’ Proof: Will be proved in the lecture 6

40

Page 41: Gradient descent method

Summary

41

Page 42: Gradient descent method

Summary

β€’ Unconstrained Convex Optimization Problem

β€’ Gradient Descent Method

β€’ Step Size Trade-off between safety and speed

β€’ Convergence Conditionsβ€’ L-Lipschtiz Function

β€’ Strong Convexity

β€’ Condition Number

42

Page 43: Gradient descent method

Summary

β€’ Exact Line Search

β€’ Backtracking Line Search

β€’ Coordinate Descent Methodβ€’ Good for parallel computation but not always converge

β€’ Steepest Descent Methodβ€’ The choice of norm is important

43

Page 44: Gradient descent method

44

END OF DOCUMENT