Gradient descent method
-
Upload
sanghyuk-chun -
Category
Technology
-
view
340 -
download
1
Transcript of Gradient descent method
Gradient descent method2013.11.10
Sanghyuk Chun
Many contents are from
Large Scale Optimization Lecture 4 & 5 by Caramanis & Sanghavi
Convex Optimization Lecture 10 by Boyd & Vandenberghe
Convex Optimization textbook Chapter 9 by Boyd & Vandenberghe1
Contents
β’ Introduction
β’ Example code & Usage
β’ Convergence Conditions
β’ Methods & Examples
β’ Summary
2
IntroductionUnconstraint minimization problem, Description, Pros and Cons
3
Unconstrained minimization problems
β’ Recall: Constrained minimization problemsβ’ From Lecture 1, the formation of a general constrained convex
optimization problem is as followsβ’ minπ π₯ π . π‘. π₯ β Ο
β’ Where π: Ο β R is convex and smooth
β’ From Lecture 1, the formation of an unconstrained optimization problem is as followsβ’ minπ π₯
β’ Where π: π π β π is convex and smooth
β’ In this problem, the necessary and sufficient condition for optimal solution x0 is
β’ π»π π₯ = 0 ππ‘ π₯ = π₯0
4
Unconstrained minimization problems
β’ Minimize f(x)β’ When f is differentiable and convex, a necessary and sufficient
condition for a point π₯β to be optimal is π»π π₯β = 0
β’ Minimize f(x) is the same as fining solution of π»π π₯β = 0β’ Min f(x): Analytically solving the optimality equation
β’ π»π π₯β = 0: Usually be solved by an iterative algorithm
5
Description of Gradient Descent Method
β’ The idea relies on the fact that βπ»π(π₯(π)) is a descent direction
β’ π₯(π+1) = π₯(π) β Ξ· π π»π(π₯(π)) π€ππ‘β π π₯ π+1 < π(π₯ π )
β’ βπ₯(π) is the step, or search direction
β’ Ξ· π is the step size, or step length
β’ Too small Ξ· π will cause slow convergence
β’ Too large Ξ· π could cause overshoot the minima and diverge
6
Description of Gradient Descent Method
β’ Algorithm (Gradient Descent Method)β’ given a starting point π₯ β πππ π
β’ repeat1. βπ₯ β βπ»π π₯
2. Line search: Choose step size Ξ· via exact or backtracking line search
3. Update π₯ β π₯ + Ξ·βπ₯
β’ until stopping criterion is satisfied
β’ Stopping criterion usually π»π(π₯) 2 β€ π
β’ Very simple, but often very slow; rarely used in practice
7
Pros and Cons
β’ Prosβ’ Can be applied to every dimension and space (even possible to
infinite dimension)
β’ Easy to implement
β’ Consβ’ Local optima problem
β’ Relatively slow close to minimum
β’ For non-differentiable functions, gradient methods are ill-defined
8
Example Code & UsageExample Code, Usage, Questions
9
Gradient Descent Example Code
β’ http://mirlab.org/jang/matlab/toolbox/machineLearning/
10
Usage of Gradient Descent Method
β’ Linear Regressionβ’ Find minimum loss function to choose best hypothesis
11
Example of Loss function:
πππ‘ππππππππ‘ β πππ‘ππππ πππ£ππ2
Find the hypothesis (function) which minimize
the loss function
Usage of Gradient Descent Method
β’ Neural Networkβ’ Back propagation
β’ SVM (Support Vector Machine)
β’ Graphical models
β’ Least Mean Squared Filter
β¦and many other applications!
12
Questions
β’ Does Gradient Descent Method always converge?
β’ If not, what is condition for convergence?
β’ How can make Gradient Descent Method faster?
β’ What is proper value for step size Ξ· π
13
Convergence ConditionsL-Lipschitz function, Strong Convexity, Condition number
14
L-Lipschitz function
β’ Definitionβ’ A function π: π πβπ is called L-Lipschitz if and only if
π»π π₯ β π»π π¦ 2 β€ πΏ π₯ β π¦ 2, βπ₯, π¦ β π π
β’ We denote this condition by π β πΆπΏ, where πΆπΏ is class of L-Lipschitzfunctions
15
L-Lipschitz function
β’ Lemma 4.1
β’ πΌπ π β πΆπΏ , π‘βππ π π¦ β π π₯ β π»π π₯ , π¦ β π₯ β€πΏ
2π¦ β π₯ 2
β’ Theorem 4.2β’ πΌπ π β πΆπΏ πππ πβ = min
π₯π π₯ > ββ, π‘βππ π‘βπ ππππππππ‘ πππ ππππ‘
πππππππ‘βπ π€ππ‘β πππ₯ππ π π‘ππ π ππ§π π π‘ππ‘ππ ππ¦πππ Ξ· <2
πΏπ€πππ ππππ£ππππ π‘π
π π π‘ππ‘ππππππ¦ πππππ‘
16
Strong Convexity and implications
β’ Definitionβ’ If there exist a constant m > 0 such that π»2π β»= ππΌ πππ βπ₯ β π,
then the function f(x) is strongly convex function on S
17
Strong Convexity and implications
β’ Lemma 4.3β’ If f is strongly convex on S, we have the following inequality:
β’ π π¦ β₯ π π₯ +< π»π π₯ , π¦ β π₯ > +π
2π¦ β π₯ 2 πππ βπ₯, π¦ β π
β’ Proof
18
( )
useful as stopping criterion (if you know m)
Strong Convexity and implications
19
Proof
Upper Bound of π»2π(π₯)
β’ Lemma 4.3 implies that the sublevel sets contained in S are bounded, so in particular, S is bounded. Therefore the maximum eigenvalue of π»2π π₯ is bounded above on Sβ’ There exists a constant M such that π»2π π₯ =βΊ ππΌ πππ βπ₯ β π
β’ Lemma 4.4β’ πΉππ πππ¦ π₯, π¦ β π, ππ π»2π π₯ =βΊ ππΌ πππ πππ π₯ β π π‘βππ
π π¦ β€ π π₯ +< π»π π₯ , π¦ β π₯ > +π
2π¦ β π₯ 2
20
Condition Number
β’ From Lemma 4.3 and 4.4 we haveππΌ =βΊ π»2π π₯ =βΊ ππΌ πππ β π₯ β π,π > 0,π > 0
β’ The ratio k=M/m is thus an upper bound on the condition number of the matrix π»2π π₯
β’ When the ratio is close to 1, we call it well-conditioned
β’ When the ratio is much larger than 1, we call it ill-conditioned
β’ When the ratio is exactly 1, it is the best case that only one step will lead to the optimal solution (there is no wrong direction)
21
Condition Number
β’ Theorem 4.5β’ Gradient descent for a strongly convex function f and step size
Ξ·=1
πwill converge as
β’ π π₯β β πβ β€ ππ π π₯0 β πβ , π€βπππ π β€ 1 βπ
π
β’ Rate of convergence c is known as linear convergence
β’ Since we usually do not know the value of M, we do line search
β’ For exact line search, π = 1 βπ
π
β’ For backtracking line search, π = 1 β min 2ππΌ,2π½πΌπ
π< 1
22
Methods & ExamplesExact Line Search, Backtracking Line Search, Coordinate Descent Method, Steepest Descent Method
23
Exact Line Search
β’ The optimal line search method in which Ξ· is chosen to minimize f along the ray π₯ β Ξ·π»π π₯ , as shown in below
β’ Exact line search is used when the cost of minimization problem with one variable is low compared to the cost of computing the search direction itself.
β’ It is not very practical24
Exact Line Search
β’ Convergence Analysisβ’
β’ π π₯ π β πβ decreases by at least a constant factor in every iteration
β’ Converging to 0 geometric fast. (linear convergence)
25
1
Backtracking Line Search
β’ It depends on two constants πΌ, π½ π€ππ‘β 0 < πΌ < 0.5, 0 < π½ < 1
β’ It starts with unit step size and then reduces it by the factor π½ until the stopping condition
π(π₯ β Ξ·π»π(π₯)) β€ π(π₯) β πΌΞ· π»π π₯ 2
β’ Since βπ»π π₯ is a descent direction and β π»π π₯ 2 < 0, so for small enough step size Ξ·, we have
π π₯ β Ξ·π»π π₯ β π π₯ β Ξ· π»π π₯ 2 < π π₯ β πΌΞ· π»π π₯ 2
β’ It shows that the backtracking line search eventually terminates
β’ πΌ is typically chosen between 0.01 and 0.3
β’ π½ is often chosen to be between 0.1 and 0.8
26
Backtracking Line Search
β’
27
Backtracking Line Search
β’ Convergence Analysis
β’ Claim: Ξ· β€1
πalways satisfies the stopping condition
β’ Proof
28
Backtracking Line Search
β’ Proof (cont)
29
Line search types
β’ Slide from Optimization Lecture 10 by Boyd
30
Line search example
β’ Slide from Optimization Lecture 10 by Boyd
31
Coordinate Descent Method
β’ Coordinate descent belongs to the class of several non derivative methods used for minimizing differentiable functions.
β’ Here, cost is minimized in one coordinate direction in each iteration.
32
Coordinate Descent Method
β’ Prosβ’ It is well suited for parallel computation
β’ Consβ’ May not reach the local minimum even for convex function
33
Converge of Coordinate Descent
β’ Lemma 5.4
34
Coordinate Descent Method
β’ Method of selecting the coordinate for next iterationβ’ Cyclic Coordinate Descent
β’ Greedy Coordinate Descent
β’ (Uniform) Random Coordinate Descent
35
Steepest Descent Method
β’ The gradient descent method takes many iterations
β’ Steepest Descent Method aims at choosing the best direction at each iteration
β’ Normalized steepest descent directionβ’ βπ₯ππ π = ππππππ π»π π₯ ππ£ π£ = 1}
β’ Interpretation: for small π£, π π₯ + π£ β π π₯ + π»π π₯ ππ£ direction βπ₯ππ πis unit-norm step with most negative directional derivative
β’ Iteratively, the algorithm follows the following stepsβ’ Calculate direction of descent βπ₯ππ π
β’ Calculate step size, t
β’ π₯+ = π₯ + π‘βπ₯ππ π
36
Steepest Descent for various norms
β’ The choice of norm used the steepest descent direction can be have dramatic effect on converge rate
β’ π2 normβ’ The steepest descent direction is as follows
β’ βπ₯ππ π =βπ»π(π₯)
π»π(π₯) 2
β’ π1 normβ’ For π₯ 1 = π π₯π , a descent direction is as follows,
β’ βπ₯πππ = βπ πππππ π₯
ππ₯πβ ππ
β
β’ πβ = πππminπ
ππ
ππ₯π
β’ πβ normβ’ For π₯ β = argmin
ππ₯π , a descent direction is as follows
β’ βπ₯πππ = βπ πππ βπ»π π₯37
Steepest Descent for various norms
38
Quadratic Norm π1-Norm
Steepest Descent for various norms
β’ Example
39
Steepest Descent Convergence Rate
β’ Fact: Any norm can be bounded by β 2, i.e., βπΎ, πΎ β (0,1]such that, π₯ β₯ πΎ π₯ 2 πππ π₯ β β₯ πΎ π₯ 2
β’ Theorem 5.5β’ If f is strongly convex with respect to m and M, and β 2 has πΎ, πΎ as
above then steepest decent with backtracking line search has linear convergence with rate
β’ π = 1 β 2ππΌ πΎ2 min 1,π½πΎ
π
β’ Proof: Will be proved in the lecture 6
40
Summary
41
Summary
β’ Unconstrained Convex Optimization Problem
β’ Gradient Descent Method
β’ Step Size Trade-off between safety and speed
β’ Convergence Conditionsβ’ L-Lipschtiz Function
β’ Strong Convexity
β’ Condition Number
42
Summary
β’ Exact Line Search
β’ Backtracking Line Search
β’ Coordinate Descent Methodβ’ Good for parallel computation but not always converge
β’ Steepest Descent Methodβ’ The choice of norm is important
43
44
END OF DOCUMENT