L.Vandenberghe ECE236C(Spring2019) 4.Proximalgradientmethodvandenbe/236C/lectures/proxgrad.pdf ·...

L. Vandenberghe ECE236C (Spring 2019)

4. Proximal gradient method

• motivation

• proximal mapping

• proximal gradient method with fixed step size

• proximal gradient method with line search

Proximal mapping

the proximal mapping (or prox-operator) of a convex function h is defined as

proxh(x) = argminu

(h(u) + 1

2‖u − x‖22

)Examples

• h(x) = 0: proxh(x) = x

• h(x) is indicator function of closed convex set C: proxh is projection on C

proxh(x) = argminu∈C

‖u − x‖22 = PC(x)

• h(x) = ‖x‖1: proxh is the “soft-threshold” (shrinkage) operation

proxh(x)i =

xi − 1 xi ≥ 10 |xi | ≤ 1xi + 1 xi ≤ −1

Proximal gradient method 4.2

Proximal gradient method

unconstrained optimization with objective split in two components

minimize f (x) = g(x) + h(x)

• g convex, differentiable, dom g = Rn

• h convex with inexpensive prox-operator

Proximal gradient algorithm

xk+1 = proxtk h (xk − tk∇g(xk))

• tk > 0 is step size, constant or determined by line search

• can start at infeasible x0 (however xk ∈ dom f = dom h for k ≥ 1)

Interpretation

x+ = proxth (x − t∇g(x))

from definition of proximal mapping:

x+ = argminu

(h(u) + 1

2t‖u − x + t∇g(x)‖22

)= argmin

(h(u) + g(x) + ∇g(x)T(u − x) + 1

2t‖u − x‖22

x+ minimizes h(u) plus a simple quadratic local model of g(u) around x

Examples

minimize g(x) + h(x)

Gradient method: special case with h(x) = 0

x+ = x − t∇g(x)

Gradient projection method: special case with h(x) = δC(x) (indicator of C)

x+ = PC (x − t∇g(x)) C

x − t∇g(x)x+

Examples

Soft-thresholding: special case with h(x) = ‖x‖1

x+ = proxth (x − t∇g(x))

(proxth(u))i =

ui − t ui ≥ t0 −t ≤ ui ≤ tui + t ui ≤ −t ui

(proxth(u))i

Outline

• motivation

Proximal mapping

if h is convex and closed (has a closed epigraph), then

proxh(x) = argminu

(h(u) + 1

2‖u − x‖22

)exists and is unique for all x

• will be studied in more detail in one of the next lectures

• from optimality conditions of minimization in the definition:

u = proxh(x) ⇐⇒ x − u ∈ ∂h(u)⇐⇒ h(z) ≥ h(u) + (x − u)T(z − u) for all z

Projection on closed convex set

proximal mapping of indicator function δC is Euclidean projection on C

proxδC(x) = argmin

u∈C‖u − x‖22 = PC(x)

u = PC(x)m

(x − u)T(z − u) ≤ 0 ∀z ∈ C

we will see that proximal mappings have many properties of projections

Firm nonexpansiveness

proximal mappings are firmly nonexpansive (co-coercive with constant 1):

(proxh(x) − proxh(y))T(x − y) ≥ ‖proxh(x) − proxh(y)‖22

• follows from page 4.7: if u = proxh(x), v = proxh(y), then

x − u ∈ ∂h(u), y − v ∈ ∂h(v)

combining this with monotonicity of subdifferential (page 2.9) gives

(x − u − y + v)T(u − v) ≥ 0

• a weaker property is nonexpansiveness (Lipschitz continuity with constant 1): proxh(x) − proxh(y)

2 ≤ ‖x − y‖2

follows from firm nonexpansiveness and Cauchy–Schwarz inequality

Outline

• motivation

Assumptions

minimize f (x) = g(x) + h(x)

• h is closed and convex (so that proxth is well defined)

• g is differentiable with dom g = Rn, and L-smooth for the Euclidean norm, i.e.,

xT x − g(x) is convex

• there exists a constant m ≥ 0 such that

g(x) − m2

xT x is convex

when m > 0 this is m-strong convexity for the Euclidean norm

• the optimal value f? is finite and attained at x? (not necessarily unique)

Implications of assumptions on g

Lower bound

• convexity of the the function g(x) − (m/2)xT x implies (page 1.19):

g(y) ≥ g(x) + ∇g(x)T(y − x) + m2‖y − x‖22 for all x, y (1)

• if m = 0, this means g is convex; if m > 0, strongly convex (lecture 1)

Upper bound

• convexity of the function (L/2)xT x − g(x) implies (page 1.12):

g(y) ≤ g(x) + ∇g(x)T(y − x) + L2‖y − x‖22 for all x, y (2)

• this is equivalent to Lipschitz continuity and co-coercivity of gradient (lecture 1)

Gradient map

Gt(x) = 1t(x − proxth(x − t∇g(x)))

Gt(x) is the negative “step” in the proximal gradient update

x+ = proxth (x − t∇g(x))= x − tGt(x)

• Gt(x) is not a gradient or subgradient of f = g + h

• from subgradient definition of prox-operator (page 4.7),

Gt(x) ∈ ∇g(x) + ∂h (x − tGt(x))

• Gt(x) = 0 if and only if x minimizes f (x) = g(x) + h(x)

Consequences of quadratic bounds on g

substitute y = x − tGt(x) in the bounds (1) and (2): for all t,

2‖Gt(x)‖22 ≤ g (x − tGt(x)) − g(x) + t∇g(x)TGt(x) ≤ Lt2

2‖Gt(x)‖22

• if 0 < t ≤ 1/L, then the upper bound implies

g(x − tGt(x)) ≤ g(x) − t∇g(x)TGt(x) + t2‖Gt(x)‖22 (3)

• if the inequality (3) is satisfied and tGt(x) , 0, then mt ≤ 1

• if the inequality (3) is satisfied, then for all z,

f (x − tGt(x)) ≤ f (z) + Gt(x)T(x − z) − t2‖Gt(x)‖22 −

m2‖x − z‖22 (4)

(proof on next page)

Proof of (4):

f (x − tGt(x))≤ g(x) − t∇g(x)TGt(x) + t

2‖Gt(x)‖22 + h(x − tGt(x))

≤ g(z) − ∇g(x)T(z − x) − m2‖z − x‖22 − t∇g(x)TGt(x) + t

2‖Gt(x)‖22

+ h(x − tGt(x))≤ g(z) − ∇g(x)T(z − x) − m

2‖z − x‖22 − t∇g(x)TGt(x) + t

2‖Gt(x)‖22

+ h(z) − (Gt(x) − ∇g(x))T(z − x + tGt(x))= g(z) + h(z) + Gt(x)T(x − z) − t

2‖Gt(x)‖22 −

m2‖x − z‖22

• in the first step we add h(x − tGt(x)) to both sides of the inequality (3)

• in the next step we use the lower bound on g(z) from (1)

• in step 3, we use Gt(x) − ∇g(x) ∈ ∂h(x − tGt(x)) (see page 4.12)

Progress in one iteration

for a step size t that satisfies the inequality (3), define

x+ = x − tGt(x)

• inequality (4) with z = x shows that the algorithm is a descent method:

f (x+) ≤ f (x) − t2‖Gt(x)‖22

• inequality (4) with z = x? shows that

f (x+) − f? ≤ Gt(x)T(x − x?) − t2‖Gt(x)‖22 −

m2‖x − x?‖22

( x − x? 2

2 − x − x? − tGt(x)

)− m

2‖x − x?‖22

((1 − mt)‖x − x?‖22 − ‖x+ − x?‖22

≤ 12t

(‖x − x?‖22 − ‖x+ − x?‖22

Analysis for fixed step size

add inequalities (6) with x = xi, x+ = xi+1, t = ti = 1/L from i = 0 to i = k − 1

k∑i=1( f (xi) − f?) ≤ 1

k−1∑i=0

(‖xi − x?‖22 − ‖xi+1 − x?‖22

(‖x0 − x?‖22 − ‖xk − x?‖22

)≤ 1

2t‖x0 − x?‖22

since f (xi) is nonincreasing,

f (xk) − f? ≤ 1k

k∑i=1( f (xi) − f?) ≤ 1

2kt‖x0 − x?‖22

Distance to optimal set

• from (5) and f (x+) ≥ f?, the distance to the optimal set does not increase:

‖x+ − x?‖22 ≤ (1 − mt)‖x − x?‖22≤ ‖x − x?‖22

• for fixed step size tk = 1/L

‖xk − x?‖22 ≤ ck ‖x0 − x?‖22, c = 1 − mL

i.e., linear convergence if g is strongly convex (m > 0)

Example: quadratic program with box constraints

minimize (1/2)xT Ax + bT xsubject to 0 � x � 1

0 5 10 15 20 25 30 35 40 45 50

10−6

10−5

10−4

10−3

10−2

10−1

n = 3000; fixed step size t = 1/λmax(A)Proximal gradient method 4.18

Example: 1-norm regularized least-squares

minimize12‖Ax − b‖22 + ‖x‖1

0 10 20 30 40 50 60 70 80 90 100

10−5

10−4

10−3

10−2

10−1

randomly generated A ∈ R2000×1000; step tk = 1/L with L = λmax(AT A)Proximal gradient method 4.19

Outline

• introduction

Line search

• the analysis for fixed step size (page 4.13) starts with the inequality

g(x − tGt(x)) ≤ g(x) − t∇g(x)TGt(x) + t2‖Gt(x)‖22 (3)

this inequality is known to hold for 0 < t ≤ 1/L

• if L is not known, we can satisfy (3) by a backtracking line search:

start at some t := t̂ > 0 and backtrack (t := βt) until (3) holds

• step size t selected by the line search satisfies t ≥ tmin = min{t̂, β/L}

• requires one evaluation of g and proxth per line search iteration

several other types of line search work

Example

line search for gradient projection method

x+ = PC (x − t∇g(x)) = x − tGt(x)

x − t̂∇g(x) PC(x − t̂∇g(x))

x − βt̂∇g(x) PC(x − βt̂∇g(x))

backtrack until PC(x − t∇g(x)) satisfies the “sufficient decrease” inequality (3)

Analysis with line search

from page 4.15, if (3) holds in iteration i, then f (xi+1) < f (xi) and

ti( f (xi+1) − f?) ≤ 12

(‖xi − x?‖22 − ‖xi+1 − x?‖22

)• adding inequalities for i = 0 to i = k − 1 gives

(k−1∑i=0

ti) ( f (xk) − f?) ≤k−1∑i=0

ti( f (xi+1) − f?) ≤ 12‖x0 − x?‖22

first inequality holds because f (xi) is nonincreasing

• since ti ≥ tmin, we obtain a similar 1/k bound as for fixed step size

f (xk) − f? ≤ 12 ∑k−1

i=0 ti‖x0 − x?‖22 ≤

12ktmin

‖x0 − x?‖22

Distance to optimal set

from page 4.15, if (3) holds in iteration i, then

‖xi+1 − x?‖22 ≤ (1 − mti)‖xi − x?‖22≤ (1 − mtmin) ‖xi − x?‖22= c ‖xi − x?‖22

‖xk − x?‖22 ≤ ck ‖x0 − x?‖22

withc = 1 − mtmin = max{1 − βm

L, 1 − mt̂}

hence linear convergence if m > 0

Summary: proximal gradient method

• minimizes sums of differentiable and non-differentiable convex functions

f (x) = g(x) + h(x)

• useful when nondifferentiable term h is simple (has inexpensive prox-operator)

• convergence properties are similar to standard gradient method (h(x) = 0)

• less general but faster than subgradient method

References

• A. Beck, First-Order Methods in Optimization (2017), §10.4 and §10.6.

• A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm forlinear inverse problems, SIAM Journal on Imaging Sciences (2009).

• A. Beck and M. Teboulle, Gradient-based algorithms with applications to signalrecovery, in: Y. Eldar and D. Palomar (Eds.), Convex Optimization in SignalProcessing and Communications (2009).

• Yu. Nesterov, Lectures on Convex Optimization (2018), §2.2.3–2.2.4.

• B. T. Polyak, Introduction to Optimization (1987), §7.2.1.

L.Vandenberghe ECE236C(Spring2019) 4.Proximalgradientmethodvandenbe/236C/lectures/proxgrad.pdf ·...

Documents

Transcript of L.Vandenberghe ECE236C(Spring2019) 4.Proximalgradientmethodvandenbe/236C/lectures/proxgrad.pdf ·...

L.Vandenberghe ECE236C(Spring2020) 10 ...

L.Vandenberghe EE236C(Spring2016) …vandenbe/236C/lectures/conic.pdf · Generalized(conic)inequalities Conicinequality:aconstraintx2KwhereKisaconvexconeinRm wewillrequirethattheconeKisproper:

L.Vandenberghe EE236C(Spring2016) 4.Subgradientsvandenbe/236C/lectures/subgradients.pdf... (x) = max(A(x)) = sup kyk2=1 yTA(x)y whereA(x) = A0 + x1A1 + + xnAnwithsymmetriccoeﬃcientsAi

L.Vandenberghe ECE133A(Fall2019) 5.Orthogonalmatricesvandenbe/133A/lectures/...Matrix-vectorproduct ifA 2Rmn hasorthonormalcolumns,thenthelinearfunction f„x”= Ax preservesinnerproducts:

L.Vandenberghe ECE236C(Spring2020) 16.Gauss–Newtonmethodvandenbe/236C/lectures/gn.pdf · L.Vandenberghe ECE236C(Spring2020) 16.Gauss–Newtonmethod deﬁnitionandexamples Gauss–Newtonmethod

L.Vandenberghe ECE133A(Fall2019) 13.Nonlinearleastsquaresvandenbe/133A/lectures/nlls.pdf · 2019. 11. 21. · L.Vandenberghe ECE133A(Fall2019) 13.Nonlinearleastsquares deﬁnitionandexamples

L.Vandenberghe ECE133A(Fall2018) 2.Norm,distance,anglevandenbe/133A/lectures/norm.pdf · Properties Positivedeﬁniteness kak 0 foralla; kak= 0 onlyifa = 0 Homogeneity k ak= j jkak

WIRE COLOR CODE: CONTENTS - Utilimaster...236B LG 0.8 500E WT 0.8 236C LG 0.8 500F WT 0.8 230D BR 0.8 236D LG 0.8 238B YL 1.0 230F BR 0.8 232C DG 1.0 236E LG 0.8 500K WT 0.8 500L WT

cutting planes localization methods - Engineering | …vandenbe/236C/lectures/localization.pdfSpeciﬁc cutting-plane algorithms methods vary in choice of query point center of gravity

L.Vandenberghe ECE133A(Fall2019) 3.Matricesvandenbe/133A/lectures/matrices.pdf · Matrix-matrixproduct productofm n matrixA andn p matrixB (A,B arerealorcomplex) C = AB isthem p matrixwithi;j

L.Vandenberghe ECE236C(Spring2020) …vandenbe/236C/lectures/fgrad.pdf · 2020. 6. 29. · + m = „1 ” deﬁnex = xi,x+ = xi+1,v= vi,andv+ = vi+1: y = 1 + m +x + v (y = x = vifi

MORTON COLLEGEWalker, Russell 310C 1320 Walley, Cynthia 312B 2384 108C 2231 Windham, Brandie 309B 2555 Wood, Robert 206B 1313 Zukauskas, Karolis 321B 1385 236C 2330 STUDENT SERVICES

Wildlife International, Ltd. Project Number 236C-l 73 · Project Number 236C-l 73 -12 . I. INTRODUCTION . This study was conducted to validate methodologies for the determination

introduction smoothing via conjugate examplesvandenbe/236C/lectures/smoothing.pdf• introduction • smoothing via conjugate • examples 1 First-order convex optimization methods

fast proximal gradient method (FISTA) FISTA with line ...vandenbe/236C/lectures/fista.pdf · Fast (proximal) gradient methods • Nesterov (1983, 1988, 2005): three gradient projection

Natural logic and natural language semanticsNatural logic and natural language semantics Chris Potts, Ling 236/Psych 236c: Representations of meaning, Spring 2013 Apr 11 1 Background

L.Vandenberghe EE236C(Spring2016) 15.Conicoptimizationvandenbe/236C/lectures/conic.pdf · Self-dualembeddings Idea embedprimal,dualconicLPsintoasingle(self-dual)conicLP,sothat: embeddedproblemisfeasible,withknownfeasiblepoints

L.Vandenberghe ECE133A(Spring2021) 7.Linearequations

L.Vandenberghe EE236C(Spring2016) …vandenbe/236C/lectures/proxop.pdfproximalmapping projections supportfunctions,norms,distances 8-1. Proximalmapping Deﬁnition:theproximalmappingofaclosedconvexfunctionfis

L.Vandenberghe ECE133A(Fall2019) 14.Nonlinearequationsvandenbe/133A/lectures/newton.pdf · 2019. 11. 18. · rg„z”= @g @x1 „z”; @g @x2 „z”;:::; @g @xn „z” Directionalderivative