Sparse Optimization - Lecture: Dual MethodsSparse Optimization Lecture: Dual Methods, Part I...
Transcript of Sparse Optimization - Lecture: Dual MethodsSparse Optimization Lecture: Dual Methods, Part I...
![Page 1: Sparse Optimization - Lecture: Dual MethodsSparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this](https://reader034.fdocuments.us/reader034/viewer/2022050119/5f4fc5bd689e5564030f0e23/html5/thumbnails/1.jpg)
Sparse Optimization
Lecture: Dual Methods, Part I
Instructor: Wotao Yin
July 2013
online discussions on piazza.com
Those who complete this lecture will know
• dual (sub)gradient iteration
• augmented `1 iteration (linearzed Bregman iteration)
• dual smoothing minimization
• augmented Lagrangian iteration
• Bregman iteration and addback iteration
1 / 39
![Page 2: Sparse Optimization - Lecture: Dual MethodsSparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this](https://reader034.fdocuments.us/reader034/viewer/2022050119/5f4fc5bd689e5564030f0e23/html5/thumbnails/2.jpg)
Review
Last two lectures
• studied explicit and implicit (proximal) gradient updates
• derived the Lagrange dual problem
• overviewed the following dual methods
1. dual (sub)gradient method (a.k.a. Uzawa’s method)
2. dual proximal method (a.k.a., augmented Lagrangian method (ALM))
3. operator splitting methods applied to the dual of
minx,z
{f(x) + g(z) : Ax + Bz = b}
The operator splitting methods studied includes
• forward-backward splitting
• Peaceman-Rachford splitting
• Douglas-Rachford splitting (giving rise to ADM or ADMM)
This lecture study these dual methods in more details and present their
applications to sparse optimization models.
2 / 39
![Page 3: Sparse Optimization - Lecture: Dual MethodsSparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this](https://reader034.fdocuments.us/reader034/viewer/2022050119/5f4fc5bd689e5564030f0e23/html5/thumbnails/3.jpg)
About sparse optimization
During the lecture, keep in mind that sparse optimization models typically have
one or more nonsmooth yet simple part(s) and one or more smooth parts with
dense data.
Common preferences:
• to the nonsmooth and simple part, proximal operation is preferred over
subgradient descent
• if smoothing is applied, exact smoothing is preferred over inexact
smoothing
• to the smooth part with dense data, simple (gradient) operator is preferred
over more complicated operators
• when applying divide-and-conquer, fewer divisions and simpler
subproblems are preferred
In general, the dual methods appear to be more versatile than the primal-only
methods (e.g., (sub)gradient and prox-linear methods)
3 / 39
![Page 4: Sparse Optimization - Lecture: Dual MethodsSparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this](https://reader034.fdocuments.us/reader034/viewer/2022050119/5f4fc5bd689e5564030f0e23/html5/thumbnails/4.jpg)
Dual (sub)gradient ascent
Primal problem
minx
f(x), s.t. Ax = b.
Lagrangian relaxation:
L(x;y) = f(x) + yT (Ax− b)
Lagrangian dual problem
miny
d(y) or maxy−d(y)
If g is differentiable, you can apply
yk+1 ← yk − ck∇d(yk)
otherwise, apply
yk+1 ← yk − ckg, where g ∈ ∂d(yk).
4 / 39
![Page 5: Sparse Optimization - Lecture: Dual MethodsSparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this](https://reader034.fdocuments.us/reader034/viewer/2022050119/5f4fc5bd689e5564030f0e23/html5/thumbnails/5.jpg)
Dual (sub)gradient ascent
Derive ∇d or ∂d
• by hand, or
• use L(x;yk): compute xk ← minx L(x;yk), then b−Ax ∈ ∂d(yk).
Iteration:
xk+1 ← minxL(x;yk),
yk+1 ← yk + ck(Axk+1 − b).
Application: augmented `1 minimization, a.k.a. linearized Bregman.
5 / 39
![Page 6: Sparse Optimization - Lecture: Dual MethodsSparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this](https://reader034.fdocuments.us/reader034/viewer/2022050119/5f4fc5bd689e5564030f0e23/html5/thumbnails/6.jpg)
Augmented `1 minimization
Augment `1 by `22:
(L1+LS) min ‖x‖1 +1
2α‖x‖22 s.t. Ax = b
• primal objective becomes strongly convex (but still non-differentiable)
• hence, its dual is unconstrained and differentiable
• a sufficiently large but finite α leads the exact `1 solution
• related to: linearized Bregman algorithm, the elastic net model
• a test with Gaussian A and sparse x with Gaussian entries:
min{‖x‖1 : Ax = b} min{‖x‖22 : Ax = b} min{‖x‖1 + 1
25 ‖x‖22 : Ax = b}
Exactly the same as `1 solution
6 / 39
![Page 7: Sparse Optimization - Lecture: Dual MethodsSparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this](https://reader034.fdocuments.us/reader034/viewer/2022050119/5f4fc5bd689e5564030f0e23/html5/thumbnails/7.jpg)
Lagrangian dual of (L1+LS)
Theorem (Convex Analysis, Rockafellar [1970])
If a convex program has a strictly convex objective, it has a unique solution and
its Lagrangian dual program is differentiable.
Lagrangian: (separable in x for fixed y)
L(x;y) = ‖x‖1 +1
2α‖x‖22 − yT (Ax− b)
Lagrange dual problem:
miny
d(y) = −b>y +α
2‖A>y − Proj[−1,1]n(A>y)‖22
note: shrink(x, γ) = max{|x| − γ, 0}sign(x) = x− Proj[−γ,γ](x).
Objective gradient:
∇d(y) = −b + αA shrink(A>y)
Dual gradient iteration:
xk+1 = α shrink(A>yk),
yk+1 = yk + ck(b−Axk+1).
7 / 39
![Page 8: Sparse Optimization - Lecture: Dual MethodsSparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this](https://reader034.fdocuments.us/reader034/viewer/2022050119/5f4fc5bd689e5564030f0e23/html5/thumbnails/8.jpg)
How to choose α
• Exact smoothing: ∃ a finite α0 so that all α > α0 lead to `1 solution
• In practice, α = 10‖xsol‖∞ suffices1, with recovery guarantees under RIP,
NSP, and other conditions.
• Although α > α0 lead to the same and unique primal solution x∗, the dual
solution set Y∗ is a multi-set and it depends on α.
• Dynamically adjusting α may not be a good idea for the dual algorithms.
1Lai and Yin [2012]
8 / 39
![Page 9: Sparse Optimization - Lecture: Dual MethodsSparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this](https://reader034.fdocuments.us/reader034/viewer/2022050119/5f4fc5bd689e5564030f0e23/html5/thumbnails/9.jpg)
Exact regularization
Theorem (Friedlander and Tseng [2007], Yin [2010])
There exists a finite α0 > 0 such that whenever α > α0, the solution to
(L1+LS) min ‖x‖1 +1
2α‖x‖22 s.t. Ax = b
is also a solution to
(L1) min ‖x‖1 s.t. Ax = b.
L1 L1+LS
9 / 39
![Page 10: Sparse Optimization - Lecture: Dual MethodsSparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this](https://reader034.fdocuments.us/reader034/viewer/2022050119/5f4fc5bd689e5564030f0e23/html5/thumbnails/10.jpg)
L1+LS in compressive sensing
If in some scenario, L1 gives exact or stable recovery provided that
#measurements m ≥ C ∙ F (signal dim n, signal sparsity k).
Then, adding 12α‖x‖22, the condition becomes
#measurements m ≥ (C + O(1
2α)) ∙ F (signal dim n, signal sparsity k).
Theorem (exact recovery, Lai and Yin [2012])
Under the assumptions
1. x0 is k-sparse, and A satisfies RIP with δ2k ≤ 0.4404, and
2. α ≥ 10‖x0‖∞,
(L1+LS) uniquely recovers x0.
The bound on δ2k is tighter than that for `1, and it depends on the bound on α.
10 / 39
![Page 11: Sparse Optimization - Lecture: Dual MethodsSparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this](https://reader034.fdocuments.us/reader034/viewer/2022050119/5f4fc5bd689e5564030f0e23/html5/thumbnails/11.jpg)
Stable recovery
For approximately sparse signals and/or noisy measurements, consider:
minx
{
‖x‖1 +1
2α‖x‖22 : ‖Ax− b‖2 ≤ σ
}
(1)
Theorem (stable recovery, Lai and Yin [2012])
Let x0 be an arbitrary vector, S = {largest k entries of x0}, and Z = SC . Let
b := Ax0 + n, where n is arbitrary noisy. If A satisfies RIP with δ2k ≤ 0.3814
and α ≥ 10‖x0‖∞, then the solution x∗ of (1) with σ = ‖n‖2 satisfies
‖x∗ − x0‖2 ≤C1 ∙ ‖n‖2 + C2 ∙ ‖x0Z‖1/
√k,
where C1, and C2 are constants depending on δ2k.
11 / 39
![Page 12: Sparse Optimization - Lecture: Dual MethodsSparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this](https://reader034.fdocuments.us/reader034/viewer/2022050119/5f4fc5bd689e5564030f0e23/html5/thumbnails/12.jpg)
Implementation
• Since dual is C1 and unconstrained, various first-order techniques apply
• accelerated gradient descent2
• Barzilai-Borwein step size3 / (non-monotone) line search4
• Quasi Newton method (with some cautions since it is not C2)
• Fits the dual-decomposition framework, easy to parallelize (later lecture)
• Results generalize to `1-like functions.
• Matlab codes and demos at
www.caam.rice.edu/~optimization/linearized_bregman/
2Nesterov [1983]3Barzilai and Borwein [1988]4Zhang and Hager [2004]
12 / 39
![Page 13: Sparse Optimization - Lecture: Dual MethodsSparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this](https://reader034.fdocuments.us/reader034/viewer/2022050119/5f4fc5bd689e5564030f0e23/html5/thumbnails/13.jpg)
Review: convex conjugate
Recall convex conjugate (the Legendre transform):
f∗(y) = supx∈domf
{yT x− f(x)}
� f∗ is convex since it is point-wise maximum of linear functions
� if f is proper, closed, convex, then (f∗)∗ = f , i.e.,
f(x) = supy∈domf∗
{yT x− f∗(y)}.
Examples:
• f(x) = ιC(x), indicator function, and f∗(y) = supx∈C yT x, support
function
• f(x) = ι{−1≤x≤1} and f∗(y) = ‖y‖1
• f(x) = ι{‖x‖2≤1} and f∗(y) = ‖y‖2
• lots of smooth examples ......
13 / 39
![Page 14: Sparse Optimization - Lecture: Dual MethodsSparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this](https://reader034.fdocuments.us/reader034/viewer/2022050119/5f4fc5bd689e5564030f0e23/html5/thumbnails/14.jpg)
Review: convex conjugate
One can introduce an alternative representation via convex conjugacy
f(x) = supy∈domf∗
{yT (Ax + b)− h∗(y)} = h(Ax + b).
Example:
• let C = {y = [y1;y2] : y1 + y2 = 1, y1,y2 ≥ 0} and
A =
[1
−1
]
.
Since ‖x‖1 = supy{(y1 − y2)T x− ιC(y)} = supy{y
T Ax− ιC(y)}, we
have
‖x‖1 = ι∗C(Ax)
where ι∗C([x1;x2]) = max{x1,x2} entry-wise.
14 / 39
![Page 15: Sparse Optimization - Lecture: Dual MethodsSparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this](https://reader034.fdocuments.us/reader034/viewer/2022050119/5f4fc5bd689e5564030f0e23/html5/thumbnails/15.jpg)
Dual smoothing
Idea: strongly convexify h∗ =⇒ f becomes differentiable and has Lipschitz ∇f
Represent f using h∗:
f(x) = supy∈domh∗
{yT (Ax + b)− h∗(y)}
Strongly convexify h∗ by adding strongly convex function d:
h∗(y) = h∗(y) + μd(y)
Obtain a differentiable approximation:
fμ(x) = supy∈domh∗
{yT (Ax + b)− h∗(y)}
fμ(x) is differentiable since h∗(y) + μd(y) is strongly convex.
15 / 39
![Page 16: Sparse Optimization - Lecture: Dual MethodsSparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this](https://reader034.fdocuments.us/reader034/viewer/2022050119/5f4fc5bd689e5564030f0e23/html5/thumbnails/16.jpg)
Example: augmented `1
• primal problem min{‖x‖1 : Ax = b}
• dual problem: max{bT y + ι[−1,1]n(AT y)}
• f(y) = ι[−1,1]n(y) is non-differentiable
• let f∗(x) = ‖x‖1 and represent f(y) = supx{yT x− f∗(x)},
• add μ2‖x‖2 to f∗(x) and obtain
fμ(y) = supx{yT x− (‖x‖1 +
μ
2‖x‖22)} =
1
2μ‖y − Proj[−1,1]n(y)‖22
• fμ(y) is differentiable; ∇fμ(y) = 1μ
shrink(y).
• On the other hand, we can also smooth f∗(x) = ‖x‖1 and obtain
differentiable f∗μ(x) by adding d(y) to f(y). (see the next slide ...)
16 / 39
![Page 17: Sparse Optimization - Lecture: Dual MethodsSparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this](https://reader034.fdocuments.us/reader034/viewer/2022050119/5f4fc5bd689e5564030f0e23/html5/thumbnails/17.jpg)
Example: smoothed absolute value
� Recall
f∗(x) = |x| = supy{yx− ι[−1,1](y)}
Let d(y) = y2/2
f∗μ = sup
y{yx− (ι[−1,1](y) + μy2/2)} =
x2/(2μ), |x| ≤ μ,
|x| − μ/2, |x| > μ,
which is the Huber function
� let d(y) = 1−√
1− y2
f∗μ = sup
y{yx− (ι[−1,1](y)− μ
√1− y2)} − μ =
√x2 + μ2 − μ.
� Recall
|x| = supy{(y1 − y2)x− ιC(y)}
for C = {y : y1 + y2 = 1, y1, y2 ≥ 0}. Let d(y) = y1 log y1 + y2 log y2 + log 2
f∗μ(x) = sup
y{(y1 − y2)x− (ιC(y) + μd(y))} = μ log
ex/μ + e−x/μ
2.
17 / 39
![Page 18: Sparse Optimization - Lecture: Dual MethodsSparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this](https://reader034.fdocuments.us/reader034/viewer/2022050119/5f4fc5bd689e5564030f0e23/html5/thumbnails/18.jpg)
Compare three smoothed functions
Courtesy of L. Vandenberghe
18 / 39
![Page 19: Sparse Optimization - Lecture: Dual MethodsSparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this](https://reader034.fdocuments.us/reader034/viewer/2022050119/5f4fc5bd689e5564030f0e23/html5/thumbnails/19.jpg)
Example: smoothed maximum eigenvalue
Let C = {Y ∈ Sn : trY = 1,Y � 0}. Let X ∈ Sn
f(X) = λmax(X) = supY{Y •X− ιC(Y)}
Negative entropy of {λi(Y)}:
d(Y) =n∑
i=1
λi(Y) log λi(Y) + log n
(Courtesy of L. Vandenberghe)
Smoothed function
fμ(X) = supY{Y •X− (ιC(Y) + μd(Y))} = μ log
(n∑
i=1
eλi(X)/μ
)
− μ log n
19 / 39
![Page 20: Sparse Optimization - Lecture: Dual MethodsSparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this](https://reader034.fdocuments.us/reader034/viewer/2022050119/5f4fc5bd689e5564030f0e23/html5/thumbnails/20.jpg)
Application: smoothed minimization5
Instead of solving
min f(x),
solve
min fμ(x) = supy∈domh∗
{yT (Ax + b)− [h∗(y) + μd(y)]}
by gradient descent, with acceleration, line search, etc......
Gradient is given by:
∇fμ(x) = AT y, where y = arg maxy∈domh∗
{yT (Ax + b)− [h∗(y) + μd(y)]}.
If d(y) is strongly convex with modulus ν > 0, then
• h∗(y) + μd(y) is strongly convex with modulus at least μν
• ∇fμ(x) is Lipschitz continuous with constant no more than ‖A‖2/μν.
Error control by bounding |fμ(x)− f(x)| or ‖x∗μ − x∗‖.
5Nesterov [2005]
20 / 39
![Page 21: Sparse Optimization - Lecture: Dual MethodsSparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this](https://reader034.fdocuments.us/reader034/viewer/2022050119/5f4fc5bd689e5564030f0e23/html5/thumbnails/21.jpg)
Augmented Lagrangian (a.k.a. method of multipliers)
Augment L(x;yk) = f(x)− (yk)T (Ax− b) by adding c2‖Ax− b‖22.
Augmented Lagrangian:
LA(x;yk) = f(x)− (yk)T (Ax− b) +c
2‖Ax− b‖22
Iteration:
xk+1 = arg minx
LA(x;yk)
yk+1 = yk + c(b−Axk+1)
from k = 0 and y0 = 0. c > 0 can change.
The objective of the first step is convex in x, if f(∙) is convex, and linear in y.
Equivalent to the dual implicit (proximal) iteration.
21 / 39
![Page 22: Sparse Optimization - Lecture: Dual MethodsSparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this](https://reader034.fdocuments.us/reader034/viewer/2022050119/5f4fc5bd689e5564030f0e23/html5/thumbnails/22.jpg)
Augmented Lagrangian (a.k.a. method of multipliers)
Recall KKT conditions (omitting the complementarity part):
(primal feasibility) Ax∗ = b
(dual feasibility) 0 ∈ ∂f(x∗)−AT y∗
Compare the 2nd condition with the optimality condition of ALM subproblem
0 ∈ ∂f(xk+1)−AT (yk + c(b−Axk+1)) = ∂f(xk+1)−AT yk+1
Conclusion: dual feasibility is maintained for (xk+1,yk+1) for all k.
Also, it “works toward” primal feasibility:
−(yk)T (Ax− b) +c
2‖A− b‖22 =
c
2〈Ax− b,
k∑
i=1
(Axi − b) + (Ax− b)〉
It keeps adding penalty to the violation of Ax = b. In the limit, Ax∗ = b
holds (for polyhedral f(∙) in finitely many steps).
22 / 39
![Page 23: Sparse Optimization - Lecture: Dual MethodsSparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this](https://reader034.fdocuments.us/reader034/viewer/2022050119/5f4fc5bd689e5564030f0e23/html5/thumbnails/23.jpg)
Augmented Lagrangian (a.k.a. Method of Multipliers)
Compared to dual (sub)gradient ascent
Pros:
• It converges for nonsmooth and extended-value f (thanks to the proximal
term)
Cons:
• If f is nice and dual ascent works, it may be slower than dual ascent since
the subproblem is more difficult
• The term 12‖Ax− b‖22 in the x-subproblem couples different blocks of x
(unless A has a block-diagonal structure)
Application/alternative derivation: Bregman iterative regularization6
6Osher, Burger, Goldfarb, Xu, and Yin [2005], Yin, Osher, Goldfarb, and Darbon [2008]
23 / 39
![Page 24: Sparse Optimization - Lecture: Dual MethodsSparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this](https://reader034.fdocuments.us/reader034/viewer/2022050119/5f4fc5bd689e5564030f0e23/html5/thumbnails/24.jpg)
Bregman Distance
Definition: let r(x) be a convex function
Dr(x,y;p) = r(x)− r(y)− 〈p,x− y〉, where p ∈ ∂r(y)
Not a distance but has a flavor of distance.
Examples: D`22(u, uk; pk) versus D`1(u, uk; pk)
differentiable case non-differentiable case
24 / 39
![Page 25: Sparse Optimization - Lecture: Dual MethodsSparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this](https://reader034.fdocuments.us/reader034/viewer/2022050119/5f4fc5bd689e5564030f0e23/html5/thumbnails/25.jpg)
Bregman iterative regularization
Iteration
xk+1 = arg min Dr(x,xk;pk) + g(x),
pk+1 = pk −∇g(xk+1),
starting k = 0 and (x0,p0) = (0,0). The update of p follows from
0 ∈ ∂r(xk+1)− pk +∇g(xk+1),
so in the next iteration, Dr(x,xk+1;pk+1) is well defined.
Bregman iteration is related, or equivalent, to
1. Proximal point iteration
2. Residual addback iteration
3. Augmented Lagrangian iteration
25 / 39
![Page 26: Sparse Optimization - Lecture: Dual MethodsSparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this](https://reader034.fdocuments.us/reader034/viewer/2022050119/5f4fc5bd689e5564030f0e23/html5/thumbnails/26.jpg)
Bregman and Proximal Point
If r(x) = c2‖x‖22, Bregman method reduces to the classical proximal point
method
xk+1 = arg minx
g(x) +c
2‖x− xk‖22.
Hence, Bregman iteration with function r is r-proximal algorithm for min g(x)
Traditional, r = `22 or other smooth functions is used as the proximal function.
Few uses non-differentiable convex functions like `1 to generate the proximal
term because `1 is not stable!
But using `1 proximal function has interesting properties.
26 / 39
![Page 27: Sparse Optimization - Lecture: Dual MethodsSparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this](https://reader034.fdocuments.us/reader034/viewer/2022050119/5f4fc5bd689e5564030f0e23/html5/thumbnails/27.jpg)
Bregman convergence
If g(x) = p(Ax− b) is a strictly convex function that penalizes Ax = b,
which is feasible, then the iteration
xk+1 = arg minx
Dr(x,xk;pk) + g(x)
converges to a solution of
min r(x), s.t. Ax = b.
Recall, the augmented Lagrangian algorithm also has a similar property.
Next we restrict our analysis to
g(x) =c
2‖Ax− b‖22.
27 / 39
![Page 28: Sparse Optimization - Lecture: Dual MethodsSparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this](https://reader034.fdocuments.us/reader034/viewer/2022050119/5f4fc5bd689e5564030f0e23/html5/thumbnails/28.jpg)
Residual addback iteration
If g(x) = c2‖Ax− b‖22, we can derive the equivalent iteration
bk+1 = b + (bk −Axk),
xk+1 = arg min r(x) +c
2‖Ax− bk+1‖22.
Interpretation:
• every iteration, the residual b−Axk is added back to bk;
• every subproblem is identical but with different data.
28 / 39
![Page 29: Sparse Optimization - Lecture: Dual MethodsSparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this](https://reader034.fdocuments.us/reader034/viewer/2022050119/5f4fc5bd689e5564030f0e23/html5/thumbnails/29.jpg)
Bregman = Residual addback
Equivalence the two forms is given by pk = −cAT (Axk − bk).
Proof by induction. Assume both iterations have the same xk so far
(c = δ below)
29 / 39
![Page 30: Sparse Optimization - Lecture: Dual MethodsSparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this](https://reader034.fdocuments.us/reader034/viewer/2022050119/5f4fc5bd689e5564030f0e23/html5/thumbnails/30.jpg)
Bregman = Residual addback = Augmented Lagrangian
Assume f(x) = r(x) and g(x) = c2‖Ax− b‖22.
Addback iteration:
bk+1 = bk + (b−Axk) = ∙ ∙ ∙ = b0 +
k∑
i=0
(b−Axi).
Augmented Lagrangian iteration:
yk+1 = yk + c(b−Axk+1) = ∙ ∙ ∙ = y0 + c
k+1∑
i=0
(b−Axi).
Bregman iteration:
pk+1 = pk + cAT (b−Axk+1) = ∙ ∙ ∙ = p0 + c
k+1∑
i=0
AT (b−Axi).
Their equivalence is established by
yk = cbk+1 and pk = AT yk, k = 0, 1, . . .
and initial values x0 = 0, b0 = 0, p0 = 0, y0 = 0.
30 / 39
![Page 31: Sparse Optimization - Lecture: Dual MethodsSparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this](https://reader034.fdocuments.us/reader034/viewer/2022050119/5f4fc5bd689e5564030f0e23/html5/thumbnails/31.jpg)
Residual addback in a regularization perspective
Adding the residual b−Axk back to bk is somewhat counter intuitive. In the
regularized least-squares problem
minx
r(x) +c
2‖Ax− b‖22
the residual b−Axk contains both unwanted error and wanted features.
The question is how to extract the features out of b−Axk.
An intuitive approach is to solve
y∗ = arg miny
r′(y) +c′
2‖Ay − (b−Axk)‖2
and then let
xk+1 ← xk + y∗.
However, the addback iteration keeps the same r and adds residuals back to
bk. Surprisingly, this gives good denoising results.
31 / 39
![Page 32: Sparse Optimization - Lecture: Dual MethodsSparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this](https://reader034.fdocuments.us/reader034/viewer/2022050119/5f4fc5bd689e5564030f0e23/html5/thumbnails/32.jpg)
Good denoising effect
Compare to addback (Bregman) to BPDN: min{‖x‖1 : ‖Ax− b‖2 ≤ σ}
where b = Axo + n.
0 50 100 150 200 250-1.5
-1
-0.5
0
0.5
1
1.5
true signalBPDN sol'n
(a) BPDN with σ = ‖n‖2
0 50 100 150 200 250-1.5
-1
-0.5
0
0.5
1
1.5
true signalBregman itr 1
(b) Bregman Iteration 1
0 50 100 150 200 250-1.5
-1
-0.5
0
0.5
1
1.5
true signalBregman itr 3
(c) Bregman Iteration 3
0 50 100 150 200 250-1.5
-1
-0.5
0
0.5
1
1.5
true signalBregman itr 5
(d) Bregman Iteration 5
32 / 39
![Page 33: Sparse Optimization - Lecture: Dual MethodsSparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this](https://reader034.fdocuments.us/reader034/viewer/2022050119/5f4fc5bd689e5564030f0e23/html5/thumbnails/33.jpg)
Good denoising effect
From this example,
• given noisy observations and starting being over regularized, some
intermediate solutions have better fitting and less noise than
x∗σ = arg min{‖x‖1 : ‖Ax− b‖2 ≤ σ},
in fact, no matter how σ is chosen.
• the addback intermediate solutions are not on the path of
x∗σ = arg min{‖x‖1 : ‖Ax− b‖2 ≤ σ} by varying σ > 0.
• Recall if the add iteration is continued, it will converge to the solution of
minx‖x‖1, s.t. Ax = b.
33 / 39
![Page 34: Sparse Optimization - Lecture: Dual MethodsSparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this](https://reader034.fdocuments.us/reader034/viewer/2022050119/5f4fc5bd689e5564030f0e23/html5/thumbnails/34.jpg)
Example of total variation denoising
Problem: u a 2D image, b noisy image. Noise is Gaussian.
Apply the addback iteration with A = I and
r(u) = TV(u) = ‖∇u‖1.
The subproblem has the form
minu
TV(u) +δ
2‖u− fk‖22,
where initial f0 is a noisy observation of uoriginal.
34 / 39
![Page 35: Sparse Optimization - Lecture: Dual MethodsSparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this](https://reader034.fdocuments.us/reader034/viewer/2022050119/5f4fc5bd689e5564030f0e23/html5/thumbnails/35.jpg)
35 / 39
![Page 36: Sparse Optimization - Lecture: Dual MethodsSparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this](https://reader034.fdocuments.us/reader034/viewer/2022050119/5f4fc5bd689e5564030f0e23/html5/thumbnails/36.jpg)
When to stop the addback iteration?
The 2nd curve shows that optimal stopping iteration is 5.
The 1st curve shows that residual just gets across noise level.
Solution: stop when ‖Axk − b‖2 ≈ noise level. Some theoretical results exist7
7Osher, Burger, Goldfarb, Xu, and Yin [2005]
36 / 39
![Page 37: Sparse Optimization - Lecture: Dual MethodsSparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this](https://reader034.fdocuments.us/reader034/viewer/2022050119/5f4fc5bd689e5564030f0e23/html5/thumbnails/37.jpg)
Numerical stability
The addback/augmented-Lagrangian iterations are more stable the Bregman
iteration, though they are same same on paper.
Two reasons:
• pk in the Bregman iteration may gradually lose the property of ∈ R(AT )
due to accumulated round-off errors; the other two iterations explicitly
multiply AT and are thus more stable
• The addback iteration enjoy errors forgetting and error cancellation when
r is polyhedral.
37 / 39
![Page 38: Sparse Optimization - Lecture: Dual MethodsSparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this](https://reader034.fdocuments.us/reader034/viewer/2022050119/5f4fc5bd689e5564030f0e23/html5/thumbnails/38.jpg)
Numerical stability for `1 minimization
`1-based errors forgetting simulation:
• use five different subproblem solvers for `1 Bregman iterations
• for each subproblem, stop solver at accuracy 10−6
• track and plot ‖xk−x∗‖2‖x∗‖2
vs iteration k
Errors made in subproblems get cancelled iteratively. See Yin and Osher [2012].
38 / 39
![Page 39: Sparse Optimization - Lecture: Dual MethodsSparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this](https://reader034.fdocuments.us/reader034/viewer/2022050119/5f4fc5bd689e5564030f0e23/html5/thumbnails/39.jpg)
Summary
• Dual gradient method, after smoothing
• Exact smoothing for `1
• Smooth a function by adding a strongly convex function to its convex
conjugate
• Augmented Lagrangian, Bregman, and residual addback iterations; their
equivalence
• Better denoising result of residual addback iteration
• Numerical stability, error forgetting and cancellation of residual addback
iteration; numerical instability of Bregman iteration
39 / 39
![Page 40: Sparse Optimization - Lecture: Dual MethodsSparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this](https://reader034.fdocuments.us/reader034/viewer/2022050119/5f4fc5bd689e5564030f0e23/html5/thumbnails/40.jpg)
(Incomplete) References:
R Tyrrell Rockafellar. Convex Analysis. Princeton University Press, 1970.
M.-J. Lai and W. Yin. Augmented `1 and nuclear-norm models with a globally linearly
convergent algorithm. Submitted to SIAM Journal on Imaging Sciences, 2012.
M.P. Friedlander and P. Tseng. Exact regularization of convex programs. SIAM
Journal on Optimization, 18(4):1326–1350, 2007.
W. Yin. Analysis and generalizations of the linearized Bregman method. SIAM
Journal on Imaging Sciences, 3(4):856–877, 2010.
Yurii Nesterov. A method of solving a convex programming problem with convergence
rate O(1/k2). Soviet Mathematics Doklady, 27:372–376, 1983.
J. Barzilai and J.M. Borwein. Two-point step size gradient methods. IMA Journal of
Numerical Analysis, 8(1):141–148, 1988.
Hongchao Zhang and William W Hager. A nonmonotone line search technique and its
application to unconstrained optimization. SIAM Journal on Optimization, 14(4):
1043–1056, 2004.
Yu Nesterov. Smooth minimization of non-smooth functions. Mathematical
Programming, 103(1):127–152, 2005.
S. Osher, M. Burger, D. Goldfarb, J. Xu, and W. Yin. An iterative regularization
method for total variation-based image restoration. SIAM Journal on Multiscale
Modeling and Simulation, 4(2):460–489, 2005.
40 / 39
![Page 41: Sparse Optimization - Lecture: Dual MethodsSparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this](https://reader034.fdocuments.us/reader034/viewer/2022050119/5f4fc5bd689e5564030f0e23/html5/thumbnails/41.jpg)
W. Yin, S. Osher, D. Goldfarb, and J. Darbon. Bregman iterative algorithms for
l1-minimization with applications to compressed sensing. SIAM Journal on Imaging
Sciences, 1(1):143–168, 2008.
W. Yin and S. Osher. Error forgetting of Bregman iteration. Journal of Scientific
Computing, 54(2):684–698, 2012.
41 / 39