Recent Advances in Approximate Message Passingschniter/pdf/h19_slides.pdf · 2019. 7. 7. ·...

Recent Advances in Approximate Message Passing

Phil Schniter

Supported in part by NSF grant CCF-1716388.

July 5, 2019

Overview

1 Linear Regression

2 Approximate Message Passing (AMP)

3 Vector AMP (VAMP)

4 Unfolding AMP and VAMP into Deep Neural Networks

5 Extensions: GLMs, Parameter Learning, Bilinear Problems

Phil Schniter (Ohio State Univ.) July’19 2 / 52

Linear Regression

Outline

1 Linear Regression


3 Vector AMP (VAMP)




Linear Regression

The Linear Regression ProblemConsider the following linear regression problem:

Recover xo from

y = Axo +w with

xo ∈ Rn unknown signal

A ∈ Rm×n known linear operator

w ∈ Rm white Gaussian noise.

Typical methodologies:

1 Optimization (or MAP estimation):

x = argminx

{1

2‖Ax− y‖22 +R(x)

}

2 Approximate MMSE:

x ≈ E{x|y} for x ∼ p(x), y|x ∼ N (Ax, νwI)

3 Plug-and-play:1 iteratively apply a denoising algorithm like BM3D

4 Train a deep network to recover xo from y.1Venkatakrishnan,Bouman,Wohlberg’13Phil Schniter (Ohio State Univ.) July’19 4 / 52

Approximate Message Passing (AMP)

Outline

1 Linear Regression


3 Vector AMP (VAMP)





The AMP Methodology

All of the aforementioned methodologies can be addressed using theApproximate Message Passing (AMP) framework.

AMP tackles these problems via iterative denoising.

We will write the iteration-t denoiser as ηt(·) : Rn → Rn.

Each method defines the denoiser ηt(·) differently:Optimization: ηt(r) = argmin

x{R(x) + 1

2νt ‖x− r‖22} , “proxRνt(r)”

MMSE: ηt(r) = E{x∣∣ r = x+N (0, νt)

}

Plug-and-play: ηt(r) = BM3D(r, νt)

Deep network: ηt(r) is learned from training data.



The AMP Algorithm

initialize x0=0, v−1=0

for t = 0, 1, 2, . . .

vt = y −Axt + N

M vt−1 div(ηt−1(xt−1 +ATv

t−1))corrected residual

xt+1 = ηt(xt +ATvt) denoising

wherediv

(ηt(r)

),

1

ntr

(∂ηt(r)

∂r

)“divergence.”

Note:Original version proposed by Donoho, Maleki, and Montanari in 2009.

They considered “scalar” denoisers, such that [ηt(r)]j = ηt(rj) ∀jFor scalar denoisers, div

(ηt(r)

)= 1

n

∑n

j=1ηt′(rj)

Can be recognized as iterative shrinkage/thresholding2 plus “Onsagercorrection.”

Can be derived using Gaussian & Taylor-series approximations of loopybelief-propagation (hence “AMP”).

2Chambolle,DeVore,Lee,Lucier’98Phil Schniter (Ohio State Univ.) July’19 7 / 52


AMP’s Denoising Property

Original AMP Assumptions

A ∈ Rm×n is drawn i.i.d. Gaussian

m,n → ∞ s.t. mn → δ ∈ (0,∞) . . . “large-system limit”

[ηt(r)]j = ηt(rj) with Lipschitz η(·) . . . “scalar denoising”

Under these assumptions, the denoiser’s input rt , xt +ATvt obeys3

rtj = xo,j +N (0, νtr)

That is, rt is a Gaussian-noise corrupted version of the true signal xo.

It should now be clear why we think of ηt(·) as a “denoiser.”

Furthermore, the effective noise variance can be consistently estimated:

νtr , 1m‖vt‖2 −→ νtr.

3Bayati,Montanari’11Phil Schniter (Ohio State Univ.) July’19 8 / 52


AMP’s State Evolution

Assume that the measurements y were generated via

y = Axo +N (0, νwI)

where xo empirically converges to some random variable Xo as n → ∞.

Define the iteration-t mean-squared error (MSE)

Et , 1n‖x

t − xo‖2.

Under above assumptions, AMP obeys the following state evolution (SE):4

for t = 0, 1, 2, . . .

νtr = νw + nmEt

Et+1 = E{[

ηt(Xo +N (0, νtr)

)−Xo

]2}

4Bayati,Montanari’11Phil Schniter (Ohio State Univ.) July’19 9 / 52


Achievability Analysis via the AMP SE

AMP’s SE can be applied to analyze achievability in various problems.

E.g., it yields a closed-form expression5 for the sparsity/sampling region whereℓ1-penalized regression is equivalent to ℓ0-penalized regression:

ρ(δ) = maxc>0

1− 2δ−1[(1 + c2)Φ(−c)− cφ(c)]

1 + c2 − 2[(1 + c2)Φ(−c)− cφ(c)],

0.2 0.4 0.6 0.80

0.2

0.4

0.6

0.8

1

δ = m/n (sampling rate)

ρ=

k/m

(sparsity

rate)

MMSE reconstruct

weak ℓ1/ℓ0 equiv

empirical AMP

5Donoho,Maleki,Montanari’09Phil Schniter (Ohio State Univ.) July’19 10 / 52


MMSE Optimality of AMP

Now suppose that the AMP Assumptions hold, and that

y = Axo +N (0, νwI),

where the elements of xo are i.i.d. draws of some random variable Xo.

Suppose also that ηt(·) is the MMSE denoiser, i.e.,

ηt(R) = E{Xo

∣∣R = Xo+N (0, νtr)}

Then, if the state evolution has a unique fixed point, the MSE of xt converges6

to the replica prediction of the MMSE as t → ∞.

Under the AMP Assumptions, the replica prediction of the MMSE was shownto be correct.78

6Bayati,Montanari’11, 7Reeves,Pfister’16, 8Barbier,Dia,Macris,Krzakala’16Phil Schniter (Ohio State Univ.) July’19 11 / 52


Universality of AMP State Evolution

Until now, it was assumed that A is drawn i.i.d. Gaussian.

The state evolution also holds when A is drawn from i.i.d. Aij such that

E{Aij} = 0

E{A2ij} = 1/m

E{A6ij} = C/m for some fixed C > 0.

often abbreviated as “sub-Gaussian Aij .”

The proof 9 assumes polynomial scalar denoising ηt(·) of bounded order.

9Bayati,Lelarge,Montanari’15Phil Schniter (Ohio State Univ.) July’19 12 / 52


Deriving AMP via Loopy BP (e.g., sum-product alg)f(x1)

f(x2)

f(xn)

x1

x2

xn

p1→1(x1)

pm←n(xn)

N (y1; [Ax]1, νw)

N (y2; [Ax]2, νw)

N (ym; [Ax]m, νw)

......

...

1 Message from yi node to xj node:

pi→j(xj) ∝

∫

{xl}l 6=j

N(yi;

≈ N via CLT︷︸︸︷∑

l ailxl , νw

)∏l 6=j pi←l(xl)

≈

∫

zi

N (yi; zi, νw)N(zi; zi(xj), ν

zi (xj)

)∼ N

To compute zi(xj), νzi (xj), the means and variances of {pi←l}l 6=j suffice,

implying Gaussian message passing, similar to expectation-propagation.Remaining problem: we have 2mn messages to compute (too many!).

2 Exploiting similarity among the messages{pi←j}mi=1, AMP employs a Taylor-seriesapproximation of their difference whoseerror vanishes as m→∞ for dense A (andsimilar for {pi←j}nj=1 as n→∞). Finally,need to compute only O(m+n) messages!

f(x1)

f(x2)

f(xn)

x1

x2

xn

p1→1(x1)

pm←n(xn)

N (y1; [Ax]1, νw)

N (y2; [Ax]2, νw)

N (ym; [Ax]m, νw)

......

...



Understanding AMP

The belief-propagation derivation of AMP provides very little insight!

Loopy BP is suboptimal, even if implemented exactlyThe i.i.d. property of A is never used in the derivation

And the rigorous proofs of AMP’s state evolution are very technical!

As a middle ground, we suggest an alternate derivation that gives insight intohow and why AMP works.

Based on the idea of “first-order cancellation”We will assume equiprobable Bernoulli aij ∈ ±1/

√m and polynomial η(·)



AMP as First-Order Cancellation

Recall the AMP recursion:

vt = y −Axt + n

mvt−1 div(η(rt−1)

)

xt+1 = η(xt +ATvt

︸︷︷︸, rt

)

Notice that

[Axt]i = aT

i η(xt−1 +

∑l alv

t−1k

)where aT

i is the ith row of A

= aTi η

(xt−1 +

∑l 6=i alv

t−1l︸︷︷︸

, rt−1i , which removes the direct contribution of ai from rt−1

+aivt−1i

)

= aTi

[η(rt−1i ) +

∂η

∂r(rt−1i )aiv

t−1i +O(1/m)

]using a Taylor expansion

= aTi η(r

t−1i ) + vt−1i

∑j a

2ijη′(rt−1ij ) +O(1/

√m)

= aTi η(r

t−1i ) + n

mvt−1i1n

∑j η′(rt−1ij )

︸︷︷︸div

(η(rt−1i )

)+O(1/

√m) since a2ij = 1/m ∀ij

which uncovers the Onsager correction.



AMP as First-Order Cancellation (cont.)

Now use [Axt]i to study jth component of denoiser input error et , rt − xo:

etj =∑

i

aij∑

l 6=j

ail[xo,l − η(rt−1il )

]+∑

i

aijwi

+∑

i

aij

[nmvt−1i div

(η(rt−1)

)− n

mvt−1i div(η(rt−1i )

)]+O(1/

√m)

where the divergence difference can be absorbed into the O(1/√m) term. . .

=∑

i

aij∑

l 6=j

ail[xo,l − η(rt−1il )

]︸︷︷︸

, ǫil︸︷︷︸∼ N

(0, 1

m2

∑i

∑l 6=j(ǫ

til)

2)

+∑

i

aijwi

︸︷︷︸∼ N

(0, 1

m

∑i w

2i

)

+O(1/√m)

using the CLT and assuming independence of {ail}nl=1 and {rt−1il }nl=1

∼ N(0, n

mE(t) + νw)+O(1/

√m) . . . the AMP state evolution

where E(t) , 1n

∑nj=1

[xo,j − x

(t)j

]2and νw , 1

m

∑mi=1 w

2i



AMP with Non-Separable Denoisers

Until now, we have focused on separable denoisers, i.e., [ηt(r)]j = ηt(rj) ∀j

Can we use sophisticated non-separable η(·) with AMP?

Yes! Many examples. . .

Markov chain,10 Markov field,12 Markov tree,12 denoisers in 2010Blockwise & TV denoising considered by Donoho, Johnstone, Montanari in 2011BM3D denoising considered by Metzler, Maleki, Baraniuk in 2015

Rigorous state-evolution proven by Berthier, Montanari, Nguyen in 2017.

Assumes A drawn i.i.d. GaussianAssumes η is Lipschitz and “convergent under Gaussian inputs”

10S’10, 11Som,S’11, 12Som,S’12Phil Schniter (Ohio State Univ.) July’19 17 / 52


AMP at Large but Finite Dimensions

Until now, we have focused on the large-system limit m,n → ∞ withm/n → δ ∈ (0,∞)

The non-asymptotic case was analyzed by Rush and Venkataramanan.13

They showed that probability of ǫ-deviation between the finite and limiting SEfalls exponentially in m, as long as the number of iterations t < o( logn

log logn )

13Rush,Venkataramanan’18Phil Schniter (Ohio State Univ.) July’19 18 / 52


AMP Summary: The good, the bad, and the ugly

The good:

With large i.i.d. sub-Gaussian A, AMP is rigorously characterized by a scalarstate-evolution whose fixed points, when unique, are MMSE optimal underproper choice of denoiser.

Empirically, AMP behaves well with many other “sufficiently random” A

(e.g., randomly sub-sampled Fourier A & i.i.d. sparse x).

The bad:

With general A, AMP gives no guarantees.

The ugly:

With some A, AMP may fail to converge!(e.g., ill-conditioned or non-zero-mean A)


Vector AMP (VAMP)

Outline

1 Linear Regression


3 Vector AMP (VAMP)




Vector AMP (VAMP)

Vector AMP (VAMP)

Recall goal is linear regression: Recover xo from y = Axo +N (0, I/γw).

Now it will be easier to work with inverse variances, i.e., precisions

VAMP is like AMP in many ways, but supports a larger class of randommatrices.

VAMP yields a precise analysis for right-orthogonally invariant A:

svd(A) = USV T for

U : deterministic orthogonalS: deterministic diagonalV : “Haar;” uniform on set of orthogonal matrices

of which i.i.d. Gaussian is a special case.

Can be derived as a form of messagepassing on a vector-valued factor graph.

p(x1)

x1

δ(x1 − x2)

x2

N (y;Ax2, I/γw)


Vector AMP (VAMP)

VAMP: The Algorithm

With SVD A = U Diag(s)V T, damping ζ ∈ (0, 1], and Lipschitz ηt1(·) : Rn → R

n.

Initialize r1, γ1.

For t = 1, 2, 3, . . .

x1 ← ηt1(r1) denoising of r1 = xo +N (0, I/γ1)

ξ1 ← γ1/ div(ηt1(r1)

)

r2 ← (ξ1x1 − γ1r1)/(ξ1 − γ1) Onsager correction

γ2 ← ξ1 − γ1

x2 ← η2(r2; γ2) LMMSE estimate of x ∼ N (r2, I/γ2)

from y = Ax+N (0, I/γw)ξ2 ← γ2/ div(η2(r2; γ2)

)

r1 ← ζ(ξ2x2 − γ2r2)/(ξ2 − γ2) + (1−ζ)r1 Onsager correction

γ1 ← ζ(ξ2 − γ2) + (1− ζ)γ1 damping

where η2(r2; γ2) = (γwA

TA+ γ2I)−1(γwA

Ty + γ2r2)

= V(γw Diag(s)2 + γ2I

)−1(γw Diag(s)UTy + γ2V

Tr2

)

ξ2 = 1

n

∑n

j=1(γws

2

j + γ2)−1 two mat-vec mults per iteration!


Vector AMP (VAMP)

VAMP’s Denoising Property

Original VAMP Assumptions

A ∈ Rm×n is right-orthogonally invariant

m,n → ∞ s.t. m/n → δ ∈ (0,∞) . . . “large-system limit”

[ηt1(r)]j = ηt1(rj) with Lipschitz ηt1(·) . . . “separable denoising”

Under Assumption 2, the elements of the denoiser’s input rt1 obey14

rt1,j = xo,j +N (0, νt1)

That is, rt1 is a Gaussian-noise corrupted version of the true signal xo.

As with AMP, we can interpret η1(·) as a “denoiser.”

14Rangan,S,Fletcher’16Phil Schniter (Ohio State Univ.) July’19 23 / 52

Vector AMP (VAMP)

VAMP’s State Evolution

Assume empirical convergence of {sj}→S and {(r01,j , xo,j)}→(R0

1, Xo), and define

Eti , 1

n‖xt

i − xo‖2 for i = 1, 2.

Then under the VAMP Assumptions, VAMP obeys the following state-evolution:

for t = 0, 1, 2, . . .

Et1 = E

{[ηt1(Xo +N (0, νt1)

)−Xo

]2}MSE

αt1 = E

{ηt1′(Xo +N (0, νt1))

}divergence

γt2 = γt

11−αt

1

αt1

, νt2 = 1(1−αt

1)2

[Et1 −

(αt1

)2νt1]

Et2 = E

{[γwS

2 + γt2

]−1}MSE

αt2 = γt

2 E{[γwS

2 + γt2

]−1}divergence

γt+11 = γt

21−αt

2

αt2

, νt+11 = 1

(1−αt2)2

[Et2 −

(αt2

)2νt2]

Note: Above equations assume η2(·) uses true noise precision γw.

If not, there are more complicated expressions for Et2 and αt2.


Vector AMP (VAMP)

MMSE Optimality of VAMP

Now suppose that the VAMP Assumptions hold, and that

y = Axo +N (0, I/γw),

where the elements of xo are i.i.d. draws of some random variable Xo.

Suppose also that ηt1(·) is the MMSE denoiser, i.e.,

ηt1(R1) = E{Xo

∣∣R1 = Xo +N (0, νt1)}

Then, if the state evolution has a unique fixed point, the MSE of xt1

converges15 to the replica prediction16 of the MMSE as t → ∞.

15Rangan,S,Fletcher’16, 16Tulino,Caire,Verdu,Shamai’13Phil Schniter (Ohio State Univ.) July’19 25 / 52

Vector AMP (VAMP)

Experiment with MMSE Denoising

Comparison of several algorithms17 with MMSE denoising.

100

101

102

103

104

105

106

-50

-45

-40

-35

-30

-25

-20

-15

-10

-5

0

AMP

S-AMP

damped GAMPVAMP

replica MMSE

condition number κ(A)

mediannormalized

MSE[dB]

n = 1024m/n = 0.5

A = U Diag(s)V T

U ,V ∼ Haarsj/sj−1 = φ ∀jφ determines κ(A)

Xo ∼Bernoulli-GaussianPr{X0 6= 0} = 0.1

SNR= 40dB

VAMP achieves the replica MMSE over a wide range of condition numbers.

17S-AMP: Cakmak,Fleury,Winther’14, damped GAMP: Vila,S,Rangan,Krzakala,Zdeborova’15


Vector AMP (VAMP)

Experiment with MMSE Denoising (cont.)

Comparison of several algorithms with priors matched to data.

100

101

102

103

-50

-40

-30

-20

-10

0

med

ian

NM

SE

[d

B]

condition number=1

AMP

S-AMP

damped GAMP

VAMP

VAMP SE

100

101

102

103

iterations

-40

-30

-20

-10

0

med

ian

NM

SE

[d

B]

condition number=1000

AMP

S-AMP

damped GAMP

VAMP

VAMP SE

n = 1024m/n = 0.5

A = U Diag(s)V T

U ,V ∼ Haarsj/sj−1 = φ ∀jφ determines κ(A)

Xo ∼Bernoulli-GaussianPr{X0 6= 0} = 0.1

SNR= 40dB

VAMP is relative fast even when A is ill-conditioned.


Vector AMP (VAMP)

VAMP for Optimization

Consider the optimization problem

x = argminx

{12‖Ax− y‖2 +R(x)

}

where R(·) is strictly convex and A is arbitrary (e.g., not necessarily RRI).

If we choose the denoiser

ηt1(r) = argmin

x

{R(x) +

γt1

2‖x− r‖2

}= proxR/γt

1

(r)

and the damping parameter

ζ ≤ 2min{γ1, γ2}γ1 + γ2

,

then a double-loop version of VAMP converges18 to x from above.

Furthermore, if the γ1 and γ2 variables are fixed over the iterations, thenVAMP reduces to the Peaceman-Rachford variant of ADMM.

18Fletcher,Sahraee,Rangan,S’16Phil Schniter (Ohio State Univ.) July’19 28 / 52

Vector AMP (VAMP)

Example of AMP & VAMP on the LASSO Problem

100

101

102

103

-45

-40

-35

-30

-25

-20

-15

-10

-5

0

VAMP

AMP

Chambolle-Pock

FISTA

iterations

NMSE[dB]

iid Gaussian A matrix

100

101

102

103

104

-30

-25

-20

-15

-10

-5

0

VAMP

AMP

Chambolle-Pock

FISTA

iterations

NMSE[dB]

column-correlated (0.99) A matrix

Solving LASSO to reconstruct 40-sparse x ∈ R1000 from noisy y ∈ R

400.

x = argminx

{1

2‖y −Ax‖22 + λ‖x‖1

}.


Vector AMP (VAMP)

Deriving VAMP from EC

Ideally, we would like to compute the exact posterior density

p(x|y) = p(x)ℓ(x;y)

Z(y)for Z(y) ,

∫p(x)ℓ(x;y) dx,

but the high-dimensional integral in Z(y) is difficult to compute.

We might try to circumvent Z(y) through variational optimization:

p(x|y) = argminb

D(b(x)

∥∥p(x|y))where D(·‖·) is KL divergence

= argminb

D(b(x)

∥∥p(x))+D

(b(x)

∥∥ℓ(x;y))+H

(b(x)

)︸︷︷︸

Gibbs free energy

= argminb1,b2,q

D(b1(x)

∥∥p(x))+D

(b2(x)

∥∥ℓ(x;y))+H

(q(x)

)︸︷︷︸

, JGibbs(b1, b2, q)s.t. b1 = b2 = q,

but the density constraint keeps the problem difficult.


Vector AMP (VAMP)

Deriving VAMP from EC (cont.)

In expectation-consistent approximation (EC)19, the density constraint isrelaxed to moment-matching constraints:

p(x|y) ≈ argminb1,b2,q

JGibbs(b1, b2, q)

s.t.

{E{x|b1} = E{x|b2} = E{x|q}tr(Cov{x|b1}) = tr(Cov{x|b2}) = tr(Cov{x|q}).

The stationary points of EC are the densities

b1(x)∝ p(x)N (x; r1, I/γ1)b2(x)∝ ℓ(x;y)N (x; r2, I/γ2)q(x)= N (x; x, I/ξ)

s.t.

{E{x|b1} = E{x|b2} = x1n tr(Cov{x|b1}) = 1

n tr(Cov{x|b2}) = 1ξ

VAMP iteratively solves for the quantities r1, γ1, r2, γ2, x, ξ above.

Leads to ηt1(·) being the MMSE denoiser of r1 = xo +N (0, I/γt

1)In this setting, VAMP is simply an instance of expectation propagation (EP)20.But VAMP is more general than EP, in that it allows non-MMSE denoisers η

1.

19Opper,Winther’04, 20Minka’01Phil Schniter (Ohio State Univ.) July’19 31 / 52

Vector AMP (VAMP)

Plug-and-play VAMP

Recall the scalar denoising step of VAMP (or AMP):

x1 = ηt1(r1) where r1 = xo +N (0, I/γt1)

For many signal classes (e.g., images), very sophisticated non-separable

denoisers η1(·) have been developed (e.g., BM3D, DnCNN).

These non-separable denoisers can be “plugged into” VAMP!

Their divergence can be approximated via Monte Carlo21

div(ηt1(r)

)≈ 1

K

K∑

k=1

pTk

[ηt1(r+ǫpk)− ηt

1(r)]

nǫ

with random vectors pk ∈ {±1}n and small ǫ > 0. Empirically, K=1 suffices.

A rigorous state-evolution has been established for plug-and-play VAMP.22

21Ramani,Blu,Unser’08, 22Fletcher,Rangan,Sarkar,S’18Phil Schniter (Ohio State Univ.) July’19 32 / 52

Vector AMP (VAMP)

Experiment: Compressive Image Recovery with BM3D

Plug-and-play versions of VAMP and AMP behave similarly with i.i.d. Gaussian A

is i.i.d., but VAMP can handle a larger class of random matrices A.

0.1 0.2 0.3 0.4 0.5

sampling rate M/N

15

20

25

30

35

40

PS

NR

VAMP-BM3D

AMP-BM3D

VAMP-L1

AMP-L1

iid Gaussian A

100

101

102

103

104

condition number

0

5

10

15

20

25

30

PS

NR VAMP-BM3D

VAMP-L1

AMP-BM3D

AMP-L1

spread spectrum A (M/N = 0.2)

Results above are averaged over 128× 128 versions of

lena, barbara, boat, fingerprint, house, peppers

and 10 random realizations of A,w.


Unfolding AMP and VAMP into Deep Neural Networks

Outline

1 Linear Regression


3 Vector AMP (VAMP)





Deep learning for sparse reconstruction

Until now we’ve focused on designing algorithms to recover xo ∼ p(x) frommeasurements y = Axo +w.

xy

model p(x),A

algorithm

What about training deep networks to predict xo from y?Can we increase accuracy and/or decrease computation?

xy

training data {(xd,yd)}Dd=1

deepnetwork

Are there connections between these approaches?



Unfolding Algorithms into Networks

Consider, e.g., the classical sparse-reconstruction algorithm, ISTA.23

vt =y −Axt

xt+1 =η(xt +ATvt)

⇔ xt+1= η(Sxt +By) with

S, I −ATA

B, AT

Gregor & LeCun24 proposed to “unfold” it into a deep net and “learn” improvedparameters using training data, yielding “learned ISTA” (LISTA):

+++

y B

SSSx1

x2

x3 x

4η(·)η(·) η(·)η(·)

The same “unfolding & learning” idea can be used to improve AMP, yielding“learned AMP” (LAMP).25

23Chambolle,DeVore,Lee,Lucier’98. 24Gregor,LeCun’10. 25Borgerding,S’16.Phil Schniter (Ohio State Univ.) July’19 36 / 52


Onsager-Corrected Deep Networks

tth LISTA layer:

+

+

−xt

xt+1

vt vt+1

y y

rt

Bt At

η(•;λt)

to exploit low-rank BtAt in linear stage St = I −BtAt.

tth LAMP layer:

+ +

+−

×

xt xt+1

vt vt+1

y y

rtct‖•‖2√

MλtBt At

η(•; •)NM div(η)

Onsager correction now aims to decouple errors across layers.



LAMP performance with soft-threshold denoising

LISTA beats AMP,FISTA,ISTALAMP beats LISTA

in convergence speed and asymptotic MSE.

5 10 15 20

-40

-35

-30

-25

-20

-15

-10

-5

ISTA

FISTA

AMP

LISTA tiedLISTA untiedLAMP tiedLAMP untied

averageNMSE[dB]

layers / iterations

-4 -3 -2 -1 0 1 2 3 4-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

QQplot of LAMP rt

Standard Normal QuantilesQuantilesof

InputSam

ple



LAMP beyond soft-thresholding

So far, we used soft-thresholding to isolate the effects of Onsager correction.

What happens with more sophisticated (learned) denoisers?

2 4 6 8 10 12 14

-45

-40

-35

-30

-25

-20

-15

-10LISTA

LAMP-l1

LAMP-bg

LAMP-expo

LAMP-pwlin

LAMP-splinesupport oracle

averageNMSE[dB]

layers

Here we learned the parameters ofthese denoiser families:

scaled soft-thresholding

conditional mean under BG

Exponential kernel26

Piecewise Linear26

Spline27

Big improvement!

26Guo,Davies’15. 27Kamilov,Mansour’16.Phil Schniter (Ohio State Univ.) July’19 39 / 52


LAMP versus VAMP

How does our best Learned AMP compare to MMSE VAMP?

2 4 6 8 10 12 14

-45

-40

-35

-30

-25

-20

-15

-10LAMP-pwlin

VAMP-bg

support oracle

averageNMSE[dB]

layers / iterations

VAMP wins!

So what about “learned VAMP”?



Learned VAMP

Suppose we unfold VAMP and learn (via backprop) the parameters {St, ηt}Tt=1

that minimize the training MSE.

Onsager

Onsager

Onsager

Onsager

ηt(·)ηt(·) StSt. . .

xo +N (0, I/γt1)xo +N (0, I/γt

1) xo +N (0, I/γt2)xo +N (0, I/γt

2)0

y

xt

xt

Remarkably, backpropagation learns the parameters prescribed by VAMP!

Theory explains the deep network!

Onsager correction decouples the design of {St, ηt(·)}Tt=1:

Layer-wise optimal St, ηt(·) ⇒ Network optimal {St, ηt(·)}Tt=1


Extensions: GLMs, Parameter Learning, Bilinear Problems

Outline

1 Linear Regression


3 Vector AMP (VAMP)





Generalized linear models

Until now we have considered the standard linear model: y = Axo +w.

One may also consider the generalized linear model (GLM), where

y ∼ p(y|z) with hidden z = Axo

which supports, e.g.,

yi = zi + wi: additive, possibly non-Gaussian noiseyi = Q(zi + wi): quantizationyi = sgn(zi + wi): binary classificationyi = |zi + wi|: phase retrievalPoisson yi: photon-limited imaging

For this, there is a Generalized AMP29 with a rigorous state evolution.30

There is also a Generalized VAMP31 with a rigorous state evolution.32

29Rangan’11, 30Javanmard,Montanari’12, 31S,Fletcher,Rangan’16. 32Fletcher,Rangan,S’18.Phil Schniter (Ohio State Univ.) July’19 43 / 52


Parameter learning

Consider inference under prior p(x;θ1) and likelihood ℓ(x;y,θ2), where the

hyperparameters θ , [θ1,θ2] are unknown.

θ1 might specify sparsity rate, or all parameters of a GMMθ2 might specify the measurement noise variance, or forward model A

EM-inspired extensions of (G)AMP and (G)VAMP that simultaneously estimatex and learn θ from y have been developed.

Have rigorous state evolutions3334

“Adaptive VAMP” yields asymptotically consistent34 estimates of θ

SURE-based auto-tuning AMP algorithms have also been proposed

for LASSO by Mousavi, Maleki, and Baraniukfor parametric separable denoisers by Guo and Davies

33Kamilov,Rangan,Fletcher,Unser’12, 34Fletcher,Sahraee,Rangan,S’17Phil Schniter (Ohio State Univ.) July’19 44 / 52


Bilinear problems

So far we have considered (generalized) linear models.

AMP has also been applied to (generalized) bilinear models.

The typical problem is to recover B ∈ Rm×k and C ∈ R

k×n from . . .{Y = BC +W (standard bilinear model)Y ∼ p(Y |Z) for Z = BC (generalized bilinear model)

The case where m,n→∞ for fixed k is well understood.35 (See Jean’s talk)

With m,n, k →∞, algorithms work (e.g., BiGAMP36) but are not well understood.

A more general bilinear problem is to recover b ∈ Rk and c ∈ R

n from{yi = bTAic+ wi, i = 1 . . .m

yi ∼ p(yi|zi) for zi = bTAic, i = 1 . . .mwhere {Ai} are known matrices

Algorithms37 and replica analyses38 (for m,n, k →∞ and i.i.d. Ai) exist.

35Montanari,Venkataramanan’17, 36Parker,S,Cevher’14, 37Parker,S’16, 38Schulke,S,Zdeborova’16



Conclusions

AMP and VAMP are a computationally efficient algorithms for (generalized)linear regression.

With large random A, the ensemble behaviors of AMP and VAMP obeyrigorous state evolutions whose fixed-points, when unique, agree with thereplica predictions of the MMSE.

AMP and VAMP support nonseparable (i.e., “plug-in”) denoisers, also withrigorous state evolutions.

For convex optimization problems, VAMP is provably convergent for any A.

Extensions of AMP and VAMP cover . . .

unfolded deep networksthe learning of unknown prior/likelihood parametersbilinear problems

Not discussed: multilayer versions of AMP & VAMP.



References I

S. V. Venkatakrishnan, C. A. Bouman, and B. Wohlberg, “Plug-and-play priors for modelbased reconstruction,” in Proc. IEEE Global Conf. Signal Info. Process., pp. 945–948, 2013.

D. L. Donoho, A. Maleki, and A. Montanari, “Message passing algorithms for compressedsensing,” Proc. Nat. Acad. Sci., vol. 106, pp. 18914–18919, Nov. 2009.

A. Chambolle, R. A. DeVore, N. Lee, and B. J. Lucier, “Nonlinear wavelet imageprocessing: Variational problems, compression, and noise removal through waveletshrinkage,” IEEE Trans. Image Process., vol. 7, pp. 319–335, Mar. 1998.

M. Bayati and A. Montanari, “The dynamics of message passing on dense graphs, withapplications to compressed sensing,” IEEE Trans. Inform. Theory, vol. 57, pp. 764–785,Feb. 2011.

G. Reeves and H. D. Pfister, “The replica-symmetric prediction for compressed sensing withGaussian matrices is exact,” in Proc. IEEE Int. Symp. Inform. Thy., 2016.

J. Barbier, M. Dia, N. Macris, and F. Krzakala, “The mutual information in random linearestimation,” in Proc. Allerton Conf. Commun. Control Comput., pp. 625–632, 2016.

M. Bayati, M. Lelarge, and A. Montanari, “Universality in polytope phase transitions andmessage passing algorithms,” Ann. App. Prob., vol. 25, no. 2, pp. 753–822, 2015.



References II

P. Schniter, “Turbo reconstruction of structured sparse signals,” in Proc. Conf. Inform.

Science & Syst., (Princeton, NJ), pp. 1–6, Mar. 2010.

S. Som and P. Schniter, “Approximate message passing for recovery of sparse signals withMarkov-random-field support structure.” Internat. Conf. Mach. Learning—Workshop on

Structured Sparsity: Learning and Inference, (Bellevue, WA), July 2011.

S. Som and P. Schniter, “Compressive imaging using approximate message passing and aMarkov-tree prior,” IEEE Trans. Signal Process., vol. 60, pp. 3439–3448, July 2012.

D. L. Donoho, I. M. Johnstone, and A. Montanari, “Accurate prediction of phasetransitions in compressed sensing via a connection to minimax denoising,” IEEE Trans.

Inform. Theory, vol. 59, June 2013.

C. A. Metlzer, A. Maleki, and R. G. Baraniuk, “BM3D-AMP: A new image recoveryalgorithm based on BM3D denoising,” in Proc. IEEE Int. Conf. Image Process.,pp. 3116–3120, 2015.

R. Berthier, A. Montanari, and P.-M. Nguyen, “State evolution for approximate messagepassing with non-separable functions,” Inform. Inference, 2019.



References III

C. Rush and R. Venkataramanan, “Finite-sample analysis of approximate message passingalgorithms,” IEEE Trans. Inform. Theory, vol. 64, no. 11, pp. 7264–7286, 2018.

S. Rangan, P. Schniter, and A. K. Fletcher, “Vector approximate message passing,” IEEE

Trans. Inform. Theory, to appear (see also arXiv:1610.03082).

A. M. Tulino, G. Caire, S. Verdu, and S. Shamai (Shitz), “Support recovery with sparselysampled free random matrices,” IEEE Trans. Inform. Theory, vol. 59, pp. 4243–4271, July2013.

B. Cakmak, O. Winther, and B. H. Fleury, “S-AMP: Approximate message passing forgeneral matrix ensembles,” in Proc. Inform. Theory Workshop, pp. 192–196, 2014.

J. Vila, P. Schniter, S. Rangan, F. Krzakala, and L. Zdeborova, “Adaptive damping andmean removal for the generalized approximate message passing algorithm,” in Proc. IEEE

Int. Conf. Acoust. Speech & Signal Process., pp. 2021–2025, 2015.

A. K. Fletcher, M. Sahraee-Ardakan, S. Rangan, and P. Schniter, “Expectation consistentapproximate inference: Generalizations and convergence,” in Proc. IEEE Int. Symp. Inform.

Thy., pp. 190–194, 2016.

M. Opper and O. Winther, “Expectation consistent approximate inference,” J. Mach.

Learn. Res., vol. 1, pp. 2177–2204, 2005.



References IV

T. Minka, A Family of Approximate Algorithms for Bayesian Inference.

PhD thesis, Dept. Comp. Sci. Eng., MIT, Cambridge, MA, Jan. 2001.

S. Ramani, T. Blu, and M. Unser, “Monte-Carlo SURE: A black-box optimization ofregularization parameters for general denoising algorithms,” IEEE Trans. Image Process.,vol. 17, no. 9, pp. 1540–1554, 2008.

A. K. Fletcher, S. Rangan, S. Sarkar, and P. Schniter, “Plug-in estimation inhigh-dimensional linear inverse problems: A rigorous analysis,” in Proc. Neural Inform.

Process. Syst. Conf., pp. 7440–7449, 2018.

M. Borgerding, P. Schniter, and S. Rangan, “AMP-inspired deep networks for sparse linearinverse problems,” IEEE Trans. Signal Process., vol. 65, no. 15, pp. 4293–4308, 2017.

C. Guo and M. E. Davies, “Near optimal compressed sensing without priors: ParametricSURE approximate message passing,” IEEE Trans. Signal Process., vol. 63, pp. 2130–2141,2015.

U. Kamilov and H. Mansour, “Learning optimal nonlinearities for iterative thresholdingalgorithms,” IEEE Signal Process. Lett., vol. 23, pp. 747–751, May 2016.



References V

S. Rangan, “Generalized approximate message passing for estimation with random linearmixing,” in Proc. IEEE Int. Symp. Inform. Thy., pp. 2168–2172, Aug. 2011.

(full version at arXiv:1010.5141).

A. Javanmard and A. Montanari, “State evolution for general approximate message passingalgorithms, with applications to spatial coupling,” Inform. Inference, vol. 2, no. 2,pp. 115–144, 2013.

P. Schniter, S. Rangan, and A. K. Fletcher, “Vector approximate message passing for thegeneralized linear model,” in Proc. Asilomar Conf. Signals Syst. Comput., pp. 1525–1529,2016.

A. K. Fletcher, S. Rangan, and P. Schniter, “Inference in deep networks in highdimensions,” in Proc. IEEE Int. Symp. Inform. Thy., 2018.

U. S. Kamilov, S. Rangan, A. K. Fletcher, and M. Unser, “Approximate message passingwith consistent parameter estimation and applications to sparse learning,” IEEE Trans.

Inform. Theory, vol. 60, pp. 2969–2985, May 2014.

A. K. Fletcher, M. Sahraee-Ardakan, S. Rangan, and P. Schniter, “Rigorous dynamics andconsistent estimation in arbitrarily conditioned linear systems,” in Proc. Neural Inform.

Process. Syst. Conf., pp. 2542–2551, 2017.



References VI

A. Mousavi, A. Maleki, and R. G. Baraniuk, “Consistent parameter estimation for LASSOand approximate message passing,” Ann. Statist., vol. 45, no. 6, pp. 2427–2454, 2017.

A. Montanari and R. Venkataramanan, “Estimation of low-rank matrices via approximatemessage passing,” arXiv:1711.01682, 2017.

J. T. Parker, P. Schniter, and V. Cevher, “Bilinear generalized approximate messagepassing—Part I: Derivation,” IEEE Trans. Signal Process., vol. 62, pp. 5839–5853, Nov.2014.

J. T. Parker, P. Schniter, and V. Cevher, “Bilinear generalized approximate messagepassing—Part II: Applications,” IEEE Trans. Signal Process., vol. 62, pp. 5854–5867, Nov.2014.

J. T. Parker and P. Schniter, “Parametric bilinear generalized approximate messagepassing,” IEEE J. Sel. Topics Signal Process., vol. 10, no. 4, pp. 795–808, 2016.

C. Schulke, P. Schniter, and L. Zdeborova, “Phase diagram of matrix compressed sensing,”Physical Rev. E, vol. 94, pp. 062136(1–16), Dec. 2016.


Recent Advances in Approximate Message Passingschniter/pdf/h19_slides.pdf · 2019. 7. 7. ·...

Documents

Transcript of Recent Advances in Approximate Message Passingschniter/pdf/h19_slides.pdf · 2019. 7. 7. ·...