Automatic estimation of regularization parameters:...

Motivating Example 3D Data Context Regularization Parameter Estimation Using the Noise Properties Conclusions and Future Theoretical Discussion

AUTOMATIC ESTIMATION OF REGULARIZATION

PARAMETERS: INITIAL STEPS

Rosemary Renauthttp://math.asu.edu/˜rosie

QUANTITATIVE SUSCEPTIBILITY MAPPING (QSM)JULY 27, 2013

Acknowledgements: Cornell MRI LaboratoryYi WangTian LiuPascal SpincemailleShaui Wang

1 / 68

http://math.asu.edu/~rosie


Outline

Motivating Example 3D Data

Context

Regularization Parameter Estimation

Using the Noise Properties

Conclusions and Future

Theoretical Discussion

2 / 68


Motivating Example for QSM

Neuroimage 59, 2012 (2560-2568): Liu et al

Morphology enabled dipole inversion for quantitativesusceptibility mapping using structural consistency between themagnitude image and the susceptibility map.

Tissue local magnetic field (b) obtained as convolution of dipolekernel (A) with susceptibility (x):

b ≈ Ax

Least squares or image based formulation: solve for x

‖Wb(Ax− b)‖2 +1

σ2‖WG(∇x)‖2

Wb weighting matrix for the noise on the data bWG weighting matrix for gradient ∇x dependent on noise level.σ is the unknown regularization parameter 3 / 68






b ≈ Ax


‖Wb(Ax− b)‖2 +1

σ2‖WG(∇x)‖2



Goals

1. Develop approach to automatically estimate parameterσ = 1/

√λ

2. Use validated parameter estimation techniques3. Employ statistical information from the data4. Efficient implementation5. Extend to L1 regularization

7 / 68


Goals


√λ


8 / 68


Goals


√λ


9 / 68


Goals


√λ


10 / 68


Goals


√λ


11 / 68


Context: L2 regularization

Solve ill-conditionedAx ≈ b

Standard Tikhonov, L approximates a derivative operator

x(λ) = arg minx{1

2‖Ax− b‖22 +

λ2

2‖Lx‖22}

x(λ) solves normal equations provided null(L)∩ null(A) = {0}

(ATA+ λ2LTL)x(λ) = ATb

Multiple approaches exist for estimating parameter λ = 1/√σ

12 / 68





x(λ) = arg minx{1

2‖Ax− b‖22 +

λ2

2‖Lx‖22}




13 / 68





x(λ) = arg minx{1

2‖Ax− b‖22 +

λ2

2‖Lx‖22}




14 / 68


Some Methods: assume variance τ2 in weighted Wbb

Morozov-Discrepancy - smooths - is a χ2 test on the residual (ResidualDiscrepancy - de Rochefort)

‖Wb(Ax(σ)− b)‖2 ≈ τ2

L-curve well -known find corner of of (x, y) plot:(log(‖Wb(Ax(σ)− b‖2), log(‖Lx‖2)

)Generalized Cross Validation (GCV) - minimization

‖Wb(Ax(σ)− b‖2

Tr(Im − (ATWbA+ 1/σ2LTL)−1ATWbA)

Unbiased Predictive Risk Estimation (UPRE) minimization

‖Wb(Ax(σ)− b)‖2 − 2τ2(m− Tr((ATWbA+ 1/σ2LTL)−1ATWbA)

)χ2 method - based on noise distribution in the data.And others e.g. Residual Periodogram ...

15 / 68




‖Wb(Ax(σ)− b)‖2 ≈ τ2








16 / 68




‖Wb(Ax(σ)− b)‖2 ≈ τ2








17 / 68




‖Wb(Ax(σ)− b)‖2 ≈ τ2








18 / 68




‖Wb(Ax(σ)− b)‖2 ≈ τ2








19 / 68


Some characteristics of the methods

Method Idea Many λ Algorithm Statistical UniqueDiscrepancy Easy No Root finding Yes Yes

L-curve Easy Yes spline No NoGCV Hard Yes Minimum Yes No

UPRE Hard Yes Minimum Yes Noχ2 Ok No Root finding Yes Yes

1. In particular χ2 and UPRE rely on provision of statistics ofthe noise distribution

2. UPRE and GCV require a matrix trace estimation -expensive

20 / 68


Some characteristics of the methods

Method Idea Many λ Algorithm Statistical UniqueDiscrepancy Easy No Root finding Yes Yes

L-curve Easy Yes spline No NoGCV Hard Yes Minimum Yes No

UPRE Hard Yes Minimum Yes Noχ2 Ok No Root finding Yes Yes

1. In particular χ2 and UPRE rely on provision of statistics ofthe noise distribution

2. UPRE and GCV require a matrix trace estimation -expensive

21 / 68


Weighting for the noise - assume noise η in b

Suppose η ∼ (0, Cb), i.e. Cb is covariance of the noise in b.Cb is SPD: hence Cb = (C

1/2b )2 and is invertible.

Multiplying by W 1/2b = (C

1/2b )−1 whitens the noise in b

W1/2b (Ax̂− b) = η̄, where η̄ ∼ (0,W

1/2b Cb(W

1/2b )T ) = (0, Im)

ie. we have the weighted form (‖A‖2W = ATWA)

x(σ) = arg minx{‖Ax− b‖2Wb

+ 1/σ2‖x‖2}

More generally : Wx = 1/σ2I and augmented residual r(σ)

x(Wx) = arg minx

∥∥∥∥∥(W

1/2b A

W1/2x

)x−

(W

1/2b b0n

)∥∥∥∥∥2

:= arg minx‖r(σ)A‖2

22 / 68







W1/2b (Ax̂− b) = η̄, where η̄ ∼ (0,W

1/2b Cb(W

1/2b )T ) = (0, Im)



+ 1/σ2‖x‖2}


x(Wx) = arg minx

∥∥∥∥∥(W

1/2b A

W1/2x

)x−

(W

1/2b b0n

)∥∥∥∥∥2


23 / 68







W1/2b (Ax̂− b) = η̄, where η̄ ∼ (0,W

1/2b Cb(W

1/2b )T ) = (0, Im)



+ 1/σ2‖x‖2}


x(Wx) = arg minx

∥∥∥∥∥(W

1/2b A

W1/2x

)x−

(W

1/2b b0n

)∥∥∥∥∥2


24 / 68







W1/2b (Ax̂− b) = η̄, where η̄ ∼ (0,W

1/2b Cb(W

1/2b )T ) = (0, Im)



+ 1/σ2‖x‖2}


x(Wx) = arg minx

∥∥∥∥∥(W

1/2b A

W1/2x

)x−

(W

1/2b b0n

)∥∥∥∥∥2


25 / 68







W1/2b (Ax̂− b) = η̄, where η̄ ∼ (0,W

1/2b Cb(W

1/2b )T ) = (0, Im)



+ 1/σ2‖x‖2}


x(Wx) = arg minx

∥∥∥∥∥(W

1/2b A

W1/2x

)x−

(W

1/2b b0n

)∥∥∥∥∥2


26 / 68


Statistical Properties of the Augmented Regularized Residual

For a given solution

x(Wx) = W−1x AT (ATW−1

x A+W−1b )−1b

the augmented residual is

J(Wx) = bT (ATW−1x A+W−1

b )−1b = ‖r(Wx)‖2

Lemma (Distribution of the Cost Functional)

If Wb and Wx have been chosen appropriately functional J is arandom variable which follows a χ2 distribution with m degreesof freedom:

J(Wx) ∼ χ2(m) E(J(x(Wx))) = m Var(J) = 2m

27 / 68


Statistical Properties of the Augmented Regularized Residual

For a given solution

x(Wx) = W−1x AT (ATW−1

x A+W−1b )−1b

the augmented residual is

J(Wx) = bT (ATW−1x A+W−1

b )−1b = ‖r(Wx)‖2

Lemma (Distribution of the Cost Functional)

If Wb and Wx have been chosen appropriately functional J is arandom variable which follows a χ2 distribution with m degreesof freedom:

J(Wx) ∼ χ2(m) E(J(x(Wx))) = m Var(J) = 2m

28 / 68


χ2 method to find the parameter (Mead and Renaut)

Find Wx = σ2I such that

m−√

2mzα/2 < bT (ATW−1x A+W−1

b )−1b < m+√

2mzα/2

Using the SVD W1/2b A = UΣV T let s = UTW

1/2b b - solve

F (σ) = sTdiag(1

1 + σ2σ2i)s−m = 0.

Spectral decompositions A = G∗ΛG : s = Gb̃ =ˆ̃b, Λ = diag(σi)

Large Scale Implement using CG or other projected methods withmapped regularization L

σ(k+1) = σ(k)(1 + α(k) 1

2

(σ(k)

‖Lx(σ(k)‖

)2

(J(σ(k))− m̃)

m̃ - degrees of freedom in the residual. α a line searchparameter.

29 / 68




m−√


b )−1b < m+√

2mzα/2


1/2b b - solve

F (σ) = sTdiag(1

1 + σ2σ2i)s−m = 0.



σ(k+1) = σ(k)(1 + α(k) 1

2

(σ(k)

‖Lx(σ(k)‖

)2

(J(σ(k))− m̃)


30 / 68




m−√


b )−1b < m+√

2mzα/2


1/2b b - solve

F (σ) = sTdiag(1

1 + σ2σ2i)s−m = 0.



σ(k+1) = σ(k)(1 + α(k) 1

2

(σ(k)

‖Lx(σ(k)‖

)2

(J(σ(k))− m̃)


31 / 68




m−√


b )−1b < m+√

2mzα/2


1/2b b - solve

F (σ) = sTdiag(1

1 + σ2σ2i)s−m = 0.



σ(k+1) = σ(k)(1 + α(k) 1

2

(σ(k)

‖Lx(σ(k)‖

)2

(J(σ(k))− m̃)


32 / 68




m−√


b )−1b < m+√

2mzα/2


1/2b b - solve

F (σ) = sTdiag(1

1 + σ2σ2i)s−m = 0.



σ(k+1) = σ(k)(1 + α(k) 1

2

(σ(k)

‖Lx(σ(k)‖

)2

(J(σ(k))− m̃)


33 / 68


Some Results: Simulated data with 10% colored noise - no masks

Figure: Estimates obtained automatically by χ2 method, above, andbelow the optimal estimates by sweeping through 50 choices

34 / 68


Example for Dipole Inversion: The SNR estimates

Figure: Estimates obtained automatically by χ2 method indicated ascompared to optimum. SNR 10 log 10(‖xtrue‖2/‖xtrue − x‖2). Allimage based methods and using CG

35 / 68


Computational CostsCosts in seconds are for the χ2 and optimal search

χ2 Opt χ2 Opt χ2 Opt χ2 Opt57 526 131 578 83 619 167 683

The ratio for the cost increase of searching optimally:

9.29 4.41 7.44 4.09

Clear dependence on model of regularization and weighting.χ2 finds the optimal parameter at reduced cost

Remarks

Noise distribution must be knownParameters must be tuned relating to WG, Wb and thetruncation for the dipole (see talk of Karin Shmueli - stillrelevant for the forward operation with regularization)

36 / 68





9.29 4.41 7.44 4.09


Remarks


37 / 68





9.29 4.41 7.44 4.09


Remarks


38 / 68





9.29 4.41 7.44 4.09


Remarks


39 / 68


Phantom Data

Figure: Estimates obtained automatically by χ2 method, left, and rightthe optimal estimates by sweeping through 50 choices 40 / 68


Example comparing SNR estimates by k− spaced method : Simulation

Figure: Notice good estimates and minimal cost using χ2 in applied tok− space data 41 / 68


Results in Fourier domain contaminated by aliasing/artifacts

Figure: Important to correctly identify noise levels and truncation forthe dipole convolution

42 / 68


Observations / Conclusions

1. χ2 successfully applies for 3D inversion with noiseinformation

2. χ2 has potential to steer toward optimal parameters3. There are a number of theoretical results justifying the

approach.4. Still needs to be refined for use in spectral domain ( to

include gradients)5. Efficient implementations require consideration of better

Krylov methods.6. Suggests use of χ2 for use in other formulations. e.g. L1

43 / 68








44 / 68








45 / 68








46 / 68








47 / 68








48 / 68


Extending for L1 using Augmented Lagrangian: Simple Example (herewith UPRE)

49 / 68


Theoretical Results: Relating UPRE and χ2

1. UPRE is designed to minimize the bias in the solution2. UPRE requires Trace operator (can be optimized)

Lemma (Connecting UPRE and χ2)

The σ solving the χ2 functional provides a local minimum of theUPRE functional.

Proof: GSVD expansion for operators.

Lemma (Convergence with increasing resolution by χ2 )

Suppose kernel is square integrable. Then σ(m)χ2 as a functionof the number of equations, converges with increasing m.

RemarkBoth results assist in justification of use of the augmenteddiscrepancy principle. Also certain kernels we may searchextensively for low resolution.

50 / 68










51 / 68










52 / 68










53 / 68










54 / 68










55 / 68


Extending for L1 regularizationFinding the optimal parameter for the Tikhonov is a first step inSplit Bregman (Goldstein and Osher, 2009)Introduce d ≈ Lx and let R(x) = 1

2σ2 ‖d− Lx‖22 + µ‖d‖1

(x,d)(σ, µ) = arg minx,d{1

2‖Ax− b‖22 +

1

2σ2‖d− Lx‖22 + µ‖d‖1}

Alternating minimization separates steps for d from x

Various versions of the iteration can be defined. Fundamentally:

S1 : x(k+1) = arg minx{1

2‖Ax− b‖22 +

1

2σ2‖Lx− (d(k+1) − g(k))‖22}

S2 : d(k+1) = arg mind{ 1

2σ2‖d− (Lx(k+1) + g(k))‖22 + µ‖d‖1}

S3 : g(k+1) = g(k) + Lx(k+1) − d(k+1).

Notice dimension increase of the problem56 / 68



2σ2 ‖d− Lx‖22 + µ‖d‖1


2‖Ax− b‖22 +

1

2σ2‖d− Lx‖22 + µ‖d‖1}




2‖Ax− b‖22 +

1

2σ2‖Lx− (d(k+1) − g(k))‖22}


2σ2‖d− (Lx(k+1) + g(k))‖22 + µ‖d‖1}

S3 : g(k+1) = g(k) + Lx(k+1) − d(k+1).



Focus: Tikhonov Step of the Algorithm


2‖Ax− b‖22 +

1

2σ2‖Lx− (d(k) − g(k))‖22}

Update for x: Introduce

h(k) = d(k) − g(k).

Then

x(k+1) = arg minx{1

2‖Ax− b‖22 +

1

2σ2‖Lx− h(k)‖22}.

Standard least squares update using a Tikhonov regularizer.Depends on changing right hand sideAlso depends on parameter σ.

61 / 68




2‖Ax− b‖22 +

1

2σ2‖Lx− (d(k) − g(k))‖22}


h(k) = d(k) − g(k).

Then

x(k+1) = arg minx{1

2‖Ax− b‖22 +

1

2σ2‖Lx− h(k)‖22}.


62 / 68




2‖Ax− b‖22 +

1

2σ2‖Lx− (d(k) − g(k))‖22}


h(k) = d(k) − g(k).

Then

x(k+1) = arg minx{1

2‖Ax− b‖22 +

1

2σ2‖Lx− h(k)‖22}.


63 / 68


Theoretical results: using Unbiased Predictive Risk for the SB Tik

LemmaSuppose noise in h(k) is stochastic, inverse Gaussiancovariance weighting applied to both data fit Ax ≈ b andderivative Lx ≈ h for b and h; then optimal choice for σ at allsteps is σ = 1. Otherwise h(k+1) is deterministic and σ changeswith iteration.

RemarkCan we expect h(k) is stochastic?

RemarkBecause h changes optimal choice for σ changes with eachiteration, converging as h converges.

64 / 68






65 / 68






66 / 68






67 / 68

Automatic estimation of regularization parameters:...

Documents

Transcript of Automatic estimation of regularization parameters:...