Sparse

Gradient Descent with Sparsification: An iterative algorithm forsparse recovery with restricted isometry property

Rahul Garg [email protected] Khandekar [email protected] T. J. Watson Research Center, New York

Keywords: sparse regression, compressed sensing, gradient descent

Abstract

We present an algorithm for finding an s-sparse vector x that minimizes the square-error y x2 where satisfies the re-stricted isometry property (RIP), with iso-metric constant 2s < 1/3. Our algorithm,called GraDeS (Gradient Descent with Spar-sification) iteratively updates x as:

x Hs(x+

1 >(y x)

)where > 1 and Hs sets all but s largestmagnitude coordinates to zero. GraDeS con-verges to the correct solution in constantnumber of iterations. The condition 2s (y x) x+ (1 + 2s)x2 (5) 2>(y x) x+ x2 (6)

The inequality (5) follows from RIP. Let g = 2>(yx). Now we prove an important component of ouranalysis.

Lemma 2.4 The vector x achieves the minimum ofg v + v2 over all vectors v such that x + v is s-sparse, i.e.,

x = argminv

Gradient Descent with Sparsification

=iSO

(gi xi + x2i ) +

iSISN

g2i4

=iSO

(gi xi + x2i +

g2i4

)+

iSOSISN

g2i4

=iSO

(xi gi2

)2+

iSOSISN

g2i4

= iSO

(xi gi2

)2+

iSOSISN

(gi2

)2Thus, in order to minimize F (v), we have to pick thecoordinates i SO to be those with the least value of|xi gi/(2)| and pick the coordinates j SN to bethose with the largest value of gj/(2) = |xjgj/(2)|.Note that gi/(2) = (1/) >(yx) which is theexpression used in the while loop of the Algorithm 1.Hence the proof is complete from the definition of Hs.

From inequality (6), Lemma 2.4, and the fact thatx = x+ x is s-sparse, we get

(x)(x) 2>(y x) x + x2= 2>(y x) x + (1 2s)x2

+( 1 + 2s)x2 2>(y x) x + (x)>>x

+( 1 + 2s)x2 (7)= (x)(x) + ( 1 + 2s)x2

(x) + 1 + 2s1 2s x

2 (8)

(x) + 1 + 2s1 2s (x) (9)

The inequalities (7) and (8) follow from RIP, while (9)follows from the fact that x2 = (x). Thus weget (x) (x) ( 1 + 2s)/(1 2s). Now notethat 2s < 1/3 implies that (1 + 2s)/(1 2s) < 1,and hence the potential decreases by a multiplicativefactor in each iteration. If we start with x = 0, theinitial potential is y2. Thus after

1log((1 2s)/( 1 + 2s) log

(y2

)iterations, the potential becomes less than . Setting = 1 + 2s, we get the desired bounds.

If the value of 2s is not known, by setting = 4/3,the same algorithm computes the above solution in

1log((1 2s)/(1/3 + 2s)) log

(y2

)iterations.

2.2. Proof of Theorem 2.3

The recovery under noise corresponds to the casewhere there exists an s-sparse vector x C2e2, let x =Hs(x+ 1 >(yx)) be the solution computed at theend of an iteration, let x = xx and x = xx.The initial part of the analysis is same as that in Sec-tion 2.1. Using (10), we get

x2 = y x e2= (x) 2e>(y x) + e2

(x) + 2Cy x2 + 1

C2(x)

(

1 +1C

)2(x).

Using the above inequality in inequality (8), we get

(x) 1 + 2s1 2s

(1 +

1C

)2(x).

Our assumption 2s < C2

3C2+4C+2 implies that1+2s12s (

1 + 1C)2 = 22s12s (1 + 1C )2 < 1. This implies that as

long as (x) > C2e2, the potential (x) decreasesby a multiplicative factor. Lemma 2.5 thus follows.


Lasso/LARS OMP StOMP Subspace PursuitGraDeS (3) GraDeS (4/3)

3000 38.71 22.04 9.57

4000 66.28 51.36 23.29 23.72 10.06 3.78

5000 69.54 62.13 29.45 22.71 10.41 3.71

6000 77.64 74.73 35.14 25.9 11.44 4.27

7000 89.99 86.84 41.97 30.08 12.24 4.23

8000 101.07 98.47 45.18 34.31 13.53 4.82

3000 4000 5000 6000 7000 80000

10

20

30

40

50

60

70

80

90

100

110Lasso/LARS

OMP

StOMP

Subspace Pursuit

GraDeS (3)

GraDeS (4/3)

Number of rows (m)

Run tim

e (

sec)

Figure 1. Dependence of the running times of different al-gorithms on the number of rows (m). Here number ofcolumns n = 8000 and sparsity s = 500.

3. Implementation Results

To evaluate usefulness of GraDeS in practice, it wascompared with LARS-based implementation of Lasso(Efron et al., 2004) (referred to as Lasso/LARS), Or-thogonal Matching Pursuit (OMP), Stagewise OMP(StOMP) and Subspace Pursuit. The algorithms wereselected to span the spectrum of the algorithms onone extreme, Lasso/LARS, that provides the best re-covery conditions but a poor bound on its runtime andin the other extreme, near linear-time algorithms withweaker recovery conditions but very good run times.

Matlab implementation of all the algorithms were usedfor evaluation. The SparseLab (Donoho & Others,2009) package, which contains the implementation ofLasso/LARS, OMP and StOMP algorithms was used.In addition, a publicly available optimized Matlab im-plementation of Subspace Pursuit, provided by oneof its authors, was used. The algorithm GraDeS wasalso implemented in Matlab for a fair comparison.GraDeS( = 4/3) for which all our results hold wasevaluated along with GraDeS( = 3) which was a bitslower but had better recovery properties. All the ex-periments were run on a 3 GHz dual-core Pentium sys-tem with 3.5 GB memory running the Linux operatingsystem. Care was taken to ensure that no other pro-gram was actively consuming the CPU time while theexperiments were in progress.

First, a random matrix of size mn with (iid) nor-mally distributed entries and a random s-sparse vectorx were generated. The columns of were normalizedto zero mean and unit norm. The vector y = x wascomputed. Now the matrix , the vector y and the pa-rameter s were given to each of the algorithms as the


6000 56.86 42.67 21.91 23.01 8.14 2.95

8000 71.11 54.76 26.73 25.16 10.72 4.05

10000 60.34 27.53 25.22 12.21 4.76

12000 66.89 32.84 23.82 15.01 5.93

14000 76.35 36.84 23.58 19.6 7.36

16000 89.07 40.08 29.74 20.22 8.94

6000 8000 10000 12000 14000 16000

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

Lasso/LARS

OMP

StOMP

Subspace Pursuit

GraDeS (3)

GraDeS (4/3)

Number of columns (n)

Ru

n t

ime

(se

c)

Figure 2. Dependence of the running times of different al-gorithms on the number of columns (n). Here number ofrows m = 4000 and sparsity s = 500.

input. The output of every algorithm was comparedwith the correct solution x.

Figure 1 shows the runtime of all the algorithms asa function of the number of rows m. The algo-rithms Lasso/LARS and OMP take the maximum timewhich increases linearly with m. This is followed byStOMP and Subspace Pursuit that also show a linearincrease in runtime as a function of m. The algorithmGraDeS(4/3) has the smallest runtime which is a fac-tor 20 better than the runtime of Lasso/LARS anda factor 7 better than that of Subspace Pursuit form = 8000, n = 8000 and s = 500. GraDeS(3) takesmore time than GraDeS(4/3) as expected. It is sur-prising to observe that, contrary to the expectations,the runtime of GraDeS did not increase substantiallyas m was increased. It was found that GraDeS neededfewer iterations offsetting the additional time neededto compute products of matrices with larger m.

Figure 2 shows the runtime of these algorithms as thenumber of column, n is varied. Here, all the algorithmsseem to scale linearly with n as expected. In this case,Lasso/LARS did not give correct results for n > 8000and hence its runtime was omitted from the graph. Asexpected, the runtime of GraDeS is an order of magni-tude smaller than that of OMP and Lasso/LARS.

Figure 3 shows the runtime of the algorithms as a func-tion of the sparsity parameter s. The increase in theruntime of Lasso/LARS and OMP is super-linear (asopposed to a linear theoretical bound for OMP). Al-though, the theoretical analysis of StOMP and Sub-space Pursuit shows very little dependence on s, theiractual run times do increase significantly as s is in-creased. In contrast, the run times of GraDeS increaseonly marginally (due to a small increase in the num-



100 11.47 11.05 6.65 1.26 7.95 2.65

200 24.65 23.75 11.31 4.23 9.23 3.25

300 39.06 38.17 15.08 8.72 10.05 3.84

400 58.18 54.27 25.14 17.95 12.07 4.24

500 86.33 75.36 31.02 22.28 13.53 5.02

600 115.17 94.46 38.48 44.24 14.88 5.51

100 200 300 400 500 600

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

105

110

115

120

Lasso/LARS

OMP

StOMP

Subspace Pursuit

GraDeS (3)

GraDeS (4/3)

Sparsity (s)

Run tim

e (

sec)

Figure 3. Dependence of the running times of different al-gorithms on the sparsity (s). Here number of rows m =5000 and number of columns n = 10000.

Table 2. A comparison of different algorithms for variousparameter values. An entry Y indicates that the algo-rithm could recover a sparse solution, while an empty entryindicates otherwise.

m n s Las

so/L

AR

S

GraDeS

(=

4/3)

StO

MP

GraDeS

(=

3)

OM

P

Subs

pace

Pur

suit

3000 10000 21003000 10000 1050 Y Y3000 8000 500 Y Y Y3000 10000 600 Y Y Y Y6000 8000 500 Y Y Y Y3000 10000 300 Y Y Y Y Y4000 10000 500 Y Y Y Y Y8000 8000 500 Y Y Y Y Y Y

ber of iterations needed for convergence) as s is in-creased. For s = 600, GraDeS(4/3) is factor 20 fasterthan Lasso/LARS and a factor 7 faster than StOMP,the second best algorithm after GraDeS.

Table 2 shows the recovery results for these algorithms.Quite surprisingly, although the theoretical recoveryconditions for Lasso are the most general (see Table 1),it was found that the LARS-based implementation ofLasso was first to fail in recovery as s is increased.

A careful examination revealed that as s was increased,the output of Lasso/LARS became sensitive to er-ror tolerance parameters used in the implementation.With careful tuning of these parameters, the recoveryproperty of Lasso/LARS could be improved. The re-covery could be improved further by replacing some in-

cremental computations with slower non-incrementalcomputations. One such computation was incremen-tal Cholesky factorization. These changes adversly im-pacted the running time of the algorithm, increasingits cost per iteration from K +O(s2) to 2K +O(s3).

Even after these changes, some instances were discov-ered for which Lasso/LARS did not produce the cor-rect sparse (though it computed the optimal solutionof the program (4)), but GraDeS(3) found the correctsolution. However, instances for which Lasso/LARSgave the correct sparse solution but GraDeS failed,were also found.

4. Conclusions

In summary, we have presented an efficient algorithmfor solving the sparse reconstruction problem providedthe isometry constants of the constraint matrix sat-isfy 2s < 1/3. Although the recovery conditions forGraDeS are stronger than those for L1-regularized re-gression (Lasso), our results indicate that wheneverLasso/LARS finds the correct solution, GraDeS alsofinds it. Conversely, there are cases where GraDeS(and other algorithms) find the correct solution butLasso/LARS fails due to numerical issues. In the ab-sence of efficient and numerically stable algorithms, itis not clear whether L1-regularization offers any ad-vantages to practitioners over simpler and faster algo-rithms such as OMP or GraDeS when the matrix satisfies RIP. A systematic study is needed to explorethis. Finally, finding more general conditions than RIPunder which the sparse reconstruction problem can besolved efficiently is a very challenging open question.

References

Berinde, R., Indyk, P., & Ruzic, M. (2008). Practicalnear-optimal sparse recovery in the L1 norm. Aller-ton Conference on Communication, Control, andComputing. Monticello, IL.

Blumensath, T., & Davies, M. E. (2008). Iterativehard thresholding for compressed sensing. Preprint.

Cande`s, E., & Wakin, M. (2008). An introductionto compressive sampling. IEEE Signal ProcessingMagazine, 25, 2130.

Cande`s, E. J. (2008). The restricted isometry propertyand its implications for compressed sensing. CompteRendus de lAcademie des Sciences, Paris, Serie I,346 589-592.

Cande`s, E. J., & Romberg, J. (2004). Practical signalrecovery from random projections. SPIN Confer-


ence on Wavelet Applications in Signal and ImageProcessing.

Cande`s, E. J., Romberg, J. K., & Tao, T. (2006).Robust uncertainty principles: Exact signal recon-struction from highly incomplete frequency informa-tion. IEEE Transactions on Information Theory, 52,489509.

Cande`s, E. J., & Tao, T. (2005). Decoding by linearprogramming. IEEE Transactions on InformationTheory, 51, 42034215.

Cande`s, E. J., & Tao, T. (2006). Near-optimal signalrecovery from random projections: Universal encod-ing strategies? IEEE Transactions on InformationTheory, 52, 54065425.

Chen, S. S., Donoho, D. L., & Saunders, M. A. (2001).Atomic decomposition by basis pursuit. SIAM Re-view, 43, 129159.

Dai, W., & Milenkovic, O. (2008). Subspace Pursuitfor Compressive Sensing: Closing the Gap BetweenPerformance and Complexity. ArXiv e-prints.

Do, T. T., Gan, L., Nguyen, N., & Tran, T. D. (2008).Sparsity adaptive matching pursuit algorithm forpractical compressed sensing. Asilomar Conferenceon Signals, Systems, and Computers. Pacific Grove,California.

Donoho, D., & Others (2009). SparseLab: Seek-ing sparse solutions to linear systems of equations.http://sparselab.stanford.edu/.

Donoho, D. L., & Tsaig, Y. (2006). Fast solution ofL1 minimization problems when the solution maybe sparse. Preprint.

Donoho, D. L., Tsaig, Y., Drori, I., & Starck, J.-L.(2006). Sparse solution of underdetermined linearequations by stagewise orthogonal matching pursuit.Preprint.

Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R.(2004). Least angle regression. Annals of Statistics,32, 407499.

Figueiredo, M. A. T., & Nowak, R. D. (2005). A boundoptimization approach to wavelet-based image de-convolution. IEEE International Conference on Im-age Processing (ICIP) (pp. 782785).

Golub, G., & Loan, C. V. (1996). Matrix computa-tions, 3rd ed. Johns Hopkins University Press.

Jung, H., Ye, J. C., & Kim, E. Y. (2007). Improvedk-t BLASK and k-t SENSE using FOCUSS. Phys.Med. Biol., 52, 32013226.

Lustig, M. (2008). Sparse MRI. Ph.D Thesis, StanfordUniversity.

Ma, S., Yin, W., Zhang, Y., & Chakraborty, A. (2008).An efficient algorithm for compressed MR imagingusing total variation and wavelets. IEEE Conferer-ence on Computer Vision and Pattern Recognition(CVPR).

Mallat, S. G., & Zhang, Z. (1993). Matching pur-suits with time-frequency dictionaries. IEEE Trans-actions on Signal Processing, 33973415.

Natarajan, B. K. (1995). Sparse approximate solutionsto linear systems. SIAM Journal of Computing, 24,227234.

Needell, & Tropp, J. A. (2008). CoSaMP: Iterativesignal recovery from incomplete and inaccurate sam-ples. Applied and Computational Harmonic Analy-sis, 26, 301321.

Needell, D., & Vershynin, R. (2009). Uniform uncer-tainty principle and signal recovery via regularizedorthogonal matching pursuit. Foundations of Com-putational Mathematics, 9, 317334.

Neylon, T. (2006). Sparse solutions for linear predic-tion problems. Doctoral dissertation, Courant Insti-tute, New York University.

Ranzato, M., Boureau, Y.-L., & LeCun, Y. (2007).Sparse feature learning for deep belief networks. InAdvances in neural information processing systems20 (NIPS07), 11851192. MIT Press.

Sarvotham, S., Baron, D., & Baraniuk, R. (2006). Su-docodes - fast measurement and reconstruction ofsparse signals. IEEE International Symposium onInformation Theory (ISIT). Seattle, Washington.

Tropp, J. A., & Gilbert, A. C. (2007). Signal recoveryfrom random measurements via orthogonal match-ing pursuit. IEEE Trans. Info. Theory, 53.

Wainwright, M. J., Ravikumar, P., & Lafferty, J. D.(2006). High-dimensional graphical model selec-tion using `1-regularized logistic regression. In Ad-vances in neural information processing systems 19(NIPS06), 14651472. Cambridge, MA: MIT Press.

Zou, H., Hastie, T., & Tibshirani, R. (2006). Sparseprincipal component analysis. Journal of Computa-tional and Graphical Statistics, 15, 262286.

Sparse

Documents

Transcript of Sparse