Sparse

8
Gradient Descent with Sparsification: An iterative algorithm for sparse recovery with restricted isometry property Rahul Garg [email protected] Rohit Khandekar [email protected] IBM T. J. Watson Research Center, New York Keywords: sparse regression, compressed sensing, gradient descent Abstract We present an algorithm for finding an s- sparse vector x that minimizes the square- error ky - Φxk 2 where Φ satisfies the re- stricted isometry property (RIP), with iso- metric constant δ 2s < 1/3. Our algorithm, called GraDeS (Gradient Descent with Spar- sification) iteratively updates x as: x H s x + 1 γ · Φ > (y - Φx) where γ> 1 and H s sets all but s largest magnitude coordinates to zero. GraDeS con- verges to the correct solution in constant number of iterations. The condition δ 2s < 1/3 is most general for which a near-linear time algorithm is known. In comparison, the best condition under which a polynomial- time algorithm is known, is δ 2s < 2 - 1. Our Matlab implementation of GraDeS out- performs previously proposed algorithms like Subspace Pursuit, StOMP, OMP, and Lasso by an order of magnitude. Curiously, our experiments also uncovered cases where L1-regularized regression (Lasso) fails but GraDeS finds the correct solution. 1. Introduction Finding a sparse solution to a system of linear equa- tions has been an important problem in multiple do- mains such as model selection in statistics and ma- chine learning (Golub & Loan, 1996; Efron et al., 2004; Wainwright et al., 2006; Ranzato et al., 2007), sparse Appearing in Proceedings of the 26 th International Confer- ence on Machine Learning, Montreal, Canada, 2009. Copy- right 2009 by the author(s)/owner(s). principal component analysis (Zou et al., 2006), image deconvolution and de-noising (Figueiredo & Nowak, 2005) and compressed sensing (Cand` es & Wakin, 2008). The recent results in the area of compressed sensing, especially those relating to the properties of random matrices (Cand` es & Tao, 2006; Cand` es et al., 2006) has exploded the interest in this area which is finding applications in diverse domains such as coding and information theory, signal processing, artificial in- telligence, and imaging. Due to these developments, efficient algorithms to find sparse solutions are increas- ingly becoming very important. Consider a system of linear equations of the form y x (1) where y ∈< m is an m-dimensional vector of “mea- surements”, x ∈< n is the unknown signal to be recon- structed and Φ ∈< m×n is the measurement matrix. The signal x is represented in a suitable (possibly over- complete) basis and is assumed to be “s-sparse” (i.e. at most s out of n components in x are non-zero). The sparse reconstruction problem is min ˜ x∈< n || ˜ x|| 0 subject to y = Φ˜ x (2) where || ˜ x|| 0 represents the number of non-zero entries in ˜ x. This problem is not only NP-hard (Natarajan, 1995), but also hard to approximate within a factor O(2 log 1- (m) ) of the optimal solution (Neylon, 2006). RIP and its implications. The above problem be- comes computationally tractable if the matrix Φ sat- isfies a restricted isometry property. Define isometry constant of Φ as the smallest number δ s such that the following holds for all s-sparse vectors x ∈< n (1 - δ s )kxk 2 2 ≤kΦxk 2 2 (1 + δ s )kxk 2 2 (3) It has been shown (Cand` es & Tao, 2005; Cand` es, 2008) that if y x * for some s-sparse vector x * and δ 2s <

description

ertwte

Transcript of Sparse

  • Gradient Descent with Sparsification: An iterative algorithm forsparse recovery with restricted isometry property

    Rahul Garg [email protected] Khandekar [email protected] T. J. Watson Research Center, New York

    Keywords: sparse regression, compressed sensing, gradient descent

    Abstract

    We present an algorithm for finding an s-sparse vector x that minimizes the square-error y x2 where satisfies the re-stricted isometry property (RIP), with iso-metric constant 2s < 1/3. Our algorithm,called GraDeS (Gradient Descent with Spar-sification) iteratively updates x as:

    x Hs(x+

    1 >(y x)

    )where > 1 and Hs sets all but s largestmagnitude coordinates to zero. GraDeS con-verges to the correct solution in constantnumber of iterations. The condition 2s (y x) x+ (1 + 2s)x2 (5) 2>(y x) x+ x2 (6)

    The inequality (5) follows from RIP. Let g = 2>(yx). Now we prove an important component of ouranalysis.

    Lemma 2.4 The vector x achieves the minimum ofg v + v2 over all vectors v such that x + v is s-sparse, i.e.,

    x = argminv

  • Gradient Descent with Sparsification

    =iSO

    (gi xi + x2i ) +

    iSISN

    g2i4

    =iSO

    (gi xi + x2i +

    g2i4

    )+

    iSOSISN

    g2i4

    =iSO

    (xi gi2

    )2+

    iSOSISN

    g2i4

    = iSO

    (xi gi2

    )2+

    iSOSISN

    (gi2

    )2Thus, in order to minimize F (v), we have to pick thecoordinates i SO to be those with the least value of|xi gi/(2)| and pick the coordinates j SN to bethose with the largest value of gj/(2) = |xjgj/(2)|.Note that gi/(2) = (1/) >(yx) which is theexpression used in the while loop of the Algorithm 1.Hence the proof is complete from the definition of Hs.

    From inequality (6), Lemma 2.4, and the fact thatx = x+ x is s-sparse, we get

    (x)(x) 2>(y x) x + x2= 2>(y x) x + (1 2s)x2

    +( 1 + 2s)x2 2>(y x) x + (x)>>x

    +( 1 + 2s)x2 (7)= (x)(x) + ( 1 + 2s)x2

    (x) + 1 + 2s1 2s x

    2 (8)

    (x) + 1 + 2s1 2s (x) (9)

    The inequalities (7) and (8) follow from RIP, while (9)follows from the fact that x2 = (x). Thus weget (x) (x) ( 1 + 2s)/(1 2s). Now notethat 2s < 1/3 implies that (1 + 2s)/(1 2s) < 1,and hence the potential decreases by a multiplicativefactor in each iteration. If we start with x = 0, theinitial potential is y2. Thus after

    1log((1 2s)/( 1 + 2s) log

    (y2

    )iterations, the potential becomes less than . Setting = 1 + 2s, we get the desired bounds.

    If the value of 2s is not known, by setting = 4/3,the same algorithm computes the above solution in

    1log((1 2s)/(1/3 + 2s)) log

    (y2

    )iterations.

    2.2. Proof of Theorem 2.3

    The recovery under noise corresponds to the casewhere there exists an s-sparse vector x C2e2, let x =Hs(x+ 1 >(yx)) be the solution computed at theend of an iteration, let x = xx and x = xx.The initial part of the analysis is same as that in Sec-tion 2.1. Using (10), we get

    x2 = y x e2= (x) 2e>(y x) + e2

    (x) + 2Cy x2 + 1

    C2(x)

    (

    1 +1C

    )2(x).

    Using the above inequality in inequality (8), we get

    (x) 1 + 2s1 2s

    (1 +

    1C

    )2(x).

    Our assumption 2s < C2

    3C2+4C+2 implies that1+2s12s (

    1 + 1C)2 = 22s12s (1 + 1C )2 < 1. This implies that as

    long as (x) > C2e2, the potential (x) decreasesby a multiplicative factor. Lemma 2.5 thus follows.

  • Gradient Descent with Sparsification

    Lasso/LARS OMP StOMP Subspace PursuitGraDeS (3) GraDeS (4/3)

    3000 38.71 22.04 9.57

    4000 66.28 51.36 23.29 23.72 10.06 3.78

    5000 69.54 62.13 29.45 22.71 10.41 3.71

    6000 77.64 74.73 35.14 25.9 11.44 4.27

    7000 89.99 86.84 41.97 30.08 12.24 4.23

    8000 101.07 98.47 45.18 34.31 13.53 4.82

    3000 4000 5000 6000 7000 80000

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    110Lasso/LARS

    OMP

    StOMP

    Subspace Pursuit

    GraDeS (3)

    GraDeS (4/3)

    Number of rows (m)

    Run tim

    e (

    sec)

    Figure 1. Dependence of the running times of different al-gorithms on the number of rows (m). Here number ofcolumns n = 8000 and sparsity s = 500.

    3. Implementation Results

    To evaluate usefulness of GraDeS in practice, it wascompared with LARS-based implementation of Lasso(Efron et al., 2004) (referred to as Lasso/LARS), Or-thogonal Matching Pursuit (OMP), Stagewise OMP(StOMP) and Subspace Pursuit. The algorithms wereselected to span the spectrum of the algorithms onone extreme, Lasso/LARS, that provides the best re-covery conditions but a poor bound on its runtime andin the other extreme, near linear-time algorithms withweaker recovery conditions but very good run times.

    Matlab implementation of all the algorithms were usedfor evaluation. The SparseLab (Donoho & Others,2009) package, which contains the implementation ofLasso/LARS, OMP and StOMP algorithms was used.In addition, a publicly available optimized Matlab im-plementation of Subspace Pursuit, provided by oneof its authors, was used. The algorithm GraDeS wasalso implemented in Matlab for a fair comparison.GraDeS( = 4/3) for which all our results hold wasevaluated along with GraDeS( = 3) which was a bitslower but had better recovery properties. All the ex-periments were run on a 3 GHz dual-core Pentium sys-tem with 3.5 GB memory running the Linux operatingsystem. Care was taken to ensure that no other pro-gram was actively consuming the CPU time while theexperiments were in progress.

    First, a random matrix of size mn with (iid) nor-mally distributed entries and a random s-sparse vectorx were generated. The columns of were normalizedto zero mean and unit norm. The vector y = x wascomputed. Now the matrix , the vector y and the pa-rameter s were given to each of the algorithms as the

    Lasso/LARS OMP StOMP Subspace PursuitGraDeS (3) GraDeS (4/3)

    6000 56.86 42.67 21.91 23.01 8.14 2.95

    8000 71.11 54.76 26.73 25.16 10.72 4.05

    10000 60.34 27.53 25.22 12.21 4.76

    12000 66.89 32.84 23.82 15.01 5.93

    14000 76.35 36.84 23.58 19.6 7.36

    16000 89.07 40.08 29.74 20.22 8.94

    6000 8000 10000 12000 14000 16000

    0

    5

    10

    15

    20

    25

    30

    35

    40

    45

    50

    55

    60

    65

    70

    75

    80

    85

    90

    Lasso/LARS

    OMP

    StOMP

    Subspace Pursuit

    GraDeS (3)

    GraDeS (4/3)

    Number of columns (n)

    Ru

    n t

    ime

    (se

    c)

    Figure 2. Dependence of the running times of different al-gorithms on the number of columns (n). Here number ofrows m = 4000 and sparsity s = 500.

    input. The output of every algorithm was comparedwith the correct solution x.

    Figure 1 shows the runtime of all the algorithms asa function of the number of rows m. The algo-rithms Lasso/LARS and OMP take the maximum timewhich increases linearly with m. This is followed byStOMP and Subspace Pursuit that also show a linearincrease in runtime as a function of m. The algorithmGraDeS(4/3) has the smallest runtime which is a fac-tor 20 better than the runtime of Lasso/LARS anda factor 7 better than that of Subspace Pursuit form = 8000, n = 8000 and s = 500. GraDeS(3) takesmore time than GraDeS(4/3) as expected. It is sur-prising to observe that, contrary to the expectations,the runtime of GraDeS did not increase substantiallyas m was increased. It was found that GraDeS neededfewer iterations offsetting the additional time neededto compute products of matrices with larger m.

    Figure 2 shows the runtime of these algorithms as thenumber of column, n is varied. Here, all the algorithmsseem to scale linearly with n as expected. In this case,Lasso/LARS did not give correct results for n > 8000and hence its runtime was omitted from the graph. Asexpected, the runtime of GraDeS is an order of magni-tude smaller than that of OMP and Lasso/LARS.

    Figure 3 shows the runtime of the algorithms as a func-tion of the sparsity parameter s. The increase in theruntime of Lasso/LARS and OMP is super-linear (asopposed to a linear theoretical bound for OMP). Al-though, the theoretical analysis of StOMP and Sub-space Pursuit shows very little dependence on s, theiractual run times do increase significantly as s is in-creased. In contrast, the run times of GraDeS increaseonly marginally (due to a small increase in the num-

  • Gradient Descent with Sparsification

    Lasso/LARS OMP StOMP Subspace PursuitGraDeS (3) GraDeS (4/3)

    100 11.47 11.05 6.65 1.26 7.95 2.65

    200 24.65 23.75 11.31 4.23 9.23 3.25

    300 39.06 38.17 15.08 8.72 10.05 3.84

    400 58.18 54.27 25.14 17.95 12.07 4.24

    500 86.33 75.36 31.02 22.28 13.53 5.02

    600 115.17 94.46 38.48 44.24 14.88 5.51

    100 200 300 400 500 600

    0

    5

    10

    15

    20

    25

    30

    35

    40

    45

    50

    55

    60

    65

    70

    75

    80

    85

    90

    95

    100

    105

    110

    115

    120

    Lasso/LARS

    OMP

    StOMP

    Subspace Pursuit

    GraDeS (3)

    GraDeS (4/3)

    Sparsity (s)

    Run tim

    e (

    sec)

    Figure 3. Dependence of the running times of different al-gorithms on the sparsity (s). Here number of rows m =5000 and number of columns n = 10000.

    Table 2. A comparison of different algorithms for variousparameter values. An entry Y indicates that the algo-rithm could recover a sparse solution, while an empty entryindicates otherwise.

    m n s Las

    so/L

    AR

    S

    GraDeS

    (=

    4/3)

    StO

    MP

    GraDeS

    (=

    3)

    OM

    P

    Subs

    pace

    Pur

    suit

    3000 10000 21003000 10000 1050 Y Y3000 8000 500 Y Y Y3000 10000 600 Y Y Y Y6000 8000 500 Y Y Y Y3000 10000 300 Y Y Y Y Y4000 10000 500 Y Y Y Y Y8000 8000 500 Y Y Y Y Y Y

    ber of iterations needed for convergence) as s is in-creased. For s = 600, GraDeS(4/3) is factor 20 fasterthan Lasso/LARS and a factor 7 faster than StOMP,the second best algorithm after GraDeS.

    Table 2 shows the recovery results for these algorithms.Quite surprisingly, although the theoretical recoveryconditions for Lasso are the most general (see Table 1),it was found that the LARS-based implementation ofLasso was first to fail in recovery as s is increased.

    A careful examination revealed that as s was increased,the output of Lasso/LARS became sensitive to er-ror tolerance parameters used in the implementation.With careful tuning of these parameters, the recoveryproperty of Lasso/LARS could be improved. The re-covery could be improved further by replacing some in-

    cremental computations with slower non-incrementalcomputations. One such computation was incremen-tal Cholesky factorization. These changes adversly im-pacted the running time of the algorithm, increasingits cost per iteration from K +O(s2) to 2K +O(s3).

    Even after these changes, some instances were discov-ered for which Lasso/LARS did not produce the cor-rect sparse (though it computed the optimal solutionof the program (4)), but GraDeS(3) found the correctsolution. However, instances for which Lasso/LARSgave the correct sparse solution but GraDeS failed,were also found.

    4. Conclusions

    In summary, we have presented an efficient algorithmfor solving the sparse reconstruction problem providedthe isometry constants of the constraint matrix sat-isfy 2s < 1/3. Although the recovery conditions forGraDeS are stronger than those for L1-regularized re-gression (Lasso), our results indicate that wheneverLasso/LARS finds the correct solution, GraDeS alsofinds it. Conversely, there are cases where GraDeS(and other algorithms) find the correct solution butLasso/LARS fails due to numerical issues. In the ab-sence of efficient and numerically stable algorithms, itis not clear whether L1-regularization offers any ad-vantages to practitioners over simpler and faster algo-rithms such as OMP or GraDeS when the matrix satisfies RIP. A systematic study is needed to explorethis. Finally, finding more general conditions than RIPunder which the sparse reconstruction problem can besolved efficiently is a very challenging open question.

    References

    Berinde, R., Indyk, P., & Ruzic, M. (2008). Practicalnear-optimal sparse recovery in the L1 norm. Aller-ton Conference on Communication, Control, andComputing. Monticello, IL.

    Blumensath, T., & Davies, M. E. (2008). Iterativehard thresholding for compressed sensing. Preprint.

    Cande`s, E., & Wakin, M. (2008). An introductionto compressive sampling. IEEE Signal ProcessingMagazine, 25, 2130.

    Cande`s, E. J. (2008). The restricted isometry propertyand its implications for compressed sensing. CompteRendus de lAcademie des Sciences, Paris, Serie I,346 589-592.

    Cande`s, E. J., & Romberg, J. (2004). Practical signalrecovery from random projections. SPIN Confer-

  • Gradient Descent with Sparsification

    ence on Wavelet Applications in Signal and ImageProcessing.

    Cande`s, E. J., Romberg, J. K., & Tao, T. (2006).Robust uncertainty principles: Exact signal recon-struction from highly incomplete frequency informa-tion. IEEE Transactions on Information Theory, 52,489509.

    Cande`s, E. J., & Tao, T. (2005). Decoding by linearprogramming. IEEE Transactions on InformationTheory, 51, 42034215.

    Cande`s, E. J., & Tao, T. (2006). Near-optimal signalrecovery from random projections: Universal encod-ing strategies? IEEE Transactions on InformationTheory, 52, 54065425.

    Chen, S. S., Donoho, D. L., & Saunders, M. A. (2001).Atomic decomposition by basis pursuit. SIAM Re-view, 43, 129159.

    Dai, W., & Milenkovic, O. (2008). Subspace Pursuitfor Compressive Sensing: Closing the Gap BetweenPerformance and Complexity. ArXiv e-prints.

    Do, T. T., Gan, L., Nguyen, N., & Tran, T. D. (2008).Sparsity adaptive matching pursuit algorithm forpractical compressed sensing. Asilomar Conferenceon Signals, Systems, and Computers. Pacific Grove,California.

    Donoho, D., & Others (2009). SparseLab: Seek-ing sparse solutions to linear systems of equations.http://sparselab.stanford.edu/.

    Donoho, D. L., & Tsaig, Y. (2006). Fast solution ofL1 minimization problems when the solution maybe sparse. Preprint.

    Donoho, D. L., Tsaig, Y., Drori, I., & Starck, J.-L.(2006). Sparse solution of underdetermined linearequations by stagewise orthogonal matching pursuit.Preprint.

    Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R.(2004). Least angle regression. Annals of Statistics,32, 407499.

    Figueiredo, M. A. T., & Nowak, R. D. (2005). A boundoptimization approach to wavelet-based image de-convolution. IEEE International Conference on Im-age Processing (ICIP) (pp. 782785).

    Golub, G., & Loan, C. V. (1996). Matrix computa-tions, 3rd ed. Johns Hopkins University Press.

    Jung, H., Ye, J. C., & Kim, E. Y. (2007). Improvedk-t BLASK and k-t SENSE using FOCUSS. Phys.Med. Biol., 52, 32013226.

    Lustig, M. (2008). Sparse MRI. Ph.D Thesis, StanfordUniversity.

    Ma, S., Yin, W., Zhang, Y., & Chakraborty, A. (2008).An efficient algorithm for compressed MR imagingusing total variation and wavelets. IEEE Conferer-ence on Computer Vision and Pattern Recognition(CVPR).

    Mallat, S. G., & Zhang, Z. (1993). Matching pur-suits with time-frequency dictionaries. IEEE Trans-actions on Signal Processing, 33973415.

    Natarajan, B. K. (1995). Sparse approximate solutionsto linear systems. SIAM Journal of Computing, 24,227234.

    Needell, & Tropp, J. A. (2008). CoSaMP: Iterativesignal recovery from incomplete and inaccurate sam-ples. Applied and Computational Harmonic Analy-sis, 26, 301321.

    Needell, D., & Vershynin, R. (2009). Uniform uncer-tainty principle and signal recovery via regularizedorthogonal matching pursuit. Foundations of Com-putational Mathematics, 9, 317334.

    Neylon, T. (2006). Sparse solutions for linear predic-tion problems. Doctoral dissertation, Courant Insti-tute, New York University.

    Ranzato, M., Boureau, Y.-L., & LeCun, Y. (2007).Sparse feature learning for deep belief networks. InAdvances in neural information processing systems20 (NIPS07), 11851192. MIT Press.

    Sarvotham, S., Baron, D., & Baraniuk, R. (2006). Su-docodes - fast measurement and reconstruction ofsparse signals. IEEE International Symposium onInformation Theory (ISIT). Seattle, Washington.

    Tropp, J. A., & Gilbert, A. C. (2007). Signal recoveryfrom random measurements via orthogonal match-ing pursuit. IEEE Trans. Info. Theory, 53.

    Wainwright, M. J., Ravikumar, P., & Lafferty, J. D.(2006). High-dimensional graphical model selec-tion using `1-regularized logistic regression. In Ad-vances in neural information processing systems 19(NIPS06), 14651472. Cambridge, MA: MIT Press.

    Zou, H., Hastie, T., & Tibshirani, R. (2006). Sparseprincipal component analysis. Journal of Computa-tional and Graphical Statistics, 15, 262286.