For Most Large Underdetermined Systems of Linear Equations the Minimal `1-norm Solution is also the...

7/29/2019 For Most Large Underdetermined Systems of Linear Equations the Minimal `1-norm Solution is also the Sparsest S

1/28

For Most Large Underdetermined Systems of Linear Equationsthe Minimal 1-norm Solution is also the Sparsest Solution

David L. Donoho

Department of Statistics

Stanford University

September 16, 2004

Abstract

We consider linear equations y = where y is a given vector in Rn, is a given n by mmatrix with n < m An, and we wish to solve for Rm. We suppose that the columnsof are normalized to unit 2 norm 1 and we place uniform measure on such . We provethe existence of = (A) so that for large n, and for all s except a negligible fraction, thefollowing property holds: For everyy having a representationy = 0 by a coefficient vector0 Rm with fewer than n nonzeros, the solution 1 of the 1 minimization problem

min x1 subject to = y

is unique and equal to 0.In contrast, heuristic attempts to sparsely solve such systems greedy algorithms and

thresholding perform poorly in this challenging setting.The techniques include the use of random proportional embeddings and almost-spherical

sections in Banach space theory, and deviation bounds for the eigenvalues of random Wishartmatrices.

Key Words and Phrases. Solution of Underdetermined Linear Systems. OvercompleteRepresentations. Minimum 1 decomposition. Almost-Euclidean Sections of Banach Spaces.Eigenvalues of Random Matrices. Sign-Embeddings in Banach Spaces. Greedy Algorithms.Matching Pursuit. Basis Pursuit.

Acknowledgements. Partial support from NSF DMS 00-77261, and 01-40698 (FRG) andONR. Thanks to Emmanuel Candes for calling my attention to a blunder in a previous version,to Noureddine El Karoui for discussions about singular values of random matrices, and to YaacovTsaig for the experimental results mentioned here.

1


2/28

1 Introduction

Many situations in science and technology call for solutions to underdetermined systems ofequations, i.e. systems of linear equations with fewer equations than unknowns. Examples in

array signal processing, inverse problems, and genomic data analysis all come to mind. However,any sensible person working in such fields would have been taught to agree with the statement:you have a system of linear equations with fewer equations than unknowns. There are infinitelymany solutions. And indeed, they would have been taught well. However, the intuition imbuedby that teaching would be misleading.

On closer inspection, many of the applications ask for sparse solutions of such systems, i.e.solutions with few nonzero elements; the interpretation being that we are sure that relativelyfew of the candidate sources, pixels, or genes are turned on, we just dont know a prioriwhichones those are. Finding sparse solutions to such systems would better match the real underlyingsituation. It would also in many cases have important practical benefits, i.e. allowing us toinstall fewer antenna elements, make fewer measurements, store less data, or investigate fewer

genes.The search for sparse solutions can transform the problem completely, in many cases making

unique solution possible (Lemma 2.1 below, see also [7, 8, 16, 14, 26, 27]). Unfortunately, thisonly seems to change the problem from an impossible one to an intractable one! Finding thesparsest solution to an general underdetermined system of equations is NP-hard [21]; manyclassic combinatorial optimization problems can be cast in that form.

In this paper we will see that for most underdetermined systems of equations, when asufficiently sparse solution exists, it can be found by convex optimization. More precisely, fora given ratio m/n of unknowns to equations, there is a threshold so that most large n by mmatrices generate systems of equations with two properties:

(a) If we run convex optimization to find the 1-minimal solution, and happen to find a solution

with fewer than n nonzeros, then this is the unique sparsest solution to the equations;and

(b) If the result does not happen to have n nonzeros, there is no solution with < n nonzeros.

In such cases, if a sparse solution would be very desirable needing far fewer than n coeffi-cients it may be found by convex optimization. If it is of relatively small value needing closeto n coefficients finding the optimal solution requires combinatorial optimization.

1.1 Background: Signal Representation

To place this result in context, we describe its genesis.

In recent years, a large body of research has focused on the use of overcomplete signal rep-resentations, in which a given signal S Rn is decomposed as S = ii using a dictionary ofm > n atoms. Equivalently, we try to solve S = for and n by m matrix. Overcompletenessimplies that m > n, so the problem is underdetermined. The goal is to use the freedom thisallows to provide a sparse representation.

Motivations for this viewpoint were first obtained empirically, where representations of sig-nals were obtained using in the early 1990s, eg. combinations of several orthonormal bases byCoifman and collaborators [4, 5] and combinations of several frames in by Mallat and Zhangswork on Matching Pursuit [19], and by Chen, Donoho, and Saunders in the mid 1990s [3].

A theoretical perspective showing that there is a sound mathematical basis for overcompleterepresentation has come together rapidly in recent years, see [7, 8, 12, 14, 16, 26, 27]. An early

2


3/28

result was the following: suppose that is the concatenation of two orthobases, so that m = 2n.Suppose that the coherence - the maximal inner product between any pair of columns of -is at most M. Suppose that S = 0 has at most N nonzeros. If N < M

1, 0 provides theunique optimally sparse representation of S. Consider the solution 1 to the problem

min 1 subject to S = .

If N (1 + M1)/2 we have 1 = 0. In short, we can recover the sparsest representation bysolving a convex optimization problem.

As an example, a signal of length n which is a superposition of no more than

n/2 totalspikes and sinusoids is uniquely representable in that form and can be uniquely recovered by 1

optimization, (in this case M = 1/

n). The sparsity bound required in this result, comparableto 1/

n, is disappointingly small, however, it was surprising at the time that any such result

was possible. Many substantial improvements on these results have since been made [12, 8, 14,16, 26, 27].

It was mentioned in [7] that the phenomena proved there represented only the tip of theiceberg. Computational results published there showed that for randomly-generated systems one could get unique recovery even with as many as about n/5 nonzeros in a 2-fold overcom-plete representation. Hence, empirically, even a mildly sparse representation could be exactlyrecovered by 1 optimization.

Very recently, Candes, Romberg and Tao [2] showed that for partial Fourier systems, formedby taking n rows at random from an m-by-m standard Fourier matrix, the resulting n bym matrix with overwhelming probability allowed exact equivalence between (P0) and (P1) inall cases where the number N of nonzeros was smaller than cn/ log(n). This very inspiringresult shows that equivalence is possible with a number of nonzeros almost proportional ton. Furthermore, [2] showed empirical examples where equivalence held with as many as n/4nonzeros.

1.2 This Paper

In previous work, equivalence between the minimal 1 solution and the optimally sparse solutionrequired that the sparse solution have an asymptotically negligible fraction of nonzeros. Thefraction O(n1/2) could be accommodated in results of [7, 12, 8, 14, 26], and O(1/ log(n)) in [2].

In this paper we construct a large class of examples where equivalence holds even whenthe number of nonzeros is proportional to n. More precisely we show that there is a constant(A) > 0 so that all but a negligible proportion of large n by m matrices with n < m An,have the following property: for every system S = allowing a solution with fewer than nnonzeros, 1 minimization uniquely finds that solution. Here proportion of matrices is taken

using the natural uniform measure on the space of matrices with columns of unit 2

norm.In contrast, greedy algorithms and thresholding algorithms seem to fail in this setting.An interesting feature of our analysis is its use of techniques from Banach space theory, in

particular quantitative extensions of Dvoretskys almost spherical sections theorem, (by Mil-man, Kashin, Schechtman, and others), and other related tools exploiting randomness in high-dimensional spaces, including properties of the minimum eigenvalue of Wishart matrices.

Section 2 gives a formal statement of the result and the overall proof architecture; Sections3-5 prove key lemmas; Section 6 discusses the failure of Greedy and Thresholding Procedures;Section 7 describes a geometric interpretation of these results. Section 8 discusses a heuristicthat correctly predicts the empirical equivalence breakdown point. Section 9 discusses stabilityand well-posedness.

3


4/28

2 Overview

Let 1, 2, . . . , m be random points on the unit sphere Sn1 in Rn, independently drawn from

the uniform distribution. Let = [1 . . . m] be the matrix obtained by concatenating the

resulting vectors. We denote this n,m when we wish to emphasize the size of the matrix.For a vector S Rn we are interested in the sparsest possible representation of S using

columns of ; this is given by:

(P0) min 0 subject to = S,

It turns out that, if (P0) has anysufficiently sparse solutions, then it will typically have a uniquesparsest one.

Lemma 2.1 On an eventE having probability1, the matrix has the followingunique spars-est solution property:

For every vector0 having00 < n/2 the vectorS = 0 generates an instanceof problem (P0) whose solution is uniquely 0.

Proof. With probability one, the i are in general position in Rn. If there were two solutions,

both with fewer than n/2 nonzeros, we would have 0 = 1 implying (1 0) = 0, alinear relation involving n conditions satisfied using fewer than n points, contradicting generalposition. QED

In general, solving (P0) requires combinatorial optimization and is impractical. The 1 norm

is in some sense the convex relaxation of the 0 norm. So consider instead the minimal 1-normrepresentation:

(P1) min 1 subject to = S,

This poses a convex optimization problem, and so in principle is more tractable than (P0).Surprisingly, when the answer to (P0) is sparse, it can be the same as the answer to (P1).

Definition 2.1 TheEquivalence Breakdown Point of a matrix, EBP(), is the maximalnumber N such that, for every0 with fewer thanN nonzeros, the corresponding vectorS = 0generates a linear system S = for which problems (P1) and (P0) have identical uniquesolutions, both equal to 0.

Using known results, we have immediately that the EBP typically exceeds c

n/ log(m).

Lemma 2.2 For each > 0,

Prob

EBP(n,m) > n

(8 + )log(m) 1, n .

Proof. The mutual coherence M = maxi=j |i, j| obeys M 0 so that for every sequence (mn)with mn An

Prob{n1EBP(n,m) (A)} 1, n .

In words, the overwhelming majority of n by m matrices have the property that 1 mini-mization will exactly recover the sparsest solution whenever it has at most n nonzeros. Anexplicit lower bound for can be given based on our proof, but it is exaggeratedly small. As wepoint out later, empirical studies observed in computer simulations set (3/10)n as the empiricalbreakdown point when A = 2, and a heuristic based on our proof quite precisely predicts thesame breakdown point see Section 8 below.

The space of n m matrices having columns with unit norm is, of course,m terms

Sn1 Sn1 .

Now the probability measure we are assuming on our random matrix is precisely the canonical

uniform measure on this space. Hence, the above result shows that having EBP() n is ageneric property of matrices, experienced on a set of nearly full measure.

2.1 Proof Outline

Let S = 0 and let I = supp(0). Suppose there is an alternate decomposition

S = (0 + )

where the perturbation obeys = 0. Partitioning = (I, Ic), we have

II =

IcIc .

We will simply show that, on a certain event n(, A)

I1 < Ic1 (2.1)

uniformly over every I with |I| < n and over every I = 0. Now

0 + 1 01 Ic1 I1.

It is then always the case that any perturbation = 0 increases the 1 norm relative to theunperturbed case = 0. In words, every perturbation hurts the 1 norm more off the supportof 0 than it helps the norm on the support of 0, so it hurts the

1 norm overall, so every

perturbation leads away from what, by convexity, must therefore be the global optimum.It follows that the 1 minimizer is unique whenever |I| < n and the event n(, A) occurs.Formally, the event n(, A) is the intersection of 3 subevents

in, i = 1, 2, 3. These depend

on positive constants i and i to be chosen later. The subevents are:

1n The minimum singular value of I exceeds 1, uniformly in I with |I| < 1n2n Denote v = II. The

1 norm v1 exceeds 2nv2, uniformly in I with |I| < 2n.3n Let Ic obey v = IcIc . The 1 norm Ic1 exceeds 3v1 uniformly in I with |I| < 3n.

5


6/28

Lemmas 3.1,4.1, and 5.1 show that one can choose the i and i so that the complement ofeach of the in, i = 1, 2, 3 tends to zero exponentially fast in n. We do so. It follows, with4 mini i, that the intersection event E4,n iin is overwhelmingly likely for large n.

When we are on the event E4,n, 1n gives us

I1

|I| I2

|I|v2/1/2min(TII)

11 |I|1/2v2.

At the same time, 2n gives usv1 2

nv2.

Finally, 3n gives usIc1 3v1,

and hence, provided|I|1/2 < 1 2 3 n,

we have (2.1), and hence 1 succeeds. In short, we just need to bound the fraction |I|/n.Now pick = min(4, (1 2 3)2), and set n(, A) = E4,n; we get EBP() n on

n(, A).

It remains to prove Lemmas 3.1, 4.1, and 5.1 supporting the above analysis.

3 Controlling the Minimal Eigenvalues

We first show there is, with overwhelming probability, a uniform bound 1(, A) on the minimalsingular value of every matrix I constructible from the matrix with |I| < n. This is ofindependent interest; see Section 9.

Lemma 3.1 Let < 1. Define the event

n,m,, = {min(TII) , |I| < n}.

There is 1 = 1(, A) > 0 so that, along sequences (mn) with mn An,

P(n,mn,1,) 1, n .

The bound 1(, A) > 0 is implied by this result; simply invert the relation 1(, A) andput 1 =

1/2.

3.1 Individual Result

We first study min(TI I) for a single fixed I.

Lemma 3.2 Let > 0 be sufficiently small. There exist = () > 0, () > 0 and n1(), sothat for k = |I| n we have

P{min(TII) 2} exp(n), n > n1.

6


7/28

Effectively our idea is to show that I is related to matrices of iid Gaussians, for which suchphenomena are already known.

Without loss of generality suppose that I = {1, . . . , k}. Let Ri, i = 1, . . . , k be iid randomvariables distributed n/

n, where n denotes the n distribution. These can be generated by

taking iid standard normal RVs Zij which are independent of (i) and setting

Ri = (n1

nj=1

Z2ij)1/2. (3.1)

Let xi = Ri i; then the xi are iid N(0, 1nIn), and we view them as the columns of X. WithR = diag((Ri)i), we have I = XR

1, and so

min(TI I) = min((R

1)TXTXR1) min(XTX) (maxi

Ri)2. (3.2)

Hence, for a given > 0 and > 0, the two events

E = {min(XTX) ( + )2} F = {maxi

Ri < 1 + /}

together implymin(

TI I) 2.

The following lemma will be proved in the next subsection:

Lemma 3.3 For u > 0,P{max

iRi > 1 u} exp{nu2/2}. (3.3)

There we will also prove:

Lemma 3.4 Let X be an n by k matrix of iid N(0, 1n) Gaussians, k < n. Let min(XTX)

denote the minimum eigenvalue of XTX. For > 0 and k/n ,

P{min(XTX) < (1 t)2} exp(nt2/2), n > n0(, ). (3.4)

Pick now > 0 with < 1, and choose so 2 < 1; finally, put t = 12.Define u = /. Then by Lemma 3.4

P(Ec) exp(nt2/2), n > n0(, ),

while by Lemma 3.3P(Fc)

exp(

nu2/2).

Setting < min(t2/2, u2/2), we conclude that, for n1 = n1(,,),

P{min(TII) < 2} exp(n), n > n1(,,).

QED

7


8/28

3.2 Invoking Concentration of Measure

We now prove Lemma 3.3. Now (3.1) exhibits each Ri as a function of n iid standard normalrandom variables, Lipschitz with respect to the standard Euclidean metric, with Lipschitz con-

stant 1/n. Moreover maxi Ri itself is such a Lipschitz function. By concentration of measurefor Gaussian variables [18], (3.3) follows.

The proof of Lemma 3.4 depends on the observation see Szarek [25], Davidson-Szarek [6]or El Karoui [13] that the singular values of Gaussian matrices obey concentration of measure:

Lemma 3.5 LetX be an n by k matrix of iid N(0, 1n) Gaussians, k < n. Let s(X) denote the-th largest singular value of X, s1 s2 . . . . Let ;k,n = Median(s(X)) Then

P{s(X) < ;k,n t} exp(nt2/2).

The idea is that a given singular value, viewed as a function of the entries of a matrix, is Lipschitzwith respect to the Euclidean metric on Rnk. Then one applies concentration of measure for

scaled Gaussian variables.As for the median k;k,n we remark that the well-known Marcenko-Pastur law implies that,

if kn/n kn;kn,n 1

, n .

Hence, for given > 0 and all sufficiently large n > n0(, ), kn;kn,n > 1

. Observingthat sk(X)

2 = min(XTX), gives the conclusion (3.4).

3.3 Proof of Lemma 3.1

We now combine estimates for individual Is obeying |I| n to obtain the simultaneous result.We need a standard combinatorial fact, used here and below:

Lemma 3.6 For p (0, 1), let H(p) = p log(1/p) + (1 p) log(1/(1 p)) be Shannon entropy.Then

log

N

pN

= NH(p)(1 + o(1)), N .

Now for a given (0, 1), and each index set I, define the event

n,I; = {min(TII) }

Thenn,m,, = |I|nn,I;.

By Booles inequality,P(cn,m,,)

|I|n

P(cn,I;),

solog P(cn,m,,) log#{I : |I| n} + log P(cn,I;), (3.5)

and we want the right-hand side to tend to . By Lemma 3.6,

#{I : |I| n} = log

mnn

= AnH(/A)(1 + o(1)).

8


9/28

Invoking now Lemma 3.2 we get a > 0 so that for n > n0(, ), we have

log P(cn,I;) n.

We wish to show that the n in this relation can outweigh AnH(/A) in the preceding one,giving a combined result in (3.5) tending to . Now note that the Shannon entropy H(p) 0as p 0. Hence for small enough , AH(/A) < . Picking such a call it 1 and setting1 = AH(1/A) > 0 we have for n > n0 that

log(P(cn,m,1,)) AnH(1/A)(1 + o(1)) n,which implies an n1 so that

P(cn,m,,) exp(1n), n > n1(, ).QED

4 Almost-Spherical Sections

Dvoretskys theorem [10, 22] says that every infinite-dimensional Banach space contains veryhigh-dimensional subspaces on which the Banach norm is nearly proportional to the Euclideannorm. This is called the spherical sections property, as it says that slicing the unit ball inthe Banach space by intersection with an appropriate finite dimensional linear subspace willresult in a slice that is effectively spherical. We need a quantitative refinement of this principlefor the 1 norm in Rn, showing that, with overwhelming probability, every operator I for|I| < n affords a spherical section of the 1n ball. The basic argument we use derives fromrefinements of Dvoretskys theorem in Banach space theory, going back to work of Milman andothers [15, 24, 20]

Definition 4.1 Let |I| = k. We say that I offers an -isometry between 2(I) and 1n if

(1 ) 2

2n I1 (1 + ) 2, Rk. (4.1)

Remarks: 1. The scale factor

2n embedded in the definition is reciprocal to the expected

1n norm of a standard iid Gaussian sequence. 2. In Banach space theory, the same notion wouldbe called an (1 + )-isometry [15, 22].

Lemma 4.1 Simultaneous -isometry. Consider the event 2n( 2n(, )) that every Iwith |I| n offers an -isometry between 2(I) and 1n. For each > 0, there is 2() > 0 sothat

P(2n(, 2)) 1, n .

4.1 Proof of Simultaneous Isometry

Our approach is based on a result for individual I, which will later be extended to get a result forevery I. This individual result is well known in Banach space theory, going back to [24, 17, 15].For our proof, we repackage key elements from the proof of Theorem 4.4 in Pisiers book [22].Pisiers argument shows that for one specific I, there is a positive probability that I offers an-isometry. We add extra bookkeeping to find that the probability is actually overwhelmingand later conclude that there is overwhelmingprobability that every I with |I| < n offers suchisometry.

9


10/28

Lemma 4.2 Individual -isometry. Fix > 0. Choose so that

(1 3)(1 )1 (1 )1/2 and (1 + )(1 )1 (1 + )1/2. (4.2)

Choose 0 = 0() > 0 so that0 (1 + 2/) < 2 2

,

and let () denote the difference between the two sides. For a subset I in {1, . . . , m} let n,Idenote the event { I offers an -isometry to 1n }. Then as n ,

max|I|1n

P(cn,I) 2 exp(()n(1 + o(1))).

This lemma will be proved in Section 4.2. We first show how it implies Lemma 4.1.With () as given in Lemma 4.2, we choose 2() < 0() and satisfying

AH(2/A) < (),

where H(p) is the Shannon entropy, and let > 0 be the difference between the two sides. Now

2n = |I|


11/28

Lemma 4.3 Let xi Rn. For each > 0, choose obeying (4.2). LetN be a -net for Sk1under 2k metric. The validity on this net of norm equivalence,

1

iixi

1

1 + ,

N,

implies norm equivalence on the whole space:

(1 )1/22

i

ixi1 (1 + )1/22, Rk.

Lemma 4.4 There is a -net N for Sk1 under 2k metric obeying

log(#N) k(1 + 2/).

So, given > 0 in the statement of our Lemma, invoke Lemma 4.3 to get a workable , and

invoke Lemma 4.4 to get a net N obeying the required bound. Corresponding to each element in the net N, define now the event

E = {1

i

ixi1 1 + }.

On the event E = NE, we may apply Lemma 4.3 to conclude that the system (xi : 1 i k) gives -equivalence between the 2 norm on Rk and the 1 norm on Span(xi).

Now E {|f Ef| > }. We note that f may be viewed as a function g on kn iidstandard normal random variables, where g is a Lipschitz function on R

kn with respect to the2 metric, having Lipschitz constant =

/2n. By concentration of measure for Gaussian

variables [18, Section 1.2-1.3],

P{|f Ef| > t} 2exp{t2/22}.

Hence

P(Ec) 2exp{2 n 2

}.

From Lemma 4.4 we havelog#N k(1 + 2/)

and so

log(P(Ec)) k (1 + 2/) + log 2 2 n 2

< log(2) n().We conclude that the xi give a near-isometry with overwhelming probability.

We now de-Gaussianize. We argue that, with overwhelming probability, we also get an

-isometry of the desired type for I. Setting i = i

2n R1i , observe that

i

ixi =

i

ii. (4.4)

Pick so that(1 + ) < (1 )1/2, (1 ) > (1 + )1/2. (4.5)

Consider the eventG = {(1 ) < Ri < (1 + ) : i = 1, . . . , n},

11


12/28

On this event we have the isometry

(1 ) 2

2n

2 (1 + ) 2.

It follows that on the event G E, we have:(1 )1/2

(1 + )

2n

2 (1 )1/22

i

ixi1 (=

i

ii1 by (4.4))

(1 + )1/22 (1 )1/2

(1 )

2n

2.

taking into account (4.5), we indeed get an -isometry. Hence, n,I G E.Now

P(Gc) = P{maxi

|Ri 1| > }.By (4.3), we may also view |Ri 1| as a function of n iid standard normal random variables,Lipschitz with respect to the standard Euclidean metric, with Lipschitz constant 1/

n. This

givesP{max

i|Ri 1| > } 2m exp{n2/2} = 2m exp{nG()}. (4.6)

Combining these we get that on |I| < n,P(cn,I) P(Ec) + P(Gc) 2 exp(()n) + 2m exp(G()n).

We note that G() will certainly be larger than (). QED.

5 Sign-Pattern Embeddings

Let I be any collection of indices in {1, . . . , m}; Range(I) is a linear subspace of Rn, andon this subspace a subset I of possible sign patterns can be realized, i.e. sequences in {1}ngenerated by

(k) = sgn

I

ii(k)

, 1 k n.

Our proof of Theorem 1 needs to show that for every v Range(I), some approximation y tosgn(v) satisfies |y, i| 1 for i Ic.

Lemma 5.1 Simultaneous Sign-Pattern Embedding. Positive functions () and 3(; A)can be defined on (0, 0) so that () 0 as 0, yielding the following properties. For each < 0, there is an event

3n( 3n,) with

P(3n) 1, n .On this event, for every subset I with |I| < 3n, for every sign pattern I, there is a vectory ( y) with

y 2 () 2, (5.1)and

|i, y| 1, i Ic. (5.2)

12


13/28

In words, a small multiple of any sign pattern almost lives in the dual ball {x : |i, x| 1}. The key aspects are the proportional dimension of the constraint n and the proportionaldistortion required to fit in the dual ball.

Before proving this result, we indicate how it supports our claim for 3n in the proof ofTheorem 1; namely, that if |I| < 3n, then

Ic1 3v1, (5.3)whenever v = IcIc . By the duality theorem for linear programming the value of the primalprogram

min Ic1 subject to IcIc = v (5.4)is at least the value of the dual

maxv, y subject to |i, y| 1, i Ic.Lemma 5.1 gives us a supply of dual-feasible vectors and hence a lower bound on the dual

program. Take = sgn(v); we can find y which is dual-feasible and obeys

v, y v, y 2v2 v1 ()2v2;picking sufficiently small and taking into account the spherical sections theorem, we arrangethat ()2v2 14v1 uniformly over v VI where |I| < 3n; (5.3) follows with 3 = 3/4.

5.1 Proof of Simultaneous Sign-Pattern Embedding

The proof introduces a function (), positive on (0, 0), which places a constraint on the size of allowed. The bulk of the effort concerns the following lemma, which demonstrates approximateembedding of a single sign pattern in the dual ball. The -function allows us to cover many

individual such sequences, producing our result.

Lemma 5.2 Individual Sign-Pattern Embedding. Let {1, 1}n, let > 0, and y0 =. There is an iterative algorithm, described below, producing a vector y as output which obeys

|i, y| 1, i = 1, . . . , m . (5.5)Let (i)

mi=1 be iid uniform on S

n1; there is an event ,,n described below, having probabilitycontrolled by

Prob(c,,n) 2n exp{n()}, (5.6)for a function () which can be explicitly given and which is positive for 0 < < 0. On thisevent,

y y02 () y02, (5.7)where () can be explicitly given and has () 0 as 0.

In short, with overwhelming probability (see (5.6)), a single sign pattern, shrunken appro-priately, obeys (5.5) after a slight modification (indicated by (5.7)). Lemma 5.2 will be provenin a section of its own. We now show that it implies Lemma 5.1.

Lemma 5.3 Let V = Range(I) Rn. The number of different sign patterns generated byvectors v V obeys

#I

n

0

+

n

1

+ +

n

|I|

.

13


14/28

Proof. This is known to statisticians as a consequence of the Vapnik-Chervonenkis VC-classtheory. See Pollard [23, Chapter 4]. QED

Let again H(p) = p log(1/p) + (1 p) log(1/(1 p)) be the Shannon entropy. Notice that if|I| < n, then

log(#I) nH()(1 + o(1)),while also

log#{I : |I| < n, I {1, . . . , m}} n A H(/A) (1 + o(1)).Hence, the total number of all sign patterns generated by all operators I obeys

log#{ : I, |I| < n} n(H() + AH(/A))(1 + o(1)).

Now the function () introduced in Lemma 5.2 is positive, and H(p) 0 as p 0. hence, foreach (0, 0), there is 3() > 0 obeying

H(3) + AH(3/A) < ().

Define3n = |I| 1, and we kill those indices. Repeat theprocess, this time on y1, and with a new threshold t1 = 3/4. Let I1 be the collection of indices1 i m where

|i, y1| > 3/4,and set

y2 = y0 PI0I1y0,

14


15/28

again killing the offending subspace. Continue in the obvious way, producing y3, y4, etc.,with stage-dependent thresholds t 1 21 successively closer to 1. Set

I =

{i :

|i, y

|> t

},

and, putting J I0 I,y+1 = y0 PJy0.

If I is empty, then the process terminates, and set y = y. Termination must occur at stage n. (In simulations, termination often occurs at = 1, 2, or 3). At termination,

|i, y| 1 21, i = 1, . . . , m .Hence y is definitely dual feasible. The only question is how close to y0 it is.

5.2.2 Analysis Framework

In our analysis of the algorithm, we will study

= y y12,and

|I| = #{i : |i, y| > 1 21}.We will propose upper bounds ,,n and ,,n for these quantities, of the form

,,n = y02 (),,,n = n 0 2 2+2()/4;

here 0 can be taken in (0, 1), for example as 1/2; this choice determines the range (0, 0) for ,and restricts the upper limit on . () (0, 1/2) is to be determined below; it will be chosenso that () 0 as 0. We define sub-events

E = {j j, j = 1, . . . , , |Ij | j, j = 0, . . . , 1};Now define

,,n = n=1E;this event implies

y y02 (

2)1/2 y02 ()/(1 2())1/2,

hence the function () referred to in the statement of Lemma 5.2 may be defined as() ()/(1 2())1/2,

and the desired property () 0 as 0 will follow from arranging for () 0 as 0.We will show that, for () > 0 chosen in conjunction with () > 0,

P(Ec+1|E) 2exp{()n}. (5.8)This implies

P(c,,n) 2n exp{()n},and the Lemma follows. QED

15


16/28

5.2.3 Transfer To Gaussianity

We again Gaussianize. Let i denote random points in Rn which are iid N(0, 1nIn). We will

analyze the algorithm below as if the s rather than the s made up the columns of .

As already described in Section 4.2, there is a natural coupling between Spherical s andGaussian s that justifies this transfer. As in Section 4.2 let Ri, i = 1, . . . , m be iid randomvariables independent of (i) and which are individually n/

n. Then define

i = Rii, i = 1, . . . , m .

If the i are uniform on Sn1 then the i are indeed N(0, 1nIn). The Ri are all quite close to 1

for large n. According to (4.6), for fixed > 0,

P{1 < Ri < 1 + , i = 1, . . . , m} 1 2m exp{n2/2}.

Hence it should be plausible that the difference between the i and the i is immaterial. Arguing

more formally, we notice the equivalence

|i, y| < 1 |i, y| < Ri.Running the algorithm using the s instead of the s, with thresholds calibrated to 1 viat0 = (1 )/2, t1 = (1 ) 3/4, etc. will produce a result y obeying |i, y| < 1 , i.Therefore, with overwhelming probability, the result will also obey |i, y| < 1 i .

However, such rescaling of thresholds is completely equivalent to rescaling of the input y0from to , where = (1 ). Hence, if we can prove results with functions () and() for the Gaussian s, the same results are proven for the Spherical s with functions() = () = ((1 )) and () = min((), 2/2).

5.2.4 Adapted Coordinates

It will be useful to have coordinates specially adapted to the analysis of the algorithm. Given y0,y1, ..., define 0, 1, . . . by Gram-Schmidt orthonormalization. In terms of these coordinateswe have the following equivalent construction: Let 0 = y02, let i, 1 i m be iid vectorsN(0, 1nIn). We will sequentially construct vectors i, i = 1, . . . , m in such a way that their jointdistribution is iid N(0, 1nIn), but so that the algorithm has an explicit trajectory.

At stage 0, we realize m scalar Gaussians Z0i iid N(0, 1n), threshold at level t0, say, anddefine I0 to be the set of indices so that |0Z0i | > t0. For such indices i only, we define

i = Z0i 0 + P

0i, i I0.

For all other i, we retain Z0i for later use. We then define y1 = y0 PI0y0, 1 = y1 y02 and1 by orthonormalizing y1 y0 with respect to 0.

At stage 1, we realize m scalar Gaussians Z1i iid N(0, 1n), and define I1 to be the set ofindices not in I0 so that |0Z0i + 1Z1i | > t1. For such indices i only, we define

i = Z0i 0 + Z

1i 1 + P

0,1i, i I1.

For i neither in I0 nor I1, we retain Z1i for later use. We then define y2 = y0 PI0I1y0,

2 = y2 y12 and 2 by orthonormalizing y2 y1 with respect to 0 and 1.

16


17/28

Continuing in this way, at some stage we stop, (i.e. I is empty) and we define i for alli not in I0 . . . I1 (if there are any such) by

i =

1j=0

Zji j + P0,...,1i, i I0 . . . I 1

We claim that we have produced a set m of iid N(0, 1nIn)s for which the algorithm has theindicated trajectory we have just traced. A proof of this fact repeatedly uses independenceproperties of orthogonal projections of standard normal random vectors.

It is immediate that, for each up to termination, we have expressions for the key variablesin the algorithm in terms of the coordinates. For example:

y y0 =

j=1

jj; y y02 = (

j=1

2j )1/2

5.2.5 Control on

We now develop a bound for

+1 = y+1 y2 = PI(y+1 y)2.

Recalling thatPIv = I(

TI

I)1TIv,

and putting (I) = min(TI

I), we have

PI(y+1

y)

2

(I)

1/2

TI(y+1

y)

2.

But Iy+1 = 0 because y+1 is orthogonal to every i, i I by construction. Now for i I.

|i, y| |i, y y1| + |i, y1| |Zi | + tand so

TIy2 t|I|1/2 +

iI(Zi )

2

1/2

(5.9)

We remark that

{i

I} {|

i, y|

> t,|

i, y1

|< t

1

} {|i, y

y1

| t

t1

};

putting u = 21/ this gives

iI

(Zi )2

iJc

1

(Zi )21{|Zi |>u}.

We conclude that

2+1 2 (I)1|I| + 2

iJc1

(Zi )21{|Zi |>u}

. (5.10)

17


18/28

5.2.6 Large Deviations

Define the eventsF = { ,,n}, G = {|I| ,,n},

so thatE+1 = F+1 G E.

Put0() = 0

2.

On the event E, |J| 0()n. Recall the quantity 1(, A) from Lemma 3.1. For some 1,1(0(), A)

2 0 for all (0, 1]; we will restrict ourselves to this range of . On E,min(I) > 0. Also on E, uj = 2

j1/j > 2j1/j = vj (say) for j . Now

P{Gc|E} P{

i

1{|Zi |>v} > },

andP{Fc+1|G, E} P{2 10

+

2

i

Zi

21{|Zi |>v}

> 2+1}.

We need two simple large deviations bounds.

Lemma 5.4 LetZi be iid N(0, 1), k 0, t > 2.

1

mlog P{

mki=1

Z2i 1{|Zi|>t} > m} et2/4 /4,

and

1

m log P{mki=1 1

{|Zi|>t} > m} et2/2

/4.Applying this,

1

mlog P{Fc+1|G, E} e

2

/4 /4,

where = n v2 = 222/22(),

and = (0

2+1/2 )/2 = 02()/4.

By inspection, for small and (), the term of most concern is at = 0; the other terms arealways better. Putting

() (; ) = 02()/4 e1/(1622()),and choosing well, we get > 0 on an interval (0, 2), and so

P{Fc+1|G, E} exp(n()).A similar analysis holds for the Gs. We get 0 in the statement of the lemma taking 0 =min(1, 2). QED

Remark: The large deviations bounds stated in Lemma 5.4 are far from best possible; wemerely found them convenient in producing an explicit expression for . Better bounds wouldbe helpful in deriving reasonable estimates on the constant (A) in Theorem 1.

18


19/28

6 Geometric Interpretation

Our result has an appealing geometric interpretation. Let Bn denote the absolute convex hullof 1, . . . , m;

Bn = {x Rn : x = i

(i)i,

i

|(i)| 1}.

Equivalently, B is exactly the set of vectors where val(P1) 1. Similarly, let the octahedronOm Rm be the absolute convex hull of the standard Kronecker basis (ei)mi=1 :

Om = { Rm : =

i

(i)ei,

mi=1

|(i)| 1}.

Note that each set is polyhedral, and it is almost true that the vertices {ei} of Om map under into vertices {i} of Bn. More precisely, the vertices of Bn are among the image vertices

{i

}; because Bn is a convex hull, there is the possibility that for some i, i lies strictly in the

interior of Bn.Now if i were strictly in the interior of Bn, then we could write

i = 1, 11 < 1,

where i supp(1). It would follow that a singleton 0 generates i through i = 0, so 0necessarily solves (P0), but, as

01 = 1 > 11.0 is not the solution of (P1). So, when any i is strictly in the interior of Bn, (P1) and (P0)are inequivalent problems.

Now on the event n(, A), (P1) and (P0) have the same solution whenever (P0) has a

solution with k = 1 < n nonzeros. We conclude that on the event n(, A), the vertices ofBn are in one-one correspondence with the vertices of Om. Letting Skel0(C) denote the set ofvertices of a polyhedral convex set C, this correspondence says:

Skel0(Bn) = [Skel0(Om)].

Something much more general is true. By (k 1)-face of a polyhedral convex set C withvertex set v = {v1, . . . , }, we mean a (k 1)-simplex

(vi1 , . . . , vik) = {x =j

jvij , j 0,

j = 1}.

all of whose points are extreme points of C. B y (k 1)-skeleton Skelk1(C) of a polyhedralconvex set C, we mean the collection of all (k 1)-faces.

The 0-skeleton is the set of vertices, the 1-skeleton is the set of edges, etc. In general onecan say that the (k 1)-faces ofBn form a subset of the images under of the (k 1)-faces ofOn:

Skelk1(Bn) [Skelk1(Om)], 1 k < n.Indeed, some of the image faces (ei1 , . . . , eik) could be at least partially interior to Bn,and hence they could not be part of the (k 1)-skeleton of Bn

Our main result says that much more is true; Theorem 1 is equivalent to this geometricstatement:

19


20/28

Theorem 2 There is a constant = (A) so that for n < m < An, on an event n(, A)whose complement has negligible probability for large n,

Skelk1(Bn) = [Skelk1(Om)], 1 k < n.In particular, with overwhelming probability, the topology of every (k 1)-skeleton of Bn is

the same as that of the corresponding (k 1)-skeleton of Om, even for k proportional to n. Thetopology of the skeleta of Om is of course obvious.

7 Other Algorithms Fail

Several algorithms besides 1 minimization have been proposed for the problem of finding sparsesolutions [21, 26, 9]. In this section we show that two standard approaches fail in the currentsetting, where 1 of course succeeds.

7.1 Subset Selection AlgorithmsConsider two algorithms which attempt to find sparse solutions to S = by selecting subsetsI and then attempting to solve S = II.

The first is simple thresholding. One sets a threshold t, selects a subset I of terms highlycorrelated with S:

I = {i : |S, i| > t},and then attempts to solve S = II. Statisticians have been using methods like this onnoisy data for many decades; the approach is sometimes called subset selection by preliminarysignificance testing from univariate regressions.

The second is greedy subset selection. One selects a subset iteratively, starting from R0 = Sand = 0 and proceeding stagewise, through stages = 0, 1, 2, . . . . At the 0-th stage, oneidentifies the best-fitting single term:

i0 = argmaxi|R0, i|,and then, putting i0 = R0, i0, subtracts that term off

R1 = R0 i0i0 ;at stage 1 one behaves similarly, getting i1and R2, etc. In general,

i = argmaxi|R1, i|,and

R = S Pi1,...,iS.We stop as soon R = 0. Procedures of this form have been used routinely by statisticianssince the 1960s under the name stepwise regression; the same procedure is called OrthogonalMatching Pursuit in signal analysis, and called greedy approximation in the approximationtheory literature. For further discussion, see [9, 26, 27].

Under sufficiently strong conditions, both methods can work.

Theorem 3 (Tropp [26]) Suppose that the dictionary has coherence M = maxi=j |i, j|.Suppose that 0 has k M1/2 nonzeros, and run the greedy algorithm with S = 0. Thealgorithm will stop afterk stages having selected at each stage one of the terms corresponding tothe k nonzero entries in 0, at the end having precisely found the unique sparsest solution 0.

20


21/28

A parallel result can be given for thresholding.

Theorem 4 Let (0, 1). Suppose that0 has k M1/2 nonzeros, and that the nonzerocoefficients obey

|0(i)

| k

2 (thus, they are all about the same size). Choose a threshold

so that exactly k terms are selected. These k terms will be exactly the nonzeros in0 and solvingS = II will recover the underlying optimal sparse solution 0.

Proof. We need to show that a certain threshold which selects exactly k terms selects onlyterms in I. Consider the preliminary threshold t0 =

2

k02. We have, for i I,

|i, S| = |i +j=i

0(j)i, j|

|i| Mj=i

|0(j)|

> |i| Mk02 02 (/k Mk) 02 /2

k = t0

Hence for i I, |i, S| > t0. On the other hand, for j I

|j, S| = |

i

0(i)i, j|

M

k02 = t0Hence, for small enough > 0, the threshold t = t0 + selects exactly the terms in I. QED

7.2 Analysis of Subset Selection

The present article considers situations where the number of nonzeros is proportional to n. Asit turns out, this is far beyond the range where previous general results about greedy algorithmsand thresholding would work. Indeed, in this articles setting of a random dictionary , we have(see Lemma 2.2) coherence M 2log(m)/n. Theorems 3 and 4 therefore apply only for|I| = o(n) n. In fact, it is not merely that the theorems dont apply; the nice behaviormentioned in Theorems 3 and 4 is absent in this more challenging setting.

Theorem 5 Letn, m, A, and be as above. On an event n having overwhelming probability,there is a vector S with unique sparsest representation using at most k < n nonzero elements,

for which the following are true:

The 1-minimal solution is also the optimally sparse solution. The thresholding algorithm can only find a solution using n nonzeros. The greedy algorithm makes a mistake in its first stage, selecting a term not appearing in

the optimally sparse solution.

Proof. The statement about 1 minimization is of course just a reprise of Theorem 1. Theother two claims depend on the following.

21


22/28

Lemma 7.1 Letn,m,A, and be as in Theorem 1. Let I = {1, . . . , k}, where/2n < k < n.There exists C > 0 so that, for each 1, 2 > 0, for all sufficiently large n, with overwhelmingprobability some S Range(I) has S2 = n, but

|S, i| < C, i I,and

miniI

|S, i| < 2.

The Lemma will be proved in the next subsection. Lets see what it says about thresholding.The construction ofS guarantees that it is a random variable independent of i, i I. With Rias introduced in (4.3), the coefficients S, iRi i Ic, are iid with standard normal distribution;and by (4.6) these differ trivially from S, i. This implies that for i Ic, the coefficients S, iare iid with a distribution that is nearly standard normal. In particular, for some a = a(C) > 0,with overwhelming probability for large n, we will have

#{i Ic

: |S, i| > C} > a m,and, if 2 is the parameter used in the invocation of the Lemma above, with overwhelmingprobability for large n we will also have

#{i Ic : |S, i| > 2} > n.Hence, thresholding will actually select a m terms not belonging to I before any term

belonging to I. Also, if the threshold is set so that thresholding selects < n terms, then someterms from I will not be among those terms (in particular, the terms where |i, S| < 2 for 2small).

With probability one, the points i are in general position. Because of Lemma 2.1, we canonly obtain a solution to the original equations if one of two things is true:

We select all terms of I; We select n terms (and then it doesnt matter which ones).

Ifany terms from I are omitted by the selection I, we cannot get a sparse representation. Sincewith overwhelming probability some of the k terms appearing in I are not among the n bestterms for the inner product with the signal, thresholding does not give a solution until n termsare included, and there must be n nonzero coefficients in the solution obtained.

Now lets see what the Lemma says about greedy subset selection. Recall that the S, iRii Ic, are iid with standard normal distribution; and these differ trivially from S, i. Combin-ing this with standard extreme value theory for iid Gaussians, we conclude that for each > 0,

with overwhelming probability for large n,

maxiIc

|S, i| > (1 )

log m.

On the other hand, with overwhelming probability

maxiI

|S, i| < C.

It follows that with overwhelming probability for large n, the first step of the greedy algorithmwill select a term not belonging to I. QED

Not proved here, but strongly suspected, is that there exist S so that greedy subset selectioncannot find any exact solution until it has been run for at least n stages.

22


23/28

7.3 Proof of Lemma 7.1

Let V = Range(I); pick any orthobasis (i) for V, and let Z1, . . . Z k be iid standard GaussianN(0, 1). Set v = i Zjj . Then for all i I, i, v N(0, 1).

Let now ij be the array defined by ij = i, j. Note that the ij are independent of vand are approximately N(0, 1k ). (More precisely, with Ri the random variable defined earlier at(4.3), Riij is exactly N(0,

1k )).

The proof of Lemma 5.2 shows that the Lemma actually has nothing to do with signs; itcan be applied to any vector rather than some sign pattern vector . Make the temporarysubstitutions n k, (Zj), i (ij), and choose > 0. Apply Lemma 5.2 to . Get avector y obeying

|i, y| 1, i = 1, . . . , k . (7.1)Now define

S

n

y

2

k

j=1yjj.

Lemma 5.2 stipulated an event, En on which the algorithm delivers

y v2 ()v2.

This event has probability exceeding 1 exp{n}. On this event

y2 (1 ())v2.

Arguing as in (4.6), the event Fn = {v2 (1 )

k} has

P(Fcn) 2exp{k2/2}.

Hence on an event En Fn,y2 (1 ())

k.

We conclude using (7.1) that, for i = 1, . . . , k,

|i, S| =

n

y2 |i, y| 1

(1 ())(1 )/2 1 C, say.This is the first claim of the lemma.

For the second claim, notice that this would be trivial if S, i were iid N(0, 1). This is notquite true, because of conditioning involved in the algorithm underlying Lemma 5.2. However,this conditioning only makes the indicated event even more likely than for an iid sequence. QED

8 Breakdown Point Heuristics

It can be argued that, in any particular application, we want to know whether we have equiv-alence for the one specific I that supports the specific 0 of interest. Our proof suggests anaccurate heuristic for predicting the sizes of subsets |I| where this restricted notion of equiva-lence can hold.

Definition 8.1 We say that local equivalence holds at a specific subset I if, for all vectorsS = 0 with supp(0) I, the minimum 1 solution to S = has 1 = 0.

23


24/28

It is clear that in the random dictionary , the probability of the event local equivalenceholds for I depends only on |I|.

Definition 8.2 Let Ek,n =

{local equivalence holds at I =

{1, . . . , k

}}. The events Ek,n are

decreasing with increasing k. The Local Equivalence Breakdown Point LEBPn is thesmallest k for which event Eck,n occurs.

Clearly EBPn LEBPn.

8.1 The Heuristic

Let x be uniform on Sn1 and consider the random 1-minimization problem

(RP1(n, m)) min 1 = x.

Here is, as usual, iid uniform on Sn1. Define the random variable Vn,m = val(RP1(n, m)).This is effectively the random variable at the heart of the event 3n in the proof of Theorem 1.A direct application of the Individual Sign-Pattern Lemma shows there is a constant (A) sothat, with overwhelming probability for large n,

Vn,m

n.

It follows that for the median(n, m) = med{Vn,m};

we have(n,An) n.

There is numerical evidence, described below, that

(n,An)

2n 0(A), n . (8.1)

where 0 is a decreasing function of A.

Heuristic for Breakdown of Local Equivalence. Let + = +(A) solve the equation

(1 ) = 0(A ).

Then we anticipate that

LEBPn

/nP

+, n

.

8.2 Derivation

We use the notation of Section 2.1. We derive LEBPn/n +. Consider the specific perturba-tion I given by the eigenvector emin of GI =

TII with smallest eigenvalue. This eigenvector

will be a random uniform point on Sk1 and so

I1 =

2

|I|I2(1 + op(1)).

24


25/28

It generates vI = II with

vI2 = 1/2minI2.Letting = |I|/n, we have [11]

min = (1 1/2)2 (1 + op(1)).Now vI is a random point on S

n1, independent of i for i Ic. Considering the programmin Ic1 subject to IcIc = vI

we see that it has value Vn,m|I| vI2. Now if > +, then

I1

2

|I|I2 =

2

nI2

>

2

n 0(A ) (1 1/2)I2

(n, m)vI2 Ic1.Hence, for a specific perturbation,

I1 > Ic1. (8.2)If we pick 0 supported in I with a specific sign pattern sgn(0)(i), i I, this equation impliesthat a small perturbation in the direction of can reduce the 1 norm below that of 0. Hencelocal equivalence breaks down.

With work we can also argue in the opposite direction, that this is approximately the worstperturbation, and it cannot cause breakdown unless > +. Note this is all conditional on thelimit relation (8.1), which seems an interesting topic for further work.

8.3 Empirical EvidenceYaakov Tsaig of Stanford University performed several experiments showing the heuristic tobe quite accurate. He studied the behavior of (n,An)/

n as a function of A, finding that

0(A) = A.42 fits well over a range of problem sizes. Combined with our heuristic, we get that,

for A = 2, + is nearly .3 i.e. local equivalence can hold up to about 30% of nonzeros. Tsaigperformed actual simulations in which a vector 0 was generated at random with specific |I|and a test was made to see if the solution of (P1) with S = 0 recovered 0. It turns out thatbreakdown in local equivalence does indeed occur when |I| is near +n.

8.4 Geometric Interpretation

Let Bn,I

denote the|I|-dimensional convex body

{iI (i)i : 1 1}. This is the imageof a |I|-dimensional octahedron by I. Note thatBn,I Range(I) Bn,

however, the inclusion can be strict. Local Equivalence at I happens precisely when

Bn,I = Range(I) Bn.This says that the faces of Om associated to I all are mapped by to faces of Bn.

Under our heuristic, for |I| > (+ )n, > 0, each event {Bn,I = Range(I)Bn} typicallyfails. This implies that most fixed sections of Bn by subspaces Range(I) have a differenttopology than that of the octahedron Bn,I.

25


26/28

9 Stability

Skeptics may object that our discussion of sparse solution to underdetermined systems is irrel-evant because the whole concept is not stable. Actually, the concept is stable, as an implicit

result of Lemma 3.1. There we showed that, with overwhelming probability for large n, allsingular values of every submatrix I with |I| < n exceed 1(, A). Now invoke

Theorem 6 (Donoho, Elad, Temlyakov [9]) Let be given, and set

(, ) = min{1/2min(TI I) : |I| < n}.

Suppose we are given the vectorY = 0+z, z2 , 00 n/2. Consider the optimizationproblem

(P0,) min 0 subject to Y 2 .and let 0, denote any solution. Then:

0,0 00 n/2; and 0, 02 2/, where = (, ).

Applying Lemma 3.1, we see that the problem of obtaining a sparse approximate solution to noisydata is a stable problem: if the noiseless data have a solution with at most n/2 nonzeros, thenan error of size in measurements can lead to a reconstruction error of size 2/1(, A). Westress that we make no claim here about stability of the 1 reconstruction; only that stabilityby some method is in principle possible. Detailed investigation of stability is being pursuedseparately.

References

[1] H.H. Bauschke and J.M. Borwein (1996) On projection algorithms for solving convex fea-sibility problems, SIAM Review 38(3), pp. 367-426.

[2] E.J. Candes, J. Romberg and T. Tao. (2004) Robust Uncertainty Principles: Exact SignalReconstruction from Highly Incomplete Frequency Information. Manuscript.

[3] Chen, S. , Donoho, D.L., and Saunders, M.A. (1999) Atomic Decomposition by BasisPursuit. SIAM J. Sci Comp., 20, 1, 33-61.

[4] R.R. Coifman, Y. Meyer, S. Quake, and M.V. Wickerhauser (1990) Signal Processing and

Compression with Wavelet Packets. in Wavelets and Their Applications, J.S. Byrnes, J. L.Byrnes, K. A. Hargreaves and K. Berry, eds. 1994,

[5] R.R. Coifman and M.V. Wickerhauser. Entropy Based Algorithms for Best Basis Selection.IEEE Transactions on Information Theory, 32:712718.

[6] K.R. Davidson and S.J. Szarek (2001) Local Operator Theory, Random Matrices and Ba-nach Spaces. Handbook of the Geometry of Banach Spaces, Vol. 1 W.B. Johnson and J.Lindenstrauss, eds. Elsevier.

[7] Donoho, D.L. and Huo, Xiaoming (2001) Uncertainty Principles and Ideal Atomic Decom-position. IEEE Trans. Info. Thry. 47 (no. 7), Nov. 2001, pp. 2845-62.

26


27/28

[8] Donoho, D.L. and Elad, Michael (2002) Optimally Sparse Representation from Overcom-plete Dictionaries via 1 norm minimization. Proc. Natl. Acad. Sci. USA March 4, 2003 1005, 2197-2002.

[9] Donoho, D., Elad, M., and Temlyakov, V. (2004) Stable Recovery of Sparse Over-complete Representations in the Presence of Noise. Submitted. URL: http://www-stat.stanford.edu/ donoho/Reports/2004.

[10] A. Dvoretsky (1961) Some results on convex bodies and Banach Spaces. Proc. Symp. onLinear Spaces. Jerusalem, 123-160.

[11] A. Edelman, Eigenvalues and condition numbers of random matrices, SIAM J. Matrix Anal.Appl. 9 (1988), 543-560

[12] M. Elad and A.M. Bruckstein (2002) A generalized uncertainty principle and sparse repre-sentations in pairs of bases. IEEE Trans. Info. Thry. 49 2558-2567.

[13] Noureddine El Karoui (2004) New Results About Random Covariance Matrices and Statis-tical Applications. Ph.D. Thesis, Stanford University.

[14] J.J. Fuchs (2002) On sparse representation in arbitrary redundant bases. Manuscript.

[15] T. Figiel, J. Lindenstrauss and V.D. Milman (1977) The dimension of almost-sphericalsections of convex bodies. Acta Math. 139 53-94.

[16] R. Gribonval and M. Nielsen. Sparse Representations in Unions of Bases. To appear IEEETrans Info Thry.

[17] W.B. Johnson and G. Schechtman (1982) Embedding mp into n1 . Acta Math. 149, 71-85.

[18] Michel Ledoux. The Concentration of Measure Phenomenon. Mathematical Surveys andMonographs 89. American Mathematical Society 2001.

[19] S. Mallat, Z. Zhang, (1993). Matching Pursuits with Time-Frequency Dictionaries, IEEETransactions on Signal Processing, 41(12):33973415.

[20] V.D. Milman and G. Schechtman (1986) Asymptotic Theory of Finite-Dimensional NormedSpaces. Lect. Notes Math. 1200, Springer.

[21] B.K. Natarajan (1995) Sparse Approximate Solutions to Linear Systems. SIAM J. Comput.24: 227-234.

[22] G. Pisier (1989) The Volume of Convex Bodies and Banach Space Geometry. CambridgeUniversity Press.

[23] D. Pollard (1989) Empirical Processes: Theory and Applications. NSF - CBMS RegionalConference Series in Probability and Statistics, Volume 2, IMS.

[24] Gideon Schechtman (1981) Random Embeddings of Euclidean Spaces in Sequence Spaces.Israel Journal of Mathematics 40, 187-192.

[25] Szarek, S.J. (1990) Spaces with large distances to n and random matrices. Amer. Jour.Math. 112, 819-842.

27


28/28

[26] J.A. Tropp (2003) Greed is Good: Algorithmic Results for Sparse Approximation To appear,IEEE Trans Info. Thry.

[27] J.A. Tropp (2004) Just Relax: Convex programming methods for Subset Sleection and

Sparse Approximation. Manuscript.

For Most Large Underdetermined Systems of Linear Equations the Minimal `1-norm Solution is also the...

Documents

Transcript of For Most Large Underdetermined Systems of Linear Equations the Minimal `1-norm Solution is also the...