Variable Selection in Sparse Regression with Quadratic...

Variable Selection in Sparse Regression withQuadratic Measurements

Jun Fan ∗ Lingchen Kong ∗ Liqun Wang† and Naihua Xiu ∗

Department of Applied Mathematics, Beijing Jiaotong University ∗

(E-maill: [email protected], [email protected], [email protected])

Department of Statistics, University of Manitoba †

(E-mail: [email protected])

Final Version, October 2016

Abstract

Regularization methods for high-dimensional variable selection and estimationhave been intensively studied in recent years and most of them are developed in theframework of linear regression models. However, in many real data problems, e.g., incompressive sensing, signal processing and imaging, the response variables are non-linear functions of the unknown parameters. In this paper we introduce a so-calledquadratic measurements regression model that extends the usual linear model. Westudy the `q regularized least squares method for variable selection and establish theweak oracle property of the corresponding estimator. Moreover, we derive a fixedpoint equation and use it to construct an efficient algorithm for numerical optimiza-tion. Numerical examples are given to demonstrate the finite sample performance ofthe proposed method and the efficiency of the algorithm.

Keywords: sparsity, `q-regularization, moderate deviation, weak oracle property, optimiza-tion algorithm.

To appear in Statistica Sinica, 28 (2018), 1157-1178. doi:10.5705/ss.202015.0335

∗Supported by the National Natural Science Foundation of China (11431002,11171018)†Supported by the Natural Sciences and Engineering Research Council of Canada (NSERC)

1

Fan et al (2016): Variable Selection in Sparse Regression with Quadratic Measurements 2

1 Introduction and Motivation

In the era of big data, more and more massive and high-dimensional data become available

in many scientific fields, e.g., genome and health science, economics and finance, astronomy

and physics, signal processing and imaging, etc. The large size and high dimensionality

of data pose significant challenges to the traditional statistical methodologies, see, e.g.,

Donoho (2000) and Fan and Lv (2010) for excellent overviews. As pointed out by these

authors, a common feature in high-dimensional data analysis is the sparsity of the predictors

and one of the main goals is to select the most relevant variables to accurately predict a

response variable of interest.

Various regularization methods have been proposed in the literature, e.g., bridge regres-

sion (Frank and Friedman (1993)), the LASSO (Tibshirani (1996)), the SCAD and other

folded-concave penalties (Fan and Li (2001)), the Elastic-Net penalty (Zou and Hastie

(2005)), the adaptive LASSO (Zou (2006)), the group LASSO (Yuan and Lin (2006)), the

Dantzig selector (Candes and Tao (2007)), and the MCP (Zhang (2010)). Recently, Lv and

Fan (2009) pointed out that there is a distinction and close relation between the model se-

lection problem in statistics and sparse recovery problem in compressive sensing and signal

processing. Moreover, they proposed a unified approach to deal with both problems.

However, most existing statistical methods for variable selection are developed in the

context of sparse linear regression. On the other hand, there is a large number of real data

problems, especially in compressive sensing, signal processing and imaging, and statistics,

where the regression relationships are in nonlinear forms of unknown parameters. The

following are some examples.

Example 1.1. Compressive sensing has been intensively studied in the last decade and the

main goal is to reconstruct sparse signals from the observations. Recently, the theory has

been extended to nonlinear compressive sensing and, in particular, to the so-called quadratic

compressive sensing that aims to find the sparse signal β to the problem minβ∈Rp ‖β‖0

subject to yi = βTZiβ+xTi β+εi, i = 1, · · · , n, where ‖β‖0 is the number of nonzero entries

of β, yi, εi ∈ R, xi ∈ Rp and Zi ∈ Rp×p are real matrices (vectors). For more details see,

e.g., Beck and Eldar (2013), Blumensath (2013) and Ohlsson et al (2013).

There is a special class of problems in optical imaging, where partially spatially inco-

herent light such as sub-wavelength optical results in a quadratic relationship between the

input object β and image intensity yi as yi ≈ βTZiβ, i = 1, · · · , n, where Zi is known from

the mutual intensity and the impulse response function of the optical system (Shechtman

et al (2011) and Shechtman et al (2012)).


Example 1.2. Phase retrieval plays an important role in X-ray crystallography, transmis-

sion electron microscopy, coherent diffractive imaging, etc. Generally speaking, the problem

is to recover the lost phase information through the observed magnitudes. In particular,

in the real phase retrieval problem the goal is to find β ∈ Rp in yi = βT (zizTi )β + εi, i =

1, · · · , n, where zi ∈ Rp and yi ∈ R are observed variables and εi are random errors (Can-

des, Strohmer and Voroninski (2013), Candes, Li and Soltanolkotabi (2015), Eldar and

Mendelson (2014), Lecue and Mendelson (2013), Netrapalli, Jain and Sanghavi (2013),

Cai, Li and Ma (2015)).

Example 1.3. In wireless ad hoc and sensor networks, localization is crucial for building

low-cost, low-power and multi-functional sensor networks in which direct measurements of

all nodes’ locations via GPS or other similar means are not feasible (Biswas and Ye (2004),

Meng, Ding and Dasgupta (2008), Wang et al (2008)). The most important element of any

localization algorithms is to measure the distances between sensors and anchors. However,

the acquired data are usually imprecise because of the measurement noise and estimation

errors. Suppose p-dimensional vectors x1, x2, ..., xn are the known sensor positions and

β ∈ Rp is the signal source location that is unknown and to be determined. Then the

measured distance yi from the source to each sensor node is given by y2i = ‖xi − β‖2

2 +

εi, i = 1, · · · , n, where εi is a random error. Again, the above relation can be written as

y2i − ‖xi‖2

2 = βTβ − 2xTi β + εi.

Example 1.4. Measurement error is ubiquitous in statistical data analysis. Wang (2003,

2004) showed that for a class of measurement error models to be identifiable and consis-

tently estimable, at least the first two conditional moments of the response variable given

the observed predictors are needed. Wang and Leblanc (2008) showed that in a general

nonlinear model this second-order least squares estimator (SLSE) is asymptotically more

efficient than the ordinary least squares estimator when the regression error has nonzero

third moment, and the two estimators have the same asymptotic variances when the error

term has symmetric distribution. In a linear model, the SLSE is derived based on the first

two conditional moments E(yi|xi) = xTi β and E(y2i |xi) = (xTi β)2 + σ2, i = 1, · · · , n, where

β is the vector of regression coefficients and σ2 is the variance of the regression error. It

is easy to see that the above second moment can be written as E(y2i |xi) = θTZiθ with

θ = (βT , σ)T and

Zi =

(xix

Ti 0

0 1

).

In these examples, the main goal is to recover the sparse signals in regression setups

where the response variable is a quadratic function of the unknown parameters, and thus


not covered by linear regression models. Given their wide applications, however, the high-

dimensional variable selection problem in such models has not been studied in statistical

literature.

In this paper we attempt to fill in this gap. First, we introduce a so-called quadratic

measurements regression (QMR) model as an extension of the usual linear model. Then

we study the `q-regularized least squares (q-RLS) estimation in this model and establish its

weak oracle property (Lv and Fan (2009)). Moreover, using moderate deviations we show

that the estimators of the nonzero coefficients have an exponential convergence rate. To

deal with the problem of numerical optimization, we derive a fixed point equation that is

necessary for global optimality. This allows us to construct an iterative algorithm and to

establish its convergence. Finally, we present some numerical examples to demonstrate the

efficiency of the proposed method and algorithm.

In section 2 we introduce the quadratic measurements model and the q-RLS estimation.

In section 3 we discuss the weak oracle property of the q-RLS estimator using the mod-

erate deviation technique. In section 4, we deal with a special case of a purely quadratic

measurements model that has applications in some important problems. In section 5, we

derive a fixed point equation and construct an algorithm for numerical minimization. In

section 6, we calculate some numerical examples to illustrate our proposed method and to

demonstrate its finite sample performance. Finally, conclusions and discussions are given

in section 7, while technical lemmas and proofs are given in the Appendices.

2 The quadratic measurements model

Motivated by the examples in the previous section, we define the quadratic measurements

regression (QMR) model as

yi = βTZiβ + xTi β + εi, i = 1, · · · , n, (1)

where yi ∈ R is the observed response, xi ∈ Rp is the vector of predictors, Zi ∈ Sp×p is

a symmetric design matrix, β ∈ Rp is the vector of unknown parameters, and εi ∈ R are

independent and identically distributed random errors with mean 0 and variance σ2. When

Zi ≡ 0, this reduces to the usual linear model

yi = xTi β + εi, i = 1, · · · , n. (2)

In this paper we are mainly interested in the high-dimensional case where p > n,

although our theory applies to the case p ≤ n as well. Throughout the paper we assume

that log p = o(n%) for some constant % ∈ (0, 1), and E exp (δ0|ε1|) <∞ for some δ0 > 0.


As mentioned earlier, in compressive sensing and signal processing the main goal is to

identify and estimate the smallest possible number of nonzero coefficients. Thus we consider

the problem of estimating unknown parameters of model (1) under the sparsity constraint

‖β‖0 ≤ s, where s < n is a certain integer and, accordingly, we study the `q-regularized

least squares (q-RLS) problem

minβ∈Rp

Ln(β) := `n(β) + λn‖β‖qq, (3)

where `n(β) =∑n

i=1(yi − βTZiβ − xTi β)2, λn > 0 and q ∈ (0, 1). The `q-regularization has

been widely used in compressive sensing. Compared to `1-regularization, this method tends

to produce precise signal reconstruction with fewer measurements (Chartrand (2007)), and

increases the robustness to noise and image non-sparsity (Saab, Chartrand and Yilmaz

(2008)). Moreover, Krishnan and Fergus (2009) demonstrated very high efficiency of `1/2

and `2/3 regularization in image deconvolution.

A minimizer β of the optimization problem (3) is called q-RLS estimator and it is a

generalization of the bridge estimator in linear models (Frank and Friedman (1993)). It is

well-known that the bridge estimator has various desirable properties including sparsity and

consistency (Knight and Fu (2000), Huang, Horowitz and Ma (2008)). A natural question

is whether the q-RLS solution of (3) continues to enjoy these properties in the more general

model (1). To answer this question, we study the moderate deviation (MD) of β which

gives the rate of convergence to β at a slower rate than n−1/2 (Kallenberg (1983)).

Although we are mainly interested in variable selection problem, our results on identi-

fiability and numerical optimization algorithm apply also to the case q ≥ 1. However, our

consistency results for selection and estimation hold only for the case where q ∈ (0, 1); this

is not surprising given that it is a well-known fact in linear models (Fan and Li (2001), Zou

(2006)).

Throughout the paper we use the following notation. For any d-dimensional vec-

tor v = (v1, · · · , vd)T , let |v| = (|v1|, · · · , |vd|)T , v2 = (v21, · · · , v2

d)T , ‖v‖2 = (

∑di=1 v

2i )

12 ,

‖v‖1 =∑d

i=1 |vi| and ‖v‖∞ = max{|v1|, · · · , |vd|}. For any set Γ ⊆ {1, · · · , d}, de-

note its cardinality by |Γ| and Γc = {1, · · · , d}/Γ. For any n × d matrix A = [aij], let

‖A‖F =√∑n

i=1

∑dj=1 a

2ij and |A|∞ = max1≤i,j≤d |aij|. Denote by AΓ the sub-matrix of A

consisting of its columns associated with index set Γ ⊆ {1, · · · , d}, AΓ′ the sub-matrix of

A consisting of its rows indexed by Γ′ ⊆ {1, · · · , n} and by AΓ′Γ the sub-matrix consisting

of the rows and columns of A indexed by Γ′ and Γ respectively. We use the notation vΓ

for a column or a row vector v. Finally, denote by ed,j the jth column of the d× d identity

matrix Id.


3 Weak oracle property

In this section we discuss the moderate deviation and consistency of the q-RLS estimators.

Let β∗ be the true parameter value of model (1) and Γ∗ = supp(β∗) := {j : eTp,jβ∗ 6= 0, j =

1, · · · , p}. Without loss of generality, let |Γ∗| = s < n. Let X = (x1, · · · , xn)T , where

xi = (xi1, · · · , xip)T , i = 1, · · · , n. Then following Huang, Horowitz and Ma (2008), we

assume that there exist constants 0 < c ≤ c <∞ such that

c ≤ min{|eTp,jβ∗|, j ∈ Γ∗} ≤ max{|eTp,jβ∗|, j ∈ Γ∗} ≤ c.

Following the literature (e.g., Zou and Hastie (2005), Huang, Horowitz and Ma (2008),

Fan, Fan and Barut (2014)), the data are assumed to be standardized so thatn∑i=1

yi = 0,n∑i=1

xij = 0, max{ n∑i=1

x2ij,

n∑i=1

|Zi|2∞}

= n, j = 1, · · · , p. (4)

In the linear model, the third equality above reduces to∑n

i=1 x2ij = n.

3.1 Identifiability of β∗

For the sparse linear model, Donoho and Elad (2003) introduced the concept of spark and

showed that the uniqueness of β∗ can be characterized by spark(X) which is defined as

the minimum number of linearly dependent columns of the design matrix X. Another way

to express this property is via the s-regularity of X, i.e., any s columns of X are linearly

independent. Indeed, X is s-regular if and only if spark(X) ≥ s + 1, (Beck and Eldar

(2013)). Further, in the linear model, −X is the Jacobian matrix of the residual function

R(β) = y −Xβ, where y = (y1, · · · , yn)T . Correspondingly, under model (1) the residual

function is R(β) =(R1(β), · · · , Rn(β)

)Twith Ri(β) = yi − βTZiβ − xTi β and hence the

Jacobian is (−2Z1β − x1, · · · ,−2Znβ − xn)T

.

Definition 3.1. The affine transform A(β) = (Z1β + x1, · · · , Znβ + xn)T

is said to be

uniformly s-regular, if A(β)Γ has full column rank for any Γ ⊆ {1, · · · , p} with |Γ| = s and

β ∈ Rp with supp(β) ⊆ Γ.

Obviously, the uniform s-regularity of A(β) implies the s-regularity of X. It is straight-

forward to verify that A(β) is uniformly s-regular if and only if the submatrix AΓ(β1) =

(ZΓΓ1 β1, · · · , ZΓΓ

n β1

)T+ XΓ has full column rank for any index set Γ ⊆ {1, · · · , p} with

|Γ| = s and β1 ∈ Rs.

In the linear model, we have AΓ(β1) = XΓ since Zi ≡ 0, and therefore the uniform

s-regularity of A(β) reduces to the s-regularity of X. On the other hand, if Zi ≡ Ip as in

Example 1.3, then A(β) is uniformly s-regular provided∑n

i=1 xi = 0 and X is s-regular.


Proposition 3.1. Let yi = β∗TZiβ∗ + xTi β

∗, i = 1, · · · , n. Then the system of equations

βTZiβ + xTi β = yi, i = 1, · · · , n, has a unique solution β∗ satisfying ‖β∗‖0 ≤ s if A(β) is

uniformly 2s-regular.

3.2 Moderate deviation and consistency

It is well known that strong convexity is the standard condition for the existence of unique

solution to a convex optimization problem. When the objective function is twice differen-

tiable, an equivalent condition is that the Hessian is uniformly positive definite. To establish

the consistency of an M-estimator in high-dimension, Negahban et al (2012) introduced the

concept of the restricted strong convexity that the objective function is strongly convex

on a certain set. To achieve the accuracy of a greedy method for the sparsity-constrained

optimization problem, Bahmani, Raj and Boufounos (2013) used the stable restricted Hes-

sian to characterize the curvature of the loss function over the sparse subspaces that can be

bounded locally from above and below such that the corresponding bounds have the same

order. However, the calculation of the exact Hessian of our model is costly. Fortunately

the transform A(β) has a special structure that allows us to not only use the Jacobian to

obtain the gradient ∇`n(β) = −2A(2β)TR(β), but also to employ it to approximate the

Hessian near β∗, as is shown below. Hence we introduce the following conditions.

Condition 1 (Uniformly Stable Restricted Jacobian).

(a) For any Γ ⊆ {1, · · · , p} with |Γ| = s and β ∈ Rp satisfying supp(β) ⊆ Γ, there exists

a positive constant c1 that bounds all eigenvalues of n−1((A(β)Γ

)TA(β)Γ

)from below.

(b) For any Γ ⊆ {1, · · · , p} with |Γ| = s and β ∈ Rp satisfying supp(β) ⊆ Γ and

‖β‖ ≤ 2c+3√

(σ2 + 1)/c1

√s, there exists a positive constant c2 that bounds all eigenvalues

of n−1((A(β)Γ

)TA(β)Γ

)from above.

It is easy to see that (a) and (b) are respectively equivalent to the following conditions.

(a′) For any Γ ⊆ {1, · · · , p} with |Γ| = s and β1 ∈ Rs, there exists a positive constant

c1 that bounds all eigenvalues of n−1(AΓ(β1)TAΓ(β1)

)from below.

(b′) For any Γ ⊆ {1, · · · , p} with |Γ| = s and β1 ∈ S := {u ∈ Rs : ‖u‖ ≤(2c +

3√

(σ2 + 1)/c1

)√s}, there exists a positive constant c2 that bounds all eigenvalues of

n−1(AΓ(β1)T AΓ(β1)

)from above.

Again, in the linear model (2), Condition 1 reduces to the first assumption of Condi-

tion 2 in Fan, Fan and Barut (2014) that the eigenvalues of n−1XTΓXΓ are bounded from

below and above. For the general case, (a′) is similar to the restricted strong convexity

in Negahban et al (2012). Indeed, the minimization problem (3) is derived from the orig-

inal optimization problem minβ∈Rp `n(β) subject to ‖β‖0 ≤ s. So, we first consider the


unconstrained optimization problem

minβ1∈Rs

1

nñ(β1) :=

1

n

n∑i=1

(yi − βT1 ZΓ∗Γ∗

i β1 − xTiΓ∗β1

)2(5)

that is clearly non-convex and may not have a unique solution in general. However, one

can calculate the Hessian matrix of the objective function 1n

ñ(β1) at β∗Γ∗ as

∇2( 1

nñ(β∗Γ∗)

)=

2

n

(AΓ∗(2β

∗Γ∗)

TAΓ∗(2β∗Γ∗))− 4

n

n∑i=1

εiZΓ∗Γ∗

i .

Since ‖ZΓ∗Γ∗i ‖2

F ≤ s2|Zi|∞, the third equality of (4) implies that∑n

i=1 ‖Zi‖2F ≤ ns2. Further,

it follows from Chebyshev’s inequality and s = o(√n) that

1

n‖

n∑i=1

εiZΓ∗Γ∗

i ‖FP−→ 0, as n→∞.

Hence Condition 1 (a′) ensures that the Hessian matrix ∇2(n−1 ˜

n(β∗Γ∗))

is strictly positive

definite and therefore the problem (5) has a unique solution in a neighborhood of β∗Γ∗

with probability approaching one. It follows that the minimization problem (3) may have

a unique solution in a neighborhood of β∗, as in Negahban et al (2012). Moreover, (a′)

implies that AΓ(β1) has full column rank for any Γ with |Γ| = s and therefore A(β) is

uniformly s-regular.

Further, (b′) is similar to the upper bound of the stable restricted Hessian. In particular,

if s is finite, then (b′) implies that the curvature of the loss function has upper bounds at

locations that are within a neighbourhood of the origin. From the proof in Appendix A,

one can see that (b′) ensures a more accurate convergent rate.

Condition 2 (Asymptotic Property of Design Matrix). Let κ1n = |X|∞ and κ2n =

max1≤i≤n |Zi|∞ be such that, as n→∞,

κ1n

√s√

n→ 0,

κ2ns3/2

√n→ 0. (6)

The first convergence in (6) is the same as in Fan, Fan and Barut (2014, Condition 2).

The second convergence in (6) and (8) below are required to deal with the quadratic term

in the low-dimensional space Rs.

Condition 3 (Partial Orthogonality). For any Γ ⊆ {1, · · · , p} with |Γ| = s, there exists

a positive constant c0 such that

1√n|

n∑i=1

xiΓ ⊗ xiΓc|∞ ≤ c0,1√n

(|

n∑i=1

xiΓ ⊗ ZΓΓc

i |∞ + |n∑i=1

xiΓc ⊗ ZΓΓi |∞

)≤ c0,


and1√n

(|

n∑i=1

ZΓΓi ⊗ ZΓcΓ

i |∞ + |n∑i=1

ZΓΓi ⊗ ZΓcΓc

i |∞)≤ c0,

where ⊗ is the Kronecker product.

Again, in the linear model (2), Condition 3 coincides with the partial orthogonality

condition of Huang, Horowitz and Ma (2008) that n−1/2|∑n

i=1 xijxik|∞ ≤ c0 for any j ∈Γ, k ∈ Γc, which is an essential assumption for the consistency when p > n.

Condition 4 (Asymptotic Property of Tuning Parameter). Let λn ≥ σc1−q√n log p be

such that, as n→∞,

√nqs3−q(log n)2−q

λn→ 0,

λns4−q

2(1−q)

n→ 0 (7)

andλnκ1ns log n

n→ 0,

λnκ2ns2 log n

n→ 0. (8)

The inequality here is equivalent to λn > 2√

(1 + C)n log p for some positive constant

C, which is used in Fan, Fan and Barut (2014). The first convergence is similar to the first

one in their condition (4.4) that γns3/2κ2

1n(log2 n)2 = o(λ2n/n), where γn = γ0

(√s log n/n+

λn‖d0‖2/n), γ0 is a positive constant and d0 is a s-dimensional vector of nonnegative

weight, and the first convergence in (8) is similar to their second convergence in (4.4). In

particular, the first convergence of this condition is trivial when s is finite and the inequality

in Condition 4 holds. The second convergence implies that λns2/n → 0, i.e., the penalty

parameter λn is o(n) if s is finite. If, for example, λn = nδ for a positive constant δ, then

Condition 2 implies that δ ∈ (1/2, 1) and log p = o(n(2δ−1)). Thus, Condition 2 imposes a

range for the penalty parameter with respect to the sample size n and dimension p. It is

easy to verify that Condition 2 also implies that s = o(√n) as needed to approximate the

Hessian through the Jacobian.

The proof of the following is given in Appendix A.

Theorem 3.1. (Moderate Deviation). Under model (1), if Conditions 1-4 hold, then

there exists a strict local minimizer β =(βTΓ∗ , β

TΓ∗c

)Tof (3) and a positive constant C0 <

min{ 18σ2 ,

12c2σ2 ,

c218c2σ2} such that

P(βΓ∗c = 0

)≥ 1− exp

(− C0a

2n

)(9)

and

P(‖βΓ∗ − β∗Γ∗‖2 ≤ rn) ≥ 1− exp(− C0a

2n)), (10)


where

βΓ∗ ∈ argminβ1∈RsLn(β1) :=n∑i=1


i β1 − xTiΓ∗β1

)2

+ λn‖β1‖qq,

rn =(an√n

+ 2cq−1λn√s

c1n

), and {an} is a sequence of positive numbers such that, as n→∞,

an√s log n

→∞, (11)

anκ1n

√s√

n→ 0,

anκ2ns3/2

√n

→ 0, (12)

and

an(n q

2 s4−q2

λn

) 12−q → 0. (13)

Note that since max(κ21n, κ

22n) ≥ 1, conditions (12) imply an

√s/n → 0. Again, if s is

finite and λn = nδ for some δ ∈ (1/2, 1), then conditions (11)-(13) simplify to

an√log n

→∞, anκ1n√n→ 0,

anκ2n√n→ 0,

an

n2δ−q2(2−q)

→ 0.

It follows that {an} tends to infinity faster than log n but slower than n2δ−q2(2−q) = o(

√n).

This differs from the case of the linear model with fixed dimension p � n, where only

anκ1n/√n→ 0 is required to establish the MD of M-estimators (Fan (2012), Fan, Yan and

Xiu (2014)). Here we assume (11)-(13) to cover the case of p� n.

By inequality (9) the q-RLS estimator correctly selects nonzero variables with proba-

bility approaching one exponentially. It follows from (10) that the estimators of nonzero

variables are consistent with an exponential rate of convergence. Theorem 3.1 also implies

that P(‖β − β∗‖2 > rn) ≤ exp(− C0a

2n

), i.e., the tail probability decreases exponentially

with rate a2n, as the tail probability of the Gaussian.

Theorem 3.1 gives general results on the MD. By taking an =√s log n, we obtain the

familiar forms of convergence rate.

Theorem 3.2. (Weak Oracle Property). Under model (1), if Conditions 1-4 hold, then

there exists a strict local minimizer β =(βTΓ∗ , β

TΓ∗c

)Tof (3) such that for sufficiently large

n,

P(βΓ∗c = 0

)≥ 1− n−C0s logn (14)

and

P(‖βΓ∗ − β∗Γ∗‖2 ≤

√s log n√n

+2cq−1λn

√s

c1n

)≥ 1− n−C0s logn. (15)


In particular, when Zi ≡ 0, Conditions 1-4 reduce to similar conditions of Huang,

Horowitz and Ma (2008) and Fan, Fan and Barut (2014) for the linear model (2). Conse-

quently, we have the following result which are similar to theirs.

Corollary 3.1. Under linear model (2), the results of Theorem 3.2 hold, provided Condi-

tions 1-3 and (17) and the first condition in (8) of Condition 4 hold, that is,

(1) for each Γ ⊆ {1, · · · , p} with |Γ| = s, the eigenvalues of 1nXT

ΓXΓ are bounded from

below and above by some positive constants c1 and c2 respectively;

(2) κ1n

√s/√n→ 0, as n→∞;

(3) for each Γ ⊆ {1, · · · , p} with |Γ| = s, there exists a positive constant c0 such that

n−1/2|n∑i=1

xijxik|∞ ≤ c0, ∀j ∈ Γ, k ∈ Γc;

(4) λn ≥ σc1−q√n log p and λ−1n

√nqs3−q(log n)2−q → 0, n−1λns

4−q2(1−q) → 0, n−1λnκ1ns log n→

0, as n→∞.

Remark 3.1. To deal with the case p > n, Huang, Horowitz and Ma (2008) showed that

the marginal bridge estimators satisfy P(βΓ∗c = 0

)→ 1 and P

(eTp,jβ 6= 0, j ∈ Γ∗

)→ 1.

Here we provide the rate of this convergence. The result (15) is slightly different from

Theorem 2 in Fan, Fan and Barut (2014) that has

P(‖βΓ∗ − β∗Γ∗‖2 ≤ γ0

(√s log n√n

+λn‖d0‖2

n

))≥ 1−O(n−cs),

where γ0 and c are two positive constants and d0 is a s-dimensional vector of nonnegative

weight. To find the constant c, we use the number√

log n to dominate the constant γ0,

which results in the lower consistent rate. To compensate this loss, the right hand side of

(15) tends to one at a faster rate.

4 Purely quadratic model

In the previous sections we have studied the regularized least squares method in the QMR

model which is an extension of linear model. However, as demonstrated in Example 1.1 and

1.2, in many applications in phase retrieval and optical imaging the models do not contain

the linear term. Therefore in this section we consider the purely quadratic measurements

model

yi = βTZiβ + εi, i = 1, · · · , n. (16)


In particular, this covers the phase retrieval model where Zi = zizTi as demonstrated in

Example 1.2. As this model differs from the general model (1), some theoretical conditions

and results in the previous sections need to be modified.

4.1 Identifiability of β∗

The absence of the linear term in model (16) has β∗ and −β∗ indistinguishable from the

observed data. In the phase retrieval literature, e.g., Balan, Casazza and Edidin (2006)

and Ohlsson and Eldar (2014), this problem is treated by identifying ±β for any β ∈ Rp.

Without loss of generality, we assume that the first nonzero element of β∗ is positive.

For the phase retrieval problem, Balan, Casazza and Edidin (2006) and Bandeira et

al (2014) introduce the complement property which is necessary and sufficient for identi-

fiability. For the sparse regression, Ohlsson and Eldar (2014) propose the more general

concept of s-complement property. In the phase retrieval model where Zi = zizTi , the

s-complement property of {zi} means that either {zΓi }i∈N or {zΓ

i }i∈Nc span Rs for every

subset N ⊆ {1, · · · , n} and Γ ⊆ {1, · · · , p} with |Γ| = s. Here the identifiability of β∗ in

(1) is guaranteed by the uniform s-regularity of the affine transform A(β). In model (16),

the residual function R(β) =(R1(β), · · · , Rn(β)

)Twith Ri(β) = yi − βTZiβ has Jacobian

matrices (−2Z1β, · · · ,−2Znβ)T

. Hence Definition 3.1 is modified as follows.

Definition 4.1. The linear transform B(β) = (Z1β, · · · , Znβ)T

is said to be uniformly s-regular

if B(β)Γ has full column rank for any Γ ⊆ {1, · · · , p} with |Γ| = s and β ∈ Rp/{0} with

supp(β) ⊆ Γ.

If Zi = zizTi , then the uniform s-regularity of B(β) is equivalent to the s-complement

property of {zi}. Further, it is straightforward to verify that B(β) is uniformly s-regular

if and only if the submatrix BΓ(β1) = (ZΓΓ1 β1, · · · , ZΓΓ

n β1

)Thas full column rank for any

Γ ⊆ {1, · · · , p} with |Γ| = s and β1 ∈ Rs/{0}.The proof of the following is analogous to that of Theorem 4 in Ohlsson and Eldar

(2014) and is therefore omitted.

Proposition 4.1. Under model (16) β∗ is the unique solution satisfying ‖β∗‖0 ≤ s, if B(β)

is uniformly 2s-regular.

4.2 Weak oracle property

To drive the MD and consistency results under model (16), we modify Conditions 1-4 in

section 3.2 as follows.


Condition 1′ (Uniformly Stable Restricted Jacobian).

(a) For any Γ ⊆ {1, · · · , p} with |Γ| = s and β1 ∈ S1 :={u ∈ Rs : |{j : |eTs,ju| ≥ c/2}| ≥

s−[ s2]}

, there exists a positive constant c1 that bounds all eigenvalues of n−1(BΓ(β1)TBΓ(β1)

)from below.

(b) For any Γ ⊆ {1, · · · , p} with |Γ| = s and β1 ∈ S, there exists a positive constant c2

that bounds all eigenvalues of n−1(BΓ(β1)T BΓ(β1)

)from above.

Condition 2′ (Asymptotic Property of Design Matrix). As n→∞, n−1/2κ2ns3/2 → 0.

Condition 3′ (Partial Orthogonality). There exists a positive constant c0 such that

n−1/2(|

n∑i=1

ZΓΓi ⊗ ZΓcΓ

i |∞ + |n∑i=1

ZΓΓi ⊗ ZΓcΓc

i |∞)≤ c0.

Condition 4′ (Asymptotic Property of Tuning Parameter). Let λn ≥ σc1−q√n log p be

such that, as n→∞,


λn→ 0,

λns4−q

2(1−q)

n→ 0 and

λnκ2ns2 log n

n→ 0.

For the phase retrieval model, since BΓ(β1)TBΓ(β1) =∑n

i=1

(zTiΓβ1

)2ziΓz

TiΓ, Condition 1′

implies that at β∗Γ∗ ,

c1‖u‖2 ≤ 1

n

n∑i=1

(zTiΓβ

∗Γ∗

)2(zTiΓu)2 ≤ c2‖u‖2, ∀u ∈ Rs/{0}.

This is similar to Corollary 7.6 of Candes, Li and Soltanolkotabi (2015) that for some

δ ∈ (0, 1), it holds

(1− δ)‖h‖2 ≤ n−1

n∑i=1

|〈zi, β∗〉|2|〈zi, h〉|2 ≤ (1 + δ)‖h‖2

for all h ∈ Cp with ‖h‖ = 1, provided n > p and zi are sampled from some distributions

such as the Gaussian.

Theorem 4.1. (Moderate Deviation). Under model (16), if Conditions 1′ - 4′ hold and

βΓ∗ ∈ argminβ1∈RsLn(β1) :=n∑i=1


i β1

)2

+ λn‖β1‖qq,

then there exists a strict local minimizer β =(βTΓ∗ , β

TΓ∗c

)Tof (3) such that (9) and (10)

hold with {an} satisfying (11),(13) and the second condition in (12).

Similar to Theorem 3.2, we can obtain the familiar form of the convergence rate for β

by taking an =√s log n.


Theorem 4.2. (Weak Oracle Property). Under model (16) and Conditions 1′ - 4′, there

exists a strict local minimizers β =(βTΓ∗ , β

TΓ∗c

)Tof (3) such that (14) and (15) hold.

For the phase retrieval problem, Candes, Strohmer and Voroninski (2013) used convex

relaxation to construct a consistent estimator of the matrix β∗(β∗)T but not for β∗. The

consistency of β∗ was studied by Eldar and Mendelson (2014) and Lecue and Mendelson

(2013). We obtain the following weak oracle property of β∗ as a consequence of Theorem

4.2.

Corollary 4.1. Under model (16) with Zi = zizTi , i = 1, 2, ..., n, the result of Theorem 4.2

holds if Condition 2′, 4′ and the following hold:

(1) for any Γ ⊆ {1, · · · , p} with |Γ| = s, there exist positive constants c1 and c2 such that

n−1∑n

i=1

(zTiΓβ1

)2(zTiΓu)2 ≥ c1‖u‖2 for any β1 ∈ S1 and n−1

∑ni=1

(zTiΓβ1

)2(zTiΓu)2 ≤ c2‖u‖2

for any β1 ∈ S;

(2) for any Γ ⊆ {1, · · · , p} with |Γ| = s, there exists a positive constant c0 such that

n−1/2(|∑n

i=1(ziΓ ⊗ ziΓc)(ziΓ ⊗ ziΓ)T |∞ +|∑n

i=1(ziΓ ⊗ ziΓc)(ziΓ ⊗ ziΓc)T |∞)≤ c0.

5 Optimization algorithm

The numerical computation of the q-RLS estimator as the solution of (3) is an important

and challenging issue, since the `q(0 < q < 1)-regularization is a nonconvex, nonsmooth,

and non-Lipschitz optimization problem. Recently, this type of problems has attracted

much attention in the field of optimization, including developing optimality conditions

and computational algorithms, see, e.g., Xu et al (2012b), Chen, Niu and Yuan (2013), Lu

(2014) and references therein. In this section, we propose an algorithm for the minimization

problem (3). Since n and p are given, to simplify notation we omit the subscript n of `n(β)

and λn so that (3) is written as

minβ∈Rp

L(β) := `(β) + λ‖β‖qq, (17)

where λ > 0. To motivate our algorithm we establish a fixed point equation. We start by

considering the simple minimization problem

minu∈R

ϕt(u) :=1

2(u− t)2 + λ|u|q, (18)

where t ∈ R, λ > 0 and q ∈ (0, 1). For this problem Chen, Xiu and Peng (2014) show

that there exists an implicit function hλ,q(·) such that the minimizer u of (18) satisfies

u = hλ,q(t). In particular, for q = 1/2 Xu et al (2012b) give an explicit expression hλ,1/2(t) =23t(1 + cos

(2π3− 2

3φλ(t)

))with φλ(t) = arccos(λ

4( |t|

3)−3/2

).


Theorem 5.1. There exists a function hλ,q(·) and a constant r > 0, such that any mini-

mizer β of problem (17) satisfies

β = Hλτ,q

(β − τ∇`(β)

)(19)

for any τ ∈(0,min{G−1

r , 1}), where Gr = supβ∈Br ‖∇2`(β)‖2, Br = {β ∈ Rp : ‖β‖2 ≤ r},

and Hλ,q(u) =(hλ,q(u1), · · · , hλ,q(up)

)Tfor u = (u1, · · · , up)T ∈ Rp.

Remark 5.1. The result of Theorem 5.1 remains true for any function ` that is bounded

from below, twice continuously differentiable, and for whish lim‖x‖→∞ `(β) = ∞. An ap-

propriate algorithm here can be derived similarly to that below.

Remark 5.2. In general, the `q minimization problem minβ∈Rp f(β) + λ‖β‖qq with λ >

0, q ∈ (0, 1) has been well studied in the optimization literature and efficient algorithms

have been proposed for f(β) = ‖Xβ − y‖2. For example, Chen, Xu and Ye (2010) derived

lower bounds for nonzero entries of the local minimizer and presented a hybrid orthogonal

matching pursuit-smoothing gradient method, while Xu et al (2012b) provided a globally

necessary optimality condition for the case q = 1/2 and proposed an efficient iterative

algorithm. More recently, the general `q problem has been studied by Chen, Niu and

Yuan (2013), who proposed a smoothing trust region Newton method for solving a class of

non-Lipschitz optimization problems. Lu (2014) studied iterative reweighted methods for

a smooth and bounded (from below) function f with an Lf -Lipschitz continuous gradient

satisfying ‖∇f(β) − ∇f(β′)‖ ≤ Lf‖β − β

′‖. Bian Chen and Ye (2015) proposed interior

point algorithms for solving a class of non-Lipschitz nonconvex optimization problems with

nonnegative bounded constraints. In these works the solution sequence of the algorithm

converges to a stationary point derived from the Karush-Kuhn-Tucker conditions.

Based on (19), we propose a fixed point iterative algorithm (FPIA).

Algorithm 5.1. Step 0. Given λ > 0, ε ≥ 0, γ, α ∈ (0, 1), δ > 0, choose an arbitrary β0

and set k = 0.

Step 1. (a) Compute ∇`(βk) from ∇`(β) = 2∑m

i=1(βTZiβ + xTi β − yi)(2Ziβ + xi);

(b) Compute βk+1 = Hλτk,q(βk − τk∇`(βk)) with τk = γαjk and jk the smallest non-

negative integer such that

L(βk)− L(βk+1) ≥ δ

2‖βk − βk+1‖2

2. (20)

Step 2. Stop if‖βk+1 − βk‖2

max{1, ‖βk‖2}≤ ε.

Otherwise, replace k by k + 1 and go to Step 1.


An important step here is to evaluate the operator Hλ,q(·). As discussed before, hλ,q(·)has an explicit expression when q = 1/2. For more general q ∈ (0, 1), by Lemma B.1 in

Appendix B there exists a constant t∗ > 0 such that hλ,q(t) > 0, hλ,q(t)−t+λqhλ,q(t)q−1 = 0

and 1 + λq(q − 1)hλ,q(t)q−2 > 0, for t > t∗; and hλ,q(t) < 0, hλ,q(t)− t− λq|hλ,q(t)|q−1 = 0

and 1 + λq(q − 1)|hλ,q(t)|q−2 > 0, for t < −t∗. Hence one can use the function fsolve in

Matlab to get the desired solution at each iteration.

Another important step is the computation of step length τk, which represents a trade-

off between the speed of reduction of the objective function L and search time for the

optimal length. According to Theorem 5.1, the ideal choice of τk depends on the maximum

eigenvalue of the Hessian ∇2`(βk) at kth iteration, which is expensive to calculate. A

more practical strategy is to perform an inexact line search to identify a step length that

achieves adequate reduction in L. One such technique is the so-called Armijo-type line

search that is adopted in our algorithm. In our context this method requires finding the

smallest nonnegative integer jk such that (20) holds. That this can be done successfully is

assured by Lemmas B.3 and B.4 in Appendix B. We also verify the convergence property

of the FPIA by Theorem B.1.

Remark 5.3. Xu et al (2012b) studied a q-regularized least square method with q = 1/2

in a linear model and proposed serval strategies for choosing the optimal regularization

parameter λ besides cross validation. Analogous to their method we can derive the range

of the optimal regularization parameter in our problem as

λ ∈[√96

9τ|[Bτ (β)]s+1|3/2,

√96

9τ|[Bτ (β)]s|3/2

)where Bτ (β) = β− τ∇`(β) and |[Bτ (β)]k| is the kth largest component of Bτ (β) in magni-

tude for each k = 1, ..., p. Xu et al (2012b) suggest that λ =√

969τ|[Bτ (β)]s+1|3/2 is a reliable

choice with an approximation such as β ≈ βk. They recommend this strategy for s-sparsity

problems and cross validation for more general problems.

Our algorithm can also be used to compute the q-RLS estimator for q ≥ 1. Indeed,

similar to Lemma B.1, we can show that there exists a unique function hλ,q(t) such that the

global minimizer of problem (18) is u = hλ,q(t). In particular, we can obtain the explicit

expressions of this function for q = 1, 2/3, 2 as hλ,1(t) = max(0, t − λ) −max(0,−t − λ, ),hλ,3/2(t) =

(√916λ2 + |t| − 3

4λ)2

sign(t), and hλ,2(t) = t/(1 + 2λ).


6 Numerical Examples

In this section we calculate two examples to illustrate the proposed approach and demon-

strate the finite sample performance of the q-RLS estimator. The first example is the

second-order least squares method described in Example 1.4, and the second is the quadratic

equations problem considered by Beck and Eldar (2013). In a phase diagram study Xu et

al (2012a) point out that the `q-regularization method yields sparser solutions with smaller

value of q in the range [1/2, 1), while there is no significant difference for q ∈ (0, 1/2].

In view of these findings, we use q = 1/2 in both examples. In addition, following the

literature we use 5-fold cross validation to choose the parameter λ. In each simulation 100

Monte Carlo samples were generated and in each case the true value β∗ was generated ran-

domly with s nonzero components from the standard normal distribution. The numerical

optimization is done using FPIA with iteration stopping criterion

‖βk+1 − βk‖max {1, ‖βk+1‖}

≤ 10−6,

or the maximum iterative time of 5000s is reached.

To evaluate the selection and estimation accuracy of our method, we calculated the

mean squared error (MSE) which is the average of ‖β−β∗‖22; the false positive (FP) which

is the number of zero coefficients incorrectly identified as nonzero; the false negative (FN)

which is the number of nonzero coefficients incorrectly identified as zero. We also report

the rate of successful recovery (SR) using the criterion Γ = Γ∗ and ‖β− β∗‖22 ≤ 2.5× 10−5,

where Γ = {j : βj 6= 0} and Γ∗ = {j : β∗j 6= 0}.

6.1 Example 1: Second-order least square method

We applied the second-order least squares method described in Example 1.4 to the variable

selection problem in linear model (2). It is known that in low-dimensional set-ups the

SLS estimator is asymptotically more efficient than the ordinary least squares estimator

when the error distributions is asymmetric. Therefore it is interesting to see if this robust-

ness property carries over to high-dimensional regularized estimation. In particular, we

considered the q-regularized second-order least squares (q-RSLS) problem

minθ

n∑i=1

ρi(θ)TWiρi(θ) + λ‖β‖qq,

where θ = (βT , σ2)T , ρi(θ) = (yi − xTi β, y2i − (xTi β)2 − σ2)T and Wi is a 2× 2 nonnegative

definite weight matrix. Here the objective function becomes that of the q-regularized


least squares (q-RLS) method if the weight is taken to be Wi = diag(1, 0). To simplify

computation, we used the weight

Wi =

(0.75 0.1

0.1 0.25

)

that is not necessarily optimal according to Wang and Leblanc (2008).

We considered five error distributions logN(0, 0.12)−e−0.005, X2(5)−5100

, 0.01∗t, U[−0.1, 0.1]

and N(0, 0.12). In each case, we took dimension p = 400 with sparsity s = 8 and sample

size n = 200.

The results in Table 1 show that q-RSLS and q-RLS perform well in identifying zero

coefficients; this is expected for `q-regularized methods with q = 1/2. Although both

methods have fairly low FN values, the values of q-RLS is about 3 times higher than

that of the q-RSLS. Moreover, The MSE of the q-RSLS estimator is about three times

smaller than that of the q-RLS estimator. The results in Table 2 show clearly that q-RSLS

has much higher rate of SR than q-RLS does, and this is true not only for the skewed

error distributions, such as log-normal and Chi-square, but also for normal or uniform

distributions.

Table 1: Selection and estimation results of Example 1.

error methodFP FN

MSEmean se mean se

e1

q-RSLS 0.12 0.04 0.00 0.00 3.41e-05

q-RLS 0.27 0.05 0.00 0.00 1.38e-04

e2

q-RSLS 0.12 0.04 0.00 0.00 2.91e-05

q-RLS 0.21 0.05 0.00 0.00 9.34e-05

e3

q-RSLS 0.09 0.03 0.00 0.00 1.32e-05

q-RLS 0.22 0.05 0.00 0.00 9.51e-05

e4

q-RSLS 0.09 0.03 0.00 0.00 3.34e-05

q-RLS 0.29 0.05 0.00 0.00 1.64e-04

e5

q-RSLS 0.11 0.03 0.00 0.00 2.14e-05

q-RLS 0.24 0.05 0.00 0.00 1.30e-04

Noiselessq-RSLS 0.10 0.03 0.00 0.00 3.80e-05

q-RLS 0.19 0.04 0.00 0.00 1.02e-04


Table 2: Rates of Successful Recovery of Example 1.

method

errore1 e2 e3 e4 e5 Noiseless

q-RSLS 0.62 0.78 0.88 0.52 0.86 0.56

q-RLS 0.08 0.15 0.13 0.06 0.12 0.10

6.2 Example 2: Quadratic measurements

We considered model (16) with εi ∼ N(0, σ2). A noise-free version of this model was

considered by Beck and Eldar (2013). For the sake of comparison we set σ = 0.01 and

generated matrices as Zi = zizTi , i = 1, 2, · · · ,m with vectors zi ∈ Rp from the standard

normal distribution. We considered n = 80, p = 120 with various sparsity s = 3, 4, · · · , 10.

For comparison, we calculated the q-RLS estimator for q = 1/2, 1, 3/2, 2.

The results are given in Table 3, with the results for q = 2 omitted since they are

very similar to those for q = 3/2. They show clearly that the FP values with q = 1/2

is much lower than the other cases. In particular, the FP values with q = 3/2, 2 are the

same as the number of true nonzero coefficients, which means that no variable selection

was performed. The MSE and SR are both very small; this demonstrates that the q-RLS

with q = 1/2 is efficient and stable in variable selection and estimation. Compared to the

results in Beck and Eldar (2013), our SR rates are lower when s = 3, 4 but significantly

higher when s = 5, 6, 7, 8, 9, 10.

To see the effectiveness of our numerical algorithm FPIA, we also ran the simulations

with n = 3p/4, s = 0.05p, and p = 100, 200, 300, 400, 500. The results in Table 4 show

that, as the dimension increases, the FP and FN, as well as MSE, remain fairly low and

stable. In all cases, the rates of successful recovery are over 50% and reach 86% when

p = 200.

7 Conclusions and Discussion

Although the problem of high-dimensional variable selection with quadratic measurements

arises in many areas in physics and engineering, such as compressive sensing, signal process-

ing and imaging, it has not been studied in statistical literature. We proposed a quadratic

measurements regression model and studied the `q-regularization method in this model.

We have established weak oracle property for the q-RLS estimator in high-dimensional

case where n and p are allowed to diverge, including the case p � n. To compute the q-


Table 3: Selection and estimation results of Example 2.

‖β∗‖0 methodFP FN

MSE SRmean se mean se

3

q = 1/2 3.95 0.64 0.36 0.09 1.56e-03 0.57

q = 1 64.36 5.84 1.07 0.14 1.47e-01 0.00

q = 3/2 117.00 0.00 0.00 0.00 2.04e-01 0.00

4

q = 1/2 3.73 0.65 0.09 0.04 6.98e-04 0.62

q = 1 62.64 5.79 1.63 0.20 2.97e-01 0.00

q = 3/2 116.00 0.00 0.00 0.00 2.15e-01 0.00

5

q = 1/2 4.66 0.71 0.05 0.02 4.97e-05 0.61

q = 1 76.75 5.38 1.55 0.23 3.08e-01 0.00

q = 3/2 115.00 0.00 0.00 0.00 3.18e-01 0.00

6

q = 1/2 5.99 0.88 0.04 0.02 4.75e-05 0.58

q = 1 81.88 5.07 1.50 0.26 2.60e-01 0.00

q = 3/2 114.00 0.00 0.00 0.00 4.36e-01 0.00

7

q = 1/2 4.70 0.84 0.07 0.03 3.37e-05 0.63

q = 1 83.32 4.91 1.01 0.30 3.27e-01 0.00

q = 3/2 113.00 0.00 0.00 0.00 5.76e-01 0.00

8

q = 1/2 3.76 0.77 0.32 0.14 5.22e-02 0.67

q = 1 87.54 4.37 1.28 0.30 2.78e-01 0.00

q = 3/2 111.99 0.01 0.00 0.00 8.02e-01 0.00

9

q = 1/2 4.01 0.97 0.34 0.16 4.92e-02 0.73

q = 1 86.05 4.38 1.53 0.34 3.30e-01 0.00

q = 3/2 111.00 0.00 0.00 0.00 6.35e-01 0.00

10

q = 1/2 5.46 0.46 0.11 0.03 2.68e-02 0.58

q = 1 84.69 4.22 1.50 0.36 3.56e-01 0.00

q = 3/2 110.00 0.00 0.00 0.00 6.57e-01 0.00


Table 4: The successful recoveries of Example 2.

p ‖β∗‖0

FP FNMSE SR

mean se mean se

100 5 2.99 0.60 0.12 0.07 1.90e-03 0.73

200 10 3.40 0.80 0.05 0.02 2.49e-05 0.86

300 15 9.50 1.20 0.09 0.03 5.17e-04 0.53

400 20 11.34 1.43 0.11 0.05 5.26e-04 0.53

500 25 13.07 2.56 0.07 0.03 5.45e-04 0.51

RLS estimator, we have derived a fixed point equation and designed an efficient algorithm

and established its convergence. We have presented two numerical examples to illustrate

the proposed method. The numerical results show that this method performs very well in

most of the cases.

In general, the classical moderate deviation principle is given in the form

P(‖β − β∗‖ > rn) = exp(−I(β∗)

2(rn√n)2 + o

((rn√n)2),

where I(β∗) is the rate function. We have derived an upper bound for the rate function

and the speed of convergence a2n which is slower than the standard (rn

√n)2 (Theorem

3.1). The result of Theorem 3.1 implies that the q-RLS estimator can correctly select the

nonzero variables with probability converging to one. Compared to the linear model, the

quadratic measurements model is more complex and therefore it is harder to obtain the

MD rate. Under some further assumptions, it is possible to establish more accurate results.

Another open question is the asymptotic normality of the q-RLS estimator for model (1).

It deserves further research.

We have studied the generalized bridge estimator because of the simplicity and tractabil-

ity of numerical optimization. We focused on the `q regularization with q < 1, mainly

because in phase retrieval and compressive sensing the primary goal is to find the smallest

set of predictors and the `q method with q < 1 helps to achieve this goal. Our identification

results and numerical optimization algorithm apply when q ≥ 1. Of course in such cases

the consistency results do not hold generally as in linear models. It is also interesting to

investigate the SCAD and other regularization methods in quadratic measurements mod-

els. Our model (1) can be viewed as a special case of the partially linear index model

y =∑d

j=1 fj(βTwj)+xTβ+ε. While it is interesting to study the regularization estimation

problem in this model, the theory and method are much more complicated.


Acknowledgments

We are grateful to the Editor, an associate editor and two anonymous reviewers for their

comments and suggestions that helped to improve the previous version of this paper.

References

Bahmani, S, Raj, B. and Boufounos, P. T. (2013). Greedy sparsity-constrained optimiza-

tion. J. Mach. Learn. Res. 14, 807-841.

Balan, R., Casazza, P. and Edidin, D. (2006). On signal reconstruction without phase.

Appl. Comput. Harmon. Anal. 20, 345-356.

Bandeira, A. S., Cahill, J., Mixon, D. G. and Nelson, A. A. (2014). Saving phase: Injectivity

and stability for phase retrieval. Appl. Comput. Harmon. Anal. 37, 106-125.

Beck, A. and Eldar, Y. C. (2013). Sparsity constrained nonlinear optimization: Optimality

conditions and algorithms. SIAM J. Optim. 23, 1480-1509.

Bian, W., Chen, X., and Ye Y., (2015). Complexity analysis of interior point algorithms

for non-Lipschitz and nonconvex minimization. Math. Program. 149, 301-327.

Biswas, P. and Ye, Y., (2004). Semidefinite programming for ad hoc wireless sensor network

localization. In Proceedings of the 3rd international symposium on Information processing

in sensor networks, Berkeley, CA, 46-54.

Blumensath, T. (2013). Compressed sensing with nonlinear observations and related non-

linear optimization problems. IEEE Trans. Inform. Theory 59, 3466-3474.

Buhlmann, P. and Van De Geer. S., (2011). Statistics for high-dimensional data: methods,

theory and applications. Springer, Heidelberg.

Cai, T., Li, X., and Ma, Z. (2015). Optimal rates of convergence for noisy sparse phase

retrieval via thresholded Wirtinger flow. arXiv preprint arXiv:1506.03382.

Candes, E., Strohmer, T., and Voroninski, V. (2013). Phaselift: Exact and stable signal

recovery from magnitude measurements via convex programming. Communications on

Pure and Applied Mathematics 66, 1241-1274.

Candes, E., Li, X., and Soltanolkotabi, M. (2015). Phase retrieval via Wirtinger flow:

Theory and algorithms. IEEE Trans. Inform. Theory 61, 1985-2007.


Candes, E., and Tao , T. (2007). The Dantzig selector: statistical estimation when p is

much larger than n. Ann. Statist. 35, 2313-2351.

Chartrand, R. (2007). Exact reconstruction of sparse signals via nonconvex minimizaion.

IEEE Signal Process. Lett. 14, 707-710.

Chen, X. Niu, L. and Yuan, Y. (2013). Optimality conditions and smoothing trust region

Newton method for non-Lipschitz optimization. SIAM J. Optim. 23, 1528-1552.

Chen, X., Xu, F., and Ye, Y. (2010). Lower bound theory of nonzero entries in solutions

of `2 − `p minimization. SIAM J. Sci. Comput. 32, 2832-2852.

Chen, Y., Xiu, N. and Peng, D. (2014). Global solutions of non-Lipschitz S2 − Sp mini-

mization over positive semidefinite cone. Optimization Letters 8, 2053-2064.

Donoho, D. L. (2000). High-dimensional data analysis: The curses and blessings of dimen-

sionality. AMS Math Challenges Lecture, 1-32.

Donoho, D. L. and Elad, M. (2003). Optimally sparse representation in general (nonorthog-

onal) dictionaries via l1 minimization. Proceedings of the National Academy of Sciences

100, 2197-2202.

Eldar, Y. C., and Mendelson, S. (2014). Phase retrieval: Stability and recovery guarantees.

Appl. Comput. Harmon. Anal. 36, 473-494.

Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its

oracle properties. J. Amer. Statist. Assoc. 96, 1348-1360.

Fan, J., and Lv, J. (2010). A selective overview of variable selection in high dimensional

feature space. Statist. Sinica 20, 101-148.

Fan, J., Fan, Y. and Barut, E. (2014). Adaptive Robust Variable Selection. Ann. Statist.

42, 324-351.

Fan, Jun. (2012). Moderate Deviations for M-estimators in Linear Models with φ-mixing

Errors. Acta Math.Sin. (Engl. Ser.) 28, 1275-1294.

Fan, Jun, Yan, Ailing and Xiu, Naihua. (2014). Asymptotic Properties for M-estimators

in Linear Models with Dependent Random Errors. J. Stat. Plan. Infer. 148, 49-66.

Frank, L.E., and Friedman, J.H. (1993). A statistical view of some chemonmetrics regression

tools (with discussion). Technometrics 35, 109-148.


Huang,J. Horowitz,J.L. and Ma,S. (2008). Asymptotic properties of bridge estimators in

sparse high-dimensional regression models. Ann.Statist. 36, 587-613.

Kallenberg, W. C. M. (1983). On moderate deviation theory in estimation. Ann. Statist.

11, 498-504.

Knight,K. and Fu, W.J. (2000). Asymptotics for lasso-type estimators. Ann. Statist. 28,

1356-1378.

Krishnan, D. and Fergus, R. (2009). Fast image deconvolution using hyper-Laplacian priors.

Adavances in Neural Information Processing Systems. 1033-1041.

Lecue, G. and Mendelson, S. (2013). Minimax rate of convergence and the performance of

ERM in phase recovery. Preprint. Available at arXiv:1311.5024

Lu, Z. (2014). Iterative reweighted minimization methods for lp regularized unconstrained

nonlinear programming. Math. Program. 147, 277-307.

Lv, J., and Fan, Y. (2009). A unified approach to model selection and sparse recovery using

regularized least squares. Ann.Statist. 37, 3498-3528.

Meng, C., Ding, Z. and Dasgupta, S. (2008). A semidefinite programming approach to

source localization in wireless sensor networks. IEEE Signal Processing Letters 15, 253-

256.

Netrapalli, P., Jain, P., and Sanghavi, S. (2013). Phase retrieval using alternating mini-

mization. Advances in Neural Information Processing Systems, 2796-2804.

Negahban, S. N., Ravikumar, M., Wainwright, M. J. and Yu B, (2012). A unified framework

for high-dimensional analysis of M -estimators with decomposable regularizers. Statist.

Sci. 27, 538-557.

Ohlsson, H., and Eldar, Y. C. (2014). On conditions for uniqueness in sparse phase retrieval.

In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),

1841-1845.

Ohlsson, H., Yang, A. Y., Dong, R., Verhaegen, M. and Sastry,S. S. (2014) Quadratic basis

pursuit. Regularization, Optimization, Kernels, and Support Vector Machines, 195.

Saab, R., Chartrand, R., and Yilmaz, O. (2008). Stable sparse approximations via non-

convex optimization. In IEEE International Conference on Acoustics, Speech and Signal

Processing (ICASSP), 3885-3888.


Shechtman, Y., Eldar, Y. C., Szameit, A. and Segev. M. (2011). Sparsity-based sub-

wavelength imaging with partially spatially incoherent light via quadratic compressed

sensing. Optics Express 19, 14807-14822.

Shechtman, Y., Szameit, A., Bullkich, E., et al. (2012). Sparsity-based single-shot sub-

wavelength coherent diffractive imaging. Nature Materials 11, 455-459.

Tibshirani, R., (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc.

Ser. B Stat. Methodol. 58, 267-288.

Wang, L. (2003). Estimation of nonlinear models with Berkson measurement error models.

Statist. Sinica 13, 1201-1210.

Wang, L. (2004). Estimation of nonlinear models with Berkson measurement errors. Ann.

Statist. 32, 2343-2775.

Wang, L. and Leblanc A. (2008). Second-order nonlinear least squares estimation. Ann.

Inst. Stat. Math. 60, 883-900.

Wang, Z., Zheng, S., Boyd, S. and Ye, Y. (2008). Further relaxations of the SDP approach

to sensor network localization. SIAM J. Optim. 19, 655-671.

Xu, Z., Guo, H., Wang, Y. and Zhang H. (2012a). Representative of L1/2 regularization

among Lq(0 < q ≤ 1) regularizations: an experimental study based on phase diagram.

Acta Autom. Sinica 38, 1225-1228.

Xu, Z., Chang X., Xu, F. and Zhang, H. (2012b). L1/2 regularization: A thresholding

representation theory and a fast solver. Neural Networks and Learning Systems. IEEE

Transactions on Neural Networks and Learning Systems 23, 1013-1027.

Yuan M and Lin Y. (2006). Model selection and estimation in regression with grouped

variables. J. R. Stat. Soc. Ser. B Stat. Methodol. 68, 49-67.

Zhang, C. H. (2010). Nearly unbiased variable selection under minimax concave penalty.

Ann. Statist., 38, 894-942.

Zhang, C.-H. and Zhang, T. (2012). A general theory of concave regularization for high-

dimensional sparse estimation problems. Statist. Sci. 27, 576-593.

Zou, H. (2006). The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 101,

1418-1429.


Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J.

R. Stat. Soc. Ser. B Stat. Methodol. 67, 301-320.

Zou, H. and Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood

models. Ann. Statist., 36, 1509-1533.

Appendix A Proofs of MD and weak oracle property

A.1 Proof of Theorem 3.1 and 3.2

Without loss of generality, in the following we let Γ∗ = {1, · · · , s} and β∗ = (β∗T1 , 0T )T .

Correspondingly, we partition Zi and xi as

Zi =

(Z11i Z12

i

Z21i Z22

i

)and xi = (x1T

i , x2Ti )T ,

where Z11i is an s× s symmetric matrix and Z22

i is a (p− s)× (p− s) symmetric matrix.

For convenience, we also denote

Ln(β1) :=n∑i=1

(yi − βT1 Z11i β1 − x1T

i β1)2 + λn‖β1‖qq

and C1 = 2c+ 3√

(σ2 + 1)/c1.

We first prove some lemmas.

Lemma A.1. Let {wn} be a sequence of real numbers and assume that {bn} and {Bn} are

two sequences of positive numbers tending to infinity. If

Bn ≥n∑i=1

w2i and

bn√Bn

max1≤i≤n

|wi| → 0,

then, for any τ > 0,

lim supn→∞

b−2n logP

(|

n∑i=1

wiεi| > bn√Bnτ

)≤ − τ 2

2σ2,

or

P(|

n∑i=1

wiεi| > bn√Bnτ

)≤ exp

(− b2

nτ2

2σ2+ o(b2

n)).

Proof. Based on the similar method proof to that of Lemma 3.2 in Fan, Yan and Xiu

(2014), it is easy to show it and so is omitted.


Lemma A.2. Assume that Conditions 1-2 and 4 hold. Let {an} be a sequence of positive

numbers satisfying (11) and (12). Then for any τ > 0,

P( 1

an√n

supu∈S‖

n∑i=1

(Z11i u+ x1

i )εi‖ > τ)≤ exp

(− a2

nτ2

2c2σ2+ o(a2

n)).

Proof. Let A = {v ∈ Rs : ‖v‖ ≤ 1} and denote rn = 1/n. Then by Lemma 14.27 in

Buhlmann and van de Geer (2011), we have

A ⊆mn⋃j=1

B(vj, rn),

where mn = (1 + 2n)s and B(uj, rn) = {v ∈ Rs : ‖v − vj‖ ≤ rn, vj ∈ A} for j = 1, · · · ,mn.

By the similar method to the proof the second result of Lemma 5.1 in Fan, Yan and Xiu

(2014), we use Lemma A.1 with Bn = nc2 and bn = an to obtain that for any τ1 > 0 and

ε1 ∈ (0, τ1/2),

P( 1

an√n‖

n∑i=1

(Z11i u+ x1

i )εi‖ > τ1

)≤ mn exp

(− a2

n(τ1 − ε1)2

2c2σ2+ o(a2

n)). (21)

Further, denote r′n = C1

√s/n. Again, by Lemma 14.27 of Buhlmann and van de Geer

(2011), we have

S ⊆mn⋃j=1

B(uj, r′n),

where B(uj, r′n) = {u ∈ Rs : ‖u− uj‖ ≤ r′n, uj ∈ S} for j = 1, · · · ,mn. Analog to (21) we

obtain that for any ε ∈ (0, τ/2) and ε1 ∈ (0, (τ − ε)/2),

P( 1

an√n

supu∈S‖

n∑i=1

(Z11i u+ x1

i )εi‖ > τ)≤m2

n exp(− a2

n(τ − ε− ε1)2

2c2σ2+ o(a2

n)).

From (11) we conclude that a−2n logm2

n = a−2n

(s log(1 + 2n)

)→ 0, which together with the

above inequality implies that

lim supn→∞

a−2n logP

( 1

an√n

supu∈S‖

n∑i=1

(Z11i u+ x1

i )εi‖ > τ)≤ −(τ − ε− ε1)2

2c2σ2.

Since ε and ε1 are arbitrary, we have for large enough n,

P( 1

an√n

supu∈S‖

n∑i=1

(Z11i u+ x1

i )εi‖ > τ)≤ exp

(− a2

nτ2

2c2σ2+ o(a2

n)).


Lemma A.3. Under the assumptions of Lemma A.2, there exists β1 = arg minβ1∈Rs Ln(β1)

such that

P(‖β1 − β∗1‖ ≤ rn

)≥ 1− exp

(− (1 + c2

1/4)a2n

2c2σ2+ o(a2

n)). (22)

Proof. To show the existence of minimizer β1, we consider the level set{β1 ∈ Rs : Ln(β1) ≤

Ln(β∗1)}

. It is apparent that

infβ1∈Rs

Ln(β1) = infβ1∈{β1∈Rs:Ln(β1)≤Ln(β∗1 )}

Ln(β1).

Since Ln(·) is continuous and the level set is non-empty and closed, Ln(·) has at least one

minimizer β1 in the level set.

Now we prove (22). For notational convenience, we denote Z = (Z1, · · · , Zn), Σn =

ZZT/n and ε = (ε1, · · · , εn)T , where Zi = Z11i (β1 + β∗1) + x1

i . Obviously, Condition 1

implies that Σn is invertible. Then by the definition of β1 we have Ln(β1) ≤ Ln(β1) for any

β1 ∈ Rs, which implies

n∑i=1

ε2i + λn

s∑j=1

|β∗1j|q ≥n∑i=1

(εi − (β1 − β∗1)T Zi

)2

+ λn

s∑j=1

|eTs,jβ1|q

=n∑i=1

ε2i − 2(β1 − β∗1)T Zε+ λn

s∑j=1

|eTs,jβ1|q

+ n(β1 − β∗1)T Σn(β1 − β∗1)

and therefore

n(β1 − β∗1)T Σn(β1 − β∗1) ≤ 2(β1 − β∗1)T Zε+ λn

s∑j=1

(|eTs,jβ∗1 |q − |eTs,jβ1|q). (23)

By the similar method to the proof of relation (8) in Huang, Horowitz and Ma (2008), we

conclude from Condition 1, the second convergence of Condition 4 and the strong law of

large number that for large enough n,

‖β1 − β∗1‖ ≤ C1

√s and ‖β1 + β∗1‖ ≤ C1

√s, a.s.

and

‖β1 − β∗1‖2 ≤ 2

nc1

‖β1 − β∗1‖‖Zε‖+ηnnc1

≤ 2C1

√s

nc1

supu∈S‖

n∑i=1

(Z11i u+ x1

i )εi‖+λnsc

q

nc1

, a.s., (24)


where ηn = λn∑s

j=1(|eTs,jβ∗1 |q − |eTs,jβ1|q). Therefore,

1 =P(‖β1 − β∗1‖2 ≤ 2C1

√s

nc1

supu∈S‖

n∑i=1

(Z11i u+ x1

i )εi‖+λnsc

q

nc1

)≤P(‖β1 − β∗1‖2 ≤ 2C1

√san

c1

√n

+λnsc

q

nc1

)+ P

( 1

an√n

supu∈S‖

n∑i=1

(Z11i u+ x1

i )εi‖ > 1),

which together with Lemma A.2 yields that

P(‖β1 − β∗1‖ > r′n) ≤ exp(− a2

n

2c2σ2+ o(a2

n)), (25)

where r′n =(

2C1an√s

c1√n

+ λnscq

nc1

)1/2.

Since r′n → 0 as n→∞, it follows that for large enough n,

1

2|eTs,jβ∗1 | ≤ |eTs,jβ1| ≤

3

2|eTs,jβ∗1 |, j = 1, · · · , s,

when ‖β1−β∗1‖ ≤ r′n. By the mean value theorem and Cauchy-Schwarz inequality, we have,

for large enough n,

ηn ≤ 2cq−1λn√s‖β1 − β∗1‖

when ‖β1 − β∗1‖ ≤ r′n. Combining the above inequality, (23), Cauchy-Schwarz inequality

and Condition 1, we have, for large enough n,

‖β1 − β∗1‖ ≤2

nc1

‖Zε‖+2cq−1λn

√s

nc1

,

when ‖β1 − β∗1‖ ≤ r′n. Therefore it follows from the first inequality of (24) that for large

enough n,

1 =P(‖β1 − β∗1‖2 ≤ 2

nc1

‖(β1 − β∗1)T‖‖Zε‖+ηnnc1

)≤P(‖β1 − β∗1‖ ≤

2

nc1

‖Zε‖+2cq−1λn

√s

c1n

)+ P

(‖β1 − β∗1‖ > r

′

n

)≤P(‖β1 − β∗1‖ ≤

an√n

+2cq−1λn

√s

c1n

)+ P

( 1

an√n

sup‖u‖≤C1

√s

‖n∑i=1

(Z11i u+ x1

i )εi‖ > c1/2)

+ P(‖β1 − β∗1‖ > r′n

).

Then, by Lemma A.2 and (25) we have

P(‖β1 − β∗1‖ ≥ rn) ≤ exp(− a2

n

2c2σ2+ o(a2

n))

+ exp(− c2

1a2n

8c2σ2+ o(a2

n)),


which yields

lim supn→∞

a−2n logP(‖β1 − β∗1‖ ≥ rn) ≤ − 1

2c2σ2− c2

1

8c2σ2.

Thus, we have

P(‖β1 − β∗1‖ ≥ rn) ≤ exp(− (1 + c2

1/4)a2n

2c2σ2+ o(a2

n)),

which yields (22).

Proof of Theorem3.1. Denote bn = an + 2cq−1λn√s√

c1nand rn =

(an√n

+ 2cq−1λn√s

c1n

)√s. We

first show that

λ−1n r1−q

n bn√ns→ 0 and λ−1

n r2−qn

√ns2 → 0, as n→∞. (26)

The first convergence of (26) follows from the second convergence of Condition 4 and (13).

Since λns2/n→ 0, the inequality of Condition 2 implies that

s3/2

√n≤ λns

2

n·√n

λn≤ λns

2

n

1

σc1−q√

log p→ 0, (27)

which yields

λ−1n r2−q

n

√ns2 = λ−1

n r1−qn bn

√ns · s

3/2

√n→ 0.

For any u = (uT1 , uT2 ) ∈ Rp and u1 ∈ Rs, we show that there exists a sufficiently large

constant C such that

P(Ln(β1, 0) = inf

‖u‖1≤CLn(β∗1 + rnu1, rnu2)

)≥ 1− exp

(− C0a

2n + o(a2

n)), (28)

which implies that with probability 1 − exp(− C0a

2n + o(a2

n))

that (βT1 , 0T )T is a local

minimizer in the ball {β∗ + rnu : ‖u‖1 ≤ C}, so that both (9) and (10) hold.

Denote ζ1i = Z11i (2β∗1 + rnu1) +x1

i and ζ2i = 2Z21i (β∗1 + rnu1) +x2

i + rnZ22i u2, and define

event

E1 :={‖

n∑i=1

ζ2iεi‖∞ ≤ 4((1 + c2))1/2bn√ns},

where bn = an + 2cq−1λn√s√

c1n. For any u2 ∈ Rp−s, we show that under event E1,

Ln(β∗1 + rnu1, rnu2) ≥ Ln(β∗1 + rnu1, 0). (29)


Clearly, Ln(β∗1 + rnu1, rnu2) = Ln(β∗1 + rnu1, 0) when ‖u2‖1 = 0. We proceed to show (29)

for ‖u2‖1 > 0. It follows that

Ln(β∗1 + rnu1, rnu2)− Ln(β∗1 + rnu1, 0)

= −2rn

n∑i=1

uT2 ζ2iεi + r2n

n∑i=1

(uT2 ζ2i)2 + 2r2

n

n∑i=1

uT1 ζ1iuT2 ζ2i + λnr

qn‖u2‖qq

≥ −2rn

n∑i=1

uT2 ζ2iεi + 2r2n

n∑i=1

uT1 ζ1iuT2 ζ2i + λnr

qn‖u2‖qq. (30)

We now use the fact that |uTAv| ≤ ‖u1‖1‖Av‖∞ ≤ |A|∞‖u‖1‖v‖1 for any n× d matrix

A and vector u ∈ Rn, v ∈ Rd to discuss the bound of |∑n

i=1 uT1 ζ1iu

T2 ζ2i|. Noting that

|n∑i=1

uT1 ζ1iuT2 ζ2i| ≤ ‖u1‖1‖u2‖1|

n∑i=1

ζ1iζT2i|∞,

we then estimate the upper bound of |∑n

i=1 ζ1iζT2i|∞. Recalling the definition of | · |∞, we

calculate the eTs,jζ1iζT2iep−s,k for each j = 1, · · · , s and k = 1, · · · , p− s. It is easy to check

that

eTs,jζ1iζT2iep−s,k =2(2β∗1 + rnu1)TZ11

i es,jeTp−s,kZ

21i (β∗1 + rnu1) + 2x1T

i es,jeTp−s,kZ

21i (β∗1 + rnu1)

+ (2β∗1 + rnu1)TZ11i es,je

Tp−s,kx

2i + x1T

i es,jeTp−s,kx

2i

+ rn(2β∗1 + rnu1)TZ11i es,je

Tp−s,kZ

22i u2 + rnx

1Ti es,je

Tp−s,kZ

22i u2.

So,

|n∑i=1

eTs,jζ1iζT2iep−s,k|

≤2‖(2β∗1 + rnu1)‖1‖(β∗1 + rnu1)‖1|n∑i=1

Z11i es,je

Tp−s,kZ

21i |∞ + 2‖β∗1 + rnu1‖1|

n∑i=1

eTs,jx1i eTp−s,kZ

21i |∞

+ ‖(2β∗1 + rnu1)‖1|n∑i=1

Z11i es,je

Tp−s,kx

2i |∞ + |

n∑i=1

x1Ti es,je

Tp−s,kx

2i |∞

+ rn‖(2β∗1 + rnu1)‖1|‖u2‖1|n∑i=1

Z11i es,je

Tp−s,kZ

22i |∞ + rn‖u2‖1|

n∑i=1

x1Ti es,je

Tp−s,kZ

22i |∞

≤(

2(2cs+ rnC)(cs+ rnC) + 2(cs+ rnC) + (2cs+ rnC) + 1 + rn(2cs+ rnC)C + rnC)√

nc0,

where the last inequality follows from Condition 3. Since rn → 0 as n → ∞, we conclude

that for large enough n,

|n∑i=1

eTs,jζ1iζT2iep−s,k| ≤ (12c2s2 + 12cs+ 1)

√nc0 ≤ (12c2 + 1)c0

√ns2,


and therefore

|n∑i=1

uT1 ζ1iuT2 ζ2i| ≤ ‖u1‖1‖u2‖1|

n∑i=1

ζ1iζT2i|∞ ≤ (12c2 + 1)c1C

√ns2‖u2‖1. (31)

Note that

|n∑i=1

uT2 ζ2iεi| ≤ ‖u2‖1‖n∑i=1

ζ2iεi‖∞

and ‖u2‖qq ≥ Cq−1‖u2‖1. Under the event E1, it follows from (26), (30) and (31) that

Ln(β∗1 + rnu1, rnu2)− Ln(β∗1 + rnu1, 0)

≥ −2rn‖n∑i=1

ζ2iεi‖∞‖u2‖1 − (12c2 + 1)c1Cr2n

√ns2‖u2‖1 + Cq−1λnr

qn‖u2‖1

≥ λnrqn‖u2‖1

(− 2(2(1 + c2))1/2λ−1

n r1−qn bn

√ns− (12c2 + 1)c1Cλ

−1n r2−q

n

√ns2 + Cq−1

)> 0,

when ‖u2‖1 > 0. That is, (29) holds.

On the other hand, under the event {‖β1− β∗1‖1 ≤ rn}, we conclude from ‖β1− β∗1‖1 ≤√s‖β1 − β∗1‖, that ‖β − β∗‖1 = ‖β1 − β∗1‖1 ≤ rn

√s, which yields

inf‖u‖1≤C

Ln(β∗1 + rnu1, rnu2) ≤ Ln(β) = Ln(β1, 0) ≤ Ln(β∗1 + rnu1, 0).

Combining this and (29), we have Ln(β) = inf‖u‖1≤C Ln(β∗1 + rnu1, rnu2) under the event

E1 ∩ {{‖β1 − β∗1‖ ≤ rn}. That is,

E1 ∩{{‖β1 − β∗1‖ ≤ rn} ⊆

{β ∈ arg inf


}. (32)

To complete the proof of (28), we need to verify that

P(‖

n∑i=1

ζ2iεi‖∞ > 4((1 + c2))1/2bn√ns)≤ exp

(− b2

n

4σ2+ o(b2

n)). (33)

Denote the jth element of ζ2i by ζ2ij. Since ‖β∗1 + rnu1‖1 ≤ ‖β∗1‖1 + rn‖u1‖1 ≤ cs + rnC,

we use Cauchy-Schwarz inequality to obtain

|ζ2ij| ≤ 2|eTp−s,jZ21i (β∗1 + rnu1)|+ |xij|+ rn|eTp−s,jZ22

i u2|

≤ 2‖eTp−s,jZ21i ‖∞‖β∗1 + rnu1‖1 + κ1n + rn‖eTp−s,jZ22

i ‖∞‖u2‖1

≤(2c+ 3rnC

)κ2ns+ κ1n. (34)


By similar calculation, we have

n∑i=1

ζ22ij ≤4

n∑i=1

(4(eTp−s,jZ

21i β

∗1

)2+ 4r2

n

(eTp−s,jZ

21i u1

)2+ x2

ij + r2n

(eTp−s,jZ

22i u2

)2)

≤4n∑i=1

(4‖eTp,jZi‖2

∞(‖β∗1‖2

1 + r2n‖u1‖2

1 + r2n‖u2‖2

1

)+ x2

ij

)≤4

n∑i=1

(4‖Zi‖2

∞(‖β∗1‖2

1 + r2n‖u‖2

1

)+ x2

ij

)≤4(4c2 + 4r2

nC2 + 1)ns2.

Write Bn = 4(4c2 + 4r2nC

2 + 1)ns2. Since the limits (8) and (12) imply respectively that

λn√s√n

(κ2ns+ κ1n√ns

)→ 0 and an

(κ2ns+ κ1n√ns

)→ 0,

it follows from (34) and rn → 0 that

bn max1≤i≤n |ζ2ij|√Bn

→ 0, as n→∞.

We use Lemma A.1 to obtain that,

P(|

n∑i=1

ζ2ijεi| > bn√Bn

)≤ exp

(− b2

n

2σ2+ o(b2

n)),

which combining the relation rn → 0 leads to

P(‖

n∑i=1

ζ2iεi‖∞ > 4((1 + c2))1/2bn√ns)≤ exp

(− b2

n

2σ2+ o(b2

n)). (35)

Note that the first relation of Condition 4 implies that

bn >2cq−1λn

√s√

n≥ 2σ

√log p.

Therefore we conclude that

P(‖

n∑i=1

ζ2iεi‖∞ > 4((1 + c2))1/2bn√ns)≤

p∑j=s+1

P(|

n∑i=1

ζ2ijεi| > 4((1 + c2))1/2bn√ns)

≤ exp(− b2

n

4σ2+ o(b2

n)).

which yields (33). Further by Lemma A.3, (32) and (33), we have

P(β ∈ arg inf


)≥ P

(E1 ∩ {‖β1 − β∗1‖ ≤ rn}

)≥ 1− exp

(− C0a

2n + o(a2

n)).

�


Proof of Theorem3.2 It suffices to show that the sequence an =√s log n satisfies (11)-

(13). First, it is clear that an/√s log n→∞. Further, it follows from (27) that

an√n =

√s log n√n≤(

max(s, log n))3/2

√n

→ 0.

Moreover, the inequality in Condition 4 and (8) imply that

anκ1n

√s√

n=λnκ1ns log n

n·√n

λn≤ λnκ1ns log n

n· 1

σc1−q√

log p→ 0

andanκ2ns

3/2

√n

=λnκ2ns

2 log n

n·√n

λn≤ λnκ2ns

2 log n

n· 1

σc1−q√

log p→ 0.

Therefore by the first convergence of Condition 4, we obtain

a2−qn n

q2 s

4−q2

λn=


λn→ 0,

which completes the proof. �

A.2 Proof of Theorem 4.1 and 4.2.

We here also use the notation in Appendix A.1 and provide two lemmas below, i.e., Lemmas

A.4 and A.5, corresponding to Lemmas A.2 and A.3 there.

Lemma A.4. For the model (16), assume that Conditions 1′-2′ and 4′ hold. Let {an} be a

sequence of positive numbers satisfying (11) and (12). Then, for any τ > 0,

P( 1

an√n

supu∈S‖

n∑i=1

(Z11i u)εi‖ > τ

)≤ exp

(− a2

nτ2

2c2σ2+ o(a2

n)),

where S ′ = S1 ∩ S.

Proof. First note that

{u ∈ Rs : ‖u‖0 ≥ s− [

s

2]} ⊆

s⋃k=[ s

2]

{u ∈ Rs : ‖u‖0 = k},

and Lemma 14.27 of Buhlmann and van de Geer (2011) implies that in the subspace Rk,{v ∈ Rk : min

1≤l≤k|eTk,lv| ≥

c

2, ‖v‖ ≤ C1

√s}

⊆(1+2n)k⋃j=1

{v ∈ Rk : ‖v − vj‖ ≤

1

n, min

1≤l≤k|eTk,lvj| ≥

c

2, ‖vj‖ ≤ C1

√s}.


Sinces∑

k=[ s2

]

Cks (1 + 2n)k ≤ (1 + 2n)s

s∑k=[ s

2]

Cks ≤ (2 + 4n)s

and {u ∈ Rs : |

{j : |eTs,ju| ≥

c

2

}| ≥ s− [

s

2]}

={u ∈ Rs : ‖u‖0 ≥ s− [

s

2], |eTs,ju| ≥ c/2, j ∈ supp(u)

},

we have

S ′ ⊆(2+4n)s⋃j=1

{u ∈ Rs : ‖u− uj‖ ≤

1

n, uj ∈ S ′

}.

Then, we can use the similar method to the proof of Lemma A.2 to get the desired result.

Lemma A.5. Under the assumptions of Lemma A.4, Ln(β1) has two minimizers β1 and

−β1 such that

P(‖(−β1)− (−β∗1)‖ ≤ rn

)=P(‖β1 − β∗1‖ ≤ rn

)≥1− exp

(− (1 + c2

1/4)a2n

2c2σ2+ o(a2

n)).

Proof. Define

S1(β∗1) ={u ∈ Rs :

∣∣∣{j : |eTs,ju+ eTs,jβ∗1 | ≥ c/2

}∣∣∣ ≥ s− [s

2]}

and

S1(−β∗1) ={u ∈ Rs :

∣∣∣{j : |eTs,ju− eTs,jβ∗1 | ≥ c/2}∣∣∣ ≥ s− [

s

2]}.

We first show that

S1(β∗1) ∪ S1(−β∗1) = Rs. (36)

First, it is obvious that S1(β∗1) ∪ S1(−β∗1) ⊆ Rs. To show the opposite inclusion, we need

the following two facts that for any u ∈ Rs,∣∣{j : |eTs,ju+ eTs,jβ∗1 | ≥ c/2}

∣∣ ≤ [s

2]⇔

∣∣{j : |eTs,ju+ eTs,jβ∗1 | < c/2}

∣∣ ≥ s− [s

2]

and

{j : |eTs,ju+ eTs,jβ∗1 | < c/2} ⊆ {j : |eTs,ju− eTs,jβ∗1 | ≥ c/2}.

It is clear that the first holds. We only need to check the second. Note that for each j with

|eTs,ju+ eTs,jβ∗1 | < c/2, it is easy to verify that

−2eTs,jβ∗1 − c/2 < eTs,ju− eTs,jβ∗1 < −2eTs,jβ

∗1 + c/2.


Combining this and the assumption 0 < c ≤ min{|eTp,jβ∗|, j ∈ Γ∗}, we have

eTs,ju− eTs,jβ∗1

{< −3c/2, if eTs,jβ

∗1 > c;

> 3c/2, if eTs,jβ∗1 > −c.

which yields |eTs,ju− eTs,jβ∗1 | ≥ c/2. Therefore the second fact holds. It follows that, for any

β1 /∈ S1(β∗1), i.e.,∣∣{j : |eTs,ju+eTs,jβ

∗1 | ≥ c/2}

∣∣ ≤ [ s2], the above two facts imply β1 ∈ S1(−β∗1),

which further implies that (36) holds.

Note that for any β1 ∈ S1(β∗1), −β1 ∈ S1(−β∗1), and for any β1 ∈ S1(−β∗1), −β1 ∈ S1(β∗1).

That is, the sets S1(β∗1) and −S1(−β∗1) are symmetric. Since Ln(β1) is an even function, it

follows from (36) that

minβ1∈Rs

Ln(β1) = minβ1∈S1(β∗1 )

Ln(β1) = minβ1∈S1(−β∗1 )

Ln(β1).

By the similar method to the proof of Lemma A.3, we can show that there exists a minimizer

β1 = arg minβ1∈S1(β∗1 ) Ln(β1), such that (22) holds. Therefore the desired result follows and

the proof is completed.

Proof of Theorem4.1 From Lemmas A.4 and A.5, we can use the similar method

for model (1) to prove that under the event E1 ∩ {‖β1 − β∗1‖ ≤ rn, (βT1 , 0T )T is a local

minimizer in the ball {β∗ + rnu : ‖u‖1 ≤ C}, and (−βT1 , 0T )T is a local minimizer in the

ball {−β∗ + rnu : ‖u‖1 ≤ C}. As mentioned before, we identify vectors β, β′ ∈ Rp which

satisfy β′ = ±β. Then, there exists strict local minimizer β such that both the results (9)

and (10) remain true. �

Proof of Theorem 4.2 is analog to that of Theorem 3.2.

Appendix B Analysis of the optimization algorithm

Lemma B.1. [Chen, Xiu and Peng (2014)] Let t ∈ R, λ > 0, q ∈ (0, 1) be given and

t∗ = (2 − q)(q(1 − q)q−1λ

)1/(2−q). For any t0 > t∗, there exists a unique implicit function

u = hλ,q(t) on (t∗,∞) such that u0 = hλ,q(t0), u = hλ,q(t) > 0, hλ,q(t)− t+ λqhλ,q(t)q−1 = 0

and u = hλ,q(t) is continuously differentiable with hλ,q′(t) = 1

1+λq(q−1)hλ,q(t)q−2 > 0. For

any t0 < −t∗, there exists a unique function u = hλ,q(t) on (−∞,−t∗) such that u0 =

hλ,q(t0), u = hλ,q(t) < 0, hλ,q(t) − t − λq|hλ,q(t)|q−1 = 0 and u = hλ,q(t) is continuously

differentiable with hλ,q′(t) = 1

1+λq(q−1)|hλ,q(t)|q−2 > 0.


Furthermore, the global solution u of the problem (18) satisfies

u = hλ,q(t) :=

hλ,q(t), if t < −t∗;−(2λ(1− q))

12−q or 0, if t = −t∗;

0, if − t∗ < t < t∗;

(2λ(1− q))1

2−q or 0, if t = t∗;

hλ,q(t), if t > t∗.

Especially, hλ,1/2(t) = 23t(1 + cos

(2π3− 2

3φλ(t)

))with φλ(t) = arccos(λ

4( |t|

3)−3/2

).

Lemma B.2. For q ∈ (0, 1), λ > 0, let u = arg minu∈Rp12‖u− b‖2

2 + λ‖u‖qq, ∀ b ∈ Rp. Then

u = Hλ,q(b).

The result is an immediate consequence of Lemma B.1 and therefore the proof is omit-

ted.

Proof of Theorem5.1 For any τ > 0, define the following auxiliary problem

minβ∈Rp

Fτ (β, u) := `(u) + 〈∇`(u), β − u〉+1

2τ‖β − u‖2

2 + λ‖β‖qq, ∀u ∈ Rp. (37)

It is easy to check that the problem (37) is equivalent to the following minimization problem

minβ∈Rp

1

2‖β − (u− τ∇`(u))‖2

2 + λτ‖β‖qq.

For any r > 0, let Br = {β ∈ Rp : ‖β‖2 ≤ r} and Gr = supβ∈Br ‖∇2`(β)‖2. For any

τ ∈ (0, G−1r ] and β, u ∈ Br, we have

L(β) = `(u) + 〈∇`(u), β − u〉+1

2(β − u)T∇2`(ξ)(β − u) + λ‖β‖qq

= Fτ (β, u) +1

2(β − u)T∇2`(ξ)(β − u)− 1

2τ‖β − u‖2

2

≤ Fτ (β, u) +1

2‖∇2`(ξ)‖2‖β − u‖2

2 −1

2τ‖β − u‖2

2

≤ Fτ (β, u) +L

2‖β − u‖2

2 −1

2τ‖β − u‖2

2

≤ Fτ (β, u), (38)

where ξ = u+α(β−u) for some α ∈ (0, 1) and the second inequality follows from ‖ξ‖2 ≤ r.

Further, let β ∈ arg minβ∈Rp Fτ (β, β). Since L(β) ≥ 0 and lim‖β‖2→∞ L(β) =∞, there

exists a positive constant r1 such that ‖β‖2 ≤ r1. Note that

∇`(β) = 2m∑i=1

(βTZiβ + xTi β − yi)(2Ziβ + xi) (39)


which implies that ∇`(β) is continuous differentiable. Then, take

r2 = r1 + supβ∈Br1

‖∇`(β)‖2.

Hence it follows from Lemma B.2 that ‖β‖2 ≤ r2 for any τ ∈ (0, 1]. By the definitions of

β and β, we obtain from the inequality (38) that for any τ ∈(0,min{G−1

r2, 1}),

Fτ (β, β) ≤ Fτ (β, β) = L(β) ≤ L(β) ≤ Fτ (β, β),

which leads to Fτ (β, β) = Fτ (β, β). Therefore β is also a minimizer of the problem (37)

with u = β. The results follows then from Lemma B.2. �

Lemma B.3. Let gk = ‖∇`(βk)‖2, Gk = supβ∈Bk ‖∇2`(β)‖2 where Bk = {β ∈ Rp : ‖β‖2 ≤

‖βk‖2 + gk}. For any δ > 0, γ, α ∈ (0, 1), define

jk =

{0, if γ(Gk + δ) ≤ 1;

−[ logα γ(Gk + δ)] + 1, otherwise.

Then (20) holds.

Proof. From the definition of τk and jk, it is easy to check that

Gk −1

τk≤ −δ. (40)

Indeed, take τk = γ which yields to

Gk −1

τk=γGk − 1

γ≤ −δ,

when γ(Gk + δ) ≤ 1. If γ(Gk + δ) > 1,

τk = γαjk ≤ γα− logα γ(Gk+δ) =1

Gk + δ

which also leads to (40).

Note that

βk+1 ∈ arg minβ∈Rp

Gτk(β, βk) (41)

and

‖βk+1‖2 ≤ ‖βk − τk∇`(βk)‖2 ≤ ‖βk‖2 + gk,


which yields βk+1 ∈ Bk. Similar to (38), we obtain from (40) that

L(βk+1) ≤ Fτk(βk+1, βk) +

1

2‖βk+1 − βk‖2

2

(‖∇2`(ξk)‖2 −

1

τk

)≤ Fτk(β

k+1, βk) +1

2‖βk+1 − βk‖2

2(Gk −1

τk)

≤ Fτk(βk+1, βk)− δ

2‖βk+1 − βk‖2

2,

where ξk = βk + %(βk+1 − βk) for some % ∈ (0, 1) and then ξk ∈ Bk leads to the second

inequality. Combining this and (41), we have

L(βk)− L(βk+1) = Fτk(βk, βk)− L(βk+1) ≥ Fτk(β

k+1, βk)− L(βk+1)

≥ δ

2‖βk+1 − βk‖2

2,

which completes the proof.

Lemma B.4. Let {βk} and {τk} be generated by FPIA. Then,

(i) {βk} is bounded; and

(ii) there is a nonnegative integer j such that τk ∈ [γαj, γ].

Proof. Lemma B.3 implies that {L(βk)} is strictly decreasing. From this, `(·) ≥ 0 and

the definition of L(·), it is easy to check that {βk} is bounded. Since `(·) is a twice

continuous differentiable function, it then follows from the bound of {βk} that there exist

two positive constants g and G such that supk≥0{gk} ≤ g and supk≥0{Gk} ≤ G. Define

j = max(0, [− logα γ(G + δ)] + 1). Then, 0 ≤ jk ≤ j which combining the definition of τk

imply that τk ∈ [γαj, γ].

Now we consider the convergence of the sequence {βk}. To this end we slightly modify

hλ,q(·) as follows

hλ,q(t) :=

hλ,q(t), if t < −t∗;0, if |t| ≤ t∗;

hλ,q(t), if t > t∗.

(42)

Then we have the following result.

Theorem B.1. Let {βk} be the sequence generated by FPIA. Then,

(i) {L(βk)} converges to L(β), where β is any accumulation point of {βk};(ii) limk→∞

‖βk+1−βk‖2τk

= 0;

(iii) any accumulation point of {βk} is a stationary point of the minimization problem

(17) when γ ≤(

q16(1−q) g

−1) 2−q

1−q(λ(1− q)

) 11−q and g = supk≥0 ‖∇`(βk)‖2.


Proof. (i) Since {βk} is bounded, it has at least one accumulation point. Since {L(βk)}is monotonically decreasing and L(·) ≥ 0, {L(βk)} converges to a constant L(≥ 0). Since

L(β) is continuous, we have {L(βk)} → L = L(β), where β is an accumulation point of

{βk} as k →∞.(ii) From the definition of βk+1 and (20), we have

n∑k=0

‖βk+1 − βk‖22 ≤

2

δ

n∑k=0

[L(βk)− L(βk+1)] =2

δ[L(β0)− L(βn+1)] ≤ 2

δL(β0).

Hence,∑∞

k=0 ‖βk+1 − βk‖22 <∞ and ‖βk+1 − βk‖2 → 0 as k →∞. Then the second result

of Lemma B.4 leads to the result (ii).

(iii) Since {βk} and {τk} have convergent sequences, without loss of generality, assume

that

βk → β and τk → τ , as k →∞. (43)

It suffices to prove that β and τ satisfy (19). Note that

‖β −Hλτ ,q

(β − τ∇`(β)

)‖2

≤ ‖β − βk+1‖2 + ‖Hλτk,q

(βk − τk∇`(βk)

)−Hλτ ,q

(β − τ∇`(β)

)‖2

= I1 + I2. (44)

The result (ii) and (43) imply that I1 → 0 as k →∞.

To complete the proof, we need show I2 → 0 for q ∈ (0, 1). For i = 1, · · · , p, denote

vki = eTp,i(βk − τk∇`(βk)

), vi = eTp,i

(β − τ∇`(β)

), t∗i =

2− q2(1− q)

[2λτ(1− q)]1/(2−q)

and βi =(2λτ(1− q)

)1/(2−q). Then it suffices to prove that

hλτk,q(vki )→ hλτk,q(vi) (45)

when vki → vi as k → ∞. We only give the proof of (45) as vi > 0 because the case of

vi < 0 can be similarly proved.

For vi < t∗i , the limit (43) and the definition of hλτ,q imply that hλτk,q(vki ) = 0 =

hλτ ,q(vi). For vi > t∗i , one can conclude from (43) and the continuity of hλτ,q on (t∗i ,∞) that

hλτk,q(vki )→ hλτ ,q(vi). For vi = t∗i , we show that any subsequence of {vki } converging to vi,

without loss of generality, say {vki }, must satisfy

vki ≤ t∗i , for large enough k. (46)


We prove the above inequality by contradiction. Denote ∆ = q16(1−q)

(λ(1 − q)

) 12−q and

δi =t∗i−βi

4. Note that t∗i > βi implies that δi = p

16(1−q)∆(2τ)1

2−q > 0. The second limit

of (43) implies τ ≥ 12τk and hence δi ≥ 2∆(τk)

12−q for large enough k. Since τ

1−q2−qk ∆−1 ≤

γ1−q2−q∆−1 ≤ ¯−1, for large enough k, we have

τk‖∇`(βk)‖2 ≤ ∆τk ¯∆−1 ≤ δi2τ− 1

2−qk τk ¯∆−1 ≤ δi

2

and therefore

eTp,iβk = vki + τk[∇`(βk)]i ≥ vki − τk‖[∇`

(eTp,iβ

k))‖2 ≥ vki −

1

2δi.

Combining this, the result (ii) and vki → t∗i , we have

eTp,iβk+1 ≥ eTp,iβ

k − 1

2δi ≥ vki − δi ≥ t∗i − 2δi = βi + 2δi, for large enough k. (47)

Note that hλτ,q is continuous on (t∗i ,∞) and limn→∞ hλτk,q(vki ) = βi. For large enough k,

we have eTp,iβk+1 = hλτk,q(v

ki ) ∈ [βi− δi, βi + δi], which is in contradiction with (47). So (46)

holds. By the definition of hλ,q(·), we have hλτk,q(vki ) = 0 = hλτ ,q(vi).

Variable Selection in Sparse Regression with Quadratic...

Documents

Transcript of Variable Selection in Sparse Regression with Quadratic...