C:WorkProjectsPapersBoosted LassoBLassoJRSS V1.0...

21
Boosted Lasso Peng Zhao University of California, Berkeley, USA. Bin Yu University of California, Berkeley, USA. Summary. In this paper, we propose the Boosted Lasso (BLasso) algorithm that is able to produce an ap- proximation to the complete regularization path for general Lasso problems. BLasso is derived as a coordinate descent method with a fixed small step size applied to the general Lasso loss function (L1 penalized convex loss). It consists of both a forward step and a backward step and uses differences of functions instead of gradient. The forward step is similar to Boosting and Forward Stagewise Fitting, but the backward step is new and crucial for BLasso to approx- imate the Lasso path in all situations. For cases with finite number of base learners, when the step size goes to zero, the BLasso path is shown to converge to the Lasso path. For nonpara- metric learning problems with a large or an infinite number of base learners, BLasso remains valid since its forward steps are Boosting steps and its backward steps only involve the base learners that are included in the model from previous iterations. Experimental results are also provided to demonstrate the difference between BLasso and Boosting or Forward Stagewise Fitting. In addition, we extend BLasso to the case of a general convex loss penalized by a general convex function and illustrate this extended BLasso with examples. Keywords: Coordinate Descent; Backward; Boosting; Lasso; Regularization Path. 1. Introduction Lasso (Tibshirani, 1996) is an important idea recently originated in statistics. It regularizes or shrinks a fitted model through an L 1 penalty or constraint. Its popularity can be ex- plained in several ways. Since nonparametric models that fit training data well often have low biases but large variances, prediction accuracies can sometimes be improved by shrink- ing a model or making it more sparse. The regularization resulting from the L 1 penalty leads to sparse solutions, that is, there are few basis functions with nonzero weights (among all possible choices). This statement is proved asymptotically by Knight and Fu (2000) and for the finite case by Donoho (2004) in the specialized setting of over-complete libraries and large under-determined systems of linear equations. Furthermore, the sparse models induced by Lasso are more interpretable and often preferred in sciences and social sciences. Another vastly popular and recent idea is Boosting. Since its inception in 1990 (Schapire, 1990; Freund, 1995; Freund and Schapire, 1996), it has become one of the most successful machine learning methods and is now understood as an iterative method leading to an additive model (Breiman, 1998, Mason et al., 1999 and Friedman et al., 2001). While it is a natural idea to combine boosting and Lasso to have a regularized Boosting procedure, it is also intriguing that Boosting, without any additional regularization, has its own resistance to overfitting. For specific cases such as L 2 Boost (Friedman, 2001), this resistance is understood to some extent (Buhlmann and Yu, 2001). However, it was not until later when Forward Stagewise Fitting (FSF) was introduced as a boosting based

Transcript of C:WorkProjectsPapersBoosted LassoBLassoJRSS V1.0...

Boosted Lasso

Peng Zhao

University of California, Berkeley, USA.

Bin Yu

University of California, Berkeley, USA.

Summary.In this paper, we propose the Boosted Lasso (BLasso) algorithm that is able to produce an ap-proximation to the complete regularization path for general Lasso problems. BLasso is derivedas a coordinate descent method with a fixed small step size applied to the general Lasso lossfunction (L1 penalized convex loss). It consists of both a forward step and a backward stepand uses differences of functions instead of gradient. The forward step is similar to Boostingand Forward Stagewise Fitting, but the backward step is new and crucial for BLasso to approx-imate the Lasso path in all situations. For cases with finite number of base learners, when thestep size goes to zero, the BLasso path is shown to converge to the Lasso path. For nonpara-metric learning problems with a large or an infinite number of base learners, BLasso remainsvalid since its forward steps are Boosting steps and its backward steps only involve the baselearners that are included in the model from previous iterations. Experimental results are alsoprovided to demonstrate the difference between BLasso and Boosting or Forward StagewiseFitting. In addition, we extend BLasso to the case of a general convex loss penalized by ageneral convex function and illustrate this extended BLasso with examples.

Keywords: Coordinate Descent; Backward; Boosting; Lasso; Regularization Path.

1. Introduction

Lasso (Tibshirani, 1996) is an important idea recently originated in statistics. It regularizesor shrinks a fitted model through an L1 penalty or constraint. Its popularity can be ex-plained in several ways. Since nonparametric models that fit training data well often havelow biases but large variances, prediction accuracies can sometimes be improved by shrink-ing a model or making it more sparse. The regularization resulting from the L1 penaltyleads to sparse solutions, that is, there are few basis functions with nonzero weights (amongall possible choices). This statement is proved asymptotically by Knight and Fu (2000) andfor the finite case by Donoho (2004) in the specialized setting of over-complete librariesand large under-determined systems of linear equations. Furthermore, the sparse modelsinduced by Lasso are more interpretable and often preferred in sciences and social sciences.

Another vastly popular and recent idea is Boosting. Since its inception in 1990 (Schapire,1990; Freund, 1995; Freund and Schapire, 1996), it has become one of the most successfulmachine learning methods and is now understood as an iterative method leading to anadditive model (Breiman, 1998, Mason et al., 1999 and Friedman et al., 2001).

While it is a natural idea to combine boosting and Lasso to have a regularized Boostingprocedure, it is also intriguing that Boosting, without any additional regularization, hasits own resistance to overfitting. For specific cases such as L2Boost (Friedman, 2001),this resistance is understood to some extent (Buhlmann and Yu, 2001). However, it wasnot until later when Forward Stagewise Fitting (FSF) was introduced as a boosting based

2 Peng Zhao et al.

procedure with much more cautious steps that a similarity between FSF and Lasso wasobserved (Hastie et al., 2001, Efron et al., 2004).

This link between Lasso and FSF is formally described for the linear regression casethrough LARS (Least Angle Regression, Efron et al. 2004). It is also known that for spe-cial cases (such as orthogonal designs) FSF can approximate Lasso path infinitely close, butin general, it is unclear what regularization criteria FSF optimizes. As can be seen in ourexperiments (Figure 1), FSF solutions can be significantly different from the Lasso solutionsin the case of strongly correlated predictors which are common in high-dimensional dataproblems. However, it is still used as an approximation to Lasso because it is often com-putationally prohibitive to solve Lasso with general loss functions for many regularizationparameters through Quadratic Programming.

In this paper, we propose a new algorithm Boosted Lasso (BLasso). It approximatesthe Lasso path in all situations. The motivation comes from a critical observation that bothFSF and Boosting only work in a forward fashion (so is FSF named). They always takesteps that reduce empirical loss the most regardless of the impact on model complexity (orthe L1 penalty in the Lasso case). This often proves to be too greedy – the algorithms arenot able to correct mistakes made in early stages. Taking a coordinate descent view point ofthe Lasso minimization with a fixed step size, we introduce an innovative “backward” step.This step utilizes the same minimization rule as the forward step to define each fitting stagewith an additional rule to force the model complexity to decrease. By combining backwardand forward steps, Boosted Lasso is able to go back and forth and approximates the Lassopath correctly.

BLasso can be seen as a marriage between two families of successful methods. Com-putationally, BLasso works similarly to Boosting and FSF. It isolates the sub-optimizationproblem at each step from the whole process, i.e. in the language of the Boosting litera-ture, each base learner is learned separately. This way BLasso can deal with different lossfunctions and large classes of base learners like trees, wavelets and splines by fitting a baselearner at each step and aggregating the base learners as the algorithm progresses. But un-like FSF which has implicit local regularization in the sense that at any iteration, FSF witha fixed small step size only searches over those models which are one small step away fromthe current one in all possible directions corresponding to the base learners, BLasso can beproven to converge to the Lasso solutions which have explicit global L1 regularization forcases with a finite number of base learners. And the algorithm uses only the differences ofthe loss function and basic operations without taking any first or second order derivativeor matrix inversion. The fact that BLasso can be generalized to give regularized path forother convex penalties also comes as a pleasant surprise which further shows a connectionbetween statistical estimation methods and the Barrier Method (Fiacco and McCormick,1968, and cf. Boyd and Vandenberghe 2004) in the constrained convex optimization lit-erature. For the original Lasso problem, i.e. the least squares problem with L1 penalty,algorithms that give the entire Lasso path have been established (Shooting algorithm by Fu1998 and LARS by Efron et al. 2004). These algorithms are not suitable for nonparametriclearning where the number of base learners can be very large or infinite. BLasso remainsvalid since it treats each fitting step as a sub-optimization problem. Even for cases wherethe number of base learners is small, BLasso is still an attractive alternative to existingalgorithms because of its simple and intuitive nature and the fact that it can be stoppedearly to find an solution from the regularization path.

The rest of the paper is organized as follows. A brief review of Boosting and FSF isprovided in Section 2.1 and Lasso in Section 2.2. Section 3 introduces BLasso. Section 4

Boosted Lasso 3

discusses the backward step and gives the intuition behind BLasso and explains why FSF isunable to give the Lasso path. Section 5 contains a more detailed discussion on solving L1

penalized least squares problem using BLasso. Section 6 discusses BLasso for nonparametriclearning problems. Section 7 introduces a Generalized BLasso algorithm which deals withgeneral convex penalties. In Section 8, results of experiments with both simulated andreal data are reported to demonstrate the attractiveness of BLasso and differences betweenBLasso and FSF. Finally, Section 9 contains a discussion on the choice of step sizes, asummary of the paper, and future research directions.

2. Boosting, Forward Stagewise Fitting and the Lasso

Boosting is an iterative fitting procedure that builds up a model stage by stage. FSF canbe viewed as Boosting with a fixed small step size at each stage and it produces solutionsthat are often close to the Lasso solutions (path). This section gives a brief review of theFSF and Boosting algorithms followed by a review of Lasso.

2.1. Boosting and Forward Stagewise FittingThe boosting algorithms can be seen as functional gradient descent techniques (Friedmanet al. 2000, Mason et al. 1999). The task is to estimate the function F : Rd → R thatminimizes an expected loss

E[C(Y, F (X))], C(·, ·) : R × R → R+ (1)

based on data Zi = (Yi, Xi)(i = 1, ..., n). The univariate Y can be continuous (regressionproblem) or discrete (classification problem). The most prominent examples for the lossfunction C(·, ·) include Classification Margin, Logit Loss and L2 Loss functions.

The family of F (·) being considered is the set of ensembles of “base learners”

D = {F : F (x) =∑

j

βjhj(x), x ∈ Rd, βj ∈ R}, (2)

where the family of base learners can be very large or contain infinite members, e.g. trees,wavelets and splines.

Let β = (β1, ...βj , ...)T , we can reparametrize the problem using

L(Z, β) := C(Y, F (X)), (3)

where the specification of F is hidden by L to make our notation simpler.To find an estimate for β, we set up an empirical minimization problem:

β = argminβ

n∑

i=1

L(Zi; β). (4)

Despite the fact that the empirical loss function is often convex in β, exact minimizationis usually a formidable task for a moderately rich function family of base learners and withsuch function families the exact minimization leads to overfitted models. Boosting is aprogressive procedure that iteratively builds up the solution (and it is often stopped earlyto avoid overfitting):

4 Peng Zhao et al.

(j, g) = argminj,g

n∑

i=1

L(Zi; βt + g1j) (5)

βt+1 = βt + g1j (6)

where 1j is the jth standard basis, i.e. the vector with all 0s except for a 1 in the jthcoordinate and g ∈ R is a step size parameter. However, because the family of base learnersis usually large, finding j is still difficult, for which approximate solutions are found. Atypical strategy that Boosting implements is by applying functional gradient descent. Thisgradient descent view has been recognized and refined by various authors including Breiman(1998), Mason et al. (1999), Friedman et al. (2000), Friedman (2001), and Buhlmann andYu (2001). The well-known AdaBoost, LogitBoost and L2Boosting can all be viewed asimplementations of this strategy for different loss functions.

Forward Stagewise Fitting (FSF) is a similar method for approximating the minimizationproblem described by (5) with some additional regularization. Instead of optimizing thestep size as in (6), FSF updates βt by a small fixed step size ε as in Friedman (2001):

βt+1 = βt + ε · sign(g)1j

When FSF was introduced (Hastie et al. 2001, Efron et al. 2002), it was only describedfor the L2 regression setting. For general loss functions, it can be defined by removing theminimization over g in (5):

(j, s) = arg minj,s=±ε

n∑

i=1

L(Zi; βt + s1j), (7)

βt+1 = βt + s1j , (8)

This description looks different from the FSF described in Efron et al.(2002), but theunderlying mechanic of the algorithm remains unchanged (see Section 5). Initially all coef-ficients are zero. At each successive step, a predictor or coordinate is selected that reducesmost the empirical loss. Its corresponding coefficient βj is then incremented or decrementedby a small amount, while all other coefficients βj , j �= j are left unchanged.

By taking small steps, FSF imposes some implicit regularization. After T < ∞ itera-tions, many of the coefficients will be zero, namely those that have yet to be incremented.The others will tend to have absolute values smaller than the unregularized solutions. Thisshrinkage/sparsity property is reflected in the similarity between the solutions given by FSFand Lasso which is reviewed next.

2.2. LassoLet T (β) denote the L1 penalty of β = (β1, ..., βj , ...)T , that is, T (β) = ‖β‖1 =

∑j |βj |, and

let Γ(β; λ) denote the Lasso (least absolute shrinkage and selection operator) loss function

Γ(β; λ) =n∑

i=1

L(Zi; β) + λT (β). (9)

Boosted Lasso 5

The Lasso estimate β = (β1, ..., βj , ...)T is defined by

βλ = minβ

Γ(β; λ).

The parameter λ ≥ 0 controls the amount of regularization applied to the estimate.Setting λ = 0 reverses the Lasso problem to minimizing the unregularized empirical loss.On the other hand, a very large λ will completely shrink β to 0 thus leading to the emptyor null model. In general, moderate values of λ will cause shrinkage of the solutions towards0, and some coefficients may end up being exactly 0. This sparsity in Lasso solutions hasbeen researched extensively, e.g. Efron et al. (2004) and Donoho (2004). (Sparsity can alsoresult from other penalizations as in Gao and Bruce, 1997 and Fan and Li, 2001.)

Computation of the solution to the Lasso problem for a fixed λ has been studied for spe-cial cases. Specifically, for least squares regression, it is a quadratic programming problemwith linear inequality constraints; for 1-norm SVM, it can be transformed into a linear pro-gramming problem. But to get a model that performs well on future data, we need to selectan appropriate value for the tuning parameter λ. Practical algorithms have been proposedto give the entire regularization path for the squared loss function (Shooting algorithm byFu, 1998 and LARS by Efron et al., 2004) and SVM (1-norm SVM by Zhu et al, 2003).

But how to give the entire regularization path of the Lasso problem for general convexloss function remained open. More importantly, existing methods are only efficient whenthe number of base learners is small, that is, they can not deal with large numbers of baselearners that often arise in nonparametric learning, e.g. trees, wavelets and splines. FSFexists as a compromise since, like Boosting, it is a nonparametric learning algorithm thatworks with different loss functions and large numbers of base learners but it only has implicitlocal regularization and is often too greedy comparing to Lasso as can be seen from ouranalysis (Section 4) and experiment (Figure 1).

Next we propose the Boosted Lasso (BLasso) algorithm which works in a computation-ally efficient fashion as FSF. But unlike FSF, BLasso is able to approximate the Lasso pathfor general convex loss functions infinitely close.

3. Boosted Lasso

We start with the description of the BLasso algorithm. A small step size constant ε > 0and a small tolerance parameter ξ ≥ 0 are needed to run the algorithm. The step size εcontrols how well BLasso approximates the Lasso path. The tolerance ξ controls how largea descend needs to be made for a backward step and is set to be smaller than ε to have agood approximation. When the number of base learners is finite, the tolerance can be setto 0 or a small number to accommodate the numerical precision of the computer.

Boosted Lasso (BLasso)

Step 1 (initialization). Given data Zi = (Yi, Xi), i = 1, ..., n, take an initial forward step

(j, sj) = arg minj,s=±ε

n∑

i=1

L(Zi; s1j),

β0 = sj1j,

6 Peng Zhao et al.

Then calculate the initial regularization parameter

λ0 =1ε(

n∑

i=1

L(Zi; 0) −n∑

i=1

L(Zi; β0))

.Set the active index set I0

A = {j}. Set t = 0.Step 2 (Backward and Forward steps). Find the “backward” step that leads to the minimalempirical loss

j = arg minj∈It

A

n∑

i=1

L(Zi; βt + sj1j) where sj = −sign(βtj)ε. (10)

Take the step if it leads to a decrease of moderate size ξ in the Lasso loss, otherwise forcea forward step (as (7), (8) in FSF) and relax λ if necessary. That is if Γ(βt + sj1j; λ

t) −Γ(βt, λt) < −ξ, then

βt+1 = βt + sj1j,

λt+1 = λt,

It+1A = It

A/{j}, if βt+1

j= 0.

Otherwise,

(j, s) = arg minj,s=±ε

n∑

i=1

L(Zi; βt + s1j), (11)

βt+1 = βt + s1j , (12)

λt+1 = min[λt,1ε(

n∑

i=1

L(Zi; βt) −n∑

i=1

L(Zi; βt+1))],

It+1A = It

A ∪ {j}.Step 3 (iteration). Increase t by one and repeat Step 2 and 3. Stop when λt ≤ 0.

We will discuss forward and backward steps in depth in the next section. Immediately,the following properties can be proved for BLasso (see Appendix for the proof).

Lemma 1.

(a) For any λ ≥ 0, if there exist j and s with |s| = ε such that Γ(s1j; λ) ≤ Γ(0;λ), wehave λ0 ≤ λ.

(b) For any t such that λt+1 = λt, we have Γ(βt+1; λt) ≤ Γ(βt; λt).(c) For ξ = 0 and any t such that λt+1 < λt, we have Γ(βt; λt) < Γ(βt ± ε1j; λt) for every

j and ‖βt+1‖1 = ‖βt‖1 + ε.

Lemma 1 (a) guarantees that it is safe for BLasso to start with an initial λ0 which isthe largest λ that would allow an ε step away from 0 (i.e., larger λ’s correspond to βλ = 0).Lemma 1 (b) says that for each value of λ, BLasso performs coordinate descent until there

Boosted Lasso 7

is no descent step. Then, by Lemma 1 (c), the value of λ is reduced and a forward step isforced. Since the minimizers corresponding to adjacent λ’s are usually close, this proceduremoves from one solution to the next within a few steps and effectively approximates theLasso path. Actually, we have a convergence result for BLasso:

Theorem 1. For finite number of base learners and ξ = 0, if L(Z; β) is strictly convex andcontinuously differentiable in β, then as ε → 0, the BLasso path converges to the Lasso path.

Many popular loss functions, e.g. squared loss, logistic loss and negative log-likelihoodfunctions of exponential families, are convex and continuously differentiable. Other func-tions like the hinge loss (SVM) is continuous and convex but not differentiable. In fact,BLasso does not use any gradient or higher order derivatives but only the differences ofthe loss function. When the function is differentiable, because of the small fixed step size,the differences are good approximations to the gradients, but they are simpler to compute.When the loss function is not differentiable, BLasso still runs because it does not rely onderivatives. It is theoretically possible that BLasso’s coordinate descent strategy gets stuckat nonstationary points for functions like the hinge loss. However, as illustrated in our thirdexperiment, BLasso may still work for 1-norm SVM empirically.

Note that Theorem 1 does not cover nonparametric learning problems with an infinitenumber of base learners. In fact, for problems with a large or an infinite number of baselearners, the minimization in (11) can be carried out approximately by functional gradientdescent, but a tolerance ξ > 0 needs to be chosen to avoid oscillation between forward andbackward steps caused by slow descending. We will return to this topic in Section 6.

4. The Backward Boosting Step

We now explain the motivation and working mechanic of BLasso. Observe that FSF onlyuses “forward” steps, that is, it only takes steps that lead to a direct reduction of theempirical loss. Comparing to classical model selection methods like Forward Selection andBackward Elimination, Growing and Pruning of a classification tree, a “backward” coun-terpart is missing. Without the backward step, FSF can be too greedy and deviate fromthe Lasso path in some cases (cf. Figure 1). This backward step naturally arises in BLassobecause of our coordinate descent view of the minimization of the Lasso loss.

For a given β �= 0 and λ > 0, consider the impact of a small ε > 0 change of βj to theLasso loss Γ(β; λ). For an |s| = ε,

∆jΓ(Z; β) = (n∑

i=1

L(Zi; β + s1j) −n∑

i=1

L(Zi; β)) + λ(T (β + s1j) − T (β))

:= ∆j(n∑

i=1

L(Zi; β)) + λ∆jT (β). (13)

Since T (β) is simply the L1 norm of β, ∆T (β) reduces to a simple form:

∆jT (β) = ‖β + s1j‖1 − ‖β‖1 = |βj + s| − |βj |= sign+(βj , s) · ε (14)

where sign+(βj , s) = 1 if sβj > 0 or βj = 0, sign+(βj , s) = −1 if sβj < 0 and sign+(βj , s) =0 if s = 0.

8 Peng Zhao et al.

Equation (14) shows that an ε step’s impact on the penalty is a fixed ε for any j. Onlythe sign of the impact may vary. Suppose given a β, the “forward” steps for differentj have impacts on the penalty of the same sign, then ∆jT is a constant in (13) for allj. Thus, minimizing the Lasso loss using fixed-size steps is equivalent to minimizing theempirical loss directly. At the “early” stages of FSF, all forward steps are parting fromzero, therefore all the signs of the “forward” steps’ impact on penalty are positive. As thealgorithm proceeds into “later” stages, some of the signs may change into negative andminimizing the empirical loss is no longer equivalent to minimizing the Lasso loss. Hencein the beginning, FSF carries out a steepest descent algorithm that minimizes the Lassoloss and follows Lasso’s regularization path, but as it goes into later stages, the equivalenceis broken and they part ways. In fact, except for special cases like orthogonal designedcovariates, FSF usually go into “later” stages, then the signs of impacts on penalty onsome directions can change from positive to negative. These directions then reduce theempirical loss and penalty simultaneously therefore they should be preferred over otherdirections. Moreover, there can also be occasions where a step goes “backward” to reducethe penalty with a small sacrifice in empirical loss. In general, to minimize the Lasso loss,one needs to go “back and forth” to trade off the penalty with empirical loss for differentregularization parameters. We call a direction that leads to reduction of the penalty a“backward” direction and define a backward step as the following:

For a given β, a backward step is such that:

∆β = sj1j , subject to βj �= 0, sign(s) = −sign(βj) and |s| = ε.

Making such a step will reduce the penalty by a fixed amount λ · ε, but its impact onthe empirical loss can be different, therefore as in (10) we want:

j = argminj

n∑

i=1

L(Zi; β + sj1j) subject to βj �= 0 and sj = −sign(βj)ε,

i.e. j is picked such that the empirical loss after making the step is as small as possible.While forward steps try to reduce the Lasso loss through minimizing the empirical loss,

the backward steps try to reduce the Lasso loss through minimizing the Lasso penalty.Although rare, it is possible to have a step reduce both the empirical loss and the Lassopenalty. It thus works both as a forward step and a backward step. We do not distinguishsuch steps as they do not create any confusions.

In summary, by allowing the backward steps, we are able to work with the Lasso lossdirectly and take backward steps to correct earlier forward steps that are too greedy.

5. Least Squares Problem

For the most common particular case – least squares regression, the forward and backwardsteps in BLasso become very simple and more intuitive. To see this, we write out theempirical loss function L(Zi; β) in its L2 form,

n∑

i=1

L(Zi; β) =n∑

i=1

(Yi − Xiβ)2 =n∑

i=1

(Yi − Yi)2 =n∑

i=1

η2i . (15)

where Y = (Y1, ..., Yn)T are the “fitted values” and η = (η1, ..., ηn)T are the “residuals”.

Boosted Lasso 9

Recall that in a penalized regression setup Xi = (Xi1, ..., Xim), every covariates Xj =(X1j , ..., Xnj)T is normalized, i.e. ‖Xj‖2 =

∑ni=1 X2

ij = 1 and∑n

i=1 Xij = 0. For a givenβ = (β1, ...βm)T , the impact of a step s of size |s| = ε along βj on the empirical loss functioncan be written as:

∆(n∑

i=1

L(Zi; β)) =n∑

i=1

[(Yi − Xi(β + s1j))2 − (Yi − Xiβ)2]

=n∑

i=1

[(ηi − sXi1j)2 − η2i ] =

n∑

i=1

(−2sηiXij + s2X2ij)

= −2s(η · Xj) + s2. (16)

The last line of these equations delivers a strong message – in least squares regression,given the step size, the impact on the empirical loss function is solely determined by theinner-product (correlation) between the fitted residuals and each coordinate. Specifically, itis proportional to the negative inner-product between the fitted residuals and the covariateplus the step size squared. Therefore coordinate steepest descent with a fixed step size onthe empirical loss function is equivalent to finding the covariate that is best correlated withthe fitted residuals at each step, then proceed along the same direction. This is in principlethe same as FSF.

Translate this for the forward step where originally

(j, sj) = arg minj,s=±ε

n∑

i=1

L(Zi; β + s1j),

we getj = arg max

j|η · Xj | and s = sign(η · X j)ε, (17)

which coincides exactly with the stagewise procedure described in Efron (2002) and is ingeneral the same principle as L2 Boosting, i.e. recursively refitting the regression residualsalong the most correlated direction except the difference in step size choice (Friedman 2001,Buhlmann and Yu 2003). Also, under this simplification, a backward step becomes

j = argminj

(−s(η · Xj)) subject to βj �= 0 and sj = −sign(βj)ε. (18)

Clearly that both forward and backward steps are based only on the correlations be-tween fitted residuals and the covariates. It follows that BLasso in the L2 case reduces tofinding the best direction in both forward and backward directions by examining the inner-products, and then deciding whether to go forward or backward based on the regularizationparameter. This not only simplifies the minimization procedure but also significantly re-duces the computation complexity for large datasets since the inner-product between ηt andXj can be updated by

(ηt+1)′Xj = (ηt − sXjt)′Xj = (ηt)′Xj − sX ′jtXj . (19)

which takes only one operation if X ′jtXj is precalculated.

Therefore, when the number of base learners is small, based on precalculated X ′X andY ′X , BLasso without the original data by using (19) which makes its computation com-plexity independent from the number of observations. When the number of base learners

10 Peng Zhao et al.

is large (or infinite), this strategy because inefficient (or unsuitable). The forward step isthen done by a sub-optimization procedure e.g. fitting smoothing splines. For the back-ward step, only inner-products between base learners with nonzero coefficients need to becalculated once. Then the inner products between these base learners and residuals can beupdated by (19). This makes the backward steps’ computation complexity proportional tothe number of base learners that are already chosen instead of the number of all possiblebase learners. Therefore BLasso works efficiently not only for cases with large sample sizebut also for cases where a class of a large or an infinite number of possible base learners isgiven but we only look for a sparse model.

As mentioned earlier, there are already established algorithms for solving the least squareLasso problem e.g. Shooting algorithm (Fu 1998), and LARS (Efron et al. 2004). Thesealgorithms are efficient for giving the exact Lasso path when the number of base learners issmall but neither algorithm is adequate for nonparametric learning problems with a large oran infinite number of base learners. Also, given the entire Lasso path, one solution needs tobe selected. Previous algorithms suggest using (generalized) cross validation procedures thatminimized a (generalized) cross validation loss over a grid of λ. BLasso, as these strategiesstill apply, can also take advantage of early stopping as in the original boosting literatureand potentially saves computation from further unnecessary iterations. Particularly, in caseswhere the Ordinary Least Square (OLS) model performs well, BLasso can be modifiedto start from the OLS model, go backward and stop in a few iterations. Overall, forleast squares problem, BLasso is an algorithm with Boosting’s computational efficiency andflexibility but also produces Lasso solutions as LARS and Shooting algorithms.

6. Boosted Lasso for Nonparametric Learning

As we discussed for the least squares problem, Boosted Lasso can be used to deal withnonparametric learning problems with a large or an infinite number of base learners. Thisis because Boosted Lasso has two types of steps: forward and backward, for both of whicha large or an infinite number of base learners does not create a problem.

The forward steps are similar to the steps in Boosting which can handle a large oran infinite number of base learners. That is, the forward step, as in Boosting, is a sub-optimization problem by itself and Boosting’s functional gradient descend strategy appliesand this functional gradient strategy works even with a large or an infinite number of baselearners. For example, in the case of classification with trees, one can use the classificationmargin or the logistic loss function as the loss function and use a reweighting procedure tofind the appropriate tree at each step (for details see e.g. Breiman 1998 and Friedman etal. 2000). In case of regression with the L2 loss function, the minimization as in (11) isequivalent to refitting the residuals as we described in the last section.

The backward steps consider only those base learners already in the model and thenumber of these learners is always finite. This is because for an iterative procedure likeBoosted Lasso, we usually stop early to avoid overfitting and to get a sparse model. Thisresults in a finite number of base learners altogether. And even if the algorithm is keptrunning, it usually reaches a close-to-perfect fit without too many iterations (i.e. the “stronglearnability” in Schapire 1990 ). Therefore, the backward step’s computation complexity islimited because it only involves base learners that are already included from previous steps.

There is, however, a difference in the BLasso algorithm between the case with a smallnumber of base learners and the one with a large or an infinite number of base learners. For

Boosted Lasso 11

the finite case, since BLasso requires a backward step to be strictly descending and relaxλ whenever no descending step is available, therefore BLasso never reach the same solutionmore than once and the tolerance constant ξ can be set to 0 or a very small numberto accommodate the program’s numerical accuracy. In the nonparametric learning case,oscillation could occur when BLasso keeps going back and forth in different directions butonly improving the penalized loss function by a diminishing amount, therefore a positivetolerance ξ is mandatory. Since a step’s impact on the penalized loss function is on theorder of ε, we suggest using ξ = c · ε where c ∈ (0, 1) is a small constant.

7. Generalized Boosted Lasso

As stated earlier, BLasso not only works for general convex loss functions, but also extends toconvex penalties other than the L1 penalty. For the Lasso problem, BLasso does a fixed stepsize coordinate descent to minimize the penalized loss. Since the penalty has the special L1

norm and (14) holds, a step’s impact on the penalty has a fixed size ε with either a positive ora negative sign, and the coordinate descent takes form of “backward” and “forward” steps.This reduces the minimization of the penalized loss function to unregularized minimizationsof the loss function as in (11) and (10). For general convex penalties, since a step on differentcoordinates does not necessarily have the same impact on the penalty, one is forced to workwith the penalized function directly. Assume T (β): Rm → R is a convex penalty function.We now describe the Generalized Boosted Lasso algorithm:

Generalized Boosted Lasso

Step 1 (initialization). Given data Zi = (Yi, Xi), i = 1, ..., n and a fixed small step sizeε > 0 and a small tolerance parameter ξ > 0, take an initial forward step

(j, sj) = arg minj,s=±ε

n∑

i=1

L(Zi; s1j), β0 = sj1j .

Then calculate the corresponding regularization parameter

λ0 =∑n

i=1 L(Zi; 0) − ∑ni=1 L(Zi; β0)

T (β0) − T (0).

Set t = 0.Step 2 (steepest descent on Lasso loss). Find the steepest coordinate descent direction onthe penalized loss

(j, sj) = arg minj,s=±ε

Γ(βt + s1j; λt).

Update β if it reduces Lasso loss; otherwise force β to minimize L and recalculate theregularization parameter. That is if Γ(βt + sj1j ; λ

t) − Γ(βt, λt) < −ξ, then

βt+1 = βt + sj1j, λt+1 = λt.

Otherwise,

(j, sj) = arg minj,|s|=ε

n∑

i=1

L(Zi; βt + s1j),

12 Peng Zhao et al.

βt+1 = βt + sj1j ,

λt+1 = min[λt,

∑ni=1 L(Zi; βt) − ∑n

i=1 L(Zi; βt+1)

T (βt+1) − T (βt)].

Step 3 (iteration). Increase t by one and repeat Step 2 and 3. Stop when λt ≤ 0.

In the Generalized Boosted Lasso algorithm, explicit “forward” or “backward” steps areno longer seen. However, the mechanic remains the same – minimize the penalized lossfunction for each λ, relax the regularization by reducing λ through a “forward” step whenthe minimum of the loss function for the current λ is reached.

Another method that works in a similar fashion is the Barrier Method proposed byFiacco and McCormick (1968) (cf. Boyd and Vandenberghe, 2004). It is an advancedoptimization method for constrained convex minimization problems which solves a sequenceof unconstrained minimization problems that approach the constrained problem. We notethat BLasso also solves a sequence of optimization problems indexed by λ. Both methodsuse the last solution found as the starting point for the next unconstrained minimizationproblem. However, there exists a fundamental difference: the Barrier Method targets to getto the end of the path (λ = 0) so it jumps quickly from solution to solution according to ageometric schedule. In comparison, BLasso is designed to produce a well sampled solutionpath. It therefore places solution points evenly on the path (more discussions will be offeredwhen we compare BLasso with the algorithm in Rosset (2004) in the discussion section).This subtle connection between BLasso and the extensively studied Barrier Method providesopportunities to explore further convergence properties of BLasso and to study the interplaybetween optimization and statistical estimation theories.

8. Experiments

In this section, three experiments are carried out to implement BLasso. The first experimentruns BLasso on the diabetes dataset (cf. Efron et al. 2004) with an added artificial covariatevariable under the classical Lasso setting, i.e. L2 regression with an L1 penalty. This addedcovariate is strongly correlated with a couple of the original covariates. In this case, BLassois seen to produce a path almost exactly the same as the Lasso path, while FSF’s path partsdrastically from Lasso’s due to the added strongly correlated covariate.

In the second experiment, we use the diabetes dataset for L2 regression without theadded covariate but with several different penalties, therefore Generalized BLasso is used.The penalties are bridge penalties (cf. Fu 1998) with different bridge parameters γ = 1,γ = 1.1, γ = 2, γ = 4 and γ = ∞. Generalized BLasso produced the solution pathssuccessfully for all cases.

The last experiment is on classification where we use simulated data to illustrate BLasso’ssolving regularized classification problem under the 1-norm SVM setting. Here, since theloss function is not continuously differentiable, it is theoretically possible that BLasso doesnot converge to the true regularization path. However, as we illustrate, BLasso runs withouta problem and produces reasonable solutions.

8.1. L2 Regression with L1 Penalty (Classical Lasso)The dataset used in this experiment is from a Diabetes study where 442 diabetes patientswere measured on 10 baseline variables. A prediction model was desired for the response

Boosted Lasso 13

variable, a quantitative measure of disease progression one year after baseline. One addi-tional variable, X11 = −X7 +X8 +5X9 + e where e is i.i.d. Gaussian noise (mean zero andVar(e) = 1/442), is added to make the difference between FSF and Lasso solutions morevisible. This added covariate is strongly correlated with X9, with correlations as follows (inthe order of X1, X2, ..., X10) :

(0.25 , 0.24 , 0.47 , 0.39 , 0.48 , 0.38 , −0.58 , 0.76 , 0.94 , 0.47).

The classical Lasso – L2 regression with L1 penalty is used for this purpose. LetX1, X2, ..., Xm be n−vectors representing the covariates and Y the vector of responsesfor the n cases, m = 11 and n = 442 in this study. Location and scale transformationsare made so that all covariates are standardized to have mean 0 and unit length, and theresponse has mean zero.

The penalized loss function has the form:

Γ(β; λ) =n∑

i=1

(Yi − Xiβ)2 + λ‖β‖1 (20)

Lasso BLasso FSF

0 1000 2000 3000

−500

0

500

0 1000 2000 3000

−500

0

500

0 1000 2000 3000

−500

0

500

t =∑ |βj | → t =

∑ |βj | → t =∑ |βj | →

Figure 1. Estimates of regression coefficients βj , j=1,2,...10,11 for the diabetes data. LeftPanel: Lasso solutions (produced using simplex search method on the penalized empiricalloss function for each λ) as a function of t = ‖β‖1. Middle Panel: BLasso solutions, whichcan be seen indistinguishable to the Lasso solutions. Right Panel: FSF solutions, which aredifferent from Lasso and BLasso.

The middle panel of Figure 1 shows the coefficient plot for BLasso applied to the modifieddiabetes data. Left (Lasso) and Middle (Middle) panels are indistinguishable from eachother. Both FSF and BLasso pick up the added artificial and strongly correlated X11 (thesolid line) in the earlier stages, but due to the greedy nature of FSF, it is not able to removeX11 in the later stages thus every parameter estimate is affected leading to significantlydifferent solutions from Lasso.

The BLasso solutions were built up in 8700 steps (making the step size ε = 0.5 small sothat the coefficient paths are smooth), 840 of which were backward steps. In comparison,FSF took 7300 pure forward steps. BLasso’s backward steps mainly concentrate around thesteps where FSF and BLasso tend to differ.

14 Peng Zhao et al.

8.2. L2 Regression with Lγ Penalties (the Bridge Regression)To demonstrate the Generalized BLasso algorithm or different penalties. We use the BridgeRegression setting with the diabetes dataset (without the added covariate in the first exper-iment). The Bridge Regression method was first proposed by Frank and Friedman (1993)and later more carefully discussed and implemented by Fu (1998). It is a generalizationof the ridge regression (L2 penalty) and Lasso (L1 penalty). It considers a linear (L2)regression problem with Lγ penalty for γ ≥ 1 (to maintain the convexity of the penaltyfunction).

The penalized loss function has the form:

Γ(β; λ) =n∑

i=1

(Yi − Xiβ)2 + λ‖β‖γ (21)

where γ is the bridge parameter. The data used in this experiment are centered and rescaledas in the first experiment.

γ = 1 γ = 1.1 γ = 2 γ = 4 γ = ∞

600 1800 3000 600 1800 3000 600 1800 3000 600 1800 3000 100 400 700

−1 0 1

−1

0

1

−1 0 1

−1

0

1

−1 0 1

−1

0

1

−1 0 1

−1

0

1

−1 0 1

−1

0

1

Figure 2. Upper Panel: Solution paths produced by BLasso for different bridge parameters., from left to right: Lasso (γ = 1), near-Lasso (γ = 1.1), Ridge (γ = 2), over-Ridge (γ = 4),max (γ = ∞). The Y -axis is the parameter estimate and has the range [−800, 800]. The X-axis for each of the left 4 plots is

∑i |βi|, the one for the 5th plot is max(|βi|) because

∑i |βi|

is unsuitable. Lower Panel: The corresponding penalty equal contours for βγ1 + βγ

2 = 1.Generalized BLasso successfully produced the paths for all 5 cases which are verified by

pointwise minimization using simplex method (γ = 1, γ = 1.1, γ = 4 and γ = max) orclose form solutions (γ = 2). It is interesting to notice the phase transition from the near-Lasso to the Lasso as the solution paths are similar but only Lasso has sparsity. Also, as γgrows larger, estimates for different βj tend to have more similar sizes and in the extremeγ = ∞ there is a “branching” phenomenon – the estimates stay together in the beginningand branch out into different directions as the path progresses. Since no differentiation

Boosted Lasso 15

is carried out or assumed, Generalized BLasso produced all of these significantly differentsolution paths by simply plugging in different penalty functions in the algorithm.

8.3. Classification with 1-norm SVM (Hinge Loss)To demonstrate the Generalized BLasso algorithm for a nondifferentiable loss function, wenow look at binary classification. As in Zhu et al. (2003), we generate 50 training datapoints in each of two classes. The first class has two standard normal independent inputs X1

and X2 and class label Y = −1. The second class also has two standard normal independentinputs, but conditioned on 4.5 ≤ (X1)2 + (X2)2 ≤ 8 and has class label Y = 1. We wishto find a classification rule from the training data. so that when given a new input, we canassign a class Y from {1,−1} to it.

1-norm SVM (Zhu et al. 2003) is used to estimate β:

(β0, β) = argminβ0,β

n∑

i=1

(1 − Yi(β0 +m∑

j=1

βjhj(Xi)))+ + λ

5∑

j=1

|βj | (22)

where hi ∈ D are basis functions and λ is the regularization parameter. The dictionaryof basis functions is D = {√2X1,

√2X2,

√2X1X2, (X1)2, (X2)2}. Notice that β0 is left

unregularized so the penalty function is not the L1 penalty.The fitted model is

f(x) = β0 +m∑

j=1

βjhj(x),

and the classification rule is given by sign(f(x)).

Regularization Path Data

0 0.5 1 1.5−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

t =∑5

j=1 |βj | →

Figure 3. Estimates of 1-norm SVM coefficients βj , j=1,2,...,5, for the simulated two-classclassification data. Left Panel BLasso solutions as a function of t =

∑5j=1 |βj |. Right Panel

Scatter plot of the data points with labels: ’+’ for y = −1; ’o’ for y = 1.

16 Peng Zhao et al.

Since the loss function is not differentiable and the penalty function is not the originalLasso penalty, we do not have a theoretical guarantee that BLasso works. Nonethelessthe solution path produced by Generalized BLasso has the same sparsity and piecewiselinearality as the 1-norm SVM solutions shown in Zhu et al. (2003). It takes GeneralizedBLasso 490 iterations to generate the solutions. The covariates enter the regression equationsequentially as t increase, in the following order: the two quadratic terms first, followed bythe interaction term then the two linear terms. As 1-norm SVM in Zhu et al. (2003), BLassocorrectly picked up the quadratic terms early. The interaction term and linear terms thatare not in the true model comes up much later. In other words, BLasso results are in goodagreement with Zhu et al.’s 1-norm SVM results and we regard this as a confirmation forBLasso’s effectiveness in this nondifferentiable example.

9. Discussion and Concluding Remarks

As seen from the experiments, BLasso is effective for solving the Lasso problem and generalconvex penalized loss minimization problems. One practical issue left undiscussed is thechoice of the step size ε. In general, BLasso takes O(1/ε) steps to produce the whole path.Depending on the actual loss function, base learners and minimization method used ineach step, the actual computation complexity varies. Although choosing a smaller step sizegives smoother solution path and more accurate estimates, we observe that the the actualcoefficient estimates are pretty accurate even for relatively large step sizes.

0500100015002000

0

200

400

λ

Figure 4. Estimates of regression coefficients β3 for the diabetes data. Solutions areplotted as functions of λ. Dotted Line: Estimates using step size ε = 0.05. Solid Line:Estimates using step size ε = 10. Dash-dot Line: Estimates using step size ε = 50.

For the diabetes data, using a small step size ε = 0.05, the solution path can not bedistinguished from the exact regularization path. However, even when the step size is aslarge as ε = 10 and ε = 50, the solutions are still good approximations.

BLasso has only one step size parameter. This parameter controls both how closeBLasso approximates the minimization coefficients for each λ and how close two adjacentλ on the regularization path are placed. As can be seen from Figure 4, a smaller step sizeleads to a closer approximation to the solutions and also finer grids for λ. We argue that,

Boosted Lasso 17

if λ is sampled on a coarse grid we should not spend computational power on finding amuch more accurate approximation of the coefficients for each λ. Instead, the availablecomputational power spent on these two coupled tasks should be balanced. BLasso’s 1-parameter setup automatically balances these two aspects of the approximation which isgraphically expressed by the staircase shape of the solution paths.

Another algorithm similar to Generalized BLasso is developed independently by Rosset(2004). There, starting from λ = 0, a solution is generated by taking a small Newton-Raphson step for each λ, then λ is increased by a fixed amount. The algorithm assumestwice-differentiability of both loss function and penalty function and involves calculation ofthe Hessian matrix. It works in a very similar fashion as the Barrier Method with a linearupdate schedule for λ (Barrier Method uses a geometric schedule). The Barrier Methodusually uses a second order optimization method for each λ as well. In comparison, BLassouses only the differences of the loss function and involves only basic operations. BLasso’sstep size is defined in the original parameter space which makes the solutions evenly spreadin β’s space rather than in λ. In general, since λ is approximately the reciprocal of sizeof the penalty, as fitted model grows larger and λ becomes smaller, changing λ by a fixedamount makes the algorithm in Rosset (2004) move too fast in the β space. On the otherhand, when the model is close to empty and the penalty function is very small, λ is verylarge, but the algorithm still uses same small steps thus computation is spent to generatesolutions that are too close from each other.

One of our current research topics is to apply BLasso in an online or time series setting.Since BLasso has both forward and backward steps, we believe that an adaptive onlinelearning algorithm can be devised based on BLasso so that it goes back and forth to trackthe best regularization parameter and the corresponding model.

We end with a summary of our main contributions:

(a) By combining both forward and backward steps, a Boosted Lasso (BLasso) algorithmis constructed to minimize an L1 penalized convex loss function. While it maintainsthe simplicity and flexibility of Boosting (or Forward Stagewise Fitting) algorithms,BLasso efficiently produces the Lasso solutions for general loss functions and largeclasses of base learners. This can be proven rigorously under the assumption that theloss function is convex and continuously differentiable.

(b) The backward steps introduced in this paper are critical for producing the Lasso path.Without them, the FSF algorithm can be too greedy and in general does not produceLasso solutions, especially when the base learners are strongly correlated as in caseswhere the number of base learners is larger than the number of observations.

(c) When the loss function is squared loss, BLasso takes a simpler and more intuitiveform and depends only on the inner products of variables. For problems with largenumber of observations and fewer base learners, after initialization, BLasso does notdepend on the number of observations. For nonparametric learning problems whereprevious methods do not cover due to a large or an infinite number of base learners,BLasso remains computationally efficient since the forward steps are boosting stepsand the backward steps deal only with base learners that have already been includedin the model.

(d) We generalize BLasso to deal with other convex penalties (generalized BLasso) andshow a connection with the Barrier Method for constrained convex optimization.

18 Peng Zhao et al.

A. Appendix: Proofs

First, we offer a proof for Lemma 1.

Proof. (Lemma 1)

(a) Suppose ∃λ, j, |s| = ε s.t. Γ(ε1j; λ) ≤ Γ(0;λ). We haven∑

i=1

L(Zi; 0) −n∑

i=1

L(Zi; s1j) ≥ λT (s1j) − λT (0),

therefore

λ ≤ 1ε{

n∑

i=1

L(Zi; 0) −n∑

i=1

L(Zi; s1j)}

≤ 1ε{

n∑

i=1

L(Zi; 0) − minj,|s|=ε

n∑

i=1

L(Zi; s1j)}

=1ε{

n∑

i=1

L(Zi; 0) −n∑

i=1

L(Zi; β0)}

= λ0.

(b) Since a backward step is only taken when Γ(βt+1; λt) < Γ(βt; λt), so we only need toconsider forward steps. When a forward step is forced, if Γ(βt+1; λt) > Γ(βt; λt), then

n∑

i=1

L(Zi; βt) −n∑

i=1

L(Zi; βt+1) < λtT (βt+1) − λtT (βt),

therefore

λt+1 =1ε{

n∑

i=1

L(Zi; βt) −n∑

i=1

L(Zi; βt+1)} < λt

which contradicts the assumption.(c) Since λt+1 < λt and λ can not be relaxed by a backward step, we immediately have

‖βt+1‖1 = ‖βt‖1 + ε. Then from

λt+1 =1ε{

n∑

i=1

L(Zi; βt) −n∑

i=1

L(Zi; βt+1)}

we getΓ(βt; λt+1) = Γ(βt+1; λt+1).

Plus both sides by λt − λt+1 times the penalty term, and recall T (βt+1) = ‖βt+1‖1 >

|βt‖1 = T (βt), we get

Γ(βt; λt) < Γ(βt+1; λt)

= minj,|s|=ε

Γ(βt + s1j; λt)

≤ Γ(βt ± ε1j; λt)

for ∀j. This completes the proof.

Boosted Lasso 19

Theorem 1 claims “the BLasso path converges to the Lasso path”, by which we mean:

(a) As ε → 0, for ∀t s.t. λt+1 < λt, Γ(βt; λt) − minβ Γ(β; λt) = o(1).(b) For each ε > 0, it takes finite steps to run BLasso.

Proof. (Theorem 1)

(a) Since L(Z; β) is strictly convex in β, so Γ(β; λ) is strictly convex in β for ∀λ. SoΓ(β; λ) has unique minimum β∗(λ). For j, that β∗

j (λ) �= 0, Γ is differentiable in thatdirection and dΓ = 0, for the j that β∗

j (λ) = 0, the right and left partial derivativeof Γ w.r.t. βj are both bigger than or equal to 0 and at least one of them is strictlybigger than 0. Now, recall Lemma 1 says BLasso does steepest coordinate descentwith fixed step size ε. We want to show it does not get stuck near a nonstationarypoint as ε → 0.Consider at βt, we look for the steepest descent on the surface of a polytope definedby β = βt + ∆β where ‖∆β‖1 = ε, i.e.

min∆β

∆Γ(βt + ∆β; λ) subject to ‖∆β‖1 = ε. (23)

Here∆Γ = ∆L + λ∆T.

Since L is continuously differentiable w.r.t. β, we have

∆L =∑

j

∂L

∂βj∆βj + o(ε). (24)

And since T (β) =∑

j |βj |, we have

∆T =∑

j

sign+(βt, ∆βj)‖∆βj‖. (25)

Therefore as ε → 0, (23) becomes a linear programming problem. This indicates,near a nonstationary point, if there is a descending point on the polytope then thereis a descending direction on the vertices. More precisely, if ∃∆β such that ∆Γ(βt +∆β; λ) < 0 and is not o(ε), then there exists a descending direction j. This contradictswith Lemma 1(c).Therefore, for any j, either ∆Γ(βt+∆βj ; λ) = o(ε) or it is not o(ε) but less than zero inboth positive and negative directions where the later corresponds to zero entries in βt.Thus, by the continuous differentiability assumption, βt is in the o(1) neighborhoodof β∗(λt) which implies Γ(βt; λt) − Γ(β∗(λt); λt) = o(1). This concludes the proof.

(b) First, suppose we have λt+1 < λt, λt′+1 < λt′ and t < t′. Immediately, we haveλt > λt′ , then

Γ(βt′ ; λt′) < Γ(βt; λt′) < Γ(βt; λt) < Γ(βt′ ; λt).

ThereforeΓ(βt′ ; λt) − Γ(βt′ ; λt′) > Γ(βt; λt) − Γ(βt; λt′),

from which we getT (βt′) > T (βt).

20 Peng Zhao et al.

So the BLasso solution before each time λ gets relaxed strictly increases in L1 norm.Then since the L1 norm can only change on an ε-grid, so λ can only be relaxed finitetimes till BLasso reaches the unregularized solution.Now for each value of λ, since BLasso is always strictly descending, the BLasso solu-tions never repeat. By the same ε-grid argument, BLasso can only take finite stepsbefore λ has to be relaxed.Combining the two arguments, we conclude that for each ε > 0 it can only take finitesteps to run BLasso.

Acknowledgments. B. Yu would like to gratefully acknowledge the partial supports fromNSF grants FD01-12731 and CCR-0106656 and ARO grant DAAD19-01-1-0643, and theMiller Research Professorship in Spring 2004 from the Miller Institute at University ofCalifornia at Berkeley. Both authors thank Dr. Chris Holmes and Mr. Guilherme V.Rocha for their very helpful comments and discussions on the paper.

References

[1] Boyd, S. and Vandenberghe L. (2004). Convex Optimization, Cambridge UniversityPress.

[2] Breiman, L. (1998). “Arcing Classifiers”, Ann. Statist. 26, 801-824.

[3] Buhlmann, P. and Yu, B. (2001). “Boosting with the L2 Loss: Regression and Classi-fication”, J. Am. Statist. Ass. 98, 324-340.

[4] Donoho, D. (2004). “For Most Large Underdetermined Systems of Linear Equations,the minimal l1-norm near-solution approximates the sparsest near-solution”, Technicalreports, Statistics Department, Stanford University.

[5] Efron, B., Hastie,T., Johnstone, I. and Tibshirani, R. (2004). “Least Angle Regression”,Ann. Statist. 32, 407-499.

[6] Fan, J. and Li, R.Z. (2001). “Variable Selection via Nonconcave Penalized Likelihoodand its Oracle Properties”, J. Am. Statist. Ass. 96, 1348-1360.

[7] Fiacco, A. V. and McCormick, G. P. (1968). “Nonlinear Programming. SequentialUnconstrained Minimization Techniques”, Society for Industrial and Applied Mathe-matics, 1990. First published in 1968 by Research Analysis Corporation.

[8] Frank, I.E. and Friedman, J.H. (1993), “A Statistical View of Some ChemometricsRegression Tools”, Technometrics, 35, 109-148.

[9] Freund, Y. (1995). “Boosting a weak learning algorithm by majority”, Information andComputation 121, 256-285.

[10] Freund, Y. and Schapire, R.E. (1996). “Experiments with a new boosting algorithm”,Machine Learning: Proc. Thirteenth International Conference, 148-156. Morgan Kauff-man, San Francisco.

[11] Friedman, J.H., Hastie, T. and Tibshirani, R. (2000). “Additive Logistic Regression: aStatistical View of Boosting”, Ann. Statist. 28, 337-407.

Boosted Lasso 21

[12] Friedman, J.H. (2001). “Greedy Function Approximation: a Gradient Boosting Ma-chine”, Ann. Statist. 29, 1189-1232.

[13] Fu, W. J. (1998). “Penalized regressions: the Bridge versus the Lasso” J. Comput.Graph. Statist. 7, 397-416.

[14] Fu, W. and Knight, K. (2000). “Asymptotics for lasso-type estimators” Ann. Statist.28, 1356-1378.

[15] Gao, H. Y. and Bruce, A. G. (1997) “WaveShrink With Firm Shrinkage,” StatisticaSinica, 7, 855-874.

[16] Hansen, M. and Yu, B. (2001). “Model Selection and the Principle of Minimum De-scription Length”, J. Am. Statist. Ass. 96, 746-774.

[17] Hastie, T., Tibshirani, R. and Friedman, J.H. (2001). The Elements of Statistical Learn-ing: Data Mining, Inference and Prediction, Springer Verlag, New York.

[18] Mason, L., Baxter, J., Bartlett, P. and Frean, M. (1999). “Functional Gradient Tech-niques for Combining Hypotheses”, In Advance in Large Margin Classifiers. MIT Press.

[19] Rosset, S. (2004). “Tracking Curved Regularized Optimization Solution Paths”, NIPS2004.

[20] Schapire, R.E. (1990). “The Strength of Weak Learnability”. Machine Learning 5(2),1997-227.

[21] Tibshirani, R. (1996). “Regression shrinkage and selection via the lasso”, J. R. Statist.Soc. B, 58(1) 267-288.

[22] Zhang, T. and Yu, B.(2003). “Boosting with early stopping: convergence and consis-tency”, Ann. Statist., to appear.

[23] Zhu, J. Rosset, S., Hastie, T. and Tibshirani, R.(2003) “1-norm Support Vector Ma-chines”, Advances in Neural Information Processing Systems 16. MIT Press.