l1 Trend Filter

7/28/2019 l1 Trend Filter

1/29

1 Trend Filtering

Seung-Jean Kim Kwangmoo Koh Stephen Boyd

Dimitry Gorinevsky

To appear in SIAM Review

Abstract

The problem of estimating underlying trends in time series data arises in a variety of

disciplines. In this paper we propose a variation on Hodrick-Prescott (H-P) filtering,a widely used method for trend estimation. The proposed 1 trend filtering methodsubstitutes a sum of absolute values (i.e., 1-norm) for the sum of squares used in H-Pfiltering to penalize variations in the estimated trend. The 1 trend filtering methodproduces trend estimates that are piecewise linear, and therefore is well suited toanalyzing time series with an underlying piecewise linear trend. The kinks, knots, orchanges in slope of the estimated trend can be interpreted as abrupt changes or eventsin the underlying dynamics of the time series. Using specialized interior-point methods,1 trend filtering can be carried out with not much more effort than H-P filtering; inparticular, the number of arithmetic operations required grows linearly with the numberof data points. We describe the method and some of its basic properties, and give someillustrative examples. We show how the method is related to 1 regularization basedmethods in sparse signal recovery and feature selection, and list some extensions of thebasic method.

Key words: detrending, 1 regularization, Hodrick-Prescott filtering, piecewise lin-ear fitting, sparse signal recovery, feature selection, time series analysis, trend estima-tion.

1 Introduction

1.1 Trend filtering

We are given a scalar time series yt, t = 1, . . . , n, assumed to consist of an underlying slowlyvarying trend xt, and a more rapidly varying random component zt. Our goal is to estimatethe trend component xt, or equivalently, estimate the random component zt = yt xt. This

Information Systems Laboratory, Electrical Engineering Department, Stanford University, Stanford, CA94305-9510 USA. Email: {sjkim,deneb1,boyd,gorin}@stanford.edu

1


2/29

can be considered as an optimization problem with two competing objectives: We want xtto be smooth, and we want zt (our estimate of the random component, sometimes called theresidual), to be small. In some contexts, estimating xt is called smoothing or filtering.

Trend filtering comes up in several applications and settings including macroeconomics(e.g., [53, 86]), geophysics (e.g., [2, 9, 10]), financial time series analysis (e.g., [97]), social

sciences (e.g., [66]), revenue management (e.g., [91]), and biological and medical sciences(e.g., [44, 68]). Many trend filtering methods have been proposed, including Hodrick-Prescott(H-P) filtering [53, 64], moving average filtering [75], exponential smoothing [70], bandpassfiltering [22, 5], smoothing splines [81], de-trending via rational square-wave filters [79], a

jump process approach [106], median filtering [100], a linear programming (LP) approachwith fixed kink points [72], and wavelet transform analysis [24]. (All these methods exceptfor the jump process approach, the LP approach, and median filtering are linear filteringmethods; see [5] for a survey of linear filtering methods in trend estimation.) The most widelyused methods are moving average filtering, exponential smoothing, and H-P filtering, whichis especially popular in economics and related disciplines since its application to businesscycle theory [53]. The idea behind H-P filtering can be found in several fields, and can betraced back at least to work in 1961 by Leser [64] in statistics.

1.2 1 trend filtering

In this paper we propose 1 trend filtering, a variation on H-P filtering which substitutesa sum of absolute values (i.e., an 1-norm) for the sum of squares used in H-P filtering topenalize variations in the estimated trend. (The term filtering is used in analogy with H-Pfiltering. Like H-P filtering, 1 trend filtering is a batch method for estimating the trendcomponent from the whole history of observations.)

We will see that the proposed 1 trend filter method shares many properties with the H-P

filter, and has the same (linear) computational complexity. The principal difference is thatthe 1 trend filter produces trend estimates that are smooth in the sense of being piecewiselinear. The 1 trend filter is thus well suited to analyzing time series with an underlyingpiecewise linear trend. The kinks, knots, or changes in slope of the estimated trend can beinterpreted as abrupt changes or events in the underlying dynamics of the time series; the1 trend filter can be interpreted as detecting or estimating changes in an underlying lineartrend. Using specialized interior-point methods, 1 trend filtering can be carried out withnot much more effort than H-P filtering; in particular, the number of arithmetic operationsrequired grows linearly with the number of data points.

1.3 OutlineIn the next section we set up our notation and give a brief summary of H-P filtering, listingsome properties for later comparison with our proposed 1 trend filter. The 1 trend filter isdescribed in 3, and compared to the H-P filter. We give some illustrative examples in 4.

In 5 we give the optimality condition for the underlying optimization problem thatdefines the 1 trend filter, and use it to derive some of the properties given in 3. We also

2


3/29

derive a Lagrange dual problem that is interesting on its own, and is also used in a primal-dual interior-point method we describe in 6. We list a number of extensions of the basicidea in 7.

2 Hodrick-Prescott filteringIn H-P filtering, the trend estimate xt is chosen to minimize the weighted sum objectivefunction

(1/2)n

t=1

(yt xt)2 +

n1t=2

(xt1 2xt + xt+1)2, (1)

where 0 is the regularization parameter used to control the trade-off between smoothnessof xt and the size of the residual yt xt. The first term in the objective function measuresthe size of the residual; the second term measures the smoothness of the estimated trend.The argument appearing in the second term, xt1 2xt + xt+1, is the second difference of

the time series at time t; it is zero when and only when the three points xt1, xt, xt+1 are ona line. The second term in the objective is zero if and only if xt is affine, i.e., has the formxt = + t for some constants and . (In other words, the graph of xt is a straight line.)The weighted sum objective (1) is strictly convex and coercive in x, and so has a uniqueminimizer, which we denote xhp.

We can write the objective (1) as

(1/2)y x22 + Dx22,

where x = (x1, . . . , xn) Rn, y = (y1, . . . , yn) R

n, u2 = (

i u2i )

1/2 is the Euclidean or2 norm, and D R

(n2)n is the second-order difference matrix

D =

1 2 11 2 1

. . .. . .

. . .

1 2 11 2 1

. (2)

(D is Toeplitz with first row [ 1 2 1 0 0 ]; entries not shown above are zero.) The H-Ptrend estimate is

xhp = (I+ 2DTD)1y. (3)

H-P filtering is supported in several standard software packages for statistical data analysis,e.g., SAS, R, and Stata.

We list some basic properties of H-P filtering, that we refer to later when we compare itto our proposed trend estimation method.

Linear computational complexity. The H-P estimated trend xhp in (3) can be computedin O(n) arithmetic operations, since D is tri-diagonal.

3


4/29

Linearity. From (3) we see that the H-P estimated trend xhp is a linear function of thetime series data y.

Convergence to original data as 0. The relative fitting error satisfies the inequality

y xhp2y2

32

1 + 32. (4)

This shows that as the regularization parameter decreases to zero, xhp converges tothe original time series data y.

Convergence to best affine fit as . As , the H-P estimated trend convergesto the best affine (straight-line) fit to the time series data,

xba = ba + bat,

with intercept and slope

ba =

nt=1 t2

nt=1 yt

nt=1 t

nt=1 tytnn

t=1 t2 (

nt=1 t)

2,

ba =nn

t=1 tyt n

t=1 tn

t=1 ytnn

t=1 t2 (

nt=1 t)

2.

Commutability with affine adjustment. We can change the order of H-P filtering andaffine adjustment of the original time series data, without affect: For any and , theH-P trend estimate of the time series data yt = yt t is x

hpt = x

hpt t. (A

special case of linear adjustment is linear detrending, with = ba, = ba, whichcorresponds to subtracting the best affine fit from the original data.)

Regularization path. The H-P trend estimate xhp is a smooth function of the regular-ization parameter , as it varies over [0, ). As decreases to zero, xhp converges tothe original data y; as increases, xhp becomes smoother, and converges to xba, thebest affine fit to the time series data.

We can derive the relative fitting error inequality (4) as follows. From the optimalitycondition y xhp = DTDxhp we obtain

y xhp = 2DTD(I+ 2DTD)1y.

The spectral norm of D is no more than 4:

Dx2 = x1:n2 2x2:n1 + x3:n2 x1:n22 + 2x2:n12 + x3:n2 4x2,

where xi:j = (xi, . . . , xj). The eigenvalues of DTD lie between 0 and 16, so the eigenvalues

of 2DTD(I+ 2DTD)1 lie between 0 and 32/(1 + 32). It follows that

y xhp2 (32/(1 + 32))y2.

4


5/29

3 1 trend filtering

We propose the following variation on H-P filtering, which we call 1 trend filtering. Wechoose the trend estimate as the minimizer of the weighted sum objective function

(1/2)

nt=1

(yt xt)2 +

n1t=2

|xt1 2xt + xt+1|, (5)

which can be written in matrix form as

(1/2)y x22 + Dx1,

where u1 =

i |ui| denotes the 1 norm of the vector u. As in H-P filtering, is anonnegative parameter used to control the trade-off between smoothness ofx and size of theresidual. The weighted sum objective (1) is strictly convex and coercive in x, and so has aunique minimizer, which we denote xlt. (The superscript lt stands for 1 trend.)

We list some basic properties of1 trend filtering, pointing out similarities and differenceswith H-P filtering.

Linear computational complexity. There is no analytic formula or expression for xlt,analogous to (3). But like xhp, xlt can be computed numerically in O(n) arithmeticoperations. (We describe an efficient method for computing xlt in 6. Its worst-casecomplexity is O(n1.5), but practically its computational effort is linear in n.)

Nonlinearity. The 1 trend estimate xlt is not a linear function of the original data y.(In contrast, xhp is a linear function of y.)

Convergence to original data as 0. The maximum fitting error satisfies the bound

y xlt 4. (6)

where u = maxi |ui| denotes the norm of the vector u. (Cf. the analogous boundfor H-P trend estimation, given in (4).) This implies that xlt y as 0.

Finite convergence to best affine fit as . As in H-P filtering, xlt xba as . For 1 trend estimation, however, the convergence occurs for a finite valueof ,

max = (DDT)1Dy. (7)

For max, we have xlt

= x

ba

. (In contrast, x

hp

xba

only in the limit as .)This maximum value max is readily computed with O(n) arithmetic steps. (Thederivation is given in 5.1.)

Commutability with affine adjustment. As in H-P filtering, we can swap the order ofaffine adjustment and trend filtering, without affect.

5


6/29

Piecewise-linear regularization path. The 1 trend estimate xlt is a piecewise-linear

function of the regularization parameter , as it varies over [0, ): There are values1, . . . , k, with 0 = k < < 1 = max, for which

xlt =i

i i+1

x(i+1) + i+1

i i+1

x(i), i+1 i, i = 1, . . . , k 1,

where x(i) is xlt with = i. (So x(1) = xba, x(k) = y.)

Linear extension property. Let xlt denote the 1 trend estimate for (y1, . . . , yn+1). Thereis an interval [l, u], with l < u, for which

xlt = (xlt, 2xltn xltn1),

provided yn+1 [u, l]. In other words, if the new observation is inside an interval, the1 trend estimate linearly extends the last affine segment.

3.1 Piecewise linearity

The basic reason the 1 trend estimate xlt might be preferred over the H-P trend estimate xhp

is that it is piecewise linear in t: There are (integer) times 1 = t1 < t2 < < tp1 < tp = nfor which

xltt = k + kt, tk t tk+1, k = 1, . . . , p 1. (8)

In other words, over each (integer) interval [ti, ti+1], xlt is an affine function of t. We can

interpret k and k as the local intercept and slope in the kth interval. These local trendparameters are not independent: they must give consistent values for xlt at the join or kinkpoints, i.e.,

k + ktk+1 = k+1 + k+1tk+1, k = 1, . . . , p 1.

The points t2, . . . , tp1 are called kink points. We say that xlt is piecewise linear with p 2

kink points. (The kink point tk can be eliminated ifk = k1, so we generally assume thatk = k1.)

In one extreme case, we have p = 2, which corresponds to no kink points. In this caset1 = 1, t2 = n, and x

lt = xba is affine. In the other extreme case, there is a kink at everytime point: we have ti = i, i = 1, . . . , p = n. In this case the piecewise linear form (8) isvacuous; it imposes no constraints on xlt. This corresponds to = 0, and xlt = y.

The kink points correspond to changes in slope of the estimated trend, and can beinterpreted as abrupt changes or events in the underlying dynamics of the time series. Thenumber of kinks in xlt typically decreases as the regularization parameter increases, but

counterexamples show this need not happen.Piecewise linearity of the trend estimate is not surprising: It is well known when an 1

norm term is added to an objective to be minimized, or constrained, the solution typicallyhas the argument of the 1 norm term sparse (i.e., with many zero elements). In this context,we would predict that Dx (the second-order difference of the estimated trend) will have manyzero elements, which means that the estimated trend is piecewise linear.

6


7/29

The general idea of 1 regularization for the purposes of sparse signal recovery or fea-ture selection has been used in geophysics since the early 1970s; see, e.g., [23, 67, 92]. Insignal processing, the idea of 1 regularization comes up in several contexts including basispursuit (denoising) [20, 21], image decomposition [32, 88], signal recovery from incompletemeasurements [17, 16, 27, 28, 96], sensor selection [56], fault identification [108], and wavelet

thresholding [29]. In statistics, the idea of 1 regularization is used in the well-known Lassoalgorithm [93] for 1-regularized linear regression, its extensions such as the fused Lasso [94],the elastic net [107], the group Lasso [105], and the monotone Lasso [50], and 1-regularizedlogistic regression [61, 62, 77]. The idea of 1 regularization has been used in other contextsincluding portfolio optimization [69], control design [49], computer-aided design of integratedcircuits [14], decoding of linear codes [18], and machine learning (sparse principal componentanalysis [26] and graphical model selection [3, 25, 98]).

We note that 1 trend filtering is related to segmented regression, a statistical regressionmethod in which the variables are segmented into groups and regression analysis is performedon each segment. Segmented regression arises in a variety of contexts including abrupt changedetection and time series segmentation (especially as a preprocessing step for mining timeseries databases); the reader is referred to a survey [1] and references therein. There aretwo types of time series segmentation. One does not require the fits for two consecutivesegments to have consistent values at their join point; see, e.g., [71, 63, 87, 80, 102]. Theother requires the fits for two consecutive segments to be consistent at at their join point,which is often called joinpoint regression; see, e.g., [33, 36, 37, 58, 89, 104]. We can thinkof 1 trend filtering as producing a segmented linear regression, with an affine fit on eachsegment, with consistent values at the join points. In 1 trend filtering, the segmentationand the affine fit on each segment are found by solving one optimization problem.

In time series segmentation, we can use the principle of dynamic programming (DP) tofind the best fit that minimizes the fitting error among all functions that consist of k affine

segments, with or without the requirement of consistency at the join points. In an earlypaper [6, 7], Bellman showed how DP can be used for segmented linear regression withoutthe requirement of consistency at the join points. The DP argument can find the best fit inO(n2k) arithmetic operations [45]. The DP algorithm with the consistency requirement atthe join points is, however, far more involved than in the case when it is absent. As a heuristic,1 trend filtering produces a segmented linear regression in O(n) arithmetic operations.Another heuristic based on grid search is described in [58], and an implementation, calledthe Joinpoint Regression Program, is available from http://srab.cancer.gov/joinpoint/.

3.2 1 trend filtering and sparse approximation

To see the connection more clearly between 1 trend filtering and 1 regularization basedsparse approximation, we note that the 1 trend filtering problem is equivalent to the 1-regularized least squares problem:

minimize (1/2)A y22 + n

i=3

|i|, (9)

7


8/29

where = (1, . . . , n) Rn is the variable and A is the lower triangular matrix

A =

11 11 2 1

1 3 2.. ....

......

. . . 11 n 1 n 2 2 1

Rnn.

The solution lt to this problem and the 1 trend estimate are related by

xlt = Alt. (10)

We can give a simple interpretation of the coefficients: lt1 is the offset (lt1 = x

lt1 ),

lt2 is

the (right) first-order difference at xlt1 (lt2 = x

lt2 x

lt1 ), and for t 3,

ltt is the second-order

difference of x at t 1 (ltt = (Dxlt)t2). This interpretation shows again the equivalence

between the 1 trend filtering problem and the 1-regularized least squares problem (9). (Thisinterpretation also shows that 1 trend filtering is a special type of basis pursuit denoising[21] and is related to multivariate adaptive regression splines (MARS) [38, 51, 52] that usetruncated linear functions as basis functions.)

From a standard result in 1-regularized least squares [31, 83], the solution lt to (9)

is a piecewise-linear function of the regularization parameter , as it varies over [0, ).From (10), we can see that the regularization path of 1 trend filtering is piecewise linear.

4 Illustrative examples

Our first example uses synthetic data, generated as

yt = xt + zt, t = 1, . . . , n, xt+1 = xt + vt, t = 1, . . . , n 1, (11)

with initial condition x1 = 0. Here xt is the true underlying trend, zt is the irregularcomponent or noise, and vt is the trend slope. The noises zt are IID N(0, 2). The trendslopes vt are chosen from a simple Markov process (independent of z). With probability p,we have vt+1 = vt, i.e., no slope change in the underlying trend. (Thus, the mean timebetween slope changes is 1/(1 p).) With probability 1 p, we choose vt+1 from a uniformdistribution on [b, b]. We choose the initial slope v1 from a uniform distribution on [b, b].The change in xt between two successive changes in slope is given by the product of two

independent random variables: the time between changes (which is geometrically distributedwith mean 1/(1 p)), and the slope (which is uniform over [b, b]). It has zero mean andvariance (1 +p)(1 p)2b2/3. The standard deviation of the change in xt between successivechanges in slope is thus

(1 + p)/3(b/(1 p)).

For our example, we use the parameter values

n = 1000, p = 0.99, = 20, b = 0.5.

8


9/29

Thus, the mean time between slope changes is 100, and the standard deviation of the changein xt between slope changes is 40.7. The particular sample we generated had 8 changes inslope.

The 1 trend estimates were computed using two solvers: cvx [43], a Matlab-based mod-eling system for convex optimization (which calls SDPT3 [95] or SeDuMi [90], a Matlab-based

solver for convex problems), and a C implementation of the specialized primal-dual interior-point method described in 6. The run times on a 3Ghz Pentium IV were around a fewseconds and 0.01 seconds, respectively.

The results are shown in figure 1. The top left plot shows the true trend xt, and thetop right plot shows the noise corrupted time series yt. In the middle left plot, we showxlt for = 35000, which results in 4 kink points in the estimated trend. The middle rightplot shows the H-P trend estimate with adjusted to give the same fitting error as xlt, i.e.,y xlt2 = y xhp2. Even though xlt is not a particularly good estimate of xt, it hasidentified some of the slope change points fairly well. The bottom left plot shows xlt for = 5000, which yields 7 kink points in xlt. The bottom right plot shows xhp, with the samefitting error. In this case the estimate of the underlying trend is quite good. Note that thetrend estimation error for xlt is better than xhp, especially around the kink points.

Our next example uses real data, 2000 consecutive daily closing values of the S&P 500Index, from March 25, 1999 to March 9, 2007, after logarithmic transform. The data areshown in the top plot of figure 2. In the middle plot, we show xlt for = 100, which resultsin 8 kink points in the estimated trend. The bottom plot shows the H-P trend estimate withthe same fitting error.

In this example (in contrast to the previous one) we cannot say that the 1 trend estimateis better than the H-P trend estimate. Each of the two trend estimates is a smoothed versionof the original data; by construction, they have the same 2 fitting error. If for some reasonyou believe that the (log of the) S&P 500 Index is driven by an underlying trend that is

piecewise-linear, you might prefer the 1 trend estimate over the H-P trend estimate.

9


10/29

x

0 200 400 600 800 1000

120

80

40

0

40

0 200 400 600 800 1000

120

80

40

0

40

y

0 200 400 600 800 1000

120

80

40

0

40

xlt

0 200 400 600 800 1000

120

80

40

0

40

xhp

0 200 400 600 800 1000

120

80

40

0

40

xlt

t0 200 400 600 800 1000

120

80

40

0

40

xhp

tFigure 1: Trend estimation on synthetic data. Top left. The true trend xt. Top right.Observed time series data yt. Middle left. 1 trend estimate x

lt with four total kinks( = 35000). Middle right. H-P trend estimate xhp with same fitting error. Bottomleft. xlt with seven total kinks ( = 5000). Bottom right. H-P trend estimate xhp withsame fitting error.

10


11/29

log-price

1/2000 1/2002 1/2004 1/2006

6.7

6.8

6.9

7.0

7.1

7.2

7.3

log-price

1/2000 1/2002 1/2004 1/2006

6.7

6.8

6.9

7.0

7.1

7.2

7.3

log-price

1/2000 1/2002 1/2004 1/2006

6.7

6.8

6.9

7.0

7.1

7.2

7.3

Figure 2: Trend estimation results for the S&P 500 Index for the period of March 25,1999 to March 9, 2007. Top. Original data. Middle. 1 trend estimate x

lt for = 100.Bottom. H-P trend estimate xhp with same fitting error.

11


12/29

5 Optimality condition and dual problem

5.1 Optimality condition

The objective function (5) of the 1 trend filtering problem is convex but not differentiable,

so we use a first-order optimality condition based on subdifferential calculus. We obtain thefollowing necessary and sufficient condition for x to minimize (5): there exists Rn suchthat

y x = DT, t

{+}, (Dx)t > 0,{}, (Dx)t < 0,[, ], (Dx)t = 0,

t = 1, . . . , n 2. (12)

(Here, we use the chain rule for subdifferentials: If f is convex, then the subdifferential ofh(x) = f(Ax+b) is given by h(x) = ATf(Ax+b). See, e.g., [8, Prop. B.24] or [11, Chap.2]for more on subdifferential calculus.) Since DDT is invertible, the optimality condition (12)can be written as

(DDT)1D(y x)

t

{+}, (Dx)

t > 0,{}, (Dx)t < 0,[, ], (Dx)t = 0,

t = 1, . . . , n 2.

The maximum fitting error bound in (6) follows from the optimality condition above.For any Rn2 with t [, ],

4 (DT)t 4, t = 1, . . . , n .

It follows from (12) that the minimizer x of (5) satisfies

4 xt

yt

4, t = 1, . . . , n .

We can now derive the formula (7) for max. Since xba is affine, Dxba = 0, so the condition

that xba is optimal is that

(DDT)1D(y xba)t

[, ] for t = 1, . . . , n 2, i.e.,

(DDT)1D(y xba) = (DDT)1Dy .

We can use the optimality condition (12) to see whether the linear extension propertyholds for a new observation yn+1. From the optimality condition (12), we can see that ifyn+1 satisfies

(DDT)1D

yyn+1

xlt

2xltn

xltn1

,

where D R(n1)(n+1) is the second-order difference matrix on Rn+1, then the 1 trendestimate for (y, yn+1) is given by (x

lt, 2xltn xltn1) R

n+1. From this inequality, we caneasily find the bounds l and u such that if l yn+1 u, then the linear extension propertyholds.

12


13/29

5.2 Dual problem

To derive a Lagrange dual of the primal problem of minimizing (5), we first introduce a newvariable z Rn2, as well as a new equality constraint z = Dx, to obtain the equivalentformulation

minimize (1/2)y x22 + z1subject to z = Dx.

Associating a dual variable Rn2 with the equality constraint, the Lagrangian is

L(x,z,) = (1/2)y x22 + z1 + T(Dx z).

The dual function is

infx,z

L(x,z,) =

(1/2)TDDT+ yTDT 1 1, otherwise.

The dual problem is

minimize g() = (1/2)TDDT yTDTsubject to 1 1.

(13)

(Here a b means ai bi for all i.) The dual problem (13) is a (convex) quadratic program(QP) with variable Rn2. We say that Rn2 is strictly dual feasible if it satisfies1 < < 1.

From the solution lt of the dual (13), we can recover the 1 trend estimate via

xlt = y DTlt. (14)

6 A primal-dual interior point method

The QP (13) can be solved by standard convex optimization methods, including generalinterior-point methods [13, 73, 74, 103], and more specialized methods such as path following[76, 31]. These methods can exploit the special structure of the problem, i.e., the bandednessof the quadratic form in the objective, to solve the problem very efficiently. To see how thiscan be done, we describe a simple primal-dual method in this section. For more detail onthese (and related) methods, see, e.g., [13, 11.7] or [103].

The worst-case number of iterations in primal-dual interior-point methods for the QP (13)is O(n1/2) [73]. In practice, primal-dual interior-point methods solve QPs in a number ofiterations that is just a few tens, almost independent of the problem size or data. Each

iteration is dominated by the cost of computing the search direction, which, if done correctlyfor the particular QP (13), is O(n). It follows that the over all complexity is O(n), thesame as solving the H-P filtering problem (but with a larger constant hidden in the O(n)notation).

The search direction is the Newton step for the system of nonlinear equations

rt(, 1, 2) = 0, (15)

13


14/29

where t > 0 is a parameter and

rt(, 1, 2) =

rdualrcent

=

g() + D( 1)T1 D(+ 1)T2

1( 1) + 2(+ 1) (1/t)1

(16)

is the residual. (The first component is the dual feasibility residual, and the second isthe centering residual.) Here 1, 2 Rn2 are (positive) dual variables for the inequalityconstraints in (13), and is strictly dual feasible. As t , rt(, 1, 2) = 0 reduces to theKarush-Kuhn-Tucker (KKT) condition for the QP (13). The basic idea is to take Newtonsteps for solving the set of nonlinear equations rt(, 1, 2) = 0, for a sequence of increasingvalues of t.

The Newton step is characterized by

rt(+ , 1 + 1, 1 + 1) rt(, 1, 2) + Drt(, 1, 2)(, 1, 2) = 0,

where Drt is the derivative (Jacobian) of rt. This can be written as

DDT I II J1 0

I 0 J2

1

2

=

DDTz Dy + 1 2f1 + (1/t) diag(1)11

f2 + (1/t) diag(2)11

, (17)

where

f1 = 1 Rn2,

f2 = 1 Rn2,

Ji = diag(i)1 diag(fi) R

(n2)(n2).

(Here diag(w) is the diagonal matrix with diagonal entries w.) By eliminating (1, 2),we obtain the reduced system

DDT J11 J

12

=

DDT Dy (1/t) diag(f1)

11 + (1/t) diag(f2)11

.

The matrix DDTJ11 J12 is banded (with bandwidth 5) so we can solve this reduced system

in O(n) arithmetic operations. The other two components of the search step, 1 and 2,can be computed as

1 =

1 + (1/t) diag(f1)11 + J11 d

,

2 =

2 + (1/t) diag(f2)11 J12 d

,

in O(n) arithmetic operations (since the matrices J1 and J2 are diagonal).A C implementation of a primal-dual interior-point method for 1 trend filtering is avail-

able online from www.stanford.edu/~boyd/l1_tf. For a typical problem with n = 10000data points, it computes xlt in around one second, on a 3Ghz Pentium IV. Problems withone million data points require around 100 seconds, consistent with linear computationalcomplexity in n.

14


15/29

7 Extensions and variations

The basic 1 trend estimation method described above can be extended in many ways, someof which we describe here. In each case, the computation reduces to solving one or a fewconvex optimization problems, and so is quite tractable; the interior-point method described

above is readily extended to handle these problems.

7.1 Polishing

One standard trick is to use the basic 1 filtering problem as a method to identify the kinkpoints in the estimated trend. Once the kinks points {t1, . . . , tp} are identified, we use astandard least-squares method to fit the data, over all piecewise linear functions with thegiven kinks points:

minimizep1

k=1

tkttk+1

y k kt22subject to k + ktk+1 = k+1 + k+1tk+1, k = 1, . . . , p 2,

where the variables are the local trend parameters k and k. This technique is described(in another context) in, e.g., [13, 6.5].

7.2 Iterative weighted 1 heuristic

The basic 1 trend filtering method is equivalent to

minimize Dx1subject to y x2 s,

(18)

with an appropriate choice of parameter s. In this formulation, we minimize Dx1 (ourmeasure of smoothness of the estimated trend) subject to a budget on residual norm. Thisproblem can be considered a heuristic for the problem of finding the piecewise linear trendwith the smallest number of kinks, subject to a budget on residual norm:

minimize card(Dx)subject to y x2 s,

where card(z) is the number of nonzero elements in a vector z. Solving this problem exactlyis intractable; all known methods require an exhaustive combinatorial search over all, or atleast very many, possible combinations of kink points.

The standard heuristic for solving this problem is to replace card(Dx) with Dx1, whichgives us our basic 1 trend filter, i.e., the solution to (18). This basic method can be improvedby an iterative method, that varies the individual weights on the second order differencesin xt. We start by solving (18). We then define a weight vector as

wt := 1/( + |(Dx)t|), t = 1, . . . , n 2,

15


16/29

where is a small positive constant. This assigns the largest weight, 1/, when (Dx)t = 0;it assigns large weight when |(Dx)t| is small; and it assigns relatively small weight when|(Dx)t| is larger. We then recompute xt as the solution of problem

minimize diag(w)Dx1subject to y x2 s.

We then update the weights as above and repeat.This iteration typically converges in 10 or fewer steps. It often gives a modest decrease

in the number of kink points card(Dx), for the same residual, compared to the basic 1 trendestimation method. The idea behind this heuristic has been used in portfolio optimizationwith transaction costs [69], where an interpretation of the heuristic for cardinality minimiza-tion is given. The reader is referred to [19] for a more extensive discussion on the iterativeheuristic.

7.3 Convex constraints and penalty functions

We can add convex constraints on the estimated trend, or use a more general convex penaltyfunction to measure the residual. In both cases, the resulting trend estimation problem isconvex, and therefore tractable. We list a few examples here.

Perhaps the simplest constraints are lower and upper bounds on xt, or the first or seconddifferences of xt, as in

|xt| M, t = 1, . . . , n , |xt+1 xt| S, t = 1, . . . , n 1.

Here we impose a magnitude limit M, and a maximum slew rate (or slope) S, on theestimated trend. Another interesting convex constraint that can be imposed on xt is mono-tonicity, i.e.,

x1

x2

xn1

xn

.

Minimizing (5) subject to this monotonicity constraint is an extension of isotonic regression,which has been extensively studied in statistics [4, 82]. (Related work on 1-regularizedisotonic regression, in an engineering context, includes [41, 42].)

We can also replace the square function used to penalize the residual term yt xt with amore general convex function . Thus, we compute our trend estimate xt as the minimizerof (the convex function)

nt=1

(yt xt) + Dx1.

For example, using (u) = |u|, we assign less penalty (compared to (u) = (1/2)u2)

to large residuals, but larger penalty to small residuals. This results in a trend estimationmethod that is more robust to outliers than the basic 1 trend method since it allows largeoccasional errors in the residual. Another example is the Huber penalty function used inrobust least squares, given by

hub(u) =

u2, |u| MM(2|u| M), |u| > M,

16


17/29

where M 0 is a constant [54]. The use of an asymmetric linear penalty function of theform

(u) =

u, u > 0,(1 )u, otherwise,

where indicates the quantile of interest, is related to quantile smoothing splines. (The

reader is referred to [59] for more on the use of this penalty function in quantile regressionand [60] for more on quantile smoothing splines.)

In all of these extensions, the resulting convex problem can be solved with a computa-tional effort that is O(n), since the system of equations that must be solved at each step ofan interior-point method is banded.

7.4 Multiple components

We can easily extend basic 1 trend filtering to analyze time series data that involve othercomponents, e.g., occasional spikes (outliers), level shifts, seasonal components, cyclic (si-

nusoidal) components, or other regression components. The problem of decomposing giventime series data into multiple components has been a topic of extensive research; see, e.g.,[12, 30, 46, 47] and references therein. Compared with standard decomposition methods, theextensions described here are well suited to the case when the underlying trend, once theother components have been subtracted out, is piecewise linear.

Spikes. Suppose the time series data y has occasional spikes or outliers u in addition totrend and irregular components. Our prior information on the component u is that it issparse. We can extract the underlying trend and the spike signal, by adding one moreregularization term to (5), and minimizing the modified objective

(1/2)y x u22 + Dx1 + u1,

where the variables are x (the trend component) and u (the spike component). Here theparameter 0 is used to control the smoothness (or number of slope changes) of theestimated trend, and 0 is used to control the number of spikes.

Level shifts. Suppose the time series data y has occasional abrupt level shifts. Level shiftscan be modeled as a piecewise constant component w. To extract the level shift componentw as well as the trend x, we add the scaled total variation of w,

nt=2 |wt wt1|, to the

weighted sum (5) and minimize the modified objective

(1/2)y x w22 + Dx1 + n

t=2

|wt wt1|,

over x Rn and w Rn. Here the parameter 0 is used to control the smoothness ofthe estimated trend x, and 0 is used to control the frequency of level shifts in w.

17


18/29

Periodic components. Suppose the time series data y has an additive deterministic pe-riodic component s with known period p:

st+p = st, t = 1, . . . , n p.

The periodic component s is called seasonal when it models seasonal fluctuations; removingeffects of the seasonal component from y in order to better estimate the trend componentis called seasonal adjustment. (The corresponding decomposition problem has been studiedextensively in the literature; see, e.g., [15, 30, 51, 57, 65, 85].)

Seasonal adjustment is readily incorporated in 1 trend filtering: We simply solve the(convex) problem

minimize (1/2)y x s22 + Dx1subject to st+p = st, t = 1, . . . , n p,p

k=1 sk = 0,

where the variables are x (the estimated trend) and s (the estimated seasonal component).The last equality constraint means that the periodic component sums to zero over the period;without this constraint, the decomposition is not unique [35, 6.2.8]. To smooth the periodiccomponent, we can add a penalty term to the objective, or impose a constraint on thevariation of s. As a generalization of this formulation, the problem of jointly estimatingmultiple periodic components (with different periods) as well as a trend can be cast as aconvex problem.

When the periodic component is sinusoidal, i.e., st = a sin t + b cos t, where is theknown frequency, the decomposition problem simplifies to

minimize (1/2)n

t=1 yt xt a sin t b cos t22 + Dx1,

where the variables are x Rn

and a, b R

. (H-P filtering has also been extended toestimate trend and cyclic components; see, e.g., [40, 48].)

Regression components. Suppose that the time series data y has autoregressive (AR)components in addition to the trend x and the irregular component z:

yt = xt + a1yt1 + + arytr + zt,

where ai are model coefficients. (This model is a special type of multiple structural changetime series model [99].) We can estimate the trend component and the AR model coefficientsby solving the 1-regularized least squares problem

minimize (1/2)n

i=1(yt xt a1yt1 arytr)2 + Dx1

where the variables are xt Rn and a = (a1, . . . , ar) Rr. (We assume that y1r, . . . , y0are given.)

18


19/29

7.5 Vector time series

The basic 1 trend estimation method can be generalized to handle vector time series data.Suppose that yt Rk, for t = 1, . . . , n. We can find our trend estimate xt Rk, t = 1, . . . , k,as the minimizer of (the convex function)

nt=1

yt xt22 +

n1t=2

xt1 2xt + xt+12,

where 0 is the usual parameter. Here we use the sum of the 2 norms of the seconddifferences as our measure of smoothness. (If we use the sum of1 norms, then the individualcomponents ofxt can be estimated separately.) Compared to estimating trends separately ineach time series, this formulation couples together changes in the slopes of individual entriesat the same time index, so the trend component found tends to show simultaneous trendchanges, in all components of xt, at common kink points. (The idea behind this penalty isused in the group Lasso [105] and in compressed sensing involving complex quantities and

related to total variation in two- or higher-dimensional data [17, 84].) The common kinkpoints can be interpreted as common abrupt changes or events in the underlying dynamicsof the vector time series.

7.6 Spatial trend estimation

Suppose we are given two-dimensional data yi,j, on a uniform grid (i, j) {1, . . . , m} {1, . . . , n}, assumed to consist of a relatively slowly varying spatial trend component xi,jand a more rapidly varying component vi,j . The values of the trend component at node (i, j)and its four 4 horizontally or vertically adjacent nodes are on a linear surface if both thehorizontal and vertical second-order differences, xi1,j 2xij + xi+1j and xi,j1 2xi,j + xi,j+1,

are zero. As in the vector time series case, we minimize a weighted sum of the fitting errormi=1

nj=1 yi,j xi,j

2 and the penalty

m1i=2

n1j=2

[(xi1,j 2xi,j + xi+1,j)2 + (xi,j1 2xi,j + xi,j1)

2]1/2

on slope changes in the horizontal and vertical directions. It is possible to use more sophis-ticated measures of the smoothness, for example, determined by a 9 point approximationthat includes 4 diagonally adjacent nodes.

The resulting trend estimates tend to be piecewise linear, i.e., there are regions overwhich xt is affine. The boundaries between regions can be interpreted as curves along which

the underlying gradient changes rapidly.

7.7 Continuous-time trend filtering

Suppose that we have noisy measurements (ti, yi), i = 1, . . . , n of a slowly varying continuousfunction at irregularly spaced ti (in increasing order). In this section we consider the problem

19


20/29

of estimating the underlying continuous trend from the finite number of data points. Thisproblem involves an infinite-dimensional set of functions, unlike the trend filtering problemsconsidered above. (Related segmented regression problems have been studied in [39, 55, 63].)

We first consider a penalized least squares problem of the form

minimize (1/2)n

i=1(yi x(ti))2

+ tnt1 (x(t))

2

dt, (19)

over the space of all functions on the interval [t1, tn] with square integrable second derivative.Here, is a parameter used to control the smoothness of the solution. The solution is a cubicspline with knots at ti, i.e., a piecewise polynomial of degree 3 on R with continuous firstand second derivatives; see, e.g., [34, 52, 101]. H-P filtering can be viewed as an approximatediscretization of this continuous function estimation problem, when ti are regularly spaced:ti = t1 + (i 1)h for some h > 0. If the second derivative of x at ti is approximated as

x(ti) x(ti1) 2x(ti) + x(ti+1)

h, i = 2, . . . , n 1,

then the objective of the continuous-time problem (19) reduces to the weighted sum objec-tive (1) of H-P filtering with regularization parameter /h.

We next turn to the continuous time 1 trend filtering problem

minimize (1/2)n

i=1(yi x(ti))2 +

tnt1

|x(t)| dt (20)

over

X =

x : [t1, tn] R

x(t) = 0 + 1t +

tnt1

max(t s, 0) d(s), 0, 1 R, V() <

,

where V() is the total variation of the measure on [t1, tn]. (This function space includespiecewise linear continuous functions with a finite number of knots; see [78].) The differencefrom (19) is that in the integral term the second derivative is penalized using the absolutevalue function.

A standard result in interpolation theory [78] is that the solution of the interpolationproblem

minimizetnt1

|x(t)| dtsubject to x(ti) = yi, i = 1, . . . , n

over X is continuous piecewise linear with knots at the points ti. From this, we can seethat the solution to the continuous time 1 trend filtering problem (20) is also piecewise

continuous linear with knots at the points ti, i.e., it is a linear spline. The second derivativeof a piecewise linear function x with knots at the points ti is given by

x(t) =n1i=2

x(ti+1) x(ti)

ti+1 ti

x(ti) x(ti1)

ti ti1

(t ti),

20


21/29

where is the Dirac delta function. (The coefficients are slope changes at the kink points.)The integral of the absolute value of the second derivative istn

t1

|x(t)| dt =n1i=2

x(ti+1) x(ti)

ti+1 ti

x(ti) x(ti1)

ti ti1

.

Thus the continuous 1 filtering problem (20) is equivalent to the (finite-dimensional)convex problem

minimize (1/2)n

i=1

(yi xi)2 +

n1i=2

xi+1 xiti+1 ti xi xi1ti ti1

(21)with variables (x1, . . . , xn) R

n. From the optimal points (ti, xi ), we can easily recover the

solution to the original continuous trend filtering problem: the piecewise linear function thatconnects (ti, x

i ),

x(t) =t ti

ti+1 tixi+1 +

ti+1 t

ti+1 tixi , t (ti, ti+1),

is the optimal continuous trend that minimizes (20). When ti are regularly spaced, thisproblem reduces to the basic 1 trend filtering problem considered in 3. For the samereason, we can solve (21) (and hence (20)) in O(n) arithmetic operations.

7.8 Segmented polynomial regression

Thus far our focus has been on fitting a piecewise linear function to the given data. Wecan extend the idea to fitting a piecewise polynomial of degree k 1 to the data. Using aweighted 1-norm of the kth-order difference of x as a penalty term, the extension can beformulated as

minimize (1/2)n

i=1(yi xi)

2

+ D

(k,n)

x1. (22)Here D(k,n) R(nk)n is the kth-order difference matrix on Rn, defined recursively as

D(k,n) = D(1,nk+1)D(k1,n), k = 2, 3, . . . ,

where D(1,p) R(p1)p is the first-order difference matrix on Rp,

D(1,p) =

1 11 1

. . . . . .

1 11 1

.

(Entries not shown above are zero.) This problem can be efficiently solved with a computa-tional effort that is O(k2n), since the system of equations that must be solved at each stepof an interior-point method is banded with bandwidth linear in k.

The resulting trend estimate xt tends to be piecewise polynomial of order k 1, i.e., thereare regions over which xt is polynomial of order k 1. The case of k = 1 corresponds topiecewise constant fitting and the case of k = 2 corresponds to piecewise linear fitting.

21


22/29

Acknowledgments

The authors thank Trevor Hastie, Johan Lim, Michael Lustig, Almir Mutapcic, and RobertTibshirani for helpful comments and suggestions. This work was funded in part by the Pre-court Institute on Energy Efficiency, by Army award W911NF-07-1-0029, by NSF award0529426, by NASA award NNX07AEIIA, by AFOSR award FA9550-06-1-0514, and byAFOSR award FA9550-06-1-0312.

22


23/29

References

[1] E. Keogh and. S, D. Hart, and M. Pazzani. Segmenting time series: A survey andnovel approach. Data Mining In Time Series Databases, pages 122, 2004.

[2] R. Baillie and S. Chung. Modeling and forecasting from trend stationary long mem-

ory models with applications to climatology. International Journal of Forecasting,18(2):215226, 2002.

[3] O. Banerjee, L. El Ghaoui, A. dAspremont, and G. Natsoulis. Convex optimizationtechniques for fitting sparse Gaussian graphical models. In Proceedings of the 23rdInternational Conference on Machine Learning, 2006.

[4] R. Barlow, D. Bartholomew, J. Bremner, and H. Brunk. Statistical Inference UnderOrder Restrictions; The Theory and Application of Isotonic Regression. Wiley, NewYork, 1972.

[5] M. Baxter and R. King. Measuring business cycles: approximate band-pass filters foreconomic time series. The Review of Economics and Statistics, 81(4):575593, 1999.

[6] R. Bellman. On the approximation of curves by line segments using dynamic program-ming. Communications of the ACM, 4(6):284, 1961.

[7] R. Bellman and R. Roth. Curve fitting by segmented straight lines. Journal of theAmerican Statistical Association, 64(327):10791084, 1969.

[8] D. Bertsekas. Nonlinear Programming. Athena Scientific, second edition, 1999.

[9] P. Bloomfeld. Trends in global temperature. Climate Change, 21:116, 1992.

[10] P. Bloomfeld and D. Nychka. Climate spectra and detecting climate change. ClimateChange, 21:275287, 1992.

[11] J. Borwein and A. Lewis. Convex Analysis and Nonlinear Optimization. Springer,2000.

[12] G. Box, G. Jenkins, and G. Reinsel. Time Series Analysis: Forecasting & Control.Prentice Hall, third edition, 1994.

[13] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press,2004.

[14] S. Boyd, L. Vandenberghe, A. El Gamal, and S. Yun. Design of robust global powerand ground networks. In Proceedings of ACM/SIGDA International Symposium onPhysical Design (ISPD), pages 6065, 2001.

[15] J. Burman. Seasonal adjustment by signal extraction. Journal of the Royal StatisticalSociety, Series A, 143:321337, 1980.

[16] E. Candes, J. Romberg, and T. Tao. Stable signal recovery from incomplete and inac-curate measurements. Communications on Pure and Applied Mathematics, 59(8):12071223, 2005.

23


24/29

[17] E. Candes, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signalreconstruction from highly incomplete frequency information. IEEE Transactions onInformation Theory, 52(2):489509, 2006.

[18] E. Candes and T. Tao. Decoding by linear programming. IEEE Transactions onInformation Theory, 51(12):42034215, 2005.

[19] E. Candes, E. Wakin, and S. Boyd. Enhancing sparsity by reweighted 1 minimization.Journal of Fourier Analysis and Applications, 2008. To appear. Available from http://www.stanford.edu/~boyd/papers/rwl1.html.

[20] S. Chen and D. Donoho. Basis pursuit. In Proceedings of the Twenty-Eighth AsilomarConference on Signals, Systems and Computers, volume 1, pages 4144, 1994.

[21] S. Chen, D. Donoho, and M. Saunders. Atomic decomposition by basis pursuit. SIAMReview, 43(1):129159, 2001.

[22] L. Christiano and T. Fitzgerald. The band-pass filter. International Economic Review,44(2):435465, 2003.

[23] J. Claerbout and F. Muir. Robust modeling of erratic data. Geophysics, 38(5):826844,1973.

[24] P. Craigmile, P. Guttorp, and D. Percival. Trend assessment in a long memory de-pendence model using the discrete wavelet transform. Environmetrics, 15(4):313335,2004.

[25] J. Dahl, V. Roychowdhury, and L. Vandenberghe. Maximum likelihood estimation ofGaussian graphical models: Numerical implementation and topology selection. Sub-mitted. Available from www.ee.ucla.edu/~vandenbe/covsel.html, 2005.

[26] A. dAspremont, L. El Ghaoui, M. Jordan, and G. Lanckriet. A direct formulationfor sparse PCA using semidefinite programming, 2005. In L. Saul, Y. Weiss and L.Bottou, editors, Advances in Neural Information Processing Systems, 17, pp. 41-48,MIT Press.

[27] D. Donoho. Compressed sensing. IEEE Transactions on Information Theory,52(4):12891306, 2006.

[28] D. Donoho, M. Elad, and V. Temlyakov. Stable recovery of sparse overcomplete rep-resentations in the presence of noise. IEEE Transactions On Information Theory,52(1):618, 2006.

[29] D. Donoho, I. Johnstone, G. Kerkyacharian, and D. Picard. Wavelet shrinkage: Asymp-

topia? J. R. Statist. Soc. B., 57(2):301337, 1995.

[30] J. Durbin and S. Koopman. Time Series Analysis by State Space Methods. OxfordUniversity Press, 2001.

[31] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annalsof Statistics, 32(2):407499, 2004.

24


25/29

[32] M. Elad, J. Starck, D. Donoho, and P. Querre. Simultaneous cartoon and textureimage inpainting using morphological component analysis (MCA). Applied and Com-putational Harmonic Analysis, 19:340358, 2005.

[33] J. Ertel and E. Fowlkes. Some algorithms for linear spline and piecewise multiple linearregression. Journal of the American Statistical Association, 71(355):640648, 1976.

[34] R. Eubank. Nonparametric Regression and Spline Smoothing. Marcel Dekker, 1999.

[35] J. Fan and Q. Yao. Nonlinear Time Series: Nonparametric and Parametric Methods.Springer-Verlag, New York, 2003.

[36] P. Feder. The log likelihood ratio in segmented regression. Annals of Statistics, 3(1):8497, 1975.

[37] P. Feder. On asymptotic distribution theory in segmented regression problemsIdentified case. Annals of Statistics, 3(1):4983, 1975.

[38] J. Friedman. Multivariate adaptive regression splines. Annals of Statistics, 19(1):167,

1991.

[39] A. Gallant and W. Fuller. Fitting segmented polynomial regression models whosejoin points have to be estimated. Journal of the American Statistical Association,68(341):144147, 1973.

[40] V. Gomez. The use of butterworth filters for trend and cycle estimation in economictime series. Journal of Business and Economic Statistics, 19(3):365373, 2001.

[41] D. Gorinevsky. Monotonic regression filters for trending gradual deterioration faults.In Proceedings of American Control Conference (ACC), pages 53945399, 2004.

[42] D. Gorinevsky, S.-J. Kim, S. Boyd, G. Gordon, S. Beard, and F.-K. Chang. Optimal

estimation of accumulating damage trend from a series of SHM images. In Proceedingsof the 6th International Workshop on Structural Health Monitoring, 2007.

[43] M. Grant, S. Boyd, and Y. Ye. cvx: a Matlab Software for Disciplined Convex Pro-gramming, 2007. Available from www.stanford.edu/~boyd/cvx/.

[44] S. Greenland and M. Longnecker. Methods for trend estimation from summarized dose-response data, with applications to meta-analysis. American Journal of Epidemiology,135(11):13011309, 1992.

[45] S. Guthery. Partition regression. Journal of the American Statistical Association,69(348):945947, 1974.

[46] J. Hamilton. Time Series Analysis. Princeton University Press, 1994.

[47] A. Harvey. Forecasting, Structural Time Series Models and the Kalman Filter. Cam-bridge University Press, 1991.

[48] A. Harvey and T. Trimbur. General model-based filters for extracting cycles and trendsin economic time series. Review of Economics and Statistics, 85(2):244255, 2003.

25


26/29

[49] A. Hassibi, J. How, and S. Boyd. Low-authority controller design via convex optimiza-tion. In Proceedings of the IEEE Conference on Decision and Control, pages 140145,1999.

[50] T. Hastie, J. Taylor, R. Tibshirani, and G. Walther. Forward stagewise regression andthe monotone lasso. Electronic Journal of Statistics, 1:129, 2007.

[51] T. Hastie and R. Tibshirani. Generalized Additive Models. Chapman & Hall/CRC,1990.

[52] T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning. SpringerSeries in Statistics. Springer-Verlag, New York, 2001.

[53] R. Hodrick and E. Prescott. Postwar U.S. business cycles: An empirical investigation.Journal of Money, Credit, and Banking, 29(1):116, 1997. Discussion paper No. 451.Center for Mathematical Studies in Economics and Management Science, NorthwesternUniversity, Evanston, IL, 1981.

[54] P. Huber. Robust Statistics. John Wiley & Sons, 1981.

[55] D. Hudson. Fitting segmented curves whose join points have to be estimated. Journalof the American Statistical Association, 61(316):10971129, 1966.

[56] S. Joshi and S. Boyd. Sensor selection via convex optimization. IEEE Transactionson Signal Processing, 2008. To appear. Available from http://www.stanford.edu/

~boyd/papers/sensor_selection.html.

[57] P. Kenny and J. Durbin. Local trend estimation and seasonal adjustment of economicand social time series. Journal of Royal Statistical Society. Series A, 145:141, 1982.

[58] H.-J. Kim, M. Fay, E. Feuer, and D. Midthune. Permutation tests for joinpoint regres-sion with applications to cancer rates. Statistics in Medicine, 19(3):335351, 2000.

[59] R. Koenker and G. Bassett. Regression quantile. Econometrica, 46(1):3350, 1978.

[60] R. Koenker, P. Ng, and S. Portnoy. Quantile smoothing splines. Biometrika, 81(4):673680, 1994.

[61] K. Koh, S.-J. Kim, and S. Boyd. An interior-point method for large-scale 1-regularizedlogistic regression. Journal on Machine Learning Research, 8:15191555, 2007.

[62] S. Lee, H. Lee, P. Abeel, and A. Ng. Efficient l1-regularized logistic regression. InProceedings of the 21st National Conference on Artificial Intelligence (AAAI-06), 2006.

[63] P. Lerman. Fitting segmented regression models by grid search. Applied Statistics,29(1):7784, 1980.

[64] C. Leser. A simple method of trend construction. Journal of the Royal StatisticalSociety. Series B (Methodological), 23(1):91107, 1961.

[65] C. Leser. Estimation of quasi-linear trend and seasonal variation. Journal of theAmerican Statistical Association, 58(304):10331043, 1963.

26


27/29

[66] S. Levitt. Understanding why crime fell in the 1990s: Four factors that explain thedecline and six that do not. Journal of Economic Perspectives, 18(1):163190, 2004.

[67] S. Levy and P. Fullagar. Reconstruction of a sparse spike train from a portion of itsspectrum and application to high-resolution deconvolution. Geophysics, 46(9):12351243, 1981.

[68] W. Link and J. Sauer. Estimating equations estimates of trend. Bird Populations,2:2332, 1994.

[69] M. Lobo, M. Fazel, and S. Boyd. Portfolio optimization with linear and fixed transac-tion costs. Annals of Operations Research, 152(1):341365, 2006.

[70] R. Lucas. Two illustrations of the quantity theory of money. American EconomicReview, 70:100514, 1980.

[71] V. McGee and W. Carleton. Piecewise regression. Journal of the American StatisticalAssociation, 65(331):11091124, 1970.

[72] G. Mosheiov and A. Raveh. On trend estimation of time-series: A simpler linearprogramming approach. Journal of the Operations Research Society, 48(1):9096, 1997.

[73] Y. Nesterov and A. Nemirovsky. Interior-Point Polynomial Methods in Convex Pro-gramming, volume 13 of Studies in Applied Mathematics. SIAM, Philadelphia, PA,1994.

[74] J. Nocedal and S. Wright. Numerical Optimization. Springer Series in OperationsResearch. Springer, 1999.

[75] D. Osborne. Moving average detrending and the analysis of business cycles. OxfordBulletin of Economics and Statistics, 57:547558, 1995.

[76] M. Osborne, B. Presnell, and B. Turlach. A new approach to variable selection in leastsquares problems. IMA Journal of Numerical Analysis, 20(3):389403, 2000.

[77] M. Park and T. Hastie. An 1 regularization-path algorithm for generalized linearmodels. Journal of the Royal Statistical Society, Series B, 69(4):659677, 2007.

[78] A. Pinkus. On smoothest interpolants. SIAM Journal on Mathematical Analysis,19(6):143141, 1988.

[79] D. Pollock. Trend estimation and de-trending via rational square-wave filters. Journalof Econometrics, 99(2):317334, 2000.

[80] R. Quandt. The estimation of the parameter of a linear regression system obeying twoseparate regimes. Journal of the American Statistical Association, 53(284):873880,1958.

[81] C. Reinsch. Smoothing by spline functions. Numerische Mathematik, 10:177183, 1976.

[82] T. Robertson, F. Wright, R., and Dykstra. Order Restricted Statistical Inference.Wiley, New York, 1988.

27


28/29

[83] S. Rosset and J. Zhu. Piecewise linear regularized solution paths. Annals of Statistics,35(3):10121030, 2007.

[84] L. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removalalgorithms. Physics D, 60:259268, 1992.

[85] E. Schlicht. A seasonal adjustment principle and a seasonal adjustment method derivedfrom this principle. Journal of the American Statistical Association, 76(374):374378,1981.

[86] K. Singleton. Econometric issues in the analysis of equilibrium business cycle models.Journal of Monetary Economics, 21(2-3):361386, 1988.

[87] A. Smith and D. Cook. Straight lines with a change-point: A Baysian analysis of somerenal transplant data. Applied Statistics, 29(2):180189, 1980.

[88] J. Starck, M. Elad, and D. Donoho. Image decomposition via the combination of sparserepresentations and a variational approach. IEEE Transactions on Image Processing,14(10):15701582, 2005.

[89] H. Stone. Approximation of curves by line segments. Mathematics of Computation,15(73):4047, 1961.

[90] J. Sturm. Using SEDUMI 1.02, a Matlab Toolbox for Optimization Over SymmetricCones, 2001. Available from fewcal.kub.nl/sturm/software/sedumi.html .

[91] R. Talluri and G. van Ryzin. Theory and Practice of Revenue Management. Springer,2005.

[92] H. Taylor, S. Banks, and J. McCoy. Deconvolution with the l1 norm. Geophysics,44(1):3952, 1979.

[93] R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the RoyalStatistical Society, Series B, 58(1):267288, 1996.

[94] R. Tibshirani, M. Saunders, S. Rosset, and J. Zhu. Sparsity and smoothness via thefused Lasso. Journal of the Royal Statistical Society Series B, 67(1):91108, 2005.

[95] K. Toh, R. Tutuncu, and M. Todd. SDPT3 4.0 (beta) (software package), 2001. Avail-able from http://www.math.nus.edu.sg/ mattohkc/sdpt3.html.

[96] J. Tropp. Just relax: Convex programming methods for identifying sparse signals innoise. IEEE Transactions on Information Theory, 53(3):10301051, 2006.

[97] R. Tsay. Analysis of Financial Time Series. Wiley-Interscience, second edition, 2005.

[98] M. Wainwright, P. Ravikumar, and J. Lafferty. High-dimensional graphical modelselection using 1-regularized logistic regression. In Advances in Neural InformationProcessing Systems (NIPS) 19, pages 14651472, 2007.

[99] J. Wang. A Bayesian time series model of multiple structural changes in level, trend,and variance. Journal of Busines and Economic Statistics, 18(3):374386, 2000.

28


29/29

[100] Y. Wen and B. Zeng. A simple nonlinear filter for economic time series analysis.Economics Letters, 64:151160, 1999.

[101] G. Whaba. Spline Models for Observational Data. Society for Industrial and AppliedMathematics, 1990.

[102] K. Worsley. Testing for a two-phase mutliple regression. Technometrics, 53(1):3442,1983.

[103] S. Wright. Primal-dual interior-point methods. Society for Industrial and AppliedMathematics, Philadelphia, PA, USA, 1997.

[104] B. Yu, M. Barrett, H.-J. Kim, and E. Feuer. Estimating joinpoints in continuous timescale for multiple change-point models. Computational Statistics & Data Analysis,51(5):24202427, 2007.

[105] M. Yuan and L. Lin. Model selection and estimation in regression with grouped vari-ables. Journal of the Royal Statistical Society, Series B, 68(1):4967, 2006.

[106] S. Zhao and G. Wei. Jump process for the trend estimation of time series. Computa-tional Statistics & Data Analysis, 42:219241, 2003.

[107] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journalof the Royal Statistical Society, Series B, 67(2):301320, 2005.

[108] A. Zymnis, S. Boyd, and D. Gorinevsky. Relaxed maximum a posteriori faultidentification, 2007. Technical Report, Stanford University. Available from http://www.stanford.edu/~boyd/papers/fault_det.html.

29

l1 Trend Filter

Documents

Transcript of l1 Trend Filter