Variable Selection for High-Dimensional Data with Spatial...

Variable Selection for High-Dimensional Datawith Spatial-Temporal Effects and Extensions to

Multitask Regression and MulticategoryClassification

Tong Tong Wu

Department of Epidemiology and BiostatisticsSchool of Public Health

University of Maryland, College Park

October 31, 2009

1 / 39

Overview and Introduction Variable Selection for High-Dimensional Data with Spatial-Temporal Effects Extensions to Multicategory Classification and Multitask Regression Discussion References

Outline

1 Overview and Introduction

2 Variable Selection for High-Dimensional Data withSpatial-Temporal Effects

Models for Time-Course DataPenalties for Structured Predictor SpacesCoordinate Descent Algorithms

3 Extensions to Multicategory Classification and MultitaskRegression

4 Discussion

2 / 39


Overview and Introduction

3 / 39


Figure 1: Overview of the proposed research

4 / 39


Objectives: NSA Mathematical Science Program

1 Development of new variable selection method forhigh-dimensional data with spatial-temporal effects inregression

1 Models for time-course data (temporal effects): AR(q), kernelcombined regression models

2 New penalty functions for structured predictor spaces (spatialeffects): linear chain, undirected graph

3 Fast and stable algorithms for high-dimensionalspatial-temporal data

4 Asymptotic properties5 Finite sample properties by numerical experiments

2 Extensions to multicategory classification and multitaskregression

3 Applications in biology: pancreatic cancer pathways, cancersubtype classification

5 / 39


Variable Selection

Design of effective and efficient tools to extract scientificinsights from these massive and noisy high-throughput data

Two approaches for dimension reduction: feature selectionand feature extraction

Selection of a possible best subset from the original variablesto build robust learning models

Goals of variable selection

Alleviating the effect of the curse of dimensionalityAcquiring better understanding about dataEnhancing generalization capability and predictionImproving stabilityAvoiding bias in hypothesis tests during or after variableselectionRetaining scientific interpretabilitySpeeding up learning process

6 / 39


Lasso Penalized Regression

Lasso (Least absolute shrinkage and selection operator) penalizedregression (Tibshirani 1996) can be phrased as

min f (θ) = g(θ) + λ

p∑j=1

|βj |

g(θ): loss function

g(θ) =∑n

i=1 |yi − µ−∑p

j=1 xijβj | for `1 regression

g(θ) = 12

∑ni=1(yi − µ−

∑pj=1 xijβj)

2 for `2 regression

θ = (µ, β1, . . . , βp)t : parameter vector

λ∑p

j=1 |βj |: lasso penalty

λ: positive tuning constant that determines the strength of thelasso penalty

7 / 39


Lasso Penalty λ∑p

j=1 |βj |

Shrinking each βj toward the origin

Discouraging models with large numbers of marginallyrelevant predictors

More effective in deleting irrelevant predictors than a ridgepenalty λ

∑pj=1 β

2j because |b| > b2 for small b

8 / 39


Lasso vs. Ridge

In `2 regression:Pn

i=1(yi − µ−Pp

j=1 xijβj)2 = (β − β0)TXTX (β − β0) + c

9 / 39


Variable Selection for High-Dimensional Datawith Spatial-Temporal Effects

10 / 39


Spatial-Temporal Models

Idea: Minimize a regularized objective function

min f (θ) = g(θ) + λP(β)

g(θ): loss function for time-course model (temporal effects)

P(β): penalty function for selecting structured predictors(spatial effects)

11 / 39


Regression Models for Time-Course Data

Regression model for subject i at time ts , s = 1, . . . ,T

yi (ts) = µ(ts) + x ti β(ts) + εi (ts)

where

yi (t): responses at time txi = (xi1, . . . , xip)t : fixed p predictorsβ(t) = (β1(t), . . . , βp(t))t : time-varying coefficientsµ(t): time-varying intercept

12 / 39



Regression model for subject i :

0B@ yi (t1)...

yi (tT )

1CA =

0B@ µ(t1)...

µ(tT )

1CA +

0B@ β1(t1) . . . βp(t1)...

β1(tT ) . . . βp(tT )

1CA0B@ xi1

...xip

1CA

+

0B@ εi (t1)...

εi (tT )

1CAyi ≡ µ+ Bxi + εi

where B is a T × p coefficient matrix whose sth row is β(ts)t

Linear system for n observations:

0B@ y t1...y tn

1CA = 1µt +

0B@ x ti...x tn

1CA0B@ β1(t1) . . . βp(t1)

...β1(tT ) . . . βp(tT )

1CAt

+

0B@ εt1...εtn

1CA13 / 39



Stationary process

E [ε(t)] = 0 and Cov(εt , εt+s) = Cov(εt , εt−s) = σ2ρs

where ρs is the error autocorrelation at lag s

Error covariance matrix Σ for each subject

Σ = σ2

1 ρ1 ρ2 . . . ρT−1

ρ1 1 ρ1 . . . ρT−2

ρ2 ρ1 1 . . . ρT−3...

. . ....

ρn−1 ρn−2 ρn−3 . . . 1

= σ2V

Loglikelihood

−n

2log(det Σ)− 1

2

n∑i=1

(yi − µ− Bxi )tΣ−1(yi − µ− Bxi )

14 / 39


AR(q) Model

Autocorrelated regression errors εt

εt = φ1εt−1 + φ2εt−2 + · · ·+ φqεt−q + vt ,

where the “random shocks” vt ∼ N(0, σ2v )

Given that first q observations for each subject are fixed, theconditional loglikelihood of the remaining T − q observations(y(tq+1), . . . , y(tT ))

−n

2log σ

2v −

1

2σ2v

nXi=1

TXs=q+1

nyi (ts )− µ(ts )− xt

i β(ts )−qX

j=1

φj

hyi (ts−j )− µ(ts )− xt

i β(ts−j )io2

| {z }Minimizing the least-squares type loss function

g1(θ) =nX

i=1

TXs=q+1

nyi (ts)− µ(ts)− x t

i β(ts)−qX

j=1

φj

hyi (ts−j )− µ(ts)− x t

i β(ts−j )io2

15 / 39


Kernel Smoothed Regression Model

Minimizing the kernel smoothed loss function at time ts forθ(ts)

θ(ts) = arg minn∑

i=1

T∑r=1

w(ts , tr ){

yi (tr )− µ(ts)− x ti β(ts)

}2

where∑T

r=1 w(ts , tr ) = 1

Kernel smoothed response

yi (ts) =T∑

r=1

w(ts , tr )yi (tr )

Minimizing the least-squares type loss function

g2(θ) =n∑

i=1

T∑s=1

{yi (ts)− µ(ts)− x t

i β(ts)}2

16 / 39


Structured Predictor Spaces

Grouped predictors (Wu and Lange 2008; Yuan and Lin 2006; Zhao

et al. 2009; Zou and Hastie 2005)

Example: in gene microarray experiments, genes cansometimes be grouped into biochemical pathways subject togenetic coregulation and their expression levels are expectedto be highly correlated

Order restrictions (Efron et al. 2004; Turlach et al. 2005; Wu et al.

2009; Yuan et al. 2007)

Example: a higher order term should be selected only whenthe corresponding lower order terms are present in the model

Ordered predictors (Tibshirani et al. 2005; Li and Zhang 2008)

Example: genes or SNPs are located on chromosomes as asequence

17 / 39


Structured Predictor Spaces

Important to incorporate the structural information into themodel building process when the covariate space is highlystructured

Two structures of predictor spaces

Linear chain: sequencing and mapping information of genomesUndirected graph: networking information

18 / 39


Linear Chain Structure

A genome map describes the order of genes or other markersand the spacing between them

The existing works incorporating linear chain structures onlyconsider the sequencing of markers (e.g. fused LASSO(Tibshirani et al. 2005), Bayesian variable selection (Li andZhang 2008; Tai and Pan 2009))

The distance between two markers, which providesinformation on the recombination of the two markers andtheir participation together in relevant biological processes,however, has not been taken into consideration

The sequencing and mapping information is widely availablefor public use in the internet

19 / 39


Penalty Function for Linear Chain Structure

Adaptive fused lasso penalty

P1(β) = γ

p∑j=2

1

|mj −mj−1||βj − βj−1| ≡

p∑j=2

γj |βj − βj−1|,

where mj is the map distance of marker j , andγj = γ/|mj −mj−1| is the adaptive weight

20 / 39


Undirected Graphical Structure

Example: Nodes in the graph represent genes and edgesrepresent interactions between genes

Bayesian approach: Li and Zhang (2008); Tai and Pan (2009)

Frequentist approach: Li and Li (2008); Pan et al. (2009)

Employing a different penalty function to incorporate themapping information for a more straightforward biologicalreason

21 / 39


Figure 2: Gene regulatory networks. Each node is a single gene, and anytwo genes are connected if in the same pathway (reproduced withpermission from Sinauer Associates, Inc. (Gifford et al. 2006))

22 / 39


Penalty Function for Undirected Graphical Structure

Consider a weighted graph G = (V ,E ,W ), where

V : the set of nodes corresponding to the p predictorsE = {u ∼ v}: the set of edges indicating whether thepredictors u and v are linked in the networkW : the weights of the edges with w(u, v) denoting the weightof edge e = (u ∼ v)

In this study, the edge weights represent the map distancesbetween nodes

Penalty for graphical structure:

P2(β) = γ∑

i

∑j∼i

w(i , j)|βi − βj | ≡∑

i

∑j∼i

γij |βi − βj |

where γij = γw(i , j) = γ/|mi −mj |

23 / 39


Spatial-Temporal Models

The spatial-temporal model therefore can be implemented byminimizing the objective function

f (θ) = gk(θ) + λ

p∑j=1

|βj |+ γlPl(β), k, l = 1, 2 (1)

The two `1 penalties encourage sparsity in the coefficients andtheir differences, respectively

The objective function is convex but nondifferentiable, with alarge number of predictors

Use the cyclic coordinate descent algorithm highlighted in Wuand Lange (2008) and Friedman et al. (2007)

Allow response to be related to a different subset of predictorsat each time point

24 / 39


Literature Review: Computational Algorithms

The inefficiency of the original lasso limited its impact on statisticalpractice (Madigan and Ridgeway 2004), even with the more recentalgorithms (Osborne et al. 2000) due to their complexity

Efron et al. (2004): LARS (Least Angle Regression, Lasso,and Forward Stagewise)

Fu (1998) and Daubechies et al. (2004): coordinate descentfor lasso penalized `2 regression, but no follow up forunderdetermined problems

Wang et al. (2006): solution path for penalized `1 regressionusing linear programming

Park and Hastie (2006): solution path for penalized `2

regression and generalized linear models

Wu and Lange (2008) and Friedman et al. (2007):independent and concurrent work on coordinate descent!

25 / 39


Coordinate Descent Algorithms

The difficulties of minimizing the objective function (1) lie inthree facts

1 Nondifferentiability due to the `1 penalties2 p � n3 Impossibility of matrix operations for huge p

Coordinate descent: simplicity, speed, and stability (Friedman

et al. 2007; Wu and Lange 2008; Wu et al. 2009)

Decisive advantages in dealing with data sets in our setting

Standard version of coordinate descent: cycling through theparameters and updating each in turn

26 / 39


Coordinate Descent Algorithms

Directional derivatives along forward or backward directions, e.g.ek is the coordinate direction along which βk varies

Forward directional derivative of the objective function

dekf (θ) = lim

τ↓0

f (θ + τek)− f (θ)

τ= dek

g(θ) +

{λ βk ≥ 0−λ βk < 0

Backward directional derivative of the objective function

d−ekf (θ) = lim

τ↓0

f (θ − τek)− f (θ)

τ= d−ek

g(θ) +

{−λ βk > 0λ βk ≤ 0

27 / 39


Procedures

1 Start all parameters at the origin2 Evaluate both dek

f (θ) and d−ekf (θ) for βk

If both are nonnegative ⇒ skip the update for βk

If either directional derivative is negative ⇒ solve for theminimum in that directionBoth directional derivatives cannot be negative because thiscontradicts the convexity of f (θ)s

3 After the direction is chosen, Newton’s method can be usedfor updating the parameter

4 Check if the objective function is driven downhill

5 Halve the step size if the descent property fails

28 / 39


Underdetermined Problems

p � n with just a few relevant predictors

Fast speed of cyclic coordinate descent

Most updates are skipped and the corresponding parametersnever budge from their starting values of 0Complete absence of matrix operations

Numerical stability from the descent property of each update

29 / 39


Extensions to Multicategory Classificationand Multitask Regression

30 / 39


Overview of Vertex Discriminant Analysis (VDA)

A new method of supervised learning

Based on linear discrimination among the vertices of a regularsimplex in Euclidean space

Each vertex represents a different category

A regression problem involving ε-insensitive residuals andpenalties on the coefficients of the linear predictors

Ridge penalty (VDAR)Lasso penalty (VDAL)Euclidean penalty (VDAE, VDALE)

Minimizing the objective function by a primal MM algorithmor a coordinate descent algorithm

Competitive in statistical accuracy and computational speed

31 / 39


Regularized ε-insensitive Loss Function for VDA

VDA (Vertex Discriminant Analysis) discriminates among kcategories by minimizing ε-insensitive loss plus penalties

f (θ) =n∑

i=1

g(yi − Bxi − µ) + λL

k−1∑j=1

p∑l=1

|βjl |+ λE

p∑l=1

‖β(l)‖2

where

yi ∈ Rk−1 is the vertex assignment for case i

β(l) is the lth column of a k × p matrix B of regressioncoefficients

µ is a (k − 1)× 1 column vector of intercepts

g(v): modified ε-insensitive loss

g(v) =

‖v‖2 − ε if ‖v‖2 ≥ ε+ δ(‖v‖2−ε+δ)3(3δ−‖v‖2+ε)

16δ3 if ‖v‖2 ∈ (ε− δ, ε+ δ)0 if ‖v‖2 ≤ ε− δ

32 / 39


−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

−1.

5−

1.0

−0.

50.

00.

51.

01.

5

Indicator Vertices for 3 Classes

z1

z2

●●●

●

●●●

●

●

Figure 3: Plot of the indicator vertices for 3 classes, with a ε-radius circle aroundeach vertex

33 / 39


VDA with Spatial Effects

Spatial information can be incorporated into classification byadding the network-constraint penalty to the regularizedobjective function

f (θ) =n∑

i=1

g(yi − Bxi − µ) + λL

k−1∑j=1

p∑l=1

|βjl |+ λE

p∑l=1

‖β(l)‖2 + γlPl(A)

l = 1, 2

Multiresponse regression problem: yi ∈ Rk−1 for k categories

Cyclic coordinate descent algorithm

34 / 39


Multitask Regression with Spatial Effects

Learning multiple related tasks from data simultaneously canbe advantageous to learning these tasks independently(Bakker and Heskes 2003; Caruana 1997; Evgeniou and Pontil2004)

Main task is trained with other related problems at the sametime using a shared representation

Example: 24 soil samples with measurements of 14 quantitiesat 770 different wavelengths (CMIS/CSIRO)

35 / 39


Multitask Regression with Spatial Effects

Assume we have k responses and share the same set ofcovariates

For subject i , yi = (yi1, . . . , yik)t and xi = (xi1, . . . , xip)t

Further assume each task comes from a different regressionmodel of x ’s

Multiresponse regression model y t1...

y tn

= 1µt +

x ti...x tn

β11 . . . β1p

. . .

βk1 . . . βkp

t

36 / 39


Discussion

37 / 39


Connections to MCAI 2.0

Applications in biology1 Pancreatic cancer pathways2 Cancer subtype classification

Model verification/checking by MACI 2.0

38 / 39


Thank you very much!

39 / 39


Bakker, B. and Heskes, T. (2003), “Task clustering and gating forBayesian multitask learning,” Journal of Machine LearningResearch, 4, 83–99.

Caruana, R. (1997), “Multitask Learning,” Machine Learning, 28,41–75.

Daubechies, I., Defrise, M., and De Mol, C. (2004), “An iterativethresholding algorithm for linear inverse problems with a sparsityconstraint,” Communications on Pure and Applied Mathematics,57, 1413–1457.

Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004),“Least angle regression,” Ann Stat, 32, 407–499.

Evgeniou, T. and Pontil, M. (2004), “Regularized multitasklearning,” Proceedings of the tenth ACM SIGKDD internationalconference on Knowledge discovery and data mining, 109–117.

Friedman, J., Hastie, T., Hofling, H., and Tibshirani, R. (2007),“Pathwise coordinate optimization,” Anns Appl Stat.

Fu, W. J. (1998), “Penalized regressions: the bridge versus thelasso,” J Comp and Graph Stat, 7, 397–416.

39 / 39


Gifford, M. L., Gutierrez, R. A., and Coruzzi, G. M. (2006), “Essay12.2: Modeling the Virtual Plant: A Systems Approach toNitrogen-Regulatory Gene Networks. Plant Physiology, FourthEdition Online,” online.

Li, C. and Li, H. (2008), “Network-constrained regularization andvariable selection for analysis of genomic data,” Bioinformatics,24, 1175–1182.

Li, F. and Zhang, N. R. (2008), “Bayesian Variable Selection inStructured High-Dimensional Covariate Spaces with Applicationsin Genomics,” manuscript.

Madigan, D. and Ridgeway, G. (2004), “Discussion of “least angleregression” by Efron et al.” Annals of Statistics, 32, 465–469.

Osborne, M. R., Presnell, B., and Turlach, B. A. (2000), “A newapproach to variable selection in least squares problems,” IMAJournal of Numerical Analysis, 20, 389–403.

Pan, W., Xie, B., and Shen, X. (2009), “Incorporating PredictorNetwork in Penalized Regression with Application to MicroarrayData,” Biometrics, online publication.

39 / 39


Park, M. Y. and Hastie, T. (2006), “Penalized logistic regressionfor detecting gene interactions,” Tech. Rep. 2006-15,Department of Statistics, Stanford University.

Tai, F. and Pan, W. (2009), “Bayesian Variable Selection inRegression with Networked Predictors,” manuscript.

Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., and Knight, K.(2005), “Sparsity and smoothness via the fused lasso,” J. Roy.Stat. Soc., Series B, 67, 91–108.

Turlach, B. A., Venables, W. N., and Wright, S. J. (2005),“Simultaneous variable selection,” Technometrics, 47, 349–363.

Wang, L., Gordon, M. D., and Zhu, J. (2006), “Regularized leastabsolute deviations regression and an efficient algorithm forparameter tuning,” Proceedings of the Sixth InternationalConference on Data Mining (ICDM’06). IEEE Computer Society,690–700.

Wu, T. T., Chen, Y. F., Hastie, T., Sobel, E., and Lange, K.(2009), “Genomewide association analysis by lasso penalizedlogistic regression,” Bioinformatics, 25, 714–721.

39 / 39


Wu, T. T. and Lange, K. (2008), “Coordinate descent algorithmsfor lasso penalized regression,” Ann. Appl. Stat., 2, 224–244.

Yuan, M., Joseph, R., and Lin, Y. (2007), “An efficient variableselection approach for analyzing designed experiments,”Technometrics, 49, 430–439.

Yuan, M. and Lin, Y. (2006), “Model selection and estimation inregression with grouped variables,” J Roy Stat Soc, Series B, 68,49–67.

Zhao, P., Rocha, G., and Yu, B. (2009), “The composite absolutepenalties family for grouped and hierarchical variable selection,”Annals of Statistics, 37, 3468–3497.

Zou, H. and Hastie, T. (2005), “Regularization and variableselection via the elastic net,” J Roy Stat Soc, Series B, 67,301–320.

39 / 39

Variable Selection for High-Dimensional Data with Spatial...

Documents

Transcript of Variable Selection for High-Dimensional Data with Spatial...