Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear...

62
Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison Departments of ECE & Statistics STOC workshop on robustness and nonconvexity Montreal, Canada June 23, 2017 Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 1 / 26

Transcript of Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear...

Page 1: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Robust high-dimensional linear regression:A statistical perspective

Po-Ling Loh

University of Wisconsin - MadisonDepartments of ECE & Statistics

STOC workshop on robustness and nonconvexityMontreal, Canada

June 23, 2017

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 1 / 26

Page 2: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Introduction: Robust regression

Robust statistics introduced in 1960s (Huber, Tukey, Hampel, et al.)

Goals:1 Develop estimators T (·) that are reliable under deviations from model

assumptions2 Quantify performance with respect to deviations

Local stability captured by influence function

IF (x ;T ,F ) = limt→0

T ((1− t)F + tδx)− T (F )

t

Global stability captured by breakdown point

ε∗(T ;X1, . . . ,Xn) = min

m

n: supXm‖T (Xm)− T (X )‖ =∞

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 2 / 26

Page 3: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Introduction: Robust regression

Robust statistics introduced in 1960s (Huber, Tukey, Hampel, et al.)

Goals:1 Develop estimators T (·) that are reliable under deviations from model

assumptions2 Quantify performance with respect to deviations

Local stability captured by influence function

IF (x ;T ,F ) = limt→0

T ((1− t)F + tδx)− T (F )

t

Global stability captured by breakdown point

ε∗(T ;X1, . . . ,Xn) = min

m

n: supXm‖T (Xm)− T (X )‖ =∞

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 2 / 26

Page 4: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Introduction: Robust regression

Robust statistics introduced in 1960s (Huber, Tukey, Hampel, et al.)

Goals:1 Develop estimators T (·) that are reliable under deviations from model

assumptions2 Quantify performance with respect to deviations

Local stability captured by influence function

IF (x ;T ,F ) = limt→0

T ((1− t)F + tδx)− T (F )

t

Global stability captured by breakdown point

ε∗(T ;X1, . . . ,Xn) = min

m

n: supXm‖T (Xm)− T (X )‖ =∞

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 2 / 26

Page 5: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Introduction: Robust regression

Robust statistics introduced in 1960s (Huber, Tukey, Hampel, et al.)

Goals:1 Develop estimators T (·) that are reliable under deviations from model

assumptions2 Quantify performance with respect to deviations

Local stability captured by influence function

IF (x ;T ,F ) = limt→0

T ((1− t)F + tδx)− T (F )

t

Global stability captured by breakdown point

ε∗(T ;X1, . . . ,Xn) = min

m

n: supXm‖T (Xm)− T (X )‖ =∞

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 2 / 26

Page 6: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

High-dimensional linear models

n 1 n p n 1

p 1

Linear model:yi = xTi β

∗ + εi , i = 1, . . . , n

When p n, assume sparsity: ‖β∗‖0 ≤ k

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 3 / 26

Page 7: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

High-dimensional linear models

n 1 n p n 1

p 1

Linear model:yi = xTi β

∗ + εi , i = 1, . . . , n

When p n, assume sparsity: ‖β∗‖0 ≤ k

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 3 / 26

Page 8: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Robust M-estimators

Generalization of OLS appropriate for robust statistics:

β ∈ arg minβ

1

n

n∑

i=1

`(xTi β − yi )

Extensive theory for p fixed, n→∞

IntroductionRobust regression

ImplementationScottish hill races

Loss functions

−6 −4 −2 0 2 4 6

01

23

45

6

Residual

Loss

Least squaresAbsolute valueHuberTukey

Patrick Breheny BST 764: Applied Statistical Modeling 6/17

IntroductionRobust regression

ImplementationScottish hill races

Belgian phone calls: Linear vs. robust regression

1950 1955 1960 1965 1970

050

100

150

200

Year

Milli

ons

of c

alls

Least squaresHuberTukey

Patrick Breheny BST 764: Applied Statistical Modeling 3/17

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 4 / 26

Page 9: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Robust M-estimators

Generalization of OLS appropriate for robust statistics:

β ∈ arg minβ

1

n

n∑

i=1

`(xTi β − yi )

Extensive theory for p fixed, n→∞

IntroductionRobust regression

ImplementationScottish hill races

Loss functions

−6 −4 −2 0 2 4 6

01

23

45

6

Residual

Loss

Least squaresAbsolute valueHuberTukey

Patrick Breheny BST 764: Applied Statistical Modeling 6/17

IntroductionRobust regression

ImplementationScottish hill races

Belgian phone calls: Linear vs. robust regression

1950 1955 1960 1965 1970

050

100

150

200

Year

Milli

ons

of c

alls

Least squaresHuberTukey

Patrick Breheny BST 764: Applied Statistical Modeling 3/17

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 4 / 26

Page 10: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Classes of loss functions

Bounded `′ limits influence of outliers:

IF ((x , y);T ,F ) = limt→0+

T ((1− t)F + tδ(x ,y))− T (F )

t

∝ `′(xTβ − y)x

where F ∼ Fβ and T minimizes M-estimator

Redescending M-estimators have finite rejection point:

`′(u) = 0, for |u| ≥ c

IntroductionRobust regression

ImplementationScottish hill races

Loss functions

−6 −4 −2 0 2 4 6

01

23

45

6

Residual

Loss

Least squaresAbsolute valueHuberTukey

Patrick Breheny BST 764: Applied Statistical Modeling 6/17But bad for optimization!!

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 5 / 26

Page 11: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Classes of loss functions

Bounded `′ limits influence of outliers:

IF ((x , y);T ,F ) = limt→0+

T ((1− t)F + tδ(x ,y))− T (F )

t

∝ `′(xTβ − y)x

where F ∼ Fβ and T minimizes M-estimatorRedescending M-estimators have finite rejection point:

`′(u) = 0, for |u| ≥ c

IntroductionRobust regression

ImplementationScottish hill races

Loss functions

−6 −4 −2 0 2 4 6

01

23

45

6

Residual

Loss

Least squaresAbsolute valueHuberTukey

Patrick Breheny BST 764: Applied Statistical Modeling 6/17

But bad for optimization!!

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 5 / 26

Page 12: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Classes of loss functions

Bounded `′ limits influence of outliers:

IF ((x , y);T ,F ) = limt→0+

T ((1− t)F + tδ(x ,y))− T (F )

t

∝ `′(xTβ − y)x

where F ∼ Fβ and T minimizes M-estimatorRedescending M-estimators have finite rejection point:

`′(u) = 0, for |u| ≥ c

IntroductionRobust regression

ImplementationScottish hill races

Loss functions

−6 −4 −2 0 2 4 6

01

23

45

6

Residual

Loss

Least squaresAbsolute valueHuberTukey

Patrick Breheny BST 764: Applied Statistical Modeling 6/17But bad for optimization!!Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 5 / 26

Page 13: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

High-dimensional M-estimators

Natural idea: For p > n, use regularized version:

β ∈ arg minβ

1

n

n∑

i=1

`(xTi β − yi ) + λ‖β‖1

Complications:

Optimization for nonconvex `?

Statistical theory? Are certain losses provably better than others?

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 6 / 26

Page 14: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

High-dimensional M-estimators

Natural idea: For p > n, use regularized version:

β ∈ arg minβ

1

n

n∑

i=1

`(xTi β − yi ) + λ‖β‖1

Complications:

Optimization for nonconvex `?

Statistical theory? Are certain losses provably better than others?

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 6 / 26

Page 15: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Overview of results

When ‖`′‖∞ < C , global optima of high-dimensional M-estimatorsatisfy

‖β − β∗‖2 ≤ C

√k log p

n,

regardless of distribution of εi

Compare to Lasso theory: Requires sub-Gaussian εi ’s

If `(u) is locally convex/smooth for |u| ≤ r , any local optima withinradius cr of β∗ satisfy

‖β − β∗‖2 ≤ C ′√

k log p

n

* in order to verify RE condition w.h.p., need Var(εi ) < cr2, as well

Local optima may be obtained via two-step algorithm

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 7 / 26

Page 16: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Overview of results

When ‖`′‖∞ < C , global optima of high-dimensional M-estimatorsatisfy

‖β − β∗‖2 ≤ C

√k log p

n,

regardless of distribution of εi

Compare to Lasso theory: Requires sub-Gaussian εi ’s

If `(u) is locally convex/smooth for |u| ≤ r , any local optima withinradius cr of β∗ satisfy

‖β − β∗‖2 ≤ C ′√

k log p

n

* in order to verify RE condition w.h.p., need Var(εi ) < cr2, as well

Local optima may be obtained via two-step algorithm

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 7 / 26

Page 17: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Overview of results

When ‖`′‖∞ < C , global optima of high-dimensional M-estimatorsatisfy

‖β − β∗‖2 ≤ C

√k log p

n,

regardless of distribution of εi

Compare to Lasso theory: Requires sub-Gaussian εi ’s

If `(u) is locally convex/smooth for |u| ≤ r , any local optima withinradius cr of β∗ satisfy

‖β − β∗‖2 ≤ C ′√

k log p

n

* in order to verify RE condition w.h.p., need Var(εi ) < cr2, as well

Local optima may be obtained via two-step algorithm

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 7 / 26

Page 18: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Overview of results

When ‖`′‖∞ < C , global optima of high-dimensional M-estimatorsatisfy

‖β − β∗‖2 ≤ C

√k log p

n,

regardless of distribution of εi

Compare to Lasso theory: Requires sub-Gaussian εi ’s

If `(u) is locally convex/smooth for |u| ≤ r , any local optima withinradius cr of β∗ satisfy

‖β − β∗‖2 ≤ C ′√

k log p

n

* in order to verify RE condition w.h.p., need Var(εi ) < cr2, as well

Local optima may be obtained via two-step algorithm

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 7 / 26

Page 19: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Overview of results

When ‖`′‖∞ < C , global optima of high-dimensional M-estimatorsatisfy

‖β − β∗‖2 ≤ C

√k log p

n,

regardless of distribution of εi

Compare to Lasso theory: Requires sub-Gaussian εi ’s

If `(u) is locally convex/smooth for |u| ≤ r , any local optima withinradius cr of β∗ satisfy

‖β − β∗‖2 ≤ C ′√

k log p

n

* in order to verify RE condition w.h.p., need Var(εi ) < cr2, as well

Local optima may be obtained via two-step algorithm

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 7 / 26

Page 20: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Theoretical insight

Lasso analysis (e.g., van de Geer ’07, Bickel et al. ’08):

β ∈ arg minβ

1

n‖y − Xβ‖2

2 + λ‖β‖1

︸ ︷︷ ︸Ln(β)

Rearranging basic inequality Ln(β) ≤ Ln(β∗) and assuming

λ ≥ 2∥∥∥XT ε

n

∥∥∥∞

, obtain

‖β − β∗‖2 ≤ cλ√k

Sub-Gaussian assumptions on xi ’s and εi ’s provide O(√

k log pn

)

bounds, minimax optimal

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 8 / 26

Page 21: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Theoretical insight

Lasso analysis (e.g., van de Geer ’07, Bickel et al. ’08):

β ∈ arg minβ

1

n‖y − Xβ‖2

2 + λ‖β‖1

︸ ︷︷ ︸Ln(β)

Rearranging basic inequality Ln(β) ≤ Ln(β∗) and assuming

λ ≥ 2∥∥∥XT ε

n

∥∥∥∞

, obtain

‖β − β∗‖2 ≤ cλ√k

Sub-Gaussian assumptions on xi ’s and εi ’s provide O(√

k log pn

)

bounds, minimax optimal

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 8 / 26

Page 22: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Theoretical insight

Lasso analysis (e.g., van de Geer ’07, Bickel et al. ’08):

β ∈ arg minβ

1

n‖y − Xβ‖2

2 + λ‖β‖1

︸ ︷︷ ︸Ln(β)

Rearranging basic inequality Ln(β) ≤ Ln(β∗) and assuming

λ ≥ 2∥∥∥XT ε

n

∥∥∥∞

, obtain

‖β − β∗‖2 ≤ cλ√k

Sub-Gaussian assumptions on xi ’s and εi ’s provide O(√

k log pn

)

bounds, minimax optimal

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 8 / 26

Page 23: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Theoretical insight

Key observation: For general loss function, if λ ≥ 2∥∥∥XT `′(ε)

n

∥∥∥∞

,

obtain‖β − β∗‖2 ≤ cλ

√k

`′(ε) sub-Gaussian whenever `′ bounded=⇒ can achieve estimation error

‖β − β∗‖2 ≤ c

√k log p

n,

without assuming εi is sub-Gaussian

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 9 / 26

Page 24: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Theoretical insight

Key observation: For general loss function, if λ ≥ 2∥∥∥XT `′(ε)

n

∥∥∥∞

,

obtain‖β − β∗‖2 ≤ cλ

√k

`′(ε) sub-Gaussian whenever `′ bounded

=⇒ can achieve estimation error

‖β − β∗‖2 ≤ c

√k log p

n,

without assuming εi is sub-Gaussian

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 9 / 26

Page 25: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Theoretical insight

Key observation: For general loss function, if λ ≥ 2∥∥∥XT `′(ε)

n

∥∥∥∞

,

obtain‖β − β∗‖2 ≤ cλ

√k

`′(ε) sub-Gaussian whenever `′ bounded=⇒ can achieve estimation error

‖β − β∗‖2 ≤ c

√k log p

n,

without assuming εi is sub-Gaussian

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 9 / 26

Page 26: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Technical challenges

Lasso analysis also requires verifying restricted eigenvalue (RE)condition on design matrix, more complicated for general `

When ` is nonconvex, local optima β may exist that are not globaloptima

Want error bounds on ‖β − β∗‖2 as well, or algorithms to find βefficiently

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 10 / 26

Page 27: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Technical challenges

Lasso analysis also requires verifying restricted eigenvalue (RE)condition on design matrix, more complicated for general `

When ` is nonconvex, local optima β may exist that are not globaloptima

Want error bounds on ‖β − β∗‖2 as well, or algorithms to find βefficiently

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 10 / 26

Page 28: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Technical challenges

Lasso analysis also requires verifying restricted eigenvalue (RE)condition on design matrix, more complicated for general `

When ` is nonconvex, local optima β may exist that are not globaloptima

Want error bounds on ‖β − β∗‖2 as well, or algorithms to find βefficiently

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 10 / 26

Page 29: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Related work: Nonconvex regularized M-estimators

Composite objective function

β ∈ arg min‖β‖1≤R

Ln(β) +

p∑

j=1

ρλ(βj)

Assumptions:

Ln satisfies restricted strong convexity with curvature α(Negahban et al. ’12)ρλ has bounded subgradient at 0, and ρλ(t) + µt2 convexα > µ

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 11 / 26

Page 30: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Related work: Nonconvex regularized M-estimators

Composite objective function

β ∈ arg min‖β‖1≤R

Ln(β) +

p∑

j=1

ρλ(βj)

Assumptions:

Ln satisfies restricted strong convexity with curvature α(Negahban et al. ’12)ρλ has bounded subgradient at 0, and ρλ(t) + µt2 convexα > µ

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 11 / 26

Page 31: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Stationary points (L. & Wainwright ’15)

b e

O

rk log p

n

!

Stationary points statistically indistinguishable from global optima

〈∇Ln(β) +∇ρλ(β), β − β〉 ≥ 0, ∀β feasible

Under suitable distributional assumptions, for λ √

log pn and R 1

λ ,

‖β − β∗‖2 ≤ c

√k log p

n≈ statistical error

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 12 / 26

Page 32: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Stationary points (L. & Wainwright ’15)

b e

O

rk log p

n

!

Stationary points statistically indistinguishable from global optima

〈∇Ln(β) +∇ρλ(β), β − β〉 ≥ 0, ∀β feasible

Under suitable distributional assumptions, for λ √

log pn and R 1

λ ,

‖β − β∗‖2 ≤ c

√k log p

n≈ statistical error

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 12 / 26

Page 33: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Mathematical statement

Theorem (L. & Wainwright ’15)

Suppose R is chosen s.t. β∗ is feasible, and λ satisfies

max

‖∇Ln(β∗)‖∞, α

√log p

n

- λ -

α

R.

For n ≥ Cτ2

α2 R2 log p, any stationary point β satisfies

‖β − β∗‖2 -λ√k

α− µ, where k = ‖β∗‖0.

New ingredient for robust setting: ` convex only in local region=⇒ need for local consistency results

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 13 / 26

Page 34: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Mathematical statement

Theorem (L. & Wainwright ’15)

Suppose R is chosen s.t. β∗ is feasible, and λ satisfies

max

‖∇Ln(β∗)‖∞, α

√log p

n

- λ -

α

R.

For n ≥ Cτ2

α2 R2 log p, any stationary point β satisfies

‖β − β∗‖2 -λ√k

α− µ, where k = ‖β∗‖0.

New ingredient for robust setting: ` convex only in local region=⇒ need for local consistency results

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 13 / 26

Page 35: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Local statistical consistencyIntroduction

Robust regressionImplementation

Scottish hill races

Loss functions

−6 −4 −2 0 2 4 6

01

23

45

6

Residual

Loss

Least squaresAbsolute valueHuberTukey

Patrick Breheny BST 764: Applied Statistical Modeling 6/17

IntroductionRobust regression

ImplementationScottish hill races

Belgian phone calls: Linear vs. robust regression

1950 1955 1960 1965 1970

050

100

150

200

Year

Milli

ons

of c

alls

Least squaresHuberTukey

Patrick Breheny BST 764: Applied Statistical Modeling 3/17Challenge in robust statistics: Population-level nonconvexity of loss=⇒ need for local optimization theory

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 14 / 26

Page 36: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Local RSC condition

Local RSC condition: For ∆ := β1 − β2,

〈∇Ln(β1)−∇Ln(β2), ∆〉 ≥ α‖∆‖22−τ

log p

n‖∆‖2

1, ∀‖βj −β∗‖2 ≤ r

How is such a result possible?

−1−0.5

00.5

1 −1−0.5

00.5

1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Loss function has directions of both positive and negative curvature.

Negative directions are forbidden by regularizer.

Only requires restricted curvature within constant-radius regionaround β∗

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 15 / 26

Page 37: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Local RSC condition

Local RSC condition: For ∆ := β1 − β2,

〈∇Ln(β1)−∇Ln(β2), ∆〉 ≥ α‖∆‖22−τ

log p

n‖∆‖2

1, ∀‖βj −β∗‖2 ≤ r

How is such a result possible?

−1−0.5

00.5

1 −1−0.5

00.5

1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Loss function has directions of both positive and negative curvature.

Negative directions are forbidden by regularizer.Only requires restricted curvature within constant-radius regionaround β∗

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 15 / 26

Page 38: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Consistency of local stationary points

b e

O

rk log p

n

!

r

Theorem (L. ’17)

Suppose Ln satisfies α-local RSC and ρλ is µ-amenable, with α > µ.

Suppose ‖`′‖∞ ≤ C and λ √

log pn . For n % τ

α−µk log p, any stationary

point β s.t. ‖β − β∗‖2 ≤ r satisfies

‖β − β∗‖2 -λ√k

α− µ.

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 16 / 26

Page 39: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Consistency of local stationary points

b e

O

rk log p

n

!

r

Theorem (L. ’17)

Suppose Ln satisfies α-local RSC and ρλ is µ-amenable, with α > µ.

Suppose ‖`′‖∞ ≤ C and λ √

log pn . For n % τ

α−µk log p, any stationary

point β s.t. ‖β − β∗‖2 ≤ r satisfies

‖β − β∗‖2 -λ√k

α− µ.

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 16 / 26

Page 40: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Optimization theory

Question: How to obtain sufficiently close local solutions?

Goal: For regularized M-estimator

β ∈ arg min‖β‖1≤R

1

n

n∑

i=1

`(xTi β − yi ) + ρλ(β)

,

where ` satisfies α-local RSC, find stationary point such that‖β − β∗‖2 ≤ r

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 17 / 26

Page 41: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Optimization theory

Question: How to obtain sufficiently close local solutions?

Goal: For regularized M-estimator

β ∈ arg min‖β‖1≤R

1

n

n∑

i=1

`(xTi β − yi ) + ρλ(β)

,

where ` satisfies α-local RSC, find stationary point such that‖β − β∗‖2 ≤ r

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 17 / 26

Page 42: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Wisdom from Huber

Descending ψ-functions are tricky, especially when thestarting values for the iterations are non-robust. . . . Itis therefore preferable to start with a monotone ψ,iterate to death, and then append a few (1 or 2)iterations with the nonmonotone ψ. — Huber 1981,pp. 191–192

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 18 / 26

Page 43: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Two-step algorithm (L. ’17)

Use composite gradient descent (Nesterov ’07):Iterative method to solve

β ∈ arg minβ∈ΩLn(β) + ρλ(β),

Ln differentiable, ρλ convex & subdifferentiable

Ln

Ln(t) + hrLn(t), ti +L

2k tk2

2

b tt+1

Updates:

βt+1 ∈ arg minβ∈Ω

Ln(βt) + 〈∇Ln(βt), β − βt〉+

L

2‖β − βt‖2

2 + ρλ(β)

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 19 / 26

Page 44: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Two-step algorithm (L. ’17)

Use composite gradient descent (Nesterov ’07):Iterative method to solve

β ∈ arg minβ∈ΩLn(β) + ρλ(β),

Ln differentiable, ρλ convex & subdifferentiable

Ln

Ln(t) + hrLn(t), ti +L

2k tk2

2

b tt+1

Updates:

βt+1 ∈ arg minβ∈Ω

Ln(βt) + 〈∇Ln(βt), β − βt〉+

L

2‖β − βt‖2

2 + ρλ(β)

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 19 / 26

Page 45: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Two-step algorithm (L. ’17)

Use composite gradient descent (Nesterov ’07):Iterative method to solve

β ∈ arg minβ∈ΩLn(β) + ρλ(β),

Ln differentiable, ρλ convex & subdifferentiable

Ln

Ln(t) + hrLn(t), ti +L

2k tk2

2

b tt+1

Updates:

βt+1 ∈ arg minβ∈Ω

Ln(βt) + 〈∇Ln(βt), β − βt〉+

L

2‖β − βt‖2

2 + ρλ(β)

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 19 / 26

Page 46: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Two-step algorithm (L. ’17)

Two-step M-estimator: Finds local stationary points of nonconvex,robust loss + µ-amenable penalty

β ∈ arg min‖β‖1≤R

1

n

n∑

i=1

`(xTi β − yi ) + ρλ(β)

Algorithm

1 Run composite gradient descent on convex, robust loss + `1-penaltyuntil convergence, output βH

2 Run composite gradient descent on nonconvex, robust loss +µ-amenable penalty, input β0 = βH

Important: We want to optimize original nonconvex objective, sinceit leads to more efficient (lower-variance) estimators

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 20 / 26

Page 47: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Two-step algorithm (L. ’17)

Two-step M-estimator: Finds local stationary points of nonconvex,robust loss + µ-amenable penalty

β ∈ arg min‖β‖1≤R

1

n

n∑

i=1

`(xTi β − yi ) + ρλ(β)

Algorithm

1 Run composite gradient descent on convex, robust loss + `1-penaltyuntil convergence, output βH

2 Run composite gradient descent on nonconvex, robust loss +µ-amenable penalty, input β0 = βH

Important: We want to optimize original nonconvex objective, sinceit leads to more efficient (lower-variance) estimators

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 20 / 26

Page 48: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Two-step algorithm (L. ’17)

Two-step M-estimator: Finds local stationary points of nonconvex,robust loss + µ-amenable penalty

β ∈ arg min‖β‖1≤R

1

n

n∑

i=1

`(xTi β − yi ) + ρλ(β)

Algorithm

1 Run composite gradient descent on convex, robust loss + `1-penaltyuntil convergence, output βH

2 Run composite gradient descent on nonconvex, robust loss +µ-amenable penalty, input β0 = βH

Important: We want to optimize original nonconvex objective, sinceit leads to more efficient (lower-variance) estimators

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 20 / 26

Page 49: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Two-step algorithm (L. ’17)

Two-step M-estimator: Finds local stationary points of nonconvex,robust loss + µ-amenable penalty

β ∈ arg min‖β‖1≤R

1

n

n∑

i=1

`(xTi β − yi ) + ρλ(β)

Algorithm

1 Run composite gradient descent on convex, robust loss + `1-penaltyuntil convergence, output βH

2 Run composite gradient descent on nonconvex, robust loss +µ-amenable penalty, input β0 = βH

Important: We want to optimize original nonconvex objective, sinceit leads to more efficient (lower-variance) estimators

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 20 / 26

Page 50: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Simulation

n/(k log p)0 5 10 15

∥β−β∗∥ 2

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1l2-error for robust regression losses

p=128p=256p=512HuberCauchy

n/(k log p)10 11 12 13 14 15 16 17 18 19 20

empi

rical

var

ianc

e of

firs

t com

pone

nt

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35variance for robust regression losses

p=128p=256p=512HuberCauchy

`2-error and empirical variance of M-estimators when errors followCauchy distribution (SCAD regularizer)

Can prove geometric convergence of two-step algorithm to desirablelocal optima (L. ’17)

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 21 / 26

Page 51: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Simulation

n/(k log p)0 5 10 15

∥β−β∗∥ 2

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1l2-error for robust regression losses

p=128p=256p=512HuberCauchy

n/(k log p)10 11 12 13 14 15 16 17 18 19 20

empi

rical

var

ianc

e of

firs

t com

pone

nt

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35variance for robust regression losses

p=128p=256p=512HuberCauchy

`2-error and empirical variance of M-estimators when errors followCauchy distribution (SCAD regularizer)

Can prove geometric convergence of two-step algorithm to desirablelocal optima (L. ’17)

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 21 / 26

Page 52: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Summary

Loss functions with desirable robustness properties in low-dimensionalregression also good for high dimensions:

bounded influence ⇐⇒ ‖`′‖∞ ≤ C ⇐⇒ O

(√k log p

n

)consistency

Two-step optimization procedure: First step for consistency,second step for efficiency

Loh (2017). Statistical consistency and asymptotic normality for

high-dimensional robust M-estimators. Annals of Statistics.

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 22 / 26

Page 53: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Summary

Loss functions with desirable robustness properties in low-dimensionalregression also good for high dimensions:

bounded influence ⇐⇒ ‖`′‖∞ ≤ C ⇐⇒ O

(√k log p

n

)consistency

Two-step optimization procedure: First step for consistency,second step for efficiency

Loh (2017). Statistical consistency and asymptotic normality for

high-dimensional robust M-estimators. Annals of Statistics.

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 22 / 26

Page 54: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Trailer

Problem: Loss function ` in some sense calibrated to scale of εi

Better objective (joint location/scale estimator):

(β, σ) ∈ arg minβ,σ

1

n

n∑

i=1

`

(yi − xTi β

σ

)σ + aσ

︸ ︷︷ ︸Ln(β,σ)

+λ‖β‖1

However, location/scale estimation notoriously difficult even in lowdimensions

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 23 / 26

Page 55: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Trailer

Problem: Loss function ` in some sense calibrated to scale of εi

Better objective (joint location/scale estimator):

(β, σ) ∈ arg minβ,σ

1

n

n∑

i=1

`

(yi − xTi β

σ

)σ + aσ

︸ ︷︷ ︸Ln(β,σ)

+λ‖β‖1

However, location/scale estimation notoriously difficult even in lowdimensions

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 23 / 26

Page 56: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Trailer

Problem: Loss function ` in some sense calibrated to scale of εi

Better objective (joint location/scale estimator):

(β, σ) ∈ arg minβ,σ

1

n

n∑

i=1

`

(yi − xTi β

σ

)σ + aσ

︸ ︷︷ ︸Ln(β,σ)

+λ‖β‖1

However, location/scale estimation notoriously difficult even in lowdimensions

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 23 / 26

Page 57: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Trailer

Another idea: MM-estimator

β ∈ arg minβ

1

n

n∑

i=1

`

(yi − xTi β

σ0

)+ λ‖β‖1

,

using robust estimate of scale σ0 based on preliminary estimate β0

How to obtain (β0, σ0)?

S-estimators/LMS:

β0 ∈ arg minβσ(r(β)) ,

where σ(r) = r(n−bnδc)LTS:

β0 ∈ arg minβ

1

n

n−bnαc∑

i=1

(yi − xTi β)2(i) + λ‖β‖1

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 24 / 26

Page 58: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Trailer

Another idea: MM-estimator

β ∈ arg minβ

1

n

n∑

i=1

`

(yi − xTi β

σ0

)+ λ‖β‖1

,

using robust estimate of scale σ0 based on preliminary estimate β0

How to obtain (β0, σ0)?

S-estimators/LMS:

β0 ∈ arg minβσ(r(β)) ,

where σ(r) = r(n−bnδc)LTS:

β0 ∈ arg minβ

1

n

n−bnαc∑

i=1

(yi − xTi β)2(i) + λ‖β‖1

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 24 / 26

Page 59: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Trailer

Another idea: MM-estimator

β ∈ arg minβ

1

n

n∑

i=1

`

(yi − xTi β

σ0

)+ λ‖β‖1

,

using robust estimate of scale σ0 based on preliminary estimate β0

How to obtain (β0, σ0)?

S-estimators/LMS:

β0 ∈ arg minβσ(r(β)) ,

where σ(r) = r(n−bnδc)

LTS:

β0 ∈ arg minβ

1

n

n−bnαc∑

i=1

(yi − xTi β)2(i) + λ‖β‖1

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 24 / 26

Page 60: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Trailer

Another idea: MM-estimator

β ∈ arg minβ

1

n

n∑

i=1

`

(yi − xTi β

σ0

)+ λ‖β‖1

,

using robust estimate of scale σ0 based on preliminary estimate β0

How to obtain (β0, σ0)?

S-estimators/LMS:

β0 ∈ arg minβσ(r(β)) ,

where σ(r) = r(n−bnδc)LTS:

β0 ∈ arg minβ

1

n

n−bnαc∑

i=1

(yi − xTi β)2(i) + λ‖β‖1

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 24 / 26

Page 61: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Trailer

Maybe an entirely different approach is necessary . . .

Loh (2017). Scale estimation for high-dimensional robust regression.

Coming soon?

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 25 / 26

Page 62: Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Thank you!

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 26 / 26