Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear...

Robust high-dimensional linear regression:A statistical perspective

Po-Ling Loh

University of Wisconsin - MadisonDepartments of ECE & Statistics

STOC workshop on robustness and nonconvexityMontreal, Canada

June 23, 2017

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 1 / 26

Introduction: Robust regression

Robust statistics introduced in 1960s (Huber, Tukey, Hampel, et al.)

Goals:1 Develop estimators T (·) that are reliable under deviations from model

assumptions2 Quantify performance with respect to deviations

Local stability captured by influence function

IF (x ;T ,F ) = limt→0

T ((1− t)F + tδx)− T (F )

Global stability captured by breakdown point

ε∗(T ;X1, . . . ,Xn) = min

n: supXm‖T (Xm)− T (X )‖ =∞

T ((1− t)F + tδx)− T (F )

ε∗(T ;X1, . . . ,Xn) = min

T ((1− t)F + tδx)− T (F )

ε∗(T ;X1, . . . ,Xn) = min

T ((1− t)F + tδx)− T (F )

ε∗(T ;X1, . . . ,Xn) = min

High-dimensional linear models

n 1 n p n 1

Linear model:yi = xTi β

∗ + εi , i = 1, . . . , n

When p n, assume sparsity: ‖β∗‖0 ≤ k

High-dimensional linear models

n 1 n p n 1

Linear model:yi = xTi β

∗ + εi , i = 1, . . . , n

When p n, assume sparsity: ‖β∗‖0 ≤ k

Robust M-estimators

Generalization of OLS appropriate for robust statistics:

β ∈ arg minβ

`(xTi β − yi )

Extensive theory for p fixed, n→∞

IntroductionRobust regression

ImplementationScottish hill races

Loss functions

−6 −4 −2 0 2 4 6

Residual

Least squaresAbsolute valueHuberTukey

Patrick Breheny BST 764: Applied Statistical Modeling 6/17

Belgian phone calls: Linear vs. robust regression

1950 1955 1960 1965 1970

Least squaresHuberTukey

Robust M-estimators

Generalization of OLS appropriate for robust statistics:

β ∈ arg minβ

`(xTi β − yi )

Extensive theory for p fixed, n→∞

Loss functions

−6 −4 −2 0 2 4 6

Residual

1950 1955 1960 1965 1970

Classes of loss functions

Bounded `′ limits influence of outliers:

IF ((x , y);T ,F ) = limt→0+

T ((1− t)F + tδ(x ,y))− T (F )

∝ `′(xTβ − y)x

where F ∼ Fβ and T minimizes M-estimator

Redescending M-estimators have finite rejection point:

`′(u) = 0, for |u| ≥ c

Loss functions

−6 −4 −2 0 2 4 6

Residual

Patrick Breheny BST 764: Applied Statistical Modeling 6/17But bad for optimization!!

IF ((x , y);T ,F ) = limt→0+

T ((1− t)F + tδ(x ,y))− T (F )

∝ `′(xTβ − y)x

where F ∼ Fβ and T minimizes M-estimatorRedescending M-estimators have finite rejection point:

`′(u) = 0, for |u| ≥ c

Loss functions

−6 −4 −2 0 2 4 6

Residual

But bad for optimization!!

IF ((x , y);T ,F ) = limt→0+

T ((1− t)F + tδ(x ,y))− T (F )

∝ `′(xTβ − y)x

where F ∼ Fβ and T minimizes M-estimatorRedescending M-estimators have finite rejection point:

`′(u) = 0, for |u| ≥ c

Loss functions

−6 −4 −2 0 2 4 6

Residual

Patrick Breheny BST 764: Applied Statistical Modeling 6/17But bad for optimization!!Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 5 / 26

High-dimensional M-estimators

Natural idea: For p > n, use regularized version:

β ∈ arg minβ

`(xTi β − yi ) + λ‖β‖1

Complications:

Optimization for nonconvex `?

Statistical theory? Are certain losses provably better than others?

High-dimensional M-estimators

Natural idea: For p > n, use regularized version:

β ∈ arg minβ

`(xTi β − yi ) + λ‖β‖1

Complications:

Optimization for nonconvex `?

Statistical theory? Are certain losses provably better than others?

Overview of results

When ‖`′‖∞ < C , global optima of high-dimensional M-estimatorsatisfy

‖β − β∗‖2 ≤ C

√k log p

regardless of distribution of εi

Compare to Lasso theory: Requires sub-Gaussian εi ’s

If `(u) is locally convex/smooth for |u| ≤ r , any local optima withinradius cr of β∗ satisfy

‖β − β∗‖2 ≤ C ′√

k log p

* in order to verify RE condition w.h.p., need Var(εi ) < cr2, as well

Local optima may be obtained via two-step algorithm

Overview of results

‖β − β∗‖2 ≤ C

√k log p

‖β − β∗‖2 ≤ C ′√

k log p

Overview of results

‖β − β∗‖2 ≤ C

√k log p

‖β − β∗‖2 ≤ C ′√

k log p

Overview of results

‖β − β∗‖2 ≤ C

√k log p

‖β − β∗‖2 ≤ C ′√

k log p

Overview of results

‖β − β∗‖2 ≤ C

√k log p

‖β − β∗‖2 ≤ C ′√

k log p

Theoretical insight

Lasso analysis (e.g., van de Geer ’07, Bickel et al. ’08):

β ∈ arg minβ

n‖y − Xβ‖2

2 + λ‖β‖1

︸︷︷︸Ln(β)

Rearranging basic inequality Ln(β) ≤ Ln(β∗) and assuming

λ ≥ 2∥∥∥XT ε

∥∥∥∞

, obtain

‖β − β∗‖2 ≤ cλ√k

Sub-Gaussian assumptions on xi ’s and εi ’s provide O(√

k log pn

bounds, minimax optimal

Theoretical insight

β ∈ arg minβ

n‖y − Xβ‖2

2 + λ‖β‖1

︸︷︷︸Ln(β)

λ ≥ 2∥∥∥XT ε

∥∥∥∞

, obtain

‖β − β∗‖2 ≤ cλ√k

k log pn

Theoretical insight

β ∈ arg minβ

n‖y − Xβ‖2

2 + λ‖β‖1

︸︷︷︸Ln(β)

λ ≥ 2∥∥∥XT ε

∥∥∥∞

, obtain

‖β − β∗‖2 ≤ cλ√k

k log pn

Theoretical insight

Key observation: For general loss function, if λ ≥ 2∥∥∥XT `′(ε)

∥∥∥∞

obtain‖β − β∗‖2 ≤ cλ

`′(ε) sub-Gaussian whenever `′ bounded=⇒ can achieve estimation error

‖β − β∗‖2 ≤ c

√k log p

without assuming εi is sub-Gaussian

Theoretical insight

∥∥∥∞

`′(ε) sub-Gaussian whenever `′ bounded

=⇒ can achieve estimation error

‖β − β∗‖2 ≤ c

√k log p

Theoretical insight

∥∥∥∞

`′(ε) sub-Gaussian whenever `′ bounded=⇒ can achieve estimation error

‖β − β∗‖2 ≤ c

√k log p

Technical challenges

Lasso analysis also requires verifying restricted eigenvalue (RE)condition on design matrix, more complicated for general `

When ` is nonconvex, local optima β may exist that are not globaloptima

Want error bounds on ‖β − β∗‖2 as well, or algorithms to find βefficiently

Related work: Nonconvex regularized M-estimators

Composite objective function

β ∈ arg min‖β‖1≤R

Ln(β) +

ρλ(βj)

Assumptions:

Ln satisfies restricted strong convexity with curvature α(Negahban et al. ’12)ρλ has bounded subgradient at 0, and ρλ(t) + µt2 convexα > µ

Related work: Nonconvex regularized M-estimators

Composite objective function

Ln(β) +

ρλ(βj)

Assumptions:

Ln satisfies restricted strong convexity with curvature α(Negahban et al. ’12)ρλ has bounded subgradient at 0, and ρλ(t) + µt2 convexα > µ

Stationary points (L. & Wainwright ’15)

rk log p

Stationary points statistically indistinguishable from global optima

〈∇Ln(β) +∇ρλ(β), β − β〉 ≥ 0, ∀β feasible

Under suitable distributional assumptions, for λ √

log pn and R 1

‖β − β∗‖2 ≤ c

√k log p

n≈ statistical error

Stationary points (L. & Wainwright ’15)

rk log p

Stationary points statistically indistinguishable from global optima

〈∇Ln(β) +∇ρλ(β), β − β〉 ≥ 0, ∀β feasible

Under suitable distributional assumptions, for λ √

log pn and R 1

‖β − β∗‖2 ≤ c

√k log p

n≈ statistical error

Mathematical statement

Theorem (L. & Wainwright ’15)

Suppose R is chosen s.t. β∗ is feasible, and λ satisfies

‖∇Ln(β∗)‖∞, α

√log p

- λ -

For n ≥ Cτ2

α2 R2 log p, any stationary point β satisfies

‖β − β∗‖2 -λ√k

α− µ, where k = ‖β∗‖0.

New ingredient for robust setting: ` convex only in local region=⇒ need for local consistency results

Mathematical statement

Theorem (L. & Wainwright ’15)

Suppose R is chosen s.t. β∗ is feasible, and λ satisfies

‖∇Ln(β∗)‖∞, α

√log p

- λ -

For n ≥ Cτ2

α2 R2 log p, any stationary point β satisfies

‖β − β∗‖2 -λ√k

α− µ, where k = ‖β∗‖0.

New ingredient for robust setting: ` convex only in local region=⇒ need for local consistency results

Local statistical consistencyIntroduction

Robust regressionImplementation

Scottish hill races

Loss functions

−6 −4 −2 0 2 4 6

Residual

1950 1955 1960 1965 1970

Patrick Breheny BST 764: Applied Statistical Modeling 3/17Challenge in robust statistics: Population-level nonconvexity of loss=⇒ need for local optimization theory

Local RSC condition

Local RSC condition: For ∆ := β1 − β2,

〈∇Ln(β1)−∇Ln(β2), ∆〉 ≥ α‖∆‖22−τ

n‖∆‖2

1, ∀‖βj −β∗‖2 ≤ r

How is such a result possible?

−1−0.5

1 −1−0.5

−0.8

−0.6

−0.4

−0.2

Loss function has directions of both positive and negative curvature.

Negative directions are forbidden by regularizer.

Only requires restricted curvature within constant-radius regionaround β∗

Local RSC condition

Local RSC condition: For ∆ := β1 − β2,

〈∇Ln(β1)−∇Ln(β2), ∆〉 ≥ α‖∆‖22−τ

n‖∆‖2

1, ∀‖βj −β∗‖2 ≤ r

How is such a result possible?

−1−0.5

1 −1−0.5

−0.8

−0.6

−0.4

−0.2

Loss function has directions of both positive and negative curvature.

Negative directions are forbidden by regularizer.Only requires restricted curvature within constant-radius regionaround β∗

Consistency of local stationary points

rk log p

Theorem (L. ’17)

Suppose Ln satisfies α-local RSC and ρλ is µ-amenable, with α > µ.

Suppose ‖`′‖∞ ≤ C and λ √

log pn . For n % τ

α−µk log p, any stationary

point β s.t. ‖β − β∗‖2 ≤ r satisfies

‖β − β∗‖2 -λ√k

α− µ.

Consistency of local stationary points

rk log p

Theorem (L. ’17)

Suppose Ln satisfies α-local RSC and ρλ is µ-amenable, with α > µ.

Suppose ‖`′‖∞ ≤ C and λ √

log pn . For n % τ

α−µk log p, any stationary

point β s.t. ‖β − β∗‖2 ≤ r satisfies

‖β − β∗‖2 -λ√k

α− µ.

Optimization theory

Question: How to obtain sufficiently close local solutions?

Goal: For regularized M-estimator

`(xTi β − yi ) + ρλ(β)

where ` satisfies α-local RSC, find stationary point such that‖β − β∗‖2 ≤ r

Optimization theory

Question: How to obtain sufficiently close local solutions?

Goal: For regularized M-estimator

where ` satisfies α-local RSC, find stationary point such that‖β − β∗‖2 ≤ r

Wisdom from Huber

Descending ψ-functions are tricky, especially when thestarting values for the iterations are non-robust. . . . Itis therefore preferable to start with a monotone ψ,iterate to death, and then append a few (1 or 2)iterations with the nonmonotone ψ. — Huber 1981,pp. 191–192

Two-step algorithm (L. ’17)

Use composite gradient descent (Nesterov ’07):Iterative method to solve

β ∈ arg minβ∈ΩLn(β) + ρλ(β),

Ln differentiable, ρλ convex & subdifferentiable

Ln(t) + hrLn(t), ti +L

2k tk2

b tt+1

Updates:

βt+1 ∈ arg minβ∈Ω

Ln(βt) + 〈∇Ln(βt), β − βt〉+

2‖β − βt‖2

2 + ρλ(β)

2k tk2

b tt+1

Updates:

2‖β − βt‖2

2 + ρλ(β)

2k tk2

b tt+1

Updates:

2‖β − βt‖2

2 + ρλ(β)

Two-step M-estimator: Finds local stationary points of nonconvex,robust loss + µ-amenable penalty

Algorithm

1 Run composite gradient descent on convex, robust loss + `1-penaltyuntil convergence, output βH

2 Run composite gradient descent on nonconvex, robust loss +µ-amenable penalty, input β0 = βH

Important: We want to optimize original nonconvex objective, sinceit leads to more efficient (lower-variance) estimators

Algorithm

Simulation

n/(k log p)0 5 10 15

∥β−β∗∥ 2

1l2-error for robust regression losses

p=128p=256p=512HuberCauchy

n/(k log p)10 11 12 13 14 15 16 17 18 19 20

0.35variance for robust regression losses

`2-error and empirical variance of M-estimators when errors followCauchy distribution (SCAD regularizer)

Can prove geometric convergence of two-step algorithm to desirablelocal optima (L. ’17)

Simulation

n/(k log p)0 5 10 15

∥β−β∗∥ 2

1l2-error for robust regression losses

n/(k log p)10 11 12 13 14 15 16 17 18 19 20

0.35variance for robust regression losses

`2-error and empirical variance of M-estimators when errors followCauchy distribution (SCAD regularizer)

Can prove geometric convergence of two-step algorithm to desirablelocal optima (L. ’17)

Summary

Loss functions with desirable robustness properties in low-dimensionalregression also good for high dimensions:

bounded influence ⇐⇒ ‖`′‖∞ ≤ C ⇐⇒ O

(√k log p

)consistency

Two-step optimization procedure: First step for consistency,second step for efficiency

Loh (2017). Statistical consistency and asymptotic normality for

high-dimensional robust M-estimators. Annals of Statistics.

Summary

Loss functions with desirable robustness properties in low-dimensionalregression also good for high dimensions:

bounded influence ⇐⇒ ‖`′‖∞ ≤ C ⇐⇒ O

(√k log p

)consistency

Two-step optimization procedure: First step for consistency,second step for efficiency

Loh (2017). Statistical consistency and asymptotic normality for

high-dimensional robust M-estimators. Annals of Statistics.

Trailer

Problem: Loss function ` in some sense calibrated to scale of εi

Better objective (joint location/scale estimator):

(β, σ) ∈ arg minβ,σ

(yi − xTi β

)σ + aσ

︸︷︷︸Ln(β,σ)

+λ‖β‖1

However, location/scale estimation notoriously difficult even in lowdimensions

Trailer

(yi − xTi β

)σ + aσ

︸︷︷︸Ln(β,σ)

+λ‖β‖1

Trailer

(yi − xTi β

)σ + aσ

︸︷︷︸Ln(β,σ)

+λ‖β‖1

Trailer

Another idea: MM-estimator

β ∈ arg minβ

(yi − xTi β

)+ λ‖β‖1

using robust estimate of scale σ0 based on preliminary estimate β0

How to obtain (β0, σ0)?

S-estimators/LMS:

β0 ∈ arg minβσ(r(β)) ,

where σ(r) = r(n−bnδc)LTS:

β0 ∈ arg minβ

n−bnαc∑

(yi − xTi β)2(i) + λ‖β‖1

Trailer

β ∈ arg minβ

(yi − xTi β

)+ λ‖β‖1

S-estimators/LMS:

β0 ∈ arg minβ

n−bnαc∑

(yi − xTi β)2(i) + λ‖β‖1

Trailer

β ∈ arg minβ

(yi − xTi β

)+ λ‖β‖1

S-estimators/LMS:

where σ(r) = r(n−bnδc)

β0 ∈ arg minβ

n−bnαc∑

(yi − xTi β)2(i) + λ‖β‖1

Trailer

β ∈ arg minβ

(yi − xTi β

)+ λ‖β‖1

S-estimators/LMS:

β0 ∈ arg minβ

n−bnαc∑

(yi − xTi β)2(i) + λ‖β‖1

Trailer

Maybe an entirely different approach is necessary . . .

Loh (2017). Scale estimation for high-dimensional robust regression.

Coming soon?

Thank you!

Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear...

Documents

Transcript of Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear...