Cross validation in sparse linear regression with ...

29
Cross validation in sparse linear regression with piecewise continuous nonconvex penalties and its acceleration Tomoyuki Obuchi 1 and Ayaka Sakata 2 Dept. of Math. and Comp. Sci., Tokyo Tech. 1 The Institute of Statistical Mathematics 2 TO, AS: arXiv:1902.10375 1/28

Transcript of Cross validation in sparse linear regression with ...

Page 1: Cross validation in sparse linear regression with ...

Cross validation in sparse linear regression with piecewise continuous nonconvex penalties and its acceleration

Tomoyuki Obuchi1 and Ayaka Sakata2

Dept. of Math. and Comp. Sci., Tokyo Tech.1

The Institute of Statistical Mathematics2

TO, AS: arXiv:1902.103751/28

Page 2: Cross validation in sparse linear regression with ...

• Penalized linear regression

Linear Regressionx(⌘) = argmin

x

⇢1

2||y �Ax||22 + J(x; ⌘)

Penalty• Representative Penalty

J(x; ⌘ = �) = �||x||pp`p• norm

• : convex • : sparsity-inducing

3/28

  → p=1 is nice for variable selection (LASSO)

p ≥ 1p ≤ 1

Page 3: Cross validation in sparse linear regression with ...

• LASSO for 1-dimensional estimation

Statistical bias in LASSO

-4

-2

0

2

4

-4 -2 0 2 4

θ^

w

✓ = argmin✓

⇢1

2�2(✓ � w)2 + �|✓|

Bias →

4/28

Page 4: Cross validation in sparse linear regression with ...

• p<1 can reduce bias but… • Nonconvex → possible local minima • Noncontinuity → algorithmic instability

• Two representatives of PCNP • Smoothly Clipped Absolute Deviation (SCAD) penalty • Minimax Concave Penalty (MCP)

• Nonconvex, but estimator is continuous

Piecewise continuous nonconvex penalty (PCNP)

We hereafter focus only on SCAD

5/28

Page 5: Cross validation in sparse linear regression with ...

SCAD estimator

J(✓; ⌘) =

8>>><

>>>:

�|✓| (|✓| �)

�✓2 � 2a�|✓|+ �2

2(a� 1)(� < |✓| a�)

(a+ 1)�2

2(|✓| > a�)

.

• SCAD penalty ⌘ = {a,�}a=5,λ=1

a=3,λ=1

a=2,λ=1

0

0.5

1

1.5

2

2.5

3

-6 -4 -2 0 2 4 6

J(x)

x

-4

-2

0

2

4

-4 -2 0 2 4

θ^

w

✓ = argmin✓

⇢1

2�2(✓ � w)2 + J(✓; ⌘)

• SCAD estimator

Continuous

No bias

• E.g. 1D estimator

( )

6/28

Page 6: Cross validation in sparse linear regression with ...

Our Contributions• Clarifying the emergence region of local minima

• Phase transition (w. replica symmetry breaking) • Quantitative analysis of reconstruction performance

• SCAD outperforms LASSO in weak noise region

• Developing an approximate CV formula • Fast CV becomes possible • A method to avoid unstable parameter region

7/28

Page 7: Cross validation in sparse linear regression with ...

Contents1. Analytical performance analysis in simulated dataset

2. Approximate CV formula

3. Numerical experiments

8/28

Page 8: Cross validation in sparse linear regression with ...

Contents1. Analytical performance analysis in simulated dataset

2. Approximate CV formula

3. Numerical experiments

8/28

Page 9: Cross validation in sparse linear regression with ...

Problem Setting• Generative process

x0i ⇠ (1� ⇢0)�(x0i) + ⇢0N (0,�2x)

Aµi ⇠ N (0, N�1)

• Quantities of interest

✏y =1

2M||y � y||22

✏x =1

2N||x0 � x||22

y = Ax: Output MSE

: Input MSE

TP, FP : True and False positive rates of support S = {i|x0i 6= 0}

Investigate typical values of these in high-dimensional limitN ! 1, (↵ = M/N = O(1))

y = Ax0 +�

�i ⇠ N (0,�2�)

all i.i.d.

crucial assumptions for analysis

9/28

Page 10: Cross validation in sparse linear regression with ...

Stat. Mech. Formulation• Hamiltonian, Boltzmann distribution, Partition function

• Computing “free energy” or moment-generating function f(β)

H(x) =1

2||y �Ax||22 + J(x; ⌘)

P (x) =1

Ze��H(x)

Z =

Zdxe��H(x)

��f(�) =1

N[logZ]y,A

• Any quantity of interest can be computed from f(β)

However, the average w.r.t. y and A is unperformable…

Average w.r.t. y and A

←Replica Method (with replica symmetric assumption)

! �(x� x(⌘|y, A)), (� ! 1)

Solution of the original problem

10/28

Page 11: Cross validation in sparse linear regression with ...

Equations to be solved

f(� ! 1) = Extr⌦,⌦

(Q� 2m+ ⇢0�2

x + ↵�2�

2(1 + �/↵)+mm� QQ� ��

2+

⇠(�; Q)

2

)

L(h; Q) ⌘ minx

(Q

2x2 � hx+ J(x; ⌘)

).

ZDz(· · · ) ⌘

Z 1

�1

dzp2⇡

exp

✓�1

2z2◆(· · · ),

• Replica symmetric (RS) free energy

⌦ = {Q,�,m} ⌦ = {Q, �, m} ⇠(�; Q) ⌘ 2

ZDz L(�z; Q),

x⇤(h; Q�1) = argminx

(Q

2x2 � hx+ J(x; ⌘)

).

✏x =1

2

�⇢0�

2x � 2m+Q

�,

✏y =1

2�.

TP =

ZDz

���x⇤(�+z; Q�1)

���0

(· · · ) =X

(· · · )P (�)

�� =p�, �+ =

p�+ m2�2

x.FP =

ZDz

���x⇤(��z; Q�1)

���0

P (�) = (1� ⇢)�(� � ��) + ⇢�(� � �+)

11/28

Page 12: Cross validation in sparse linear regression with ...

Stability and Multiple solutions

RSRSB

(Replica symmetry breaking)✏y✏y

SG transition

• RS solution is sometimes unstable • The instability can be signaled by a formula (not shown here)

• Spin-glass transition or Almeida-Thouless (AT) instability

Configuration Configuration

12/28

Exponentially many (w.r.t. N) local minima exist.

Page 13: Cross validation in sparse linear regression with ...

10-1

100

101

2

3

4

5

6

7

a

=0.5, 0=0.2,

2=0.01

Phase diagrams

100

101

2

6.5

11

15.5

20

a

=0.5, 0=0.2,

2=1

• that gives minimum input MSE is in stable region. • For large noise, LASSO is sufficient.

(λ, a)

13/28

Green line: Minimum of input MSE for each

Green dot: Minimum of input MSE along the green line

Blue line: AT line (Above the line, our analysis is stable)

λ

Page 14: Cross validation in sparse linear regression with ...

ROC curve

• Receiver operating characteristic (ROC) curve • Plot of TP against FP

• A criterion: “Optimal point” on ROC curve is the minimum of

R(η) = (TP(η) − 1)2 + (FP(η) − 0)2, η = {λ, a}

R(η)

• Here, we identify the optimal value of at a fixed value of , and compare the value with that gives minimum of input MSE.

λ a

At the optimal point, the support recovery error is expected to be minimized.

14/28

Page 15: Cross validation in sparse linear regression with ...

ROC curve

0 0.5 1

FP

0

0.2

0.4

0.6

0.8

1

TP

=0.5, 0=0.2,

2=0.1

R min

x min

0 0.5 1

FP

0

0.2

0.4

0.6

0.8

1

TP

=0.5, 0=0.2,

2=0.0001

R min

x min

• Minimum locations of input MSE and R are close. This property is absent in LASSO [Obuchi and Kabashima, JSTAT (2016)]

• Input MSE is unknown in general settings, but relates to Cross-validation (CV) error, hence we may minimize CV error to determine optimal support.

15/28

Page 16: Cross validation in sparse linear regression with ...

Verification of Theoretical Result

(N=1000, 10samples)

Analytically derived lines match to numerical simulation

in RS phase.

RS phase

Input MSE ROC curve

16/28

Page 17: Cross validation in sparse linear regression with ...

Contents1. Analytical performance analysis in simulated dataset

2. Approximate CV formula

3. Numerical experiments

17/28

Page 18: Cross validation in sparse linear regression with ...

LOOCV and Linear Approx.

• Leave-one-out CV (LOOCV)

← Large cost!

d = x� x\µ• Approximation: Expand w.r.t.

ℋ(x |D) ≡ {12 ∑

μ(yμ − ∑

i

Aμixi)2

+ J(x; η)}

x\μ = arg minx

ℋ(x |D\μ)

ϵLOO(η) =1

2M ∑μ

(yμ − ∑i

Aμi x\μi (η))

2

Define:

ℋ(x |D) − ℋ(x\μ |D\μ) ∼ ∑μ

dThμ(x)

x\μ ∼ x − χ\μhμ(x), χ\μ =∂x\μ

∂h

18/28

Page 19: Cross validation in sparse linear regression with ...

Approximate CV formula• Approximate CV formula:

✏LOO ⇡ 1

2M

MX

µ=1

⇥µ

�yµ � a>

µ x�2

⇥µ =

✓1� (aµ)

>SA

⇣(A⇤SA)

> A⇤SA +�@2J(xSA ; ⌘)

�SASA

⌘�1(aµ)SA

◆�2

.

Computable only from x

cost function’s Hessian on support

SA: support

• Delicate points • Invariance of support between full and LOO solutions is

assumed (approximately (exactly in N→∞) correct) • Regularity of cost function Hessian

• Actually violated in RSB phase • Computational cost is O(|SA|3)

19/28

Page 20: Cross validation in sparse linear regression with ...

Contents1. Analytical performance analysis in simulated dataset

2. Approximate CV formula

3. Numerical experiments

20/28

Page 21: Cross validation in sparse linear regression with ...

Experimental Setting

x0i ⇠ (1� ⇢0)�(x0i) + ⇢0N (0,�2x) all i.i.d.

Aµi ⇠ N (0, N�1)

y = Ax0 +�

�i ⇠ N (0,�2�)

• Generative process: Identical to theoretical setting

• Optimization algorithm: Cyclic Coordinate Descent (CCD) • Coordinate-wise update optimizing the cost function

• A technique: λ annealing • Pathwise optimization with gradually changing λ

• Faster convergence • Robust solution even in RSB region

21/28

Page 22: Cross validation in sparse linear regression with ...

Approx. CV: Sample dependenceSCAD parameter a = 3

α = 0.5, ρ0 = 0.2, σ2Δ = 0.1, N = 100

Sample No.1 Sample No.4

• CV error fluctuates depending on sample. • Approximated CV error is valid in RS phase for both samples.

22/28

(Error bar is for components of data.)

Page 23: Cross validation in sparse linear regression with ...

Approx. CV: Sample dependenceSCAD parameter a = 4

α = 0.5, ρ0 = 0.2, σ2Δ = 0.1, N = 100

Sample No.1 Sample No.4

Sample dependence becomes moderate

as increase SCAD parameter a.

22/28

(Error bar is for components of data.)

Page 24: Cross validation in sparse linear regression with ...

“Phase diagram” for given data• “Phase” is defined for the infinite set of samples that are

distributing according to a probability distribution.

• In practical problems, • Appropriate parameter region for a given data is required. • In particular for finite size system, sample-dependency is large.

• We propose a method to get “phase diagram” for given data. • In other words, we identify the parameter region where we should

rule out as candidates.

23/28

Page 25: Cross validation in sparse linear regression with ...

Approx. CV: Instability detection for “phase diagram”

24/28

• Detect “irregular” datapoints along the path.

• Find the maximum value of irregular datapoints.

• smaller than the maximum value is inappropriate in the sense that instability appears.

λ

λ

λ

SCAD parameter a = 3α = 0.5, ρ0 = 0.2, σ2

Δ = 0.1, N = 100Sample No.4

We use our approximate CV formula to detect “RSB” region.

Page 26: Cross validation in sparse linear regression with ...

25/28

10-1

100

101

2.1

5

10

15

20

a

=0.5, 0=0.2,

2=0.1, N=100, ID=4

a AT

a CVE

Blue line: AT line (RS-RSB transition)

Black region: “RSB region” for sample No.4

Green line: Minimum of CV error

Corresponding “phase diagram” for sample No.4

What happens in “RSB” region (black)?

Starting from 10 different initial condition without annealing, literal CV’s value fluctuates in the “RSB” region.

Page 27: Cross validation in sparse linear regression with ...

Application to SuperNovae data analysis

10-1

100

0

0.05

0.1

0.15

0.2

CV

err

ors

TIa supernovae dataset, a=4

ApproximateLiteralInstabilityCVE min

26/28

http://heracles.astro.berkeley.edu/sndb/

Minimum of literal CV

CV error for a=4

Minimum of CV in stable region

“Phase” boundary

Page 28: Cross validation in sparse linear regression with ...

Application to SuperNovae data analysis

2.1 3 4 6 8 10 20 40 80

a

3

5

8

10

12

K

CVE min

2.1 3 4 6 8 10 20 40 80

a

0.01

0.015

0.02

0.025

0.03

0.035

0.04

CV

err

or

CVE min

Sparsest within one-sigma rule: K=3

←Identical solution to a Monte-Carlo method solving L0 problem

(TO et al, 2016,2018)Our method is consistent with L0 result even though SCAD is more computationally reasonable

27/28

Number of parameters in modelCV error

Page 29: Cross validation in sparse linear regression with ...

Summary• Theoretical analysis of SCAD estimator in linear regression

• Emergence of local minima = Phase transition w. RSB • Analytical evidence of outperformance of SCAD to LASSO

• Invention of an approximate CV formula • The scaling is O(N3) but still practical in a wide range of N • Approximate CV instability <-> Local minima or RSB

• Instability detection in CV formula also signals RSB • Numerical results fully support the theoretical result • A MATLAB Package of approx. CV formula + CCD algorithm:

https://github.com/T-Obuchi/SLRpackage_AcceleratedCV_matlab

• Future work • Characterization of the λ annealed solution path • Applications, different models (non-L2 cost function)

TO, AS, arXiv:1902.1037528/28