Madunicky J., Pasta J., Nemcova I. Smeckova K., Hladikova K., Hakucova K.
Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf ·...
Transcript of Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf ·...
![Page 1: Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf · Card(fk : j˚ G k j 2 6= 0g) ˝K: Low dimensional volatility assumption For q given](https://reader035.fdocuments.us/reader035/viewer/2022081607/5edb4e79ad6a402d6665758c/html5/thumbnails/1.jpg)
Joint estimation of the conditional mean and theconditional variance in high-dimensions
Joint work with M. Hebiri, K. Meziani, J. Salmon
SAPS IX, Le Mans, FRANCE
Arnak S. Dalalyan
ENSAE / CREST / GENES
![Page 2: Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf · Card(fk : j˚ G k j 2 6= 0g) ˝K: Low dimensional volatility assumption For q given](https://reader035.fdocuments.us/reader035/viewer/2022081607/5edb4e79ad6a402d6665758c/html5/thumbnails/2.jpg)
I. Problem presentation
![Page 3: Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf · Card(fk : j˚ G k j 2 6= 0g) ˝K: Low dimensional volatility assumption For q given](https://reader035.fdocuments.us/reader035/viewer/2022081607/5edb4e79ad6a402d6665758c/html5/thumbnails/3.jpg)
3
Continuous-time autoregression
Observations : (x t , yt) ∈ Rd × R obeying
yt = b∗(x t) + s∗(x t) Wt , t ∈ T = [0,T ],
where b∗ : Rd → R and s∗ : Rd → R+ such that
Drift : b∗(x t) = E[yt∣∣x t].
Volatility : s∗2(x t) = Var[yt∣∣x t].
(Wt)t≥0 is a standard Wiener process (w.r.t a filtration (Ft)t≥0).
The goal is to estimate the function b∗.
c© Dalalyan, A.S. Dec. 19, 2012 3
![Page 4: Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf · Card(fk : j˚ G k j 2 6= 0g) ˝K: Low dimensional volatility assumption For q given](https://reader035.fdocuments.us/reader035/viewer/2022081607/5edb4e79ad6a402d6665758c/html5/thumbnails/4.jpg)
4
Discrete time (auto-)regression
Observations : finite collection (x t , yt) ∈ Rd × R obeying
yt = b∗(x t) + s∗(x t) ξt , t ∈ T = {1, . . . ,T},
where b∗ : Rd → R and s∗ : Rd → R+ such that
Conditional mean : E[yt |x t ] = b∗(x t).
Conditional variance : Var[yt |x t ] = s∗2(x t).
Therefore, ξt ’s are such that E[ξt |x t ] = 0 and Var[ξt |x t ] = 1.They are often assumed Gaussian N (0,1) for simplicity.
The goal is to estimate the functions b∗ and s∗.
c© Dalalyan, A.S. Dec. 19, 2012 4
![Page 5: Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf · Card(fk : j˚ G k j 2 6= 0g) ˝K: Low dimensional volatility assumption For q given](https://reader035.fdocuments.us/reader035/viewer/2022081607/5edb4e79ad6a402d6665758c/html5/thumbnails/5.jpg)
5
Sparsity Assumption
• In these settings, estimating b∗ and s∗ under no furtherassumption is an ill-posed problem.
• Sparsity scenario : b∗ and s∗ belong to some low dimensionalspaces.
Example : Homoscedastic regression
∀x , b∗(x) =p∑
j=1
fj(x)β∗j = [f1(x), . . . , fp(x)]β∗, and s∗(x) ≡ σ∗
↪→ Dictionary {f1, . . . , fp} of functions from Rd to R
↪→ Unknown vector (β∗, σ∗) ∈ Rp × R, sparse vector β∗
↪→ Sparsity index : p∗ = |β∗|0 :=∑p
j=1 1l(β∗j 6= 0) with p∗ � p
c© Dalalyan, A.S. Dec. 19, 2012 5
![Page 6: Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf · Card(fk : j˚ G k j 2 6= 0g) ˝K: Low dimensional volatility assumption For q given](https://reader035.fdocuments.us/reader035/viewer/2022081607/5edb4e79ad6a402d6665758c/html5/thumbnails/6.jpg)
6
Main assumptions on b∗ and s∗
Re-parametrize by the inverse of the conditional volatility s∗
r∗(x) =1
s∗(x)and f∗(x) =
b∗(x)s∗(x)
.
Group Sparsity Assumption
For p given functions f1, . . . , fp mapping Rd into R, there is a vectorφ∗ ∈ Rp such that f∗(x) =
∑pj=1 φ
∗j fj(x). Furthermore, for a given
partition G1, . . . ,GK of {1, . . . ,p}, the vector φ∗ is group-sparse that isCard({k : |φ∗Gk
|2 6= 0})� K .
Low dimensional volatility assumption
For q given functions r1, . . . , rq mapping Rd into R+, there is a vectorα∗ ∈ Rq such that r∗(x) =
∑q`=1 α
∗` r`(x) for almost every x ∈ Rd .
c© Dalalyan, A.S. Dec. 19, 2012 6
![Page 7: Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf · Card(fk : j˚ G k j 2 6= 0g) ˝K: Low dimensional volatility assumption For q given](https://reader035.fdocuments.us/reader035/viewer/2022081607/5edb4e79ad6a402d6665758c/html5/thumbnails/7.jpg)
7
Motivations for these assumptions
� Group sparsity assumption is relevant in a sparseadditive model, that is when
↪→ f∗(x) = ψ1(x1) + . . .+ ψd(xd) s.t. ψj ≡ 0 for most j ,↪→ projection on basis ψj(xj) ≈
∑Kj`=1 φ
∗`,j f`(xj),
↪→ group sparsity of φ = (φ`,j).
� Low dimensionality of the volatility occurs, for instancewhen the noise is block-wise homoscedastic or periodic.
c© Dalalyan, A.S. Dec. 19, 2012 7
![Page 8: Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf · Card(fk : j˚ G k j 2 6= 0g) ˝K: Low dimensional volatility assumption For q given](https://reader035.fdocuments.us/reader035/viewer/2022081607/5edb4e79ad6a402d6665758c/html5/thumbnails/8.jpg)
II. Relation to previous work
![Page 9: Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf · Card(fk : j˚ G k j 2 6= 0g) ˝K: Low dimensional volatility assumption For q given](https://reader035.fdocuments.us/reader035/viewer/2022081607/5edb4e79ad6a402d6665758c/html5/thumbnails/9.jpg)
9
Homoscedastic regression
The model is
Y = Xβ∗ + σ∗ξ
Observations : Y = [y1, . . . , yT ]> ∈ RT
Noise : ξ = [ξ1, . . . , ξT ]> ∈ RT
Design Matrix : X = xt,j with xt,j = [fj(x t)] ∈ R
Coefficients : β∗ =[β∗1 , . . . , β
∗p]> ∈ Rp
Standard deviation : s∗(x t) ≡ σ∗ ∈ R+∗
Recall that the sparsity assumption postulates that |β∗|0 = p∗ � p.
c© Dalalyan, A.S. Dec. 19, 2012 9
![Page 10: Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf · Card(fk : j˚ G k j 2 6= 0g) ˝K: Low dimensional volatility assumption For q given](https://reader035.fdocuments.us/reader035/viewer/2022081607/5edb4e79ad6a402d6665758c/html5/thumbnails/10.jpg)
10
Most popular methods : Lasso and Dantzig selector
� The LASSO of Tibshirani (1996) is defined as
βLasso
= arg minβ∈Rp
(|Y − Xβ|22
2σ∗2 + λ
p∑j=1
|Xj |2|βj |)
� The Dantzig selector of Candès and Tao (2007) is
βDS
= arg minβ∈Rp
{ p∑j=1
|Xj |2|βj | : maxj=1,··· ,p,
|X>j (Y − Xβ)||Xj |2
≤ λ}
For a tuning parameter satisfying λ ∝ 1/σ∗, sharp oracle inequalitiesare available, e.g., Bickel et al. (2009).
E(
1T|X(β
•− β)|22
)≤ C
p∗ log(p)T
.
To correctly tune the parameter λ, the knowledge of σ∗ is necessary.
c© Dalalyan, A.S. Dec. 19, 2012 10
![Page 11: Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf · Card(fk : j˚ G k j 2 6= 0g) ˝K: Low dimensional volatility assumption For q given](https://reader035.fdocuments.us/reader035/viewer/2022081607/5edb4e79ad6a402d6665758c/html5/thumbnails/11.jpg)
11
Joint estimation of β∗ and σ∗
� Scaled Lasso, Städler et al. (2010),
(βScL, σScL) = arg min
β,σ
(T log(σ)+
|Y − Xβ|222σ2 +
λ
σ
∑p
j=1|Xj |2|βj |
).
This can be recast in a convex problem (do ρ := 1σ and φ := β
σ ) :
arg minφ,ρ
(− T log(ρ) +
|ρY − Xφ|222
+ λ∑p
j=1|Xj |2|φj |
).
� Square-Root Lasso Antoniadis (2010), Belloni et al. (2011), Sun& Zhang (2012), Gautier & Tsybakov (2011),
βSqR-Lasso
= arg minβ
(∣∣Y − Xβ∣∣2 + λ
∑p
j=1|X :,j |2|βj |
)σSqR-Lasso = T−1/2
∣∣Y − XβSqR-Lasso∣∣
2.
� Scaled DS version proposed by Dalalyan & Chen (2012) : sharpanalysis and computational advantages.
c© Dalalyan, A.S. Dec. 19, 2012 11
![Page 12: Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf · Card(fk : j˚ G k j 2 6= 0g) ˝K: Low dimensional volatility assumption For q given](https://reader035.fdocuments.us/reader035/viewer/2022081607/5edb4e79ad6a402d6665758c/html5/thumbnails/12.jpg)
III. Main results
![Page 13: Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf · Card(fk : j˚ G k j 2 6= 0g) ˝K: Low dimensional volatility assumption For q given](https://reader035.fdocuments.us/reader035/viewer/2022081607/5edb4e79ad6a402d6665758c/html5/thumbnails/13.jpg)
13
Estimation for heteroscedastic regressionContinuous time
Observations : (x t , yt)t≤T obeying yt = b∗(x t) + s∗(x t) Wt , thus
yt
s∗(x t)=
( p∑j=1
φ∗j fj(x t)
)+ Wt = X(t)φ∗ + Wt , t ∈ T = [0,T ].
Scaled Heteroscedastic Lasso :
φScHeL
= arg minφ∈Rp
{12
∫ T
0
(yt
s∗(x t)− X(t)φ
)2
dt + λ∑K
k=1|XGkφGk
|2
},
with |XGφG|22 =
∫ T0
(∑j∈G Xj(t)φj
)2 dt =∫ T
0
(∑j∈G fj(x t)φj
)2 dt .
Computation : the estimator φScHeL
can be efficiently computed evenfor very large dimensions p by solving a second-order cone program.
c© Dalalyan, A.S. Dec. 19, 2012 13
![Page 14: Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf · Card(fk : j˚ G k j 2 6= 0g) ˝K: Low dimensional volatility assumption For q given](https://reader035.fdocuments.us/reader035/viewer/2022081607/5edb4e79ad6a402d6665758c/html5/thumbnails/14.jpg)
14
Estimation for heteroscedastic regressionDiscrete time
Observations : (x t , yt)t=1,...,T obeying
yt = b∗(x t) + s∗(x t) ξt = r∗(x t)−1(f∗(x t) + ξt).
Under our assumptions
f∗(x t) =∑p
j=1φ∗j fj(x t) = X(t)φ∗, r∗(x t) =
∑q
`=1α∗` r`(x t) = R(t)α∗.
Thus, [f∗(x1), . . . , f∗(xT )]> = Xφ∗ and [r∗(x1), . . . , r∗(xT )]
> = Rα∗. Thisleads to
DY Rα∗ = Xφ∗ + ξ, DY = diag(yt ; t = 1, . . . ,T ).
Scaled Heteroscedastic Lasso : (φScHeL
, αScHeL) solution to
minφ∈Rp
{−
T∑t=1
log(R(t)α) +12|DY Rα− Xφ|22︸ ︷︷ ︸
Gaussian log-likelihood
+ λ
K∑k=1
|XGkφGk|2︸ ︷︷ ︸
sparsity promoting penalty
}.
c© Dalalyan, A.S. Dec. 19, 2012 14
![Page 15: Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf · Card(fk : j˚ G k j 2 6= 0g) ˝K: Low dimensional volatility assumption For q given](https://reader035.fdocuments.us/reader035/viewer/2022081607/5edb4e79ad6a402d6665758c/html5/thumbnails/15.jpg)
15
Finite sample risk bounds for the ScHeL
Theorem Consider either continuous-time model or discrete-time withsub-Gaussian errors ξ. Let K ∗ (resp. p∗) be the number of relevantgroups (resp. corrdinates of φ∗). Let ε ∈ (0,1) be a tolerance level andset
λk = 4(√|Gk |+
√log(K/ε)
).
Under some assumptions, with probability at least 1− 2ε,∣∣X(φ− φ∗)∣∣2 ≤ D3/2
T ,ε
√q log(2q/ε) + DT ,ε
√p∗ + K ∗ log(K/ε).∣∣R(α−α∗)
∣∣2 ≤ D3/2
T ,ε
√q log(2q/ε) + DT ,ε
√p∗ + K ∗ log(K/ε),
where DT ,ε ∝ (maxt |f∗(x t)|+ log(2T/ε)).
c© Dalalyan, A.S. Dec. 19, 2012 15
![Page 16: Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf · Card(fk : j˚ G k j 2 6= 0g) ˝K: Low dimensional volatility assumption For q given](https://reader035.fdocuments.us/reader035/viewer/2022081607/5edb4e79ad6a402d6665758c/html5/thumbnails/16.jpg)
16
Summary
New procedures named ScHeL and ScHeDs :
� Suitable for fitting the heteroscedastic regression model.
� Simultaneous estimation of the mean and the variance functions.
� Takes into account group sparsity.
� Implemented using two different solvers :↪→ primal-dual interior point method (highly accurate),↪→ optimal first-order method (moderately accurate but with
cheap iterations).
� Competitive with state-of-the art algorithms↪→ applicable in a much more general framework.
Manuscript, proofs and codes available on request.
c© Dalalyan, A.S. Dec. 19, 2012 16
![Page 17: Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf · Card(fk : j˚ G k j 2 6= 0g) ˝K: Low dimensional volatility assumption For q given](https://reader035.fdocuments.us/reader035/viewer/2022081607/5edb4e79ad6a402d6665758c/html5/thumbnails/17.jpg)
17
0 500 1000 1500 2000 25000.2
0.25
0.3
0.35
0.4
dimension pmeanerror
Precisions on the nonzero coefficients
precision 1, IPprecision 1, OFO
0 500 1000 1500 2000 25000
0.5
1
1.5x 10
−4
dimension p
meanerror
Precisions on the coefficients =0
precision 2, IP x 107
precision 2, OFO
0 500 1000 1500 2000 25000
200
400
600
800
dimension p
meanrunningtim
e(sec) Running times
timing, IPtiming, OFO
FIGURE: Comparing interior point (IP) vs. optimal first-order (OFO)method. Top & middle : MSE of β
ScHeDs. Bottom : running times.
c© Dalalyan, A.S. Dec. 19, 2012 17
![Page 18: Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf · Card(fk : j˚ G k j 2 6= 0g) ˝K: Low dimensional volatility assumption For q given](https://reader035.fdocuments.us/reader035/viewer/2022081607/5edb4e79ad6a402d6665758c/html5/thumbnails/18.jpg)
18
References I
A. Antoniadis, Comments on : `1-penalization for mixture regression models, TEST 19 (2010), no. 2,257–258. MR 2677723
A. Belloni, V. Chernozhukov, and L. Wang, Square-root Lasso : Pivotal recovery of sparse signals via conicprogramming, Biometrika 98 (2011), no. 4, 791–806.
P. J. Bickel, Y. Ritov, and A. B. Tsybakov, Simultaneous analysis of Lasso and Dantzig selector, Ann. Statist.37 (2009), no. 4, 1705–1732.
E. J. Candès and T. Tao, The Dantzig selector : statistical estimation when p is much larger than n, Ann.Statist. 35 (2007), no. 6, 2392–2404.
Arnak S. Dalalyan and Yin Chen, Fused sparsity and robust estimation for linear models with unknownvariance, Advances in Neural Information Processing Systems 25 : NIPS, 2012, pp. 1268–1276.
E. Gautier and A. Tsybakov, High-dimensional instrumental variables regression and confidence sets,September 2011.
N. Städler, P. Bühlmann, and Sara s van de Geer, `1-penalization for mixture regression models, TEST 19(2010), no. 2, 209–256.
T. Sun and C.-H. Zhang, Scaled sparse linear regression, Biometrika 99 (2012), no. 4, 879–898.
R. Tibshirani, Regression shrinkage and selection via the Lasso, J. Roy. Statist. Soc. Ser. B 58 (1996), no. 1,267–288.
c© Dalalyan, A.S. Dec. 19, 2012 18
![Page 19: Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf · Card(fk : j˚ G k j 2 6= 0g) ˝K: Low dimensional volatility assumption For q given](https://reader035.fdocuments.us/reader035/viewer/2022081607/5edb4e79ad6a402d6665758c/html5/thumbnails/19.jpg)