Download - A Joint convex penalty for inverse covariance matrix estimation.

Introduction and Scope Joint Convex Penalty Solving the optimization problem Simulation Study Conclusion References

A Joint Convex Penalty for Inverse CovarianceMatrix Estimation

Ashwini Maurya

Department of Statistics and Probability, Michigan State University

March 27, 2014

Ashwini Maurya A Joint Convex Penalty for Inverse Covariance Matrix Estimation

Page 2: A Joint convex penalty for inverse covariance matrix estimation.

Outline

1 Introduction and Scope

2 Joint Convex Penalty

3 Solving the optimization problem

4 Simulation Study

5 Conclusion

6 References

Covariance Matrix

Consider X = (X1, · · · ,Xn)T be observation matrix from ap-dimensional multivariate distribution with mean vectorµ = (µ1, · · · , µp)T and covariance matrix Σ, the samplecovariance matrix is defined by:

Si ,j =n∑

m=1

(Xm,i − X̄i )(Xm,j − X̄j); i , j = 1, 2, ..p

When p > n, the sample covariance matrix is singular matrixand therefore the estimation of Σ and hence Σ−1 is ill-posed.We assume that Σ−1 is sparse, i.e.,

#{(i , j) : Σ−1i ,j 6= 0} := g � p(p + 1)/2.

How can we estimate the non-zero entries of inversecovariance matrix?

Covariance Matrix

Si ,j =n∑

m=1

(Xm,i − X̄i )(Xm,j − X̄j); i , j = 1, 2, ..p

#{(i , j) : Σ−1i ,j 6= 0} := g � p(p + 1)/2.

Covariance Matrix

Si ,j =n∑

m=1

(Xm,i − X̄i )(Xm,j − X̄j); i , j = 1, 2, ..p

#{(i , j) : Σ−1i ,j 6= 0} := g � p(p + 1)/2.

Page 6: A Joint convex penalty for inverse covariance matrix estimation.

Scope of Inverse Covariance Matrix

Estimation of inverse covariance matrix is important in number ofstatistical analysis including:

Gaussian Graphical Modeling: In Gaussian graphical modeling,a zero entry of an element in the inverse of a covariancematrix corresponds to conditional independence between thevariables.

Linear or Quadratic Discriminant Analysis: When the featuresare assumed to have multivariate Gaussian density, theresulting discriminant rule requires an estimate of inversecovariance matrix.

Principal Component Analysis (PCA): In multivariate highdimensional data it is often desirable to transform thehigh-dimensional feature space to a lower dimension withoutloosing much information. covariance matrix method ispopular method for PCA estimates.

Page 7: A Joint convex penalty for inverse covariance matrix estimation.

Page 8: A Joint convex penalty for inverse covariance matrix estimation.

Page 9: A Joint convex penalty for inverse covariance matrix estimation.

Penalization methods

It is well known that the `1 minimization leads to sparse solution.Some of the popular methods for estimating a sparse inversecovariance matrix based on `1 minimization includes:

Graphical Lasso (Friedman et.al., 2007)

Sparse Permutation Invariant Covariance Matrix Estimation(Rothamn et al., 2008)

High Dimensional Inverse Covariance Matrix Estimation(Ming Yuan, 2010)

Sparse Inverse Covariance Matrix Estimation using QuadraticApproximation ( CJ Hsieh et.al., 2011)

Page 10: A Joint convex penalty for inverse covariance matrix estimation.

Penalization methods

It is well known that the `1 minimization leads to sparse solution.Some of the popular methods for estimating a sparse inversecovariance matrix based on `1 minimization includes:

Graphical Lasso (Friedman et.al., 2007)

Sparse Permutation Invariant Covariance Matrix Estimation(Rothamn et al., 2008)

High Dimensional Inverse Covariance Matrix Estimation(Ming Yuan, 2010)

Sparse Inverse Covariance Matrix Estimation using QuadraticApproximation ( CJ Hsieh et.al., 2011)

Page 11: A Joint convex penalty for inverse covariance matrix estimation.

Outline

4 Simulation Study

5 Conclusion

6 References

Page 12: A Joint convex penalty for inverse covariance matrix estimation.

Beyond `1 Penalization

It is well known that the eigenvalues of sample covariancematrix are over-dispersed [Marcenko-Pastur 1967, Johnstone2001]

In high dimensional setting ( where n� p), an estimate ofcovariance matrix based on sample covariance matrix will besingular matrix. In other words, the most of eigen values ofcovariance matrix will be zero or close to zero.

Consequently the eigenvalues of inverse covariance matrix willbe very large.

Page 13: A Joint convex penalty for inverse covariance matrix estimation.

Page 14: A Joint convex penalty for inverse covariance matrix estimation.

Page 15: A Joint convex penalty for inverse covariance matrix estimation.

True Eigen Values

Fre

quen

cy

0.0 0.2 0.4 0.6 0.8 1.0

050

100

150

200

Sample Eigen Values n=500 p=200

Fre

quen

cy

0.0 0.5 1.0 1.5 2.0 2.5

050

100

150

200

Sample Eigen Values n=100 p=200

Fre

quen

cy

0 1 2 3 4 5 6

050

100

150

200

Page 16: A Joint convex penalty for inverse covariance matrix estimation.

Beyond `1 Penalization and A Joint Convex Penalty

Motivated by this observation, we include an additionalpenalty which reduces the eigen-spectrum of inversecovariance matrix.

Let X = (X1, · · · ,Xn)T be observation matrix from ap-dimensional multivariate normal distribution with meanvector µ = (µ1, · · · , µp)T and covariance matrix Σ.

The constrained maximum likelihood estimate of the inversecovariance matrix is given by:

argminW�0

F (W ) = −log(det(W )) + tr(SW ) + λ‖W ‖1 + τ‖W ‖∗(2.1)

where W is inverse of covariance matrix, S is samplecovariance matrix, λ and τ are non-negative real constants.

Page 17: A Joint convex penalty for inverse covariance matrix estimation.

argminW�0

Page 18: A Joint convex penalty for inverse covariance matrix estimation.

argminW�0

Page 19: A Joint convex penalty for inverse covariance matrix estimation.

The first penalty term ‖W ‖1 in equation (1.1) is defined asthe sum of absolute value of the inverse covariance matrx W .

The second penalty term ‖W ‖∗ is the trace norm which isdefined as sum of singular values of W ( for positive definitematrices, eigen values and singular values are same).

`1 norm is convex, smooth except at the origin however tracenorm is non-smooth convex function. Consequently theoptimization problem is convex optimization problems withnon-smooth constraints.

We implement proximal gradient method to solve thisoptimization problem.

Page 20: A Joint convex penalty for inverse covariance matrix estimation.

Page 21: A Joint convex penalty for inverse covariance matrix estimation.

Page 22: A Joint convex penalty for inverse covariance matrix estimation.

Page 23: A Joint convex penalty for inverse covariance matrix estimation.

Outline

4 Simulation Study

5 Conclusion

6 References

Page 24: A Joint convex penalty for inverse covariance matrix estimation.

Proximal Gradient Algorithm

Let h(x) be some smooth convex function bounded below bysome finite constant. The proximal gradient method generatesa sequence of solutions by solving,

xk = argminx

h(x) + ‖x − xk−1‖2 k = 1, 2, ... (3.1)

The above solution converges weekly to argminx h(x)[Rockafellar, 1976].

Page 25: A Joint convex penalty for inverse covariance matrix estimation.

Convergence Analysis

Theorem

Let {Wn, n = 1, 2, ...} be the sequence generated by proximalgradient algorithm. Let M <∞ be a constant such that∞∑n=1| <W ∗ −Wn,5‖Wn‖1 > | < M. We have,

F (Wk)− F (W ∗) ≤(γL‖W 0−W ∗‖2F +18c2 + M

4k

)where L > 0 is the least upper Lipschitz constant of gradient ofnegative log-likelihood function, γ > 1 and c > 0 are constants.

Page 26: A Joint convex penalty for inverse covariance matrix estimation.

Outline

4 Simulation Study

5 Conclusion

6 References

Page 27: A Joint convex penalty for inverse covariance matrix estimation.

Simulation

Extensive simulation are done, here we report the result fortwo types of inverse covariance matrix:

(i) Hub Graph : The rows/columns are partitioned into Jequally-sized disjoint groups:{V1 ∪ V2 ∪, ...,∪ VJ} = {1, 2, ..., p}, each group is associatedwith a pivotal row k. Let |V1| = s. We set wi ,j = wj ,i = ρ fori ∈ Vk and wi ,j = wj ,i = 0 otherwise. In our experiment,J = [p/s], k = 1, s + 1, 2s + 1, ..., and we always setρ = 1/(s + 1) with J = 20.

(ii) Neighborhood Graph : We first uniformly sample(y1, y2, ..., yn) from a unit square. We then set wi ,j = wj ,i = ρ

with probability (√

2π)−1

exp(−4‖yi − yj‖2). The remainingentries of W are set to be zero. The number of nonzerooff-diagonal elements of each row or column is restricted to besmaller than [1/ρ]. ρ is set to be 0.245.

Page 28: A Joint convex penalty for inverse covariance matrix estimation.

Simulation

Extensive simulation are done, here we report the result fortwo types of inverse covariance matrix:

(i) Hub Graph : The rows/columns are partitioned into Jequally-sized disjoint groups:{V1 ∪ V2 ∪, ...,∪ VJ} = {1, 2, ..., p}, each group is associatedwith a pivotal row k. Let |V1| = s. We set wi ,j = wj ,i = ρ fori ∈ Vk and wi ,j = wj ,i = 0 otherwise. In our experiment,J = [p/s], k = 1, s + 1, 2s + 1, ..., and we always setρ = 1/(s + 1) with J = 20.

(ii) Neighborhood Graph : We first uniformly sample(y1, y2, ..., yn) from a unit square. We then set wi ,j = wj ,i = ρ

with probability (√

2π)−1

exp(−4‖yi − yj‖2). The remainingentries of W are set to be zero. The number of nonzerooff-diagonal elements of each row or column is restricted to besmaller than [1/ρ]. ρ is set to be 0.245.

Page 29: A Joint convex penalty for inverse covariance matrix estimation.

Simulation

We compare the proposed method (JCP) to graphical lassoand SPICE in terms of Kullback-Leibler loss (KL-Loss) forvarying values of sample size (n) and number of covariates(p). We choose n = 50, 100 and p = 50, 100, 200.

For both Hub and Neighborhood type inverse covariancematrix, the JCP method shows better performance thanSPICE method for all n and p in terms of KL-Loss.

The performance of JCP is better than graphical lasso forp=50 and p=100, whereas the JCP performs similar toGraphical lasso for p=200.

The extensive simulation analysis ( refer to the paper) alsoshows that the loss function value of the estimated inversecovariance matrix also depends upon the underlying structureof matrix.

Page 30: A Joint convex penalty for inverse covariance matrix estimation.

Simulation

Page 31: A Joint convex penalty for inverse covariance matrix estimation.

Simulation

Page 32: A Joint convex penalty for inverse covariance matrix estimation.

Simulation

Page 33: A Joint convex penalty for inverse covariance matrix estimation.

Simulation(Hub type Inverse Covariance Matrix)

Left graph n = 50, right graph n = 100 for varying p.

50 100 200

1020

30

Number of variables p

KL-

loss

JCPGraphical LassoSPICE

50 100 200

10

KL-

loss

Page 34: A Joint convex penalty for inverse covariance matrix estimation.

Simulation(Neighborhood type Inverse Covariance Matrix)

Left graph n = 50, right graph n = 100 for varying p.

50 100 200

1020

KL-

loss

50 100 200

10

KL-

loss

Page 35: A Joint convex penalty for inverse covariance matrix estimation.

Outline

4 Simulation Study

5 Conclusion

6 References

Page 36: A Joint convex penalty for inverse covariance matrix estimation.

Conclusion

We impose joint convex penalty which has shown betterperformance than other methods based on simulation study.In practice the underlying inverse covariance matrix may haveadditional structure than just being sparse, in that case asuitable penalty can be used to estimate the inherentstructure of the matrix.

The proposed method uses joint convex penalty which is moreflexible for penalizing entries of a inverse covariance matrixdifferently rather than penalizing by same amount of chosenregularization parameter as in graphical lasso and SPICE.

Under mild conditions, the proposed proximal gradientalgorithm is shown to achieve sub-linear rate of convergencefor this problem which can further be improved to have alinear rate of convergence, makes it suitable choice for largescale optimization problems.