Selection of the Regularization Parameter in Graphical ... · PDF fileSelection of the...

Selection of the Regularization Parameterin Graphical Models usingNetwork Characteristics

Natalia Bochkina

University of Edinburgh, Maxwell Institute and the Alan Turing Institute

joint work withAdria Caballe Mestres (University of Edinburgh and BioSS)and Claus Mayer (BioMathematics and Statistics Scotland)

27 July 2016

Natalia Bochkina (University of Edinburgh) 27 July 2016 1 / 35

Outline

1 Sparse High Dimensional Gaussian Graphical Models

2 Network-based estimation of hyperparameter

3 Simulated data

4 Tumour gene expression data

5 Summary


Sparse High Dimensional Gaussian Graphical Models Model

Gaussian graphical models

Suppose we observe n replicates of p variables:

Yi = (Y1,i , . . . ,Yp,i ) ∼ N (µ,Ω−1) independently for i = 1, . . . ,n

where Ω is p × p precision matrix and µ is the vector of the means (assumeµ = 0).

Matrix Ω represents the conditional dependence structure among thevariables, with zero values representing conditional independence.

Aim: estimate the underlying graph of the conditional dependence structuredetermined by Ω.

Applications: networks in genetics and genomics, financial models . . .


Sparse High Dimensional Gaussian Graphical Models Estimation of precision matrix

Gaussian graphical models, p is large

For p small compared to n, the maximum likelihood estimate (MLE)

ΩML = arg maxΩ0

[log det Ω− tr(SΩ)], (1)

where S is the sample covariance matrix: S = n−1∑ni=1 YiY T

i .

Problem: if p is large s.t. S is not of full rank, then ΩML is not unique.

Additional assumption on Ω is sparsity.

Penalised maximum likelihood estimator (with a convex penalty) is

ΩpML = arg maxΩ0

[log det Ω− tr(SΩ)− λ||Ω||1], (2)

where ||Ω||1 =∑p

i,j=1 |Ωij | is the elementwise `1 norm of the matrix Ω.

For λ large enough, estimator ΩpML is sparse.


Sparse High Dimensional Gaussian Graphical Models Algorithms and consistency

Methods for estimating hyperparameter λ

Additional penalty term for λ:

ΩppML = arg maxΩ0, λ>0

[L(Ω)− λ||Ω||1 − pen(λ)]

Methods such as AIC and BIC are suboptimal when p is large.

Bayesian model with a slab-and-spike prior for elements of Ω: computationallyintensive for large p (≥ 103).

Two steps procedure:

Step 1: Use pMLE estimator ΩpML = ΩpML(λ) for given λ,Step 2: Choose λ to minimise R(λ, ΩpML(λ)),e.g. Cross Validation for Ω. Overfits when p is large (Liu et al., 2011).

Stability selection by Meinshausen and Bühlman (2010): controls FDR

StARS – Stability Approach to Regularization Selection by Liu et al. (2011).Additional tuning parameter; can lead to overfitting in certain graph topologies.

. . .


Sparse High Dimensional Gaussian Graphical Models Algorithms and consistency

Instability of estimated graph structureSmall variation in the penalty (||Ω||1) can lead to a significant change in theestimated graph structure.

λ = 0.83 λ = 0.85

0.70 0.75 0.80 0.85 0.90 0.95

020

040

060

080

010

0012

00

λ

l_1

norm


Network-based estimation of hyperparameter

Estimation of the hyperparameter

in sparse graphical models

using network characteristics


Network-based estimation of hyperparameter Network approach

Network-based approach to estimating λ

We propose to estimate λ using network characteristics of underlying graph.

Notation: Graph G(V ,E) with nodes V , edges E , adjacency matrix A.In graphical models: Aij = I(Ωij 6= 0) for i 6= j , and Aii = 0.

Network-based estimation of λ

Given λ, estimate Ω = Ωλ by penalised MLE:

Ωλ = arg maxΩ0

[log det Ω− tr(SΩ)− λ||Ω||1]

Choose λ = arg minλ R(λ, Aλ),where Aλ is adjacency matrix of cond. dependence graph of Ωλ.

The loss f-n for estimating λ depends only on the adjacency matrix of the underlyingconditional dependence graph.

Main a priori assumption: presence of weakly connected clusters.


Network-based estimation of hyperparameter Network approach

Network characteristics

Correlation coefficient between nodes Vi , Vj ∈ G(V ,E):

σij =|nei(Vi ) ∩ nei(Vj )|√|nei(Vi )| |nei(Vj )|

,

where nei(Vi ) is the set of neighbours of node Vi (Estrada, 2011).Corresponding dissimilarity measure δij = 1− σij .

Mean Geodesic Distance: measure of connectivity between nodes

H(λ) =1

p(p − 1)

∑i<j

dij I(dij <∞)

where dij is the length of the shortest path between nodes i and j (Costaand Rodrigues, 2007).. . .


Network-based estimation of hyperparameter Novel approaches

General algorithm

Fix a sequence (grid) of values of λ, (λ1, . . . , λN)

For each λ`, estimate Ω by penalised MLE, and hence the adjacencymatrix A` of the corresponding graphChoose λ`? = arg min` R(λ`,A`)

Can be interpreted as a point estimator of a modularised Bayesian model.

We propose two risk functions R(λ,A):

Path Connectivity: uses λ corresponding to the biggest structural changein the complexity of the graph.Complexity of the graph is measured by the Mean Geodesic Distance.

Augmented MSE: mimics a cross-validation approach with the lossdepending on the adjacency matrix of the graph.


Network-based estimation of hyperparameter Path Connectivity

Path ConnectivityConsider the Mean Geodesic Distance

H(λ) =1

p(p − 1)

∑i<j

dij I(dij <∞)

where dij is the length of the shortest path between nodes Vi and Vj .Choose λ: the largest change in graph structure measured by H(λ).

0.3 0.4 0.5 0.6

010

000

2000

030

000

λ

conn

Use finite differences with bandwidth h: H(λ+ h)− H(λ).Natalia Bochkina (University of Edinburgh) 27 July 2016 11 / 35


Path Connectivity: motivationUse normalised difference between mean geodesic distances:

R(λk ,A) =H(λk + h)− H(λk )

k−1∑k

j=1[H(λj + h)− H(λj )], λk = λ0 + (k − 1)h

FP = 46TP = 116

(a) Optimal λk?

FP = 54TP = 120

(b) λk?−1



Path Connectivity: estimators of λ

0.20 0.25 0.30 0.35 0.40

05

1015

2025

λ

Den

sity

clusterednon−clustered


Network-based estimation of hyperparameter Augmented MSE

Augmented MSE

Ideally, would like to use a cross-validation approach over some characteristic ofthe conditional dependence graph, which is an unbiased estimator of thecorresponding oracle risk.E.g. the MSE error of estimating a characteristic (dij ):

R(λ) = E∑i<j

(dij − dij (λ))2

where dij (λ) are based on GLasso estimator Ω(λ), with the corresponding oracleλoracle

λoracle = arg minλ

R(λ).

However, we do not observe the conditional dependency graph, i.e. we do nothave unbiased estimators of dij .

A priori information

the network contains clusters (possibly overlapping)⇒an algorithm that estimates well global characteristics (number of clusters, degrees,..)

to produce an original “estimate”



Augmented MSE of graph correlations

network characteristic: graph correlations:

ρij =|nei(Vi ) ∩ nei(Vj )|√|nei(Vi )| |nei(Vj )|

original "estimate": output of a clustering algorithm.AGNES (Kaufman and Rousseeuw, 2009): estimates well global characteristicssuch as average degree of the graph, eigenvalues of A, etc

A-MSE estimator of λGiven Ωλ from GLasso and its adjacency matrix Aλ, choose

λAMSE = arg minλ

R(λ, Aλ) = arg minλ

E(∑i>j

|ρij − ρλij |q), q ≥ 1,

where E is the average over subsamples and (ρij ) correspond to the graph correlationsin the “original graph estimate”.



Augmented MSE of graph order 2 connectivity

network characteristic: graph order 2 connectivity:

δij = I(|nei(Vi ) ∩ nei(Vj )| > 0) = I(ρij 6= 0)

i.e. the indicator function whether nodes i and j are connected or share aconnection.original "estimate": clustering algorithm (AGNES)

Risk:

R(λ, Aλ) = E∑i>j

(δij − δλij )2 = C + E(TP(λ)− FP(λ))

also known as Youden index, where

FP(λ) =∑i<j

I[δij = 0, δij (λ) = 1], TP(λ) =∑i<j

I[δij = 1, δij (λ) = 1].

Similarly, can estimate λ using this risk with δij replaced by δij in original graphestimate.



A-MSE and oracle tuning parameter

−0.1

0.0

0.1

n=50 n=100 n=200 n=500

n

λ−λ

(c) p=50

−0.1

0.0

0.1

n=50 n=100 n=200 n=500

n

λ−λ

(d) p=170

−0.1

0.0

0.1

n=50 n=100 n=200 n=500

n

λ−λ

(e) p=290

−0.1

0.0

0.1

n=50 n=100 n=200 n=500

n

λ−λ

(f) p=500

The oracle value of λ is within the 95% confidence interval for the median ofλAMSE .


Simulated data

Comparison on simulated data

Compare 6 approaches:

StARS, AGNES, A-MSE (graph correlations), PC, AIC and BIC

method penalized uses network subsampling fully fast very sparselikelihood characteristics. automatic graph estimates

PC X X X XA-MSE X X X XAGNES X X XStARS X XBIC X X X XAIC X X X

Compare on 3 graph structure scenarios: hubs, power law and randomnetworks.


Simulated data Graph topologies

Graph topologies in biological data

Networks with hubs.Typical in biological networksPower-law networks. Distribution of the number of connections ξ of eachnode is

Pξ = k =k−α

ς(α), k ≥ 1,

for some constant α and the normalizing function ς(α).Peng et al. (2009): α = 2.3 provides a distribution that is close to what isexpected in biological networks.Random networks:

Pξ = k =

(pk

)θk (1− θ)p−k ,

where the parameter θ determines the proportion of edges (or sparsity) inthe graph.


Simulated data Graph topologies

Examples of simulated graphs

1

(g) p=50, hubs-based

1

(h) p=170, hubs-based

1

(i) p=290, hubs-based

1

(j) p=50, power-law

1

(k) p=170, power-law

1

(l) p=290, power-lawNatalia Bochkina (University of Edinburgh) 27 July 2016 20 / 35

Simulated data Performance

Average ranks for the MSE of the precision matrix

Hubs-based Power lawn 50 100 200 500 50 100 200 500

dimension p=50AGNES 3.05 3.55 4.06 4.40 3.12 3.73 4.40 4.71A-MSE 4.33 4.90 5.22 5.38 4.92 5.47 5.67 5.78PC 5.23 5.80 5.58 5.15 4.58 5.13 4.85 4.49StARS 1.27 1.49 1.18 1.28 1.17 1.43 1.04 1.07BIC 5.38 3.73 3.14 3.06 5.33 3.66 3.08 3.02AIC 1.73 1.52 1.82 1.73 1.90 1.58 1.96 1.92




Average ranks for the MSE of the dissimilarity matrix

Hubs-based Power lawn 50 100 200 500 50 100 200 500






True discovery rate TDR = TP/(TP + FP)

0.0

0.2

0.4

0.6

0.8

1.0

p=50T

DR

p=170

p=290

p=500

0.0

0.2

0.4

0.6

0.8

1.0

n

TD

R

50 100 200 500

n50 100 200 500

n50 100 200 500

n50 100 200 500

AGNES AMSE PC StARS BIC AIC

TDR increases with n for AGNES, A-MSE and PC, and decreases for AIC and BIC.Natalia Bochkina (University of Edinburgh) 27 July 2016 23 / 35


ROC curves

0.0

0.2

0.4

0.6

0.00 0.01 0.02 0.03 0.04FPR

TP

R

METHODPCAGSTAAG

0.0

0.2

0.4

0.6

0.00 0.02 0.04 0.06FPR

TP

R

METHODPCAGSTAAG

0.0

0.2

0.4

0.6

0.000 0.025 0.050 0.075FPR

TP

R

METHODPCAGSTAAG

0.0

0.2

0.4

0.6

0.00 0.02 0.04 0.06 0.08FPR

TP

R

METHODPCAGSTAAG

0.0

0.2

0.4

0.6

0.00 0.02 0.04 0.06 0.08FPR

TP

R

METHODPCAGSTAAG

0.0

0.2

0.4

0.6

0.000 0.025 0.050 0.075 0.100FPR

TP

R

METHODPCAGSTAAG

Dots: optimal graph selected by the corresponding method.Natalia Bochkina (University of Edinburgh) 27 July 2016 24 / 35

Simulated data Summary

Summary

AGNES is the best approach to recover global network characteristics(e.g. the proportion of edges, Mean Geodesic Distance) but generallyleads to complex graphs that are difficult to interpret.

Augmented MSE: sparser graphs than AGNES and achieves betterresults in estimating adjacency matrix A; more interpretable graphs

Path Connectivity is computationally the fastest method and only doesslightly worse than A-MSE in estimating MSE(A). It generally obtainssimple graph structures which are easier to interpret.

The choice of method depends on the relative cost of False Positivescompared to that of True Positives.


Tumour gene expression data


Gene expression data set, colorectal tumour study (Hinoue et al., 2012).25 patientspaired samples: the gene expression profiling is obtained in each patientfor a colorectal tumor sample and its healthy adjacent colonic tissueTotal number of genes: 25, 000.7,579 genes were analysed (selected as differentially expressed betweenthe conditions).



Dependence structure for tumour gene expressiondata: healthy

Path Connectivity A-MSE

clust 1clust 2clust 3clust 4clust 5clust 6clust 7clust 8clust 9clust 10

clust 1clust 2clust 3clust 4clust 5clust 6clust 7clust 8clust 9clust 10clust 11clust 12clust 13clust 14clust 15clust 16clust 17



Dependence structure for tumour gene expressiondata: tumour

Path Connectivity A-MSE

clust 1clust 2clust 3clust 4clust 5clust 6clust 7clust 8clust 9clust 10clust 11clust 12clust 13

clust 1clust 2clust 3clust 4clust 5clust 6clust 7clust 8clust 9clust 10clust 11clust 12clust 13clust 14clust 15



PC graph for gene expression data

10 clusters in the healthy samples13 clusters in the tumour samples

Overlap between cluster 4 in the healthy samples (84 genes) with cluster 2 inthe tumor sample (88 genes), which share 38 genes.

Overlap expected by chance: ∼ 4.45 genes.

Genes in Cluster 4 (normal) and Cluster 2 (tumour):P53-signaling pathway (P53 being the classical cancer gene)DNA replicationadaptive immune system


Summary

Summary and future workSummary

Propose a network-based method to choose the hyperparameter in Gaussiangraphical modelEstimation of conditional dependence graph is more stable than the approacheswhich depend on Ω only via ||Ω||1Estimated graphs are more interpretableChoice of method should be determined by the relative cost of FP vs TPBayesian interpretation? A point estimator under a modularised DAG.

R package: "GMRPS", paper is on arXiv:1509.05326.Current and future work

Asymptotic/non-asymptotic propertiesIn particular, given n and p, how large is the conditional dependence graph thatcan be estimated reliably.Other risk functions, notably based on second (and other) eigenvalues of ATest for the difference between the conditional dependence graphs in differentgroups of samples“Differential” network: difference between networks in two conditions


Summary

References

Cai, T., W. Liu, and X. Luo (2011). A Constrained l1 Minimization Approach to Sparse Precision MatrixEstimation. Journal of the American Statistical Association 106(494), 594–607.

Costa, L. and F. Rodrigues (2007). Characterization of complex networks: A survey of measurements.Advances in Physics 56(1), 167–242.

Estrada, E. (2011). The structure of complex networks. New York: OXFORD University press.

Hinoue, T., D. J. Weisenberger, C. P. E. Lange, H. Shen, H.-M. Byun, D. Van Den Berg, S. Malik, F. Pan,H. Noushmehr, C. M. van Dijk, R. a. E. M. Tollenaar, and P. W. Laird (2012, February). Genome-scaleanalysis of aberrant DNA methylation in colorectal cancer. Genome research 22(2), 271–82.

Kaufman, L. and P. Rousseeuw (2009). Finding groups in data: an introduction to cluster analysis. New Jersey:John Wiley & sons.

Liu, H., K. Roeder, and L. Wasserman (2011). Stability approach to regularization selection (stars) for highdimensional graphical models. Journal of Computational and Graphical Statistics, 1.

Meinshausen, N. and P. Bühlman (2010). Stability Selection. Journal of the Royal Statistical Society, SeriesB 72, 417–473.

Peng, J., P. Wang, N. Zhou, and J. Zhu (2009, June). Partial Correlation Estimation by Joint Sparse RegressionModels. Journal of the American Statistical Association 104(486), 735–746.


Summary

Thank you!


Summary

Simulated data

Yi ∼ Np(0,Ω−1), i = 1, . . . ,n

3 graph structure scenarios: hubs, power law and random networks.

Then,Ω = Ω(0) + δI

where off-diagonal elements of Ω are (Cai et al., 2011)

Ω(0)ij =

Unif (0.5,0.9) if Aij = 1 and Bern(0.5)=1 ;Unif (−0.5,−0.9) if Aij = 1 and Bern(0.5)=0;0 if Aij = 0.

with δ such that Ω is a positive definite matrix.

Each simulation is repeated 50 times.


Summary

Path connectivity and 2nd eigenvalue of A

0.30 0.35 0.40 0.45 0.50 0.55

050

100

150

λ

H(λ)100

3.0

3.5

4.0

evalue2

H(λ) 100evalue2


Selection of the Regularization Parameter in Graphical ... · PDF fileSelection of the...

Documents

Transcript of Selection of the Regularization Parameter in Graphical ... · PDF fileSelection of the...