Joint gene network inference with multiple samples: a ... fileConsensus LASSO (SAMM, UP1) Nathalie...
Transcript of Joint gene network inference with multiple samples: a ... fileConsensus LASSO (SAMM, UP1) Nathalie...
Joint gene network inference with multiplesamples: a bootstrapped consensual approach
Nathalie Villa-Vialaneixhttp://www.nathalievilla.org
RΓ©union annuelle du groupe NETBIOParis, 10 septembre 2013
Joint work with Matthieu Vignes, Nathalie Viguerieand Magali SanCristobal
Consensus LASSO (SAMM, UP1) Nathalie Villa-Vialaneix Paris, 10 septembre 2013 1 / 21
Short overview on network inference with GGM
Outline
1 Short overview on network inference with GGM
2 Inference with multiple samples
3 Simulations
Consensus LASSO (SAMM, UP1) Nathalie Villa-Vialaneix Paris, 10 septembre 2013 2 / 21
Short overview on network inference with GGM
Framework
Data: large scale gene expression data
individualsn ' 30/50
X =
. . . . . .
. . X ji . . .
. . . . . .
οΈΈ οΈ·οΈ· οΈΈvariables (genes expression), p'103/4
What we want to obtain: a graph/network with
β’ nodes: genes;
β’ edges: strong links between gene expressions.
Consensus LASSO (SAMM, UP1) Nathalie Villa-Vialaneix Paris, 10 septembre 2013 3 / 21
Short overview on network inference with GGM
Using partial correlations
correlation is not causality...
strong indirect correlationy z
x
set.seed(2807); x <- rnorm(100)
y <- 2*x+1+rnorm(100,0,0.1); cor(x,y) [1] 0.998826
z <- 2*x+1+rnorm(100,0,0.1); cor(x,z) [1] 0.998751
cor(y,z) [1] 0.9971105
] Partial correlation
cor(lm(x z)$residuals,lm(y z)$residuals) [1] 0.7801174
cor(lm(x y)$residuals,lm(z y)$residuals) [1] 0.7639094
cor(lm(y x)$residuals,lm(z x)$residuals) [1] -0.1933699
Consensus LASSO (SAMM, UP1) Nathalie Villa-Vialaneix Paris, 10 septembre 2013 4 / 21
Short overview on network inference with GGM
Using partial correlations
strong indirect correlationy z
x
set.seed(2807); x <- rnorm(100)
y <- 2*x+1+rnorm(100,0,0.1); cor(x,y) [1] 0.998826
z <- 2*x+1+rnorm(100,0,0.1); cor(x,z) [1] 0.998751
cor(y,z) [1] 0.9971105
] Partial correlation
cor(lm(x z)$residuals,lm(y z)$residuals) [1] 0.7801174
cor(lm(x y)$residuals,lm(z y)$residuals) [1] 0.7639094
cor(lm(y x)$residuals,lm(z x)$residuals) [1] -0.1933699
Consensus LASSO (SAMM, UP1) Nathalie Villa-Vialaneix Paris, 10 septembre 2013 4 / 21
Short overview on network inference with GGM
Using partial correlations
strong indirect correlationy z
x
set.seed(2807); x <- rnorm(100)
y <- 2*x+1+rnorm(100,0,0.1); cor(x,y) [1] 0.998826
z <- 2*x+1+rnorm(100,0,0.1); cor(x,z) [1] 0.998751
cor(y,z) [1] 0.9971105
] Partial correlation
cor(lm(x z)$residuals,lm(y z)$residuals) [1] 0.7801174
cor(lm(x y)$residuals,lm(z y)$residuals) [1] 0.7639094
cor(lm(y x)$residuals,lm(z x)$residuals) [1] -0.1933699
Consensus LASSO (SAMM, UP1) Nathalie Villa-Vialaneix Paris, 10 septembre 2013 4 / 21
Short overview on network inference with GGM
Theoretical framework
Gaussian Graphical Models (GGM) [SchΓ€fer and Strimmer, 2005,Meinshausen and BΓΌhlmann, 2006, Friedman et al., 2008]gene expressions: X βΌ N(0,Ξ£)Sparse approach: partial correlations are estimated by using linearmodels and a sparse penalty: β j
X j = Ξ²Tj Xβj + Ξ΅ ; arg max
(Ξ²jjβ² )jβ²
log MLj β Ξ»βjβ²,j
|Ξ²jjβ² |
In the Gaussian framework: Ξ²jjβ² = β
Sjjβ²
Sjjwhere S = Ξ£β1 (concentration
matrix) is related to partial correlations by Οjjβ² = βSjjβ²βSjjSjβ² jβ²
.
Consensus LASSO (SAMM, UP1) Nathalie Villa-Vialaneix Paris, 10 septembre 2013 5 / 21
Inference with multiple samples
Outline
1 Short overview on network inference with GGM
2 Inference with multiple samples
3 Simulations
Consensus LASSO (SAMM, UP1) Nathalie Villa-Vialaneix Paris, 10 septembre 2013 6 / 21
Inference with multiple samples
Motivation for multiple networks inference
Pan-European project Diogenes1 (with Nathalie Viguerie, INSERM):gene expressions (lipid tissues) from 204 obese women before and aftera low-calorie diet (LCD).
β’ Assumption: Acommon functioningexists regardless thecondition;
β’ Which genes are linkedindependentlyfrom/depending on thecondition?
1http://www.diogenes-eu.org/; see also [Viguerie et al., 2012]Consensus LASSO (SAMM, UP1) Nathalie Villa-Vialaneix Paris, 10 septembre 2013 7 / 21
Inference with multiple samples
Naive approach: independent estimations
Notations: p genes measured in k samples, each corresponding to aspecific condition: (Xc
j )j=1,...,p βΌ N(0,Ξ£c), for c = 1, . . . , k .For c = 1, . . . , k , nc independent observations (Xc
ij )i=1,...,nc andβ
c nc = n.
Independent inference
Estimation β c = 1, . . . , k and β j = 1, . . . , p,
Xcj = Xc
\jΞ²cj + Ξ΅c
j
are estimated (independently) by maximizing pseudo-likelihood:
L(S |X) =β
c
βj
logP(Xc
j |Xc\j , Sc
j
), S concentration matrix
Consensus LASSO (SAMM, UP1) Nathalie Villa-Vialaneix Paris, 10 septembre 2013 8 / 21
Inference with multiple samples
Related papers
Problem: previous estimation does not use the fact that the differentnetworks should be somehow alike!Previous proposalsβ’ [Chiquet et al., 2011] replace Ξ£c by Ξ£Μc = 1
2 Ξ£c + 12 Ξ£ and add a
sparse penalty;
β’ [Chiquet et al., 2011] LASSO and Group-LASSO type penalties toforce consistent or sign-coherent edges between conditions;
β’ [Danaher et al., ] add a sparse penalty and the penaltyβc,cβ² βSc β Scβ²βL1 ;
β’ [Mohan et al., 2012] add a group-LASSO like penaltyβc,cβ²
βj βSc
j β Scβ²j βL2 that focuses on differences due to a few number
of nodes only.
Consensus LASSO (SAMM, UP1) Nathalie Villa-Vialaneix Paris, 10 septembre 2013 9 / 21
Inference with multiple samples
Consensus LASSOProposal: Infer multiple networks by forcing them toward a consensualnetwork: i.e., explicitly constraining the differences between conditionsto be under control but with a L2 penalty to allow for more differencesthan with Group-LASSO type penalties.Original optimization:
max(Ξ²c
jk )k,j,c=1,...,C
βc
log MLcj β Ξ»
βk,j
|Ξ²cjk |
.
Add a constraint to force inference toward a βconsensusβ Ξ²cons
12Ξ²T
j Ξ£Μ\j\jΞ²j + Ξ²Tj Ξ£Μj\j + Ξ»βΞ²jβL1 + Β΅
βc
wcβΞ²cj β Ξ²
consj β2L2
with:β’ wc : real number used to weight the conditions (wc = 1 or wc = 1β
nc);
β’ Β΅ regularization parameter;β’ Ξ²cons
j whatever you want...?
Consensus LASSO (SAMM, UP1) Nathalie Villa-Vialaneix Paris, 10 septembre 2013 10 / 21
Inference with multiple samples
Consensus LASSOProposal: Infer multiple networks by forcing them toward a consensualnetwork: i.e., explicitly constraining the differences between conditionsto be under control but with a L2 penalty to allow for more differencesthan with Group-LASSO type penalties.Original optimization:
max(Ξ²c
jk )k,j,c=1,...,C
βc
log MLcj β Ξ»
βk,j
|Ξ²cjk |
.[Ambroise et al., 2009, Chiquet et al., 2011]: is equivalent to minimize pproblems having dimension k(p β 1):
12Ξ²T
j Ξ£Μ\j\jΞ²j + Ξ²Tj Ξ£Μj\j + Ξ»βΞ²jβL1
with Ξ£Μ\j\j is the block diagonal matrix Diag(Ξ£Μ1\j\j , . . . , Ξ£Μ
k\j\j
)and similarly for
Ξ£Μj\j .
Add a constraint to force inference toward a βconsensusβ Ξ²cons
12Ξ²T
j Ξ£Μ\j\jΞ²j + Ξ²Tj Ξ£Μj\j + Ξ»βΞ²jβL1 + Β΅
βc
wcβΞ²cj β Ξ²
consj β2L2
with:β’ wc : real number used to weight the conditions (wc = 1 or wc = 1β
nc);
β’ Β΅ regularization parameter;β’ Ξ²cons
j whatever you want...?
Consensus LASSO (SAMM, UP1) Nathalie Villa-Vialaneix Paris, 10 septembre 2013 10 / 21
Inference with multiple samples
Consensus LASSOProposal: Infer multiple networks by forcing them toward a consensualnetwork: i.e., explicitly constraining the differences between conditionsto be under control but with a L2 penalty to allow for more differencesthan with Group-LASSO type penalties.
Add a constraint to force inference toward a βconsensusβ Ξ²cons
12Ξ²T
j Ξ£Μ\j\jΞ²j + Ξ²Tj Ξ£Μj\j + Ξ»βΞ²jβL1 + Β΅
βc
wcβΞ²cj β Ξ²
consj β2L2
with:β’ wc : real number used to weight the conditions (wc = 1 or wc = 1β
nc);
β’ Β΅ regularization parameter;β’ Ξ²cons
j whatever you want...?
Consensus LASSO (SAMM, UP1) Nathalie Villa-Vialaneix Paris, 10 septembre 2013 10 / 21
Inference with multiple samples
Choice of a consensus
Ξ²consj =
βc
ncn Ξ²
cj is a good choice because:
β’ the consensual penalty is then quadratic with respect to Ξ²j ;
β’ thus, solving the optimization problem is equivalent to maximizing
12Ξ²T
j Sj(Β΅)Ξ²j + Ξ²Tj Ξ£Μj\j + Ξ»
βc
1ncβΞ²c
j β1
with Sj(Β΅) = Ξ£Μj\j + 2Β΅AT A with A a matrix that does not depend on j.
Convex part + L1-norm penalty
similar to standard LASSO problems: use of an βactive setβ approach asdescribed in [Osborne et al., 2000, Chiquet et al., 2011]
Consensus LASSO (SAMM, UP1) Nathalie Villa-Vialaneix Paris, 10 septembre 2013 11 / 21
Inference with multiple samples
Choice of a consensus
Ξ²consj =
βc
ncn Ξ²
cj is a good choice because:
β’ the consensual penalty is then quadratic with respect to Ξ²j ;
β’ thus, solving the optimization problem is equivalent to maximizing
12Ξ²T
j Sj(Β΅)Ξ²j + Ξ²Tj Ξ£Μj\j + Ξ»
βc
1ncβΞ²c
j β1
with Sj(Β΅) = Ξ£Μj\j + 2Β΅AT A with A a matrix that does not depend on j.
Convex part + L1-norm penalty
similar to standard LASSO problems: use of an βactive setβ approach asdescribed in [Osborne et al., 2000, Chiquet et al., 2011]
Consensus LASSO (SAMM, UP1) Nathalie Villa-Vialaneix Paris, 10 septembre 2013 11 / 21
Inference with multiple samples
Choice of a consensus
Ξ²consj =
βc
ncn Ξ²
cj is a good choice because:
β’ the consensual penalty is then quadratic with respect to Ξ²j ;
β’ thus, solving the optimization problem is equivalent to maximizing
12Ξ²T
j Sj(Β΅)Ξ²j + Ξ²Tj Ξ£Μj\j + Ξ»
βc
1ncβΞ²c
j β1
with Sj(Β΅) = Ξ£Μj\j + 2Β΅AT A with A a matrix that does not depend on j.
Convex part + L1-norm penalty
similar to standard LASSO problems: use of an βactive setβ approach asdescribed in [Osborne et al., 2000, Chiquet et al., 2011]
Consensus LASSO (SAMM, UP1) Nathalie Villa-Vialaneix Paris, 10 septembre 2013 11 / 21
Inference with multiple samples
Bootstrap estimation
Bootstrapped Consensus Lasso
1: Require: List of genes: {1, . . . , p}; Gene expressions: X ; Conditionids: ci β {1, ...,C}
2: Initialize β j, jβ² β {1, . . . , p}, Nc(j, jβ²)β 0; Β΅ fixed3: for b = 1β P do4: Take a bootstrap sample Bb
5: Estimate (Ξ²c,b ,Ξ»j )j,c,Ξ» from the previous method for several Ξ»
(decreasing order)
6: Find{(β
j,jβ²,c IΞ²c,Ξ»,bj ,0
)> T1
}return (Ξ²c,b
j )j,c := (Ξ²c,Ξ»max,bj )j,c
7: if Ξ²c,bj , 0 then
8: Nc(j, jβ²)β Nc(j, jβ²) + 19: end if
10: end for11: Select edges with Nc(j, jβ²) > T2 (T2 chosen)
Consensus LASSO (SAMM, UP1) Nathalie Villa-Vialaneix Paris, 10 septembre 2013 12 / 21
Simulations
Outline
1 Short overview on network inference with GGM
2 Inference with multiple samples
3 Simulations
Consensus LASSO (SAMM, UP1) Nathalie Villa-Vialaneix Paris, 10 septembre 2013 13 / 21
Simulations
Simulated data
Expression data with known co-expression network
β’ original network (scale free) taken fromhttp://www.comp-sys-bio.org/AGN/data.html (100 nodes,βΌ 200 edges, loops removed);
β’ rewire a ratio r of the edges to generate k βchildrenβ networks(sharing approximately 100(1 β 2r)% of their edges);
β’ generate βexpression dataβ with a random Gaussian process fromeach chid.
Consensus LASSO (SAMM, UP1) Nathalie Villa-Vialaneix Paris, 10 septembre 2013 14 / 21
Simulations
An example with k = 2, r = 5%
Consensus LASSO (SAMM, UP1) Nathalie Villa-Vialaneix Paris, 10 septembre 2013 15 / 21
Simulations
Choice for T2Data: r = 0.05, k = 2 and n1 = n2 = 20100 bootstrap samples, Β΅ = 1, T1 = 250 or 500
ββ 30 selections at least30 selections at least
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75precision
reca
ll
ββ
40 selections at least43 selections at least
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00precision
reca
ll
Dots correspond to best F = 2 Γ precisionΓrecallprecision+recall
β Best F corresponds to selecting a number of edges approximatelyequal to the number of edges of the original network.
Consensus LASSO (SAMM, UP1) Nathalie Villa-Vialaneix Paris, 10 septembre 2013 16 / 21
Simulations
Choice for T1 and ¡¡ T1 % of improvement
0.1/1 {250, 300, 500} of bootstrappingnetwork sizes rewired edges: 5%20-20 1 500 30.6920-30 0.1 500 11.8730-30 1 300 20.1550-50 1 300 14.3620-20-20-20-20 1 500 86.0430-30-30-30 0.1 500 42.67network sizes rewired edges: 20%20-20 0.1 300 -17.8620-30 0.1 300 -18.3530-30 1 500 -7.9750-50 0.1 300 -7.8320-20-20-20-20 0.1 500 10.2730-30-30-30 1 500 13.48
Consensus LASSO (SAMM, UP1) Nathalie Villa-Vialaneix Paris, 10 septembre 2013 17 / 21
Simulations
Comparisons (best/worst case F for different parameters)Method gLasso cLasso gLasso+boot cLasso+bootnc rewired edges: 5%20-20 0.19 0.22 (0.18) 0.27 (0.26) 0.29 (0.27)20-30 0.26 0.30 (0.26) 0.31 (0.29) 0.33 (0.32)30-30 0.28 0.31 (0.27) 0.35 (0.31) 0.38 (0.36)50-50 0.36 0.43 (0.36) 0.47 (0.46) 0.49 (0.49)20-20-20-20-20 0.19 0.23 (0.18) 0.39 (0.38) 0.43 (0.40)30-30-30-30 0.30 0.36 (0.29) 0.49 (0.48) 0.51 (0.50)nc rewired edges: 20%20-20 0.21 0.23 (0.19) 0.18 (0.17) 0.19 (0.17)20-30 0.26 0.26 (0.25) 0.20 (0.19) 0.22 (0.20)30-30 0.28 0.31 (0.29) 0.27 (0.27) 0.29 (0.28)50-50 0.42 0.43 (0.41) 0.38 (0.37) 0.40 (0.38)20-20-20-20-20 0.20 0.22 (0.20) 0.22 (0.20) 0.24 (0.24)30-30-30-30 0.27 0.29 (0.27) 0.30 (0.30) 0.33 (0.31)
Not shown here but when the % of rewired edges is larger (20%),intertwinned Lasso has better performances (they are not improved bybootstrapping).
Consensus LASSO (SAMM, UP1) Nathalie Villa-Vialaneix Paris, 10 septembre 2013 18 / 21
Simulations
Real data204 obese women ; expression of 221 genes before and after a LCDΒ΅ = 1 ; T1 = 1000 (target density: 4%)
Distribution of the number of times an edge is selected over 100bootstrap samples
0
500
1000
1500
0 25 50 75 100Counts
Fre
quen
cy
(70% of the pairs of nodes are never selected)β T2 = 80Consensus LASSO (SAMM, UP1) Nathalie Villa-Vialaneix Paris, 10 septembre 2013 19 / 21
Simulations
Networks
Before diet
β
β
β
β
β
β
β
β
β
ββ
ββ
β
β β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
AZGP1
CIDEA
FADS1
FADS2
GPD1L
PCK2
After diet
β
β
β
β
β
β
β
β
β
ββ
ββ
β
β β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
AZGP1
CIDEA
FADS1
FADS2
GPD1L
PCK2
densities about 1.3% - some interactions (both shared and specific) makesense for the biologist
Consensus LASSO (SAMM, UP1) Nathalie Villa-Vialaneix Paris, 10 septembre 2013 20 / 21
Simulations
Thank you for your attention...
Programs available in the R package therese (on R-Forge). Joint workwith
Magali SanCristobal Matthieu Vignes(LGC, INRA de Toulouse) (MIAT, INRA de Toulouse)
Nathalie Viguerie(I2MC, INSERM Toulouse)
Consensus LASSO (SAMM, UP1) Nathalie Villa-Vialaneix Paris, 10 septembre 2013 21 / 21
Simulations
ReferencesAmbroise, C., Chiquet, J., and Matias, C. (2009).Inferring sparse Gaussian graphical models with latent structure.Electronic Journal of Statistics, 3:205β238.
Chiquet, J., Grandvalet, Y., and Ambroise, C. (2011).Inferring multiple graphical structures.Statistics and Computing, 21(4):537β553.
Danaher, P., Wang, P., and Witten, D.The joint graphical lasso for inverse covariance estimation accross multiple classes.Preprint arXiv 1111.0324v3. Submitted for publication.
Friedman, J., Hastie, T., and Tibshirani, R. (2008).Sparse inverse covariance estimation with the graphical lasso.Biostatistics, 9(3):432β441.
Meinshausen, N. and BΓΌhlmann, P. (2006).High dimensional graphs and variable selection with the lasso.Annals of Statistic, 34(3):1436β1462.
Mohan, K., Chung, J., Han, S., Witten, D., Lee, S., and Fazel, M. (2012).Structured learning of Gaussian graphical models.In Proceedings of NIPS (Neural Information Processing Systems) 2012, Lake Tahoe, Nevada, USA.
Osborne, M., Presnell, B., and Turlach, B. (2000).On the LASSO and its dual.Journal of Computational and Graphical Statistics, 9(2):319β337.
SchΓ€fer, J. and Strimmer, K. (2005).An empirical bayes approach to inferring large-scale gene association networks.Bioinformatics, 21(6):754β764.
Consensus LASSO (SAMM, UP1) Nathalie Villa-Vialaneix Paris, 10 septembre 2013 21 / 21
Simulations
Viguerie, N., Montastier, E., Maoret, J., Roussel, B., Combes, M., Valle, C., Villa-Vialaneix, N., Iacovoni, J., Martinez, J., Holst,C., Astrup, A., Vidal, H., ClΓ©ment, K., Hager, J., Saris, W., and Langin, D. (2012).Determinants of human adipose tissue gene expression: impact of diet, sex, metabolic status and cis genetic regulation.PLoS Genetics, 8(9):e1002959.
Consensus LASSO (SAMM, UP1) Nathalie Villa-Vialaneix Paris, 10 septembre 2013 21 / 21