Methods for High Dimensional Interactions

Sahir Rai Bhatnagar, PhD Candidate – McGill Biostatistics

Joint work with Yi Yang, Mathieu Blanchette and Celia Greenwood

Ludmer Center – May 19, 2016

Underlying objective of this talk

Motivation

one predictor variable at a time

Predictor Variable Phenotype

Test 1

Test 2

Test 3

Test 4

Test 5

one predictor variable at a time

Test 1

Test 2

Test 3

Test 4

Test 5

a network based view

Test 1

system level changes due to environment

Predictor Variable PhenotypeEnvironment

Test 1

system level changes due to environment

Predictor Variable PhenotypeEnvironment

Test 1

Motivating Dataset: Newborn epigenetic adaptations to gesta-

tional diabetes exposure (Luigi Bouchard, Sherbrooke)

Environment

Gestational

Diabetes

Large Data

Child’s epigenome

(p ≈ 450k)

Phenotype

Obesity measures

Differential Correlation between environments

(a) Gestational diabetes affected pregnancy (b) Controls

Gene Expression: COPD patients

(a) Gene Exp.: Never Smokers (b) Gene Exp.: Current Smokers

(c) Correlations: Never Smokers (d) Correlations: Current Smokers

Imaging Data: Topological properties and Age

Correlations differ between Age groups

NIH MRI brain study

Environment

Large Data

Cortical Thickness

(p ≈ 80k)

Phenotype

Intelligence

Differential Networking

formal statement of initial problem

• n: number of subjects

• p: number of predictor variables

• Xn×p: high dimensional data set (p >> n)

• Yn×1: phenotype

• En×1: environmental factor that has widespread effect on X and can

modify the relation between X and Y

Objective

• Which elements of X that are associated with Y , depend on E?

Objective

conceptual model

Environment

ff(Maternal

care, Age, Diet)

Large Data (p >> n)

Gene Expression t

DNA Methylation

t Brain Imaging

Gene Expression t

DNA Methylation

t Brain Imaging

Phenotype (Behavioral

development, IQ

scores, Death)

conceptual model

Environment

ff(Maternal

care, Age, Diet)

Large Data (p >> n)

Gene Expression t

DNA Methylation

t Brain Imaging

Gene Expression t

DNA Methylation

t Brain Imaging

development, IQ

scores, Death)

conceptual model

Environment

ff(Maternal

care, Age, Diet)

Large Data (p >> n)

Gene Expression t

DNA Methylation

t Brain Imaging

Gene Expression t

DNA Methylation

t Brain Imaging

development, IQ

scores, Death)

conceptual model

Environment

ff(Maternal

care, Age, Diet)

Large Data (p >> n)

Gene Expression t

DNA Methylation

t Brain Imaging

Gene Expression t

DNA Methylation

t Brain Imaging

development, IQ

scores, Death)

epidemiological study

conceptual model

Environment

ff(Maternal

care, Age, Diet)

Large Data (p >> n)

Gene Expression t

DNA Methylation

t Brain Imaging

Gene Expression t

DNA Methylation

t Brain Imaging

development, IQ

scores, Death)

(epi)genetic/imaging associations

conceptual model

Environment

ff(Maternal

care, Age, Diet)

Large Data (p >> n)

Gene Expression t

DNA Methylation

t Brain Imaging

Gene Expression t

DNA Methylation

t Brain Imaging

development, IQ

scores, Death)

(epi)genetic/imaging associations

conceptual model

Environment

ff(Maternal

care, Age, Diet)

Large Data (p >> n)

Gene Expression t

DNA Methylation

t Brain Imaging

Gene Expression t

DNA Methylation

t Brain Imaging

development, IQ

scores, Death)

Is this mediation analysis?

• No

• We are not making any causal claims i.e. direction of the arrows

• There are many untestable assumptions required for such analysis

→ not well understood for HD data

• No

Methods

analysis strategies

? marginal correlations (univariate p-value)

? multiple testing adjustment

Single-Marker or Single Variable Tests

? LASSO (convex penalty with one tuning parameter)

? MCP, SCAD, Dantzig selector (non-convex penalty with two tuning parameters)

? Group level penalization (group LASSO, SCAD and MCP)

Multivariate Regression Approaches Including Penalization Methods

? cluster features based on euclidean distance, correlation, connectivity

? regression with group level summary (PCA, average)

Clustering Together with Regression

analysis strategies

ECLUST - our proposed method: 3 phases

Original Data

1) Gene Similarity

2) Cluster

Representation

n × 1 n × 1

3) Penalized

Regression

Yn×1∼ + ×E

Original Data

1) Gene Similarity

2) Cluster

Representation

n × 1 n × 1

3) Penalized

Regression

Yn×1∼ + ×E

Original Data

1) Gene Similarity

2) Cluster

Representation

n × 1 n × 1

3) Penalized

Regression

Yn×1∼ + ×E

Original Data

1) Gene Similarity

2) Cluster

Representation

n × 1 n × 1

3) Penalized

Regression

Yn×1∼ + ×E

Original Data

1) Gene Similarity

2) Cluster

Representation

n × 1 n × 1

3) Penalized

Regression

Yn×1∼ + ×E

Original Data

1) Gene Similarity

2) Cluster

Representation

n × 1 n × 1

3) Penalized

Regression

Yn×1∼ + ×E

the objective of statistical

methods is the reduction of data.

A quantity of data . . . is to be

replaced by relatively few quantities

which shall adequately represent

. . . the relevant information

contained in the original data.

- Sir R. A. Fisher, 1922

Underlying model

Y = β0 + β1U + β2U · E + ε (1)

X ∼ F (α0 + α1U,ΣE ) (2)

• U: unobserved latent variable

• X : observed data which is a function of U

• ΣE : environment sensitive correlation matrix

Underlying model

Y = β0 + β1U + β2U · E + ε (1)

X ∼ F (α0 + α1U,ΣE ) (2)

Underlying model

Y = β0 + β1U + β2U · E + ε (1)

X ∼ F (α0 + α1U,ΣE ) (2)

Original Data

1) Gene Similarity

2) Cluster

Representation

n × 1 n × 1

3) Penalized

Regression

Yn×1∼ + ×E

advantages and disadvantages

General Approach Advantages Disadvantages

Single-Marker simple, easy to implementmultiple testing burden,

power, interpretability

Penalization

multivariate, variable

selection, sparsity, efficient

optimization algorithms

poor sensitivity with

correlated data, ignores

structure in design matrix,

interpretability

Environment Cluster with

Regression

multivariate, flexible

implementation,

group structure, takes

advantage of correlation,

interpretability

difficult to identify relevant

clusters, clustering is

unsupervised

Methods to detect gene clusters

Table 1: Methods to detect gene clusters

General Approach Formula

Correlationpearson, spearman,

biweight midcorrelation

Correlation Scoring |ρE=1 − ρE=0|

Weighted Correlation

Scoringc|ρE=1 − ρE=0|

Fisher’s Z

Transformation

|zij0−zij1|√1/(n0−3)+1/(n1−3)

Cluster Representation

Table 2: Methods to create cluster representations

General Approach Type

Unsupervised average

K principal components

Supervised partial least squares

Simulation Studies

Simulation Study 1

(a) Corr(XE=0) (b) Corr(XE=1)

(c) |Corr(XE=1)− Corr(XE=0)| (d) Corr(Xall)22

Results: Jaccard Index and test set MSE

Simulation Study 2

TOM based on all subjects

(a) TOM(Xall)25

TOM based on unexposed subjects

(a) TOM(XE=0)26

TOM based on exposed subjects

(a) TOM(XE=1)27

Difference of TOMs

(a) |TOM(XE=1)− TOM(XE=0)|28

Results: Test set MSE

Strong Heredity Models

g(µ) =β0 + β1X1 + · · ·+ βpXp + βEE︸︷︷︸main effects

+α1E (X1E ) + · · ·+ αpE (XpE )︸︷︷︸interactions

• g(·) is a known link function

• µ = E [Y |X,E ,β,α]

• β = (β1, β2, . . . , βp, βE ) ∈ Rp+1

• α = (α1E , . . . , αpE ) ∈ Rp

Variable Selection

arg minβ0,β,α

2‖Y − g(µ)‖2 + λ (‖β‖1 + ‖α‖1)

• ‖Y − g(µ)‖2 =∑

i (yi − g(µi ))2

• ‖β‖1 =∑

j |βj |• ‖α‖1 =

∑j |αj |

• λ ≥ 0: tuning parameter

Why Strong Heredity?

• Statistical Power: large main effects are more likely to lead to

detectable interactions than small ones

• Interpretability: Assuming a model with interaction only is generally

not biologically plausible

• Practical Sparsity: X1,E ,X1 · E vs. X1,E ,X2 · E

Reparametrization1: αjE = γjEβjβE .

Strong heredity principle2:

αjE 6= 0 ⇒ βj 6= 0 and βE 6= 0

1Choi et al. 2010, JASA2Chipman 1996, Canadian Journal of Statistics

αjE 6= 0 ⇒ βj 6= 0 and βE 6= 0

Strong Heredity Model with Penalization

arg minβ0,β,γ

2‖Y − g(µ)‖2 +

λβ (w1β1 + · · ·+ wqβq + wEβE ) +

λγ (w1Eγ1E + · · ·+ wqEγqE )

∣∣∣∣∣ 1

∣∣∣∣∣ , wjE =

∣∣∣∣∣ βj βEαjE

∣∣∣∣∣

Open source software

• Software implementation in R: http://sahirbhatnagar.com/eclust/

• Allows user specified interaction terms

• Automatically determines the optimal tuning parameters through

cross validation

• Can also be applied to genetic data (SNPs)

Feature Screening and

Non-linear associations

The most popular way of feature screening

How to fit statistical models when you have over 100,000 features?

Marginal correlations, t-tests

• for each feature, calculate the correlation between X and Y

• keep all features with correlation greater than some threshold

• However this procedure assumes a linear relationship between X and

Non-linear feature screening: Kolmogorov-Smirnov Test

Mai & Zou (2012) proposed using the Kolmogorov-Smirnov (KS) test

statistic

Kj = supx|Fj(x |Y = 1)− Fj(x |Y = 0)| (3)

Figure 8: Depiction of KS statistic

Non-linear Interaction Models

After feature screening, we can fit non-linear relationships between

X and Y

Yi = β0 +∑

f (Xij) +∑

f (Xij ,Ei ) + εi (4)

Conclusions

Conclusions and Contributions

• Large system-wide changes are observed in many environments

• This assumption can possibly be exploited to aid analysis of large

• We develop and implement a multivariate penalization procedure for

predicting a continuous or binary disease outcome while detecting

interactions between high dimensional data (p >> n) and an

environmental factor.

• Dimension reduction is achieved through leveraging the

environmental-class-conditional correlations

• Also, we develop and implement a strong heredity framework

within the penalized model

• R software: http://sahirbhatnagar.com/eclust/

Limitations

• There must be a high-dimensional signature of the exposure

• Clustering is unsupervised

• Two tuning parameters

Limitations

What type of data is required to

use these methods

ECLUST method

1. environmental exposure (currently only binary)

2. a high dimensional dataset that can be affected by the exposure

3. a single phenotype (continuous or binary)

4. Must be a high-dimensional signature of the exposure

Strong Heredity and Non-linear Models

1. a single phenotype (continuous or binary)

2. environment variable (continuous or binary)

3. any number of predictor variables

Check out our Lab’s Software!

http://greenwoodlab.github.io/software/

acknowledgements

• Dr. Celia Greenwood

• Dr. Blanchette and Dr. Yang

• Dr. Luigi Bouchard, Andre Anne

• Dr. Steele, Dr. Kramer,

Dr. Abrahamowicz

• Maxime Turgeon, Kevin

McGregor, Lauren Mokry,

Marie Forest, Pablo Ginestet

• Greg Voisin, Vince Forgetta,

Kathleen Klein

• Mothers and children from the

Methods for High Dimensional Interactions

Science

Transcript of Methods for High Dimensional Interactions