Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data...

Blessing of heterogeneous large-scale datafor high-dimensional causal inference

Peter Buhlmann

Seminar fur Statistik, ETH Zurich

joint work with

Jonas PetersMPI Tubingen

Nicolai MeinshausenSfS ETH Zurich

Heterogeneous large-scale data

Big Data

the talk is not (yet) on “really big data”

but we will take advantage of heterogeneityoften arising with large-scale data where

i.i.d./homogeneity assumption is not appropriate

causal inference = intervention analysis

in genomics (for yeast or plants):if we would make an intervention at a single (or many) gene(s),what would be its (their) effect on a response of interest?

want to infer/predict such effects without actually doing theinterventione.g. from observational data (cf. Pearl; Spirtes, Scheines & Glymour)(from observations of a “steady-state system”)

or, as is mainly our focus: fromI observational and interventional data with well-specified

interventionsI changing environments or experimental settings with

“vaguely” or “un-” specified interventionsthat is, from “heterogeneous” data

Genomics

1. Flowering of Arabidopsis Thaliana

phenotype/response variable of interest:Y = days to bolting (flowering)“covariates” X = gene expressions from p = 21′326 genes

question: infer/predict the effect of knocking-out a single geneon the phenotype/response variable Y?

using statistical method based on n = 47 observational data; validated the top-predictions with randomized experiments

with some moderate success(Stekhoven, Moraes, Sveinbjornsson, Hennig, Maathuis & PB, 2012)

2. Gene expressions of yeast

p = 5360 genesphenotype of interest: Y = expression of first gene“covariates” X = gene expressions from all other genes

and thenphenotype of interest: Y = expression of second gene“covariates” X = gene expressions from all other genes

and so on

infer/predict the effects of single gene deletions on all othergenes

Effects of single gene deletions on all other genes (yeast)

(Maathuis, Colombo, Kalisch & PB, 2010)• p = 5360 genes (expression of genes)• 231 single gene deletions ; 1.2 · 106 intervention effects• the truth is “known in good approximation”

(thanks to intervention experiments)

goal: prediction of the true large intervention effectsbased on observational data with no knock-downs

n = 63observational data

False positives

True

pos

itive

s

0 1,000 2,000 3,000 4,000

0

200

400

600

800

1,000 IDALassoElastic−netRandom

REGRESSION

False positives

True

pos

itive

s

0 1,000 2,000 3,000 4,000

0

200

400

600

800


REGRESSION

because: for

Y =

p∑j=1

βjX (j) + ε

βj measures effect of X (j) on Y whenkeeping all other variables {X (k); k 6= j} fixed

but when doing an intervention at a gene ; some/many othergenes might change as well and cannot be kept fixed

causal inference framework: allows to definedynamic notion of intervention effect“without keeping other variables fixed”

False positives

True

pos

itive

s

0 1,000 2,000 3,000 4,000

0

200

400

600

800


“predictions of single genedeletions effects in yeast”

was a good finding for a difficultproblem...

... but some criticisms apply:I not very “robust”:

depends somewhat how we define the “truth”(has been found recently by J. Mooij, J. Peters and others)

I old, publicly available data (Hughes et al., 2000)I no collaborator... just (publicly available︸︷︷︸

this is good!

) data

I we didn’t use the interventional data for training

A new and “better” attempt for the “same” problemsingle gene knock-down in yeast and measure genomewideexpression (observational and interventional data)

goal: predict unseen gene knock-down effectsbased on both observational and interventional data

collaborators:Frank Holstege, Patrick Kemmeren et al. (Utrecht)

data from modern technologyKemmeren, ..., and Holstege (Cell, 2014)

Graphical and Structural equation modelslate 1980s: Pearl; Spirtes, Glymour, Scheines; Dawid; Lauritzen;. . .

powerful language of graphs make formulations and modelsmore transparent (Evans and Richardson, 2011)

variables X1, . . . ,Xp+1 (Xp+1 = Y is the response of interest)

directed acyclic graph (DAG) D0 encoding the true underlyingcausal influence diagram

structural equation model (SEM):

Xj ← f 0j (XpaD0 (j), εj), j = 1, . . . ,p + 1,

ε1, . . . , εp+1 independent

e.g. linear Xj ←∑

k∈paD0 (j)

β0jkXk + εj j = 1, . . . ,p + 1

causal variables for Y = Xp+1: S0 = {j ; j ∈ paD0(Y )}causal coefficients for Y in linear SEM: β0

Yj , j ∈ paD0(Y )

severe issues of identifiability !

(X ,Y ) ∼ N2(0,Σ)

X XY Y

X causes Y Y causes X

agenda for estimation(Chickering, 2002; Shimizu, 2005; Kalisch & PB, 2007;... )

1. estimate the Markov equivalence class of DAGs D0: Dsevere issues of identifiability !

2. derive causal variables: the ones which are causal in allDAGs from D; derive bounds for causal effects based on D(Maathuis, Kalisch & PB, 2009)

goal: construction of confidence statements(without knowing the structure of the underlying graph)

problem:direct likelihood-based inference:

(as in e.g. Mladen Kolar’s talk for undirected graphs)difficult, because of severe identifiability issues !

nevertheless: our goal is to construct a procedureS ⊆ {1, . . . ,p} such that

P[ S ⊆ S0︸︷︷︸no false positives

] ≥ 1− α

without the need to specify identifiable componentsthat is: if a variable Xj cannot be identified to be causal⇒ j /∈ S

we do this with an entirely new framework(which might be “much simpler” for causal inference)NOT or AVOIDINGgraphical model fitting, potential outcome models,...and maybe more accessible to experts in regression modeling?

Causal inference using invariant prediction

Peters, PB and Meinshausen (2015)

a main message:

causal structure/components remain the samefor different sub-populations

while the non-causal components can change acrosssub-populations

thus:; look for “stability” of structures among

different sub-populations

goal:find the causal variables (components) among a p-dimensionalpredictor variable X for a specific response variable Y

consider data

(X e,Y e) ∼ F e, e ∈ E

with response variables Y e and predictor variables X e

here: e ∈ E︸︷︷︸space of exp. sett.

denotes an experimental setting

“heterogeneous” data fromdifferent environments/experiments e ∈ E

(aspect of “Big Data”)

data

(X e,Y e) ∼ F e, e ∈ E

example 1: E = {1,2} encoding observational (1) and allpotentially unspecific interventional data (2)

example 2: E = {1,2} encoding observational data (1) and(repeated) data from one specific intervention (2)

example 3: E = {1,2,3} ... or E = {1,2,3, . . . ,26} ...

do not need data from carefullydesigned (randomized) experiments

Invariance Assumption

for a set S∗ ⊆ {1, . . . ,p}: L(Y e|X eS∗) is invariant across e ∈ E

for linear model setting: there exists a vector γ∗ such that

∀e ∈ E : Y e = X eγ∗ + εe, εe ⊥ X eS∗ , S∗ = {j ; γ∗j 6= 0}

εe ∼ Fε the same for all eX e has an arbitrary distribution, different across e

γ∗, S∗ is interesting in its own right!

namely the parameter and structure which remain invariantacross experimental settings, or across heterogeneous groups

Moreover: link to causality

assume:

in short : L(Y e|X eS∗) is invariant across e ∈ E

Proposition (Peters, PB & Meinshausen, 2015)If E does not affect the structural equation for Y in a SEM:

e.g. linear SEM: Y e ←∑

k∈pa(Y )

βYk︸︷︷︸∀e

X ek + εe

Y︸︷︷︸∼Fε∀e

then S0 = pa(Y )︸︷︷︸causal var.

satisfies the Invariance Assumption (w.r.t. E)

the causal variables lead to invariance (of conditional distr.)

if E does not affect structural equation for Y :

S0 = pa(Y ) satisfies the Invariance Assumption

this assumption holds for example for:I do-intervention (Pearl) at variables different than YI noise (or “soft”) intervention (Eberhardt & Scheines, 2007)

at variables different than Y

in addition: there might be many other S∗ satisfying theInvariance Assumptionbut uniqueness is not really important (see later)

how do we know whetherE is not affecting structural equation for Y ?

if E does affect structural equation for Y :we will argue:

“robustness” of our procedure (proposed later)

; no causal statementsno false positivesconservative, but on the safe side

Invariance Assumption: plausible to hold with real data

two-dimensional conditional distributions of observational (blue)and interventional (orange) data(no intervention at displayed variables X ,Y )

seeminglyno invarianceof conditional d.

plausibleinvarianceof conditional d.

A procedure: population caserequire and exploit the Invariance Assumption

L(Y e|X eS∗) the same across e ∈ E

H0,γ,S(E) : γk = 0 if k /∈ S and∃Fε such that ∀ e ∈ E :

Y e = X eγ + εe, εe ⊥ X eS , ε

e ∼ Fε the same for all e

if H0,γ,S(E) is true:I model is correctI S, γ are plausible causal variables/predictors and

coefficientsand

H0,S(E) : there exists γ such that H0,γ,S(E) holds

S is called “plausible causal predictors” if H0,S(E) holds

identifiable causal predictors under E :

is defined as the set S(E), where

S(E) =⋂{ S; H0,S(E) holds︸︷︷︸plausible causal predictors

}

the intersection of all plausible causal predictors

under the Invariance Assumption we have: for any S∗,

S(E) ⊆ S∗

and this is key to obtain confidence bounds for identifiablecausal predictors

we have by definition:

S(E1) ⊆ S(E2) if E1 ⊆ E2

withI more interventionsI more “heterogeneity”I more “diversity in complex data”

we can identify more causal predictors

identifiable causal predictors : S(E)↗ as E ↗

question: when is S(E) = S0 ?but it is not important that it is “equal to” or unique(see later)

Theorem (Peters, PB and Meinshausen, 2015)

S(E) = S0 = (parental set of Y in the causal DAG)

if there is:I a single do-intervention for each variable other than Y and |E| = pI a single noise intervention for each variable other than Y and |E| = pI a simultaneous noise intervention and |E| = 2

the conditions can be relaxed such that it is not necessary to intervene at allthe variables

Statistical confidence sets for causal predictors

“the finite sample version of S(E) =⋂

S{S; H0,S(E) is true}”

for “any” S ⊆ {1, . . . ,p}:test whether H0,S(E) is accepted or rejected

S(E) =⋂S

{H0,S accepted at level α}

for H0,S(E):test constancy of regression param. and its residual error distr.across e ∈ E

by weakening H0,S(E) to H0,S(E):

∃β, σ : βepred(S) ≡ β, σe

pred(S) ≡ σ ∀e ∈ E

where

βepred(S) = argminβ;βk=0 (k /∈S)E|Y e − X eβ|2,

σepred(S) = (E|Y e − X eβe

pred(S)|2)1/2

note: H0,S(E) true =⇒ H0,S(E) true

testing H0,S(E): assuming Gaussian errors

DT Σ−1D D

σ2ne∼ F (ne,n−e − |S| − 1)

D = Y e − Y e, Y e based on data \{e}ΣD = Ine + Xe,S(X T

−e,SX−e,S)−1X Te,S

reject H0,S(E) if p-value < α/|E|

S(E) =⋂S

{H0,S accepted at level α}

for some significance level 0 < α < 1

going through all sets S?

going through all sets S?

1. start with S = ∅: if H0,∅(E) accepted =⇒ S(E) = ∅2. consider small sets S of cardinality 1,2, . . .

and construct corresponding intersections S∩ withpreviously considered accepted sets S (H0,S(E) accepted)

for S with H0,S accepted :

S∩ ← S∩ ∩ S

if intersection S∩ = ∅ =⇒ S(E) = ∅if not:

discard all S with S ⊇ S∩

and continue with the remaining sets3. for large p:

restrict search space by variables from Lasso regression;need a faithfulness assumption (and sparsity and assumptionson X e for justification)

confidence sets with invariant prediction

1. for each S ⊆ {1, . . . ,p} construct a set ΓS(E) as follows:

if H0,S(E) rejected at α/(2|E|) ; set ΓS(E) = ∅otherw. ΓS(E) = classic. (1− α/2) CI for βS based on XS

2. set

Γ(E) =⋃

S⊆{1,...,p}

ΓS(E) (est. plausible causal coeff.)

we then obtain:

confidence set for S∗ : S(E)

confidence set for γ∗ : Γ(E)

Theorem (Peters, PB and Meinshausen, 2015)

assume: linear model, Gaussian errors

then, for any γ∗,S∗ satisfying the Invariance Assumption:

P[S(E) ⊆ S∗] ≥ 1− α : confidence w.r.t. true causal var.P[γ∗ ∈ Γ(E)] ≥ 1− α : confidence set for true causal param.

and can choose S∗ = S0 = causal variables in linear SEMif E does not affect struct. eqn. of Y

“on the safe side” (conservative)

we do not need to care about identifiability: if the effect is notidentifiable, the method will not wrongly claim an effect

we do not require that S∗ is minimal and unique

“the first” result on statistical confidence for potentiallynon-identifiable causal predictors when structure is unknown(route via graphical modeling for confidence sets seems awkward)

leading to (hopefully) morereliable causal inferential statements

“Robustness”

if Invariance Assumption does not holdbecause E has a direct effect on Y

; for all S: L(Y e|X eS) is not invariant across e ∈ E

; S(E) = ∅ (at least as n→∞)still on the safe side but no power

Empirical results: simulations100 different scenarios, 1000 data sets per scenario:|E| = 2, nobs = ninterv ∈ {100, . . . ,500}, p ∈ {5, . . . ,40}

SU

CC

ES

S P

RO

BA

BIL

ITY

Mar

gina

lM

argi

nal

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

Reg

ress

ion

Reg

ress

ion

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●●

●

●

●

●●

●

● ●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

Ges

Ges

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

Gie

s (u

nkno

wn)

Gie

s (u

nkno

wn)

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

Gie

s (k

now

n)G

ies

(kno

wn) ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●● ●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

Ling

amLi

ngam

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

Inva

riant

pre

dict

ion

Inva

riant

pre

dict

ion

0.00

0.25

0.50

0.75

1.00

power todetect causal predictors

FW

ER

Mar

gina

l Mar

gina

l

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

Reg

ress

ion R

egre

ssio

n

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●●

●

●

●

●●

●

● ●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

Ges

Ges

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●●

●●

●

●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

Gie

s (u

nkno

wn)

Gie

s (u

nkno

wn)

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

Gie

s (k

now

n)G

ies

(kno

wn)

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●● ●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

Ling

amLi

ngam

●

●

● ●●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●● ●●● ●● ●●● ● ●●●● ●●● ●●●●●● ●●

●●●●

●●●● ●●●●● ●●● ●● ●●● ●●

●

●● ●● ● ●●

●●● ●●● ●●

●●● ●● ●●●●● ●●● ●●● ● ●●●●●

●● ●● ●● ●● ●●

●

●

Inva

riant

pre

dict

ion In

varia

nt p

redi

ctio

n

0.00

0.50

1.00

familywise error rate:P[S(E) 6⊆ S∗], aimed at 0.05

Single gene deletion experiments in yeast

p = 6170 genesresponse of interest: Y = expression of first gene“covariates” X = gene expressions from all other genes

and thenresponse of interest: Y = expression of second gene“covariates” X = gene expressions from all other genes

and so on

infer/predict the effects of a single gene knock-down on allother genes

collaborators:Frank Holstege, Patrick Kemmeren et al. (Utrecht)

data from modern technologyKemmeren, ..., and Holstege (Cell, 2014)

Kemmeren et al. (2014):genome-wide mRNA expressions in yeast: p = 6170 genes

I nobs = 160 “observational” samples of wild-typesI nint = 1479 “interventional” samples

each of them corresponds to a single gene deletion strain

for our method:I we use |E| = 2 (observational and interventional data)I training-test data:• training: all observational and 2/3 of interventional data• test: other 1/3 of gene deletion interventions• repeat this for the three blocks of interventional data

I since every interventional data point is used once as aresponse variable:we use coverage 1− α/nint with α = 0.01 and nint = 1479

Results

8 genes are significant (α = 0.01 level) causal variables(each of the 8 genes “causes” another gene)

validation with test datamethod invar.pred. GIES PC-IDA marg.corr. rand.guess.

no. true pos. 6 2 2 2 *(out of 8)

*: quantiles for selecting true positives among 7 random draws2 (95%), 3 (99%)

; our invariant prediction method has most power !and it should exhibit control against false positive selections

# INTERVENTION PREDICTIONS

# S

TR

ON

G IN

TE

RV

EN

TIO

N E

FF

EC

TS

0 5 10 15 20 25

02

46

8

PERFECTINVARIANTHIDDEN−INVARIANTPCRFCIREGRESSION (CV−Lasso)GES and GIESRANDOM (99% prediction− interval)

I : invariant prediction methodH: invariant prediction with some hidden variables

Validation (Meinshausen, Hauser, Mooij, Peters, Versteeg & PB, 2015)with intervention experiments: strong intervention effect (SIE)with yeastgenome.org database: scores A-F

rank cause effect SIE A B C D E F1 YMR104C YMR103C X2 YPL273W YMR321C X3 YCL040W YCL042W X4 YLL019C YLL020C X5 YMR186W YPL240C X X X X X X6 YDR074W YBR126C X X X X X X7 YMR173W YMR173W-A X8 YGR162W YGR264C9 YOR027W YJL077C X

10 YJL115W YLR170C11 YOR153W YDR011W X X12 YLR270W YLR345W13 YOR153W YBL005W14 YJL141C YNR007C15 YAL059W YPL211W16 YLR263W YKL098W17 YGR271C-A YDR339C18 YLL019C YGR130C19 YCL040W YML100W20 YMR310C YOR224C

SIE: correctly predicting a strong intervention effect which is inthe 1%- or 99% tail of the observational data

Flow cytometry data (Sachs et al., 2005)

I p = 11 abundances of chemical reagentsI 8 different environments (not “well-defined” interventions)

(one of them observational; 7 different reagents added)I each environment contains ne ≈ 700− 1′000 samples

goal:recover network of causal relations (linear SEM)

Raf

Mek

PLCg

PIP2

PIP3

Erk

Akt

PKA

PKC

p38

JNK

approach: invariant causal prediction(one variable the response Y ; the other 10 the covariates X ;

do this 11 times with every variable once the response)

main concern: different environments might have a directinfluence on the response ; Invariance Assumption would fail

main concern: different environments might have a directinfluence on the response ; Invariance Assumption would fail

instead of requiring invariance among 8 environments; require invariance for pairs of environments Eij (28 pairs intotal) and do Bonferroni correction by taking the union

S(E) = ∪i<j Sα/28(Eij)

this is weakening the assumption that an intervention shouldnot directly influence the response Y

and if a pair of interventions does nevertheless; S = ∅ (“robustness” as discussed before)

Raf

Mek

PLCg

PIP2

PIP3

Erk

Akt

PKA

PKC

p38

JNK

blue edges: only invariant causal prediction approach (ICP)red: only ICP allowing hidden variables and feedbackpurple: both ICP with and without hidden variablessolid: all relations that have been reported in literaturebroken: new findings not reported in the literature

; reasonable consensus with existing resultsbut no real ground-truth available

serves as an illustration that we can work with “vaguely definedinterventions”

Concluding thoughts

generalize Invariance Assumption and statistical testing tononparametric/nonlinear modelsin particular additive models

∀e ∈ E : Y e = f ∗(X eS∗) + εe, εe ∼ Fε, εe ⊥ XS∗

∀e ∈ E : Y e =∑j∈S∗

f ∗j (X ej ) + εe, εe ∼ Fε, εe ⊥ XS∗

the statistical significance testing becomes more difficultimproved identifiability with nonlinear SEMs (Mooij et al., 2009)

generalize to include hidden variables

S0 = pa(Y )∩observed variables, S0H = pa(Y )∩hidden variables

presented procedure still OK if (essentially):- no interventions at Y , at S0

H and ancestors(S0H)

- no hidden confounder between Y and S0

Y

H1 H2H3

S0X1

X8

X12

more general hidden variable models can be treated with a“related” technique

(Rothenhausler, Heinze, Peters & Meinshausen, 2015)

I

C

Y

E

W

I

X Y

W

1

(X ,Y ) form a DAGI

C

Y

E

W

I

X Y

W

1

X = (C,E)without knowing C,E

C ← Bc←i I + Bc←w W + εC

Y ← By←w W + By←cC + εY

E ← Be←i I + Be←w W + Be←cC + Be←y Y + εE

provocative next step: how about using “Big Data”?

; structure the “large-scale” data into differentunknown groups of experimental settings Ethat is: learn E from data

“optimal” E is a trade-off betweenidentifiability and statistical power

learning/estimating E

problem: given a “large bag of data”, can we estimate theunknown underlying different experimental conditions?

mathematically: denote byJ all experimental settings, satisfying Invariance Assumption; mixture distribution for e ∈ constr. E : F e =

∑j∈J we

j F j

when pooling two constr. experimental settings e1 and e2:; new mixture distribution with weights (we1 + we2)/2

- mixture modeling- change point modeling for (time) ordered datamight be useful to estimate E

causal components remain the same fordifferent sub-populations or experimental settings

; exploit the power of heterogeneity in complex data!

and confidence bounds follow naturally

Thank you!

SoftwareR-package: pcalg

(Kalisch, Machler, Colombo, Maathuis & PB, 2010–2015)R-package: InvariantCausalPrediction (Meinshausen, 2014)

References to some of our own work:I Peters, J., Buhlmann, P. and Meinshausen, N. (2015). Causal inference using invariant prediction:

identification and confidence intervals. To appear in J. Royal Statistical Society, Series B (with discussion).Preprint arXiv:1501.01332

I Meinshausen, N., Hauser, A. Mooij, J., Peters, J., Versteeg, P. and Buhlmann, P. and (2015). Causalinference from gene perturbation experiments: methods, software and validation. Preprint.

I Hauser, A. and Buhlmann, P. (2015). Jointly interventional and observational data: estimation ofinterventional Markov equivalence classes of directed acyclic graphs. Journal of the Royal StatisticalSociety, Series B, 77, 291-318.

I Hauser, A. and Buhlmann, P. (2012). Characterization and greedy learning of interventional Markovequivalence classes of directed acyclic graphs. Journal of Machine Learning Research 13, 2409-2464.

I Kalisch, M., Machler, M., Colombo, D., Maathuis, M.H. and Buhlmann, P. (2012). Causal inference usinggraphical models with the R package pcalg. Journal of Statistical Software 47 (11), 1-26.

I Stekhoven, D.J., Moraes, I., Sveinbjornsson, G., Hennig, L., Maathuis, M.H. and Buhlmann, P. (2012).Causal stability ranking. Bioinformatics 28, 2819-2823.

I Maathuis, M.H., Colombo, D., Kalisch, M. and Buhlmann, P. (2010). Predicting causal effects in large-scalesystems from observational data. Nature Methods 7, 247-248.

I Maathuis, M.H., Kalisch, M. and Buhlmann, P. (2009). Estimating high-dimensional intervention effects fromobservational data. Annals of Statistics 37, 3133-3164.

Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data...

Documents

Transcript of Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data...