Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

1UAI2010 Tutorial, July 8, Catalina Island.

Non-Gaussian Methods for Learning Linear Structural

Equation ModelsEquation Models

Shohei Shimizu and Yoshinobu KawaharaOsaka Universityy

Special thanks to1

Special thanks to Aapo Hyvärinen, Patrik O. Hoyer and Takashi Washio.

2

AbstractAbstractLi t t l ti d l (li SEM )• Linear structural equation models (linear SEMs) can be used to model data generating processes of variablesof variables.

• We review a new approach to learn or estimateWe review a new approach to learn or estimate linear SEMs.

• The new estimation approach utilizes non-Gaussianity of data for model identification

d i l ti t h id i t fand uniquely estimates much wider variety of models.

33

OutlineOutline • Part I. Overview (70 min.) : Shohei

• Break (10 min.)

• Part II. Recent advances (40 min): Yoshi( )– Time series

Latent confounders– Latent confounders

4

Motivation (1/2)Motivation (1/2)• Suppose that data X was randomly generatedSuppose that data X was randomly generated

from either of the following two data generating processes:

Model 1: Model 2:

or11

exbxex

+==

22

12121

exexbx

=+=x1

x2

e1

e2

x1

x2

e1

e2

where and are latent variables (disturbances, errors).

21212 exbx += 22 exx2 e2 x2 e2

1e 2e ( , )

• We want to estimate or identify which model t d th d t X b d th d t X l

1 2

generated the data X based on the data X only.

5

Motivation (2/2)( )• We want to identify which model generated the

data X based on the data X onlydata X based on the data X only.

If 1 d 2 G i it i ll k th t• If x1 and x2 are Gaussian, it is well known that we cannot identify the data generating process.

M d l 1 d 2 ll fit d t

1e 2e

– Models 1 and 2 equally fit data.

If 1 d 2 G i i t ti• If x1 and x2 are non-Gaussian, an interesting result is obtained: We can identify which of Models 1 and 2 generated the data

1e 2e

Models 1 and 2 generated the data.

Thi t t i l i h h G i• This tutorial reviews how such non-Gaussian methods work.

Problem formulation

7

Basic problem setup (1/3)Basic problem setup (1/3)• Assume that the data generating process of g g p

continuous observed variables is graphically represented by a directed acyclic graph (DAG).

ixp y y g p ( )

– Acyclicity means that there are no directed cycles.

E l f di t dExample of a directed acyclic graph (DAG):

Example of a directed cyclic graph:

x3 e3 x3 e3

x1 e1

x2 e2

x1 e1

x2 e2x2 e2 x2 e2

is a parent of etc.1x3x

8

Basic problem setup (2/3)Basic problem setup (2/3)• Further assume linear relations of variables .ix• Then we obtain a linear acyclic SEM (Wright, 1921; Bollen,

1989):

eBxx +=ijiji exbx += ∑ or

where

iij

jiji ∑ofparents:

where– The are continuous latent variables that are not

determined inside the model which we call externalie

determined inside the model, which we call external influences (disturbances, errors).

– The are of non-zero variance and are independenteThe are of non zero variance and are independent.– The ‘path-coefficient’ matrix B = [ ] corresponds to a DAG.

ieijb

99

Example of linear acyclic SEMs• A three-variable linear acyclic SEM:

p y

⎥⎤

⎢⎡

⎥⎤

⎢⎡⎥⎤

⎢⎡

⎥⎤

⎢⎡ 111 5.100 exx

131 5.1 exx +=

⎥⎥⎥

⎦⎢⎢⎢

⎣

+⎥⎥⎥

⎦⎢⎢⎢

⎣⎥⎥⎥

⎦⎢⎢⎢

⎣

−=⎥⎥⎥

⎦⎢⎢⎢

⎣ 3

2

3

2

3

2

000003.1

ee

xx

xx

212 3.1ex

exx=

+−= or

44 344 21

B33 ex =

• B corresponds to the data-generating DAG:

x3 e3x3

x1

e3

e11.5 ijij xxb tofromedgedirected No0 ⇔=

b fddi dA0x2 e2

-1.3 ijij xxb tofromedgedirectedA 0 ⇔≠

10

Assumption of acyclicity• Acyclicity ensures existence of an ordering of

variables that makes B lower triangular withxvariables that makes B lower-triangular with zeros on the diagonal (Bollen, 1989).

ix

⎥⎥⎤

⎢⎢⎡

+⎥⎥⎤

⎢⎢⎡

⎥⎥⎤

⎢⎢⎡

=⎥⎥⎤

⎢⎢⎡

1

3

1

3

1

3

005.1000

ee

xx

xx 0

00 0

0⎥⎥⎤

⎢⎢⎡

+⎥⎥⎤

⎢⎢⎡

⎥⎥⎤

⎢⎢⎡−=⎥

⎥⎤

⎢⎢⎡

2

1

2

1

2

1

003.15.100

ee

xx

xx 00

⎥⎥⎦⎢

⎢⎣⎥

⎥⎦⎢

⎢⎣⎥⎥⎦⎢

⎢⎣ −⎥

⎥⎦⎢

⎢⎣ 222 03.10 exx

44 344 210

B⎥⎥⎦⎢

⎢⎣⎥

⎥⎦⎢

⎢⎣⎥⎥⎦⎢

⎢⎣⎥

⎥⎦⎢

⎢⎣ 333 000 exx

44 344 21

Bx3 e3

permBB:isorderingThex3

x1

e3

e11.5

ofancestoranbemay.g

213

xxxxxx <<

x2-1.3

e2 .versavicenotbut ,,ofancestor an bemay 213 xxx

11Assumption of independence between external influences

• It implies that there are no latent confounders (Spirtes et al. 2000)– A latent confounder is a latent variable that is a parent of morefA latent confounder is a latent variable that is a parent of more

than or equal to two observed variables:f

x1

x2f

e1’

e2’

• Such a latent confounder makes external influences

x2e2

fSuch a latent confounder makes external influences dependent (Part II):

x1 e1

f

x2 e2

1212Basic problem setup (3/3):Learning linear acyclic SEMs

• Assume that data X is randomly sampled from a linear acyclic SEM (with no latentfrom a linear acyclic SEM (with no latent confounders):

eBxx +=x1 e1

21bx2 e2

• Goal: Estimate the path-coefficient matrix B by observing data X only!observing data X only!– B corresponds to the data-generating DAG.

Problems: Identifiability problems of

con entional methodsconventional methods

1414

Under what conditions B is identifiable?

• `B is identifiable’ `B is uniquely determined or estimated from p(x)’

≡estimated from p(x) .

S• Linear acylic SEM:

Bx1 e1

beBxx +=x2 e2

21b

– B and p(e) induce p(x). If p(x) are different for different B then B is uniquely– If p(x) are different for different B, then B is uniquely determined.

15

Conventional estimation principle: Causal Markov condition

• If the data-generating model is a (linear) acyclic SEM causal Markov conditionacyclic SEM, causal Markov condition holds :

E h b d i bl i i i d d t f itx– Each observed variable xi is independent of its non-descendants in the DAG conditional on its

t

ix

parents (Pearl & Verma, 1991) :

( ) ( )∏=p

iii xxpp

1

ofparents|x=i 1

1616Conventional methods based on

causal Markov condition• Methods based on conditional independencies

(Spirtes & Glymour, 1991)

M li li SEM i t f– Many linear acyclic SEMs give a same set of conditional independences and equally fit data.

• Scoring methods based on Gaussianity(Chickering 2002)(Chickering, 2002)

– Many linear acyclic SEMs give a same Gaussian distribution and equally fit datadistribution and equally fit data.

• In many cases the path coefficient matrix B is• In many cases, the path-coefficient matrix B is not uniquely determined.

1717

Example• Two models with Gaussian e1 and :1e 2e

11 ex = 121 8.0 exx +=Model 1: Model 2:

x1 e1 x1 e1

212 8.0 exx += 22 ex =x2 e2 x2 e2

• Both introduce no conditional independence:

( ) ( ) 1varvar 21 == xx( ) ( ) ,021 == eEeE

• Both introduce no conditional independence:

( ) 08.0,cov 21 ≠=xx

• Both induce the same Gaussian distribution:

( )21

⎞⎛ ⎤⎡⎤⎡⎤⎡ 8010x⎟⎟⎠

⎞⎜⎜⎝

⎛⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡18.08.01

00

~2

1 Nxx

A solution:A solution: Non-Gaussian approachpp

1919

A new direction: Non-Gaussian approach

• Non-Gaussian data in many applications:– Neuroinformatics (Hyvarinen et al 2001); BioinformaticsNeuroinformatics (Hyvarinen et al., 2001); Bioinformatics

(Sogawa et al., ICANN2010); Social sciences (Micceri, 1989); Economics (Moneta, Entner, et al., 2010)

• Utilize non-Gaussianity for model identification:– Bentler (1983); Mooijaart (1985); Dodge and Rousson (2001)Bentler (1983); Mooijaart (1985); Dodge and Rousson (2001)

• The path-coefficient matrix B is uniquely p q yestimated if ei are non-Gaussian. – Shimizu, Hoyer, Hyvarinen and Kerminen (JMLR, 2006)

ie

2020Illustrative example: G i G iGaussian vs non-Gaussian

Gaussian Non-Gaussian(uniform)

Model 1:x2x2

Model 1:x1 e1 x1x111

80ex

+=

M d l 2

x2 e2212 8.0 exx +=

Model 2:

x1

x2

x1 e1

x2

121 8.0 exx +=x1

x2 e2x1

22 ex =

( ) ( ) 1varvar 21 == xx( ) ( ) ,021 == eEeE

2121Linear Non-Gaussian A li M d l LiNGAMAcyclic Model: LiNGAM(Shimizu, Hyvarinen, Hoyer & Kerminen, JMLR, 2006)

• Non-Gaussian version of linear acyclic SEM:

eBxx +=exbx += ∑ or eBxx +=iij

jiji exbx += ∑ofparents:

or

whereThe external influence variables (disturbancese– The external influence variables (disturbances, errors) are

• of non-zero variance

ie

of non-zero variance.• non-Gaussian and mutually independent.

22

Identifiability of LiNGAM modelIdentifiability of LiNGAM model

• LiNGAM model can be shown to beidentifiableidentifiable.– B is uniquely estimated.

• To see the identifiability, helpful to review y, pindependent component analysis (ICA) (Hyvarinen et al., 2001).

23Independent Component Analysis (ICA)(ICA) (Jutten & Herault, 1991; Comon, 1994)

• Observed random vector x is modeled byObserved random vector x is modeled by

Asx =∑=p

jiji sax or

where

Asx∑=j

jiji1

– The mixing matrix A = [ ] is square and is of full column rank.The latent ariables (independent components)s

ija

– The latent variables (independent components) are of non-zero variance, non-Gaussian and mutually independent.

is

• Then, A is identifiable up to permutation P and scaling D of the columns:scaling D of the columns:

APDA =ica

2424Estimation of ICA• Most of estimation methods estimate

(Hyvarinen et al., 2001):1−= AW

( y , )

sWAsx 1−==

• Most of the methods minimize mutual information (or its approximation) of estimated independent components:

xWs ica=ˆ• W is estimated up to permutation P and scaling D of the rows:

( )1−== PDAPDWWi

• Consistent and computationally efficient algorithms:

( )PDAPDWWica

– Fixed point (FastICA) (Hyvarinen,99); Gradient-based (Amari, 98)

– Semiparametric: no specific distributional assumption

Back to LiNGAM model

2626Identifiability of LiNGAM (1/3):ICA hi h lf f id tifi tiICA achieves half of identification

• LiNGAM model is ICA• LiNGAM model is ICA.– Observed variables are linear combinations of

non Gaussian independent external influences :ix

enon-Gaussian independent external influences :

eBIxeBxx −−=⇔+= 1)(ie

eWAe 1−== ( )BIW −=

• ICA gives .– P: unknown permutation matrix

)( BIPDPDWW −==icaP: unknown permutation matrix

– D: unknown scaling (diagonal) matrix

• Need to determine P and D to identify B.

2727Identifiability of LiNGAM (2/3):

No permutation indeterminacy (1/6)• ICA gives .

– P : permutation matrix; D: scaling (diagonal) matrix)( BIPDPDWW −==ica

• We want to find such a permutation matrix that cancels the permutation i e :IPP =

PPpermutation , i.e., :IPP =

IDWPDWPWP ==ica

P

• We can show (Shimizu et al., UAI05) (illustrated in the next slides):– If i e no permutation is made on the rows ofIPP =

I=

DW– If , i.e., no permutation is made on the rows of , has no zero in the diagonal (obvious by definition).

If i id ti l t ti i d th

IPP =icaWP

DW

– If , i.e., any nonidentical permutation is made on the rows of , has a zero in the diagonal.

IPP ≠icaWPDW

28

Identifiability of LiNGAM (2/3):No permutation indeterminacy (2/6)

• By definition has all unities in the diagonal.– The diagonal elements of B are all zeros.

BIW −=

• Acyclicity ensures existence of an ordering of variables that makes lower triangular and then isBIWBthat makes lower-triangular, and then is also lower-triangular.

BIW −=B

• So, WLG, can be assumed to be lower-triangular:W

⎥⎥⎤

⎢⎢⎡

= 01*001

W0 0

0 No zeros in the diagonal!

⎥⎥⎦⎢

⎢⎣ 1**

0W in the diagonal!

29Identifiability of LiNGAM (2/3):N t ti i d t i (3/6)No permutation indeterminacy (3/6)

• Premultiplying by a scaling (diagonal) matrix D does NOT change the zero/non-zero pattern of :

WW

⎥⎥⎤

⎢⎢⎡

=11

0*00

dd

DW 0

0 0

⎥⎥⎤

⎢⎢⎡

= 01*001

W0 0

0

⎥⎥⎦⎢

⎢⎣

=

33

22

**0

ddDW 0

⎥⎥⎦⎢

⎢⎣

=1**01W 0

No zeros in the diagonal!

3030


• Any other permutation of the rows of changes the zero/non-zero pattern of and

DWDWg p

brings zero in the diagonal:

⎥⎥⎤

⎢⎢⎡

= 11

2212 00

0*d

dDWP

⎥⎥⎤

⎢⎢⎡

=11

0*00

dd

DW 00 0 0

00⎥⎥⎦⎢

⎢⎣ 33

11

** d⎥⎥⎦⎢

⎢⎣

=

33

22

**0

ddDW 0 00

Exchanging 1stExchanging 1and 2nd rows Zero in the diagonal!

3131


• Any other permutation of the rows of changes the zero/non-zero pattern of and

DWDWg p

brings zero in the diagonal:

⎥⎥⎤

⎢⎢⎡ 11

0*00

dd

DW ⎥⎥⎤

⎢⎢⎡

= 0*** 33

13 dd

DWP00 0

0⎥⎥⎦⎢

⎢⎣

=

33

22

**0*

ddDW

⎥⎥⎦⎢

⎢⎣

=000

11

22

ddDWP0 0

0Exchanging 1st

0Exchanging 1st

and 3rd rows Zero in the diagonal!

3232


• We can find correct by finding that gives no th di l f WP

P Pzero on the diagonal of (Shimizu et al., UAI05).icaWP

• Thus, we can solve the permutation indeterminacy and obtain: y

( )BIDDWPDWPWP −=== ( )BIDDWPDWPWP −===ica

I=

3333

Identifiability of LiNGAM (3/3): No scaling indeterminacy

• Now we have: B)D(IWP −=ica

• Then, ( )icaWPD diag=

• Divide each row of by its corresponding icaWP y p gdiagonal element to get , i.e., :

icaBI − B

( ) BIB)D(IDWPWP −=−= −− 11diag icaica

3434

Estimation of LiNGAM model

1 ICA-LiNGAM algorithm1. ICA LiNGAM algorithm2. DirectLiNGAM algorithm

3535

Two estimation algorithmsg• ICA-LiNGAM algorithm

(Shi i H H i & K i JMLR 2006)(Shimizu, Hoyer, Hyvarinen & Kerminen, JMLR, 2006)

• DirectLiNGAM algorithm (Shimizu Hyvarinen Kawahara & Washio UAI09)(Shimizu, Hyvarinen, Kawahara & Washio, UAI09)

• Both estimate an ordering of variables that makes gthe path-coefficient matrix B to be lower-triangular.– Acyclicity ensures existence of such an ordering.

⎤⎡ OA full DAG

permpermperm exx +⎥⎦

⎤⎢⎣

⎡=

31

Ox3x1

321

permBx2

redundant edges

3636

Once such an ordering is found…

• Many existing (covariance-based) methods can do:– Pruning the redundant path-coefficients

• Sparse methods like weighted lasso (Zou, 2006）

– Finding significant path-coefficients • Testing, bootstrapping (Shimizu et al., 2006; Hyvarinen et al. 2010)

x3x1 x3x1exx +⎥⎤

⎢⎡

= O0*

A full DAG

x2 x2


⎢⎣

=321

00*

* *

permB

3737

1. Outline of ICA-LiNGAM algorithmOut e o C G a go t(Shimizu, Hoyer, Hyvarinen, & Kerminen, JMLR, 2006)

1 Estimate B by ICA 2. Pruning1. Estimate B by ICA + permutation

g

x2x1x1 x2

x3x3x3 23b13b

Redundant edgesRedundant edges

3838

ICA-LiNGAM algorithm (1/2): Step 1. Estimation of B

1. Perform ICA (here, FastICA (Hyvarinen, 1999)) to obtain an estimate of

B)PD(IPDWW −==ica

2. Find a permutation that makes the diagonal elements of as large (non-zero) as possible in b l t l

icaWP ˆP

absolute value:

( )WPP

P ˆ1minˆ = Hungarian alg.

(Kuhn, 1955)

3 N li h f th t

( )iiicaWPP ( , )

WP ˆˆ3. Normalize each row of , then we get an estimate of I-B and .

icaWPB̂

3939ICA-LiNGAM algorithm (2/2): St 2 P iStep 2. Pruning

• Find such an ordering of variables that makes• Find such an ordering of variables that makes estimated B be as close to be lower-triangular as possiblepossible.– Find a permutation matrix Q that minimizes the sum of the

elements in the upper triangular part of permuted :B̂elements in the upper triangular part of permuted :

( )∑≤

=ji

ijT 2ˆminˆ QBQQ

Q

B

– Approximate algorithm for large variables (Hoyer et al., ICA06)

≤ ji

x2x1

0 1 3

0.1

55x1 x2

x3x3x30.1

0.1 3 0.1 3-0.01

4040

Basic properties of ICA-LiNGAM algorithm

• ICA-LiNGAM algorithm = ICA + permutationsComputationally efficient with the help of well developed– Computationally efficient with the help of well-developed ICA techniques.

• Potential problems ICA is an iterative search method:– ICA is an iterative search method:

• May get stuck in a local optimum if the initial guess or step size is badly chosensize is badly chosen.

– The permutation algorithms are not scale-invariant:• May provide different estimates for different scales ofMay provide different estimates for different scales of

variables.

41

Estimation of LiNGAM model

1 ICA-LiNGAM algorithm1. ICA LiNGAM algorithm2. DirectLiNGAM algorithm

4242

2. DirectLiNGAM algorithm2. DirectLiNGAM algorithm(Shimizu, Hyvarinen, Kawahara & Washio, UAI2009)

• Alternative estimation method without ICA– Estimates an ordering of variables that makes path-g p

coefficient matrix B to be lower-triangular.

⎤⎡A full DAG


⎤⎢⎣

⎡= O

x3x1

A full DAG

⎦⎣ 321

permB x2Redundant edges

• Many existing (covariance-based) methods can do further pruning or finding significant pathdo further pruning or finding significant path coefficients (Zou, 2006; Shimizu et al., 2006; Hyvarinen et al. 2010)

43Basic idea (1/2) :A i bl b t thAn exogenous variable can be at the

top of a right orderingtop of a right ordering • An exogenous variable is a variable with no jxg

parents (Bollen, 1989), here .– The corresponding row of B has all zeros.

3xp g

• So, an exogenous variable can be at the top ofsuch an ordering that makes B lower-triangularwith zeros on the diagonal.

⎥⎥⎤

⎢⎢⎡

+⎥⎥⎤

⎢⎢⎡

⎥⎥⎤

⎢⎢⎡

=⎥⎥⎤

⎢⎢⎡ 333

0051000

ee

xx

xx

000

00x3 x1 x2

⎥⎥⎦⎢

⎢⎣

+⎥⎥⎦⎢

⎢⎣⎥⎥⎦⎢

⎢⎣ −

=⎥⎥⎦⎢

⎢⎣ 2

1

2

1

2

1

03.10005.1

ee

xx

xx 0 0

0

x3 x1 x2

44Basic idea (2/2): R tRegress exogenous out

• Compute the residuals regressing the other

3x( ) )21(3 =ir• Compute the residuals regressing the other

variables on exogenous :The residuals form a LiNGAM model

( ) )2,1( =iri

3x)2,1( =ixi( ) ( )33 and rr– The residuals form a LiNGAM model.

– The ordering of the residuals is equivalent to that of corresponding original variables.

21 and rr

p g g

⎥⎤

⎢⎡

⎥⎤

⎢⎡⎥⎤

⎢⎡

⎥⎤

⎢⎡ 333 000 exx 0 0 000

⎤⎡⎤⎡⎤⎡⎤⎡

⎥⎥⎥

⎦⎢⎢⎢

⎣

+⎥⎥⎥

⎦⎢⎢⎢

⎣⎥⎥⎥

⎦⎢⎢⎢

⎣ −=

⎥⎥⎥

⎦⎢⎢⎢

⎣ 2

1

2

1

2

1

03.10005.1

ee

xx

xx 0 0

0 ⎥⎦

⎤⎢⎣

⎡+⎥

⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡−

=⎥⎦

⎤⎢⎣

⎡

2

1)3(

2

)3(1

)3(2

)3(1

03.100

ee

rr

rr 0 0

0⎦⎣⎦⎣⎦⎣⎦⎣ 222 ⎦⎣⎦⎣

)3(2r

)3(1rx3 x1 x2

• Exogenous implies ` can be at the second top’.)3(1r 1x

45

Outline of DirectLiNGAM• Iteratively find exogenous variables until all the

variables are ordered:variables are ordered:1. Find an exogenous variable .

P t t th t f th d i3x

x– Put at the top of the ordering. – Regress out.

2 Fi d id l h )3(r

3x3x

2. Find an exogenous residual, here .– Put at the second top of the ordering.

R t

)(1r

1x)3(– Regress out.

3. Put at the third top of the ordering and terminate. Th ti t d d i i

)3(1r

2x<<The estimated ordering is .213 xxx <<

Step. 1 Step. 2 Step. 3)3(

2r)3(

1rx3 x1 x2 )1,3(2r

46-1Identification of an exogenous variable (two variable cases)

ii) is NOT exogenousi) is exogenous)( 11 ex = 1xii) is NOT exogenous.i) is exogenous.

( )011 =

bbex

)( 11 ex 1x( )122121 0bxbx ≠+= 1e

( )02121212 ≠+= bexbx 22 ex =

)var(),cov(,on Regressing

112

2)1(

2

12

xx

xxxr

xx

−=1

122

)1(2

12

),cov(,on Regressing

xxxxr

xx

−=

( ))var(

var)var(

),cov(1

)var(

2122

1212

1

xxbx

xxxb

x

−⎭⎬⎫

⎩⎨⎧−=1212

11

22 )var(xbx

xx

xr

−=1e

)var()var( 11 xx ⎭⎩2e=

t.independenNOTareand )1(21 rxt.independenareand )1(

21 rx

46-2Need to use Darmois-Skitovitch’ ththeorem (Darmois, 1953; Skitovitch, 1953)

ii) is NOT exogenous1x( )1212121 0bexbx ≠⋅+=

Darmois-Skitovitch’ theorem:

Define two variables and as

ii) is NOT exogenous.1x

1x1

2x22 ex =Define two variables and as

∑∑ ==p

jj

p

jj eaxeax 2211 ,

1x 2x

112

2)1(

2

21

)var(),cov(,on Regressing

xx

xxxr

xx

−=

∑∑== j

jjj

jj11

where are independent d i bl

je

( )1

22

1212

1

)var(var

)var(),cov(1

)var(

ex

xxx

xxb

x

−⎭⎬⎫

⎩⎨⎧−=

random variables.

If there exists a non-Gaussian ie 12b

11 )var()var( xx ⎭⎩for which , and are dependent.

021 ≠iiaa1x 2x

t.independenNOTareand )1(21 rx

47

Identification of an exogenous variable (p variable cases)

• Lemma 1: and its residual ( )j

jii

ji x

xxxr

)cov( ,−=jxLemma 1: and its residual

are independent for all is exogenous

jj

ii xx

x)var(j

ji ≠ jx⇔are independent for all is exogenous

I ti id tif i bl

ji ≠ jx⇔

• In practice, we can identify an exogenous variable by finding a variable that is most independent of it id lits residuals.

48

Independence measures• Evaluate independence between a variable and a

residual by a nonlinear correlation:

p

residual by a nonlinear correlation:

( ){ } ( )tanhcorr )( =grgx j

• Taking the sum over all the residuals we get:

( ){ } ( )tanh,corr =grgx ij

• Taking the sum over all the residuals, we get:

( ){ } ( ){ }∑ += jj rxgrgxT )()( corrcorr

C hi ti t d ll

( ){ } ( ){ }∑≠

+=ji

ijij rxgrgxT ,corr,corr

• Can use more sophisticated measures as well (Bach & Jordan, 2002; Gretton et al., 2005; Kraskov et al., 2004).– Kernel-based independence measure (Bach & Jordan 2002)Kernel based independence measure (Bach & Jordan, 2002)

often gives more accurate estimates (Sogawa et al., IJCNN10).

49

Important properties of DirectLiNGAM

• DirectLiNGAM repeats:Least squares simple linear regression– Least squares simple linear regression

– Evaluation of pairwise independence between each variable and its residualsvariable and its residuals.

• No algorithmic parameters like stepsize, initial g p p ,guesses, convergence criteria.

• Guaranteed convergence to the right solution in a fixed number of steps (the number of variables) if the data strictly follows the model.

50Estimation of LiNGAM model: Summary (1/2)

• Two estimation algorithms:ICA LiNGAM: Estimation using ICA– ICA-LiNGAM: Estimation using ICA

• Pros. Fast• Cons Possible local optimum; Not scale invariant• Cons. Possible local optimum; Not scale-invariant

– DirectLiNGAM: Alternative estimation without ICA• Pros. Guaranteed convergence; Scale-invariant• Cons. Less fast

– Cf. Neither needs faithfulness (Shimizu et al., JMLR, 2006; Hoyer personal comm July 2010)2006; Hoyer, personal comm., July, 2010).

51Estimation of LiNGAM model: Summary (2/2)

• Experimental comparison of the two algorithms: (Sogawa et al., IJCNN2010)

• Sample size: Both need at least 1000 sample size for more than 10 variables.S l bilit B th l 100 i bl Th• Scalability: Both can analyze 100 variables. The performances depend on the sample size etc., of course!

• Scale invariance: ICA-LiNGAM is less robust for changing g gscales of variables.

• Local optima?: For less than 10 variables ICA LiNGAM often a bit– For less than 10 variables, ICA-LiNGAM often a bit better.

– For more than 10 variables, DirectLiNGAM often better perhaps because the problem of local optima might become more serious?

52

Testing andTesting and Reliability evaluationy

53

Testing testable assumptionsg p• Non-Gaussianity of external influences :iey

– Gaussianity testsi

• Could detect violations of some assumptions:– Local test

• Independence of external influences• Conditional independencies between observed

variables (causal Markov condition)

ie

xvariables (causal Markov condition)• Linearity

– Overall fit of the model assumptions

ix

Overall fit of the model assumptions• Chi-square test using 3rd and/or 4th-order moments

(Shimizu & Kano, 2008) • Still under development

54

Reliability evaluationReliability evaluation• Need to evaluate statistical reliability of LiNGAMy

results:– Sample fluctuations– Smaller non-Gaussianity makes the model closer to

be NOT identifiable.

• Reliability evaluation by bootstrapping:(Komatsu Shimizu & Shimodaira ICANN2010; Hyvarinen et al JMLR 2010)(Komatsu, Shimizu & Shimodaira, ICANN2010; Hyvarinen et al., JMLR, 2010)

– If either the sample size is small or the magnitude of non-Gaussianity is small, LiNGAM would give very different results for bootstrap samples.

55

Extensions

56

Extensions (a partial list)Extensions (a partial list)R l i th ti f LiNGAM d l• Relaxing the assumptions of LiNGAM model:– Acyclic Cyclic (Lacerda et al., UAI2008)

– Single homogenous population heterogeneous population (Shimizu et al., ICONIP2007)

– i.i.d. sampling time structures (Part II.) (Hyvarinen et al, JMLR, 2010; Kawahara, S & Washio, 2010)

– No latent confounders Allow latents (Part II.) (Hoyer et al., IJAR, 08; Kawahara, Bollen et al., 2010)

– Linear Non-linear (Hoyer et al., NIPS08; Zhang & Hyvarinen, UAI09; Tilmann, Gretton & Spirtes, NIPS09)

57

Application areas so far

58

Non-Gaussian SEMs have been applied to…

• Neuroinformatics– Brain connectivity analysis (Hyvarinen et al., JMLR, 2010)y y ( y )

• Bioinformatics– Gene network estimation (Sogawa et al ICANN2010)Gene network estimation (Sogawa et al., ICANN2010)

• Economics (Wan & Tan, 2009; Moneta, Entner, Hoyer & Coad, 2010)

• Genetics (Ozaki & Ando, 2009)

• Environmental sciences (Niyogi et al., 2010)( y g , )

• Physics (Kawahara, Shimizu & Washio, 2010)

S i l• Sociology (Kawahara, Bollen, Shimizu & Washio, 2010)

59

Final summary of Part IFinal summary of Part IU f G i it i li SEM i• Use of non-Gaussianity in linear SEMs is useful for model identification.

• Non-Gaussian data is encountered in many applicationsapplications.

• The non-Gaussian approach can be a goodThe non Gaussian approach can be a good option.

• Links to codes and papers: http://homepage.mac.com/shoheishimizu/lingampapers.html

60

FAQs

61

Q. My data is Gaussian. LiNGAM will not be useful.

• A. You’re right. Try Gaussian methods.

• Comment: Hoyer et al. (UAI2008) showed: `To hat e tent one can identif the model for aTo what extent one can identify the model for a mixture of Gaussian and non-Gaussian external influence variables’influence variables’.

62

Q. I applied LiNGAM, but the result is not reasonable to background knowledge.

• A. You might first want to check: Some model assumptions might be violated– Some model assumptions might be violated.

Try other extensions of LiNGAMor non-parametric methods PC or FCI etcor non parametric methods PC or FCI etc. (Spirtes et al., 2000).

– Small sample size or small non-GaussianityTry bootstrap to see if the result is reliableTry bootstrap to see if the result is reliable.

Background knowledge might be wrong– Background knowledge might be wrong.

63

Q. Relation to causal Markov condition?

• A. The following 3 estimation principles are equivalent (Zhang & Hyvarinen ECML09; Hyvarinen et al JMLR 2010):equivalent (Zhang & Hyvarinen, ECML09; Hyvarinen et al., JMLR, 2010):1. Maximize independence between external influences .2 Minimize the sum of entropies of external influences

iee2. Minimize the sum of entropies of external influences .

3. Causal Markov condition (Each variable is independent of its non-descendants in the DAG conditional on its

ie

its non-descendants in the DAG conditional on its parents) AND maximization of independence between the parents of each variable and its corresponding p p gexternal influences .ie

64

Q. I am a psychometrician and am more interested in latent factors.• A. Shimizu, Hoyer, and Hyvarinen. (2009)

proposes LiNGAM for latent factors:proposes LiNGAM for latent factors:

dBff += -- LiNGAM for latent factors

eGfx += -- Measurement model

65

Others• Q. Prior knowledge?

– It is possible to incorporate prior knowledge. The accuracy of p p p g yDirectLiNGAM is often greatly improved even if the amount of prior knowledge is not so large (Inazumi et al., LVA/ICA2010).

• Q. Sparse LiNGAM?– Zhang et al. (ICA09) and Hyvarinen et al. (JMLR, 2010).– ICA + adaptive Lasso (Zou 2006)ICA + adaptive Lasso (Zou, 2006).

• Q. Bayesian approach?H d H tti (UAI09) H d Wi th (NIPS09)– Hoyer and Hyttinen (UAI09); Henao and Winther. (NIPS09).

• Q. The idea can be applied to discrete variables?Q. The idea can be applied to discrete variables?– One proposal by Peters et al. (AISTATS2010).– Comment: if your discrete variables are close to be continuous, e.g.,

ordinal scales with many points LiNGAM might workordinal scales with many points, LiNGAM might work.

66

Q Nonlinear extensions?• A Several nonlinear SEMs have been proposed:

Q. Nonlinear extensions?• A. Several nonlinear SEMs have been proposed:

– DAG; No latent confounders.

( ) ij

iiji ex jfx +=∑ ofparent -- Imoto et al. (2002)1.

( ) iiii

j

exfx += ofparents -- Hoyer et al. (NIPS08)2.

( )( )iiiii exffx += − ofparents1,12, -- Zhang et al. (UAI09)3.

• For two variable cases, unique identification possible except several combinations of nonlinearities and distributions (Hoyer et al., NIPS08; Zhang & Hyvarinen, UAI09).

67

Nonlinear extensions (continued)

• Proposals to aim at computational efficiency (Mooij et al., ICML09; Tilmann et al., NIPS09; Zhang & Hyvarinen, ECML09; UAI09).

• Pros: – Nonlinear models are more general than linear models.g

• Cons:– Computationally demanding.

• Current: at most 7 or 8 variables.• Perhaps assumption of Gaussian external influences might• Perhaps, assumption of Gaussian external influences might

help: Imoto et al. (2002) analyzes 100 variables.

– More difficult to allow other possible violations of LiNGAMassumptions, latent confounders etc.

68

Q. My data follows neither such linear SEMs nor such nonlinear SEMs as you

have talked.

• A Try non-parametric methods e g

have talked.

A. Try non parametric methods, e.g.,– DAG: PC (Spirtes & Glymour, 1991)

– DAG with latent confounders: FCI (Spirtes et al 1995)DAG with latent confounders: FCI (Spirtes et al., 1995).

( )exfx ofparents=

( ) f

( )iiii exfx ,ofparents=

• Probably you get an (probably large) class of equivalence models rather than a single model, b h ld b h b lbut that would be the best you currently can.

69

Q Deterministic relations?Q. Deterministic relations?

• A. LiNGAM is not applicable.

• See Daniusis et al. (UAI2010) for a method to analyze deterministic relationsanalyze deterministic relations.

70

Extra slides

Very brief review ofVery brief review of Linear SEMs and causalityy

See Pearl (2000) for detailsSee Pearl (2000) for details.

72

We can give a causal interpretation to each path-coefficient bij (Pearl, 2000).ijb

• Example:

11

exbxex

+= x1 e1

221b

32323

21212

exbxexbx

+=+=

x3

x2

e3

e232b

32b

• Causal interpretation of :32bCausa e p e a o o,constant atoconstant afromchangeyouIf 102 ccx

32

( ).bychangewill)(then 01323 ccbxE −

73

`Change from to ’2x 1c0cg• It means `Fix at regardless of the variables 2x 0c

2 10

that previously determines and change the constant from to ’ (Pearl, 2000).1c0c

2x

• `Fix at regardless of the variables that determines ’ is denoted by .2x ( )02 cxdo =

2x 0c

– In SEM, `Replace the function determining with a constant ’:

Mutilated model0c

2x

11 ex =11 ex =

Mutilated model with do(x2=c0):( )02 cxdo =

21212

11

exbxexbx

+=+=

02

11

bcxex

+=

32323 exbx +=32323 exbx +=

(Assumption of invariance: the other functions don’t change)

74

Average causal effect: Definition (Rubin, 1974; Pearl, 2000)

• Average causal effect of x2 on x3 when changing x2 from c0 to c1:2x

2x 3x0c 1cchanging x2 from c0 to c1:

( )( ) ( )( )023123 || cxdoxEcxdoxE =−= c0c1

2 0c 1c

0cc01c

Computed based on the mutilated models

( )0132 ccb −= 32b– Computed based on the mutilated models

with do(x2=c1) or do(x2=c0) .( )12 cxdo = ( )02 cxdo =

• Average causal effects can be computed based on estimated path-coefficient matrix Bon estimated path coefficient matrix B.

75Cf. Conducting experiments with d i t i th d trandom assignment is a method to

estimate average causal effectsestimate average causal effects• Example: a drug trialp g• Random assignment of subjects into two groups

fixes to or regardless of the variables that2x 1c 0cfixes to or regardless of the variables that determines :– Experimental group with (take the drug)

2 1c 0c2x

12 cx =Experimental group with (take the drug) – Control group with (placebo)02 cx =

12 cx

( )( ) ( )( )023123 || cxdoxEcxdoxE =−=

for the exp. group for the control group-)( 3xE )( 3xE

Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

Science

Transcript of Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I