Download - Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

Transcript
Page 1: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

1UAI2010 Tutorial, July 8, Catalina Island.

Non-Gaussian Methods for Learning Linear Structural

Equation ModelsEquation Models

Shohei Shimizu and Yoshinobu KawaharaOsaka Universityy

Special thanks to1

Special thanks to Aapo Hyvärinen, Patrik O. Hoyer and Takashi Washio.

Page 2: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

2

AbstractAbstractLi t t l ti d l (li SEM )• Linear structural equation models (linear SEMs) can be used to model data generating processes of variablesof variables.

• We review a new approach to learn or estimateWe review a new approach to learn or estimate linear SEMs.

• The new estimation approach utilizes non-Gaussianity of data for model identification

d i l ti t h id i t fand uniquely estimates much wider variety of models.

Page 3: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

33

OutlineOutline • Part I. Overview (70 min.) : Shohei

• Break (10 min.)

• Part II. Recent advances (40 min): Yoshi( )– Time series

Latent confounders– Latent confounders

Page 4: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

4

Motivation (1/2)Motivation (1/2)• Suppose that data X was randomly generatedSuppose that data X was randomly generated

from either of the following two data generating processes:

Model 1: Model 2:

or11

exbxex

+==

22

12121

exexbx

=+=x1

x2

e1

e2

x1

x2

e1

e2

where and are latent variables (disturbances, errors).

21212 exbx += 22 exx2 e2 x2 e2

1e 2e ( , )

• We want to estimate or identify which model t d th d t X b d th d t X l

1 2

generated the data X based on the data X only.

Page 5: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

5

Motivation (2/2)( )• We want to identify which model generated the

data X based on the data X onlydata X based on the data X only.

If 1 d 2 G i it i ll k th t• If x1 and x2 are Gaussian, it is well known that we cannot identify the data generating process.

M d l 1 d 2 ll fit d t

1e 2e

– Models 1 and 2 equally fit data.

If 1 d 2 G i i t ti• If x1 and x2 are non-Gaussian, an interesting result is obtained: We can identify which of Models 1 and 2 generated the data

1e 2e

Models 1 and 2 generated the data.

Thi t t i l i h h G i• This tutorial reviews how such non-Gaussian methods work.

Page 6: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

Problem formulation

Page 7: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

7

Basic problem setup (1/3)Basic problem setup (1/3)• Assume that the data generating process of g g p

continuous observed variables is graphically represented by a directed acyclic graph (DAG).

ixp y y g p ( )

– Acyclicity means that there are no directed cycles.

E l f di t dExample of a directed acyclic graph (DAG):

Example of a directed cyclic graph:

x3 e3 x3 e3

x1 e1

x2 e2

x1 e1

x2 e2x2 e2 x2 e2

is a parent of etc.1x3x

Page 8: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

8

Basic problem setup (2/3)Basic problem setup (2/3)• Further assume linear relations of variables .ix• Then we obtain a linear acyclic SEM (Wright, 1921; Bollen,

1989):

eBxx +=ijiji exbx += ∑ or

where

iij

jiji ∑ofparents:

where– The are continuous latent variables that are not

determined inside the model which we call externalie

determined inside the model, which we call external influences (disturbances, errors).

– The are of non-zero variance and are independenteThe are of non zero variance and are independent.– The ‘path-coefficient’ matrix B = [ ] corresponds to a DAG.

ieijb

Page 9: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

99

Example of linear acyclic SEMs• A three-variable linear acyclic SEM:

p y

⎥⎤

⎢⎡

⎥⎤

⎢⎡⎥⎤

⎢⎡

⎥⎤

⎢⎡ 111 5.100 exx

131 5.1 exx +=

⎥⎥⎥

⎦⎢⎢⎢

+⎥⎥⎥

⎦⎢⎢⎢

⎣⎥⎥⎥

⎦⎢⎢⎢

−=⎥⎥⎥

⎦⎢⎢⎢

⎣ 3

2

3

2

3

2

000003.1

ee

xx

xx

212 3.1ex

exx=

+−= or

44 344 21

B33 ex =

• B corresponds to the data-generating DAG:

x3 e3x3

x1

e3

e11.5 ijij xxb tofromedgedirected No0 ⇔=

b fddi dA0x2 e2

-1.3 ijij xxb tofromedgedirectedA 0 ⇔≠

Page 10: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

10

Assumption of acyclicity• Acyclicity ensures existence of an ordering of

variables that makes B lower triangular withxvariables that makes B lower-triangular with zeros on the diagonal (Bollen, 1989).

ix

⎥⎥⎤

⎢⎢⎡

+⎥⎥⎤

⎢⎢⎡

⎥⎥⎤

⎢⎢⎡

=⎥⎥⎤

⎢⎢⎡

1

3

1

3

1

3

005.1000

ee

xx

xx 0

00 0

0⎥⎥⎤

⎢⎢⎡

+⎥⎥⎤

⎢⎢⎡

⎥⎥⎤

⎢⎢⎡−=⎥

⎥⎤

⎢⎢⎡

2

1

2

1

2

1

003.15.100

ee

xx

xx 00

⎥⎥⎦⎢

⎢⎣⎥

⎥⎦⎢

⎢⎣⎥⎥⎦⎢

⎢⎣ −⎥

⎥⎦⎢

⎢⎣ 222 03.10 exx

44 344 210

B⎥⎥⎦⎢

⎢⎣⎥

⎥⎦⎢

⎢⎣⎥⎥⎦⎢

⎢⎣⎥

⎥⎦⎢

⎢⎣ 333 000 exx

44 344 21

Bx3 e3

permBB:isorderingThex3

x1

e3

e11.5

ofancestoranbemay.g

213

xxxxxx <<

x2-1.3

e2 .versavicenotbut ,,ofancestor an bemay 213 xxx

Page 11: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

11Assumption of independence between external influences

• It implies that there are no latent confounders (Spirtes et al. 2000)– A latent confounder is a latent variable that is a parent of morefA latent confounder is a latent variable that is a parent of more

than or equal to two observed variables:f

x1

x2f

e1’

e2’

• Such a latent confounder makes external influences

x2e2

fSuch a latent confounder makes external influences dependent (Part II):

x1 e1

f

x2 e2

Page 12: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

1212Basic problem setup (3/3):Learning linear acyclic SEMs

• Assume that data X is randomly sampled from a linear acyclic SEM (with no latentfrom a linear acyclic SEM (with no latent confounders):

eBxx +=x1 e1

21bx2 e2

• Goal: Estimate the path-coefficient matrix B by observing data X only!observing data X only!– B corresponds to the data-generating DAG.

Page 13: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

Problems: Identifiability problems of

con entional methodsconventional methods

Page 14: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

1414

Under what conditions B is identifiable?

• `B is identifiable’ `B is uniquely determined or estimated from p(x)’

≡estimated from p(x) .

S• Linear acylic SEM:

Bx1 e1

beBxx +=x2 e2

21b

– B and p(e) induce p(x). If p(x) are different for different B then B is uniquely– If p(x) are different for different B, then B is uniquely determined.

Page 15: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

15

Conventional estimation principle: Causal Markov condition

• If the data-generating model is a (linear) acyclic SEM causal Markov conditionacyclic SEM, causal Markov condition holds :

E h b d i bl i i i d d t f itx– Each observed variable xi is independent of its non-descendants in the DAG conditional on its

t

ix

parents (Pearl & Verma, 1991) :

( ) ( )∏=p

iii xxpp

1

ofparents|x=i 1

Page 16: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

1616Conventional methods based on

causal Markov condition• Methods based on conditional independencies

(Spirtes & Glymour, 1991)

M li li SEM i t f– Many linear acyclic SEMs give a same set of conditional independences and equally fit data.

• Scoring methods based on Gaussianity(Chickering 2002)(Chickering, 2002)

– Many linear acyclic SEMs give a same Gaussian distribution and equally fit datadistribution and equally fit data.

• In many cases the path coefficient matrix B is• In many cases, the path-coefficient matrix B is not uniquely determined.

Page 17: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

1717

Example• Two models with Gaussian e1 and :1e 2e

11 ex = 121 8.0 exx +=Model 1: Model 2:

x1 e1 x1 e1

212 8.0 exx += 22 ex =x2 e2 x2 e2

• Both introduce no conditional independence:

( ) ( ) 1varvar 21 == xx( ) ( ) ,021 == eEeE

• Both introduce no conditional independence:

( ) 08.0,cov 21 ≠=xx

• Both induce the same Gaussian distribution:

( )21

⎞⎛ ⎤⎡⎤⎡⎤⎡ 8010x⎟⎟⎠

⎞⎜⎜⎝

⎛⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡18.08.01

00

~2

1 Nxx

Page 18: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

A solution:A solution: Non-Gaussian approachpp

Page 19: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

1919

A new direction: Non-Gaussian approach

• Non-Gaussian data in many applications:– Neuroinformatics (Hyvarinen et al 2001); BioinformaticsNeuroinformatics (Hyvarinen et al., 2001); Bioinformatics

(Sogawa et al., ICANN2010); Social sciences (Micceri, 1989); Economics (Moneta, Entner, et al., 2010)

• Utilize non-Gaussianity for model identification:– Bentler (1983); Mooijaart (1985); Dodge and Rousson (2001)Bentler (1983); Mooijaart (1985); Dodge and Rousson (2001)

• The path-coefficient matrix B is uniquely p q yestimated if ei are non-Gaussian. – Shimizu, Hoyer, Hyvarinen and Kerminen (JMLR, 2006)

ie

Page 20: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

2020Illustrative example: G i G iGaussian vs non-Gaussian

Gaussian Non-Gaussian(uniform)

Model 1:x2x2

Model 1:x1 e1 x1x111

80ex

+=

M d l 2

x2 e2212 8.0 exx +=

Model 2:

x1

x2

x1 e1

x2

121 8.0 exx +=x1

x2 e2x1

22 ex =

( ) ( ) 1varvar 21 == xx( ) ( ) ,021 == eEeE

Page 21: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

2121Linear Non-Gaussian A li M d l LiNGAMAcyclic Model: LiNGAM(Shimizu, Hyvarinen, Hoyer & Kerminen, JMLR, 2006)

• Non-Gaussian version of linear acyclic SEM:

eBxx +=exbx += ∑ or eBxx +=iij

jiji exbx += ∑ofparents:

or

whereThe external influence variables (disturbancese– The external influence variables (disturbances, errors) are

• of non-zero variance

ie

of non-zero variance.• non-Gaussian and mutually independent.

Page 22: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

22

Identifiability of LiNGAM modelIdentifiability of LiNGAM model

• LiNGAM model can be shown to beidentifiableidentifiable.– B is uniquely estimated.

• To see the identifiability, helpful to review y, pindependent component analysis (ICA) (Hyvarinen et al., 2001).

Page 23: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

23Independent Component Analysis (ICA)(ICA) (Jutten & Herault, 1991; Comon, 1994)

• Observed random vector x is modeled byObserved random vector x is modeled by

Asx =∑=p

jiji sax or

where

Asx∑=j

jiji1

– The mixing matrix A = [ ] is square and is of full column rank.The latent ariables (independent components)s

ija

– The latent variables (independent components) are of non-zero variance, non-Gaussian and mutually independent.

is

• Then, A is identifiable up to permutation P and scaling D of the columns:scaling D of the columns:

APDA =ica

Page 24: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

2424Estimation of ICA• Most of estimation methods estimate

(Hyvarinen et al., 2001):1−= AW

( y , )

sWAsx 1−==

• Most of the methods minimize mutual information (or its approximation) of estimated independent components:

xWs ica=ˆ• W is estimated up to permutation P and scaling D of the rows:

( )1−== PDAPDWWi

• Consistent and computationally efficient algorithms:

( )PDAPDWWica

– Fixed point (FastICA) (Hyvarinen,99); Gradient-based (Amari, 98)

– Semiparametric: no specific distributional assumption

Page 25: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

Back to LiNGAM model

Page 26: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

2626Identifiability of LiNGAM (1/3):ICA hi h lf f id tifi tiICA achieves half of identification

• LiNGAM model is ICA• LiNGAM model is ICA.– Observed variables are linear combinations of

non Gaussian independent external influences :ix

enon-Gaussian independent external influences :

eBIxeBxx −−=⇔+= 1)(ie

eWAe 1−== ( )BIW −=

• ICA gives .– P: unknown permutation matrix

)( BIPDPDWW −==icaP: unknown permutation matrix

– D: unknown scaling (diagonal) matrix

• Need to determine P and D to identify B.

Page 27: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

2727Identifiability of LiNGAM (2/3):

No permutation indeterminacy (1/6)• ICA gives .

– P : permutation matrix; D: scaling (diagonal) matrix)( BIPDPDWW −==ica

• We want to find such a permutation matrix that cancels the permutation i e :IPP =

PPpermutation , i.e., :IPP =

IDWPDWPWP ==ica

P

• We can show (Shimizu et al., UAI05) (illustrated in the next slides):– If i e no permutation is made on the rows ofIPP =

I=

DW– If , i.e., no permutation is made on the rows of , has no zero in the diagonal (obvious by definition).

If i id ti l t ti i d th

IPP =icaWP

DW

– If , i.e., any nonidentical permutation is made on the rows of , has a zero in the diagonal.

IPP ≠icaWPDW

Page 28: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

28

Identifiability of LiNGAM (2/3):No permutation indeterminacy (2/6)

• By definition has all unities in the diagonal.– The diagonal elements of B are all zeros.

BIW −=

• Acyclicity ensures existence of an ordering of variables that makes lower triangular and then isBIWBthat makes lower-triangular, and then is also lower-triangular.

BIW −=B

• So, WLG, can be assumed to be lower-triangular:W

⎥⎥⎤

⎢⎢⎡

= 01*001

W0 0

0 No zeros in the diagonal!

⎥⎥⎦⎢

⎢⎣ 1**

0W in the diagonal!

Page 29: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

29Identifiability of LiNGAM (2/3):N t ti i d t i (3/6)No permutation indeterminacy (3/6)

• Premultiplying by a scaling (diagonal) matrix D does NOT change the zero/non-zero pattern of :

WW

⎥⎥⎤

⎢⎢⎡

=11

0*00

dd

DW 0

0 0

⎥⎥⎤

⎢⎢⎡

= 01*001

W0 0

0

⎥⎥⎦⎢

⎢⎣

=

33

22

**0

ddDW 0

⎥⎥⎦⎢

⎢⎣

=1**01W 0

No zeros in the diagonal!

Page 30: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

3030

Identifiability of LiNGAM (2/3):No permutation indeterminacy (4/6)

• Any other permutation of the rows of changes the zero/non-zero pattern of and

DWDWg p

brings zero in the diagonal:

⎥⎥⎤

⎢⎢⎡

= 11

2212 00

0*d

dDWP

⎥⎥⎤

⎢⎢⎡

=11

0*00

dd

DW 00 0 0

00⎥⎥⎦⎢

⎢⎣ 33

11

** d⎥⎥⎦⎢

⎢⎣

=

33

22

**0

ddDW 0 00

Exchanging 1stExchanging 1and 2nd rows Zero in the diagonal!

Page 31: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

3131

Identifiability of LiNGAM (2/3):No permutation indeterminacy (5/6)

• Any other permutation of the rows of changes the zero/non-zero pattern of and

DWDWg p

brings zero in the diagonal:

⎥⎥⎤

⎢⎢⎡ 11

0*00

dd

DW ⎥⎥⎤

⎢⎢⎡

= 0*** 33

13 dd

DWP00 0

0⎥⎥⎦⎢

⎢⎣

=

33

22

**0*

ddDW

⎥⎥⎦⎢

⎢⎣

=000

11

22

ddDWP0 0

0Exchanging 1st

0Exchanging 1st

and 3rd rows Zero in the diagonal!

Page 32: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

3232

Identifiability of LiNGAM (2/3):No permutation indeterminacy (6/6)

• We can find correct by finding that gives no th di l f WP

P Pzero on the diagonal of (Shimizu et al., UAI05).icaWP

• Thus, we can solve the permutation indeterminacy and obtain: y

( )BIDDWPDWPWP −=== ( )BIDDWPDWPWP −===ica

I=

Page 33: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

3333

Identifiability of LiNGAM (3/3): No scaling indeterminacy

• Now we have: B)D(IWP −=ica

• Then, ( )icaWPD diag=

• Divide each row of by its corresponding icaWP y p gdiagonal element to get , i.e., :

icaBI − B

( ) BIB)D(IDWPWP −=−= −− 11diag icaica

Page 34: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

3434

Estimation of LiNGAM model

1 ICA-LiNGAM algorithm1. ICA LiNGAM algorithm2. DirectLiNGAM algorithm

Page 35: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

3535

Two estimation algorithmsg• ICA-LiNGAM algorithm

(Shi i H H i & K i JMLR 2006)(Shimizu, Hoyer, Hyvarinen & Kerminen, JMLR, 2006)

• DirectLiNGAM algorithm (Shimizu Hyvarinen Kawahara & Washio UAI09)(Shimizu, Hyvarinen, Kawahara & Washio, UAI09)

• Both estimate an ordering of variables that makes gthe path-coefficient matrix B to be lower-triangular.– Acyclicity ensures existence of such an ordering.

⎤⎡ OA full DAG

permpermperm exx +⎥⎦

⎤⎢⎣

⎡=

31

Ox3x1

321

permBx2

redundant edges

Page 36: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

3636

Once such an ordering is found…

• Many existing (covariance-based) methods can do:– Pruning the redundant path-coefficients

• Sparse methods like weighted lasso (Zou, 2006)

– Finding significant path-coefficients • Testing, bootstrapping (Shimizu et al., 2006; Hyvarinen et al. 2010)

x3x1 x3x1exx +⎥⎤

⎢⎡

= O0*

A full DAG

x2 x2

permpermperm exx +⎥⎦

⎢⎣

=321

00*

* *

permB

Page 37: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

3737

1. Outline of ICA-LiNGAM algorithmOut e o C G a go t(Shimizu, Hoyer, Hyvarinen, & Kerminen, JMLR, 2006)

1 Estimate B by ICA 2. Pruning1. Estimate B by ICA + permutation

g

x2x1x1 x2

x3x3x3 23b13b

Redundant edgesRedundant edges

Page 38: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

3838

ICA-LiNGAM algorithm (1/2): Step 1. Estimation of B

1. Perform ICA (here, FastICA (Hyvarinen, 1999)) to obtain an estimate of

B)PD(IPDWW −==ica

2. Find a permutation that makes the diagonal elements of as large (non-zero) as possible in b l t l

icaWP ˆP

absolute value:

( )WPP

P ˆ1minˆ = Hungarian alg.

(Kuhn, 1955)

3 N li h f th t

( )iiicaWPP ( , )

WP ˆˆ3. Normalize each row of , then we get an estimate of I-B and .

icaWPB̂

Page 39: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

3939ICA-LiNGAM algorithm (2/2): St 2 P iStep 2. Pruning

• Find such an ordering of variables that makes• Find such an ordering of variables that makes estimated B be as close to be lower-triangular as possiblepossible.– Find a permutation matrix Q that minimizes the sum of the

elements in the upper triangular part of permuted :B̂elements in the upper triangular part of permuted :

( )∑≤

=ji

ijT 2ˆminˆ QBQQ

Q

B

– Approximate algorithm for large variables (Hoyer et al., ICA06)

≤ ji

x2x1

0 1 3

0.1

55x1 x2

x3x3x30.1

0.1 3 0.1 3-0.01

Page 40: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

4040

Basic properties of ICA-LiNGAM algorithm

• ICA-LiNGAM algorithm = ICA + permutationsComputationally efficient with the help of well developed– Computationally efficient with the help of well-developed ICA techniques.

• Potential problems ICA is an iterative search method:– ICA is an iterative search method:

• May get stuck in a local optimum if the initial guess or step size is badly chosensize is badly chosen.

– The permutation algorithms are not scale-invariant:• May provide different estimates for different scales ofMay provide different estimates for different scales of

variables.

Page 41: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

41

Estimation of LiNGAM model

1 ICA-LiNGAM algorithm1. ICA LiNGAM algorithm2. DirectLiNGAM algorithm

Page 42: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

4242

2. DirectLiNGAM algorithm2. DirectLiNGAM algorithm(Shimizu, Hyvarinen, Kawahara & Washio, UAI2009)

• Alternative estimation method without ICA– Estimates an ordering of variables that makes path-g p

coefficient matrix B to be lower-triangular.

⎤⎡A full DAG

permpermperm exx +⎥⎦

⎤⎢⎣

⎡= O

x3x1

A full DAG

⎦⎣ 321

permB x2Redundant edges

• Many existing (covariance-based) methods can do further pruning or finding significant pathdo further pruning or finding significant path coefficients (Zou, 2006; Shimizu et al., 2006; Hyvarinen et al. 2010)

Page 43: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

43Basic idea (1/2) :A i bl b t thAn exogenous variable can be at the

top of a right orderingtop of a right ordering • An exogenous variable is a variable with no jxg

parents (Bollen, 1989), here .– The corresponding row of B has all zeros.

3xp g

• So, an exogenous variable can be at the top ofsuch an ordering that makes B lower-triangularwith zeros on the diagonal.

⎥⎥⎤

⎢⎢⎡

+⎥⎥⎤

⎢⎢⎡

⎥⎥⎤

⎢⎢⎡

=⎥⎥⎤

⎢⎢⎡ 333

0051000

ee

xx

xx

000

00x3 x1 x2

⎥⎥⎦⎢

⎢⎣

+⎥⎥⎦⎢

⎢⎣⎥⎥⎦⎢

⎢⎣ −

=⎥⎥⎦⎢

⎢⎣ 2

1

2

1

2

1

03.10005.1

ee

xx

xx 0 0

0

x3 x1 x2

Page 44: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

44Basic idea (2/2): R tRegress exogenous out

• Compute the residuals regressing the other

3x( ) )21(3 =ir• Compute the residuals regressing the other

variables on exogenous :The residuals form a LiNGAM model

( ) )2,1( =iri

3x)2,1( =ixi( ) ( )33 and rr– The residuals form a LiNGAM model.

– The ordering of the residuals is equivalent to that of corresponding original variables.

21 and rr

p g g

⎥⎤

⎢⎡

⎥⎤

⎢⎡⎥⎤

⎢⎡

⎥⎤

⎢⎡ 333 000 exx 0 0 000

⎤⎡⎤⎡⎤⎡⎤⎡

⎥⎥⎥

⎦⎢⎢⎢

+⎥⎥⎥

⎦⎢⎢⎢

⎣⎥⎥⎥

⎦⎢⎢⎢

⎣ −=

⎥⎥⎥

⎦⎢⎢⎢

⎣ 2

1

2

1

2

1

03.10005.1

ee

xx

xx 0 0

0 ⎥⎦

⎤⎢⎣

⎡+⎥

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡−

=⎥⎦

⎤⎢⎣

2

1)3(

2

)3(1

)3(2

)3(1

03.100

ee

rr

rr 0 0

0⎦⎣⎦⎣⎦⎣⎦⎣ 222 ⎦⎣⎦⎣

)3(2r

)3(1rx3 x1 x2

• Exogenous implies ` can be at the second top’.)3(1r 1x

Page 45: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

45

Outline of DirectLiNGAM• Iteratively find exogenous variables until all the

variables are ordered:variables are ordered:1. Find an exogenous variable .

P t t th t f th d i3x

x– Put at the top of the ordering. – Regress out.

2 Fi d id l h )3(r

3x3x

2. Find an exogenous residual, here .– Put at the second top of the ordering.

R t

)(1r

1x)3(– Regress out.

3. Put at the third top of the ordering and terminate. Th ti t d d i i

)3(1r

2x<<The estimated ordering is .213 xxx <<

Step. 1 Step. 2 Step. 3)3(

2r)3(

1rx3 x1 x2 )1,3(2r

Page 46: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

46-1Identification of an exogenous variable (two variable cases)

ii) is NOT exogenousi) is exogenous)( 11 ex = 1xii) is NOT exogenous.i) is exogenous.

( )011 =

bbex

)( 11 ex 1x( )122121 0bxbx ≠+= 1e

( )02121212 ≠+= bexbx 22 ex =

)var(),cov(,on Regressing

112

2)1(

2

12

xx

xxxr

xx

−=1

122

)1(2

12

),cov(,on Regressing

xxxxr

xx

−=

( ))var(

var)var(

),cov(1

)var(

2122

1212

1

xxbx

xxxb

x

−⎭⎬⎫

⎩⎨⎧−=1212

11

22 )var(xbx

xx

xr

−=1e

)var()var( 11 xx ⎭⎩2e=

t.independenNOTareand )1(21 rxt.independenareand )1(

21 rx

Page 47: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

46-2Need to use Darmois-Skitovitch’ ththeorem (Darmois, 1953; Skitovitch, 1953)

ii) is NOT exogenous1x( )1212121 0bexbx ≠⋅+=

Darmois-Skitovitch’ theorem:

Define two variables and as

ii) is NOT exogenous.1x

1x1

2x22 ex =Define two variables and as

∑∑ ==p

jj

p

jj eaxeax 2211 ,

1x 2x

112

2)1(

2

21

)var(),cov(,on Regressing

xx

xxxr

xx

−=

∑∑== j

jjj

jj11

where are independent d i bl

je

( )1

22

1212

1

)var(var

)var(),cov(1

)var(

ex

xxx

xxb

x

−⎭⎬⎫

⎩⎨⎧−=

random variables.

If there exists a non-Gaussian ie 12b

11 )var()var( xx ⎭⎩for which , and are dependent.

021 ≠iiaa1x 2x

t.independenNOTareand )1(21 rx

Page 48: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

47

Identification of an exogenous variable (p variable cases)

• Lemma 1: and its residual ( )j

jii

ji x

xxxr

)cov( ,−=jxLemma 1: and its residual

are independent for all is exogenous

jj

ii xx

x)var(j

ji ≠ jx⇔are independent for all is exogenous

I ti id tif i bl

ji ≠ jx⇔

• In practice, we can identify an exogenous variable by finding a variable that is most independent of it id lits residuals.

Page 49: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

48

Independence measures• Evaluate independence between a variable and a

residual by a nonlinear correlation:

p

residual by a nonlinear correlation:

( ){ } ( )tanhcorr )( =grgx j

• Taking the sum over all the residuals we get:

( ){ } ( )tanh,corr =grgx ij

• Taking the sum over all the residuals, we get:

( ){ } ( ){ }∑ += jj rxgrgxT )()( corrcorr

C hi ti t d ll

( ){ } ( ){ }∑≠

+=ji

ijij rxgrgxT ,corr,corr

• Can use more sophisticated measures as well (Bach & Jordan, 2002; Gretton et al., 2005; Kraskov et al., 2004).– Kernel-based independence measure (Bach & Jordan 2002)Kernel based independence measure (Bach & Jordan, 2002)

often gives more accurate estimates (Sogawa et al., IJCNN10).

Page 50: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

49

Important properties of DirectLiNGAM

• DirectLiNGAM repeats:Least squares simple linear regression– Least squares simple linear regression

– Evaluation of pairwise independence between each variable and its residualsvariable and its residuals.

• No algorithmic parameters like stepsize, initial g p p ,guesses, convergence criteria.

• Guaranteed convergence to the right solution in a fixed number of steps (the number of variables) if the data strictly follows the model.

Page 51: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

50Estimation of LiNGAM model: Summary (1/2)

• Two estimation algorithms:ICA LiNGAM: Estimation using ICA– ICA-LiNGAM: Estimation using ICA

• Pros. Fast• Cons Possible local optimum; Not scale invariant• Cons. Possible local optimum; Not scale-invariant

– DirectLiNGAM: Alternative estimation without ICA• Pros. Guaranteed convergence; Scale-invariant• Cons. Less fast

– Cf. Neither needs faithfulness (Shimizu et al., JMLR, 2006; Hoyer personal comm July 2010)2006; Hoyer, personal comm., July, 2010).

Page 52: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

51Estimation of LiNGAM model: Summary (2/2)

• Experimental comparison of the two algorithms: (Sogawa et al., IJCNN2010)

• Sample size: Both need at least 1000 sample size for more than 10 variables.S l bilit B th l 100 i bl Th• Scalability: Both can analyze 100 variables. The performances depend on the sample size etc., of course!

• Scale invariance: ICA-LiNGAM is less robust for changing g gscales of variables.

• Local optima?: For less than 10 variables ICA LiNGAM often a bit– For less than 10 variables, ICA-LiNGAM often a bit better.

– For more than 10 variables, DirectLiNGAM often better perhaps because the problem of local optima might become more serious?

Page 53: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

52

Testing andTesting and Reliability evaluationy

Page 54: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

53

Testing testable assumptionsg p• Non-Gaussianity of external influences :iey

– Gaussianity testsi

• Could detect violations of some assumptions:– Local test

• Independence of external influences• Conditional independencies between observed

variables (causal Markov condition)

ie

xvariables (causal Markov condition)• Linearity

– Overall fit of the model assumptions

ix

Overall fit of the model assumptions• Chi-square test using 3rd and/or 4th-order moments

(Shimizu & Kano, 2008) • Still under development

Page 55: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

54

Reliability evaluationReliability evaluation• Need to evaluate statistical reliability of LiNGAMy

results:– Sample fluctuations– Smaller non-Gaussianity makes the model closer to

be NOT identifiable.

• Reliability evaluation by bootstrapping:(Komatsu Shimizu & Shimodaira ICANN2010; Hyvarinen et al JMLR 2010)(Komatsu, Shimizu & Shimodaira, ICANN2010; Hyvarinen et al., JMLR, 2010)

– If either the sample size is small or the magnitude of non-Gaussianity is small, LiNGAM would give very different results for bootstrap samples.

Page 56: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

55

Extensions

Page 57: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

56

Extensions (a partial list)Extensions (a partial list)R l i th ti f LiNGAM d l• Relaxing the assumptions of LiNGAM model:– Acyclic Cyclic (Lacerda et al., UAI2008)

– Single homogenous population heterogeneous population (Shimizu et al., ICONIP2007)

– i.i.d. sampling time structures (Part II.) (Hyvarinen et al, JMLR, 2010; Kawahara, S & Washio, 2010)

– No latent confounders Allow latents (Part II.) (Hoyer et al., IJAR, 08; Kawahara, Bollen et al., 2010)

– Linear Non-linear (Hoyer et al., NIPS08; Zhang & Hyvarinen, UAI09; Tilmann, Gretton & Spirtes, NIPS09)

Page 58: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

57

Application areas so far

Page 59: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

58

Non-Gaussian SEMs have been applied to…

• Neuroinformatics– Brain connectivity analysis (Hyvarinen et al., JMLR, 2010)y y ( y )

• Bioinformatics– Gene network estimation (Sogawa et al ICANN2010)Gene network estimation (Sogawa et al., ICANN2010)

• Economics (Wan & Tan, 2009; Moneta, Entner, Hoyer & Coad, 2010)

• Genetics (Ozaki & Ando, 2009)

• Environmental sciences (Niyogi et al., 2010)( y g , )

• Physics (Kawahara, Shimizu & Washio, 2010)

S i l• Sociology (Kawahara, Bollen, Shimizu & Washio, 2010)

Page 60: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

59

Final summary of Part IFinal summary of Part IU f G i it i li SEM i• Use of non-Gaussianity in linear SEMs is useful for model identification.

• Non-Gaussian data is encountered in many applicationsapplications.

• The non-Gaussian approach can be a goodThe non Gaussian approach can be a good option.

• Links to codes and papers: http://homepage.mac.com/shoheishimizu/lingampapers.html

Page 61: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

60

FAQs

Page 62: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

61

Q. My data is Gaussian. LiNGAM will not be useful.

• A. You’re right. Try Gaussian methods.

• Comment: Hoyer et al. (UAI2008) showed: `To hat e tent one can identif the model for aTo what extent one can identify the model for a mixture of Gaussian and non-Gaussian external influence variables’influence variables’.

Page 63: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

62

Q. I applied LiNGAM, but the result is not reasonable to background knowledge.

• A. You might first want to check: Some model assumptions might be violated– Some model assumptions might be violated.

Try other extensions of LiNGAMor non-parametric methods PC or FCI etcor non parametric methods PC or FCI etc. (Spirtes et al., 2000).

– Small sample size or small non-GaussianityTry bootstrap to see if the result is reliableTry bootstrap to see if the result is reliable.

Background knowledge might be wrong– Background knowledge might be wrong.

Page 64: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

63

Q. Relation to causal Markov condition?

• A. The following 3 estimation principles are equivalent (Zhang & Hyvarinen ECML09; Hyvarinen et al JMLR 2010):equivalent (Zhang & Hyvarinen, ECML09; Hyvarinen et al., JMLR, 2010):1. Maximize independence between external influences .2 Minimize the sum of entropies of external influences

iee2. Minimize the sum of entropies of external influences .

3. Causal Markov condition (Each variable is independent of its non-descendants in the DAG conditional on its

ie

its non-descendants in the DAG conditional on its parents) AND maximization of independence between the parents of each variable and its corresponding p p gexternal influences .ie

Page 65: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

64

Q. I am a psychometrician and am more interested in latent factors.• A. Shimizu, Hoyer, and Hyvarinen. (2009)

proposes LiNGAM for latent factors:proposes LiNGAM for latent factors:

dBff += -- LiNGAM for latent factors

eGfx += -- Measurement model

Page 66: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

65

Others• Q. Prior knowledge?

– It is possible to incorporate prior knowledge. The accuracy of p p p g yDirectLiNGAM is often greatly improved even if the amount of prior knowledge is not so large (Inazumi et al., LVA/ICA2010).

• Q. Sparse LiNGAM?– Zhang et al. (ICA09) and Hyvarinen et al. (JMLR, 2010).– ICA + adaptive Lasso (Zou 2006)ICA + adaptive Lasso (Zou, 2006).

• Q. Bayesian approach?H d H tti (UAI09) H d Wi th (NIPS09)– Hoyer and Hyttinen (UAI09); Henao and Winther. (NIPS09).

• Q. The idea can be applied to discrete variables?Q. The idea can be applied to discrete variables?– One proposal by Peters et al. (AISTATS2010).– Comment: if your discrete variables are close to be continuous, e.g.,

ordinal scales with many points LiNGAM might workordinal scales with many points, LiNGAM might work.

Page 67: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

66

Q Nonlinear extensions?• A Several nonlinear SEMs have been proposed:

Q. Nonlinear extensions?• A. Several nonlinear SEMs have been proposed:

– DAG; No latent confounders.

( ) ij

iiji ex jfx +=∑ ofparent -- Imoto et al. (2002)1.

( ) iiii

j

exfx += ofparents -- Hoyer et al. (NIPS08)2.

( )( )iiiii exffx += − ofparents1,12, -- Zhang et al. (UAI09)3.

• For two variable cases, unique identification possible except several combinations of nonlinearities and distributions (Hoyer et al., NIPS08; Zhang & Hyvarinen, UAI09).

Page 68: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

67

Nonlinear extensions (continued)

• Proposals to aim at computational efficiency (Mooij et al., ICML09; Tilmann et al., NIPS09; Zhang & Hyvarinen, ECML09; UAI09).

• Pros: – Nonlinear models are more general than linear models.g

• Cons:– Computationally demanding.

• Current: at most 7 or 8 variables.• Perhaps assumption of Gaussian external influences might• Perhaps, assumption of Gaussian external influences might

help: Imoto et al. (2002) analyzes 100 variables.

– More difficult to allow other possible violations of LiNGAMassumptions, latent confounders etc.

Page 69: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

68

Q. My data follows neither such linear SEMs nor such nonlinear SEMs as you

have talked.

• A Try non-parametric methods e g

have talked.

A. Try non parametric methods, e.g.,– DAG: PC (Spirtes & Glymour, 1991)

– DAG with latent confounders: FCI (Spirtes et al 1995)DAG with latent confounders: FCI (Spirtes et al., 1995).

( )exfx ofparents=

( ) f

( )iiii exfx ,ofparents=

• Probably you get an (probably large) class of equivalence models rather than a single model, b h ld b h b lbut that would be the best you currently can.

Page 70: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

69

Q Deterministic relations?Q. Deterministic relations?

• A. LiNGAM is not applicable.

• See Daniusis et al. (UAI2010) for a method to analyze deterministic relationsanalyze deterministic relations.

Page 71: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

70

Extra slides

Page 72: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

Very brief review ofVery brief review of Linear SEMs and causalityy

See Pearl (2000) for detailsSee Pearl (2000) for details.

Page 73: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

72

We can give a causal interpretation to each path-coefficient bij (Pearl, 2000).ijb

• Example:

11

exbxex

+= x1 e1

221b

32323

21212

exbxexbx

+=+=

x3

x2

e3

e232b

32b

• Causal interpretation of :32bCausa e p e a o o,constant atoconstant afromchangeyouIf 102 ccx

32

( ).bychangewill)(then 01323 ccbxE −

Page 74: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

73

`Change from to ’2x 1c0cg• It means `Fix at regardless of the variables 2x 0c

2 10

that previously determines and change the constant from to ’ (Pearl, 2000).1c0c

2x

• `Fix at regardless of the variables that determines ’ is denoted by .2x ( )02 cxdo =

2x 0c

– In SEM, `Replace the function determining with a constant ’:

Mutilated model0c

2x

11 ex =11 ex =

Mutilated model with do(x2=c0):( )02 cxdo =

21212

11

exbxexbx

+=+=

02

11

bcxex

+=

32323 exbx +=32323 exbx +=

(Assumption of invariance: the other functions don’t change)

Page 75: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

74

Average causal effect: Definition (Rubin, 1974; Pearl, 2000)

• Average causal effect of x2 on x3 when changing x2 from c0 to c1:2x

2x 3x0c 1cchanging x2 from c0 to c1:

( )( ) ( )( )023123 || cxdoxEcxdoxE =−= c0c1

2 0c 1c

0cc01c

Computed based on the mutilated models

( )0132 ccb −= 32b– Computed based on the mutilated models

with do(x2=c1) or do(x2=c0) .( )12 cxdo = ( )02 cxdo =

• Average causal effects can be computed based on estimated path-coefficient matrix Bon estimated path coefficient matrix B.

Page 76: Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

75Cf. Conducting experiments with d i t i th d trandom assignment is a method to

estimate average causal effectsestimate average causal effects• Example: a drug trialp g• Random assignment of subjects into two groups

fixes to or regardless of the variables that2x 1c 0cfixes to or regardless of the variables that determines :– Experimental group with (take the drug)

2 1c 0c2x

12 cx =Experimental group with (take the drug) – Control group with (placebo)02 cx =

12 cx

( )( ) ( )( )023123 || cxdoxEcxdoxE =−=

for the exp. group for the control group-)( 3xE )( 3xE