Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I
-
Upload
sshimizu2006 -
Category
Science
-
view
298 -
download
1
description
Transcript of Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I
1UAI2010 Tutorial, July 8, Catalina Island.
Non-Gaussian Methods for Learning Linear Structural
Equation ModelsEquation Models
Shohei Shimizu and Yoshinobu KawaharaOsaka Universityy
Special thanks to1
Special thanks to Aapo Hyvärinen, Patrik O. Hoyer and Takashi Washio.
2
AbstractAbstractLi t t l ti d l (li SEM )• Linear structural equation models (linear SEMs) can be used to model data generating processes of variablesof variables.
• We review a new approach to learn or estimateWe review a new approach to learn or estimate linear SEMs.
• The new estimation approach utilizes non-Gaussianity of data for model identification
d i l ti t h id i t fand uniquely estimates much wider variety of models.
33
OutlineOutline • Part I. Overview (70 min.) : Shohei
• Break (10 min.)
• Part II. Recent advances (40 min): Yoshi( )– Time series
Latent confounders– Latent confounders
4
Motivation (1/2)Motivation (1/2)• Suppose that data X was randomly generatedSuppose that data X was randomly generated
from either of the following two data generating processes:
Model 1: Model 2:
or11
exbxex
+==
22
12121
exexbx
=+=x1
x2
e1
e2
x1
x2
e1
e2
where and are latent variables (disturbances, errors).
21212 exbx += 22 exx2 e2 x2 e2
1e 2e ( , )
• We want to estimate or identify which model t d th d t X b d th d t X l
1 2
generated the data X based on the data X only.
5
Motivation (2/2)( )• We want to identify which model generated the
data X based on the data X onlydata X based on the data X only.
If 1 d 2 G i it i ll k th t• If x1 and x2 are Gaussian, it is well known that we cannot identify the data generating process.
M d l 1 d 2 ll fit d t
1e 2e
– Models 1 and 2 equally fit data.
If 1 d 2 G i i t ti• If x1 and x2 are non-Gaussian, an interesting result is obtained: We can identify which of Models 1 and 2 generated the data
1e 2e
Models 1 and 2 generated the data.
Thi t t i l i h h G i• This tutorial reviews how such non-Gaussian methods work.
Problem formulation
7
Basic problem setup (1/3)Basic problem setup (1/3)• Assume that the data generating process of g g p
continuous observed variables is graphically represented by a directed acyclic graph (DAG).
ixp y y g p ( )
– Acyclicity means that there are no directed cycles.
E l f di t dExample of a directed acyclic graph (DAG):
Example of a directed cyclic graph:
x3 e3 x3 e3
x1 e1
x2 e2
x1 e1
x2 e2x2 e2 x2 e2
is a parent of etc.1x3x
8
Basic problem setup (2/3)Basic problem setup (2/3)• Further assume linear relations of variables .ix• Then we obtain a linear acyclic SEM (Wright, 1921; Bollen,
1989):
eBxx +=ijiji exbx += ∑ or
where
iij
jiji ∑ofparents:
where– The are continuous latent variables that are not
determined inside the model which we call externalie
determined inside the model, which we call external influences (disturbances, errors).
– The are of non-zero variance and are independenteThe are of non zero variance and are independent.– The ‘path-coefficient’ matrix B = [ ] corresponds to a DAG.
ieijb
99
Example of linear acyclic SEMs• A three-variable linear acyclic SEM:
p y
⎥⎤
⎢⎡
⎥⎤
⎢⎡⎥⎤
⎢⎡
⎥⎤
⎢⎡ 111 5.100 exx
131 5.1 exx +=
⎥⎥⎥
⎦⎢⎢⎢
⎣
+⎥⎥⎥
⎦⎢⎢⎢
⎣⎥⎥⎥
⎦⎢⎢⎢
⎣
−=⎥⎥⎥
⎦⎢⎢⎢
⎣ 3
2
3
2
3
2
000003.1
ee
xx
xx
212 3.1ex
exx=
+−= or
44 344 21
B33 ex =
• B corresponds to the data-generating DAG:
x3 e3x3
x1
e3
e11.5 ijij xxb tofromedgedirected No0 ⇔=
b fddi dA0x2 e2
-1.3 ijij xxb tofromedgedirectedA 0 ⇔≠
10
Assumption of acyclicity• Acyclicity ensures existence of an ordering of
variables that makes B lower triangular withxvariables that makes B lower-triangular with zeros on the diagonal (Bollen, 1989).
ix
⎥⎥⎤
⎢⎢⎡
+⎥⎥⎤
⎢⎢⎡
⎥⎥⎤
⎢⎢⎡
=⎥⎥⎤
⎢⎢⎡
1
3
1
3
1
3
005.1000
ee
xx
xx 0
00 0
0⎥⎥⎤
⎢⎢⎡
+⎥⎥⎤
⎢⎢⎡
⎥⎥⎤
⎢⎢⎡−=⎥
⎥⎤
⎢⎢⎡
2
1
2
1
2
1
003.15.100
ee
xx
xx 00
⎥⎥⎦⎢
⎢⎣⎥
⎥⎦⎢
⎢⎣⎥⎥⎦⎢
⎢⎣ −⎥
⎥⎦⎢
⎢⎣ 222 03.10 exx
44 344 210
B⎥⎥⎦⎢
⎢⎣⎥
⎥⎦⎢
⎢⎣⎥⎥⎦⎢
⎢⎣⎥
⎥⎦⎢
⎢⎣ 333 000 exx
44 344 21
Bx3 e3
permBB:isorderingThex3
x1
e3
e11.5
ofancestoranbemay.g
213
xxxxxx <<
x2-1.3
e2 .versavicenotbut ,,ofancestor an bemay 213 xxx
11Assumption of independence between external influences
• It implies that there are no latent confounders (Spirtes et al. 2000)– A latent confounder is a latent variable that is a parent of morefA latent confounder is a latent variable that is a parent of more
than or equal to two observed variables:f
x1
x2f
e1’
e2’
• Such a latent confounder makes external influences
x2e2
fSuch a latent confounder makes external influences dependent (Part II):
x1 e1
f
x2 e2
1212Basic problem setup (3/3):Learning linear acyclic SEMs
• Assume that data X is randomly sampled from a linear acyclic SEM (with no latentfrom a linear acyclic SEM (with no latent confounders):
eBxx +=x1 e1
21bx2 e2
• Goal: Estimate the path-coefficient matrix B by observing data X only!observing data X only!– B corresponds to the data-generating DAG.
Problems: Identifiability problems of
con entional methodsconventional methods
1414
Under what conditions B is identifiable?
• `B is identifiable’ `B is uniquely determined or estimated from p(x)’
≡estimated from p(x) .
S• Linear acylic SEM:
Bx1 e1
beBxx +=x2 e2
21b
– B and p(e) induce p(x). If p(x) are different for different B then B is uniquely– If p(x) are different for different B, then B is uniquely determined.
15
Conventional estimation principle: Causal Markov condition
• If the data-generating model is a (linear) acyclic SEM causal Markov conditionacyclic SEM, causal Markov condition holds :
E h b d i bl i i i d d t f itx– Each observed variable xi is independent of its non-descendants in the DAG conditional on its
t
ix
parents (Pearl & Verma, 1991) :
( ) ( )∏=p
iii xxpp
1
ofparents|x=i 1
1616Conventional methods based on
causal Markov condition• Methods based on conditional independencies
(Spirtes & Glymour, 1991)
M li li SEM i t f– Many linear acyclic SEMs give a same set of conditional independences and equally fit data.
• Scoring methods based on Gaussianity(Chickering 2002)(Chickering, 2002)
– Many linear acyclic SEMs give a same Gaussian distribution and equally fit datadistribution and equally fit data.
• In many cases the path coefficient matrix B is• In many cases, the path-coefficient matrix B is not uniquely determined.
1717
Example• Two models with Gaussian e1 and :1e 2e
11 ex = 121 8.0 exx +=Model 1: Model 2:
x1 e1 x1 e1
212 8.0 exx += 22 ex =x2 e2 x2 e2
• Both introduce no conditional independence:
( ) ( ) 1varvar 21 == xx( ) ( ) ,021 == eEeE
• Both introduce no conditional independence:
( ) 08.0,cov 21 ≠=xx
• Both induce the same Gaussian distribution:
( )21
⎞⎛ ⎤⎡⎤⎡⎤⎡ 8010x⎟⎟⎠
⎞⎜⎜⎝
⎛⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡18.08.01
00
~2
1 Nxx
A solution:A solution: Non-Gaussian approachpp
1919
A new direction: Non-Gaussian approach
• Non-Gaussian data in many applications:– Neuroinformatics (Hyvarinen et al 2001); BioinformaticsNeuroinformatics (Hyvarinen et al., 2001); Bioinformatics
(Sogawa et al., ICANN2010); Social sciences (Micceri, 1989); Economics (Moneta, Entner, et al., 2010)
• Utilize non-Gaussianity for model identification:– Bentler (1983); Mooijaart (1985); Dodge and Rousson (2001)Bentler (1983); Mooijaart (1985); Dodge and Rousson (2001)
• The path-coefficient matrix B is uniquely p q yestimated if ei are non-Gaussian. – Shimizu, Hoyer, Hyvarinen and Kerminen (JMLR, 2006)
ie
2020Illustrative example: G i G iGaussian vs non-Gaussian
Gaussian Non-Gaussian(uniform)
Model 1:x2x2
Model 1:x1 e1 x1x111
80ex
+=
M d l 2
x2 e2212 8.0 exx +=
Model 2:
x1
x2
x1 e1
x2
121 8.0 exx +=x1
x2 e2x1
22 ex =
( ) ( ) 1varvar 21 == xx( ) ( ) ,021 == eEeE
2121Linear Non-Gaussian A li M d l LiNGAMAcyclic Model: LiNGAM(Shimizu, Hyvarinen, Hoyer & Kerminen, JMLR, 2006)
• Non-Gaussian version of linear acyclic SEM:
eBxx +=exbx += ∑ or eBxx +=iij
jiji exbx += ∑ofparents:
or
whereThe external influence variables (disturbancese– The external influence variables (disturbances, errors) are
• of non-zero variance
ie
of non-zero variance.• non-Gaussian and mutually independent.
22
Identifiability of LiNGAM modelIdentifiability of LiNGAM model
• LiNGAM model can be shown to beidentifiableidentifiable.– B is uniquely estimated.
• To see the identifiability, helpful to review y, pindependent component analysis (ICA) (Hyvarinen et al., 2001).
23Independent Component Analysis (ICA)(ICA) (Jutten & Herault, 1991; Comon, 1994)
• Observed random vector x is modeled byObserved random vector x is modeled by
Asx =∑=p
jiji sax or
where
Asx∑=j
jiji1
– The mixing matrix A = [ ] is square and is of full column rank.The latent ariables (independent components)s
ija
– The latent variables (independent components) are of non-zero variance, non-Gaussian and mutually independent.
is
• Then, A is identifiable up to permutation P and scaling D of the columns:scaling D of the columns:
APDA =ica
2424Estimation of ICA• Most of estimation methods estimate
(Hyvarinen et al., 2001):1−= AW
( y , )
sWAsx 1−==
• Most of the methods minimize mutual information (or its approximation) of estimated independent components:
xWs ica=ˆ• W is estimated up to permutation P and scaling D of the rows:
( )1−== PDAPDWWi
• Consistent and computationally efficient algorithms:
( )PDAPDWWica
– Fixed point (FastICA) (Hyvarinen,99); Gradient-based (Amari, 98)
– Semiparametric: no specific distributional assumption
Back to LiNGAM model
2626Identifiability of LiNGAM (1/3):ICA hi h lf f id tifi tiICA achieves half of identification
• LiNGAM model is ICA• LiNGAM model is ICA.– Observed variables are linear combinations of
non Gaussian independent external influences :ix
enon-Gaussian independent external influences :
eBIxeBxx −−=⇔+= 1)(ie
eWAe 1−== ( )BIW −=
• ICA gives .– P: unknown permutation matrix
)( BIPDPDWW −==icaP: unknown permutation matrix
– D: unknown scaling (diagonal) matrix
• Need to determine P and D to identify B.
2727Identifiability of LiNGAM (2/3):
No permutation indeterminacy (1/6)• ICA gives .
– P : permutation matrix; D: scaling (diagonal) matrix)( BIPDPDWW −==ica
• We want to find such a permutation matrix that cancels the permutation i e :IPP =
PPpermutation , i.e., :IPP =
IDWPDWPWP ==ica
P
• We can show (Shimizu et al., UAI05) (illustrated in the next slides):– If i e no permutation is made on the rows ofIPP =
I=
DW– If , i.e., no permutation is made on the rows of , has no zero in the diagonal (obvious by definition).
If i id ti l t ti i d th
IPP =icaWP
DW
– If , i.e., any nonidentical permutation is made on the rows of , has a zero in the diagonal.
IPP ≠icaWPDW
28
Identifiability of LiNGAM (2/3):No permutation indeterminacy (2/6)
• By definition has all unities in the diagonal.– The diagonal elements of B are all zeros.
BIW −=
• Acyclicity ensures existence of an ordering of variables that makes lower triangular and then isBIWBthat makes lower-triangular, and then is also lower-triangular.
BIW −=B
• So, WLG, can be assumed to be lower-triangular:W
⎥⎥⎤
⎢⎢⎡
= 01*001
W0 0
0 No zeros in the diagonal!
⎥⎥⎦⎢
⎢⎣ 1**
0W in the diagonal!
29Identifiability of LiNGAM (2/3):N t ti i d t i (3/6)No permutation indeterminacy (3/6)
• Premultiplying by a scaling (diagonal) matrix D does NOT change the zero/non-zero pattern of :
WW
⎥⎥⎤
⎢⎢⎡
=11
0*00
dd
DW 0
0 0
⎥⎥⎤
⎢⎢⎡
= 01*001
W0 0
0
⎥⎥⎦⎢
⎢⎣
=
33
22
**0
ddDW 0
⎥⎥⎦⎢
⎢⎣
=1**01W 0
No zeros in the diagonal!
3030
Identifiability of LiNGAM (2/3):No permutation indeterminacy (4/6)
• Any other permutation of the rows of changes the zero/non-zero pattern of and
DWDWg p
brings zero in the diagonal:
⎥⎥⎤
⎢⎢⎡
= 11
2212 00
0*d
dDWP
⎥⎥⎤
⎢⎢⎡
=11
0*00
dd
DW 00 0 0
00⎥⎥⎦⎢
⎢⎣ 33
11
** d⎥⎥⎦⎢
⎢⎣
=
33
22
**0
ddDW 0 00
Exchanging 1stExchanging 1and 2nd rows Zero in the diagonal!
3131
Identifiability of LiNGAM (2/3):No permutation indeterminacy (5/6)
• Any other permutation of the rows of changes the zero/non-zero pattern of and
DWDWg p
brings zero in the diagonal:
⎥⎥⎤
⎢⎢⎡ 11
0*00
dd
DW ⎥⎥⎤
⎢⎢⎡
= 0*** 33
13 dd
DWP00 0
0⎥⎥⎦⎢
⎢⎣
=
33
22
**0*
ddDW
⎥⎥⎦⎢
⎢⎣
=000
11
22
ddDWP0 0
0Exchanging 1st
0Exchanging 1st
and 3rd rows Zero in the diagonal!
3232
Identifiability of LiNGAM (2/3):No permutation indeterminacy (6/6)
• We can find correct by finding that gives no th di l f WP
P Pzero on the diagonal of (Shimizu et al., UAI05).icaWP
• Thus, we can solve the permutation indeterminacy and obtain: y
( )BIDDWPDWPWP −=== ( )BIDDWPDWPWP −===ica
I=
3333
Identifiability of LiNGAM (3/3): No scaling indeterminacy
• Now we have: B)D(IWP −=ica
• Then, ( )icaWPD diag=
• Divide each row of by its corresponding icaWP y p gdiagonal element to get , i.e., :
icaBI − B
( ) BIB)D(IDWPWP −=−= −− 11diag icaica
3434
Estimation of LiNGAM model
1 ICA-LiNGAM algorithm1. ICA LiNGAM algorithm2. DirectLiNGAM algorithm
3535
Two estimation algorithmsg• ICA-LiNGAM algorithm
(Shi i H H i & K i JMLR 2006)(Shimizu, Hoyer, Hyvarinen & Kerminen, JMLR, 2006)
• DirectLiNGAM algorithm (Shimizu Hyvarinen Kawahara & Washio UAI09)(Shimizu, Hyvarinen, Kawahara & Washio, UAI09)
• Both estimate an ordering of variables that makes gthe path-coefficient matrix B to be lower-triangular.– Acyclicity ensures existence of such an ordering.
⎤⎡ OA full DAG
permpermperm exx +⎥⎦
⎤⎢⎣
⎡=
31
Ox3x1
321
permBx2
redundant edges
3636
Once such an ordering is found…
• Many existing (covariance-based) methods can do:– Pruning the redundant path-coefficients
• Sparse methods like weighted lasso (Zou, 2006)
– Finding significant path-coefficients • Testing, bootstrapping (Shimizu et al., 2006; Hyvarinen et al. 2010)
x3x1 x3x1exx +⎥⎤
⎢⎡
= O0*
A full DAG
x2 x2
permpermperm exx +⎥⎦
⎢⎣
=321
00*
* *
permB
3737
1. Outline of ICA-LiNGAM algorithmOut e o C G a go t(Shimizu, Hoyer, Hyvarinen, & Kerminen, JMLR, 2006)
1 Estimate B by ICA 2. Pruning1. Estimate B by ICA + permutation
g
x2x1x1 x2
x3x3x3 23b13b
Redundant edgesRedundant edges
3838
ICA-LiNGAM algorithm (1/2): Step 1. Estimation of B
1. Perform ICA (here, FastICA (Hyvarinen, 1999)) to obtain an estimate of
B)PD(IPDWW −==ica
2. Find a permutation that makes the diagonal elements of as large (non-zero) as possible in b l t l
icaWP ˆP
absolute value:
( )WPP
P ˆ1minˆ = Hungarian alg.
(Kuhn, 1955)
3 N li h f th t
( )iiicaWPP ( , )
WP ˆˆ3. Normalize each row of , then we get an estimate of I-B and .
icaWPB̂
3939ICA-LiNGAM algorithm (2/2): St 2 P iStep 2. Pruning
• Find such an ordering of variables that makes• Find such an ordering of variables that makes estimated B be as close to be lower-triangular as possiblepossible.– Find a permutation matrix Q that minimizes the sum of the
elements in the upper triangular part of permuted :B̂elements in the upper triangular part of permuted :
( )∑≤
=ji
ijT 2ˆminˆ QBQQ
Q
B
– Approximate algorithm for large variables (Hoyer et al., ICA06)
≤ ji
x2x1
0 1 3
0.1
55x1 x2
x3x3x30.1
0.1 3 0.1 3-0.01
4040
Basic properties of ICA-LiNGAM algorithm
• ICA-LiNGAM algorithm = ICA + permutationsComputationally efficient with the help of well developed– Computationally efficient with the help of well-developed ICA techniques.
• Potential problems ICA is an iterative search method:– ICA is an iterative search method:
• May get stuck in a local optimum if the initial guess or step size is badly chosensize is badly chosen.
– The permutation algorithms are not scale-invariant:• May provide different estimates for different scales ofMay provide different estimates for different scales of
variables.
41
Estimation of LiNGAM model
1 ICA-LiNGAM algorithm1. ICA LiNGAM algorithm2. DirectLiNGAM algorithm
4242
2. DirectLiNGAM algorithm2. DirectLiNGAM algorithm(Shimizu, Hyvarinen, Kawahara & Washio, UAI2009)
• Alternative estimation method without ICA– Estimates an ordering of variables that makes path-g p
coefficient matrix B to be lower-triangular.
⎤⎡A full DAG
permpermperm exx +⎥⎦
⎤⎢⎣
⎡= O
x3x1
A full DAG
⎦⎣ 321
permB x2Redundant edges
• Many existing (covariance-based) methods can do further pruning or finding significant pathdo further pruning or finding significant path coefficients (Zou, 2006; Shimizu et al., 2006; Hyvarinen et al. 2010)
43Basic idea (1/2) :A i bl b t thAn exogenous variable can be at the
top of a right orderingtop of a right ordering • An exogenous variable is a variable with no jxg
parents (Bollen, 1989), here .– The corresponding row of B has all zeros.
3xp g
• So, an exogenous variable can be at the top ofsuch an ordering that makes B lower-triangularwith zeros on the diagonal.
⎥⎥⎤
⎢⎢⎡
+⎥⎥⎤
⎢⎢⎡
⎥⎥⎤
⎢⎢⎡
=⎥⎥⎤
⎢⎢⎡ 333
0051000
ee
xx
xx
000
00x3 x1 x2
⎥⎥⎦⎢
⎢⎣
+⎥⎥⎦⎢
⎢⎣⎥⎥⎦⎢
⎢⎣ −
=⎥⎥⎦⎢
⎢⎣ 2
1
2
1
2
1
03.10005.1
ee
xx
xx 0 0
0
x3 x1 x2
44Basic idea (2/2): R tRegress exogenous out
• Compute the residuals regressing the other
3x( ) )21(3 =ir• Compute the residuals regressing the other
variables on exogenous :The residuals form a LiNGAM model
( ) )2,1( =iri
3x)2,1( =ixi( ) ( )33 and rr– The residuals form a LiNGAM model.
– The ordering of the residuals is equivalent to that of corresponding original variables.
21 and rr
p g g
⎥⎤
⎢⎡
⎥⎤
⎢⎡⎥⎤
⎢⎡
⎥⎤
⎢⎡ 333 000 exx 0 0 000
⎤⎡⎤⎡⎤⎡⎤⎡
⎥⎥⎥
⎦⎢⎢⎢
⎣
+⎥⎥⎥
⎦⎢⎢⎢
⎣⎥⎥⎥
⎦⎢⎢⎢
⎣ −=
⎥⎥⎥
⎦⎢⎢⎢
⎣ 2
1
2
1
2
1
03.10005.1
ee
xx
xx 0 0
0 ⎥⎦
⎤⎢⎣
⎡+⎥
⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡−
=⎥⎦
⎤⎢⎣
⎡
2
1)3(
2
)3(1
)3(2
)3(1
03.100
ee
rr
rr 0 0
0⎦⎣⎦⎣⎦⎣⎦⎣ 222 ⎦⎣⎦⎣
)3(2r
)3(1rx3 x1 x2
• Exogenous implies ` can be at the second top’.)3(1r 1x
45
Outline of DirectLiNGAM• Iteratively find exogenous variables until all the
variables are ordered:variables are ordered:1. Find an exogenous variable .
P t t th t f th d i3x
x– Put at the top of the ordering. – Regress out.
2 Fi d id l h )3(r
3x3x
2. Find an exogenous residual, here .– Put at the second top of the ordering.
R t
)(1r
1x)3(– Regress out.
3. Put at the third top of the ordering and terminate. Th ti t d d i i
)3(1r
2x<<The estimated ordering is .213 xxx <<
Step. 1 Step. 2 Step. 3)3(
2r)3(
1rx3 x1 x2 )1,3(2r
46-1Identification of an exogenous variable (two variable cases)
ii) is NOT exogenousi) is exogenous)( 11 ex = 1xii) is NOT exogenous.i) is exogenous.
( )011 =
bbex
)( 11 ex 1x( )122121 0bxbx ≠+= 1e
( )02121212 ≠+= bexbx 22 ex =
)var(),cov(,on Regressing
112
2)1(
2
12
xx
xxxr
xx
−=1
122
)1(2
12
),cov(,on Regressing
xxxxr
xx
−=
( ))var(
var)var(
),cov(1
)var(
2122
1212
1
xxbx
xxxb
x
−⎭⎬⎫
⎩⎨⎧−=1212
11
22 )var(xbx
xx
xr
−=1e
)var()var( 11 xx ⎭⎩2e=
t.independenNOTareand )1(21 rxt.independenareand )1(
21 rx
46-2Need to use Darmois-Skitovitch’ ththeorem (Darmois, 1953; Skitovitch, 1953)
ii) is NOT exogenous1x( )1212121 0bexbx ≠⋅+=
Darmois-Skitovitch’ theorem:
Define two variables and as
ii) is NOT exogenous.1x
1x1
2x22 ex =Define two variables and as
∑∑ ==p
jj
p
jj eaxeax 2211 ,
1x 2x
112
2)1(
2
21
)var(),cov(,on Regressing
xx
xxxr
xx
−=
∑∑== j
jjj
jj11
where are independent d i bl
je
( )1
22
1212
1
)var(var
)var(),cov(1
)var(
ex
xxx
xxb
x
−⎭⎬⎫
⎩⎨⎧−=
random variables.
If there exists a non-Gaussian ie 12b
11 )var()var( xx ⎭⎩for which , and are dependent.
021 ≠iiaa1x 2x
t.independenNOTareand )1(21 rx
47
Identification of an exogenous variable (p variable cases)
• Lemma 1: and its residual ( )j
jii
ji x
xxxr
)cov( ,−=jxLemma 1: and its residual
are independent for all is exogenous
jj
ii xx
x)var(j
ji ≠ jx⇔are independent for all is exogenous
I ti id tif i bl
ji ≠ jx⇔
• In practice, we can identify an exogenous variable by finding a variable that is most independent of it id lits residuals.
48
Independence measures• Evaluate independence between a variable and a
residual by a nonlinear correlation:
p
residual by a nonlinear correlation:
( ){ } ( )tanhcorr )( =grgx j
• Taking the sum over all the residuals we get:
( ){ } ( )tanh,corr =grgx ij
• Taking the sum over all the residuals, we get:
( ){ } ( ){ }∑ += jj rxgrgxT )()( corrcorr
C hi ti t d ll
( ){ } ( ){ }∑≠
+=ji
ijij rxgrgxT ,corr,corr
• Can use more sophisticated measures as well (Bach & Jordan, 2002; Gretton et al., 2005; Kraskov et al., 2004).– Kernel-based independence measure (Bach & Jordan 2002)Kernel based independence measure (Bach & Jordan, 2002)
often gives more accurate estimates (Sogawa et al., IJCNN10).
49
Important properties of DirectLiNGAM
• DirectLiNGAM repeats:Least squares simple linear regression– Least squares simple linear regression
– Evaluation of pairwise independence between each variable and its residualsvariable and its residuals.
• No algorithmic parameters like stepsize, initial g p p ,guesses, convergence criteria.
• Guaranteed convergence to the right solution in a fixed number of steps (the number of variables) if the data strictly follows the model.
50Estimation of LiNGAM model: Summary (1/2)
• Two estimation algorithms:ICA LiNGAM: Estimation using ICA– ICA-LiNGAM: Estimation using ICA
• Pros. Fast• Cons Possible local optimum; Not scale invariant• Cons. Possible local optimum; Not scale-invariant
– DirectLiNGAM: Alternative estimation without ICA• Pros. Guaranteed convergence; Scale-invariant• Cons. Less fast
– Cf. Neither needs faithfulness (Shimizu et al., JMLR, 2006; Hoyer personal comm July 2010)2006; Hoyer, personal comm., July, 2010).
51Estimation of LiNGAM model: Summary (2/2)
• Experimental comparison of the two algorithms: (Sogawa et al., IJCNN2010)
• Sample size: Both need at least 1000 sample size for more than 10 variables.S l bilit B th l 100 i bl Th• Scalability: Both can analyze 100 variables. The performances depend on the sample size etc., of course!
• Scale invariance: ICA-LiNGAM is less robust for changing g gscales of variables.
• Local optima?: For less than 10 variables ICA LiNGAM often a bit– For less than 10 variables, ICA-LiNGAM often a bit better.
– For more than 10 variables, DirectLiNGAM often better perhaps because the problem of local optima might become more serious?
52
Testing andTesting and Reliability evaluationy
53
Testing testable assumptionsg p• Non-Gaussianity of external influences :iey
– Gaussianity testsi
• Could detect violations of some assumptions:– Local test
• Independence of external influences• Conditional independencies between observed
variables (causal Markov condition)
ie
xvariables (causal Markov condition)• Linearity
– Overall fit of the model assumptions
ix
Overall fit of the model assumptions• Chi-square test using 3rd and/or 4th-order moments
(Shimizu & Kano, 2008) • Still under development
54
Reliability evaluationReliability evaluation• Need to evaluate statistical reliability of LiNGAMy
results:– Sample fluctuations– Smaller non-Gaussianity makes the model closer to
be NOT identifiable.
• Reliability evaluation by bootstrapping:(Komatsu Shimizu & Shimodaira ICANN2010; Hyvarinen et al JMLR 2010)(Komatsu, Shimizu & Shimodaira, ICANN2010; Hyvarinen et al., JMLR, 2010)
– If either the sample size is small or the magnitude of non-Gaussianity is small, LiNGAM would give very different results for bootstrap samples.
55
Extensions
56
Extensions (a partial list)Extensions (a partial list)R l i th ti f LiNGAM d l• Relaxing the assumptions of LiNGAM model:– Acyclic Cyclic (Lacerda et al., UAI2008)
– Single homogenous population heterogeneous population (Shimizu et al., ICONIP2007)
– i.i.d. sampling time structures (Part II.) (Hyvarinen et al, JMLR, 2010; Kawahara, S & Washio, 2010)
– No latent confounders Allow latents (Part II.) (Hoyer et al., IJAR, 08; Kawahara, Bollen et al., 2010)
– Linear Non-linear (Hoyer et al., NIPS08; Zhang & Hyvarinen, UAI09; Tilmann, Gretton & Spirtes, NIPS09)
57
Application areas so far
58
Non-Gaussian SEMs have been applied to…
• Neuroinformatics– Brain connectivity analysis (Hyvarinen et al., JMLR, 2010)y y ( y )
• Bioinformatics– Gene network estimation (Sogawa et al ICANN2010)Gene network estimation (Sogawa et al., ICANN2010)
• Economics (Wan & Tan, 2009; Moneta, Entner, Hoyer & Coad, 2010)
• Genetics (Ozaki & Ando, 2009)
• Environmental sciences (Niyogi et al., 2010)( y g , )
• Physics (Kawahara, Shimizu & Washio, 2010)
S i l• Sociology (Kawahara, Bollen, Shimizu & Washio, 2010)
59
Final summary of Part IFinal summary of Part IU f G i it i li SEM i• Use of non-Gaussianity in linear SEMs is useful for model identification.
• Non-Gaussian data is encountered in many applicationsapplications.
• The non-Gaussian approach can be a goodThe non Gaussian approach can be a good option.
• Links to codes and papers: http://homepage.mac.com/shoheishimizu/lingampapers.html
60
FAQs
61
Q. My data is Gaussian. LiNGAM will not be useful.
• A. You’re right. Try Gaussian methods.
• Comment: Hoyer et al. (UAI2008) showed: `To hat e tent one can identif the model for aTo what extent one can identify the model for a mixture of Gaussian and non-Gaussian external influence variables’influence variables’.
62
Q. I applied LiNGAM, but the result is not reasonable to background knowledge.
• A. You might first want to check: Some model assumptions might be violated– Some model assumptions might be violated.
Try other extensions of LiNGAMor non-parametric methods PC or FCI etcor non parametric methods PC or FCI etc. (Spirtes et al., 2000).
– Small sample size or small non-GaussianityTry bootstrap to see if the result is reliableTry bootstrap to see if the result is reliable.
Background knowledge might be wrong– Background knowledge might be wrong.
63
Q. Relation to causal Markov condition?
• A. The following 3 estimation principles are equivalent (Zhang & Hyvarinen ECML09; Hyvarinen et al JMLR 2010):equivalent (Zhang & Hyvarinen, ECML09; Hyvarinen et al., JMLR, 2010):1. Maximize independence between external influences .2 Minimize the sum of entropies of external influences
iee2. Minimize the sum of entropies of external influences .
3. Causal Markov condition (Each variable is independent of its non-descendants in the DAG conditional on its
ie
its non-descendants in the DAG conditional on its parents) AND maximization of independence between the parents of each variable and its corresponding p p gexternal influences .ie
64
Q. I am a psychometrician and am more interested in latent factors.• A. Shimizu, Hoyer, and Hyvarinen. (2009)
proposes LiNGAM for latent factors:proposes LiNGAM for latent factors:
dBff += -- LiNGAM for latent factors
eGfx += -- Measurement model
65
Others• Q. Prior knowledge?
– It is possible to incorporate prior knowledge. The accuracy of p p p g yDirectLiNGAM is often greatly improved even if the amount of prior knowledge is not so large (Inazumi et al., LVA/ICA2010).
• Q. Sparse LiNGAM?– Zhang et al. (ICA09) and Hyvarinen et al. (JMLR, 2010).– ICA + adaptive Lasso (Zou 2006)ICA + adaptive Lasso (Zou, 2006).
• Q. Bayesian approach?H d H tti (UAI09) H d Wi th (NIPS09)– Hoyer and Hyttinen (UAI09); Henao and Winther. (NIPS09).
• Q. The idea can be applied to discrete variables?Q. The idea can be applied to discrete variables?– One proposal by Peters et al. (AISTATS2010).– Comment: if your discrete variables are close to be continuous, e.g.,
ordinal scales with many points LiNGAM might workordinal scales with many points, LiNGAM might work.
66
Q Nonlinear extensions?• A Several nonlinear SEMs have been proposed:
Q. Nonlinear extensions?• A. Several nonlinear SEMs have been proposed:
– DAG; No latent confounders.
( ) ij
iiji ex jfx +=∑ ofparent -- Imoto et al. (2002)1.
( ) iiii
j
exfx += ofparents -- Hoyer et al. (NIPS08)2.
( )( )iiiii exffx += − ofparents1,12, -- Zhang et al. (UAI09)3.
• For two variable cases, unique identification possible except several combinations of nonlinearities and distributions (Hoyer et al., NIPS08; Zhang & Hyvarinen, UAI09).
67
Nonlinear extensions (continued)
• Proposals to aim at computational efficiency (Mooij et al., ICML09; Tilmann et al., NIPS09; Zhang & Hyvarinen, ECML09; UAI09).
• Pros: – Nonlinear models are more general than linear models.g
• Cons:– Computationally demanding.
• Current: at most 7 or 8 variables.• Perhaps assumption of Gaussian external influences might• Perhaps, assumption of Gaussian external influences might
help: Imoto et al. (2002) analyzes 100 variables.
– More difficult to allow other possible violations of LiNGAMassumptions, latent confounders etc.
68
Q. My data follows neither such linear SEMs nor such nonlinear SEMs as you
have talked.
• A Try non-parametric methods e g
have talked.
A. Try non parametric methods, e.g.,– DAG: PC (Spirtes & Glymour, 1991)
– DAG with latent confounders: FCI (Spirtes et al 1995)DAG with latent confounders: FCI (Spirtes et al., 1995).
( )exfx ofparents=
( ) f
( )iiii exfx ,ofparents=
• Probably you get an (probably large) class of equivalence models rather than a single model, b h ld b h b lbut that would be the best you currently can.
69
Q Deterministic relations?Q. Deterministic relations?
• A. LiNGAM is not applicable.
• See Daniusis et al. (UAI2010) for a method to analyze deterministic relationsanalyze deterministic relations.
70
Extra slides
Very brief review ofVery brief review of Linear SEMs and causalityy
See Pearl (2000) for detailsSee Pearl (2000) for details.
72
We can give a causal interpretation to each path-coefficient bij (Pearl, 2000).ijb
• Example:
11
exbxex
+= x1 e1
221b
32323
21212
exbxexbx
+=+=
x3
x2
e3
e232b
32b
• Causal interpretation of :32bCausa e p e a o o,constant atoconstant afromchangeyouIf 102 ccx
32
( ).bychangewill)(then 01323 ccbxE −
73
`Change from to ’2x 1c0cg• It means `Fix at regardless of the variables 2x 0c
2 10
that previously determines and change the constant from to ’ (Pearl, 2000).1c0c
2x
• `Fix at regardless of the variables that determines ’ is denoted by .2x ( )02 cxdo =
2x 0c
– In SEM, `Replace the function determining with a constant ’:
Mutilated model0c
2x
11 ex =11 ex =
Mutilated model with do(x2=c0):( )02 cxdo =
21212
11
exbxexbx
+=+=
02
11
bcxex
+=
32323 exbx +=32323 exbx +=
(Assumption of invariance: the other functions don’t change)
74
Average causal effect: Definition (Rubin, 1974; Pearl, 2000)
• Average causal effect of x2 on x3 when changing x2 from c0 to c1:2x
2x 3x0c 1cchanging x2 from c0 to c1:
( )( ) ( )( )023123 || cxdoxEcxdoxE =−= c0c1
2 0c 1c
0cc01c
Computed based on the mutilated models
( )0132 ccb −= 32b– Computed based on the mutilated models
with do(x2=c1) or do(x2=c0) .( )12 cxdo = ( )02 cxdo =
• Average causal effects can be computed based on estimated path-coefficient matrix Bon estimated path coefficient matrix B.
75Cf. Conducting experiments with d i t i th d trandom assignment is a method to
estimate average causal effectsestimate average causal effects• Example: a drug trialp g• Random assignment of subjects into two groups
fixes to or regardless of the variables that2x 1c 0cfixes to or regardless of the variables that determines :– Experimental group with (take the drug)
2 1c 0c2x
12 cx =Experimental group with (take the drug) – Control group with (placebo)02 cx =
12 cx
( )( ) ( )( )023123 || cxdoxEcxdoxE =−=
for the exp. group for the control group-)( 3xE )( 3xE