Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function...
Transcript of Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function...
![Page 1: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/1.jpg)
1
Kenji FukumizuInstitute of Statistical Mathematics, ROIS
Department of Statistical Science, Graduate University for Advanced Studies
July 25, 2008 / Statistical Learning Theory II
Independence and Conditional Independence with RKHS
Statistical Inference with Reproducing Kernel Hilbert Space
![Page 2: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/2.jpg)
2
Outline
1. Introduction
2. Covariance operators on RKHS
3. Independence with RKHS
4. Conditional independence with RKHS
5. Summary
![Page 3: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/3.jpg)
3
Outline
1. Introduction
2. Covariance operators on RKHS
3. Independence with RKHS
4. Conditional independence with RKHS
5. Summary
![Page 4: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/4.jpg)
4
Covariance on RKHS(X , Y): random variable taking values on X Y. resp. (HX, kX), (HY , kY): RKHS with measurable kernels on X and Y, resp.Assume
Cross-covariance operator:
– c.f. Euclidean caseVYX = E[YXT] – E[Y]E[X]T : covariance matrix
)])(),([Cov()]([)]([)]()([, YgXfXfEYgEXfYgEfg YX =−=Σfor all YX HgHf ∈∈ ,
( ) )],(),,[(, XaYbCovaVb YX =
XYXYYX mmXYE ⊗−Φ⊗Φ≡Σ )]()([
XY HH ⊗∈
Proposition
×
YX HHYX →Σ :
∞<)],([)],([ YYkEXXkE YX
XYYX PPP mm ⊗−=
![Page 5: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/5.jpg)
5
Characterization of independenceIndependence and Cross-covariance operatorTheorem
If the product kernel is characteristic on X Y, then
proof)
– c.f. for Gaussian variables
– c.f. Characteristic function
OXY =Σ⇔X and Y are independent
OVXY =⇔ i.e. uncorrelated
YXkk
X Y
X Y
][][][ 11)(1 vYY
uXX
vYuXXY eEeEeE −−+− =⇔
⇔=Σ OXY
⇔ (by characteristic assumption)
×
YXXY PPP mm ⊗=
YXXY PPP ⊗=
![Page 6: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/6.jpg)
6
Estimation of cross-cov. operator: i.i.d. sample on
An estimator of ΣYX is defined by
Theorem
)(,),( ,1,1 NN YXYX K
{ } { }∑=
−⋅⊗−⋅=ΣN
iXiYi
NYX mXkmYk
N 1
)( ˆ),(ˆ),(1ˆXY
( ) )(1ˆ )( ∞→=Σ−Σ NNOpHSYXN
YX
Corollary to the -consistency of the empirical mean, because the norm in is equal to the Hilbert-Schmidt norm of the corresponding operator .
NYX HH ⊗
.YX ×
YX HH →
![Page 7: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/7.jpg)
7
Hilbert-Schmidt Operator– Hilbert-Schmidt operator
A is called Hilbert-Schmidt if for complete orthonormal systems of H1 and of H2
Hilbert-Schmidt norm:
– Fact: If is regarded as an element
– Fact:
21: HHA → : operator on a Hilbert space
∑ ∑ ∞<j i ij A .,2
ϕψ
{ }iϕ { }jψ
∑ ∑= j i ijHS AA22 , ϕψ
c.f. Frobenius norm of a matrix
|||||||| AHS FA =,21 HHFA ⊗∈21: HHA →
)Q .||||,, 2222
212Aj HHi jiAj Hi ijHS FFAA =⊗== ∑ ∑∑ ∑ ⊗
ψϕϕψ
CONS of 21 HH ⊗
![Page 8: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/8.jpg)
8
Outline
1. Introduction
2. Covariance operators on RKHS
3. Independence with RKHS
4. Conditional independence with RKHS
5. Summary
![Page 9: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/9.jpg)
9
Measuring DependenceDependence measure
Empirical dependence measure
and can be used as measures of dependence.
2HSYXYXM Σ=
2)()( ˆˆHS
NYX
NYXM Σ=
⇔= 0YXM X Y with kXkY characteristic
![Page 10: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/10.jpg)
10
HS norm of cross-cov. operator IIntegral expression
Note: a Hilbert-Schmidt norm always has an integral expression
[ ]]~|)~,([]~|)~,([2)]~,()~,([2 YYYkEXXXkEEYYkXXkEM HSYXYX YXYX −=Σ=
)]~,([)]~,([ YYkEXXkE YX+
where is an independent copy of (X,Y).:)~,~( YX
Proof.
![Page 11: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/11.jpg)
11
Empirical estimatorGram matrix expression
HS-norm can be evaluated only in the subspaces and .
∑∑==
−=Σ=N
kjikiji
N
jijijiHS
NYX
NYX YYkXXk
NYYkXXk
NM
1,,3
1,2
2)()( ),(),(2),(),(1ˆˆYXYX
∑∑==
+N
kk
N
jiji YYkXXk
N 1,1,4 ),(),(1
llYX
{ }Ni
NXi mXk 1
)(ˆ),(Span =−⋅X { })(ˆ),(Span NYi mYk −⋅Y
[ ]YXN
YX GGN
M Tr1ˆ2
)( =
,NXNX QKQG =where TNNNN N
IQ 111
−=
Or equivalently,
HS norm of cross-cov. operator II
![Page 12: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/12.jpg)
12
Application: ICAIndependent Component Analysis (ICA)– Assumption
• m independent source signals• m observations of linearly mixed signals
– Problem
• Restore the independent signals S from observations X.
A
s1(t)
s2(t)
s3(t)x3(t)
x2(t)
x1(t)
A: m x m invertiblematrix
)()( tAStX =
BXS =ˆ B: m x m orthogonal matrix
![Page 13: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/13.jpg)
13
ICA with HSIC
Pairwise-independence criterion is applicable.
Objective function is non-convex. Optimization is not easy.Approximate Newton method has been proposed
Fast Kernel ICA (FastKICA, Shen et al 07)
Other methods for ICASee, for example, Hyvärinen et al. (2001).
∑ ∑= >
=m
a abba YYHSICBL
1),()( BXY =
)()1( ,..., NXX : i.i.d. observation (m-dimensional)
Minimize
(Software downloadable at Arthur Gretton’s homepage)
![Page 14: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/14.jpg)
14
Experiments (speech signal)
A
s1(t)
s2(t)
x3(t)
x2(t)
x1(t)
randomlygenerateds3(t)
B
Fast KICA
Three speech signals
![Page 15: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/15.jpg)
15
Independence test with kernels IIndependence test with positive definite kernels– Null hypothesis H0: X and Y are independent– Alternative H1: X and Y are not independent
can be used for a test statistics.
∑∑==
−=Σ=N
kjikiji
N
jijijiHS
NYX
NYX YYkXXk
NYYkXXk
NM
1,,3
1,2
2)()( ),(),(2),(),(1ˆˆYXYX
∑∑==
+N
kk
N
jiji YYkXXk
N 1,1,4 ),(),(1
llYX
![Page 16: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/16.jpg)
16
Independence test with kernels IIAsymptotic distribution under null-hypothesisTheorem (Gretton et al. 2008)
If X and Y are independent, then
where
– The proof is easy by the theory of U (or V)-statistics (see e.g. Serfling 1980, Chapter 5).
∑∞
=⇒
1
2)(ˆi
iiN
YX ZMN λ in law )( ∞→N
is the eigenvalues of the following integral operator { }∞=1iiλ
Zi : i.i.d. ~ N(0,1),
∑ +−= ),,,( ,,,,,,!41 2),,,( dcba dcbacabababadcba kkkkkkUUUUh YXYXYX
Ua = (Xa, Ya)
)()(),,,( aiiUUUbidcba udPdPdPuuuuuhdcb
ϕλϕ =∫
),,(, baba XXkk XX =
![Page 17: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/17.jpg)
17
Independence test with kernels IIIConsistency of testTheorem (Gretton et al. 2008)
If MYX is not zero, then
where
( ) ),0(ˆ 2)( σNMMN YXN
YX ⇒− in law )( ∞→N
[ ]( )YXdcbadcba MUUUUhEE −= 2,,
2 ],,,([16σ
![Page 18: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/18.jpg)
18
Example of Independent TestSynthesized data– Data: two d-dimensional samples
),...,(),...,,...,( )()(1
)1()1(1
Nd
Nd XXXX ),...,(),...,,...,( )()(
1)1()1(
1N
dN
d YYYY
strength of dependence
![Page 19: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/19.jpg)
19
Power Divergence (Ku&Fine05, Read&Cressie)– Make partition : Each dimension is divided into q parts so
that each bin contains almost the same number of data.
– Power-divergence
– Null distribution under independence
Limitations– All the standard tests assume vector (numerical / discrete) data. – They are often weak for high-dimensional data.
JjjA ∈}{
∑ ∏∈ = ⎭
⎬⎫
⎩⎨⎧
−⎟⎠⎞
⎜⎝⎛
+==
Jj
N
k
kjjjN k
pppNmXIT 1ˆˆˆ)2(
2),(21
)(λ
λ
λλ
: frequency in Aj: marginal freq. in r-th interval
21−+−
⇒ NqNqN NT χ
jp̂)(ˆ k
rpI2 = Mean Square Conting.I0 = MI
![Page 20: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/20.jpg)
20
Independent Test on Text– Data: Official records of Canadian Parliament in English and French.
• Dependent data: 5 line-long parts from English texts and their French translations.
• Independent data: 5 line-long parts from English texts and random 5 line-parts from French texts.
– Kernel: Bag-of-words and spectral kernel
Topic Match BOW(N=10) Spec(N=10) BOW(N=50) Spec(N=50) HSICg HSICp HSICg HSICp HSICg HSICp HSICg HSICp
Agri- Random 1.00 0.94 1.00 0.95 1.00 0.93 1.00 0.95culture Same 0.99 0.18 1.00 0.00 0.00 0.00 0.00 0.00
Fishery Random 1.00 0.94 1.00 0.94 1.00 0.93 1.00 0.95Same 1.00 0.20 1.00 0.00 0.00 0.00 0.00 0.00
Immig- Random 1.00 0.96 1.00 0.91 0.99 0.94 1.00 0.95ration Same 1.00 0.09 1.00 0.00 0.00 0.00 0.00 0.00
(Gretton et al. 07)Acceptance rate (α = 5%)
![Page 21: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/21.jpg)
21
Outline
1. Introduction
2. Covariance operators on RKHS
3. Independence with RKHS
4. Conditional independence with RKHS
5. Summary
![Page 22: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/22.jpg)
22
Re: Statistics on RKHSLinear statistics on RKHS
– Basic statistics Basic statisticson Euclidean space on RKHS
Mean Mean element Covariance Cross-covariance operatorConditional covariance Cond. cross-covariance operator
– Plan: define the basic statistics on RKHS and derive nonlinear/nonparametric statistical methods in the original space.
Ω (original space)Φ
feature map H (RKHS)
XΦ (X) = k( , X)
YXΣ
![Page 23: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/23.jpg)
23
Conditional IndependenceDefinitionX, Y, Z: random variables with joint p.d.f. X and Y are conditionally independent given Z, if
)|(),|( || zypxzyp ZYZXY =
)|()|()|,( ||| zypzxpzyxp ZYZXZXY =or
),,( zyxpXYZ
YX Z
With Z known, the information of Xis unnecessary for the inference on Y
(A)
(B)
YX
Z(B)
(A)
![Page 24: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/24.jpg)
24
Review: Conditional Covariance Conditional covariance of Gaussian variables– Jointly Gaussian variable
: m ( = p + q) dimensional Gaussian variable
– Conditional probability of Y given X is again Gaussian
),,(),,,( 11 qp YYYXXX KK ==),( YXZ =
),(~ VNZ μ ⎟⎠
⎞⎜⎝
⎛=⎟⎟
⎠
⎞⎜⎜⎝
⎛=
YYYX
XYXX
Y
X
VVVV
V,μμ
μ
)(]|[ 1| XXXYXYXY xVVxXYE μμμ −+==≡ −
XYXXYXYYXYY VVVVxXYVarV 1| ]|[ −−==≡
),(~ || XYYXY VN μ
Cond. mean
Cond. covariance
Note: VYY|X does not depend on x
Schur complement of VXX in V
![Page 25: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/25.jpg)
25
Conditional Independence for Gaussian Variables
Two characterizationsX,Y,Z are Gaussian.
– Conditional covariance
– Comparison of conditional variance
X Y | Z OV ZXY =⇔ | i.e. OVVVV ZXZZYZYX =− −1
X Y | Z ZYYZXYY VV |],[| =⇔
YXZZXZXZXYYY VVVV ],[1
],][,[],[−− ( ) ⎟⎟
⎠
⎞⎜⎜⎝
⎛⎟⎠
⎞⎜⎝
⎛−=
−
ZY
XY
ZZZX
XZXXYZYXYY V
VVVVV
VVV1
,
( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛ −⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎠
⎞⎜⎝
⎛−
−=−
−
−
−ZY
XYZZXZ
ZZ
ZXX
ZXZZYZYXYY V
VIOVVI
VOOV
IVVOI
VVV1
1
1|
1,
ZXYZXXZYXZYY VVVV |1
|||−−=
)Q
![Page 26: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/26.jpg)
26
Linear Regression and Conditional Covariance
Review: linear regression– X, Y: random vector (not necessarily Gaussian) of dim p and q (resp.)
– Linear regression: predict Y using the linear combination of X.Minimize the mean square error:
– The residual error is given by the conditional covariance matrix.
– For Gaussian variables,
][~],[~ YEYYXEXX −=−=
2
matrix:
~~min XAYEpqA
−×
[ ]XYYpqAVXAYE |
2
matrix:Tr~~min =−
×
ZYYZXYY VV |],[| =
“If Z is known, X is not necessary for linear prediction of Y.”can be interpreted as
( )X Y | Z⇔
![Page 27: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/27.jpg)
27
Conditional Covariance on RKHSConditional Cross-covariance operatorX, Y, Z : random variables on ΩX, ΩY, ΩZ (resp.).(HX, kX), (HY , kY), (HZ , kZ) : RKHS defined on ΩX, ΩY, ΩZ (resp.).
– Conditional cross-covariance operator
– Conditional covariance operator
– may not exist as a bounded operator. But, we can justify the definions.
ZXZZYZYXZYX ΣΣΣ−Σ≡Σ −1|
YX HH →
ZYZZYZYYZYY ΣΣΣ−Σ≡Σ −1|
YY HH →
1−ΣZZ
![Page 28: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/28.jpg)
28
Decomposition of covariance operator
Rigorous definition of conditional covariance operators
2/12/1XXYXYYYX W ΣΣ=Σ
2/12/1| XXZXYZYYYXZYX WW ΣΣ−Σ≡Σ
such that WYX is a bounded operator with and
.)()(,)()( XXYXYYYX RangeWKerRangeWRange Σ⊥Σ=
1|||| ≤YXW
2/12/1| YYZYYZYYYYZYY WW ΣΣ−Σ≡Σ
![Page 29: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/29.jpg)
29
Two Characterizations of Conditional Independence with Kernels
(1) Conditional covariance operator (FBJ04, 08)– Conditional variance (kZ is characteristic)
– Conditional independence (all the kernels are characteristic)
– c.f. Gaussian variables
ZYYXZYY |][| Σ=ΣX Y | Z ⇔
X Y | Z ZYYZXYY VV |],[| =⇔
2|
~~min]|[ ZaYbZYbVarbVb TT
a
TZYY
T −==
[ ] 2| )(~)(~inf]|)([, ZfYgEZYgVarEgg
ZHfZYY −==Σ∈
X is not necessary for predicting g(Y)
![Page 30: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/30.jpg)
30
(2) Cond. cross-covariance operator (FBJ04)
– Conditional Covariance (kZ is characteristic)
– Conditional independence
– c.f. Gaussian variables
OZXY =Σ |&& ( )OZXY =Σ⇔ |&&⇔
),,( ZXX =&& ),( ZYY =&&where
X Y | Z
X Y | Z OV ZXY =⇔ |
]|,[Cov| ZYbXabVa TTZXY
T =
[ ]]|)(),([Cov, | ZXfYgEfg ZYX =Σ
![Page 31: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/31.jpg)
31
– Proof of (1) (partial) : relation between residual error and operator
( ) ( )2)]([)()]([)( ZfEZfYgEYgE −−−
gggfff YYZYZZ Σ+Σ−Σ= ,,2,22/12/12/122/1 ,2 ggWff YYYYZYZZZZ Σ+ΣΣ−Σ=
22/122/122/12/1 gWggWf YYZYYYYYZYZZ Σ−Σ+Σ−Σ=
This part can be arbitrary small by choosing fbecause of
( )gWWggWf YYZYYZYYYYYYZYZZ2/12/122/12/1 , ΣΣ−Σ+Σ−Σ=
ZYY |Σ
.)()( ZZZY RangeWRange Σ=
![Page 32: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/32.jpg)
32
Proof of (1): conditional independence
[ ] [ ]ZZXYgEVarZZXYgVarEZYgVar ZXZX |],|)([|],|)([]|)([ || +=
[ ] [ ]]|[]|[][ || XYVarEXYEVarYVar XYXXYX +=Lemma
From the above lemma
[ ] [ ] [ ][ ]ZZXYgEVarEZXYgVarEZYgVarE ZXZ |],|)([],|)([]|)([ |=−Take ][⋅ZE
L.H.S = 0 from ZYYXZYY |][| Σ=Σ
[ ] zPZZXYgEVar zZX everyalmost 0|],|)([| −=
),(everyalmost ]|)([],|)([ zxPZYgEZXYgE XZ −=
ZYXZY PP || = (kY characteristic)
![Page 33: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/33.jpg)
33
– Why is the “extended variable” needed in (2)?
The l.h.s is not a funciton of z. c.f. Gaussian case
However, if X is replaced by [X, Z]
]|)(),([, | zZXfYgCovfg ZYX =≠Σ
[ ]]|)(),([, | ZXfYgCovEfg ZYX =Σ
dzzpzypzxpyxpOZYX )()|()|(),(| ∫=⇒=Σ
)|()|()|,(| zypzxpzyxpOZYX =⇒=Σ
dzzpzypzzxpzyxpOZZXY )()|()|',()',,(|],[ ∫=⇒=Σ
)'()|()|',( zzzxpzzxp −= δ
)'()'|()'|()',,( zpzypzxpzyxp =
where
i.e. )'|()'|()'|,( zypzxpzyxp =
![Page 34: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/34.jpg)
34
Empirical Estimator of Cond. Cov. Operator
(X1, Y1, Z1), ... , (XN, YN, ZN)
– Empirical conditional covariance operator
– Estimator of Hilbert-Schmidt norm
( ) )(1)()()()(|
ˆˆˆˆ:ˆ NZXN
NZZ
NYZ
NYX
NZYX I Σ+ΣΣ−Σ=Σ
−ε
)(ˆ NYZYZ Σ→Σ etc.
( ) 1)(1 ˆ −− +Σ→Σ INN
ZZZZ ε regularization for inversion
finite rank operators
[ ]ZYZXHSN
ZYX SGSGTrˆ 2)(| =Σ
,NXNX QKQG = TNNNN N
IQ 111
−= centered Gram matrix
( ) 111)( −− +=+−= ZNNZNNZNZ GIGINGISNεε
![Page 35: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/35.jpg)
35
Statistical ConsistencyConsistency on conditional covariance operator
Theorem (FBJ08, Sun et al. 07)Assume and
In particular,
)(0ˆ|
)(| ∞→→Σ−Σ N
HSZYXN
ZYX
0→Nε ∞→NNε
)(ˆ|
)(| ∞→Σ→Σ N
HSZYXHSN
ZYX
![Page 36: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/36.jpg)
36
Normalized Covariance OperatorNormalized Cross-Covariance Operator
NOCCO
Normalized Conditional cross-covariance operatorNOC3O
Characterization of conditional independenceYX YZ ZXW W W= −
Recall: 2/12/1XXYXYYYX W ΣΣ=Σ
( ) 2/112/12/1|
2/1|
−−−−− ΣΣΣΣ−ΣΣ=ΣΣΣ= XXZXZZYZYXYYXXZYXYYZYXW
1/ 2 1/ 2YX YY YX XXW − −= Σ Σ Σ
OW ZXY =|~ X Y | Z⇔
YXW O= X Y⇔
With characteristic kernels,
![Page 37: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/37.jpg)
37
Measures for Conditional Independence
Assume WXY etc. are Hilbert-Schmidt.
– Dependence measure
– Conditional dependence measure
– Independence / conditional independence
23|NOC O XY Z HS
W= % %
X Y | Z3NOC O 0= ⇔
2NOCCO YX HSW=
NOCCO 0= ⇔ X Y
(X and Y augmented )
![Page 38: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/38.jpg)
38
Kernel-free Integral Expression
Let
Assume PXY and have density and , resp.HZ and are characteristic. WYX and WYZ WZX are Hilbert-Schmidt.
Then,
In the unconditional case
2| |||| HSZYXW dxdyypxp
ypxpyxpyxp
YXYX
ZYXXY )()()()(
),(),( 2|∫∫ ⎟⎟
⎠
⎞⎜⎜⎝
⎛ −= ⊥
),( yxpXY ),(| yxp ZYX ⊥[ ]ZXZYZ PPE || ⊗
YX HH ⊗
- Kernel-free expression, though the definitions are given by kernels!
2|||| HSYXW dxdyypxpypxp
yxpYX
YX
XY )()(1)()(
),(2
∫∫ ⎟⎟⎠
⎞⎜⎜⎝
⎛−=
[ ] )()|()|()( |||| zdPzZAPzZBPABPPE ZZXZYZXZYZ ===×⊗ ∫Theorem
probability on .X YΩ × Ω
![Page 39: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/39.jpg)
39
– Kernel-free value is desired as a “measure” of dependence.c.f. If unnormalized operators are used, the measures depend on the choice of kernel.
– In the unconditional case, NOCCO =
is equal to the mean square contingency, which is a very popular measure of dependence for discrete variables.
– In the conditional case, if the augmented variables are used,2
| |||| HSZXYW &&&&
dxdydzzypzxpzypzxp
zpzypzxpzyxpYZXZ
YZXZ
ZZYZXXYZ ),(),(),(),(
)()|()|(),,( 2||∫∫ ⎟⎟
⎠
⎞⎜⎜⎝
⎛ −=
2|||| HSYXW
(conditional mean square contingency)
![Page 40: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/40.jpg)
40
– Empirical estimation is straightforward with the empirical cross-covariance operator .
– Inversion regularization:
– Replace the covariances in by the empirical ones given by the data ΦX(X1),…, ΦX(XN) and ΦY(Y1),…, ΦY(YN)
– NOCCOemp and NOC3Oemp give kernel estimates for the mean square contingency and conditional mean square contingency, resp.
[ ]3NOC O Tr 2emp Z Z ZX Y X Y X YR R R R R R R R R= − +% % % % % %
( ) 1X X X N NR G G N Iε −≡ +
[ ]NOCCO Tremp X YR R=
2/12/1 −− ΣΣΣ= XXYXYYYXW
(dependence measure)
(conditional dependence measure)
( ) 11 ( )ˆ NXX XX Iε
−−Σ → Σ +
where
( )ˆ NYX
Empirical Estimators
Σ
( ) ( )TNNNNX
TNNNNX IKIG 11 11 11 −−= ( )N
jijiX XXkK1,
),(=
=
![Page 41: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/41.jpg)
41
Theorem (Fukumizu et al. 2008)Assume that is Hilbert-Schmidt, and the regularization
coefficient satisfies
Then,
In particular,
)(0ˆ|
)(| ∞→→− NWW
HSZYXN
ZYX
0→Nε .3/1 ∞→NN ε
)(ˆ|
)(| ∞→→ NWW
HSZYXHSN
ZYX
i.e. NOC3Oemp (NOCCOemp) converges to the population value NOC3O (NOCCO, resp).
ZYXW |
Consistency
![Page 42: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/42.jpg)
42
Choice of KernelHow to choose a kernel?– No definitive solutions have been proposed yet. – For statistical tests, comparison of power or efficiency will be
desirable. – Other suggestions:
• Make a relevant supervised problem, and use cross-validation.• Some heuristics
– Heuristics for Gaussian kernels (Gretton et al 2007)
– Speed of asymptotic convergence (Fukumizu et al. 2008)
Compare the bootstrapped variance and the theoretical one, and choose the parameter to give the minimum discrepancy.
{ }jiXX ji ≠− |σ = median
2 2( )lim 2Nemp XX YYHS HSN
Var N HSIC→∞
⎡ ⎤× = Σ Σ⎣ ⎦ under independence
![Page 43: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/43.jpg)
43
Permutation test
– If Z takes values in a finite set {1, …, L},set
otherwise, partition the values of Z into L subsets C1, …, CL, and set
– Repeat the following process B times: (b = 1, …, B)1. Generate pseudo cond. independent
data D(b) by permuting X data within each 2. Compute TN
(b) for the data D(b) .
– Set the threshold by the (1-α)-percentile of the empirical distributions of TN
(b).
2)(|
ˆHS
NZYXNT Σ=
2)(|
ˆHS
NZYXN WT =or
),,...,1(}|{ LZiA i === lll
).,...,1(}|{ LCZiA i =∈= lll
.lA
11 ,1,1 ii YX
22 ,1,1 ii YX
33 ,1,1 ii YX
44 ,2,2 ii YX22 ,2,2 ii YX
66 ,2,2 ii YX
77 ,, iLiL YX
88 ,, iLiL YX
99 ,, iLiL YX
…
1C
2C
LC
perm
ute
perm
ute
perm
ute
{{{
Conditional Independence Test
Approximate null distribution under cond. indep. assumption
![Page 44: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/44.jpg)
44
Causality of Time SeriesGranger causality (Granger 1969)X(t), Y(t): two time series t = 1, 2, 3, …– Problem:
Is {X(1), …, X(t)} a cause of Y(t+1)?
– Granger causality Model: AR
Test
X is called a Granger cause of Y if H0 is rejected.
(No inverse causal relation)
t
p
jj
p
ii UjtXbitYactY +−+−+= ∑∑
== 11)()()(
H0: b1 = b2 = … = bp = 0
![Page 45: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/45.jpg)
45
– F-test• Linear estimation
• Test statistics
– Software• Matlab: Econometrics toolbox (www.spatial-econometrics.com)• R: lmtest package
tpj j
pi i UjtXbitYactY +−+−+= ∑∑ == 11 )()()( ji bac ˆ,ˆ,ˆ
iac ˆ̂,ˆ̂
( )∑ += −= Npt tYtYERR 11 )()(ˆ ( )210 )()(ˆ̂∑ += −= N
pt tYtYERR
tpi i WitYactY +−+= ∑ =1 )()(
( )12,
1
10
)12( +−⇒+−
−≡ pNpN F
pNERRpERRERRT
p.d.f of 21,ddF xdxd
xddxd
xdddB
dd11
)2/,2/(1 21
21
1
21
1
21⎟⎟⎠
⎞⎜⎜⎝
⎛+
−⎟⎟⎠
⎞⎜⎜⎝
⎛+
=
)( ∞→N
H0:
under H0
![Page 46: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/46.jpg)
46
– Granger causality is widely used and influential in econometrics.Clive Granger received Nobel Prize in 2003.
– Limitations• Linearity: linear AR model is used.
No nonlinear dependence is considered.• Stationarity: stationary time series are assumed.• Hidden cause: hidden common causes (other time series)
cannot be considered.
“Granger causality” is not necessarily “causality” in general sense.
– There are many extensions.
– With kernel dependence measures, it is easily extended to incorporate nonlinear dependence. Remark: There are few good conditional independence tests
for continuous variables.
![Page 47: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/47.jpg)
47
Kernel Method for Causality of Time Series
Causality by conditional independence– Extended notion of Granger causality
X is NOT a cause of Y if
– Kernel measures for causality
),...,|(),...,,,...,|( 111 ptttpttpttt YYYpXXYYYp −−−−−− =
ptt XX −− ,...,1 ptt YY −− ,...,| 1tY
2)1(ˆ
HS
pNYHSCIC +−Σ=
pp Y|X&&
2)1(ˆ
HS
pNYWHSNCIC +−=
pp Y|X&&
},...,1|),,{( 2,1p NptXXX ppttt +=∈= −−− RX L
},...,1|),,{( 2,1p NptYYY ppttt +=∈= −−− RY L
![Page 48: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/48.jpg)
48
Coupled Hénon map– X, Y:
{ }
21 1 2
2 1
21 1 1 1 2
2 1
( 1) 1.4 ( ) 0.3 ( )( 1) ( )
( 1) 1.4 ( ) ( ) (1 ) ( ) 0.1 ( )
( 1) ( )
x t x t x tx t x t
y t x t y t y t y t
y t y t
γ γ
⎧ + = − +⎨
+ =⎩⎧ + = − + − +⎪⎨
+ =⎪⎩
-2 -1 0 1 2-1
-0.5
0
0.5
1
1.5
2
-2 -1 0 1 2-1.5
-1
-0.5
0
0.5
1
1.5
2
-2 -1 0 1 2-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
x1-y1
γ = 0 γ = 0.25 γ = 0.8
-2 -1 0 1 2-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
x1
x2
Example
![Page 49: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/49.jpg)
49
Causality in coupled Hénon map– X is a cause of Y if γ > 0.
– Y is not a cause of X for all γ.
– Permutation tests for non-causality with NOC3O
tY | tX1tX +
x1 – y1 H0: Yt is not a cause of Xt+1 H0: Xt is not a cause of Yt+1
γ 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.1 0.2 0.3 0.4 0.5 0.6
0 0
32
0
13
0
45
0
85
0
92
62
93
77
94
86
90
63
90
81
95
88
96
0.0
NOC3O 94 97
Granger 92 96
Number of times accepting H0 among 100 datasets (α = 5%)
N = 100
Yt+1 |t tX Y
![Page 50: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/50.jpg)
50
SummaryDependence analysis with RKHS– Covariance and conditional covariance on RKHS can capture the
(in)dependence and conditional (in)dependence of random variables.
– Easy estimators can be obtained for the Hilbert-Schmidt norm of the operators.
– Statistical tests of independence and conditional independence are possible with kernel measures.
• Applications: dimension reduction for regression (FBJ04, FBJ08), causal inference (Sun et al. 2007).
– Further studies are required for kernel choice.
![Page 51: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/51.jpg)
51
ReferencesFukumizu, K. Francis R. Bach and M. Jordan. Kernel dimension reduction in
regression. The Annals of Statistics. To appear, 2008.Fukumizu, K., A. Gretton, X. Sun, and B. Schoelkopf: Kernel Measures of
Conditional Dependence. Advances in Neural Information Processing Systems 21, 489-496, MIT Press (2008).
Fukumizu, K., Bach, F.R., and Jordan, M.I. Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. Journal of Machine Learning Research. 5(Jan):73-99, 2004.
Gretton, A., K. Fukumizu, C.H. Teo, L. Song, B. Schölkopf, and Alex Smola. A Kernel Statistical Test of Independence. Advances in Neural Information Processing Systems 20. 585-592. MIT Press 2008.
Gretton, A., K. M. Borgwardt, M. Rasch, B. Schölkopf and A. Smola: A Kernel Method for the Two-Sample-Problem. Advances in Neural Information Processing Systems 19, 513-520. 2007.
Gretton, A., O. Bousquet, A. Smola and B. Schölkopf. Measuring Statistical Dependence with Hilbert-Schmidt Norms. Proc. Algorithmic Learning Thoery(ALT2005), 63-78. 2005.
![Page 52: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed](https://reader030.fdocuments.us/reader030/viewer/2022041010/5eb6f95b9b31ab6618601414/html5/thumbnails/52.jpg)
52
Shen, H., S. Jegelka and A. Gretton: Fast Kernel ICA using an Approximate Newton Method. AISTATS 2007.
Serfling, R. J. Approximation Theorems of Mathematical Statistics. Wiley-Interscience 1980.
Sun, X., Janzing, D. Schölkopf, B., and Fukumizu, K.: A kernel-based causal learning algorithm. Proc. 24th Annual International Conference on Machine Learning (ICML2007), 855-862 (2007)