Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function...

52
1 Kenji Fukumizu Institute of Statistical Mathematics, ROIS Department of Statistical Science, Graduate University for Advanced Studies July 25, 2008 / Statistical Learning Theory II Independence and Conditional Independence with RKHS Statistical Inference with Reproducing Kernel Hilbert Space

Transcript of Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function...

Page 1: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

1

Kenji FukumizuInstitute of Statistical Mathematics, ROIS

Department of Statistical Science, Graduate University for Advanced Studies

July 25, 2008 / Statistical Learning Theory II

Independence and Conditional Independence with RKHS

Statistical Inference with Reproducing Kernel Hilbert Space

Page 2: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

2

Outline

1. Introduction

2. Covariance operators on RKHS

3. Independence with RKHS

4. Conditional independence with RKHS

5. Summary

Page 3: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

3

Outline

1. Introduction

2. Covariance operators on RKHS

3. Independence with RKHS

4. Conditional independence with RKHS

5. Summary

Page 4: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

4

Covariance on RKHS(X , Y): random variable taking values on X Y. resp. (HX, kX), (HY , kY): RKHS with measurable kernels on X and Y, resp.Assume

Cross-covariance operator:

– c.f. Euclidean caseVYX = E[YXT] – E[Y]E[X]T : covariance matrix

)])(),([Cov()]([)]([)]()([, YgXfXfEYgEXfYgEfg YX =−=Σfor all YX HgHf ∈∈ ,

( ) )],(),,[(, XaYbCovaVb YX =

XYXYYX mmXYE ⊗−Φ⊗Φ≡Σ )]()([

XY HH ⊗∈

Proposition

×

YX HHYX →Σ :

∞<)],([)],([ YYkEXXkE YX

XYYX PPP mm ⊗−=

Page 5: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

5

Characterization of independenceIndependence and Cross-covariance operatorTheorem

If the product kernel is characteristic on X Y, then

proof)

– c.f. for Gaussian variables

– c.f. Characteristic function

OXY =Σ⇔X and Y are independent

OVXY =⇔ i.e. uncorrelated

YXkk

X Y

X Y

][][][ 11)(1 vYY

uXX

vYuXXY eEeEeE −−+− =⇔

⇔=Σ OXY

⇔ (by characteristic assumption)

×

YXXY PPP mm ⊗=

YXXY PPP ⊗=

Page 6: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

6

Estimation of cross-cov. operator: i.i.d. sample on

An estimator of ΣYX is defined by

Theorem

)(,),( ,1,1 NN YXYX K

{ } { }∑=

−⋅⊗−⋅=ΣN

iXiYi

NYX mXkmYk

N 1

)( ˆ),(ˆ),(1ˆXY

( ) )(1ˆ )( ∞→=Σ−Σ NNOpHSYXN

YX

Corollary to the -consistency of the empirical mean, because the norm in is equal to the Hilbert-Schmidt norm of the corresponding operator .

NYX HH ⊗

.YX ×

YX HH →

Page 7: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

7

Hilbert-Schmidt Operator– Hilbert-Schmidt operator

A is called Hilbert-Schmidt if for complete orthonormal systems of H1 and of H2

Hilbert-Schmidt norm:

– Fact: If is regarded as an element

– Fact:

21: HHA → : operator on a Hilbert space

∑ ∑ ∞<j i ij A .,2

ϕψ

{ }iϕ { }jψ

∑ ∑= j i ijHS AA22 , ϕψ

c.f. Frobenius norm of a matrix

|||||||| AHS FA =,21 HHFA ⊗∈21: HHA →

)Q .||||,, 2222

212Aj HHi jiAj Hi ijHS FFAA =⊗== ∑ ∑∑ ∑ ⊗

ψϕϕψ

CONS of 21 HH ⊗

Page 8: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

8

Outline

1. Introduction

2. Covariance operators on RKHS

3. Independence with RKHS

4. Conditional independence with RKHS

5. Summary

Page 9: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

9

Measuring DependenceDependence measure

Empirical dependence measure

and can be used as measures of dependence.

2HSYXYXM Σ=

2)()( ˆˆHS

NYX

NYXM Σ=

⇔= 0YXM X Y with kXkY characteristic

Page 10: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

10

HS norm of cross-cov. operator IIntegral expression

Note: a Hilbert-Schmidt norm always has an integral expression

[ ]]~|)~,([]~|)~,([2)]~,()~,([2 YYYkEXXXkEEYYkXXkEM HSYXYX YXYX −=Σ=

)]~,([)]~,([ YYkEXXkE YX+

where is an independent copy of (X,Y).:)~,~( YX

Proof.

Page 11: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

11

Empirical estimatorGram matrix expression

HS-norm can be evaluated only in the subspaces and .

∑∑==

−=Σ=N

kjikiji

N

jijijiHS

NYX

NYX YYkXXk

NYYkXXk

NM

1,,3

1,2

2)()( ),(),(2),(),(1ˆˆYXYX

∑∑==

+N

kk

N

jiji YYkXXk

N 1,1,4 ),(),(1

llYX

{ }Ni

NXi mXk 1

)(ˆ),(Span =−⋅X { })(ˆ),(Span NYi mYk −⋅Y

[ ]YXN

YX GGN

M Tr1ˆ2

)( =

,NXNX QKQG =where TNNNN N

IQ 111

−=

Or equivalently,

HS norm of cross-cov. operator II

Page 12: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

12

Application: ICAIndependent Component Analysis (ICA)– Assumption

• m independent source signals• m observations of linearly mixed signals

– Problem

• Restore the independent signals S from observations X.

A

s1(t)

s2(t)

s3(t)x3(t)

x2(t)

x1(t)

A: m x m invertiblematrix

)()( tAStX =

BXS =ˆ B: m x m orthogonal matrix

Page 13: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

13

ICA with HSIC

Pairwise-independence criterion is applicable.

Objective function is non-convex. Optimization is not easy.Approximate Newton method has been proposed

Fast Kernel ICA (FastKICA, Shen et al 07)

Other methods for ICASee, for example, Hyvärinen et al. (2001).

∑ ∑= >

=m

a abba YYHSICBL

1),()( BXY =

)()1( ,..., NXX : i.i.d. observation (m-dimensional)

Minimize

(Software downloadable at Arthur Gretton’s homepage)

Page 14: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

14

Experiments (speech signal)

A

s1(t)

s2(t)

x3(t)

x2(t)

x1(t)

randomlygenerateds3(t)

B

Fast KICA

Three speech signals

Page 15: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

15

Independence test with kernels IIndependence test with positive definite kernels– Null hypothesis H0: X and Y are independent– Alternative H1: X and Y are not independent

can be used for a test statistics.

∑∑==

−=Σ=N

kjikiji

N

jijijiHS

NYX

NYX YYkXXk

NYYkXXk

NM

1,,3

1,2

2)()( ),(),(2),(),(1ˆˆYXYX

∑∑==

+N

kk

N

jiji YYkXXk

N 1,1,4 ),(),(1

llYX

Page 16: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

16

Independence test with kernels IIAsymptotic distribution under null-hypothesisTheorem (Gretton et al. 2008)

If X and Y are independent, then

where

– The proof is easy by the theory of U (or V)-statistics (see e.g. Serfling 1980, Chapter 5).

∑∞

=⇒

1

2)(ˆi

iiN

YX ZMN λ in law )( ∞→N

is the eigenvalues of the following integral operator { }∞=1iiλ

Zi : i.i.d. ~ N(0,1),

∑ +−= ),,,( ,,,,,,!41 2),,,( dcba dcbacabababadcba kkkkkkUUUUh YXYXYX

Ua = (Xa, Ya)

)()(),,,( aiiUUUbidcba udPdPdPuuuuuhdcb

ϕλϕ =∫

),,(, baba XXkk XX =

Page 17: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

17

Independence test with kernels IIIConsistency of testTheorem (Gretton et al. 2008)

If MYX is not zero, then

where

( ) ),0(ˆ 2)( σNMMN YXN

YX ⇒− in law )( ∞→N

[ ]( )YXdcbadcba MUUUUhEE −= 2,,

2 ],,,([16σ

Page 18: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

18

Example of Independent TestSynthesized data– Data: two d-dimensional samples

),...,(),...,,...,( )()(1

)1()1(1

Nd

Nd XXXX ),...,(),...,,...,( )()(

1)1()1(

1N

dN

d YYYY

strength of dependence

Page 19: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

19

Power Divergence (Ku&Fine05, Read&Cressie)– Make partition : Each dimension is divided into q parts so

that each bin contains almost the same number of data.

– Power-divergence

– Null distribution under independence

Limitations– All the standard tests assume vector (numerical / discrete) data. – They are often weak for high-dimensional data.

JjjA ∈}{

∑ ∏∈ = ⎭

⎬⎫

⎩⎨⎧

−⎟⎠⎞

⎜⎝⎛

+==

Jj

N

k

kjjjN k

pppNmXIT 1ˆˆˆ)2(

2),(21

)(λ

λ

λλ

: frequency in Aj: marginal freq. in r-th interval

21−+−

⇒ NqNqN NT χ

jp̂)(ˆ k

rpI2 = Mean Square Conting.I0 = MI

Page 20: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

20

Independent Test on Text– Data: Official records of Canadian Parliament in English and French.

• Dependent data: 5 line-long parts from English texts and their French translations.

• Independent data: 5 line-long parts from English texts and random 5 line-parts from French texts.

– Kernel: Bag-of-words and spectral kernel

Topic Match BOW(N=10) Spec(N=10) BOW(N=50) Spec(N=50) HSICg HSICp HSICg HSICp HSICg HSICp HSICg HSICp

Agri- Random 1.00 0.94 1.00 0.95 1.00 0.93 1.00 0.95culture Same 0.99 0.18 1.00 0.00 0.00 0.00 0.00 0.00

Fishery Random 1.00 0.94 1.00 0.94 1.00 0.93 1.00 0.95Same 1.00 0.20 1.00 0.00 0.00 0.00 0.00 0.00

Immig- Random 1.00 0.96 1.00 0.91 0.99 0.94 1.00 0.95ration Same 1.00 0.09 1.00 0.00 0.00 0.00 0.00 0.00

(Gretton et al. 07)Acceptance rate (α = 5%)

Page 21: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

21

Outline

1. Introduction

2. Covariance operators on RKHS

3. Independence with RKHS

4. Conditional independence with RKHS

5. Summary

Page 22: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

22

Re: Statistics on RKHSLinear statistics on RKHS

– Basic statistics Basic statisticson Euclidean space on RKHS

Mean Mean element Covariance Cross-covariance operatorConditional covariance Cond. cross-covariance operator

– Plan: define the basic statistics on RKHS and derive nonlinear/nonparametric statistical methods in the original space.

Ω (original space)Φ

feature map H (RKHS)

XΦ (X) = k( , X)

YXΣ

Page 23: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

23

Conditional IndependenceDefinitionX, Y, Z: random variables with joint p.d.f. X and Y are conditionally independent given Z, if

)|(),|( || zypxzyp ZYZXY =

)|()|()|,( ||| zypzxpzyxp ZYZXZXY =or

),,( zyxpXYZ

YX Z

With Z known, the information of Xis unnecessary for the inference on Y

(A)

(B)

YX

Z(B)

(A)

Page 24: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

24

Review: Conditional Covariance Conditional covariance of Gaussian variables– Jointly Gaussian variable

: m ( = p + q) dimensional Gaussian variable

– Conditional probability of Y given X is again Gaussian

),,(),,,( 11 qp YYYXXX KK ==),( YXZ =

),(~ VNZ μ ⎟⎠

⎞⎜⎝

⎛=⎟⎟

⎞⎜⎜⎝

⎛=

YYYX

XYXX

Y

X

VVVV

V,μμ

μ

)(]|[ 1| XXXYXYXY xVVxXYE μμμ −+==≡ −

XYXXYXYYXYY VVVVxXYVarV 1| ]|[ −−==≡

),(~ || XYYXY VN μ

Cond. mean

Cond. covariance

Note: VYY|X does not depend on x

Schur complement of VXX in V

Page 25: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

25

Conditional Independence for Gaussian Variables

Two characterizationsX,Y,Z are Gaussian.

– Conditional covariance

– Comparison of conditional variance

X Y | Z OV ZXY =⇔ | i.e. OVVVV ZXZZYZYX =− −1

X Y | Z ZYYZXYY VV |],[| =⇔

YXZZXZXZXYYY VVVV ],[1

],][,[],[−− ( ) ⎟⎟

⎞⎜⎜⎝

⎛⎟⎠

⎞⎜⎝

⎛−=

ZY

XY

ZZZX

XZXXYZYXYY V

VVVVV

VVV1

,

( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛ −⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎠

⎞⎜⎝

⎛−

−=−

−ZY

XYZZXZ

ZZ

ZXX

ZXZZYZYXYY V

VIOVVI

VOOV

IVVOI

VVV1

1

1|

1,

ZXYZXXZYXZYY VVVV |1

|||−−=

)Q

Page 26: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

26

Linear Regression and Conditional Covariance

Review: linear regression– X, Y: random vector (not necessarily Gaussian) of dim p and q (resp.)

– Linear regression: predict Y using the linear combination of X.Minimize the mean square error:

– The residual error is given by the conditional covariance matrix.

– For Gaussian variables,

][~],[~ YEYYXEXX −=−=

2

matrix:

~~min XAYEpqA

−×

[ ]XYYpqAVXAYE |

2

matrix:Tr~~min =−

×

ZYYZXYY VV |],[| =

“If Z is known, X is not necessary for linear prediction of Y.”can be interpreted as

( )X Y | Z⇔

Page 27: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

27

Conditional Covariance on RKHSConditional Cross-covariance operatorX, Y, Z : random variables on ΩX, ΩY, ΩZ (resp.).(HX, kX), (HY , kY), (HZ , kZ) : RKHS defined on ΩX, ΩY, ΩZ (resp.).

– Conditional cross-covariance operator

– Conditional covariance operator

– may not exist as a bounded operator. But, we can justify the definions.

ZXZZYZYXZYX ΣΣΣ−Σ≡Σ −1|

YX HH →

ZYZZYZYYZYY ΣΣΣ−Σ≡Σ −1|

YY HH →

1−ΣZZ

Page 28: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

28

Decomposition of covariance operator

Rigorous definition of conditional covariance operators

2/12/1XXYXYYYX W ΣΣ=Σ

2/12/1| XXZXYZYYYXZYX WW ΣΣ−Σ≡Σ

such that WYX is a bounded operator with and

.)()(,)()( XXYXYYYX RangeWKerRangeWRange Σ⊥Σ=

1|||| ≤YXW

2/12/1| YYZYYZYYYYZYY WW ΣΣ−Σ≡Σ

Page 29: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

29

Two Characterizations of Conditional Independence with Kernels

(1) Conditional covariance operator (FBJ04, 08)– Conditional variance (kZ is characteristic)

– Conditional independence (all the kernels are characteristic)

– c.f. Gaussian variables

ZYYXZYY |][| Σ=ΣX Y | Z ⇔

X Y | Z ZYYZXYY VV |],[| =⇔

2|

~~min]|[ ZaYbZYbVarbVb TT

a

TZYY

T −==

[ ] 2| )(~)(~inf]|)([, ZfYgEZYgVarEgg

ZHfZYY −==Σ∈

X is not necessary for predicting g(Y)

Page 30: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

30

(2) Cond. cross-covariance operator (FBJ04)

– Conditional Covariance (kZ is characteristic)

– Conditional independence

– c.f. Gaussian variables

OZXY =Σ |&& ( )OZXY =Σ⇔ |&&⇔

),,( ZXX =&& ),( ZYY =&&where

X Y | Z

X Y | Z OV ZXY =⇔ |

]|,[Cov| ZYbXabVa TTZXY

T =

[ ]]|)(),([Cov, | ZXfYgEfg ZYX =Σ

Page 31: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

31

– Proof of (1) (partial) : relation between residual error and operator

( ) ( )2)]([)()]([)( ZfEZfYgEYgE −−−

gggfff YYZYZZ Σ+Σ−Σ= ,,2,22/12/12/122/1 ,2 ggWff YYYYZYZZZZ Σ+ΣΣ−Σ=

22/122/122/12/1 gWggWf YYZYYYYYZYZZ Σ−Σ+Σ−Σ=

This part can be arbitrary small by choosing fbecause of

( )gWWggWf YYZYYZYYYYYYZYZZ2/12/122/12/1 , ΣΣ−Σ+Σ−Σ=

ZYY |Σ

.)()( ZZZY RangeWRange Σ=

Page 32: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

32

Proof of (1): conditional independence

[ ] [ ]ZZXYgEVarZZXYgVarEZYgVar ZXZX |],|)([|],|)([]|)([ || +=

[ ] [ ]]|[]|[][ || XYVarEXYEVarYVar XYXXYX +=Lemma

From the above lemma

[ ] [ ] [ ][ ]ZZXYgEVarEZXYgVarEZYgVarE ZXZ |],|)([],|)([]|)([ |=−Take ][⋅ZE

L.H.S = 0 from ZYYXZYY |][| Σ=Σ

[ ] zPZZXYgEVar zZX everyalmost 0|],|)([| −=

),(everyalmost ]|)([],|)([ zxPZYgEZXYgE XZ −=

ZYXZY PP || = (kY characteristic)

Page 33: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

33

– Why is the “extended variable” needed in (2)?

The l.h.s is not a funciton of z. c.f. Gaussian case

However, if X is replaced by [X, Z]

]|)(),([, | zZXfYgCovfg ZYX =≠Σ

[ ]]|)(),([, | ZXfYgCovEfg ZYX =Σ

dzzpzypzxpyxpOZYX )()|()|(),(| ∫=⇒=Σ

)|()|()|,(| zypzxpzyxpOZYX =⇒=Σ

dzzpzypzzxpzyxpOZZXY )()|()|',()',,(|],[ ∫=⇒=Σ

)'()|()|',( zzzxpzzxp −= δ

)'()'|()'|()',,( zpzypzxpzyxp =

where

i.e. )'|()'|()'|,( zypzxpzyxp =

Page 34: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

34

Empirical Estimator of Cond. Cov. Operator

(X1, Y1, Z1), ... , (XN, YN, ZN)

– Empirical conditional covariance operator

– Estimator of Hilbert-Schmidt norm

( ) )(1)()()()(|

ˆˆˆˆ:ˆ NZXN

NZZ

NYZ

NYX

NZYX I Σ+ΣΣ−Σ=Σ

−ε

)(ˆ NYZYZ Σ→Σ etc.

( ) 1)(1 ˆ −− +Σ→Σ INN

ZZZZ ε regularization for inversion

finite rank operators

[ ]ZYZXHSN

ZYX SGSGTrˆ 2)(| =Σ

,NXNX QKQG = TNNNN N

IQ 111

−= centered Gram matrix

( ) 111)( −− +=+−= ZNNZNNZNZ GIGINGISNεε

Page 35: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

35

Statistical ConsistencyConsistency on conditional covariance operator

Theorem (FBJ08, Sun et al. 07)Assume and

In particular,

)(0ˆ|

)(| ∞→→Σ−Σ N

HSZYXN

ZYX

0→Nε ∞→NNε

)(ˆ|

)(| ∞→Σ→Σ N

HSZYXHSN

ZYX

Page 36: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

36

Normalized Covariance OperatorNormalized Cross-Covariance Operator

NOCCO

Normalized Conditional cross-covariance operatorNOC3O

Characterization of conditional independenceYX YZ ZXW W W= −

Recall: 2/12/1XXYXYYYX W ΣΣ=Σ

( ) 2/112/12/1|

2/1|

−−−−− ΣΣΣΣ−ΣΣ=ΣΣΣ= XXZXZZYZYXYYXXZYXYYZYXW

1/ 2 1/ 2YX YY YX XXW − −= Σ Σ Σ

OW ZXY =|~ X Y | Z⇔

YXW O= X Y⇔

With characteristic kernels,

Page 37: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

37

Measures for Conditional Independence

Assume WXY etc. are Hilbert-Schmidt.

– Dependence measure

– Conditional dependence measure

– Independence / conditional independence

23|NOC O XY Z HS

W= % %

X Y | Z3NOC O 0= ⇔

2NOCCO YX HSW=

NOCCO 0= ⇔ X Y

(X and Y augmented )

Page 38: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

38

Kernel-free Integral Expression

Let

Assume PXY and have density and , resp.HZ and are characteristic. WYX and WYZ WZX are Hilbert-Schmidt.

Then,

In the unconditional case

2| |||| HSZYXW dxdyypxp

ypxpyxpyxp

YXYX

ZYXXY )()()()(

),(),( 2|∫∫ ⎟⎟

⎞⎜⎜⎝

⎛ −= ⊥

),( yxpXY ),(| yxp ZYX ⊥[ ]ZXZYZ PPE || ⊗

YX HH ⊗

- Kernel-free expression, though the definitions are given by kernels!

2|||| HSYXW dxdyypxpypxp

yxpYX

YX

XY )()(1)()(

),(2

∫∫ ⎟⎟⎠

⎞⎜⎜⎝

⎛−=

[ ] )()|()|()( |||| zdPzZAPzZBPABPPE ZZXZYZXZYZ ===×⊗ ∫Theorem

probability on .X YΩ × Ω

Page 39: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

39

– Kernel-free value is desired as a “measure” of dependence.c.f. If unnormalized operators are used, the measures depend on the choice of kernel.

– In the unconditional case, NOCCO =

is equal to the mean square contingency, which is a very popular measure of dependence for discrete variables.

– In the conditional case, if the augmented variables are used,2

| |||| HSZXYW &&&&

dxdydzzypzxpzypzxp

zpzypzxpzyxpYZXZ

YZXZ

ZZYZXXYZ ),(),(),(),(

)()|()|(),,( 2||∫∫ ⎟⎟

⎞⎜⎜⎝

⎛ −=

2|||| HSYXW

(conditional mean square contingency)

Page 40: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

40

– Empirical estimation is straightforward with the empirical cross-covariance operator .

– Inversion regularization:

– Replace the covariances in by the empirical ones given by the data ΦX(X1),…, ΦX(XN) and ΦY(Y1),…, ΦY(YN)

– NOCCOemp and NOC3Oemp give kernel estimates for the mean square contingency and conditional mean square contingency, resp.

[ ]3NOC O Tr 2emp Z Z ZX Y X Y X YR R R R R R R R R= − +% % % % % %

( ) 1X X X N NR G G N Iε −≡ +

[ ]NOCCO Tremp X YR R=

2/12/1 −− ΣΣΣ= XXYXYYYXW

(dependence measure)

(conditional dependence measure)

( ) 11 ( )ˆ NXX XX Iε

−−Σ → Σ +

where

( )ˆ NYX

Empirical Estimators

Σ

( ) ( )TNNNNX

TNNNNX IKIG 11 11 11 −−= ( )N

jijiX XXkK1,

),(=

=

Page 41: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

41

Theorem (Fukumizu et al. 2008)Assume that is Hilbert-Schmidt, and the regularization

coefficient satisfies

Then,

In particular,

)(0ˆ|

)(| ∞→→− NWW

HSZYXN

ZYX

0→Nε .3/1 ∞→NN ε

)(ˆ|

)(| ∞→→ NWW

HSZYXHSN

ZYX

i.e. NOC3Oemp (NOCCOemp) converges to the population value NOC3O (NOCCO, resp).

ZYXW |

Consistency

Page 42: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

42

Choice of KernelHow to choose a kernel?– No definitive solutions have been proposed yet. – For statistical tests, comparison of power or efficiency will be

desirable. – Other suggestions:

• Make a relevant supervised problem, and use cross-validation.• Some heuristics

– Heuristics for Gaussian kernels (Gretton et al 2007)

– Speed of asymptotic convergence (Fukumizu et al. 2008)

Compare the bootstrapped variance and the theoretical one, and choose the parameter to give the minimum discrepancy.

{ }jiXX ji ≠− |σ = median

2 2( )lim 2Nemp XX YYHS HSN

Var N HSIC→∞

⎡ ⎤× = Σ Σ⎣ ⎦ under independence

Page 43: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

43

Permutation test

– If Z takes values in a finite set {1, …, L},set

otherwise, partition the values of Z into L subsets C1, …, CL, and set

– Repeat the following process B times: (b = 1, …, B)1. Generate pseudo cond. independent

data D(b) by permuting X data within each 2. Compute TN

(b) for the data D(b) .

– Set the threshold by the (1-α)-percentile of the empirical distributions of TN

(b).

2)(|

ˆHS

NZYXNT Σ=

2)(|

ˆHS

NZYXN WT =or

),,...,1(}|{ LZiA i === lll

).,...,1(}|{ LCZiA i =∈= lll

.lA

11 ,1,1 ii YX

22 ,1,1 ii YX

33 ,1,1 ii YX

44 ,2,2 ii YX22 ,2,2 ii YX

66 ,2,2 ii YX

77 ,, iLiL YX

88 ,, iLiL YX

99 ,, iLiL YX

1C

2C

LC

perm

ute

perm

ute

perm

ute

{{{

Conditional Independence Test

Approximate null distribution under cond. indep. assumption

Page 44: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

44

Causality of Time SeriesGranger causality (Granger 1969)X(t), Y(t): two time series t = 1, 2, 3, …– Problem:

Is {X(1), …, X(t)} a cause of Y(t+1)?

– Granger causality Model: AR

Test

X is called a Granger cause of Y if H0 is rejected.

(No inverse causal relation)

t

p

jj

p

ii UjtXbitYactY +−+−+= ∑∑

== 11)()()(

H0: b1 = b2 = … = bp = 0

Page 45: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

45

– F-test• Linear estimation

• Test statistics

– Software• Matlab: Econometrics toolbox (www.spatial-econometrics.com)• R: lmtest package

tpj j

pi i UjtXbitYactY +−+−+= ∑∑ == 11 )()()( ji bac ˆ,ˆ,ˆ

iac ˆ̂,ˆ̂

( )∑ += −= Npt tYtYERR 11 )()(ˆ ( )210 )()(ˆ̂∑ += −= N

pt tYtYERR

tpi i WitYactY +−+= ∑ =1 )()(

( )12,

1

10

)12( +−⇒+−

−≡ pNpN F

pNERRpERRERRT

p.d.f of 21,ddF xdxd

xddxd

xdddB

dd11

)2/,2/(1 21

21

1

21

1

21⎟⎟⎠

⎞⎜⎜⎝

⎛+

−⎟⎟⎠

⎞⎜⎜⎝

⎛+

=

)( ∞→N

H0:

under H0

Page 46: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

46

– Granger causality is widely used and influential in econometrics.Clive Granger received Nobel Prize in 2003.

– Limitations• Linearity: linear AR model is used.

No nonlinear dependence is considered.• Stationarity: stationary time series are assumed.• Hidden cause: hidden common causes (other time series)

cannot be considered.

“Granger causality” is not necessarily “causality” in general sense.

– There are many extensions.

– With kernel dependence measures, it is easily extended to incorporate nonlinear dependence. Remark: There are few good conditional independence tests

for continuous variables.

Page 47: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

47

Kernel Method for Causality of Time Series

Causality by conditional independence– Extended notion of Granger causality

X is NOT a cause of Y if

– Kernel measures for causality

),...,|(),...,,,...,|( 111 ptttpttpttt YYYpXXYYYp −−−−−− =

ptt XX −− ,...,1 ptt YY −− ,...,| 1tY

2)1(ˆ

HS

pNYHSCIC +−Σ=

pp Y|X&&

2)1(ˆ

HS

pNYWHSNCIC +−=

pp Y|X&&

},...,1|),,{( 2,1p NptXXX ppttt +=∈= −−− RX L

},...,1|),,{( 2,1p NptYYY ppttt +=∈= −−− RY L

Page 48: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

48

Coupled Hénon map– X, Y:

{ }

21 1 2

2 1

21 1 1 1 2

2 1

( 1) 1.4 ( ) 0.3 ( )( 1) ( )

( 1) 1.4 ( ) ( ) (1 ) ( ) 0.1 ( )

( 1) ( )

x t x t x tx t x t

y t x t y t y t y t

y t y t

γ γ

⎧ + = − +⎨

+ =⎩⎧ + = − + − +⎪⎨

+ =⎪⎩

-2 -1 0 1 2-1

-0.5

0

0.5

1

1.5

2

-2 -1 0 1 2-1.5

-1

-0.5

0

0.5

1

1.5

2

-2 -1 0 1 2-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

x1-y1

γ = 0 γ = 0.25 γ = 0.8

-2 -1 0 1 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

x1

x2

Example

Page 49: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

49

Causality in coupled Hénon map– X is a cause of Y if γ > 0.

– Y is not a cause of X for all γ.

– Permutation tests for non-causality with NOC3O

tY | tX1tX +

x1 – y1 H0: Yt is not a cause of Xt+1 H0: Xt is not a cause of Yt+1

γ 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.1 0.2 0.3 0.4 0.5 0.6

0 0

32

0

13

0

45

0

85

0

92

62

93

77

94

86

90

63

90

81

95

88

96

0.0

NOC3O 94 97

Granger 92 96

Number of times accepting H0 among 100 datasets (α = 5%)

N = 100

Yt+1 |t tX Y

Page 50: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

50

SummaryDependence analysis with RKHS– Covariance and conditional covariance on RKHS can capture the

(in)dependence and conditional (in)dependence of random variables.

– Easy estimators can be obtained for the Hilbert-Schmidt norm of the operators.

– Statistical tests of independence and conditional independence are possible with kernel measures.

• Applications: dimension reduction for regression (FBJ04, FBJ08), causal inference (Sun et al. 2007).

– Further studies are required for kernel choice.

Page 51: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

51

ReferencesFukumizu, K. Francis R. Bach and M. Jordan. Kernel dimension reduction in

regression. The Annals of Statistics. To appear, 2008.Fukumizu, K., A. Gretton, X. Sun, and B. Schoelkopf: Kernel Measures of

Conditional Dependence. Advances in Neural Information Processing Systems 21, 489-496, MIT Press (2008).

Fukumizu, K., Bach, F.R., and Jordan, M.I. Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. Journal of Machine Learning Research. 5(Jan):73-99, 2004.

Gretton, A., K. Fukumizu, C.H. Teo, L. Song, B. Schölkopf, and Alex Smola. A Kernel Statistical Test of Independence. Advances in Neural Information Processing Systems 20. 585-592. MIT Press 2008.

Gretton, A., K. M. Borgwardt, M. Rasch, B. Schölkopf and A. Smola: A Kernel Method for the Two-Sample-Problem. Advances in Neural Information Processing Systems 19, 513-520. 2007.

Gretton, A., O. Bousquet, A. Smola and B. Schölkopf. Measuring Statistical Dependence with Hilbert-Schmidt Norms. Proc. Algorithmic Learning Thoery(ALT2005), 63-78. 2005.

Page 52: Independence and Conditional Independence with …fukumizu/H20_kernel/Kernel_9...Objective function is non-convex. Optimization is not easy. ÆApproximate Newton method has been proposed

52

Shen, H., S. Jegelka and A. Gretton: Fast Kernel ICA using an Approximate Newton Method. AISTATS 2007.

Serfling, R. J. Approximation Theorems of Mathematical Statistics. Wiley-Interscience 1980.

Sun, X., Janzing, D. Schölkopf, B., and Fukumizu, K.: A kernel-based causal learning algorithm. Proc. 24th Annual International Conference on Machine Learning (ICML2007), 855-862 (2007)