Lecture2: MappingsofProbabilities...

Lecture 2: Mappings of Probabilitiesto RKHS and Applications

MLSS Tubingen, 2015

Arthur Gretton

Gatsby Unit, CSML, UCL

Outline

• Kernel metric on the space of probability measures

– Function revealing differences in distributions

– Distance between means in space of features (RKHS)

– Independence measure: features of joint minus product of marginals

• Characteristic kernels: feature space mappings of probabilities unique

• Two-sample, independence tests for (almost!) any data type

– distributions on strings, images, graphs, groups (rotation matrices),

semigroups,. . .

• Advanced topics

– testing on big data, kernel choice

– Energy distance/distance covariance: special case of kernel statistic

Feature mean difference

• Simple example: 2 Gaussians with different means

• Answer: t-test

−6 −4 −2 0 2 4 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Two Gaussians with different means

X

Pro

b. density

P

X

QX


• Two Gaussians with same means, different variance

• Idea: look at difference in means of features of the RVs

• In Gaussian case: second order features of form ϕ(x) = x2

−6 −4 −2 0 2 4 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Two Gaussians with different variances

Pro

b.

de

nsity

X

P

X

QX


• Two Gaussians with same means, different variance

• Idea: look at difference in means of features of the RVs

• In Gaussian case: second order features of form ϕ(x) = x2

−6 −4 −2 0 2 4 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Two Gaussians with different variances

Pro

b.

de

nsity

X

P

X

QX

10−1

100

101

102

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Densities of feature X2

X2

Pro

b.

de

nsity

P

X

QX


• Gaussian and Laplace distributions

• Same mean and same variance

• Difference in means using higher order features...RKHS

−4 −3 −2 −1 0 1 2 3 40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Gaussian and Laplace densities

Pro

b. density

X

P

X

QX

Probabilities in feature space: the mean trick

The kernel trick

• Given x ∈ X for some set X ,

define feature map ϕx ∈ F ,

ϕx = [. . . ϕi(x) . . .] ∈ ℓ2

• For positive definite k(x, x′),

k(x, x′) = 〈ϕx, ϕx′〉F

• The kernel trick: ∀f ∈ F ,

f(x) = 〈f, ϕx〉F

Probabilities in feature space: the mean trick

The kernel trick

• Given x ∈ X for some set X ,

define feature map ϕx ∈ F ,

ϕx = [. . . ϕi(x) . . .] ∈ ℓ2


k(x, x′) = 〈ϕx, ϕx′〉F

• The kernel trick: ∀f ∈ F ,

f(x) = 〈f, ϕx〉F

The mean trick

• Given P a Borel probability

measure on X , define feature

map µP ∈ F

µP = [. . .EP [ϕi(x)] . . .]


EP,Qk(x, y) = 〈µP, µQ〉F

for x ∼ P and y ∼ Q.

• The mean trick: (we call µP a

mean/distribution embedding)

EP(f(x)) =: 〈µP, f〉F

What does µP look like?

We plot the function µP

• Mean embedding µP ∈ F

〈µP(·), f(·)〉F = EPf(x).

• What does prob. feature map

look like?

µP(x) = 〈µP(·), ϕ(x)〉F= 〈µP(·), k(·, x)〉F = EPk(x, x).

Expectation of kernel!

• Empirical estimate:

µP(x) =1

m

m∑

i=1

k(xi, x) xi ∼ P

What does µP look like?

We plot the function µP

• Mean embedding µP ∈ F

〈µP(·), f(·)〉F = EPf(x).

• What does prob. feature map

look like?

µP(x) = 〈µP(·), ϕ(x)〉F= 〈µP(·), k(·, x)〉F = EPk(x, x).

Expectation of kernel!

• Empirical estimate:

µP(x) =1

m

m∑

i=1

k(xi, x) xi ∼ P

−2 0 20

0.01

0.02

0.03

X

Histogram

Embedding

Does the feature space mean exist?

Does there exist an element µP ∈ F such that

EPf(x) = EP〈f(·), ϕ(x)〉F = 〈f(·),EPϕ(x)〉F = 〈f(·), µP(·)〉F ∀f ∈ F

Does the feature space mean exist?

Does there exist an element µP ∈ F such that

EPf(x) = EP〈f(·), ϕ(x)〉F = 〈f(·),EPϕ(x)〉F = 〈f(·), µP(·)〉F ∀f ∈ F

Yes: You can exchange expectation and innner product (i.e. ϕ(x) is Bochner

integrable [Steinwart and Christmann, 2008]) under the condition

EP‖ϕ(x)‖F = EP

√k(x, x) <∞

Function Showing Difference in Distributions

• Are P and Q different?


• Are P and Q different?

0 0.2 0.4 0.6 0.8 1−1

−0.5

0

0.5

1

Samples from P and Q


• Maximum mean discrepancy: smooth function for P vs Q

MMD(P,Q;F ) := supf∈F

[EPf(x)−EQf(y)] .

0 0.2 0.4 0.6 0.8 1−1

−0.5

0

0.5

1

x

f(x)

Smooth function


• What if the function is not smooth?


[EPf(x)−EQf(y)] .

0 0.2 0.4 0.6 0.8 1−1

−0.5

0

0.5

1

Bounded continuous function

x

f(x)




[EPf(x)−EQf(y)] .

• Gauss P vs Laplace Q

−6 −4 −2 0 2 4 6−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

Witness f for Gauss and Laplace densities

X

Pro

b. density a

nd f

f

Gauss

Laplace




[EPf(x)−EQf(y)] .

• Classical results: MMD(P,Q;F ) = 0 iff P = Q, when

– F =bounded continuous [Dudley, 2002]

– F = bounded variation 1 (Kolmogorov metric) [Muller, 1997]

– F = bounded Lipschitz (Earth mover’s distances) [Dudley, 2002]




[EPf(x)−EQf(y)] .





• MMD(P,Q;F ) = 0 iff P = Q when F =the unit ball in a characteristic

RKHS F (coming soon!) [ISMB06, NIPS06a, NIPS07b, NIPS08a, JMLR10]




[EPf(x)−EQf(y)] .





• MMD(P,Q;F ) = 0 iff P = Q when F =the unit ball in a characteristic

RKHS F (coming soon!) [ISMB06, NIPS06a, NIPS07b, NIPS08a, JMLR10]

How do smooth functions relate to feature maps?

Function view vs feature mean view

• The (kernel) MMD: [ISMB06, NIPS06a]

MMD2(P,Q;F )

=

(supf∈F

[EPf(x)−EQf(y)]

)2

−6 −4 −2 0 2 4 6−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

Witness f for Gauss and Laplace densities

X

Pro

b. density a

nd f

f

Gauss

Laplace



MMD2(P,Q;F )

=

(supf∈F

[EPf(x)−EQf(y)]

)2 use




MMD2(P,Q;F )

=

(supf∈F

[EPf(x)−EQf(y)]

)2

=

(supf∈F

〈f, µP − µQ〉F

)2

use




MMD2(P,Q;F )

=

(supf∈F

[EPf(x)−EQf(y)]

)2

=

(supf∈F

〈f, µP − µQ〉F

)2

= ‖µP − µQ‖2F

use

‖θ‖F = supf∈F

〈f, θ〉F

Function view and feature view equivalent

Empirical estimate of MMD

• An unbiased empirical estimate: for ximi=1 ∼ P and yimi=1 ∼ Q,

MMD2= 1

m(m−1)

∑mi=1

∑mj 6=i [k(xi, xj) + k(yi, yj)]

− 1m2

∑mi=1

∑mj=1 [k(yi, xj) + k(xi, yj)]



MMD2= 1

m(m−1)

∑mi=1


− 1m2

∑mi=1


• Proof:

‖µP − µQ‖2F = 〈µP − µQ, µP − µQ〉F= 〈µP, µP〉+ 〈µQ, µQ〉 − 2 〈µP, µQ〉



MMD2= 1

m(m−1)

∑mi=1


− 1m2

∑mi=1


• Proof:

‖µP − µQ‖2F = 〈µP − µQ, µP − µQ〉F= 〈µP, µP〉+ 〈µQ, µQ〉 − 2 〈µP, µQ〉= EP[µP(x)] + . . .



MMD2= 1

m(m−1)

∑mi=1


− 1m2

∑mi=1


• Proof:


= EP 〈µP(·), ϕ(x)〉+ . . .



MMD2= 1

m(m−1)

∑mi=1


− 1m2

∑mi=1


• Proof:


= EP 〈µP(·), k(x, ·)〉+ . . .



MMD2= 1

m(m−1)

∑mi=1


− 1m2

∑mi=1


• Proof:


= EP 〈µP(·), k(x, ·)〉+ . . .

= EPk(x, x′) +EQk(y, y

′)− 2EP,Qk(x, y)



MMD2= 1

m(m−1)

∑mi=1


− 1m2

∑mi=1


• Proof:


= EP 〈µP(·), k(x, ·)〉+ . . .

= EPk(x, x′) +EQk(y, y

′)− 2EP,Qk(x, y)

Then Ek(x, x′) = 1m(m−1)

∑mi=1

∑mj 6=i k(xi, xj)

MMD for independence: HSIC

• Dependence measure: the Hilbert Schmidt Independence Criterion [ALT05,

NIPS07a, ALT07, ALT08, JMLR10]

Related to [Feuerverger, 1993]and [Szekely and Rizzo, 2009, Szekely et al., 2007]

HSIC(PXY ,PXPY ) := ‖µPXY− µPXPY

‖2






‖2

k( , )!" #" !"l( , )#"

k( , )× l( , )!" #" !" #"

κ( , ) =!" #"!" #"






‖2

HSIC using expectations of kernels:

Define RKHS F on X with kernel k, RKHS G on Y with kernel l. Then

HSIC(PXY ,PXPY )

= EXY EX′Y ′k(x, x′)l(y, y′) +EXEX′k(x, x′)EY EY ′ l(y, y′)

− 2EX′Y ′

[EXk(x, x

′)EY l(y, y′)].

HSIC: empirical estimate and intuition

!"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',&

-"#34%#&'#5#%&"266$#%&-"2'&7"#'&0(//(7$'*&

2'&$'-#%#)8'*&)9#'-:&!"#3&'##,&6/#'-3&(0&

#;#%9$)#1&2<(+-&2'&"(+%&2&,23&$0&6())$</#:&

=&/2%*#&2'$.2/&7"(&)/$'*)&)/(<<#%1&#;+,#)&2&

,$)8'985#&"(+',3&(,(%1&2',&72'-)&'(-"$'*&.(%#&

-"2'&-(&0(//(7&"$)&'()#:&!"#3&'##,&2&)$*'$>92'-&

2.(+'-&(0&#;#%9$)#&2',&.#'-2/&)8.+/28(':&

!#;-&0%(.&,(*8.#:9(.&2',&6#?$',#%:9(.&

@'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',&

#;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#&

<%##,&$)&6#%0#9-&$0&3(+&72'-&2&%#)6(')$5#1&&

$'-#%2985#&6#-1&('#&-"2-&7$//&</(7&$'&3(+%&#2%&

2',&0(//(7&3(+&#5#%37"#%#:&


!"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',&

-"#34%#&'#5#%&"266$#%&-"2'&7"#'&0(//(7$'*&

2'&$'-#%#)8'*&)9#'-:&!"#3&'##,&6/#'-3&(0&

#;#%9$)#1&2<(+-&2'&"(+%&2&,23&$0&6())$</#:&

=&/2%*#&2'$.2/&7"(&)/$'*)&)/(<<#%1&#;+,#)&2&

,$)8'985#&"(+',3&(,(%1&2',&72'-)&'(-"$'*&.(%#&

-"2'&-(&0(//(7&"$)&'()#:&!"#3&'##,&2&)$*'$>92'-&

2.(+'-&(0&#;#%9$)#&2',&.#'-2/&)8.+/28(':&

!#;-&0%(.&,(*8.#:9(.&2',&6#?$',#%:9(.&

@'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',&

#;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#&

<%##,&$)&6#%0#9-&$0&3(+&72'-&2&%#)6(')$5#1&&

$'-#%2985#&6#-1&('#&-"2-&7$//&</(7&$'&3(+%&#2%&

2',&0(//(7&3(+&#5#%37"#%#:&

!" #"


!"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',&

-"#34%#&'#5#%&"266$#%&-"2'&7"#'&0(//(7$'*&

2'&$'-#%#)8'*&)9#'-:&!"#3&'##,&6/#'-3&(0&

#;#%9$)#1&2<(+-&2'&"(+%&2&,23&$0&6())$</#:&

=&/2%*#&2'$.2/&7"(&)/$'*)&)/(<<#%1&#;+,#)&2&

,$)8'985#&"(+',3&(,(%1&2',&72'-)&'(-"$'*&.(%#&

-"2'&-(&0(//(7&"$)&'()#:&!"#3&'##,&2&)$*'$>92'-&

2.(+'-&(0&#;#%9$)#&2',&.#'-2/&)8.+/28(':&

!#;-&0%(.&,(*8.#:9(.&2',&6#?$',#%:9(.&

@'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',&

#;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#&

<%##,&$)&6#%0#9-&$0&3(+&72'-&2&%#)6(')$5#1&&

$'-#%2985#&6#-1&('#&-"2-&7$//&</(7&$'&3(+%&#2%&

2',&0(//(7&3(+&#5#%37"#%#:&

!" #"

Empirical HSIC(PXY ,PXPY ):

1

n2(HKH HLH)++

Characteristic kernels (Via Fourier, on the torus T)

Characteristic Kernels (via Fourier)

Reminder:

Characteristic: MMD a metric (MMD = 0 iff P = Q) [NIPS07b, JMLR10]

In the next slides:

1. Characteristic property on [−π, π] with periodic boundary

2. Characteristic property on Rd


Reminder: Fourier series

• Function [−π, π] with periodic boundary.

f(x) =∞∑

ℓ=−∞

fℓ exp(ıℓx) =∞∑

l=−∞

fℓ (cos(ℓx) + ı sin(ℓx)) .

−4 −2 0 2 4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

x

f(x

)

Top hat

−10 −5 0 5 10−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5

ℓ

fℓ

Fourier series coefficients


Reminder: Fourier series of kernel

k(x, y) = k(x− y) = k(z), k(z) =

∞∑

ℓ=−∞

kℓ exp (ıℓz) ,

E.g., k(x) = 12πϑ

(x2π ,

ıσ2

2π

), kℓ =

12π exp

(−σ2ℓ2

2

).

ϑ is the Jacobi theta function, close to Gaussian when σ2 sufficiently narrower than [−π, π].

−4 −2 0 2 4−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

x

k(x

)

Kernel

−10 −5 0 5 100

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

ℓ

fℓ



Maximum mean embedding via Fourier series:

• Fourier series for P is characteristic function φP

• Fourier series for mean embedding is product of fourier series!

(convolution theorem)

µP(x) = EPk(x− x) =

∫ π

−πk(x− t)dP(t) µP,ℓ = kℓ × φP,ℓ


Maximum mean embedding via Fourier series:

• Fourier series for P is characteristic function φP

• Fourier series for mean embedding is product of fourier series!

(convolution theorem)

µP(x) = EPk(x− x) =

∫ π

−πk(x− t)dP(t) µP,ℓ = kℓ × φP,ℓ

• MMD can be written in terms of Fourier series:

MMD(P,Q;F ) :=

∥∥∥∥∥

∞∑

ℓ=−∞

[(φP,ℓ − φQ,ℓ

)kℓ

]exp(ıℓx)

∥∥∥∥∥F

A simpler Fourier expression for MMD

• From previous slide,

MMD(P,Q;F ) :=

∥∥∥∥∥

∞∑

ℓ=−∞

[(φP,ℓ − φQ,ℓ

)kℓ

]exp(ıℓx)

∥∥∥∥∥F

• The squared norm of a function f in F is:

‖f‖2F = 〈f, f〉F =∞∑

l=−∞

|fℓ|2kℓ

.

• Simple, interpretable expression for squared MMD:

MMD2(P,Q;F ) =∞∑

l=−∞

[|φP,ℓ − φQ,ℓ|2kℓ]2kℓ

=∞∑

l=−∞

|φP,ℓ − φQ,ℓ|2kℓ

Example

• Example: P differs from Q at one frequency

−2 0 20

0.05

0.1

0.15

0.2

x

P(x

)

−2 0 20

0.05

0.1

0.15

0.2

x

Q(x

)

Characteristic Kernels (2)

• Example: P differs from Q at (roughly) one frequency

−2 0 20

0.05

0.1

0.15

0.2

x

P(x

)

−2 0 20

0.05

0.1

0.15

0.2

x

Q(x

)

F→

F→

−10 0 100

0.5

1

ℓ

φP

,ℓ

−10 0 100

0.5

1

ℓ

φQ

,ℓ

Characteristic Kernels (2)


−2 0 20

0.05

0.1

0.15

0.2

x

P(x

)

−2 0 20

0.05

0.1

0.15

0.2

x

Q(x

)

F→

F→

−10 0 100

0.5

1

ℓ

φP

,ℓ

−10 0 100

0.5

1

ℓ

φQ

,ℓց

ր

Characteristic function difference

−10 0 100

0.2

0.4

0.6

0.8

1

ℓ

φP

,ℓ−

φQ

,ℓ

Example

Is the Gaussian-spectrum kernel characteristic?

−4 −2 0 2 4−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

x

k(x

)

Kernel

−10 −5 0 5 100

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

ℓ

fℓ


MMD2(P,Q;F ) :=∞∑

l=−∞


Example

Is the Gaussian-spectrum kernel characteristic? YES

−4 −2 0 2 4−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

x

k(x

)

Kernel

−10 −5 0 5 100

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

ℓ

fℓ


MMD2(P,Q;F ) :=

∞∑

l=−∞


Example

Is the triangle kernel characteristic?

−4 −2 0 2 4−0.1

−0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

x

f(x

)

Triangle

−10 −5 0 5 100

0.01

0.02

0.03

0.04

0.05

0.06

0.07

ℓ

fℓ


MMD2(P,Q;F ) :=

∞∑

l=−∞


Example

Is the triangle kernel characteristic? NO

−4 −2 0 2 4−0.1

−0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

x

f(x

)

Triangle

−10 −5 0 5 100

0.01

0.02

0.03

0.04

0.05

0.06

0.07

ℓ

fℓ


MMD2(P,Q;F ) :=

∞∑

l=−∞


Characteristic kernels (Via Fourier, on Rd)


• Can we prove characteristic on Rd?



• Characteristic function of P via Fourier transform

φP(ω) =

∫

Rd

eix⊤ωdP(x)



• Characteristic function of P via Fourier transform

φP(ω) =

∫

Rd

eix⊤ωdP(x)

• Translation invariant kernels: k(x, y) = k(x− y) = k(z)

• Bochner’s theorem:

k(z) =

∫

Rd

e−iz⊤ωdΛ(ω)

– Λ finite non-negative Borel measure


• Fourier representation of MMD:

MMD(P,Q;F ) :=

∫ ∫|φP(ω)− φQ(ω)|2 dΛ(ω)

– φP characteristic function of P

Proof: Using Bochner’s theorem (a) and Fubini’s theorem (b),

MMD(P,Q) =

∫ ∫

Rd

k(x− y) d(P−Q)(x) d(P−Q)(y)

(a)=

∫ ∫ ∫

Rd

e−i(x−y)Tω dΛ(ω) d(P−Q)(x) d(P−Q)(y)

(b)=

∫ ∫

Rd

e−ixTω d(P−Q)(x)

∫

Rd

eiyTω d(P−Q)(y) dΛ(ω)

=

∫|φP(ω)− φQ(ω)|2 dΛ(ω)

Example


−10 −5 0 5 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

X

P(X

)

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

X

Q(X

)

Example


−10 −5 0 5 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

X

P(X

)

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

X

Q(X

)

F→

F→

−20 −10 0 10 200

0.1

0.2

0.3

0.4

ω

|φP |

−20 −10 0 10 200

0.1

0.2

0.3

0.4

ω

|φQ

|

Example


−10 −5 0 5 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

X

P(X

)

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

X

Q(X

)

F→

F→

−20 −10 0 10 200

0.1

0.2

0.3

0.4

ω

|φP |

−20 −10 0 10 200

0.1

0.2

0.3

0.4

ω

|φQ

|ց

ր

Characteristic function difference

−30 −20 −10 0 10 20 300

0.05

0.1

0.15

0.2

ω

|φP −

φQ

|

Example


Gaussian kernel

Difference |φP − φQ|

−30 −20 −10 0 10 20 300

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Frequency ω

Example


Characteristic

−30 −20 −10 0 10 20 300

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Frequency ω

Example


Sinc kernel


−30 −20 −10 0 10 20 300

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Frequency ω

Example


NOT characteristic

−30 −20 −10 0 10 20 300

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Frequency ω

Example


Triangle (B-spline) kernel


−30 −20 −10 0 10 20 300

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Frequency ω

Example


???

−30 −20 −10 0 10 20 300

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Frequency ω

Example


Characteristic

−30 −20 −10 0 10 20 300

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Frequency ω

Summary: Characteristic Kernels

Characteristic kernel: (MMD = 0 iff P = Q) [NIPS07b, COLT08]

Main theorem: A translation invariant k characteristic for prob. measures on

Rd if and only if supp(Λ) = Rd (i.e. support zero on at most a countable set)

[COLT08, JMLR10]

Corollary: continuous, compactly supported k characteristic (since Fourier

spectrum Λ(ω) cannot be zero on an interval). 1-D proof sketch from [Mallat, 1999,

Theorem 2.6] proof on Rd via distribution theory in [Sriperumbudur et al., 2010, Corollary 10 p. 1535]

k characteristic iff supp(Λ) = Rd

Proof: supp Λ = Rd =⇒ k characteristic:

Recall Fourier definition of MMD:

MMD2(P,Q) =

∫

Rd

|φP(ω)− φQ(ω)|2 dΛ(ω).

Characteristic functions φP(ω) and φQ(ω) uniformly continuous, hence their

difference cannot be non-zero only on a countable set.

Map φP uniformly continuous: ∀ǫ > 0, ∃δ > 0 such that ∀(ω1, ω2) ∈ Ω for which d(ω1, ω2) < δ, we have

d(φP(ω1), φP(ω2)) < ǫ. Uniform: δ depends only on ǫ, not on ω1, ω2.

k characteristic iff supp(Λ) = Rd

Proof: k characteristic =⇒ supp Λ = Rd :

Proof by contrapositive.

Given supp Λ ( Rd, hence ∃ open interval U such that Λ(ω) zero on U .

Construct densities p(x), q(x) such that φP, φQ differ only inside U

Further extensions

• Similar reasoning wherever extensions of Bochner’s theorem exist:

[Fukumizu et al., 2009]

– Locally compact Abelian groups (periodic domains, as we saw)

– Compact, non-Abelian groups (orthogonal matrices)

– The semigroup R+n (histograms)

• Related kernel statistics: Fisher statistic [Harchaoui et al., 2008](zero iff P = Q

for characteristic kernels), other distances [Zhou and Chellappa, 2006](not yet

shown to establish whether P = Q), energy distances

Statistical hypothesis testing

Motivating question: differences in brain signals

The problem: Do local field potential (LFP) signals changewhen measured near a spike burst?

0 20 40 60 80 100−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

LFP near spike burst

Time

LF

P a

mp

litu

de

0 20 40 60 80 100−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

LFP without spike burst

Time

LF

P a

mp

litu

de

Motivating question: differences in brain signals

The problem: Do local field potential (LFP) signals changewhen measured near a spike burst?

Statistical test using MMD (1)

• Two hypotheses:

– H0: null hypothesis (P = Q)

– H1: alternative hypothesis (P 6= Q)


• Two hypotheses:

– H0: null hypothesis (P = Q)

– H1: alternative hypothesis (P 6= Q)

• Observe samples x := x1, . . . , xn from P and y from Q

• If empirical MMD(x,y;F ) is

– “far from zero”: reject H0

– “close to zero”: accept H0


• “far from zero” vs “close to zero” - threshold?

• One answer: asymptotic distribution of MMD2




• An unbiased empirical estimate (quadratic cost):

MMD2= 1

n(n−1)

∑

i 6=j

k(xi, xj)− k(xi, yj)− k(yi, xj) + k(yi, yj)︸︷︷︸h((xi,yi),(xj ,yj))




• An unbiased empirical estimate (quadratic cost):

MMD2= 1

n(n−1)

∑

i 6=j

k(xi, xj)− k(xi, yj)− k(yi, xj) + k(yi, yj)︸︷︷︸h((xi,yi),(xj ,yj))

• When P 6= Q, asymptotically normal

(√n)(MMD

2 −MMD2)∼ N (0, σ2u)

[Hoeffding, 1948, Serfling, 1980]

• Expression for the variance: zi := (xi, yi)

σ2u = 4(Ez

[(Ez′h(z, z

′))2]−[Ez,z′(h(z, z

′))]2)


• Example: laplace distributions with different variance

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.40

2

4

6

8

10

12

14

MMD distribution and Gaussian fit under H1

MMD

Pro

b.

de

nsity

Empirical PDF

Gaussian fit

−6 −4 −2 0 2 4 60

0.5

1

1.5

Two Laplace distributions with different variances

X

Pro

b.

de

nsity

P

X

QX


• When P = Q, U-statistic degenerate: Ez′ [h(z, z′)] = 0 [Anderson et al., 1994]

• Distribution is

nMMD(x,y;F ) ∼∞∑

l=1

λl[z2l − 2

]

• where

– zl ∼ N (0, 2) i.i.d

–∫Xk(x, x′)︸︷︷︸centred

ψi(x)dP(x) = λiψi(x′)


• When P = Q, U-statistic degenerate: Ez′ [h(z, z′)] = 0 [Anderson et al., 1994]

• Distribution is

nMMD(x,y;F ) ∼∞∑

l=1

λl[z2l − 2

]

• where

– zl ∼ N (0, 2) i.i.d

–∫Xk(x, x′)︸︷︷︸centred

ψi(x)dP(x) = λiψi(x′)

−2 −1 0 1 2 3 4 5 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MMD density under H0

n× MMD2

Pro

b. density

χ2 sum

Empirical PDF


• Given P = Q, want threshold T such that P(MMD > T ) ≤ 0.05

MMD2= KP,P +KQ,Q − 2KP,Q

−2 −1 0 1 2 3 4 5 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MMD density under H0 and H1

n× MMD2

Pro

b.

de

nsity

null

alternative

1−α null quantile

Type II error



• Permutation for empirical CDF [Arcones and Gine, 1992, Alba Fernandez et al., 2008]

• Pearson curves by matching first four moments [Johnson et al., 1994]

• Large deviation bounds [Hoeffding, 1963, McDiarmid, 1989]

• Consistent test using kernel eigenspectrum [NIPS09b]



• Permutation for empirical CDF [Arcones and Gine, 1992, Alba Fernandez et al., 2008]

• Pearson curves by matching first four moments [Johnson et al., 1994]

• Large deviation bounds [Hoeffding, 1963, McDiarmid, 1989]

• Consistent test using kernel eigenspectrum [NIPS09b]

−0.02 0 0.02 0.04 0.06 0.08 0.10

0.2

0.4

0.6

0.8

1

CDF of the MMD and Pearson fit

mmd

P(M

MD

< m

md)

MMD

Pearson

Approximate null distribution of MMD via permutation

Empirical MMD:

w = (1, 1, 1, . . . 1︸︷︷︸n

,−1 . . . ,−1,−1,−1︸︷︷︸n

)⊤

1n2

∑(KP,P KP,Q

KQ,P KQ,Q

⊙

[ww⊤

]) ≈ MMD2


Permuted case: [Alba Fernandez et al., 2008]

w = (1,−1, 1, . . . 1︸︷︷︸n

,−1 . . . , 1,−1,−1︸︷︷︸n

)⊤

(equal number of +1 and −1)

1n2

∑(KP,P KP,Q

KQ,P KQ,Q

⊙

[ww⊤

])=[?]


Permuted case: [Alba Fernandez et al., 2008]

w = (1,−1, 1, . . . 1︸︷︷︸n

,−1 . . . , 1,−1,−1︸︷︷︸n

)⊤

(equal number of +1 and −1)

1n2

∑(KP,P KP,Q

KQ,P KQ,Q

⊙

[ww⊤

])=[?]

⊙ =

Figure thanks to Kacper Chwialkowski.

Approximate null distribution of MMD2via permutation

MMD2

p ≈1n2

∑(KP,P KP ,Q

KQ,P KQ,Q

⊙

[ww⊤

])

−2 −1 0 1 2 3 4 5 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MMD density under H0

n× MMD2

Pro

b. density

Null PDF

Null PDF from permutation

Detecting differences in brain signals

Do local field potential (LFP) signals change when measured near a spike

burst?

0 20 40 60 80 100−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

LFP near spike burst

Time

LF

P a

mp

litu

de

0 20 40 60 80 100−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

LFP without spike burst

Time

LF

P a

mp

litu

de

Nero data: consistent test w/o permutation

• Maximum mean discrepancy (MMD): distance between P and Q

MMD(P,Q;F ) := ‖µP − µQ‖2F

• Is MMD significantly > 0?

• P = Q, null distrib. of MMD:

nMMD →D

∞∑

l=1

λl(z2l − 2),

– λl is lth eigenvalue of

kernel k(xi, xj)100 150 200 250 3000

0.1

0.2

0.3

0.4

0.5

P ≠ Q (neuro)

Sample size mT

yp

e I

I e

rro

r

Spectral

Permutation

Use Gram matrix spectrum for λl: consistent test without permutation

Hypothesis testing with HSIC

Distribution of HSIC at independence

• (Biased) empirical HSIC a v-statistic

HSICb =1

n2trace(KHLH)

– Statistical testing: How do we find when this is larger enough that

the null hypothesis P = PxPy is unlikely?

– Formally: given P = PxPy, what is the threshold T such that

P(HSIC > T ) < α for small α?



HSICb =1

n2trace(KHLH)

• Associated U-statistic degenerate when P = PxPy [Serfling, 1980]:

nHSICbD→

∞∑

l=1

λlz2l , zl ∼ N (0, 1)i.i.d.

λlψl(zj) =

∫hijqrψl(zi)dFi,q,r, hijqr =

1

4!

(i,j,q,r)∑

(t,u,v,w)

ktultu + ktulvw − 2ktultv



HSICb =1

n2trace(KHLH)

• Associated U-statistic degenerate when P = PxPy [Serfling, 1980]:

nHSICbD→

∞∑

l=1

λlz2l , zl ∼ N (0, 1)i.i.d.

λlψl(zj) =

∫hijqrψl(zi)dFi,q,r, hijqr =

1

4!

(i,j,q,r)∑

(t,u,v,w)

ktultu + ktulvw − 2ktultv

• First two moments [NIPS07b]

E(HSICb) =1

nTrCxxTrCyy

var(HSICb) =2(n− 4)(n− 5)

(n)4‖Cxx‖

2HS ‖Cyy‖

2HS +O(n−3).

Statistical testing with HSIC

• Given P = PxPy, what is the threshold T such that P(HSIC > T ) < α

for small α?

• Null distribution via permutation [Feuerverger, 1993]

– Compute HSIC for xi, yπ(i)ni=1 for random permutation π of indices

1, . . . , n. This gives HSIC for independent variables.

– Repeat for many different permutations, get empirical CDF

– Threshold T is 1− α quantile of empirical CDF

Statistical testing with HSIC

• Given P = PxPy, what is the threshold T such that P(HSIC > T ) < α

for small α?

• Null distribution via permutation [Feuerverger, 1993]

– Compute HSIC for xi, yπ(i)ni=1 for random permutation π of indices

1, . . . , n. This gives HSIC for independent variables.

– Repeat for many different permutations, get empirical CDF

– Threshold T is 1− α quantile of empirical CDF

• Approximate null distribution via moment matching [Kankainen, 1995]:

nHSICb(Z) ∼xα−1e−x/β

βαΓ(α)

where

α =(E(HSICb))

2

var(HSICb), β =

var(HSICb)

nE(HSICb).

Experiment: dependence testing for translation

Are the French text extracts translations of English?X1: Honourable senators, I have a question for

the Leader of the Government in the Senate with

regard to the support funding to farmers that has

been announced. Most farmers have not received

any money yet.

Y1: Honorables senateurs, ma question

s’adresse au leader du gouvernement au Senat

et concerne l’aide financiere qu’on a annoncee

pour les agriculteurs. La plupart des agriculteurs

n’ont encore rien reu de cet argent.

X2: No doubt there is great pressure on provin-

cial and municipal governments in relation to the

issue of child care, but the reality is that there

have been no cuts to child care funding from the

federal government to the provinces. In fact,

we have increased federal investments for early

childhood development.

· · ·

?⇐⇒

Y2:Il est evident que les ordres de gouverne-

ments provinciaux et municipaux subissent de

fortes pressions en ce qui concerne les ser-

vices de garde, mais le gouvernement n’a pas

reduit le financement qu’il verse aux provinces

pour les services de garde. Au contraire, nous

avons augmente le financement federal pour le

developpement des jeunes enfants.

· · ·


• (Biased) empirical HSIC:

HSICb =1

n2trace(KHLH)

• Translation example: [NIPS07b]

Canadian Hansard

(agriculture)

• 5-line extracts,

k-spectrum kernel, k = 10,

repetitions=300,

sample size 10

⇓

K

⇒HSIC⇐

⇓

L

• k-spectrum kernel: average Type II error 0 (α = 0.05)


• (Biased) empirical HSIC:

HSICb =1

n2trace(KHLH)

• Translation example: [NIPS07b]

Canadian Hansard

(agriculture)

• 5-line extracts,

k-spectrum kernel, k = 10,

repetitions=300,

sample size 10

⇓

K

⇒HSIC⇐

⇓

L

• k-spectrum kernel: average Type II error 0 (α = 0.05)

• Bag of words kernel: average Type II error 0.18

Kernel two-sample tests for big data, optimal kernel choice

Quadratic time estimate of MMD

MMD2 = ‖µP − µQ‖2F = EPk(x, x′) +EQk(y, y

′)− 2EP,Qk(x, y)



′)− 2EP,Qk(x, y)

Given i.i.d. X := x1, . . . , xm and Y := y1, . . . , ym from P,Q,

respectively:

The earlier estimate: (quadratic time)

EPk(x, x′) =

1

m(m− 1)

m∑

i=1

m∑

j 6=i

k(xi, xj)



′)− 2EP,Qk(x, y)

Given i.i.d. X := x1, . . . , xm and Y := y1, . . . , ym from P,Q,

respectively:

The earlier estimate: (quadratic time)

EPk(x, x′) =

1

m(m− 1)

m∑

i=1

m∑

j 6=i

k(xi, xj)

New, linear time estimate:

EPk(x, x′) =

2

m[k(x1, x2) + k(x3, x4) + . . .]

=2

m

m/2∑

i=1

k(x2i−1, x2i)

Linear time MMD

Shorter expression with explicit k dependence:

MMD2 =: ηk(p, q) = Exx′yy′hk(x, x′, y, y′) =: Evhk(v),

where

hk(x, x′, y, y′) = k(x, x′) + k(y, y′)− k(x, y′)− k(x′, y),

and v := [x, x′, y, y′].

Linear time MMD

Shorter expression with explicit k dependence:

MMD2 =: ηk(p, q) = Exx′yy′hk(x, x′, y, y′) =: Evhk(v),

where

hk(x, x′, y, y′) = k(x, x′) + k(y, y′)− k(x, y′)− k(x′, y),

and v := [x, x′, y, y′].

The linear time estimate again:

ηk =2

m

m/2∑

i=1

hk(vi),

where vi := [x2i−1, x2i, y2i−1, y2i] and

hk(vi) := k(x2i−1, x2i) + k(y2i−1, y2i)− k(x2i−1, y2i)− k(x2i, y2i−1)

Linear time vs quadratic time MMD

Disadvantages of linear time MMD vs quadratic time MMD

• Much higher variance for a given m, hence. . .

• . . .a much less powerful test for a given m

Linear time vs quadratic time MMD

Disadvantages of linear time MMD vs quadratic time MMD

• Much higher variance for a given m, hence. . .

• . . .a much less powerful test for a given m

Advantages of the linear time MMD vs quadratic time MMD

• Very simple asymptotic null distribution (a Gaussian, vs an infinite

weighted sum of χ2)

• Both test statistic and threshold computable in O(m), with storage O(1).

• Given unlimited data, a given Type II error can be attained with less

computation

Asymptotics of linear time MMD

By central limit theorem,

m1/2 (ηk − ηk(p, q))D→ N (0, 2σ2k)

• assuming 0 < E(h2k) <∞ (true for bounded k)

• σ2k = Evh2k(v)− [Ev(hk(v))]

2 .

Hypothesis test

Hypothesis test of asymptotic level α:

tk,α = m−1/2σk√2Φ−1(1− α) where Φ−1 is inverse CDF of N (0, 1).

−4 −2 0 2 4 6 80

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4Null distribution, linear time MMD

2

= ηk

P(η

k)

ηk

Type I error

tk,α = (1 − α) quantile

Type II error

−4 −2 0 2 4 6 80

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

P(η

k)

ηk

Null vs alternative distribution, P (ηk)

null

alternative

Type II error ηk(p, q )

The best kernel: minimizes Type II error

Type II error: ηk falls below the threshold tk,α and ηk(p, q) > 0.

Prob. of a Type II error:

P (ηk < tk,α) = Φ

(Φ−1(1− α)− ηk(p, q)

√m

σk√2

)

where Φ is a Normal CDF.

The best kernel: minimizes Type II error

Type II error: ηk falls below the threshold tk,α and ηk(p, q) > 0.

Prob. of a Type II error:

P (ηk < tk,α) = Φ

(Φ−1(1− α)− ηk(p, q)

√m

σk√2

)

where Φ is a Normal CDF.

Since Φ monotonic, best kernel choice to minimize Type II error prob. is:

k∗ = argmaxk∈K

ηk(p, q)σ−1k ,

where K is the family of kernels under consideration.

Learning the best kernel in a family

Define the family of kernels as follows:

K :=

k : k =

d∑

u=1

βuku, ‖β‖1 = D, βu ≥ 0, ∀u ∈ 1, . . . , d.

Properties: if at least one βu > 0

• all k ∈ K are valid kernels,

• If all ku charateristic then k characteristic

Test statistic

The squared MMD becomes

ηk(p, q) = ‖µk(p)− µk(q)‖2Fk=

d∑

u=1

βuηu(p, q),

where ηu(p, q) := Evhu(v).

Test statistic

The squared MMD becomes

ηk(p, q) = ‖µk(p)− µk(q)‖2Fk=

d∑

u=1

βuηu(p, q),

where ηu(p, q) := Evhu(v).

Denote:

• β = (β1, β2, . . . , βd)⊤ ∈ Rd,

• h = (h1, h2, . . . , hd)⊤ ∈ Rd,

– hu(x, x′, y, y′) = ku(x, x

′) + ku(y, y′)− ku(x, y

′)− ku(x′, y)

• η = Ev(h) = (η1, η2, . . . , ηd)⊤ ∈ Rd.

Quantities for test:

ηk(p, q) = E(β⊤h) = β⊤η σ2k := β⊤cov(h)β.

Optimization of ratio ηk(p, q)σ−1k

Empirical test parameters:

ηk = β⊤η σk,λ =

√β⊤(Q+ λmI

)β,

Q is empirical estimate of cov(h).

Note: ηk, σk,λ computed on training data, vs ηk, σk on data to be tested

(why?)

Optimization of ratio ηk(p, q)σ−1k

Empirical test parameters:

ηk = β⊤η σk,λ =

√β⊤(Q+ λmI

)β,

Q is empirical estimate of cov(h).

Note: ηk, σk,λ computed on training data, vs ηk, σk on data to be tested

(why?)

Objective:

β∗ = argmaxβ0

ηk(p, q)σ−1k,λ

= argmaxβ0

(β⊤η

)(β⊤(Q+ λmI

)β)−1/2

=: α(β; η, Q)

Optmization of ratio ηk(p, q)σ−1k

Assume: η has at least one positive entry

Then there exists β 0 s.t. α(β; η, Q) > 0.

Thus: α(β∗; η, Q) > 0




Thus: α(β∗; η, Q) > 0

Solve easier problem: β∗ = argmaxβ0 α2(β; η, Q).

Quadratic program:

minβ⊤(Q+ λmI

)β : β⊤η = 1, β 0




Thus: α(β∗; η, Q) > 0

Solve easier problem: β∗ = argmaxβ0 α2(β; η, Q).

Quadratic program:

minβ⊤(Q+ λmI

)β : β⊤η = 1, β 0

What if η has no positive entries?

Test procedure

1. Split the data into testing and training.

2. On the training data:

(a) Compute ηu for all ku ∈ K(b) If at least one ηu > 0, solve the QP to get β∗, else choose random

kernel from K3. On the test data:

(a) Compute ηk∗ using k∗ =∑d

u=1 β∗ku

(b) Compute test threshold tα,k∗ using σk∗

4. Reject null if ηk∗ > tα,k∗

Convergence bounds

Assume bounded kernel, σk, bounded away from 0.

If λm = Θ(m−1/3) then

∣∣∣∣supk∈K

ηkσ−1k,λ − sup

k∈Kηkσ

−1k

∣∣∣∣ = OP

(m−1/3

).

Convergence bounds

Assume bounded kernel, σk, bounded away from 0.

If λm = Θ(m−1/3) then



k∈Kηkσ

−1k

∣∣∣∣ = OP

(m−1/3

).

Idea:



k∈Kηkσ

−1k

∣∣∣∣

≤ supk∈K

∣∣∣ηkσ−1k,λ − ηkσ

−1k,λ

∣∣∣+ supk∈K

∣∣∣ηkσ−1k,λ − ηkσ

−1k

∣∣∣

≤√d

D√λm

(C1 sup

k∈K|ηk − ηk|+ C2 sup

k∈K|σk,λ − σk,λ|

)+ C3D

2λm,

Experiments

Competing approaches

• Median heuristic

• Max. MMD: choose ku ∈ K with the largest ηu

– same as maximizing β⊤η subject to ‖β‖1 ≤ 1

• ℓ2 statistic: maximize β⊤η subject to ‖β‖2 ≤ 1

• Cross validation on training set

Also compare with:

• Single kernel that maximizes ratio ηk(p, q)σ−1k

Blobs: data

Difficult problems: lengthscale of the difference in distributions not the same

as that of the distributions.

Blobs: data

Difficult problems: lengthscale of the difference in distributions not the same

as that of the distributions.

We distinguish a field of Gaussian blobs with different covariances.

5 10 15 20 25 30 355

10

15

20

25

30

35Blob data p

x1

x2

5 10 15 20 25 30 355

10

15

20

25

30

35Blob data q

y1

y2

Ratio ε = 3.2 of largest to smallest eigenvalues of blobs in q.

Blobs: results

0 5 10 150

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ε ratio

Typ

eII

erro

r

max ratio

opt

l2

maxmmd

xval

xvalc

med

Parameters: m = 10, 000 (for training and test). Ratio ε of largest to

smallest eigenvalues of blobs in q. Results are average over 617 trials.

Blobs: results

0 5 10 150

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ε ratio

Type II err

or

max ratiooptl2maxmmdxvalxvalcmed

Optimize ratio ηk(p, q)σ−1k

Blobs: results

0 5 10 150

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ε ratio

Type II err

or


Maximize ηk(p, q) with β constraint

Blobs: results

0 5 10 150

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ε ratio

Type II err

or


Median heuristic

Feature selection: data

Idea: no single best kernel.

Each of the ku are univariate (along a single coordinate)

Feature selection: data

Idea: no single best kernel.

Each of the ku are univariate (along a single coordinate)

−4 −2 0 2 4 6 8

−4

−2

0

2

4

6

8Selection data

x1

x2

p

q

Feature selection: results

0 5 10 15 20 25 300.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

Feature selection

dimension

Typ

e I

I e

rro

r

max ratio

opt

l2

maxmmd

Single best kernel

Linear combination

m = 10, 000, average over 5000 trials

Amplitude modulated signals

Given an audio signal s(t), an amplitude modulated signal can be defined

u(t) = sin(ωct) [a s(t) + l]

• ωc: carrier frequency

• a = 0.2 is signal scaling, l = 2 is offset


Given an audio signal s(t), an amplitude modulated signal can be defined

u(t) = sin(ωct) [a s(t) + l]

• ωc: carrier frequency

• a = 0.2 is signal scaling, l = 2 is offset

Two amplitude modulated signals from same artist (in this case, Magnetic

Fields).

• Music sampled at 8KHz (very low)

• Carrier frequency is 24kHz

• AM signal observed at 120kHz

• Samples are extracts of length N = 1000, approx. 0.01 sec (very short).

• Total dataset size is 30,000 samples from each of p, q.


Samples from P Samples from Q

Results: AM signals

−0.2 0 0.2 0.4 0.6 0.8 1 1.20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

added noise

Typ

e I

I e

rro

r

max ratio

opt

med

l2

maxmmd

m = 10, 000 (for training and test) and scaling a = 0.5. Average over 4124

trials. Gaussian noise added.

Observations on kernel choice

• It is possible to choose the best kernel for a kernel

two-sample test

• Kernel choice matters for “difficult” problems, where the

distributions differ on a lengthscale different to that of

the data.

• Ongoing work:

– quadratic time statistic

– avoid training/test split

Summary

• MMD a distance between distributions [ISMB06, NIPS06a, JMLR10, JMLR12a]

– high dimensionality

– non-euclidean data (strings, graphs)

– Nonparametric hypothesis tests

• Measure and test independence [ALT05, NIPS07a, NIPS07b, ALT08, JMLR10, JMLR12a]

• Characteristic RKHS: MMD a metric [NIPS07b, COLT08, NIPS08a]

– Easy to check: does spectrum cover Rd

Co-authors

• From UCL:

– Luca Baldasssarre

– Steffen Grunewalder

– Guy Lever

– Sam Patterson

– Massimiliano Pontil

– Dino Sejdinovic

• External:

– Karsten Borgwardt, MPI

– Wicher Bergsma, LSE

– Kenji Fukumizu, ISM

– Zaid Harchaoui, INRIA

– Bernhard Schoelkopf, MPI

– Alex Smola, CMU/Google

– Le Song, Georgia Tech

– Bharath Sriperumbudur,

Cambridge

Selected references

Characteristic kernels and mean embeddings:

• Smola, A., Gretton, A., Song, L., Schoelkopf, B. (2007). A hilbert space embedding for distributions. ALT.

• Sriperumbudur, B., Gretton, A., Fukumizu, K., Schoelkopf, B., Lanckriet, G. (2010). Hilbert space

embeddings and metrics on probability measures. JMLR.

• Gretton, A., Borgwardt, K., Rasch, M., Schoelkopf, B., Smola, A. (2012). A kernel two- sample test. JMLR.

Two-sample, independence, conditional independence tests:

• Gretton, A., Fukumizu, K., Teo, C., Song, L., Schoelkopf, B., Smola, A. (2008). A kernel statistical test of

independence. NIPS

• Fukumizu, K., Gretton, A., Sun, X., Schoelkopf, B. (2008). Kernel measures of conditional dependence.

• Gretton, A., Fukumizu, K., Harchaoui, Z., Sriperumbudur, B. (2009). A fast, consistent kernel two-sample

test. NIPS.

• Gretton, A., Borgwardt, K., Rasch, M., Schoelkopf, B., Smola, A. (2012). A kernel two- sample test. JMLR

Energy distance, relation to kernel distances

• Sejdinovic, D., Sriperumbudur, B., Gretton, A.,, Fukumizu, K., (2013). Equivalence of distance-based and

rkhs-based statistics in hypothesis testing. Annals of Statistics.

Three way interaction

• Sejdinovic, D., Gretton, A., and Bergsma, W. (2013). A Kernel Test for Three-Variable Interactions. NIPS.

Selected references (continued)

Conditional mean embedding, RKHS-valued regression:

• Weston, J., Chapelle, O., Elisseeff, A., Scholkopf, B., and Vapnik, V., (2003). Kernel Dependency

Estimation, NIPS.

• Micchelli, C., and Pontil, M., (2005). On Learning Vector-Valued Functions. Neural Computation.

• Caponnetto, A., and De Vito, E. (2007). Optimal Rates for the Regularized Least-Squares Algorithm.

Foundations of Computational Mathematics.

• Song, L., and Huang, J., and Smola, A., Fukumizu, K., (2009). Hilbert Space Embeddings of Conditional

Distributions. ICML.

• Grunewalder, S., Lever, G., Baldassarre, L., Patterson, S., Gretton, A., Pontil, M. (2012). Conditional mean

embeddings as regressors. ICML.

• Grunewalder, S., Gretton, A., Shawe-Taylor, J. (2013). Smooth operators. ICML.

Kernel Bayes rule:

• Song, L., Fukumizu, K., Gretton, A. (2013). Kernel embeddings of conditional distributions: A unified

kernel framework for nonparametric inference in graphical models. IEEE Signal Processing Magazine.

• Fukumizu, K., Song, L., Gretton, A. (2013). Kernel Bayes rule: Bayesian inference with positive definite

kernels, JMLR

Local departures from the null

What is a hard testing problem?



• First version: for fixed m, “closer” P and Q have higher Type II error

0 0.2 0.4 0.6 0.8 1−1

−0.5

0

0.5

1


0 0.2 0.4 0.6 0.8 1−1

−0.5

0

0.5

1




• As m increases, distinguish “closer” P and Q with fixed Type II error



• As m increases, distinguish “closer” P and Q with fixed Type II error

• Example: fP and fQ probability densities, fQ = fP + δg, where δ ∈ R, g

some fixed function such that fQ is a valid density

– If δ ∼ m−1/2, Type II error approaches a constant

More general local departures from null



−6 −4 −2 0 2 4 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

X

P(X

)

VS

−6 −4 −2 0 2 4 6

0.1

0.2

0.3

X

Q(X

)

−6 −4 −2 0 2 4 6

0.1

0.2

0.3

X

Q(X

)

−6 −4 −2 0 2 4 6

0.1

0.2

0.3

X

Q(X

)



• As we see more samples m, distinguish “closer” P and Q with same

Type II error




• ...but other choices also possible – how to characterize them all?



• As we see more samples m, distinguish “closer” P and Q with same

Type II error




• ...but other choices also possible – how to characterize them all?

General characterization of local departures from H0:

• Write µQ = µP + gm, where gm ∈ F chosen such that µP + gm a valid

distribution embedding

• Minimum distinguishable distance [JMLR12]

‖gm‖F = cm−1/2

More general local departures from null

• More advanced example of a local departure from the null

• Recall: µQ = µP + gm, and ‖gm‖F = cm−1/2

−6 −4 −2 0 2 4 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

X

P(X

)

VS

−6 −4 −2 0 2 4 6

0.1

0.2

0.3

0.4

X

Q(X

)

−6 −4 −2 0 2 4 6

0.1

0.2

0.3

0.4

X

Q(X

)

−6 −4 −2 0 2 4 6

0.1

0.2

0.3

0.4

X

Q(X

)

Kernels vs kernels

• How does MMD relate to Parzen density estimate? [Anderson et al., 1994]

fP(x) =1

m

m∑

i=1

κ (xi − x) , where κ satisfies

∫

X

κ (x) dx = 1 and κ (x) ≥ 0.

Kernels vs kernels


fP(x) =1

m

m∑

i=1


∫

X

κ (x) dx = 1 and κ (x) ≥ 0.

• L2 distance between Parzen density estimates:

D2(fP, fQ)2 =

∫ [1

m

m∑

i=1

κ(xi − z)− 1

m

m∑

i=1

κ(yi − z)

]2dz

=1

m2

m∑

i,j=1

k(xi − xj) +1

m2

m∑

i,j=1

k(yi − yj)−2

m2

m∑

i,j=1

k(xi − yj),

where k(x− y) =∫κ(x− z)κ(y − z)dz

Kernels vs kernels


fP(x) =1

m

m∑

i=1


∫

X

κ (x) dx = 1 and κ (x) ≥ 0.

• L2 distance between Parzen density estimates:

D2(fP, fQ)2 =

∫ [1

m

m∑

i=1

κ(xi − z)− 1

m

m∑

i=1

κ(yi − z)

]2dz

=1

m2

m∑

i,j=1

k(xi − xj) +1

m2

m∑

i,j=1

k(yi − yj)−2

m2

m∑

i,j=1

k(xi − yj),

where k(x− y) =∫κ(x− z)κ(y − z)dz

• fQ = fP + δg, minimum distance to discriminate fP from fQ is

δ = (m)−1/2h−d/2m , where hm is width of κ.

Characteristic Kernels (via universality)

Characteristic: MMD a metric (MMD = 0 iff P = Q) [NIPS07b, COLT08]



Classical result: P = Q if and only if EP(f(x)) = EQ(f(y)) for all f ∈ C(X ),

the space of bounded continuous functions on X [Dudley, 2002]





Universal RKHS: k(x, x′) continuous, X compact, and F dense in C(X ) with

respect to L∞ [Steinwart, 2001]





Universal RKHS: k(x, x′) continuous, X compact, and F dense in C(X ) with

respect to L∞ [Steinwart, 2001]

If F universal, then MMD P,Q;F = 0 iff P = Q


Proof:

First, it is clear that P = Q implies MMD P,Q;F is zero.

Converse: by the universality of F , for any given ǫ > 0 and f ∈ C(X ) ∃g ∈ F

‖f − g‖∞ ≤ ǫ.


Proof (continued):

Next, write

EPg(x)−EQg(y) = 〈g(·), µP − µQ〉F = 0,

since MMD P,Q;F = 0 implies µP = µQ. Hence

|EPf(x)−EQf(y)| ≤ 2ǫ

for all f ∈ C(X ) and ǫ > 0, which implies P = Q.

References

V.Alb

aFernandez,M

.Jim

enez-G

am

ero,and

J.M

unozG

arcia

.A

testfo

rthe

two-s

am

ple

proble

mbased

on

em

piric

alcharacteristic

functio

ns.Comput.

Sta

t.Data

An.,

52:3

730–3748,2008.

N.Anderson,P.Hall,

and

D.Titterin

gton.

Two-s

am

ple

test

statistic

sfo

rm

easurin

gdiscrepancie

sbetween

two

multiv

aria

te

probabilit

ydensity

functio

nsusin

gkernel-b

ased

density

estim

ates.JournalofM

ultiv

ariate

Anal-

ysis,

50:4

1–54,1994.

M.Arcones

and

E.G

ine.O

nthe

bootstrap

ofu

and

vstatistic

s.TheAnnals

ofSta

tistic

s,20(2):6

55–674,1992.

R.M

.D

udle

y.Realanalysis

and

pro

bability

.Cam

brid

ge

Univ

ersity

Press,Cam

-brid

ge,UK

,2002.

Andrey

Feuerverger.A

consistenttestfo

rbiv

aria

te

dependence.In

ternatio

nal

Sta

tistic

alReview,61(3):4

19–433,1993.

K.Fukum

izu,B.Srip

erum

budur,A.G

retton,and

B.Schoelk

opf.

Character-

istic

kernels

on

groups

and

sem

igroups.

InAdvancesin

Neura

lIn

formatio

n

Processin

gSystems21,pages

473–480,Red

Hook,NY,2009.Curran

Asso-

cia

tes

Inc.

Z.Harchaoui,

F.Bach,and

E.M

oulin

es.Testin

gfo

rhom

ogeneity

with

kernel

Fisher

discrim

inant

analy

sis.

InAdvances

inNeura

lIn

formatio

nProcessin

g

Systems20,pages

609–616.M

ITPress,Cam

brid

ge,M

A,2008.

W.

Hoeffd

ing.

Probabilit

yin

equalit

ies

for

sum

sof

bounded

random

vari-

able

s.

Journalofth

eAmerican

Sta

tistic

alAssociatio

n,58:1

3–30,1963.

Wassily

Hoeffd

ing.

Acla

ss

ofstatistic

swith

asym

ptotic

ally

norm

aldistri-

butio

n.

TheAnnals

ofM

ath

ematic

alSta

tistic

s,19(3):2

93–325,1948.

N.L.Johnson,S.K

otz,and

N.Bala

krishnan.

Contin

uousUnivariate

Distribu-

tions.Volume1.

John

Wile

yand

Sons,2nd

editio

n,1994.

A.

Kankain

en.

ConsistentTestin

gofTota

lIn

dependence

Based

on

the

Empirical

Chara

cteristic

Functio

n.

PhD

thesis,Univ

ersity

ofJyvaskyla

,1995.

S.M

alla

t.

AW

aveletTourofSignalProcessin

g.

Academ

icPress,2nd

editio

n,

1999.

C.M

cD

iarm

id.O

nthe

method

ofbounded

diffe

rences.In

Surveyin

Combin

a-

torics,

pages

148–188.Cam

brid

ge

Univ

ersity

Press,1989.

103-1

A.M

ulle

r.In

tegralprobabilit

ym

etric

sand

their

generatin

gcla

ssesoffu

nc-

tio

ns.

Advancesin

Applie

dPro

bability

,29(2):4

29–443,1997.

R.Serflin

g.Appro

xim

atio

nTheoremsofM

ath

ematic

alSta

tistic

s.W

iley,New

York,

1980.

B.

Srip

erum

budur,

A.

Gretton,

K.

Fukum

izu,

G.

Lanckrie

t,

and

B.Scholk

opf.

Hilb

ertspace

em

beddin

gsand

metric

son

probabilit

ym

ea-

sures.

JournalofM

achin

eLearnin

gResearch,11:1

517–1561,2010.

I.Stein

wart.

On

the

influ

ence

ofthe

kernelon

the

consistency

ofsupport

vector

machin

es.

JournalofM

achin

eLearnin

gResearch,2:6

7–93,2001.

Ingo

Stein

wartand

AndreasChristm

ann.SupportVecto

rM

achin

es.

Info

rm

atio

nScie

nce

and

Statistic

s.Sprin

ger,2008.

G.Szekely

and

M.Riz

zo.

Brownia

ndistance

covaria

nce.

Annals

ofApplie

d

Sta

tistic

s,4(3):1

233–1303,2009.

G.Szekely

,M

.Riz

zo,and

N.Bakirov.M

easurin

gand

testin

gdependence

by

correla

tio

nofdistances.

Ann.Sta

t.,35(6):2

769–2794,2007.

S.K

.Zhou

and

R.Chella

ppa.From

sam

ple

sim

ilarity

to

ensem

ble

sim

ilarity:

Probabilis

tic

distancem

easuresin

reproducin

gkernelhilb

ertspace.IE

EE

Tra

nsactio

nson

Patte

rn

Analysis

andM

achin

eIn

tellig

ence,28(6):9

17–929,2006.

103-2

Lecture2: MappingsofProbabilities...

Documents

Transcript of Lecture2: MappingsofProbabilities...