Lecture2: MappingsofProbabilities...

171
Lecture 2: Mappings of Probabilities to RKHS and Applications MLSS T¨ ubingen, 2015 Arthur Gretton Gatsby Unit, CSML, UCL

Transcript of Lecture2: MappingsofProbabilities...

Page 1: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Lecture 2: Mappings of Probabilitiesto RKHS and Applications

MLSS Tubingen, 2015

Arthur Gretton

Gatsby Unit, CSML, UCL

Page 2: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Outline

• Kernel metric on the space of probability measures

– Function revealing differences in distributions

– Distance between means in space of features (RKHS)

– Independence measure: features of joint minus product of marginals

• Characteristic kernels: feature space mappings of probabilities unique

• Two-sample, independence tests for (almost!) any data type

– distributions on strings, images, graphs, groups (rotation matrices),

semigroups,. . .

• Advanced topics

– testing on big data, kernel choice

– Energy distance/distance covariance: special case of kernel statistic

Page 3: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Feature mean difference

• Simple example: 2 Gaussians with different means

• Answer: t-test

−6 −4 −2 0 2 4 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Two Gaussians with different means

X

Pro

b. density

P

X

QX

Page 4: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Feature mean difference

• Two Gaussians with same means, different variance

• Idea: look at difference in means of features of the RVs

• In Gaussian case: second order features of form ϕ(x) = x2

−6 −4 −2 0 2 4 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Two Gaussians with different variances

Pro

b.

de

nsity

X

P

X

QX

Page 5: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Feature mean difference

• Two Gaussians with same means, different variance

• Idea: look at difference in means of features of the RVs

• In Gaussian case: second order features of form ϕ(x) = x2

−6 −4 −2 0 2 4 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Two Gaussians with different variances

Pro

b.

de

nsity

X

P

X

QX

10−1

100

101

102

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Densities of feature X2

X2

Pro

b.

de

nsity

P

X

QX

Page 6: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Feature mean difference

• Gaussian and Laplace distributions

• Same mean and same variance

• Difference in means using higher order features...RKHS

−4 −3 −2 −1 0 1 2 3 40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Gaussian and Laplace densities

Pro

b. density

X

P

X

QX

Page 7: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Probabilities in feature space: the mean trick

The kernel trick

• Given x ∈ X for some set X ,

define feature map ϕx ∈ F ,

ϕx = [. . . ϕi(x) . . .] ∈ ℓ2

• For positive definite k(x, x′),

k(x, x′) = 〈ϕx, ϕx′〉F

• The kernel trick: ∀f ∈ F ,

f(x) = 〈f, ϕx〉F

Page 8: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Probabilities in feature space: the mean trick

The kernel trick

• Given x ∈ X for some set X ,

define feature map ϕx ∈ F ,

ϕx = [. . . ϕi(x) . . .] ∈ ℓ2

• For positive definite k(x, x′),

k(x, x′) = 〈ϕx, ϕx′〉F

• The kernel trick: ∀f ∈ F ,

f(x) = 〈f, ϕx〉F

The mean trick

• Given P a Borel probability

measure on X , define feature

map µP ∈ F

µP = [. . .EP [ϕi(x)] . . .]

• For positive definite k(x, x′),

EP,Qk(x, y) = 〈µP, µQ〉F

for x ∼ P and y ∼ Q.

• The mean trick: (we call µP a

mean/distribution embedding)

EP(f(x)) =: 〈µP, f〉F

Page 9: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

What does µP look like?

We plot the function µP

• Mean embedding µP ∈ F

〈µP(·), f(·)〉F = EPf(x).

• What does prob. feature map

look like?

µP(x) = 〈µP(·), ϕ(x)〉F= 〈µP(·), k(·, x)〉F = EPk(x, x).

Expectation of kernel!

• Empirical estimate:

µP(x) =1

m

m∑

i=1

k(xi, x) xi ∼ P

Page 10: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

What does µP look like?

We plot the function µP

• Mean embedding µP ∈ F

〈µP(·), f(·)〉F = EPf(x).

• What does prob. feature map

look like?

µP(x) = 〈µP(·), ϕ(x)〉F= 〈µP(·), k(·, x)〉F = EPk(x, x).

Expectation of kernel!

• Empirical estimate:

µP(x) =1

m

m∑

i=1

k(xi, x) xi ∼ P

−2 0 20

0.01

0.02

0.03

X

Histogram

Embedding

Page 11: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Does the feature space mean exist?

Does there exist an element µP ∈ F such that

EPf(x) = EP〈f(·), ϕ(x)〉F = 〈f(·),EPϕ(x)〉F = 〈f(·), µP(·)〉F ∀f ∈ F

Page 12: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Does the feature space mean exist?

Does there exist an element µP ∈ F such that

EPf(x) = EP〈f(·), ϕ(x)〉F = 〈f(·),EPϕ(x)〉F = 〈f(·), µP(·)〉F ∀f ∈ F

Yes: You can exchange expectation and innner product (i.e. ϕ(x) is Bochner

integrable [Steinwart and Christmann, 2008]) under the condition

EP‖ϕ(x)‖F = EP

√k(x, x) <∞

Page 13: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Function Showing Difference in Distributions

• Are P and Q different?

Page 14: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Function Showing Difference in Distributions

• Are P and Q different?

0 0.2 0.4 0.6 0.8 1−1

−0.5

0

0.5

1

Samples from P and Q

Page 15: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Function Showing Difference in Distributions

• Are P and Q different?

0 0.2 0.4 0.6 0.8 1−1

−0.5

0

0.5

1

Samples from P and Q

Page 16: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Function Showing Difference in Distributions

• Maximum mean discrepancy: smooth function for P vs Q

MMD(P,Q;F ) := supf∈F

[EPf(x)−EQf(y)] .

0 0.2 0.4 0.6 0.8 1−1

−0.5

0

0.5

1

x

f(x)

Smooth function

Page 17: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Function Showing Difference in Distributions

• Maximum mean discrepancy: smooth function for P vs Q

MMD(P,Q;F ) := supf∈F

[EPf(x)−EQf(y)] .

0 0.2 0.4 0.6 0.8 1−1

−0.5

0

0.5

1

x

f(x)

Smooth function

Page 18: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Function Showing Difference in Distributions

• What if the function is not smooth?

MMD(P,Q;F ) := supf∈F

[EPf(x)−EQf(y)] .

0 0.2 0.4 0.6 0.8 1−1

−0.5

0

0.5

1

Bounded continuous function

x

f(x)

Page 19: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Function Showing Difference in Distributions

• What if the function is not smooth?

MMD(P,Q;F ) := supf∈F

[EPf(x)−EQf(y)] .

0 0.2 0.4 0.6 0.8 1−1

−0.5

0

0.5

1

Bounded continuous function

x

f(x)

Page 20: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Function Showing Difference in Distributions

• Maximum mean discrepancy: smooth function for P vs Q

MMD(P,Q;F ) := supf∈F

[EPf(x)−EQf(y)] .

• Gauss P vs Laplace Q

−6 −4 −2 0 2 4 6−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

Witness f for Gauss and Laplace densities

X

Pro

b. density a

nd f

f

Gauss

Laplace

Page 21: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Function Showing Difference in Distributions

• Maximum mean discrepancy: smooth function for P vs Q

MMD(P,Q;F ) := supf∈F

[EPf(x)−EQf(y)] .

• Classical results: MMD(P,Q;F ) = 0 iff P = Q, when

– F =bounded continuous [Dudley, 2002]

– F = bounded variation 1 (Kolmogorov metric) [Muller, 1997]

– F = bounded Lipschitz (Earth mover’s distances) [Dudley, 2002]

Page 22: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Function Showing Difference in Distributions

• Maximum mean discrepancy: smooth function for P vs Q

MMD(P,Q;F ) := supf∈F

[EPf(x)−EQf(y)] .

• Classical results: MMD(P,Q;F ) = 0 iff P = Q, when

– F =bounded continuous [Dudley, 2002]

– F = bounded variation 1 (Kolmogorov metric) [Muller, 1997]

– F = bounded Lipschitz (Earth mover’s distances) [Dudley, 2002]

• MMD(P,Q;F ) = 0 iff P = Q when F =the unit ball in a characteristic

RKHS F (coming soon!) [ISMB06, NIPS06a, NIPS07b, NIPS08a, JMLR10]

Page 23: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Function Showing Difference in Distributions

• Maximum mean discrepancy: smooth function for P vs Q

MMD(P,Q;F ) := supf∈F

[EPf(x)−EQf(y)] .

• Classical results: MMD(P,Q;F ) = 0 iff P = Q, when

– F =bounded continuous [Dudley, 2002]

– F = bounded variation 1 (Kolmogorov metric) [Muller, 1997]

– F = bounded Lipschitz (Earth mover’s distances) [Dudley, 2002]

• MMD(P,Q;F ) = 0 iff P = Q when F =the unit ball in a characteristic

RKHS F (coming soon!) [ISMB06, NIPS06a, NIPS07b, NIPS08a, JMLR10]

How do smooth functions relate to feature maps?

Page 24: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Function view vs feature mean view

• The (kernel) MMD: [ISMB06, NIPS06a]

MMD2(P,Q;F )

=

(supf∈F

[EPf(x)−EQf(y)]

)2

−6 −4 −2 0 2 4 6−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

Witness f for Gauss and Laplace densities

X

Pro

b. density a

nd f

f

Gauss

Laplace

Page 25: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Function view vs feature mean view

• The (kernel) MMD: [ISMB06, NIPS06a]

MMD2(P,Q;F )

=

(supf∈F

[EPf(x)−EQf(y)]

)2 use

EP(f(x)) =: 〈µP, f〉F

Page 26: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Function view vs feature mean view

• The (kernel) MMD: [ISMB06, NIPS06a]

MMD2(P,Q;F )

=

(supf∈F

[EPf(x)−EQf(y)]

)2

=

(supf∈F

〈f, µP − µQ〉F

)2

use

EP(f(x)) =: 〈µP, f〉F

Page 27: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Function view vs feature mean view

• The (kernel) MMD: [ISMB06, NIPS06a]

MMD2(P,Q;F )

=

(supf∈F

[EPf(x)−EQf(y)]

)2

=

(supf∈F

〈f, µP − µQ〉F

)2

= ‖µP − µQ‖2F

use

‖θ‖F = supf∈F

〈f, θ〉F

Function view and feature view equivalent

Page 28: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Empirical estimate of MMD

• An unbiased empirical estimate: for ximi=1 ∼ P and yimi=1 ∼ Q,

MMD2= 1

m(m−1)

∑mi=1

∑mj 6=i [k(xi, xj) + k(yi, yj)]

− 1m2

∑mi=1

∑mj=1 [k(yi, xj) + k(xi, yj)]

Page 29: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Empirical estimate of MMD

• An unbiased empirical estimate: for ximi=1 ∼ P and yimi=1 ∼ Q,

MMD2= 1

m(m−1)

∑mi=1

∑mj 6=i [k(xi, xj) + k(yi, yj)]

− 1m2

∑mi=1

∑mj=1 [k(yi, xj) + k(xi, yj)]

• Proof:

‖µP − µQ‖2F = 〈µP − µQ, µP − µQ〉F= 〈µP, µP〉+ 〈µQ, µQ〉 − 2 〈µP, µQ〉

Page 30: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Empirical estimate of MMD

• An unbiased empirical estimate: for ximi=1 ∼ P and yimi=1 ∼ Q,

MMD2= 1

m(m−1)

∑mi=1

∑mj 6=i [k(xi, xj) + k(yi, yj)]

− 1m2

∑mi=1

∑mj=1 [k(yi, xj) + k(xi, yj)]

• Proof:

‖µP − µQ‖2F = 〈µP − µQ, µP − µQ〉F= 〈µP, µP〉+ 〈µQ, µQ〉 − 2 〈µP, µQ〉

Page 31: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Empirical estimate of MMD

• An unbiased empirical estimate: for ximi=1 ∼ P and yimi=1 ∼ Q,

MMD2= 1

m(m−1)

∑mi=1

∑mj 6=i [k(xi, xj) + k(yi, yj)]

− 1m2

∑mi=1

∑mj=1 [k(yi, xj) + k(xi, yj)]

• Proof:

‖µP − µQ‖2F = 〈µP − µQ, µP − µQ〉F= 〈µP, µP〉+ 〈µQ, µQ〉 − 2 〈µP, µQ〉= EP[µP(x)] + . . .

Page 32: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Empirical estimate of MMD

• An unbiased empirical estimate: for ximi=1 ∼ P and yimi=1 ∼ Q,

MMD2= 1

m(m−1)

∑mi=1

∑mj 6=i [k(xi, xj) + k(yi, yj)]

− 1m2

∑mi=1

∑mj=1 [k(yi, xj) + k(xi, yj)]

• Proof:

‖µP − µQ‖2F = 〈µP − µQ, µP − µQ〉F= 〈µP, µP〉+ 〈µQ, µQ〉 − 2 〈µP, µQ〉= EP[µP(x)] + . . .

= EP 〈µP(·), ϕ(x)〉+ . . .

Page 33: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Empirical estimate of MMD

• An unbiased empirical estimate: for ximi=1 ∼ P and yimi=1 ∼ Q,

MMD2= 1

m(m−1)

∑mi=1

∑mj 6=i [k(xi, xj) + k(yi, yj)]

− 1m2

∑mi=1

∑mj=1 [k(yi, xj) + k(xi, yj)]

• Proof:

‖µP − µQ‖2F = 〈µP − µQ, µP − µQ〉F= 〈µP, µP〉+ 〈µQ, µQ〉 − 2 〈µP, µQ〉= EP[µP(x)] + . . .

= EP 〈µP(·), k(x, ·)〉+ . . .

Page 34: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Empirical estimate of MMD

• An unbiased empirical estimate: for ximi=1 ∼ P and yimi=1 ∼ Q,

MMD2= 1

m(m−1)

∑mi=1

∑mj 6=i [k(xi, xj) + k(yi, yj)]

− 1m2

∑mi=1

∑mj=1 [k(yi, xj) + k(xi, yj)]

• Proof:

‖µP − µQ‖2F = 〈µP − µQ, µP − µQ〉F= 〈µP, µP〉+ 〈µQ, µQ〉 − 2 〈µP, µQ〉= EP[µP(x)] + . . .

= EP 〈µP(·), k(x, ·)〉+ . . .

= EPk(x, x′) +EQk(y, y

′)− 2EP,Qk(x, y)

Page 35: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Empirical estimate of MMD

• An unbiased empirical estimate: for ximi=1 ∼ P and yimi=1 ∼ Q,

MMD2= 1

m(m−1)

∑mi=1

∑mj 6=i [k(xi, xj) + k(yi, yj)]

− 1m2

∑mi=1

∑mj=1 [k(yi, xj) + k(xi, yj)]

• Proof:

‖µP − µQ‖2F = 〈µP − µQ, µP − µQ〉F= 〈µP, µP〉+ 〈µQ, µQ〉 − 2 〈µP, µQ〉= EP[µP(x)] + . . .

= EP 〈µP(·), k(x, ·)〉+ . . .

= EPk(x, x′) +EQk(y, y

′)− 2EP,Qk(x, y)

Then Ek(x, x′) = 1m(m−1)

∑mi=1

∑mj 6=i k(xi, xj)

Page 36: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

MMD for independence: HSIC

• Dependence measure: the Hilbert Schmidt Independence Criterion [ALT05,

NIPS07a, ALT07, ALT08, JMLR10]

Related to [Feuerverger, 1993]and [Szekely and Rizzo, 2009, Szekely et al., 2007]

HSIC(PXY ,PXPY ) := ‖µPXY− µPXPY

‖2

Page 37: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

MMD for independence: HSIC

• Dependence measure: the Hilbert Schmidt Independence Criterion [ALT05,

NIPS07a, ALT07, ALT08, JMLR10]

Related to [Feuerverger, 1993]and [Szekely and Rizzo, 2009, Szekely et al., 2007]

HSIC(PXY ,PXPY ) := ‖µPXY− µPXPY

‖2

k( , )!" #" !"l( , )#"

k( , )× l( , )!" #" !" #"

κ( , ) =!" #"!" #"

Page 38: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

MMD for independence: HSIC

• Dependence measure: the Hilbert Schmidt Independence Criterion [ALT05,

NIPS07a, ALT07, ALT08, JMLR10]

Related to [Feuerverger, 1993]and [Szekely and Rizzo, 2009, Szekely et al., 2007]

HSIC(PXY ,PXPY ) := ‖µPXY− µPXPY

‖2

HSIC using expectations of kernels:

Define RKHS F on X with kernel k, RKHS G on Y with kernel l. Then

HSIC(PXY ,PXPY )

= EXY EX′Y ′k(x, x′)l(y, y′) +EXEX′k(x, x′)EY EY ′ l(y, y′)

− 2EX′Y ′

[EXk(x, x

′)EY l(y, y′)].

Page 39: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

HSIC: empirical estimate and intuition

!"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',&

-"#34%#&'#5#%&"266$#%&-"2'&7"#'&0(//(7$'*&

2'&$'-#%#)8'*&)9#'-:&!"#3&'##,&6/#'-3&(0&

#;#%9$)#1&2<(+-&2'&"(+%&2&,23&$0&6())$</#:&

=&/2%*#&2'$.2/&7"(&)/$'*)&)/(<<#%1&#;+,#)&2&

,$)8'985#&"(+',3&(,(%1&2',&72'-)&'(-"$'*&.(%#&

-"2'&-(&0(//(7&"$)&'()#:&!"#3&'##,&2&)$*'$>92'-&

2.(+'-&(0&#;#%9$)#&2',&.#'-2/&)8.+/28(':&

!#;-&0%(.&,(*8.#:9(.&2',&6#?$',#%:9(.&

@'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',&

#;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#&

<%##,&$)&6#%0#9-&$0&3(+&72'-&2&%#)6(')$5#1&&

$'-#%2985#&6#-1&('#&-"2-&7$//&</(7&$'&3(+%&#2%&

2',&0(//(7&3(+&#5#%37"#%#:&

Page 40: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

HSIC: empirical estimate and intuition

!"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',&

-"#34%#&'#5#%&"266$#%&-"2'&7"#'&0(//(7$'*&

2'&$'-#%#)8'*&)9#'-:&!"#3&'##,&6/#'-3&(0&

#;#%9$)#1&2<(+-&2'&"(+%&2&,23&$0&6())$</#:&

=&/2%*#&2'$.2/&7"(&)/$'*)&)/(<<#%1&#;+,#)&2&

,$)8'985#&"(+',3&(,(%1&2',&72'-)&'(-"$'*&.(%#&

-"2'&-(&0(//(7&"$)&'()#:&!"#3&'##,&2&)$*'$>92'-&

2.(+'-&(0&#;#%9$)#&2',&.#'-2/&)8.+/28(':&

!#;-&0%(.&,(*8.#:9(.&2',&6#?$',#%:9(.&

@'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',&

#;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#&

<%##,&$)&6#%0#9-&$0&3(+&72'-&2&%#)6(')$5#1&&

$'-#%2985#&6#-1&('#&-"2-&7$//&</(7&$'&3(+%&#2%&

2',&0(//(7&3(+&#5#%37"#%#:&

!" #"

Page 41: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

HSIC: empirical estimate and intuition

!"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',&

-"#34%#&'#5#%&"266$#%&-"2'&7"#'&0(//(7$'*&

2'&$'-#%#)8'*&)9#'-:&!"#3&'##,&6/#'-3&(0&

#;#%9$)#1&2<(+-&2'&"(+%&2&,23&$0&6())$</#:&

=&/2%*#&2'$.2/&7"(&)/$'*)&)/(<<#%1&#;+,#)&2&

,$)8'985#&"(+',3&(,(%1&2',&72'-)&'(-"$'*&.(%#&

-"2'&-(&0(//(7&"$)&'()#:&!"#3&'##,&2&)$*'$>92'-&

2.(+'-&(0&#;#%9$)#&2',&.#'-2/&)8.+/28(':&

!#;-&0%(.&,(*8.#:9(.&2',&6#?$',#%:9(.&

@'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',&

#;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#&

<%##,&$)&6#%0#9-&$0&3(+&72'-&2&%#)6(')$5#1&&

$'-#%2985#&6#-1&('#&-"2-&7$//&</(7&$'&3(+%&#2%&

2',&0(//(7&3(+&#5#%37"#%#:&

!" #"

Empirical HSIC(PXY ,PXPY ):

1

n2(HKH HLH)++

Page 42: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Characteristic kernels (Via Fourier, on the torus T)

Page 43: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Characteristic Kernels (via Fourier)

Reminder:

Characteristic: MMD a metric (MMD = 0 iff P = Q) [NIPS07b, JMLR10]

In the next slides:

1. Characteristic property on [−π, π] with periodic boundary

2. Characteristic property on Rd

Page 44: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Characteristic Kernels (via Fourier)

Reminder: Fourier series

• Function [−π, π] with periodic boundary.

f(x) =∞∑

ℓ=−∞

fℓ exp(ıℓx) =∞∑

l=−∞

fℓ (cos(ℓx) + ı sin(ℓx)) .

−4 −2 0 2 4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

x

f(x

)

Top hat

−10 −5 0 5 10−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5

fℓ

Fourier series coefficients

Page 45: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Characteristic Kernels (via Fourier)

Reminder: Fourier series of kernel

k(x, y) = k(x− y) = k(z), k(z) =

∞∑

ℓ=−∞

kℓ exp (ıℓz) ,

E.g., k(x) = 12πϑ

(x2π ,

ıσ2

), kℓ =

12π exp

(−σ2ℓ2

2

).

ϑ is the Jacobi theta function, close to Gaussian when σ2 sufficiently narrower than [−π, π].

−4 −2 0 2 4−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

x

k(x

)

Kernel

−10 −5 0 5 100

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

fℓ

Fourier series coefficients

Page 46: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Characteristic Kernels (via Fourier)

Maximum mean embedding via Fourier series:

• Fourier series for P is characteristic function φP

• Fourier series for mean embedding is product of fourier series!

(convolution theorem)

µP(x) = EPk(x− x) =

∫ π

−πk(x− t)dP(t) µP,ℓ = kℓ × φP,ℓ

Page 47: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Characteristic Kernels (via Fourier)

Maximum mean embedding via Fourier series:

• Fourier series for P is characteristic function φP

• Fourier series for mean embedding is product of fourier series!

(convolution theorem)

µP(x) = EPk(x− x) =

∫ π

−πk(x− t)dP(t) µP,ℓ = kℓ × φP,ℓ

• MMD can be written in terms of Fourier series:

MMD(P,Q;F ) :=

∥∥∥∥∥

∞∑

ℓ=−∞

[(φP,ℓ − φQ,ℓ

)kℓ

]exp(ıℓx)

∥∥∥∥∥F

Page 48: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

A simpler Fourier expression for MMD

• From previous slide,

MMD(P,Q;F ) :=

∥∥∥∥∥

∞∑

ℓ=−∞

[(φP,ℓ − φQ,ℓ

)kℓ

]exp(ıℓx)

∥∥∥∥∥F

• The squared norm of a function f in F is:

‖f‖2F = 〈f, f〉F =∞∑

l=−∞

|fℓ|2kℓ

.

• Simple, interpretable expression for squared MMD:

MMD2(P,Q;F ) =∞∑

l=−∞

[|φP,ℓ − φQ,ℓ|2kℓ]2kℓ

=∞∑

l=−∞

|φP,ℓ − φQ,ℓ|2kℓ

Page 49: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Example

• Example: P differs from Q at one frequency

−2 0 20

0.05

0.1

0.15

0.2

x

P(x

)

−2 0 20

0.05

0.1

0.15

0.2

x

Q(x

)

Page 50: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Characteristic Kernels (2)

• Example: P differs from Q at (roughly) one frequency

−2 0 20

0.05

0.1

0.15

0.2

x

P(x

)

−2 0 20

0.05

0.1

0.15

0.2

x

Q(x

)

F→

F→

−10 0 100

0.5

1

φP

,ℓ

−10 0 100

0.5

1

φQ

,ℓ

Page 51: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Characteristic Kernels (2)

• Example: P differs from Q at (roughly) one frequency

−2 0 20

0.05

0.1

0.15

0.2

x

P(x

)

−2 0 20

0.05

0.1

0.15

0.2

x

Q(x

)

F→

F→

−10 0 100

0.5

1

φP

,ℓ

−10 0 100

0.5

1

φQ

,ℓց

ր

Characteristic function difference

−10 0 100

0.2

0.4

0.6

0.8

1

φP

,ℓ−

φQ

,ℓ

Page 52: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Example

Is the Gaussian-spectrum kernel characteristic?

−4 −2 0 2 4−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

x

k(x

)

Kernel

−10 −5 0 5 100

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

fℓ

Fourier series coefficients

MMD2(P,Q;F ) :=∞∑

l=−∞

|φP,ℓ − φQ,ℓ|2kℓ

Page 53: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Example

Is the Gaussian-spectrum kernel characteristic? YES

−4 −2 0 2 4−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

x

k(x

)

Kernel

−10 −5 0 5 100

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

fℓ

Fourier series coefficients

MMD2(P,Q;F ) :=

∞∑

l=−∞

|φP,ℓ − φQ,ℓ|2kℓ

Page 54: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Example

Is the triangle kernel characteristic?

−4 −2 0 2 4−0.1

−0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

x

f(x

)

Triangle

−10 −5 0 5 100

0.01

0.02

0.03

0.04

0.05

0.06

0.07

fℓ

Fourier series coefficients

MMD2(P,Q;F ) :=

∞∑

l=−∞

|φP,ℓ − φQ,ℓ|2kℓ

Page 55: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Example

Is the triangle kernel characteristic? NO

−4 −2 0 2 4−0.1

−0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

x

f(x

)

Triangle

−10 −5 0 5 100

0.01

0.02

0.03

0.04

0.05

0.06

0.07

fℓ

Fourier series coefficients

MMD2(P,Q;F ) :=

∞∑

l=−∞

|φP,ℓ − φQ,ℓ|2kℓ

Page 56: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Characteristic kernels (Via Fourier, on Rd)

Page 57: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Characteristic Kernels (via Fourier)

• Can we prove characteristic on Rd?

Page 58: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Characteristic Kernels (via Fourier)

• Can we prove characteristic on Rd?

• Characteristic function of P via Fourier transform

φP(ω) =

Rd

eix⊤ωdP(x)

Page 59: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Characteristic Kernels (via Fourier)

• Can we prove characteristic on Rd?

• Characteristic function of P via Fourier transform

φP(ω) =

Rd

eix⊤ωdP(x)

• Translation invariant kernels: k(x, y) = k(x− y) = k(z)

• Bochner’s theorem:

k(z) =

Rd

e−iz⊤ωdΛ(ω)

– Λ finite non-negative Borel measure

Page 60: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Characteristic Kernels (via Fourier)

• Can we prove characteristic on Rd?

• Characteristic function of P via Fourier transform

φP(ω) =

Rd

eix⊤ωdP(x)

• Translation invariant kernels: k(x, y) = k(x− y) = k(z)

• Bochner’s theorem:

k(z) =

Rd

e−iz⊤ωdΛ(ω)

– Λ finite non-negative Borel measure

Page 61: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Characteristic Kernels (via Fourier)

• Fourier representation of MMD:

MMD(P,Q;F ) :=

∫ ∫|φP(ω)− φQ(ω)|2 dΛ(ω)

– φP characteristic function of P

Proof: Using Bochner’s theorem (a) and Fubini’s theorem (b),

MMD(P,Q) =

∫ ∫

Rd

k(x− y) d(P−Q)(x) d(P−Q)(y)

(a)=

∫ ∫ ∫

Rd

e−i(x−y)Tω dΛ(ω) d(P−Q)(x) d(P−Q)(y)

(b)=

∫ ∫

Rd

e−ixTω d(P−Q)(x)

Rd

eiyTω d(P−Q)(y) dΛ(ω)

=

∫|φP(ω)− φQ(ω)|2 dΛ(ω)

Page 62: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Example

• Example: P differs from Q at (roughly) one frequency

−10 −5 0 5 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

X

P(X

)

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

X

Q(X

)

Page 63: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Example

• Example: P differs from Q at (roughly) one frequency

−10 −5 0 5 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

X

P(X

)

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

X

Q(X

)

F→

F→

−20 −10 0 10 200

0.1

0.2

0.3

0.4

ω

|φP |

−20 −10 0 10 200

0.1

0.2

0.3

0.4

ω

|φQ

|

Page 64: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Example

• Example: P differs from Q at (roughly) one frequency

−10 −5 0 5 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

X

P(X

)

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

X

Q(X

)

F→

F→

−20 −10 0 10 200

0.1

0.2

0.3

0.4

ω

|φP |

−20 −10 0 10 200

0.1

0.2

0.3

0.4

ω

|φQ

ր

Characteristic function difference

−30 −20 −10 0 10 20 300

0.05

0.1

0.15

0.2

ω

|φP −

φQ

|

Page 65: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Example

• Example: P differs from Q at (roughly) one frequency

Gaussian kernel

Difference |φP − φQ|

−30 −20 −10 0 10 20 300

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Frequency ω

Page 66: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Example

• Example: P differs from Q at (roughly) one frequency

Characteristic

−30 −20 −10 0 10 20 300

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Frequency ω

Page 67: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Example

• Example: P differs from Q at (roughly) one frequency

Sinc kernel

Difference |φP − φQ|

−30 −20 −10 0 10 20 300

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Frequency ω

Page 68: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Example

• Example: P differs from Q at (roughly) one frequency

NOT characteristic

−30 −20 −10 0 10 20 300

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Frequency ω

Page 69: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Example

• Example: P differs from Q at (roughly) one frequency

Triangle (B-spline) kernel

Difference |φP − φQ|

−30 −20 −10 0 10 20 300

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Frequency ω

Page 70: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Example

• Example: P differs from Q at (roughly) one frequency

???

−30 −20 −10 0 10 20 300

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Frequency ω

Page 71: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Example

• Example: P differs from Q at (roughly) one frequency

Characteristic

−30 −20 −10 0 10 20 300

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Frequency ω

Page 72: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Summary: Characteristic Kernels

Characteristic kernel: (MMD = 0 iff P = Q) [NIPS07b, COLT08]

Main theorem: A translation invariant k characteristic for prob. measures on

Rd if and only if supp(Λ) = Rd (i.e. support zero on at most a countable set)

[COLT08, JMLR10]

Corollary: continuous, compactly supported k characteristic (since Fourier

spectrum Λ(ω) cannot be zero on an interval). 1-D proof sketch from [Mallat, 1999,

Theorem 2.6] proof on Rd via distribution theory in [Sriperumbudur et al., 2010, Corollary 10 p. 1535]

Page 73: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

k characteristic iff supp(Λ) = Rd

Proof: supp Λ = Rd =⇒ k characteristic:

Recall Fourier definition of MMD:

MMD2(P,Q) =

Rd

|φP(ω)− φQ(ω)|2 dΛ(ω).

Characteristic functions φP(ω) and φQ(ω) uniformly continuous, hence their

difference cannot be non-zero only on a countable set.

Map φP uniformly continuous: ∀ǫ > 0, ∃δ > 0 such that ∀(ω1, ω2) ∈ Ω for which d(ω1, ω2) < δ, we have

d(φP(ω1), φP(ω2)) < ǫ. Uniform: δ depends only on ǫ, not on ω1, ω2.

Page 74: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

k characteristic iff supp(Λ) = Rd

Proof: k characteristic =⇒ supp Λ = Rd :

Proof by contrapositive.

Given supp Λ ( Rd, hence ∃ open interval U such that Λ(ω) zero on U .

Construct densities p(x), q(x) such that φP, φQ differ only inside U

Page 75: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Further extensions

• Similar reasoning wherever extensions of Bochner’s theorem exist:

[Fukumizu et al., 2009]

– Locally compact Abelian groups (periodic domains, as we saw)

– Compact, non-Abelian groups (orthogonal matrices)

– The semigroup R+n (histograms)

• Related kernel statistics: Fisher statistic [Harchaoui et al., 2008](zero iff P = Q

for characteristic kernels), other distances [Zhou and Chellappa, 2006](not yet

shown to establish whether P = Q), energy distances

Page 76: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Statistical hypothesis testing

Page 77: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Motivating question: differences in brain signals

The problem: Do local field potential (LFP) signals changewhen measured near a spike burst?

0 20 40 60 80 100−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

LFP near spike burst

Time

LF

P a

mp

litu

de

0 20 40 60 80 100−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

LFP without spike burst

Time

LF

P a

mp

litu

de

Page 78: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Motivating question: differences in brain signals

The problem: Do local field potential (LFP) signals changewhen measured near a spike burst?

Page 79: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Motivating question: differences in brain signals

The problem: Do local field potential (LFP) signals changewhen measured near a spike burst?

Page 80: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Statistical test using MMD (1)

• Two hypotheses:

– H0: null hypothesis (P = Q)

– H1: alternative hypothesis (P 6= Q)

Page 81: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Statistical test using MMD (1)

• Two hypotheses:

– H0: null hypothesis (P = Q)

– H1: alternative hypothesis (P 6= Q)

• Observe samples x := x1, . . . , xn from P and y from Q

• If empirical MMD(x,y;F ) is

– “far from zero”: reject H0

– “close to zero”: accept H0

Page 82: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Statistical test using MMD (2)

• “far from zero” vs “close to zero” - threshold?

• One answer: asymptotic distribution of MMD2

Page 83: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Statistical test using MMD (2)

• “far from zero” vs “close to zero” - threshold?

• One answer: asymptotic distribution of MMD2

• An unbiased empirical estimate (quadratic cost):

MMD2= 1

n(n−1)

i 6=j

k(xi, xj)− k(xi, yj)− k(yi, xj) + k(yi, yj)︸ ︷︷ ︸h((xi,yi),(xj ,yj))

Page 84: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Statistical test using MMD (2)

• “far from zero” vs “close to zero” - threshold?

• One answer: asymptotic distribution of MMD2

• An unbiased empirical estimate (quadratic cost):

MMD2= 1

n(n−1)

i 6=j

k(xi, xj)− k(xi, yj)− k(yi, xj) + k(yi, yj)︸ ︷︷ ︸h((xi,yi),(xj ,yj))

• When P 6= Q, asymptotically normal

(√n)(MMD

2 −MMD2)∼ N (0, σ2u)

[Hoeffding, 1948, Serfling, 1980]

• Expression for the variance: zi := (xi, yi)

σ2u = 4(Ez

[(Ez′h(z, z

′))2]−[Ez,z′(h(z, z

′))]2)

Page 85: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Statistical test using MMD (3)

• Example: laplace distributions with different variance

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.40

2

4

6

8

10

12

14

MMD distribution and Gaussian fit under H1

MMD

Pro

b.

de

nsity

Empirical PDF

Gaussian fit

−6 −4 −2 0 2 4 60

0.5

1

1.5

Two Laplace distributions with different variances

X

Pro

b.

de

nsity

P

X

QX

Page 86: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Statistical test using MMD (4)

• When P = Q, U-statistic degenerate: Ez′ [h(z, z′)] = 0 [Anderson et al., 1994]

• Distribution is

nMMD(x,y;F ) ∼∞∑

l=1

λl[z2l − 2

]

• where

– zl ∼ N (0, 2) i.i.d

–∫Xk(x, x′)︸ ︷︷ ︸centred

ψi(x)dP(x) = λiψi(x′)

Page 87: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Statistical test using MMD (4)

• When P = Q, U-statistic degenerate: Ez′ [h(z, z′)] = 0 [Anderson et al., 1994]

• Distribution is

nMMD(x,y;F ) ∼∞∑

l=1

λl[z2l − 2

]

• where

– zl ∼ N (0, 2) i.i.d

–∫Xk(x, x′)︸ ︷︷ ︸centred

ψi(x)dP(x) = λiψi(x′)

−2 −1 0 1 2 3 4 5 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MMD density under H0

n× MMD2

Pro

b. density

χ2 sum

Empirical PDF

Page 88: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Statistical test using MMD (5)

• Given P = Q, want threshold T such that P(MMD > T ) ≤ 0.05

MMD2= KP,P +KQ,Q − 2KP,Q

−2 −1 0 1 2 3 4 5 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MMD density under H0 and H1

n× MMD2

Pro

b.

de

nsity

null

alternative

1−α null quantile

Type II error

Page 89: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Statistical test using MMD (5)

• Given P = Q, want threshold T such that P(MMD > T ) ≤ 0.05

Page 90: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Statistical test using MMD (5)

• Given P = Q, want threshold T such that P(MMD > T ) ≤ 0.05

• Permutation for empirical CDF [Arcones and Gine, 1992, Alba Fernandez et al., 2008]

• Pearson curves by matching first four moments [Johnson et al., 1994]

• Large deviation bounds [Hoeffding, 1963, McDiarmid, 1989]

• Consistent test using kernel eigenspectrum [NIPS09b]

Page 91: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Statistical test using MMD (5)

• Given P = Q, want threshold T such that P(MMD > T ) ≤ 0.05

• Permutation for empirical CDF [Arcones and Gine, 1992, Alba Fernandez et al., 2008]

• Pearson curves by matching first four moments [Johnson et al., 1994]

• Large deviation bounds [Hoeffding, 1963, McDiarmid, 1989]

• Consistent test using kernel eigenspectrum [NIPS09b]

−0.02 0 0.02 0.04 0.06 0.08 0.10

0.2

0.4

0.6

0.8

1

CDF of the MMD and Pearson fit

mmd

P(M

MD

< m

md)

MMD

Pearson

Page 92: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Approximate null distribution of MMD via permutation

Empirical MMD:

w = (1, 1, 1, . . . 1︸ ︷︷ ︸n

,−1 . . . ,−1,−1,−1︸ ︷︷ ︸n

)⊤

1n2

∑(KP,P KP,Q

KQ,P KQ,Q

[ww⊤

]) ≈ MMD2

Page 93: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Approximate null distribution of MMD via permutation

Permuted case: [Alba Fernandez et al., 2008]

w = (1,−1, 1, . . . 1︸ ︷︷ ︸n

,−1 . . . , 1,−1,−1︸ ︷︷ ︸n

)⊤

(equal number of +1 and −1)

1n2

∑(KP,P KP,Q

KQ,P KQ,Q

[ww⊤

])=[?]

Page 94: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Approximate null distribution of MMD via permutation

Permuted case: [Alba Fernandez et al., 2008]

w = (1,−1, 1, . . . 1︸ ︷︷ ︸n

,−1 . . . , 1,−1,−1︸ ︷︷ ︸n

)⊤

(equal number of +1 and −1)

1n2

∑(KP,P KP,Q

KQ,P KQ,Q

[ww⊤

])=[?]

⊙ =

Figure thanks to Kacper Chwialkowski.

Page 95: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Approximate null distribution of MMD2via permutation

MMD2

p ≈1n2

∑(KP,P KP ,Q

KQ,P KQ,Q

[ww⊤

])

−2 −1 0 1 2 3 4 5 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MMD density under H0

n× MMD2

Pro

b. density

Null PDF

Null PDF from permutation

Page 96: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Detecting differences in brain signals

Do local field potential (LFP) signals change when measured near a spike

burst?

0 20 40 60 80 100−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

LFP near spike burst

Time

LF

P a

mp

litu

de

0 20 40 60 80 100−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

LFP without spike burst

Time

LF

P a

mp

litu

de

Page 97: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Nero data: consistent test w/o permutation

• Maximum mean discrepancy (MMD): distance between P and Q

MMD(P,Q;F ) := ‖µP − µQ‖2F

• Is MMD significantly > 0?

• P = Q, null distrib. of MMD:

nMMD →D

∞∑

l=1

λl(z2l − 2),

– λl is lth eigenvalue of

kernel k(xi, xj)100 150 200 250 3000

0.1

0.2

0.3

0.4

0.5

P ≠ Q (neuro)

Sample size mT

yp

e I

I e

rro

r

Spectral

Permutation

Use Gram matrix spectrum for λl: consistent test without permutation

Page 98: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Hypothesis testing with HSIC

Page 99: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Distribution of HSIC at independence

• (Biased) empirical HSIC a v-statistic

HSICb =1

n2trace(KHLH)

– Statistical testing: How do we find when this is larger enough that

the null hypothesis P = PxPy is unlikely?

– Formally: given P = PxPy, what is the threshold T such that

P(HSIC > T ) < α for small α?

Page 100: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Distribution of HSIC at independence

• (Biased) empirical HSIC a v-statistic

HSICb =1

n2trace(KHLH)

• Associated U-statistic degenerate when P = PxPy [Serfling, 1980]:

nHSICbD→

∞∑

l=1

λlz2l , zl ∼ N (0, 1)i.i.d.

λlψl(zj) =

∫hijqrψl(zi)dFi,q,r, hijqr =

1

4!

(i,j,q,r)∑

(t,u,v,w)

ktultu + ktulvw − 2ktultv

Page 101: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Distribution of HSIC at independence

• (Biased) empirical HSIC a v-statistic

HSICb =1

n2trace(KHLH)

• Associated U-statistic degenerate when P = PxPy [Serfling, 1980]:

nHSICbD→

∞∑

l=1

λlz2l , zl ∼ N (0, 1)i.i.d.

λlψl(zj) =

∫hijqrψl(zi)dFi,q,r, hijqr =

1

4!

(i,j,q,r)∑

(t,u,v,w)

ktultu + ktulvw − 2ktultv

• First two moments [NIPS07b]

E(HSICb) =1

nTrCxxTrCyy

var(HSICb) =2(n− 4)(n− 5)

(n)4‖Cxx‖

2HS ‖Cyy‖

2HS +O(n−3).

Page 102: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Statistical testing with HSIC

• Given P = PxPy, what is the threshold T such that P(HSIC > T ) < α

for small α?

• Null distribution via permutation [Feuerverger, 1993]

– Compute HSIC for xi, yπ(i)ni=1 for random permutation π of indices

1, . . . , n. This gives HSIC for independent variables.

– Repeat for many different permutations, get empirical CDF

– Threshold T is 1− α quantile of empirical CDF

Page 103: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Statistical testing with HSIC

• Given P = PxPy, what is the threshold T such that P(HSIC > T ) < α

for small α?

• Null distribution via permutation [Feuerverger, 1993]

– Compute HSIC for xi, yπ(i)ni=1 for random permutation π of indices

1, . . . , n. This gives HSIC for independent variables.

– Repeat for many different permutations, get empirical CDF

– Threshold T is 1− α quantile of empirical CDF

• Approximate null distribution via moment matching [Kankainen, 1995]:

nHSICb(Z) ∼xα−1e−x/β

βαΓ(α)

where

α =(E(HSICb))

2

var(HSICb), β =

var(HSICb)

nE(HSICb).

Page 104: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Experiment: dependence testing for translation

Are the French text extracts translations of English?X1: Honourable senators, I have a question for

the Leader of the Government in the Senate with

regard to the support funding to farmers that has

been announced. Most farmers have not received

any money yet.

Y1: Honorables senateurs, ma question

s’adresse au leader du gouvernement au Senat

et concerne l’aide financiere qu’on a annoncee

pour les agriculteurs. La plupart des agriculteurs

n’ont encore rien reu de cet argent.

X2: No doubt there is great pressure on provin-

cial and municipal governments in relation to the

issue of child care, but the reality is that there

have been no cuts to child care funding from the

federal government to the provinces. In fact,

we have increased federal investments for early

childhood development.

· · ·

?⇐⇒

Y2:Il est evident que les ordres de gouverne-

ments provinciaux et municipaux subissent de

fortes pressions en ce qui concerne les ser-

vices de garde, mais le gouvernement n’a pas

reduit le financement qu’il verse aux provinces

pour les services de garde. Au contraire, nous

avons augmente le financement federal pour le

developpement des jeunes enfants.

· · ·

Page 105: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Experiment: dependence testing for translation

• (Biased) empirical HSIC:

HSICb =1

n2trace(KHLH)

• Translation example: [NIPS07b]

Canadian Hansard

(agriculture)

• 5-line extracts,

k-spectrum kernel, k = 10,

repetitions=300,

sample size 10

K

⇒HSIC⇐

L

• k-spectrum kernel: average Type II error 0 (α = 0.05)

Page 106: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Experiment: dependence testing for translation

• (Biased) empirical HSIC:

HSICb =1

n2trace(KHLH)

• Translation example: [NIPS07b]

Canadian Hansard

(agriculture)

• 5-line extracts,

k-spectrum kernel, k = 10,

repetitions=300,

sample size 10

K

⇒HSIC⇐

L

• k-spectrum kernel: average Type II error 0 (α = 0.05)

• Bag of words kernel: average Type II error 0.18

Page 107: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Kernel two-sample tests for big data, optimal kernel choice

Page 108: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Quadratic time estimate of MMD

MMD2 = ‖µP − µQ‖2F = EPk(x, x′) +EQk(y, y

′)− 2EP,Qk(x, y)

Page 109: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Quadratic time estimate of MMD

MMD2 = ‖µP − µQ‖2F = EPk(x, x′) +EQk(y, y

′)− 2EP,Qk(x, y)

Given i.i.d. X := x1, . . . , xm and Y := y1, . . . , ym from P,Q,

respectively:

The earlier estimate: (quadratic time)

EPk(x, x′) =

1

m(m− 1)

m∑

i=1

m∑

j 6=i

k(xi, xj)

Page 110: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Quadratic time estimate of MMD

MMD2 = ‖µP − µQ‖2F = EPk(x, x′) +EQk(y, y

′)− 2EP,Qk(x, y)

Given i.i.d. X := x1, . . . , xm and Y := y1, . . . , ym from P,Q,

respectively:

The earlier estimate: (quadratic time)

EPk(x, x′) =

1

m(m− 1)

m∑

i=1

m∑

j 6=i

k(xi, xj)

New, linear time estimate:

EPk(x, x′) =

2

m[k(x1, x2) + k(x3, x4) + . . .]

=2

m

m/2∑

i=1

k(x2i−1, x2i)

Page 111: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Linear time MMD

Shorter expression with explicit k dependence:

MMD2 =: ηk(p, q) = Exx′yy′hk(x, x′, y, y′) =: Evhk(v),

where

hk(x, x′, y, y′) = k(x, x′) + k(y, y′)− k(x, y′)− k(x′, y),

and v := [x, x′, y, y′].

Page 112: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Linear time MMD

Shorter expression with explicit k dependence:

MMD2 =: ηk(p, q) = Exx′yy′hk(x, x′, y, y′) =: Evhk(v),

where

hk(x, x′, y, y′) = k(x, x′) + k(y, y′)− k(x, y′)− k(x′, y),

and v := [x, x′, y, y′].

The linear time estimate again:

ηk =2

m

m/2∑

i=1

hk(vi),

where vi := [x2i−1, x2i, y2i−1, y2i] and

hk(vi) := k(x2i−1, x2i) + k(y2i−1, y2i)− k(x2i−1, y2i)− k(x2i, y2i−1)

Page 113: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Linear time vs quadratic time MMD

Disadvantages of linear time MMD vs quadratic time MMD

• Much higher variance for a given m, hence. . .

• . . .a much less powerful test for a given m

Page 114: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Linear time vs quadratic time MMD

Disadvantages of linear time MMD vs quadratic time MMD

• Much higher variance for a given m, hence. . .

• . . .a much less powerful test for a given m

Advantages of the linear time MMD vs quadratic time MMD

• Very simple asymptotic null distribution (a Gaussian, vs an infinite

weighted sum of χ2)

• Both test statistic and threshold computable in O(m), with storage O(1).

• Given unlimited data, a given Type II error can be attained with less

computation

Page 115: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Asymptotics of linear time MMD

By central limit theorem,

m1/2 (ηk − ηk(p, q))D→ N (0, 2σ2k)

• assuming 0 < E(h2k) <∞ (true for bounded k)

• σ2k = Evh2k(v)− [Ev(hk(v))]

2 .

Page 116: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Hypothesis test

Hypothesis test of asymptotic level α:

tk,α = m−1/2σk√2Φ−1(1− α) where Φ−1 is inverse CDF of N (0, 1).

−4 −2 0 2 4 6 80

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4Null distribution, linear time MMD

2

= ηk

P(η

k)

ηk

Type I error

tk,α = (1 − α) quantile

Page 117: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Type II error

−4 −2 0 2 4 6 80

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

P(η

k)

ηk

Null vs alternative distribution, P (ηk)

null

alternative

Type II error ηk(p, q )

Page 118: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

The best kernel: minimizes Type II error

Type II error: ηk falls below the threshold tk,α and ηk(p, q) > 0.

Prob. of a Type II error:

P (ηk < tk,α) = Φ

(Φ−1(1− α)− ηk(p, q)

√m

σk√2

)

where Φ is a Normal CDF.

Page 119: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

The best kernel: minimizes Type II error

Type II error: ηk falls below the threshold tk,α and ηk(p, q) > 0.

Prob. of a Type II error:

P (ηk < tk,α) = Φ

(Φ−1(1− α)− ηk(p, q)

√m

σk√2

)

where Φ is a Normal CDF.

Since Φ monotonic, best kernel choice to minimize Type II error prob. is:

k∗ = argmaxk∈K

ηk(p, q)σ−1k ,

where K is the family of kernels under consideration.

Page 120: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Learning the best kernel in a family

Define the family of kernels as follows:

K :=

k : k =

d∑

u=1

βuku, ‖β‖1 = D, βu ≥ 0, ∀u ∈ 1, . . . , d.

Properties: if at least one βu > 0

• all k ∈ K are valid kernels,

• If all ku charateristic then k characteristic

Page 121: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Test statistic

The squared MMD becomes

ηk(p, q) = ‖µk(p)− µk(q)‖2Fk=

d∑

u=1

βuηu(p, q),

where ηu(p, q) := Evhu(v).

Page 122: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Test statistic

The squared MMD becomes

ηk(p, q) = ‖µk(p)− µk(q)‖2Fk=

d∑

u=1

βuηu(p, q),

where ηu(p, q) := Evhu(v).

Denote:

• β = (β1, β2, . . . , βd)⊤ ∈ Rd,

• h = (h1, h2, . . . , hd)⊤ ∈ Rd,

– hu(x, x′, y, y′) = ku(x, x

′) + ku(y, y′)− ku(x, y

′)− ku(x′, y)

• η = Ev(h) = (η1, η2, . . . , ηd)⊤ ∈ Rd.

Quantities for test:

ηk(p, q) = E(β⊤h) = β⊤η σ2k := β⊤cov(h)β.

Page 123: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Optimization of ratio ηk(p, q)σ−1k

Empirical test parameters:

ηk = β⊤η σk,λ =

√β⊤(Q+ λmI

)β,

Q is empirical estimate of cov(h).

Note: ηk, σk,λ computed on training data, vs ηk, σk on data to be tested

(why?)

Page 124: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Optimization of ratio ηk(p, q)σ−1k

Empirical test parameters:

ηk = β⊤η σk,λ =

√β⊤(Q+ λmI

)β,

Q is empirical estimate of cov(h).

Note: ηk, σk,λ computed on training data, vs ηk, σk on data to be tested

(why?)

Objective:

β∗ = argmaxβ0

ηk(p, q)σ−1k,λ

= argmaxβ0

(β⊤η

)(β⊤(Q+ λmI

)β)−1/2

=: α(β; η, Q)

Page 125: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Optmization of ratio ηk(p, q)σ−1k

Assume: η has at least one positive entry

Then there exists β 0 s.t. α(β; η, Q) > 0.

Thus: α(β∗; η, Q) > 0

Page 126: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Optmization of ratio ηk(p, q)σ−1k

Assume: η has at least one positive entry

Then there exists β 0 s.t. α(β; η, Q) > 0.

Thus: α(β∗; η, Q) > 0

Solve easier problem: β∗ = argmaxβ0 α2(β; η, Q).

Quadratic program:

minβ⊤(Q+ λmI

)β : β⊤η = 1, β 0

Page 127: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Optmization of ratio ηk(p, q)σ−1k

Assume: η has at least one positive entry

Then there exists β 0 s.t. α(β; η, Q) > 0.

Thus: α(β∗; η, Q) > 0

Solve easier problem: β∗ = argmaxβ0 α2(β; η, Q).

Quadratic program:

minβ⊤(Q+ λmI

)β : β⊤η = 1, β 0

What if η has no positive entries?

Page 128: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Test procedure

1. Split the data into testing and training.

2. On the training data:

(a) Compute ηu for all ku ∈ K(b) If at least one ηu > 0, solve the QP to get β∗, else choose random

kernel from K3. On the test data:

(a) Compute ηk∗ using k∗ =∑d

u=1 β∗ku

(b) Compute test threshold tα,k∗ using σk∗

4. Reject null if ηk∗ > tα,k∗

Page 129: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Convergence bounds

Assume bounded kernel, σk, bounded away from 0.

If λm = Θ(m−1/3) then

∣∣∣∣supk∈K

ηkσ−1k,λ − sup

k∈Kηkσ

−1k

∣∣∣∣ = OP

(m−1/3

).

Page 130: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Convergence bounds

Assume bounded kernel, σk, bounded away from 0.

If λm = Θ(m−1/3) then

∣∣∣∣supk∈K

ηkσ−1k,λ − sup

k∈Kηkσ

−1k

∣∣∣∣ = OP

(m−1/3

).

Idea:

∣∣∣∣supk∈K

ηkσ−1k,λ − sup

k∈Kηkσ

−1k

∣∣∣∣

≤ supk∈K

∣∣∣ηkσ−1k,λ − ηkσ

−1k,λ

∣∣∣+ supk∈K

∣∣∣ηkσ−1k,λ − ηkσ

−1k

∣∣∣

≤√d

D√λm

(C1 sup

k∈K|ηk − ηk|+ C2 sup

k∈K|σk,λ − σk,λ|

)+ C3D

2λm,

Page 131: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Experiments

Page 132: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Competing approaches

• Median heuristic

• Max. MMD: choose ku ∈ K with the largest ηu

– same as maximizing β⊤η subject to ‖β‖1 ≤ 1

• ℓ2 statistic: maximize β⊤η subject to ‖β‖2 ≤ 1

• Cross validation on training set

Also compare with:

• Single kernel that maximizes ratio ηk(p, q)σ−1k

Page 133: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Blobs: data

Difficult problems: lengthscale of the difference in distributions not the same

as that of the distributions.

Page 134: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Blobs: data

Difficult problems: lengthscale of the difference in distributions not the same

as that of the distributions.

We distinguish a field of Gaussian blobs with different covariances.

5 10 15 20 25 30 355

10

15

20

25

30

35Blob data p

x1

x2

5 10 15 20 25 30 355

10

15

20

25

30

35Blob data q

y1

y2

Ratio ε = 3.2 of largest to smallest eigenvalues of blobs in q.

Page 135: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Blobs: results

0 5 10 150

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ε ratio

Typ

eII

erro

r

max ratio

opt

l2

maxmmd

xval

xvalc

med

Parameters: m = 10, 000 (for training and test). Ratio ε of largest to

smallest eigenvalues of blobs in q. Results are average over 617 trials.

Page 136: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Blobs: results

0 5 10 150

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ε ratio

Type II err

or

max ratiooptl2maxmmdxvalxvalcmed

Optimize ratio ηk(p, q)σ−1k

Page 137: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Blobs: results

0 5 10 150

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ε ratio

Type II err

or

max ratiooptl2maxmmdxvalxvalcmed

Maximize ηk(p, q) with β constraint

Page 138: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Blobs: results

0 5 10 150

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ε ratio

Type II err

or

max ratiooptl2maxmmdxvalxvalcmed

Median heuristic

Page 139: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Feature selection: data

Idea: no single best kernel.

Each of the ku are univariate (along a single coordinate)

Page 140: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Feature selection: data

Idea: no single best kernel.

Each of the ku are univariate (along a single coordinate)

−4 −2 0 2 4 6 8

−4

−2

0

2

4

6

8Selection data

x1

x2

p

q

Page 141: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Feature selection: results

0 5 10 15 20 25 300.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

Feature selection

dimension

Typ

e I

I e

rro

r

max ratio

opt

l2

maxmmd

Single best kernel

Linear combination

m = 10, 000, average over 5000 trials

Page 142: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Amplitude modulated signals

Given an audio signal s(t), an amplitude modulated signal can be defined

u(t) = sin(ωct) [a s(t) + l]

• ωc: carrier frequency

• a = 0.2 is signal scaling, l = 2 is offset

Page 143: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Amplitude modulated signals

Given an audio signal s(t), an amplitude modulated signal can be defined

u(t) = sin(ωct) [a s(t) + l]

• ωc: carrier frequency

• a = 0.2 is signal scaling, l = 2 is offset

Two amplitude modulated signals from same artist (in this case, Magnetic

Fields).

• Music sampled at 8KHz (very low)

• Carrier frequency is 24kHz

• AM signal observed at 120kHz

• Samples are extracts of length N = 1000, approx. 0.01 sec (very short).

• Total dataset size is 30,000 samples from each of p, q.

Page 144: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Amplitude modulated signals

Samples from P Samples from Q

Page 145: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Results: AM signals

−0.2 0 0.2 0.4 0.6 0.8 1 1.20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

added noise

Typ

e I

I e

rro

r

max ratio

opt

med

l2

maxmmd

m = 10, 000 (for training and test) and scaling a = 0.5. Average over 4124

trials. Gaussian noise added.

Page 146: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Observations on kernel choice

• It is possible to choose the best kernel for a kernel

two-sample test

• Kernel choice matters for “difficult” problems, where the

distributions differ on a lengthscale different to that of

the data.

• Ongoing work:

– quadratic time statistic

– avoid training/test split

Page 147: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Summary

• MMD a distance between distributions [ISMB06, NIPS06a, JMLR10, JMLR12a]

– high dimensionality

– non-euclidean data (strings, graphs)

– Nonparametric hypothesis tests

• Measure and test independence [ALT05, NIPS07a, NIPS07b, ALT08, JMLR10, JMLR12a]

• Characteristic RKHS: MMD a metric [NIPS07b, COLT08, NIPS08a]

– Easy to check: does spectrum cover Rd

Page 148: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Co-authors

• From UCL:

– Luca Baldasssarre

– Steffen Grunewalder

– Guy Lever

– Sam Patterson

– Massimiliano Pontil

– Dino Sejdinovic

• External:

– Karsten Borgwardt, MPI

– Wicher Bergsma, LSE

– Kenji Fukumizu, ISM

– Zaid Harchaoui, INRIA

– Bernhard Schoelkopf, MPI

– Alex Smola, CMU/Google

– Le Song, Georgia Tech

– Bharath Sriperumbudur,

Cambridge

Page 149: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Selected references

Characteristic kernels and mean embeddings:

• Smola, A., Gretton, A., Song, L., Schoelkopf, B. (2007). A hilbert space embedding for distributions. ALT.

• Sriperumbudur, B., Gretton, A., Fukumizu, K., Schoelkopf, B., Lanckriet, G. (2010). Hilbert space

embeddings and metrics on probability measures. JMLR.

• Gretton, A., Borgwardt, K., Rasch, M., Schoelkopf, B., Smola, A. (2012). A kernel two- sample test. JMLR.

Two-sample, independence, conditional independence tests:

• Gretton, A., Fukumizu, K., Teo, C., Song, L., Schoelkopf, B., Smola, A. (2008). A kernel statistical test of

independence. NIPS

• Fukumizu, K., Gretton, A., Sun, X., Schoelkopf, B. (2008). Kernel measures of conditional dependence.

• Gretton, A., Fukumizu, K., Harchaoui, Z., Sriperumbudur, B. (2009). A fast, consistent kernel two-sample

test. NIPS.

• Gretton, A., Borgwardt, K., Rasch, M., Schoelkopf, B., Smola, A. (2012). A kernel two- sample test. JMLR

Energy distance, relation to kernel distances

• Sejdinovic, D., Sriperumbudur, B., Gretton, A.,, Fukumizu, K., (2013). Equivalence of distance-based and

rkhs-based statistics in hypothesis testing. Annals of Statistics.

Three way interaction

• Sejdinovic, D., Gretton, A., and Bergsma, W. (2013). A Kernel Test for Three-Variable Interactions. NIPS.

Page 150: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Selected references (continued)

Conditional mean embedding, RKHS-valued regression:

• Weston, J., Chapelle, O., Elisseeff, A., Scholkopf, B., and Vapnik, V., (2003). Kernel Dependency

Estimation, NIPS.

• Micchelli, C., and Pontil, M., (2005). On Learning Vector-Valued Functions. Neural Computation.

• Caponnetto, A., and De Vito, E. (2007). Optimal Rates for the Regularized Least-Squares Algorithm.

Foundations of Computational Mathematics.

• Song, L., and Huang, J., and Smola, A., Fukumizu, K., (2009). Hilbert Space Embeddings of Conditional

Distributions. ICML.

• Grunewalder, S., Lever, G., Baldassarre, L., Patterson, S., Gretton, A., Pontil, M. (2012). Conditional mean

embeddings as regressors. ICML.

• Grunewalder, S., Gretton, A., Shawe-Taylor, J. (2013). Smooth operators. ICML.

Kernel Bayes rule:

• Song, L., Fukumizu, K., Gretton, A. (2013). Kernel embeddings of conditional distributions: A unified

kernel framework for nonparametric inference in graphical models. IEEE Signal Processing Magazine.

• Fukumizu, K., Song, L., Gretton, A. (2013). Kernel Bayes rule: Bayesian inference with positive definite

kernels, JMLR

Page 151: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015
Page 152: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Local departures from the null

What is a hard testing problem?

Page 153: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Local departures from the null

What is a hard testing problem?

• First version: for fixed m, “closer” P and Q have higher Type II error

0 0.2 0.4 0.6 0.8 1−1

−0.5

0

0.5

1

Samples from P and Q

0 0.2 0.4 0.6 0.8 1−1

−0.5

0

0.5

1

Samples from P and Q

Page 154: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Local departures from the null

What is a hard testing problem?

• As m increases, distinguish “closer” P and Q with fixed Type II error

Page 155: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Local departures from the null

What is a hard testing problem?

• As m increases, distinguish “closer” P and Q with fixed Type II error

• Example: fP and fQ probability densities, fQ = fP + δg, where δ ∈ R, g

some fixed function such that fQ is a valid density

– If δ ∼ m−1/2, Type II error approaches a constant

Page 156: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

More general local departures from null

• Example: fP and fQ probability densities, fQ = fP + δg, where δ ∈ R, g

some fixed function such that fQ is a valid density

−6 −4 −2 0 2 4 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

X

P(X

)

VS

−6 −4 −2 0 2 4 6

0.1

0.2

0.3

X

Q(X

)

−6 −4 −2 0 2 4 6

0.1

0.2

0.3

X

Q(X

)

−6 −4 −2 0 2 4 6

0.1

0.2

0.3

X

Q(X

)

Page 157: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Local departures from the null

What is a hard testing problem?

• As we see more samples m, distinguish “closer” P and Q with same

Type II error

• Example: fP and fQ probability densities, fQ = fP + δg, where δ ∈ R, g

some fixed function such that fQ is a valid density

– If δ ∼ m−1/2, Type II error approaches a constant

• ...but other choices also possible – how to characterize them all?

Page 158: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Local departures from the null

What is a hard testing problem?

• As we see more samples m, distinguish “closer” P and Q with same

Type II error

• Example: fP and fQ probability densities, fQ = fP + δg, where δ ∈ R, g

some fixed function such that fQ is a valid density

– If δ ∼ m−1/2, Type II error approaches a constant

• ...but other choices also possible – how to characterize them all?

General characterization of local departures from H0:

• Write µQ = µP + gm, where gm ∈ F chosen such that µP + gm a valid

distribution embedding

• Minimum distinguishable distance [JMLR12]

‖gm‖F = cm−1/2

Page 159: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

More general local departures from null

• More advanced example of a local departure from the null

• Recall: µQ = µP + gm, and ‖gm‖F = cm−1/2

−6 −4 −2 0 2 4 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

X

P(X

)

VS

−6 −4 −2 0 2 4 6

0.1

0.2

0.3

0.4

X

Q(X

)

−6 −4 −2 0 2 4 6

0.1

0.2

0.3

0.4

X

Q(X

)

−6 −4 −2 0 2 4 6

0.1

0.2

0.3

0.4

X

Q(X

)

Page 160: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Kernels vs kernels

• How does MMD relate to Parzen density estimate? [Anderson et al., 1994]

fP(x) =1

m

m∑

i=1

κ (xi − x) , where κ satisfies

X

κ (x) dx = 1 and κ (x) ≥ 0.

Page 161: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Kernels vs kernels

• How does MMD relate to Parzen density estimate? [Anderson et al., 1994]

fP(x) =1

m

m∑

i=1

κ (xi − x) , where κ satisfies

X

κ (x) dx = 1 and κ (x) ≥ 0.

• L2 distance between Parzen density estimates:

D2(fP, fQ)2 =

∫ [1

m

m∑

i=1

κ(xi − z)− 1

m

m∑

i=1

κ(yi − z)

]2dz

=1

m2

m∑

i,j=1

k(xi − xj) +1

m2

m∑

i,j=1

k(yi − yj)−2

m2

m∑

i,j=1

k(xi − yj),

where k(x− y) =∫κ(x− z)κ(y − z)dz

Page 162: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Kernels vs kernels

• How does MMD relate to Parzen density estimate? [Anderson et al., 1994]

fP(x) =1

m

m∑

i=1

κ (xi − x) , where κ satisfies

X

κ (x) dx = 1 and κ (x) ≥ 0.

• L2 distance between Parzen density estimates:

D2(fP, fQ)2 =

∫ [1

m

m∑

i=1

κ(xi − z)− 1

m

m∑

i=1

κ(yi − z)

]2dz

=1

m2

m∑

i,j=1

k(xi − xj) +1

m2

m∑

i,j=1

k(yi − yj)−2

m2

m∑

i,j=1

k(xi − yj),

where k(x− y) =∫κ(x− z)κ(y − z)dz

• fQ = fP + δg, minimum distance to discriminate fP from fQ is

δ = (m)−1/2h−d/2m , where hm is width of κ.

Page 163: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Characteristic Kernels (via universality)

Characteristic: MMD a metric (MMD = 0 iff P = Q) [NIPS07b, COLT08]

Page 164: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Characteristic Kernels (via universality)

Characteristic: MMD a metric (MMD = 0 iff P = Q) [NIPS07b, COLT08]

Classical result: P = Q if and only if EP(f(x)) = EQ(f(y)) for all f ∈ C(X ),

the space of bounded continuous functions on X [Dudley, 2002]

Page 165: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Characteristic Kernels (via universality)

Characteristic: MMD a metric (MMD = 0 iff P = Q) [NIPS07b, COLT08]

Classical result: P = Q if and only if EP(f(x)) = EQ(f(y)) for all f ∈ C(X ),

the space of bounded continuous functions on X [Dudley, 2002]

Universal RKHS: k(x, x′) continuous, X compact, and F dense in C(X ) with

respect to L∞ [Steinwart, 2001]

Page 166: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Characteristic Kernels (via universality)

Characteristic: MMD a metric (MMD = 0 iff P = Q) [NIPS07b, COLT08]

Classical result: P = Q if and only if EP(f(x)) = EQ(f(y)) for all f ∈ C(X ),

the space of bounded continuous functions on X [Dudley, 2002]

Universal RKHS: k(x, x′) continuous, X compact, and F dense in C(X ) with

respect to L∞ [Steinwart, 2001]

If F universal, then MMD P,Q;F = 0 iff P = Q

Page 167: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Characteristic Kernels (via universality)

Proof:

First, it is clear that P = Q implies MMD P,Q;F is zero.

Converse: by the universality of F , for any given ǫ > 0 and f ∈ C(X ) ∃g ∈ F

‖f − g‖∞ ≤ ǫ.

Page 168: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Characteristic Kernels (via universality)

Proof:

First, it is clear that P = Q implies MMD P,Q;F is zero.

Converse: by the universality of F , for any given ǫ > 0 and f ∈ C(X ) ∃g ∈ F

‖f − g‖∞ ≤ ǫ.

We next make the expansion

|EPf(x)−EQf(y)| ≤ |EPf(x)−EPg(x)|+|EPg(x)−EQg(y)|+|EQg(y)−EQf(y)| .

The first and third terms satisfy

|EPf(x)−EPg(x)| ≤ EP |f(x)− g(x)| ≤ ǫ.

Page 169: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Characteristic Kernels (via universality)

Proof (continued):

Next, write

EPg(x)−EQg(y) = 〈g(·), µP − µQ〉F = 0,

since MMD P,Q;F = 0 implies µP = µQ. Hence

|EPf(x)−EQf(y)| ≤ 2ǫ

for all f ∈ C(X ) and ǫ > 0, which implies P = Q.

Page 170: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

References

V.Alb

aFernandez,M

.Jim

enez-G

am

ero,and

J.M

unozG

arcia

.A

testfo

rthe

two-s

am

ple

proble

mbased

on

em

piric

alcharacteristic

functio

ns.Comput.

Sta

t.Data

An.,

52:3

730–3748,2008.

N.Anderson,P.Hall,

and

D.Titterin

gton.

Two-s

am

ple

test

statistic

sfo

rm

easurin

gdiscrepancie

sbetween

two

multiv

aria

te

probabilit

ydensity

functio

nsusin

gkernel-b

ased

density

estim

ates.JournalofM

ultiv

ariate

Anal-

ysis,

50:4

1–54,1994.

M.Arcones

and

E.G

ine.O

nthe

bootstrap

ofu

and

vstatistic

s.TheAnnals

ofSta

tistic

s,20(2):6

55–674,1992.

R.M

.D

udle

y.Realanalysis

and

pro

bability

.Cam

brid

ge

Univ

ersity

Press,Cam

-brid

ge,UK

,2002.

Andrey

Feuerverger.A

consistenttestfo

rbiv

aria

te

dependence.In

ternatio

nal

Sta

tistic

alReview,61(3):4

19–433,1993.

K.Fukum

izu,B.Srip

erum

budur,A.G

retton,and

B.Schoelk

opf.

Character-

istic

kernels

on

groups

and

sem

igroups.

InAdvancesin

Neura

lIn

formatio

n

Processin

gSystems21,pages

473–480,Red

Hook,NY,2009.Curran

Asso-

cia

tes

Inc.

Z.Harchaoui,

F.Bach,and

E.M

oulin

es.Testin

gfo

rhom

ogeneity

with

kernel

Fisher

discrim

inant

analy

sis.

InAdvances

inNeura

lIn

formatio

nProcessin

g

Systems20,pages

609–616.M

ITPress,Cam

brid

ge,M

A,2008.

W.

Hoeffd

ing.

Probabilit

yin

equalit

ies

for

sum

sof

bounded

random

vari-

able

s.

Journalofth

eAmerican

Sta

tistic

alAssociatio

n,58:1

3–30,1963.

Wassily

Hoeffd

ing.

Acla

ss

ofstatistic

swith

asym

ptotic

ally

norm

aldistri-

butio

n.

TheAnnals

ofM

ath

ematic

alSta

tistic

s,19(3):2

93–325,1948.

N.L.Johnson,S.K

otz,and

N.Bala

krishnan.

Contin

uousUnivariate

Distribu-

tions.Volume1.

John

Wile

yand

Sons,2nd

editio

n,1994.

A.

Kankain

en.

ConsistentTestin

gofTota

lIn

dependence

Based

on

the

Empirical

Chara

cteristic

Functio

n.

PhD

thesis,Univ

ersity

ofJyvaskyla

,1995.

S.M

alla

t.

AW

aveletTourofSignalProcessin

g.

Academ

icPress,2nd

editio

n,

1999.

C.M

cD

iarm

id.O

nthe

method

ofbounded

diffe

rences.In

Surveyin

Combin

a-

torics,

pages

148–188.Cam

brid

ge

Univ

ersity

Press,1989.

103-1

Page 171: Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

A.M

ulle

r.In

tegralprobabilit

ym

etric

sand

their

generatin

gcla

ssesoffu

nc-

tio

ns.

Advancesin

Applie

dPro

bability

,29(2):4

29–443,1997.

R.Serflin

g.Appro

xim

atio

nTheoremsofM

ath

ematic

alSta

tistic

s.W

iley,New

York,

1980.

B.

Srip

erum

budur,

A.

Gretton,

K.

Fukum

izu,

G.

Lanckrie

t,

and

B.Scholk

opf.

Hilb

ertspace

em

beddin

gsand

metric

son

probabilit

ym

ea-

sures.

JournalofM

achin

eLearnin

gResearch,11:1

517–1561,2010.

I.Stein

wart.

On

the

influ

ence

ofthe

kernelon

the

consistency

ofsupport

vector

machin

es.

JournalofM

achin

eLearnin

gResearch,2:6

7–93,2001.

Ingo

Stein

wartand

AndreasChristm

ann.SupportVecto

rM

achin

es.

Info

rm

atio

nScie

nce

and

Statistic

s.Sprin

ger,2008.

G.Szekely

and

M.Riz

zo.

Brownia

ndistance

covaria

nce.

Annals

ofApplie

d

Sta

tistic

s,4(3):1

233–1303,2009.

G.Szekely

,M

.Riz

zo,and

N.Bakirov.M

easurin

gand

testin

gdependence

by

correla

tio

nofdistances.

Ann.Sta

t.,35(6):2

769–2794,2007.

S.K

.Zhou

and

R.Chella

ppa.From

sam

ple

sim

ilarity

to

ensem

ble

sim

ilarity:

Probabilis

tic

distancem

easuresin

reproducin

gkernelhilb

ertspace.IE

EE

Tra

nsactio

nson

Patte

rn

Analysis

andM

achin

eIn

tellig

ence,28(6):9

17–929,2006.

103-2