Diffusion kernels on SNP data embedded in a non-Euclidean metric

45
Diffusion kernels on SNP data embedded in a non-Euclidean metric Animal Breeding & Genomics Seminar Gota Morota April 10, 2012 1 / 37

description

Presented at the Animal Breeding & Genomics Seminar. University of Wisconsin-Madison.

Transcript of Diffusion kernels on SNP data embedded in a non-Euclidean metric

Page 1: Diffusion kernels on SNP data embedded in a non-Euclidean metric

Diffusion kernels on SNP data embedded in anon-Euclidean metric

Animal Breeding & Genomics Seminar

Gota Morota

April 10, 2012

1 / 37

Page 2: Diffusion kernels on SNP data embedded in a non-Euclidean metric

Kernel functions

DefinitionA kernel is a weighting function which provides a similarity metric

1. define a function that measures distance (metric) forgenotypes

2. compute a similarity based on this metric space

function of a distance under certain metric space f(||x − x′

||)

• Euclidean distance

• Manhattan distance

• Mahalanobis distance

• Minkowski distance

2 / 37

Page 3: Diffusion kernels on SNP data embedded in a non-Euclidean metric

Metric (Distance function)

DefinitionA function which defines a distance between two points

If one picks Euclidean metric, the Matern covariance functionoffers flexible kernels

K(x, x′

) = σ2K

21−ν

Γ(ν)

√2ν(||x − x

||/h)νK(||x − x′

||/h)

• Gaussian Kernel: ν = ∞, exp(−θ(||x − x′

||2))

• Exponentail Kernel: ν = 12 , exp(−θ(||x − x

||))

A choice of a metric determines characteristics of a kernel

3 / 37

Page 4: Diffusion kernels on SNP data embedded in a non-Euclidean metric

Euclidean Metric

DefinitionThe distance function given by the Pythagorean theorem(a2 + b2 = c2)

Euclidean distance on R2

xi = (xi1, xi2), xj = (xj1, xj2)

||xi − xj || =√

(xi1 − xj1)2 + (xi2 − xj2)2Figure 1: Euclidean distancebetween two points A and B

Euclidean distance on Rp

||xi − xj || =√

(xi1 − xj1)2 + · · ·+ (xik − xjk )2 + · · ·+ (xip − xjp)2

4 / 37

Page 5: Diffusion kernels on SNP data embedded in a non-Euclidean metric

Euclidean space

Euclidean distance is a metric on a metric space callled Euclideanspace

Figure 2: 3-dimensional Euclideanspace. −∞ ≤ (X ,Y ,Z) ≤ ∞

Suppose, we observed twoindividuals with 3 SNPgenotypes.

• ID1 = x1 = (0,2,2)

• ID2 = x2 = (2,1,0)

Euclidean distance on R3

||x1 − x2|| =√

(0 − 2)2 + (2 − 1)2 + (2 − 0)2 = 3

5 / 37

Page 6: Diffusion kernels on SNP data embedded in a non-Euclidean metric

Metric on graphsA graph is consisted of vertices and edges

0 1 2

01

2

0

1

2

1st Genotype

2nd

Gen

otyp

e

3rd

Gen

otyp

e

Figure 3:

6 / 37

Page 7: Diffusion kernels on SNP data embedded in a non-Euclidean metric

Metric on graphs (continue)

0 1 2

01

2

0

1

2

1st Genotype

2nd

Gen

otyp

e

3rd

Gen

otyp

e

(2,1,2)

(2,0,1)

(0,1,2)

(0,2,0)

(0,1,0)

(0,1,1)

(1,0,0) (2,0,0)

(1,1,0) (2,1,0)

(1,2,0) (2,2,0)(1,0,1)

(1,1,1) (2,1,1)

(0,2,1) (1,2,1) (2,2,1)

(0,2,2) (1,2,2) (2,2,2)

(1,0,2)

(1,1,2)

Figure 4:

7 / 37

Page 8: Diffusion kernels on SNP data embedded in a non-Euclidean metric

Metric on graphs (continue)Two individuals with 3 SNP genotypes previously shown.• ID1 = x1 = (0,2,2), ID2 = x2 = (2,1,0)

0 1 2

01

2

0

1

2

1st Genotype

2nd

Gen

otyp

e

3rd

Gen

otyp

e

(2,1,0)

(0,2,2)

Figure 5:

8 / 37

Page 9: Diffusion kernels on SNP data embedded in a non-Euclidean metric

The purpose of this study

1. Is the Euclidean distance adequate for genotypes?

2. The metric on graphs seems to be given by the Manhattandistance, but how to express the degree of similarity?

• Embed SNP data in a non-Euclidean metric space

• Define a metric for discrete genotypes on graphs andconstruct a kernel on this metric

Develope a kernel that is suited for all kinds of kernel-basedgenomic analyses

9 / 37

Page 10: Diffusion kernels on SNP data embedded in a non-Euclidean metric

Diffusion on one-dimensional graphs (Z13)

We have three possible genotypes, 0 (aa), 1 (Aa) and 2 (’AA’).

0 − 1 − 2 (1)0 − 1\ /

2(2)

1. Graph (1) path graph• genotype 1’s (’Aa’) influence diffuses to genotype 0 (’aa’) and

2 (’AA’)• genotype 0’s (’aa’) influence diffuses to only genotype 1 (’Aa’)• genotype 2’s (’AA’) influence diffuses to only genotype 1 (’Aa’)

2. Graph (2) complete graph• the distance from genotype 0 (’aa’) to genotype 2 (’AA’) is the

same as that from 0 (’aa’) to 1 (’Aa’).

• more reasonable to assume that genotype ’Aa’ is closer than’aa’ to ’AA’ which has two copies of the ’A’ allele.

• genotype 0 (’aa’) requires two mutations to become genotype2 (’AA’), while genotype 1 (’Aa’) requires only one mutation

10 / 37

Page 11: Diffusion kernels on SNP data embedded in a non-Euclidean metric

Diffusion on one-dimensional graphs (Z13)

We have three possible genotypes, 0 (aa), 1 (Aa) and 2 (’AA’).

0 − 1 − 2 (1)0 − 1\ /

2(2)

1. Graph (1) path graph• genotype 1’s (’Aa’) influence diffuses to genotype 0 (’aa’) and

2 (’AA’)• genotype 0’s (’aa’) influence diffuses to only genotype 1 (’Aa’)• genotype 2’s (’AA’) influence diffuses to only genotype 1 (’Aa’)

2. Graph (2) complete graph• the distance from genotype 0 (’aa’) to genotype 2 (’AA’) is the

same as that from 0 (’aa’) to 1 (’Aa’).

• more reasonable to assume that genotype ’Aa’ is closer than’aa’ to ’AA’ which has two copies of the ’A’ allele.

• genotype 0 (’aa’) requires two mutations to become genotype2 (’AA’), while genotype 1 (’Aa’) requires only one mutation

10 / 37

Page 12: Diffusion kernels on SNP data embedded in a non-Euclidean metric

Diffusion on one-dimensional graphs (Z13)

We have three possible genotypes, 0 (aa), 1 (Aa) and 2 (’AA’).

0 − 1 − 2 (1)0 − 1\ /

2(2)

1. Graph (1) path graph• genotype 1’s (’Aa’) influence diffuses to genotype 0 (’aa’) and

2 (’AA’)• genotype 0’s (’aa’) influence diffuses to only genotype 1 (’Aa’)• genotype 2’s (’AA’) influence diffuses to only genotype 1 (’Aa’)

2. Graph (2) complete graph• the distance from genotype 0 (’aa’) to genotype 2 (’AA’) is the

same as that from 0 (’aa’) to 1 (’Aa’).

• more reasonable to assume that genotype ’Aa’ is closer than’aa’ to ’AA’ which has two copies of the ’A’ allele.

• genotype 0 (’aa’) requires two mutations to become genotype2 (’AA’), while genotype 1 (’Aa’) requires only one mutation

10 / 37

Page 13: Diffusion kernels on SNP data embedded in a non-Euclidean metric

Diffusion on two-dimensional graphs (Z23)

Two-dimensional graphs are given by the Cartesian graph product(�) of the 2 one-dimensional graphs 0 - 1 - 2.

0 − 1 − 2�0 − 1 − 2 (3)

Let Γ1 and Γ2 be two graphs. Consider a graph with vertex setV(Γ1) × V(Γ2), with vertices (x, x′) ∈ V(Γ1) and (y, y′) ∈ V(Γ2).

Cartesian graph productThe Cartesian graph product connects two vertices (x, y) and(x′, y′) if only if x = x′, y ∼ y′ or y = y′, x ∼ x′, where “∼” meansconnected.

11 / 37

Page 14: Diffusion kernels on SNP data embedded in a non-Euclidean metric

Example of the Cartesian graph product (�)

Cartesian graph product of the 2 one-dimensional graphs

0 − 1 − 2�0 − 1 − 2

Fisrt, list all possible configuration of vertices

02 12 22

01 11 21

00 10 20

12 / 37

Page 15: Diffusion kernels on SNP data embedded in a non-Euclidean metric

Example of the Cartesian graph product (�) (continue)Cartesian graph product of the 2 one-dimensional graph

0 − 1 − 2�0 − 1 − 2

The Cartesian graph product connects two vertices (x, y) and(x′, y′) if only if x = x′, y ∼ y′ or y = y′, x ∼ x′, where “∼” meansconnected.

• 0 = 0, 0 ∼ 1→ connected• 0 = 0, 1 ∼ 2→ connected

02 12 22

01 11 21

00 10 20

02 12 22|

01 11 21|

00 10 20

13 / 37

Page 16: Diffusion kernels on SNP data embedded in a non-Euclidean metric

Example of the Cartesian graph product (�) (continue)Cartesian graph product of the 2 one-dimensional graph

0 − 1 − 2�0 − 1 − 2

The Cartesian graph product connects two vertices (x, y) and(x′, y′) if only if x = x′, y ∼ y′ or y = y′, x ∼ x′, where “∼” meansconnected.

• 0 = 0, 0 ∼ 1→ connected• 0 = 0, 1 ∼ 2→ connected

02 12 22

01 11 21

00 10 20

02 12 22|

01 11 21|

00 10 20

13 / 37

Page 17: Diffusion kernels on SNP data embedded in a non-Euclidean metric

Example of the Cartesian graph product (�) (continue)Cartesian graph product of the 2 one-dimensional graph

0 − 1 − 2�0 − 1 − 2

The Cartesian graph product connects two vertices (x, y) and(x′, y′) if only if x = x′, y ∼ y′ or y = y′, x ∼ x′, where “∼” meansconnected.

• 0 = 0, 0 ∼ 1→ connected• 0 = 0, 1 ∼ 2→ connected

02 12 22

01 11 21

00 10 20

02 12 22|

01 11 21|

00 10 2013 / 37

Page 18: Diffusion kernels on SNP data embedded in a non-Euclidean metric

Example of the Cartesian graph product (�) (continue)Cartesian graph product of the 2 one-dimensional graphs

0 − 1 − 2�0 − 1 − 2

The Cartesian graph product connects two vertices (x, y) and(x′, y′) if only if x = x′, y ∼ y′ or y = y′, x ∼ x′, where “∼” meansconnected.

• 0 = 0, 0 ∼ 1→ connected• 0 , 1, 0 , 1→ not connected• 0 , 1, 0 , 2→ not connected

02 12 22|

01 11 21|

00 10 20

02 12 22|

01 11 21|

00 − 10 20

14 / 37

Page 19: Diffusion kernels on SNP data embedded in a non-Euclidean metric

Example of the Cartesian graph product (�) (continue)Cartesian graph product of the 2 one-dimensional graphs

0 − 1 − 2�0 − 1 − 2

The Cartesian graph product connects two vertices (x, y) and(x′, y′) if only if x = x′, y ∼ y′ or y = y′, x ∼ x′, where “∼” meansconnected.

• 0 = 0, 0 ∼ 1→ connected• 0 , 1, 0 , 1→ not connected• 0 , 1, 0 , 2→ not connected

02 12 22|

01 11 21|

00 10 20

02 12 22|

01 11 21|

00 − 10 20

14 / 37

Page 20: Diffusion kernels on SNP data embedded in a non-Euclidean metric

Example of the Cartesian graph product (�) (continue)Cartesian graph product of the 2 one-dimensional graphs

0 − 1 − 2�0 − 1 − 2

The Cartesian graph product connects two vertices (x, y) and(x′, y′) if only if x = x′, y ∼ y′ or y = y′, x ∼ x′, where “∼” meansconnected.

• 0 = 0, 0 ∼ 1→ connected• 0 , 1, 0 , 1→ not connected• 0 , 1, 0 , 2→ not connected

02 12 22|

01 11 21|

00 10 20

02 12 22|

01 11 21|

00 − 10 2014 / 37

Page 21: Diffusion kernels on SNP data embedded in a non-Euclidean metric

Diffusion on two-dimensional graphs (Z23) (continue)

A graph from the Cartesian graph product between path graphs ofany size takes the form of a grid.

02 − 12 − 22| | |

01 − 11 − 21| | |

00 − 10 − 20

A SNP grid of p loci is a p dimensional grid with vertices in Zp3 , with

two vertices x and x′ adjacent if and only if

p∑i=1

|xi − x′i | = 1.

i.e., two vertices are adjacent if and only if just one SNP locusdiffers by 1.

15 / 37

Page 22: Diffusion kernels on SNP data embedded in a non-Euclidean metric

Diffusion on three-dimensional graphs (Z33)

Cartesian graph product of the 3 one-dimensional graphs.

0 − 1 − 2�0 − 1 − 2�0 − 1 − 2

In general, the p-dimensional SNP grid graph is �pi=1Γ, where

Γ = 0 − 1 − 2.

0 1 2

01

2

0

1

2

1st Genotype

2nd

Gen

otyp

e

3rd

Gen

otyp

e

(2,1,2)

(2,0,1)

(0,1,2)

(0,2,0)

(0,1,0)

(0,1,1)

(1,0,0) (2,0,0)

(1,1,0) (2,1,0)

(1,2,0) (2,2,0)(1,0,1)

(1,1,1) (2,1,1)

(0,2,1) (1,2,1) (2,2,1)

(0,2,2) (1,2,2) (2,2,2)

(1,0,2)

(1,1,2)

Figure 6:

16 / 37

Page 23: Diffusion kernels on SNP data embedded in a non-Euclidean metric

Graph Laplacians

The Laplacian of a graph 0 − 1 − 2 is

L(Γ) = −A(Γ) + Λ

= −

0 1 01 0 10 1 0

+

1 0 00 2 00 0 1

=

1 −1 0−1 2 −10 −1 1

where A is an adjacency matrix and Λ is a diagonal matrix withΛii =

∑nj=1 Aij .

17 / 37

Page 24: Diffusion kernels on SNP data embedded in a non-Euclidean metric

Graph Laplacians (continue)

The Laplacian of a graph 0 − 1 − 2�0 − 1 − 2 is a square matrix ofdimension 32 × 32.

L(Γ) =

200 −1 0 −1 0 0 0 0 0−1 301 −1 0 −1 0 0 0 00 −1 202 0 0 −1 0 0 0−1 0 0 310 −1 0 −1 0 00 −1 0 −1 411 −1 0 −1 00 0 −1 0 −1 312 0 0 −10 0 0 −1 0 0 220 −1 00 0 0 0 −1 0 −1 321 −10 0 0 0 0 −1 0 −1 222

18 / 37

Page 25: Diffusion kernels on SNP data embedded in a non-Euclidean metric

Diffusion on graphs at time t

• kx is a function which measures the spread of ’influence’ ofthe genotype x over other genotypes.

• kx(0, x) = 1x=x(x), at time 0.

• define the time t diffusion of the ’influence’ of genotype x ongenotype x to be

kx(t , x) = kx(t − 1, x) +∑

|x−x′ |=1

α(kx(t − 1, x′) − kx(t − 1, x))

19 / 37

Page 26: Diffusion kernels on SNP data embedded in a non-Euclidean metric

Diffusion on graphs at time t (continue)

kx(t , x) = kx(t − 1, x) +∑

|x−x′ |=1

α(kx(t − 1, x′) − kx(t − 1, x))

• x = (0, 1, 2) is the genotype code, α = (0.1, 0.2) is thediffusion rate.

• kx(t , x) is the time t diffusion of the influence of genotype x ongenotype x.

α= 0.1 α = 0.2 α = 0.2x = 0 1 2 x= 0 1 2 x= 0 1 2k1(0, x) 0 1 0 k1(0, x) 0 1 0 k2(0, x) 0 0 1k1(1, x) 0.1 0.8 0.1 k1(1, x) 0.2 0.6 0.2 k2(1, x) 0 0.2 0.8k1(2, x) 0.17 0.66 0.17 k1(2, x) 0.28 0.44 0.28 k2(2, x) 0.04 0.28 0.68k1(3, x) 0.219 0.562 0.219 k1(3, x) 0.312 0.376 0.312 k2(3, x) 0.171 0.330 0.498k1(15, x) 0.331 0.336 0.331 k1(15, x) 0.333 0.333 0.333 k2(15, x) 0.324 0.333 0.342

20 / 37

Page 27: Diffusion kernels on SNP data embedded in a non-Euclidean metric

Diffusion on graphs at time t (continue)Writing in vector form, with kx(t , x) = [kx(t)]x , we get

kx(t) = kx(t − 1) + αHkx(t − 1)

= (I + αH)kx(t − 1)

= (I + αH)tkx(0)

• H is the negative of the graph Laplacian• in order to make ’time’ continuous, let α = θh (θ > 0) and

t = 1/h.• by using a small h, we can achieve a discretization of the

’diffusion time’

limh→0

(I + θhH(Γ))1/h = exp(θH)

=∞∑

k=0

θk

k !Hk = I + θH +

θ2

2H2 +

θ3

3!H3 + · · ·+

θn

n!Hn + · · ·

21 / 37

Page 28: Diffusion kernels on SNP data embedded in a non-Euclidean metric

Diffusion kernels

DefinitionSuppose a graph Γ with a graph Laplacian L(Γ). Then exp(θH(Γ))or exp(−θL(Γ)) is called the diffusion kernel or heat kernel forgraph Γ, where θ is a rate of diffusion.

Here putting K = exp(θH) and taking the derivative with respect toθ gives,

ddθK = HK (4)

which is a diffusion equation (heat equation) on a graph withH = −L(Γ).

22 / 37

Page 29: Diffusion kernels on SNP data embedded in a non-Euclidean metric

Gaussian kernelsDefinitionA Gaussian kernel is a space continuous diffusion kernel

• in order to make ’space’ continuous, we create an infinitenumber of ’fake’ genotypes between and outside of 0 and 2

• i.e., consider genotypes such as 1.23 or −10.5.• each genotype x is connected to only two genotypes, x + dx

and x − dx for some infinitesimal dx.• H becomes an infinite matrix, and H(x, x′) is −2 for x′ = x

and 1 for x + dx, x − dx.

H(Γ) =

−1 1 01 −2 10 1 −1

⇒ Infinite matrix with diagonalelements equal to -2 and 1 for itsneighbors and 0 otherwise

23 / 37

Page 30: Diffusion kernels on SNP data embedded in a non-Euclidean metric

Gaussian kernels (continue)• a vector of genotypes: x = (−∞, · · · , x − dx, x, x + dx, · · · ,∞)

• an influence function:f = (f(−∞), · · · , f(x − dx), f(x), f(x + dx), · · · , f(∞))

• Approximating dx by h, and dividing H by h2, HfT/h2 indexedby the genotype x will be

1h2 [H(x, ·)fT ] =

f(x + h) − 2f(x) + f(x − h)

h2

=

f(x+h)−f(x)h −

f(x)−f(x−h)h

h� f

′′

(x)

• Thus, with space continuity, H acts like ddx2 . Using this analogy

back in (4), we get the heat equation.

ddθKθ(x) =

ddx2Kθ(x)

24 / 37

Page 31: Diffusion kernels on SNP data embedded in a non-Euclidean metric

Gaussian kernels (continue)

• The solution to this partial differential equation (PDE) withDirac delta initial condition of concentration on x = 0,k0(x) = 1x=0, is given by

Gθ(x) =1√

4πθexp

(−

x2

)• This is a Gaussian density in one dimensional space withθ = σ2

e/2.

• With the initial condition K0(x) = f(x), the solution to this PDEis

Kθ(x) =

∫R

f(x′

)Gθ(x − x′)dx′

This kernel gθ(x, x′) = G(x − x′) is the Gaussian kernel withbandwidth θ.

25 / 37

Page 32: Diffusion kernels on SNP data embedded in a non-Euclidean metric

Gaussian kernels (continue)• For example, allowing additional genotypes (0.25, 0.50, 0.75,

1.25, 1.50, 1.75).• now, x ∈ R9 instead of x ∈ Z3

0 1 2

01

2

0

1

2

1st Genotype

2nd

Gen

otyp

e

3rd

Gen

otyp

e

(0,1.75,2)

Figure 7:

26 / 37

Page 33: Diffusion kernels on SNP data embedded in a non-Euclidean metric

Computation of diffusion kernels

Kernel notation:

• K as the kernel matrix indexed by the observed covariates

• K for the infinite dimensional kernel for the Gaussian, and the3p × 3p dimensional kernel for the diffusion kernel

Gaussian kernels

• K : infinite dimensionalkernel

• K = exp(−θ(||x − x′

||2))

• we have a closed form for K,so no need to deal with K

Diffusion kernels

• K : 3p × 3p dimensionalkernel

• K: is there any way todirectly compute K so thatwe don’t need to deal withK?

• closed form for K?

27 / 37

Page 34: Diffusion kernels on SNP data embedded in a non-Euclidean metric

Computation of diffusion kernels (continue)

Let K1(θ) and K2(θ) be the kernels for the two graphs Γ1 and Γ2.The diffusion kernel for Γ = Γ1�Γ2 is

K1(θ) ⊗ K2(θ).

were ⊗ is the tensor product.

Suppose, Γ1 = 0 − 1 − 2, K(Γ1) is a diffusion kernel on Γ1.

SNP grid graph on p dimensions

• �pi=1Γ1

SNP grid kernel on p dimensions

•⊗p

i=1K(Γ1)

We just need to compute K(Γ1) = exp(−θL(Γ1)) and take thetensor product p times!

28 / 37

Page 35: Diffusion kernels on SNP data embedded in a non-Euclidean metric

Matrix exponentiation

Γ1 = 0 − 1 − 2

H =

1 −1 0−1 2 −10 −1 1

We make use of matrix diagonalization H = TDT−1 to obtain

Kθ = exp(θH)

= T exp(θD)T−1

=16

e−3θ + 3e−θ + 2 −2e−3θ + 2 e−3θ − 3e−θ + 2−2e−3θ + 2 4e−3θ + 2 −2e−3θ + 2

e−3θ − 3e−θ + 2 −2e−3θ + 2 e−3θ + 3e−θ + 2

Here, exp(θD) becomes simple componentwise exponentiationbecause D is a diagonal matrix of eigenvalues.

29 / 37

Page 36: Diffusion kernels on SNP data embedded in a non-Euclidean metric

Diffusion kernels indexed by the observed covariates

Symmetric property

Kθ(x, x′

) =

−2e−3θ + 2 if |xi − x

i | = 1e−3θ − 3e−θ + 2 if |xi − x

i | = 2e−3θ + 3e−θ + 2 if xi = x

i , x′

, 14e−3θ + 2 if xi = x

i = 1

Thus,

K⊗pθ (x, x′) ∝

p∏i=1

(e−3θ − 3e−θ + 2)δ|xi−x′i |=2 + (−2e−3θ + 2)δ|xi−x′i |=1

+ (e−3θ + 3e−θ + 2)δxi=x′i ,1 + (4e−3θ + 2)δxi=x′i =1

30 / 37

Page 37: Diffusion kernels on SNP data embedded in a non-Euclidean metric

Diffusion kernels indexed by the observed covariates(continue)• let x and x′ be an SNP data for p loci; ns be the number of loci

for which |xi − x′i | = s• let m11 be the number of loci for which xi = x′i = 1, i.e., m11 is

the number of loci that two individuals share heterozygousstates.

Using the fact thatn1 + n0 + n2 = p,

K⊗pθ (x, x′) =(−2e−3θ + 2)n1(e−3θ − 3e−θ + 2)n2

(e−3θ + 3e−θ + 2)n0−m11(4e−3θ + 2)m11

∝(−2e−3θ + 2)n1(e−3θ − 3e−θ + 2)n2(4e−3θ + 2)m11

(e−3θ + 3e−θ + 2)n1+n2+m11

We obtain a SNP grid kernel.31 / 37

Page 38: Diffusion kernels on SNP data embedded in a non-Euclidean metric

Example of computing a diffusion kernel

Two individuals with 3 SNP genotypes previously shown.

• ID1 = x1 = (0,2,2)

• ID2 = x2 = (2,1,0)

0 1 2

01

2

0

1

2

1st Genotype

2nd

Gen

otyp

e

3rd

Gen

otyp

e

(2,1,0)

(0,2,2)

Since

Kθ(x, x′

) =

−2e−3θ + 2 if |xi − x

i | = 1e−3θ − 3e−θ + 2 if |xi − x

i | = 2e−3θ + 3e−θ + 2 if xi = x

i , x′

, 14e−3θ + 2 if xi = x

i = 1

Similarity between ID1 and ID2 is

K⊗3θ (x, x′) =

(−2e−3θ + 2)1(e−3θ − 3e−θ + 2)2

(e−3θ + 3e−θ + 2)1+2

32 / 37

Page 39: Diffusion kernels on SNP data embedded in a non-Euclidean metric

Diffusion kernels for binary genotypes

Here, x ∈ Zp2

Γ = 0 − 2

L(Γ) = −H(Γ)

=

[1 −1−1 1

]

K⊗pθ (x, x′) ∝

(1 − exp(−2θ)

1 + 2 exp(−2θ)

)d(x,x′)

where d(x, x′) is the Hamming distance, that is, number ofcoordinates at which x and x′ differ.

33 / 37

Page 40: Diffusion kernels on SNP data embedded in a non-Euclidean metric

Applications

A SNP kernel can be used in DNA-based genomic analysesincluding

• regressions

• classifications

• kernel association studies

• kernel principal component analyses

Application of using the diffusion kernel on real data

• 7902 Holstein bulls (USDA-ARS AIPL)

• 43382 SNPs

34 / 37

Page 41: Diffusion kernels on SNP data embedded in a non-Euclidean metric

Diffusion kernels based on for different θ

K(i,i')

Fre

quen

cy

0.10 0.15 0.20 0.25 0.30 0.35

0.0e

+00

1.0e

+07

θ = 10

K(i,i')

Fre

quen

cy

0.45 0.50 0.55 0.60 0.65 0.70

0.0e

+00

1.0e

+07

θ = 11

K(i,i')

Fre

quen

cy

0.74 0.78 0.82 0.86

0.0e

+00

6.0e

+06

1.2e

+07

θ = 12

K(i,i')

Fre

quen

cy

0.90 0.91 0.92 0.93 0.94 0.95

0.0e

+00

6.0e

+06

1.2e

+07

θ = 13

Figure 8: Elements of four diffusion kernels based on four differentbandwidth parameters (θ).

35 / 37

Page 42: Diffusion kernels on SNP data embedded in a non-Euclidean metric

ConclusionDiffusion kernels

• various graph structures can be used to represent sets ofdiscrete random variables, such as genotypes

• defines the distance between two vertices, and projects thisinformation into a more interpretable Rn

• matrix exponentiation of the graph Laplacian

• which senario, the Gaussian can approximate the diffusionkernel well?

R package ’dkDNA’ will be available on CRAN soon

• SNP grid kernel

• binary grid kernel

• other DNA structures/polymorphisms in future

• written in Fortran

36 / 37

Page 43: Diffusion kernels on SNP data embedded in a non-Euclidean metric

ConclusionDiffusion kernels

• various graph structures can be used to represent sets ofdiscrete random variables, such as genotypes

• defines the distance between two vertices, and projects thisinformation into a more interpretable Rn

• matrix exponentiation of the graph Laplacian

• which senario, the Gaussian can approximate the diffusionkernel well?

R package ’dkDNA’ will be available on CRAN soon

• SNP grid kernel

• binary grid kernel

• other DNA structures/polymorphisms in future

• written in Fortran

36 / 37

Page 44: Diffusion kernels on SNP data embedded in a non-Euclidean metric

ConclusionDiffusion kernels

• various graph structures can be used to represent sets ofdiscrete random variables, such as genotypes

• defines the distance between two vertices, and projects thisinformation into a more interpretable Rn

• matrix exponentiation of the graph Laplacian

• which senario, the Gaussian can approximate the diffusionkernel well?

R package ’dkDNA’ will be available on CRAN soon

• SNP grid kernel

• binary grid kernel

• other DNA structures/polymorphisms in future

• written in Fortran36 / 37

Page 45: Diffusion kernels on SNP data embedded in a non-Euclidean metric

Acknowledgments

• Daniel Gianola

• Grace Wahba

• Masanori Koyama

• Chen Yao

37 / 37