Diffusion kernels on SNP data embedded in a non-Euclidean metric

Diffusion kernels on SNP data embedded in anon-Euclidean metric

Animal Breeding & Genomics Seminar

Gota Morota

April 10, 2012

1 / 37

Kernel functions

DefinitionA kernel is a weighting function which provides a similarity metric

1. define a function that measures distance (metric) forgenotypes

2. compute a similarity based on this metric space

⇓

function of a distance under certain metric space f(||x − x′

||)

• Euclidean distance

• Manhattan distance

• Mahalanobis distance

• Minkowski distance

2 / 37

Metric (Distance function)

DefinitionA function which defines a distance between two points

If one picks Euclidean metric, the Matern covariance functionoffers flexible kernels

K(x, x′

) = σ2K

21−ν

Γ(ν)

√2ν(||x − x

′

||/h)νK(||x − x′

||/h)

• Gaussian Kernel: ν = ∞, exp(−θ(||x − x′

||2))

• Exponentail Kernel: ν = 12 , exp(−θ(||x − x

′

||))

⇓

A choice of a metric determines characteristics of a kernel

3 / 37

Euclidean Metric

DefinitionThe distance function given by the Pythagorean theorem(a2 + b2 = c2)

Euclidean distance on R2

xi = (xi1, xi2), xj = (xj1, xj2)

||xi − xj || =√

(xi1 − xj1)2 + (xi2 − xj2)2Figure 1: Euclidean distancebetween two points A and B

Euclidean distance on Rp

||xi − xj || =√

(xi1 − xj1)2 + · · ·+ (xik − xjk )2 + · · ·+ (xip − xjp)2

4 / 37

Euclidean space

Euclidean distance is a metric on a metric space callled Euclideanspace

Figure 2: 3-dimensional Euclideanspace. −∞ ≤ (X ,Y ,Z) ≤ ∞

Suppose, we observed twoindividuals with 3 SNPgenotypes.

• ID1 = x1 = (0,2,2)

• ID2 = x2 = (2,1,0)

Euclidean distance on R3

||x1 − x2|| =√

(0 − 2)2 + (2 − 1)2 + (2 − 0)2 = 3

5 / 37

Metric on graphsA graph is consisted of vertices and edges

0 1 2

01

2

0

1

2

1st Genotype

2nd

Gen

otyp

e

3rd

Gen

otyp

e

Figure 3:

6 / 37

Metric on graphs (continue)

0 1 2

01

2

0

1

2

1st Genotype

2nd

Gen

otyp

e

3rd

Gen

otyp

e

(2,1,2)

(2,0,1)

(0,1,2)

(0,2,0)

(0,1,0)

(0,1,1)

(1,0,0) (2,0,0)

(1,1,0) (2,1,0)

(1,2,0) (2,2,0)(1,0,1)

(1,1,1) (2,1,1)

(0,2,1) (1,2,1) (2,2,1)

(0,2,2) (1,2,2) (2,2,2)

(1,0,2)

(1,1,2)

Figure 4:

7 / 37

Metric on graphs (continue)Two individuals with 3 SNP genotypes previously shown.• ID1 = x1 = (0,2,2), ID2 = x2 = (2,1,0)

0 1 2

01

2

0

1

2

1st Genotype

2nd

Gen

otyp

e

3rd

Gen

otyp

e

(2,1,0)

(0,2,2)

Figure 5:

8 / 37

The purpose of this study

1. Is the Euclidean distance adequate for genotypes?

2. The metric on graphs seems to be given by the Manhattandistance, but how to express the degree of similarity?

• Embed SNP data in a non-Euclidean metric space

• Define a metric for discrete genotypes on graphs andconstruct a kernel on this metric

⇓

Develope a kernel that is suited for all kinds of kernel-basedgenomic analyses

9 / 37

Diffusion on one-dimensional graphs (Z13)

We have three possible genotypes, 0 (aa), 1 (Aa) and 2 (’AA’).

0 − 1 − 2 (1)0 − 1\ /

2(2)

1. Graph (1) path graph• genotype 1’s (’Aa’) influence diffuses to genotype 0 (’aa’) and

2 (’AA’)• genotype 0’s (’aa’) influence diffuses to only genotype 1 (’Aa’)• genotype 2’s (’AA’) influence diffuses to only genotype 1 (’Aa’)

2. Graph (2) complete graph• the distance from genotype 0 (’aa’) to genotype 2 (’AA’) is the

same as that from 0 (’aa’) to 1 (’Aa’).

• more reasonable to assume that genotype ’Aa’ is closer than’aa’ to ’AA’ which has two copies of the ’A’ allele.

• genotype 0 (’aa’) requires two mutations to become genotype2 (’AA’), while genotype 1 (’Aa’) requires only one mutation

10 / 37

Diffusion on two-dimensional graphs (Z23)

Two-dimensional graphs are given by the Cartesian graph product(�) of the 2 one-dimensional graphs 0 - 1 - 2.

0 − 1 − 2�0 − 1 − 2 (3)

Let Γ1 and Γ2 be two graphs. Consider a graph with vertex setV(Γ1) × V(Γ2), with vertices (x, x′) ∈ V(Γ1) and (y, y′) ∈ V(Γ2).

Cartesian graph productThe Cartesian graph product connects two vertices (x, y) and(x′, y′) if only if x = x′, y ∼ y′ or y = y′, x ∼ x′, where “∼” meansconnected.

11 / 37

Example of the Cartesian graph product (�)

Cartesian graph product of the 2 one-dimensional graphs

0 − 1 − 2�0 − 1 − 2

Fisrt, list all possible configuration of vertices

02 12 22

01 11 21

00 10 20

12 / 37

Example of the Cartesian graph product (�) (continue)Cartesian graph product of the 2 one-dimensional graph

0 − 1 − 2�0 − 1 − 2

The Cartesian graph product connects two vertices (x, y) and(x′, y′) if only if x = x′, y ∼ y′ or y = y′, x ∼ x′, where “∼” meansconnected.

• 0 = 0, 0 ∼ 1→ connected• 0 = 0, 1 ∼ 2→ connected

02 12 22

01 11 21

00 10 20

⇒

02 12 22|

01 11 21|

00 10 20

13 / 37

Example of the Cartesian graph product (�) (continue)Cartesian graph product of the 2 one-dimensional graph

0 − 1 − 2�0 − 1 − 2


• 0 = 0, 0 ∼ 1→ connected• 0 = 0, 1 ∼ 2→ connected

02 12 22

01 11 21

00 10 20

⇒

02 12 22|

01 11 21|

00 10 2013 / 37

Example of the Cartesian graph product (�) (continue)Cartesian graph product of the 2 one-dimensional graphs

0 − 1 − 2�0 − 1 − 2


• 0 = 0, 0 ∼ 1→ connected• 0 , 1, 0 , 1→ not connected• 0 , 1, 0 , 2→ not connected

02 12 22|

01 11 21|

00 10 20

⇒

02 12 22|

01 11 21|

00 − 10 20

14 / 37

Example of the Cartesian graph product (�) (continue)Cartesian graph product of the 2 one-dimensional graphs

0 − 1 − 2�0 − 1 − 2


• 0 = 0, 0 ∼ 1→ connected• 0 , 1, 0 , 1→ not connected• 0 , 1, 0 , 2→ not connected

02 12 22|

01 11 21|

00 10 20

⇒

02 12 22|

01 11 21|

00 − 10 2014 / 37

Diffusion on two-dimensional graphs (Z23) (continue)

A graph from the Cartesian graph product between path graphs ofany size takes the form of a grid.

02 − 12 − 22| | |

01 − 11 − 21| | |

00 − 10 − 20

A SNP grid of p loci is a p dimensional grid with vertices in Zp3 , with

two vertices x and x′ adjacent if and only if

p∑i=1

|xi − x′i | = 1.

i.e., two vertices are adjacent if and only if just one SNP locusdiffers by 1.

15 / 37

Diffusion on three-dimensional graphs (Z33)

Cartesian graph product of the 3 one-dimensional graphs.

0 − 1 − 2�0 − 1 − 2�0 − 1 − 2

In general, the p-dimensional SNP grid graph is �pi=1Γ, where

Γ = 0 − 1 − 2.

0 1 2

01

2

0

1

2

1st Genotype

2nd

Gen

otyp

e

3rd

Gen

otyp

e

(2,1,2)

(2,0,1)

(0,1,2)

(0,2,0)

(0,1,0)

(0,1,1)

(1,0,0) (2,0,0)

(1,1,0) (2,1,0)

(1,2,0) (2,2,0)(1,0,1)

(1,1,1) (2,1,1)

(0,2,1) (1,2,1) (2,2,1)

(0,2,2) (1,2,2) (2,2,2)

(1,0,2)

(1,1,2)

Figure 6:

16 / 37

Graph Laplacians

The Laplacian of a graph 0 − 1 − 2 is

L(Γ) = −A(Γ) + Λ

= −

0 1 01 0 10 1 0

+

1 0 00 2 00 0 1

=

1 −1 0−1 2 −10 −1 1

where A is an adjacency matrix and Λ is a diagonal matrix withΛii =

∑nj=1 Aij .

17 / 37

Graph Laplacians (continue)

The Laplacian of a graph 0 − 1 − 2�0 − 1 − 2 is a square matrix ofdimension 32 × 32.

L(Γ) =

200 −1 0 −1 0 0 0 0 0−1 301 −1 0 −1 0 0 0 00 −1 202 0 0 −1 0 0 0−1 0 0 310 −1 0 −1 0 00 −1 0 −1 411 −1 0 −1 00 0 −1 0 −1 312 0 0 −10 0 0 −1 0 0 220 −1 00 0 0 0 −1 0 −1 321 −10 0 0 0 0 −1 0 −1 222

18 / 37

Diffusion on graphs at time t

• kx is a function which measures the spread of ’influence’ ofthe genotype x over other genotypes.

• kx(0, x) = 1x=x(x), at time 0.

• define the time t diffusion of the ’influence’ of genotype x ongenotype x to be

kx(t , x) = kx(t − 1, x) +∑

|x−x′ |=1

α(kx(t − 1, x′) − kx(t − 1, x))

19 / 37

Diffusion on graphs at time t (continue)

kx(t , x) = kx(t − 1, x) +∑

|x−x′ |=1

α(kx(t − 1, x′) − kx(t − 1, x))

• x = (0, 1, 2) is the genotype code, α = (0.1, 0.2) is thediffusion rate.

• kx(t , x) is the time t diffusion of the influence of genotype x ongenotype x.

α= 0.1 α = 0.2 α = 0.2x = 0 1 2 x= 0 1 2 x= 0 1 2k1(0, x) 0 1 0 k1(0, x) 0 1 0 k2(0, x) 0 0 1k1(1, x) 0.1 0.8 0.1 k1(1, x) 0.2 0.6 0.2 k2(1, x) 0 0.2 0.8k1(2, x) 0.17 0.66 0.17 k1(2, x) 0.28 0.44 0.28 k2(2, x) 0.04 0.28 0.68k1(3, x) 0.219 0.562 0.219 k1(3, x) 0.312 0.376 0.312 k2(3, x) 0.171 0.330 0.498k1(15, x) 0.331 0.336 0.331 k1(15, x) 0.333 0.333 0.333 k2(15, x) 0.324 0.333 0.342

20 / 37

Diffusion on graphs at time t (continue)Writing in vector form, with kx(t , x) = [kx(t)]x , we get

kx(t) = kx(t − 1) + αHkx(t − 1)

= (I + αH)kx(t − 1)

= (I + αH)tkx(0)

• H is the negative of the graph Laplacian• in order to make ’time’ continuous, let α = θh (θ > 0) and

t = 1/h.• by using a small h, we can achieve a discretization of the

’diffusion time’

limh→0

(I + θhH(Γ))1/h = exp(θH)

=∞∑

k=0

θk

k !Hk = I + θH +

θ2

2H2 +

θ3

3!H3 + · · ·+

θn

n!Hn + · · ·

21 / 37

Diffusion kernels

DefinitionSuppose a graph Γ with a graph Laplacian L(Γ). Then exp(θH(Γ))or exp(−θL(Γ)) is called the diffusion kernel or heat kernel forgraph Γ, where θ is a rate of diffusion.

Here putting K = exp(θH) and taking the derivative with respect toθ gives,

ddθK = HK (4)

which is a diffusion equation (heat equation) on a graph withH = −L(Γ).

22 / 37

Gaussian kernelsDefinitionA Gaussian kernel is a space continuous diffusion kernel

• in order to make ’space’ continuous, we create an infinitenumber of ’fake’ genotypes between and outside of 0 and 2

• i.e., consider genotypes such as 1.23 or −10.5.• each genotype x is connected to only two genotypes, x + dx

and x − dx for some infinitesimal dx.• H becomes an infinite matrix, and H(x, x′) is −2 for x′ = x

and 1 for x + dx, x − dx.

H(Γ) =

−1 1 01 −2 10 1 −1

⇒ Infinite matrix with diagonalelements equal to -2 and 1 for itsneighbors and 0 otherwise

23 / 37

Gaussian kernels (continue)• a vector of genotypes: x = (−∞, · · · , x − dx, x, x + dx, · · · ,∞)

• an influence function:f = (f(−∞), · · · , f(x − dx), f(x), f(x + dx), · · · , f(∞))

• Approximating dx by h, and dividing H by h2, HfT/h2 indexedby the genotype x will be

1h2 [H(x, ·)fT ] =

f(x + h) − 2f(x) + f(x − h)

h2

=

f(x+h)−f(x)h −

f(x)−f(x−h)h

h� f

′′

(x)

• Thus, with space continuity, H acts like ddx2 . Using this analogy

back in (4), we get the heat equation.

ddθKθ(x) =

ddx2Kθ(x)

24 / 37

Gaussian kernels (continue)

• The solution to this partial differential equation (PDE) withDirac delta initial condition of concentration on x = 0,k0(x) = 1x=0, is given by

Gθ(x) =1√

4πθexp

(−

x2

4θ

)• This is a Gaussian density in one dimensional space withθ = σ2

e/2.

• With the initial condition K0(x) = f(x), the solution to this PDEis

Kθ(x) =

∫R

f(x′

)Gθ(x − x′)dx′

This kernel gθ(x, x′) = G(x − x′) is the Gaussian kernel withbandwidth θ.

25 / 37

Gaussian kernels (continue)• For example, allowing additional genotypes (0.25, 0.50, 0.75,

1.25, 1.50, 1.75).• now, x ∈ R9 instead of x ∈ Z3

0 1 2

01

2

0

1

2

1st Genotype

2nd

Gen

otyp

e

3rd

Gen

otyp

e

(0,1.75,2)

Figure 7:

26 / 37

Computation of diffusion kernels

Kernel notation:

• K as the kernel matrix indexed by the observed covariates

• K for the infinite dimensional kernel for the Gaussian, and the3p × 3p dimensional kernel for the diffusion kernel

Gaussian kernels

• K : infinite dimensionalkernel

• K = exp(−θ(||x − x′

||2))

• we have a closed form for K,so no need to deal with K

Diffusion kernels

• K : 3p × 3p dimensionalkernel

• K: is there any way todirectly compute K so thatwe don’t need to deal withK?

• closed form for K?

27 / 37

Computation of diffusion kernels (continue)

Let K1(θ) and K2(θ) be the kernels for the two graphs Γ1 and Γ2.The diffusion kernel for Γ = Γ1�Γ2 is

K1(θ) ⊗ K2(θ).

were ⊗ is the tensor product.

Suppose, Γ1 = 0 − 1 − 2, K(Γ1) is a diffusion kernel on Γ1.

SNP grid graph on p dimensions

• �pi=1Γ1

SNP grid kernel on p dimensions

•⊗p

i=1K(Γ1)

⇓

We just need to compute K(Γ1) = exp(−θL(Γ1)) and take thetensor product p times!

28 / 37

Matrix exponentiation

Γ1 = 0 − 1 − 2

H =

1 −1 0−1 2 −10 −1 1

We make use of matrix diagonalization H = TDT−1 to obtain

Kθ = exp(θH)

= T exp(θD)T−1

=16

e−3θ + 3e−θ + 2 −2e−3θ + 2 e−3θ − 3e−θ + 2−2e−3θ + 2 4e−3θ + 2 −2e−3θ + 2

e−3θ − 3e−θ + 2 −2e−3θ + 2 e−3θ + 3e−θ + 2

Here, exp(θD) becomes simple componentwise exponentiationbecause D is a diagonal matrix of eigenvalues.

29 / 37

Diffusion kernels indexed by the observed covariates(continue)• let x and x′ be an SNP data for p loci; ns be the number of loci

for which |xi − x′i | = s• let m11 be the number of loci for which xi = x′i = 1, i.e., m11 is

the number of loci that two individuals share heterozygousstates.

Using the fact thatn1 + n0 + n2 = p,

K⊗pθ (x, x′) =(−2e−3θ + 2)n1(e−3θ − 3e−θ + 2)n2

(e−3θ + 3e−θ + 2)n0−m11(4e−3θ + 2)m11

∝(−2e−3θ + 2)n1(e−3θ − 3e−θ + 2)n2(4e−3θ + 2)m11

(e−3θ + 3e−θ + 2)n1+n2+m11

We obtain a SNP grid kernel.31 / 37

Example of computing a diffusion kernel

Two individuals with 3 SNP genotypes previously shown.

• ID1 = x1 = (0,2,2)

• ID2 = x2 = (2,1,0)

0 1 2

01

2

0

1

2

1st Genotype

2nd

Gen

otyp

e

3rd

Gen

otyp

e

(2,1,0)

(0,2,2)

Since

Kθ(x, x′

) =

−2e−3θ + 2 if |xi − x

′

i | = 1e−3θ − 3e−θ + 2 if |xi − x

′

i | = 2e−3θ + 3e−θ + 2 if xi = x

′

i , x′

, 14e−3θ + 2 if xi = x

′

i = 1

Similarity between ID1 and ID2 is

K⊗3θ (x, x′) =

(−2e−3θ + 2)1(e−3θ − 3e−θ + 2)2

(e−3θ + 3e−θ + 2)1+2

32 / 37

Diffusion kernels for binary genotypes

Here, x ∈ Zp2

Γ = 0 − 2

L(Γ) = −H(Γ)

=

[1 −1−1 1

]

K⊗pθ (x, x′) ∝

(1 − exp(−2θ)

1 + 2 exp(−2θ)

)d(x,x′)

where d(x, x′) is the Hamming distance, that is, number ofcoordinates at which x and x′ differ.

33 / 37

Applications

A SNP kernel can be used in DNA-based genomic analysesincluding

• regressions

• classifications

• kernel association studies

• kernel principal component analyses

Application of using the diffusion kernel on real data

• 7902 Holstein bulls (USDA-ARS AIPL)

• 43382 SNPs

34 / 37

Diffusion kernels based on for different θ

K(i,i')

Fre

quen

cy

0.10 0.15 0.20 0.25 0.30 0.35

0.0e

+00

1.0e

+07

θ = 10

K(i,i')

Fre

quen

cy

0.45 0.50 0.55 0.60 0.65 0.70

0.0e

+00

1.0e

+07

θ = 11

K(i,i')

Fre

quen

cy

0.74 0.78 0.82 0.86

0.0e

+00

6.0e

+06

1.2e

+07

θ = 12

K(i,i')

Fre

quen

cy

0.90 0.91 0.92 0.93 0.94 0.95

0.0e

+00

6.0e

+06

1.2e

+07

θ = 13

Figure 8: Elements of four diffusion kernels based on four differentbandwidth parameters (θ).

35 / 37

ConclusionDiffusion kernels

• various graph structures can be used to represent sets ofdiscrete random variables, such as genotypes

• defines the distance between two vertices, and projects thisinformation into a more interpretable Rn

• matrix exponentiation of the graph Laplacian

• which senario, the Gaussian can approximate the diffusionkernel well?

R package ’dkDNA’ will be available on CRAN soon

• SNP grid kernel

• binary grid kernel

• other DNA structures/polymorphisms in future

• written in Fortran

36 / 37

ConclusionDiffusion kernels

• various graph structures can be used to represent sets ofdiscrete random variables, such as genotypes

• defines the distance between two vertices, and projects thisinformation into a more interpretable Rn

• matrix exponentiation of the graph Laplacian

• which senario, the Gaussian can approximate the diffusionkernel well?

R package ’dkDNA’ will be available on CRAN soon

• SNP grid kernel

• binary grid kernel

• other DNA structures/polymorphisms in future

• written in Fortran36 / 37

Acknowledgments

• Daniel Gianola

• Grace Wahba

• Masanori Koyama

• Chen Yao

37 / 37

Diffusion kernels on SNP data embedded in a non-Euclidean metric

Science

Transcript of Diffusion kernels on SNP data embedded in a non-Euclidean metric