Local Genomic Relationship Matrices
-
Upload
deniz-akdemir -
Category
Documents
-
view
266 -
download
0
description
Transcript of Local Genomic Relationship Matrices
Locally Epistatic Genetic Relationship Matrices forImproved Genomic Association, Prediction and Selection
Deniz Akdemir
Cornell UniversityPlant Breeding & Genetics
January 24, 2013
Deniz Akdemir (Plant Breeding & Genetics)Locally Epistatic Genetic Relationship Matrices January 24, 2013 1 / 42
Semi-Parametric Mixed Model
Selection in animal or plant breeding is usually based on estimates ofgenetic breeding values (GEBV) obtained with semi-parametric mixedmodels (SPMM).
A SPMM for the n × 1 response vector y is expressed as
y = Xβ + Zg + e (1)
where X is the n × p design matrix for the fixed effects, β is a p × 1vector of fixed effects coefficients, Z is the n× q design matrix for therandom effects; the random effects (g′, e′)′ are assumed to follow amultivariate normal distribution with mean 0 and covariance
(
σ2gK 0
0 σ2e In
)
where K is a q × q kernel matrix.
Deniz Akdemir (Plant Breeding & Genetics)Locally Epistatic Genetic Relationship Matrices January 24, 2013 2 / 42
Kernel Functions
A kernel function, k(., .) maps a pair of input points x and x′ into realnumbers.
A kernel function is by definition symmetric (k(x, x′) = k(x′, x)) andnon-negative.
Given the inputs for the n individuals we can compute a kernel matrixK whose entries are Kij = k(xi , xj).
Linear kernel function: k(x; y) = x′y.
Polynomial kernel function: k(x; y) = (x′y + c)d for c and d ∈ R .
Gaussian kernel function: k(x; y) = exp(−(x′ − y)′(x′ − y)/h) whereh > 0.
Deniz Akdemir (Plant Breeding & Genetics)Locally Epistatic Genetic Relationship Matrices January 24, 2013 3 / 42
Kernels, Heritability, Breeding value
Taylor expansions of these kernel functions reveal that each of thesekernels correspond to a different feature map.
For the marker based SPMM’s, a genetic kernel matrix calculatedusing a linear kernel matrix incorporates only additive effects ofmarkers.
A genetic kernel matrix based on the polynomial kernel of order kincorporates all of the one to k order monomials of markers in anadditive fashion.
The Gaussian kernel function allows us to implicitly incorporate theadditive and complex epistatic effects of the markers.
Deniz Akdemir (Plant Breeding & Genetics)Locally Epistatic Genetic Relationship Matrices January 24, 2013 4 / 42
Kernels, Heritability, Breeding value
Simulation studies and results from empirical experiments show thatthe prediction accuracies of models with Gaussian or polynomialkernel are usually better than the models with linear kernel.
However, it is not possible to know how much of the increase inaccuracy can be transfered into better generations because some ofthe predicted epistatic effects that will be lost by recombination.
Deniz Akdemir (Plant Breeding & Genetics)Locally Epistatic Genetic Relationship Matrices January 24, 2013 5 / 42
Kernels, Heritability, Breeding value
Commercial value of a line: overall genetic effect (additive+epistatic)
Breeding value of a line: potential for being a good parent (additive)
It can be argued that linear kernel model estimates the breeding valuewhere as the Gaussian kernel estimates the commercial value.
The breeder can take advantage of some of the epistatic markereffects in regions of low recombination.
The models introduced here aim to estimate local epistatic lineheritability by using the genetic map information and combine thelocal main and epistatic effects.
Deniz Akdemir (Plant Breeding & Genetics)Locally Epistatic Genetic Relationship Matrices January 24, 2013 6 / 42
Kernels, Heritability, Breeding value
Heritability is defined as the percentage of total variation that can beexplained by the genotypic component.
One can argue that SPMM’s with linear kernels produce estimates ofnarrow sense line heritability, and the SPMM’s with Gaussian Kernelproduces estimates of broad sense line heritability.
We expect that the estimates of local heritability developed in thispaper to be between narrow and broad sense heritability.
Deniz Akdemir (Plant Breeding & Genetics)Locally Epistatic Genetic Relationship Matrices January 24, 2013 7 / 42
Multiple Kernel Learning
In recent years, several methods have been proposed to combinemultiple kernel matrices instead of using a single one.
These kernel matrices may correspond to using different notions ofsimilarity or may be using information coming from multiple sources.
For example, genomic kernel + pedigree kernel, chromosome model,linear mixed models with linear covariance structure.
A good review and taxanomy of multiple kernel learning algorithms inthe machine learning literature can be found in Gonen and Alpaydin(2011)
Deniz Akdemir (Plant Breeding & Genetics)Locally Epistatic Genetic Relationship Matrices January 24, 2013 8 / 42
Multiple Kernel Learning
Multiple kernel learning methods use multiple kernels by combiningthem into a single one via a combination function.
The most commonly used combination function is linear. Givenkernels K1,K2, . . . ,Kp, a linear kernel is of the form
K = η1K1 + η2K2 + . . .+ ηpKp.
The kernel weights η1, η2, . . . , ηp are usually assumed to be positiveand this corresponds to calculating a kernel in the combined featurespaces of the individual kernels. We will also assume that the weightssum to one.
Deniz Akdemir (Plant Breeding & Genetics)Locally Epistatic Genetic Relationship Matrices January 24, 2013 9 / 42
Interaction Terms
The components of K are usually input variables from differentsources or different kernels calculated from same input variables.
The kernel K can also include interaction components like Ki ⊙ Kj ,Ki ⊗ Kj , or perhaps −(Ki − Kj)⊙ (Kj − Ki ).
For example, if KE is the environment kernel matrix and KG is thegenetic kernel matrix, then a component KE ⊙ KG can be used tocapture the gene by environment interaction effects.
Deniz Akdemir (Plant Breeding & Genetics)Locally Epistatic Genetic Relationship Matrices January 24, 2013 10 / 42
Mixed Models with Heteregenous Covariance Structures
Models in Burgueno et al (2007)
(Separating main effects of genotype (x genotype) from genotypegenotype(x genotype) x environment interaction) Data from g genotypes,s sites, and r replicates
y = X + Zg (a+) + Zge(+) + Zr+
a ∼ N(0,Ga); ∼ N(0,Gi ); ∼ N(0,Gae); ∼ N(0,Gie); ∼ N(0,R);∼ N(0,N).R = Σr ⊗ I , N = Σn ⊗ I ; Σr and Σe diagonal. Ga = Σg ⊗ A.Gi = Σi ⊗ A⊙ A. Gae = Σge ⊗ A. Gie = Σie ⊗ A⊙ A.
Deniz Akdemir (Plant Breeding & Genetics)Locally Epistatic Genetic Relationship Matrices January 24, 2013 11 / 42
Mixed Models with Heteregenous Covariance Structures
Models in Burgueno et al (2007)
Why A⊙ A?”If one assumes no dominance, all terms will vanish except the terms foradditive and additive x additive variances, which will take the formCi ′i = 2fi ′iσ
2a + (2fi ′i )
2σ2aa where fi ′i is the COP between individuals i ′ and
i , σ2a is the additive genetic variance, and σaa is the additive x additive
genetic variance. Assuming linkage and identity equilibrium, it seemsjustified to use (2fi ′i )
2, which in matrix notation can be represented by(A⊙ A) = A as the coefficient of the additive x additive component(where ⊙ is the element-wise multiplication operator) (Falconer andMackay, 1996).”
Deniz Akdemir (Plant Breeding & Genetics)Locally Epistatic Genetic Relationship Matrices January 24, 2013 12 / 42
Learning Kernel Weights
Qiu and Lane (2009) propose simple heuristics to select kernelweights in regression problems:
ηm =r2m
∑ph=1 r
2h
and
ηm =
∑ph=1Mh −Mm
(1− p)∑p
h=1Mh
where rm is the Pearson correlation coefficient between true responseand the predicted response and Mm is the mean square errorgenerated by the regression using the kernel matrix Km alone.
Another approach in Qiu and Lane (2009) uses the kernel alignment.
Deniz Akdemir (Plant Breeding & Genetics)Locally Epistatic Genetic Relationship Matrices January 24, 2013 13 / 42
Multiple Kernels Using Mixed Models 1
In the context of the SPMM’s we propose using weights that areproportional to the estimated variances attributed to the kernels.
One possible approach is to use a SPMM with multiple kernels in theform of
y = Xβ + Z1g1 + Z2g2 + . . .+ Zkgk + e (2)
where gj ∼ Nqk (0, σ2gjKj) for j = 1, 2, . . . , k . Let σ2
gjfor
j = 1, 2, . . . , k and σ2e be the estimated variance components.
Under this model calculate local heritabilities ash2m = σ2
gm/(∑k
ℓ=1 σ2gℓ+ σ2
e ) for m = 1, 2, . . . , k .
Deniz Akdemir (Plant Breeding & Genetics)Locally Epistatic Genetic Relationship Matrices January 24, 2013 14 / 42
Multiple Kernels Using Mixed Models 2
Another model incorporates the marginal variance contribution foreach kernel matrix.
For this we use the following SPMM:
y = Xβ + Zjgj + Z−jg−j + e (3)
where gj ∼ Nqk (0, σ2gjKj) for j = 1, 2, . . . , k . Z−j and g
−j correspondto the design matrix and the random effects for the input componentsother than the ones in group j .
In this case calculate local heritabilities ash2m = σ2
gm/(σ2
gm+ σ2
g−m+ σ2
em) for m = 1, 2, . . . , k .
Deniz Akdemir (Plant Breeding & Genetics)Locally Epistatic Genetic Relationship Matrices January 24, 2013 15 / 42
Multiple Kernels Using Mixed Models 3
A simpler approach is to use a separate SPMM for each kernel.
Let σ2gm
and σ2em
be the estimated variance components from theSPMM model in (1) with kernel K = Km.
Let h2m = σ2gm/(σ2
gm+ σ2
em).
Note that, in this case, the markers corresponding to the randomeffect g
−j which mainly accounts for the sample structure can now beincorporated by a fixed effects via their first principal components.
Deniz Akdemir (Plant Breeding & Genetics)Locally Epistatic Genetic Relationship Matrices January 24, 2013 16 / 42
Estimation of Parameters
After local heritabilities are obtained, calculate the kernel weightsη1, η2, . . . , ηp as
ηm =h2m
∑ph=1 h
2h
. (4)
The estimates of parameters for models in (1), (2) and (3) can be bymaximizing the likelihood or the restricted (or residual, or reduced)maximum likelihood (REML).
There are very fast algorithms devised for estimating the parametersof the single kernel model in (1).
Deniz Akdemir (Plant Breeding & Genetics)Locally Epistatic Genetic Relationship Matrices January 24, 2013 17 / 42
Local kernels
Regions or groups in the genome and a separate kernel matrix foreach group and region.
The regions can be overlapping or discrete.
If the some markers are associated with each other in terms of linkageor function it might be useful to combine them together.
The whole genome can be divided physically into chromosomes,chromosome arms. Further divisions could be based on recombinationhot-spots, or just merely based on local proximity.
We could calculate a seperate kernel for introns and exons, noncoding, promoter or repressor sequences.
We can also use a grouping of markers based on their effects on lowlevel traits like lipids, metabolites, gene expressions, or based on theirallele frequencies.
When no such guide is present one can use a hierarchical clustering ofthe variables.
Deniz Akdemir (Plant Breeding & Genetics)Locally Epistatic Genetic Relationship Matrices January 24, 2013 18 / 42
Kernels: Kernel Scanning
This approach is similar to the one in Kizilkaya where thechromosomes are scanned with windows of 5 consecutive markers.
M: q × p matrix of p markers on q lines, which is partitioned withrespect to the chromosomes as (M1,M2, . . . ,Mc) where Mk has pkcolumns.
Cumulative distances based on LD between markers in eachchromosome: pk for k = 1, 2, . . . , c .
Based on pk calculate block diagonal p × p kernel matrix S .
k column of S is sk .
A local kernel matrix Kk at position k involves using diag(sk)1/2M in
kernel matrix calculations.
Kernel scanning approach involves calculation of a kernel matrix forselected marker across the genome at each marker location.
By adjusting the kernel width parameter, we are able to determine thesmoothness and locality of these kernel matrices.
Deniz Akdemir (Plant Breeding & Genetics)Locally Epistatic Genetic Relationship Matrices January 24, 2013 19 / 42
I shrunk the relationship matrices too!
When the number of markers in a region is less than the number ofindividuals in the training set the kernel matrix for this regionbecomes singular or ill conditioned.
In these cases we can use shrinkage approaches to obtain wellconditioned positive definite kernel.
Shrinkage towards the identity matrix is not suitable to use with theSPMM since this involves allocating a fixed proportion of the errorvariance to the variance of the random effects.
We instead propose and use shrinkage estimators which aim tointroduce sparsity in the off diagonal elements of the kernel matrix.Many algorithms have been devised for learning sparse covariancematrices in the recent years.
Some sparse covariance estimation techniques are implemented in anR package ”huge”.
Deniz Akdemir (Plant Breeding & Genetics)Locally Epistatic Genetic Relationship Matrices January 24, 2013 20 / 42
Sparse relationship matrix
In addition to possible decrease of computational burden by use ofsparse matrix methods in mixed model parameter estimation, we canproduce graphical representations of the kernel matrices.
For a normally distributed random vector, the independence betweentwo components is implied by zero covariance between thecomponents, more interestingly, the conditional independencebetween two components is implied by the zero components in theinverse covariance matrix.
Deniz Akdemir (Plant Breeding & Genetics)Locally Epistatic Genetic Relationship Matrices January 24, 2013 21 / 42
Sparse relationship matrix on graph
Figure: Meinshausen & Buhlmann graph estimationDeniz Akdemir (Plant Breeding & Genetics)Locally Epistatic Genetic Relationship Matrices January 24, 2013 22 / 42
Sparse relationship matrix on graph-GBS-Stem rust data
Figure: GBS-Stem RustDeniz Akdemir (Plant Breeding & Genetics)Locally Epistatic Genetic Relationship Matrices January 24, 2013 23 / 42
Sparse relationship matrix on graph cap123 FHB DONData
Deniz Akdemir (Plant Breeding & Genetics)Locally Epistatic Genetic Relationship Matrices January 24, 2013 24 / 42
Hypothesis Testing 1
Under Model (1), twice the log-likelihood of y given the parameters β, σ2e
and λ = σ2g/σ
2e is, up to a constant,
L(β, σ2e , λ) = −n log σ2
e − log |Vλ| −(y − Xβ)′V−1
λ (y − Xβ)
σ2e
(5)
where Vλ = In + λZKZ ′ and n is the size of the vector y.Twice the residual log-likelihood (Patterson, Harville) is, up to a constant,
RL(β, σ2e , λ) = −(n−p) log σ2
e−log |Vλ|−log(X ′VλX )−(y − X βλ)
′V−1λ
(y − X βλ)
σ2e
(6)
Deniz Akdemir (Plant Breeding & Genetics)Locally Epistatic Genetic Relationship Matrices January 24, 2013 25 / 42
Hypothesis Testing 2
From both of these likelihoods, a test statistic for the significance of thevariance component
H0 : λ = 0 (σ2g = 0)
HA : λ > 0 (σ2g > 0)
can be obtained by calculating the likelihood ratio statistic (Cox andHinkley 1974).
Deniz Akdemir (Plant Breeding & Genetics)Locally Epistatic Genetic Relationship Matrices January 24, 2013 26 / 42
Hypothesis Testing 3
Under standard regularity conditions the null distribution of thelikelihood ratio test statistic has a χ2 distribution with degrees offreedom given by the difference in the number of parameters betweenthe null and alternative hypothesis.
However, Self and Liang (1987) showed that the asymptoticdistribution is a weighted mixture of χ2 distributions. For the SPMMin (1) they have recommended using a equally weighted mixture ofχ2(0) and χ2(1) distribution where χ2(0) distribution refers to adistribution degenerate at 0.
In simulation studies Pinheiro and Bates (2000) Morrell (1998) foundthat equal contributions work well with residual log-likelihood whereas a 0.65 to 0.35 mixture works better for the log-likelihood.
A finite sample null distribution was recommended for the SPMM in(1) in Ruppert (2003) and it was shown that the mixture proportionsdepended on the kernel matrix.
Deniz Akdemir (Plant Breeding & Genetics)Locally Epistatic Genetic Relationship Matrices January 24, 2013 27 / 42
Testing Scheme: Hierarchical testing scheme
Relevant regions are divided further subregions and the procedure isrepeated to a desired detail level.
Nevertheless, suitable hierarchical testing procedures have beendeveloped. Blanchard and Geman (2005) proposed and analyzedseveral hierarchical designs in terms of their cost / power properties.Multiple testing procedures where coarse to fine hypotheses are testedsequentially have been proposed to control the family wise error rateor false discovery rate (Benjamini and Yekutieli (2003), Meinshausen(2007)).
These procedures can be used along the ”keep rejecting until firstacceptance” scheme to test hypotheses in an hierarchy.
Deniz Akdemir (Plant Breeding & Genetics)Locally Epistatic Genetic Relationship Matrices January 24, 2013 28 / 42
Meinshausen’s hierarchical testing procedure
Meinshausen’s hierarchical testing procedure controls the family wiseerror by adjusting the significance levels of single tests in thehierarchy.
The procedure starts testing the root node H0 at level α.
When a parent hypothesis is rejected one continues with testing allthe child nodes of that parent.
The significance level to be used at each node H is adjusted by afactor proportional to the number of variables in that node:
αH = α|H|
|H0|
where |.| denotes the cardinality of a set.
Deniz Akdemir (Plant Breeding & Genetics)Locally Epistatic Genetic Relationship Matrices January 24, 2013 29 / 42
Meinshausen’s hierarchical testing procedure
Deniz Akdemir (Plant Breeding & Genetics)Locally Epistatic Genetic Relationship Matrices January 24, 2013 30 / 42
Sim
ulated
Exam
ple
050
100150
200
0.0 0.2 0.4 0.6 0.8 1.0
Index
outhieri$hypothesistests[(Nparts + 1 + 1):length(outhieri$hypothesistests)]
12
3
0.50 0.55 0.60 0.65 0.70 0.75 0.80
Den
izAkdem
ir(P
lantBreed
ing&
Gen
etics)Locally
Epista
ticGen
eticRela
tionship
Matrices
January
24,2013
31/42
4 9 16 25 36 49 64 81 100 121 144 169 196 225 linear
0.4
0.5
0.6
0.7
0.8
Deniz Akdemir (Plant Breeding & Genetics)Locally Epistatic Genetic Relationship Matrices January 24, 2013 32 / 42
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
0.00
0.02
0.04
0.06
0.08
0.10
Figure: Local kernel weights for non overlapping sectionsDeniz Akdemir (Plant Breeding & Genetics)Locally Epistatic Genetic Relationship Matrices January 24, 2013 33 / 42
0 500 1000 1500 2000 2500 3000
0.00
0.05
0.10
0.15
0.20
1:length(Hkv)
Hkv
Deniz Akdemir (Plant Breeding & Genetics)Locally Epistatic Genetic Relationship Matrices January 24, 2013 34 / 42
Balancing short and long term gains
While we may have confidence that GS can accelerate short-termgain, no such confidence is justified for long-term gain.
Weighted GS model of Jean Luc was used so that markers for whichthe favorable allele had a low frequency should be weighted moreheavily to avoid losing such alleles.
We recommend using the similar approach which aims to conserverare but favorable alleles for balancing short-term and long-term gainsfrom GS.
Deniz Akdemir (Plant Breeding & Genetics)Locally Epistatic Genetic Relationship Matrices January 24, 2013 35 / 42
Balancing short and long term gains
Let gji for regions j = 1, 2, . . . , k and individuals i = 1, 2, . . . ,N bethe EBLUPs of random effect components that correspond to the k
local kernels and let pji denote the estimated density value of thealleles in region j for individual i .
A selection criterion in the spirit of Goddart and Jean Luc for the ithindividual is then
ci =k
∑
j=1
gji
1 +√
pji. (7)
However, this criterion also down weights favorable but commonalleles.
Deniz Akdemir (Plant Breeding & Genetics)Locally Epistatic Genetic Relationship Matrices January 24, 2013 36 / 42
Balancing short and long term gains
A weighting scheme that up weights both rare and common favorablealleles.Let gj(n) be the maximum EBLUP value among the individuals for region j
and let ηj be the weight of the kernel j . The selection criterion
ci =k
∑
j=1
ηj2πh1sd(gj)h2sd(pj)
exp {−1
2
[
(gji − gj(n))2
h21var(gj)+
p2ji
h22var(pj)
]
} (8)
where the constants h1 and h2 are selected by the breeder based onpreferences given to short term and long term gains correspondingly.
Deniz Akdemir (Plant Breeding & Genetics)Locally Epistatic Genetic Relationship Matrices January 24, 2013 37 / 42
Balancing short and long term gains
3 380 709 33 31 39 8 92 70 73 49 4 790 94 46 93 419 5 347 69 19 84 1 423 54
Kernel scanning
−20
10
3 550 214 889 541 129 292 143 964 95 65 270 66 118 614 86 916 60 67 1 311 226
Kernel scanning with density weighting
−10
010
304 769 226 954 792 727 777 663 146 285 127 200 535 828 54 945 800
Kernel scanning (best 50)
010
25
186 411 226 117 285 957 849 663 194 828 610 673 773 671 764 341 800
Kernel scanning with density weighting (best 50)
04
8
Deniz Akdemir (Plant Breeding & Genetics)Locally Epistatic Genetic Relationship Matrices January 24, 2013 38 / 42
Final remarks
Knowledge transfer between experiments
Deniz Akdemir (Plant Breeding & Genetics)Locally Epistatic Genetic Relationship Matrices January 24, 2013 39 / 42
Robust to Missing Data
The local kernels use information collected over a region in the genomeand thus will not be effected by a few missing or outlier data points so thisapproach is also robust to missing data and outliers.
Deniz Akdemir (Plant Breeding & Genetics)Locally Epistatic Genetic Relationship Matrices January 24, 2013 40 / 42
Chromosomal Breeding/ gene transplant
Breed good chromosomes, combine them.
Deniz Akdemir (Plant Breeding & Genetics)Locally Epistatic Genetic Relationship Matrices January 24, 2013 41 / 42
Chromosomal Breeding/ gene transplant
Deniz Akdemir (Plant Breeding & Genetics)Locally Epistatic Genetic Relationship Matrices January 24, 2013 42 / 42