IAP workshop, Ghent, Sept. 18 th, 2008 Mixed model analysis to discover cis- regulatory haplotypes...
-
Upload
alban-stewart -
Category
Documents
-
view
212 -
download
0
Transcript of IAP workshop, Ghent, Sept. 18 th, 2008 Mixed model analysis to discover cis- regulatory haplotypes...
IAP workshop, Ghent, Sept. 18th, 2008
Mixed model analysis to discover cis-regulatory haplotypes in A. Thaliana
Fanghong Zhang*, Stijn Vansteelandt*,
Olivier Thas*, Marnik Vuylsteke#
* Ghent University # VIB (Flanders Institute for Biotechnology)
IAP workshop, Ghent, Sept. 18th, 2008 2
Overview
Genetic background
Objectives
Data
Methodology
Results
Conclusions
IAP workshop, Ghent, Sept. 18th, 2008 3
Genetic background Regulation of gene expression is affected either in:
- Cis : affecting the expression of only one of the two alleles in a
heterozygous individual;
- Trans : affecting the expression of both alleles in a heterozygous individual;
IAP workshop, Ghent, Sept. 18th, 2008 4
Genetic background
Why search for Cis-regulatory variants?
“low hanging fruit”: window is a small genomic region
Fast screening for markers in LD with expression trait.
How to search for Cis-regulatory variants?
Using GASED (Genome-wide Allelic Specific Expression Difference) approach (Kiekens et al, 2006)
- Based on a diallel design which is very popular in plant breeding system to estimate GCA (generation combination ability) and SCA (specific combination ability)
IAP workshop, Ghent, Sept. 18th, 2008 5
Genetic Background What is GASED approach?
The expression of a gene in a F1 hybrid coming from the kth offspring of the cross can be written as: (c—cis-element, t-trans-element)
ijkjiijjjjiiiijk ctctctcctcy
kth offspring of cross i j From parent i From parent j From both (cross-terms)
ijky igca jgca ijsca ijk
In case there is no trans-effect
0ijsca
In case there is cis-effect
ji gcagca
In case homozygousGenotypic variation
A cis-regulatory divergence completely explains the difference between two parental lines
IAP workshop, Ghent, Sept. 18th, 2008 6
Objectives of this study
Using mixed model analysis to discover Cis-regulated Arabidopsis genes
Based on GASED approach, to partition between F1 hybrid genotypic variation for mRNA abundance into additive and non-additive variance components to differentiate between cis- and trans-regulatory changes and to assign allele specific expression differences to cis-regulatory variation.
To find its associated haplotypes (a set of SNPs) for these selected cis-regulated genes.
Systematic surveys of cis-regulatory variation to identify “superior alleles”.
IAP workshop, Ghent, Sept. 18th, 2008 7
Flow chart Data contains all expressed genes
(25527 genes)
Step I:
Step II:
Step III:
Step IV:
Choose genes with significant genotypic variation:
Choose genes from Step 1 with no trans-regulatory variation:
Choose genes from step 2 displaying significant allelic imbalance to cis-
regulatory variation:
Choose genes from Step 3 showing significant association with founded
haplotype blocks:
0genotype2σ
0σ sca_ij 2
ji gcagca
0βSNPi
IAP workshop, Ghent, Sept. 18th, 2008 8
Data
Data acquisition:
1) Scan the arrays
2) Quantitate each spot
3) Subtract noise from background
4) Normalize
5) Export table
Data for us to analyze
IAP workshop, Ghent, Sept. 18th, 2008 9
Methodology - Step I
Full model:
Mixed-Model Equations
Gene X:expression
valuesFIXED effects RANDOM effect Residual
Reduced model: yklnm = μ + dyek + replicatel + arraym + errorklnm
yklnm = μ + dyek + replicatel + genotypen + arraym + errorklnm
error ~ N(0,Σe) , Σe =I2202e ; array ~ N(0, Σa) , Σa =I1102
a
genotype ~ N(0,Σgenotype) , Σ genotype=G = K2g;
K = 55 x 55 marker-based relatedness matrix: Calculated as 1 – dR ; dR = Rogers’ distance (Rogers ,1972; Reif et al. 2005)
IAP workshop, Ghent, Sept. 18th, 2008 10
pij and qij are allele frequencies of the jth allele at the ith locus
ni is the number of alleles at the ith locus (i.e. ni= 2)
m refers to the number of loci (i.e. m = 210,205)
2/),(),(),(
]1,0[
)(2
11
212111
1
2
1
PPdPFdPFd
d
qp
md
RRR
R
m
t
ij
n
jij
R
i
Rogers (1972); Reif et al. (2005)
Melchinger et al. (1991)
Methodology - Step I
K = 55 x 55 marker-based relatedness matrix:
Mixed-Model Equations
IAP workshop, Ghent, Sept. 18th, 2008 11
Likelihood ratio test (REML) LRT ~ 0.52(0) + 0.52(1)) p-value
Multiple testing correction
0σ:Hvs0σ:H 2ga
2g0 Gene X:
25527 Genes Adjusted q-value (FDR)
Estimate the proportion of features that
are truly null:
Methodology - Step I
FDR: false discovery rate How many of the called positives are false? 5% FDR means 5% of calls are false positive
John Storey et al. (2002) : q-value to represent FDR
0π
We use adjusted q-value to represent FDR
t)(pval#
tπmqval 0
^
IAP workshop, Ghent, Sept. 18th, 2008 12
Multiple testing correction
Storey et al estimate π0 = m0 /m under assumption that true null p-values is uniformly distributed (0,1)
^
We estimate π0 –adj = m0 /m under assumption that true null p-values is 50% uniformly distributed (0,0.5) , 50% is just 0.5.
))1,0(()(#
^
0
ttpvalue
tmqvalue
))5.0,0((
)(#_
^
_0
ttpvalue
tmqvalueadjusted adj
Methodology - Step I
IAP workshop, Ghent, Sept. 18th, 2008 13
Full model:
Mixed-Model Equations
Reduced model: y klijm= μ + dyek + replicatel + gcai + gcaj + arraym + error klijm
TLL )sca2σ,gca
2)(σ451I,101(I
)sca2σ,gca
2)(σ451I,101(ITLL
)scaij2σgcaj
2σgcai2K(σg
2KσgenotypeΣ
L is the Cholesky decomposition
Methodology - Step II
y klijm= μ + dyek + replicatel + gcai + gcaj + scaij + arraym + error klijm
Gene X:expression
valuesFIXED effects RANDOM effect Residual
IAP workshop, Ghent, Sept. 18th, 2008 14
Likelihood ratio test (REML) LRT ~ 0.52(0) + 0.52(1) p-value
Multiple testing correction
0:0: 220 scaasca HvsH Gene X:
20976 Genes qa-value (FNR)
Methodology - Step II
FNR: false non-discovery rate (Genovese et al , 2002) How many of the called negatives are false? 5% FNR means 5% of calls are false negative
Since we are interested in selecting genes with negative scaij effect, we control FNR instead of FDR
We use qa-value to represent FNR
IAP workshop, Ghent, Sept. 18th, 2008 15
Multiple testing correctionMethodology - Step II
False non-discovery rate (FNR) :
t)(pval#
t)(1^
0πm1qaval
0R)Pr(m0R)(m|Rm
TEFNR
][
π0 is the estimate of the proportion of features that are truly null
IAP workshop, Ghent, Sept. 18th, 2008 16
model:
Mixed-Model Equations
Gene X:
g1 =g2? g1 =g3? g1 =g4? … g1= g10? g2 =g3? g2= g4? g2=g5? … g2 =g10? ……, …… g9 = g10?
Test 45 pairs ?
Two sample dependent t-test
Non-standard P-value
2
^
21
^
1
2
^
21
^
1
2
^
21
^
1
^
2
^
1
^
2
^
1
gofBLUPisg,gofBLUPisg
))gg()ggSE((
))gg()gg((standard_tnon
)ggSE(
)gg(standard_t
Distribution of true null p-values is not uniformly distributed from 0 to 1
Methodology - Step III
yklijm = μ + dyek + replicatel + gcai + gcaj + arraym + errorkijlm
jgca
igca
IAP workshop, Ghent, Sept. 18th, 2008 17
Multiple testing correction
two sample t-test testing BLUPs
jgcaigcaHvsjgcaigcaH a __:__:0 Gene X:
1380 Genes q-value (FDR)
Simulate H0 distribution from real data: simulation-based p-value
Methodology - Step III
IAP workshop, Ghent, Sept. 18th, 2008 18
Gene
Full model:
Mixed-Model Equations
SNP1 SNP2 SNP3 ………SNPi (tag SNPs)
Reduced model: yklim = μ + dyek + replicate+ genotypei + arraym + errorkilm
Gene X:(cis-regulated)
Methodology - Step IV
yklim = μ + dyek + replicatel + + genotypei + arraym + errorkijlm
FIXED effects RANDOM effect Residual
chromosome
genotype ~ N(0,Σgenotype) , Σ genotype=G = K2g;
K = 55 x 55 marker-based relatedness matrix.
array ~ N(0,Σa) , Σ a=I1102a; error ~ N(0,Σe) , Σ e=I2202
e
i
iSNP
iSNP
β *
IAP workshop, Ghent, Sept. 18th, 2008 19
0SNP
βoneleastat:a
H
0SNP
βSNP
βSNP
β:0
H
i
i
...21
Gene X:(cis-regulated)
836 Genes q-value (FDR)
p-value
Multiple testing correction
Methodology - Step IV
LRT ~ 2(2n) n is the number of SNPs
Likelihood ratio test (ML)
IAP workshop, Ghent, Sept. 18th, 2008 20
ResultsData contains all expressed
genes (25527 genes)
Step I:
20979 genes
0genotype Adjusted_q value<0.0005
1328 genes
972 genes
859 genes
Adjusted_qa value<0.010_ ijsca
jgcaigca __ q value<0.01
0 SNPiq value<0.01
Step II:
Step III:
Step IV:
IAP workshop, Ghent, Sept. 18th, 2008 21
Results
Among all 25527 genes, 20979 genes have significant genotypic variation (qvalue < 0.0005). (–Step I)
Among these 20979 genes, 1328 genes have no-trans regulated effect (qavalue < 0.01). (–Step II)
Among these 1328 genes, 972 genes have showed significant different allelic expressions (qvlaue < 0.01); these 972 genes are discovered as cis-regulated. (–Step III)
We confirm our discovery from these 972 cis-regulated genes in step IV:
an allelic expression difference caused by cis-regulatory variant implies a nearby polymorphism (SNP) that controls expression in LD;
We indeed found 96.5% selected cis-regulated genes have associated polymorphisms (haplotype blocks ) nearby.
IAP workshop, Ghent, Sept. 18th, 2008 22
Conclusions This mixed-model approach used here for association mapping analysis with Kinship matrix included are more appropriate than other recent methods in identifying cis-regulated genes ( p-values more reliable).
Each step’s statistical method is controlled in a more accurate way to specify statistical significance (referring to FDR, FNR).
Using simulation-based pvalues when testing difference between random effects increases power of detecting association.
A comprehensive analysis of gene expression variation in plant populations has been described.
Using this mixed-model analysis strategy, a detailed characterization of both the genetic and the positional effects in the genome is provided.
This detailed statistical analysis provides a robust and useful framework for the future analysis of gene expression variation in large sample sizes.
Advanced statistical methods look promising in identifying interesting discoveries in genetics.
IAP workshop, Ghent, Sept. 18th, 2008 23
Many thanks
for your attention !