Mixed model analysis to discover cis-regulatory haplotypes in A. Thaliana

23
IAP workshop, Ghent, Sept. 18 th , 2008 Mixed model analysis to discover cis-regulatory haplotypes in A. Thaliana Fanghong Zhang*, Stijn Vansteelandt*, Olivier Thas*, Marnik Vuylsteke # * Ghent University # VIB (Flanders Institute for Biotechnology)

description

Mixed model analysis to discover cis-regulatory haplotypes in A. Thaliana. Fanghong Zhang*, Stijn Vansteelandt*, Olivier Thas*, Marnik Vuylsteke # * Ghent University # VIB ( Flanders Institute for Biotechnology). Overview. Genetic background Objectives Data Methodology - PowerPoint PPT Presentation

Transcript of Mixed model analysis to discover cis-regulatory haplotypes in A. Thaliana

Page 1: Mixed model analysis to discover cis-regulatory haplotypes in A. Thaliana

IAP workshop, Ghent, Sept. 18th, 2008

Mixed model analysis to discover cis-regulatory haplotypes in A. Thaliana

Fanghong Zhang*, Stijn Vansteelandt*,

Olivier Thas*, Marnik Vuylsteke#

* Ghent University # VIB (Flanders Institute for Biotechnology)

 

Page 2: Mixed model analysis to discover cis-regulatory haplotypes in A. Thaliana

IAP workshop, Ghent, Sept. 18th, 2008 2

Overview

Genetic background

Objectives

Data

Methodology

Results

Conclusions

Page 3: Mixed model analysis to discover cis-regulatory haplotypes in A. Thaliana

IAP workshop, Ghent, Sept. 18th, 2008 3

Genetic background Regulation of gene expression is affected either in:

- Cis : affecting the expression of only one of the two alleles in a

heterozygous individual;

- Trans : affecting the expression of both alleles in a heterozygous individual;

Page 4: Mixed model analysis to discover cis-regulatory haplotypes in A. Thaliana

IAP workshop, Ghent, Sept. 18th, 2008 4

Genetic background

Why search for Cis-regulatory variants?

“low hanging fruit”: window is a small genomic region

Fast screening for markers in LD with expression trait.

How to search for Cis-regulatory variants?

Using GASED (Genome-wide Allelic Specific Expression Difference) approach (Kiekens et al, 2006)

- Based on a diallel design which is very popular in plant breeding system to estimate GCA (generation combination ability) and SCA (specific combination ability)

Page 5: Mixed model analysis to discover cis-regulatory haplotypes in A. Thaliana

IAP workshop, Ghent, Sept. 18th, 2008 5

Genetic Background What is GASED approach?

The expression of a gene in a F1 hybrid coming from the kth offspring of the cross can be written as: (c—cis-element, t-trans-element)

ijkjiijjjjiiiijk ctctctcctcy

kth offspring of cross i j From parent i From parent j From both (cross-terms)

ijky igca jgca ijsca ijk

In case there is no trans-effect

0ijsca

In case there is cis-effect

ji gcagca

In case homozygousGenotypic variation

A cis-regulatory divergence completely explains the difference between two parental lines

Page 6: Mixed model analysis to discover cis-regulatory haplotypes in A. Thaliana

IAP workshop, Ghent, Sept. 18th, 2008 6

Objectives of this study

Using mixed model analysis to discover Cis-regulated Arabidopsis genes

Based on GASED approach, to partition between F1 hybrid genotypic variation for mRNA abundance into additive and non-additive variance components to differentiate between cis- and trans-regulatory changes and to assign allele specific expression differences to cis-regulatory variation.

To find its associated haplotypes (a set of SNPs) for these selected cis-regulated genes.

Systematic surveys of cis-regulatory variation to identify “superior alleles”.

Page 7: Mixed model analysis to discover cis-regulatory haplotypes in A. Thaliana

IAP workshop, Ghent, Sept. 18th, 2008 7

Flow chart Data contains all expressed genes

(25527 genes)

Step I:

Step II:

Step III:

Step IV:

Choose genes with significant genotypic variation:

Choose genes from Step 1 with no trans-regulatory variation:

Choose genes from step 2 displaying significant allelic imbalance to cis-

regulatory variation:

Choose genes from Step 3 showing significant association with founded

haplotype blocks:

0genotype2σ

0σ sca_ij 2

ji gcagca

0βSNPi

Page 8: Mixed model analysis to discover cis-regulatory haplotypes in A. Thaliana

IAP workshop, Ghent, Sept. 18th, 2008 8

Data

Data acquisition:

1) Scan the arrays

2) Quantitate each spot

3) Subtract noise from background

4) Normalize

5) Export table

Data for us to analyze

Page 9: Mixed model analysis to discover cis-regulatory haplotypes in A. Thaliana

IAP workshop, Ghent, Sept. 18th, 2008 9

Methodology - Step I

Full model:

Mixed-Model Equations

Gene X:expression

valuesFIXED effects RANDOM effect Residual

Reduced model: yklnm = μ + dyek + replicatel + arraym + errorklnm

yklnm = μ + dyek + replicatel + genotypen + arraym + errorklnm

error ~ N(0,Σe) , Σe =I2202e ; array ~ N(0, Σa) , Σa =I1102

a

genotype ~ N(0,Σgenotype) , Σ genotype=G = K2g;

K = 55 x 55 marker-based relatedness matrix: Calculated as 1 – dR ; dR = Rogers’ distance (Rogers ,1972; Reif et al. 2005)

Page 10: Mixed model analysis to discover cis-regulatory haplotypes in A. Thaliana

IAP workshop, Ghent, Sept. 18th, 2008 10

pij and qij are allele frequencies of the jth allele at the ith locus

ni is the number of alleles at the ith locus (i.e. ni= 2)

m refers to the number of loci (i.e. m = 210,205)

2/),(),(),(

]1,0[

)(2

11

212111

1

2

1

PPdPFdPFd

d

qp

md

RRR

R

m

t

ij

n

jij

R

i

Rogers (1972); Reif et al. (2005)

Melchinger et al. (1991)

Methodology - Step I

K = 55 x 55 marker-based relatedness matrix:

Mixed-Model Equations

Page 11: Mixed model analysis to discover cis-regulatory haplotypes in A. Thaliana

IAP workshop, Ghent, Sept. 18th, 2008 11

Likelihood ratio test (REML) LRT ~ 0.52(0) + 0.52(1)) p-value

Multiple testing correction

0σ:Hvs0σ:H 2ga

2g0 Gene X:

25527 Genes Adjusted q-value (FDR)

Estimate the proportion of features that

are truly null:

Methodology - Step I

FDR: false discovery rate How many of the called positives are false? 5% FDR means 5% of calls are false positive

John Storey et al. (2002) : q-value to represent FDR

We use adjusted q-value to represent FDR

t)(pval#

tπmqval 0

^

Page 12: Mixed model analysis to discover cis-regulatory haplotypes in A. Thaliana

IAP workshop, Ghent, Sept. 18th, 2008 12

Multiple testing correction

Storey et al estimate π0 = m0 /m under assumption that true null p-values is uniformly distributed (0,1)

^

We estimate π0 –adj = m0 /m under assumption that true null p-values is 50% uniformly distributed (0,0.5) , 50% is just 0.5.

))1,0(()(#

^

0

ttpvalue

tmqvalue

))5.0,0((

)(#_

^

_0

ttpvalue

tmqvalueadjusted adj

Methodology - Step I

Page 13: Mixed model analysis to discover cis-regulatory haplotypes in A. Thaliana

IAP workshop, Ghent, Sept. 18th, 2008 13

Full model:

Mixed-Model Equations

Reduced model: y klijm= μ + dyek + replicatel + gcai + gcaj + arraym + error klijm

TLL )sca2σ,gca

2)(σ451I,101(I

)sca2σ,gca

2)(σ451I,101(ITLL

)scaij2σgcaj

2σgcai2K(σg

2KσgenotypeΣ

L is the Cholesky decomposition

Methodology - Step II

y klijm= μ + dyek + replicatel + gcai + gcaj + scaij + arraym + error klijm

Gene X:expression

valuesFIXED effects RANDOM effect Residual

Page 14: Mixed model analysis to discover cis-regulatory haplotypes in A. Thaliana

IAP workshop, Ghent, Sept. 18th, 2008 14

Likelihood ratio test (REML) LRT ~ 0.52(0) + 0.52(1) p-value

Multiple testing correction

0:0: 220 scaasca HvsH Gene X:

20976 Genes qa-value (FNR)

Methodology - Step II

FNR: false non-discovery rate (Genovese et al , 2002) How many of the called negatives are false? 5% FNR means 5% of calls are false negative

Since we are interested in selecting genes with negative scaij effect, we control FNR instead of FDR

We use qa-value to represent FNR

Page 15: Mixed model analysis to discover cis-regulatory haplotypes in A. Thaliana

IAP workshop, Ghent, Sept. 18th, 2008 15

Multiple testing correctionMethodology - Step II

False non-discovery rate (FNR) :

t)(pval#

t)(1^

0πm1qaval

0R)Pr(m0R)(m|Rm

TEFNR

][

π0 is the estimate of the proportion of features that are truly null

Page 16: Mixed model analysis to discover cis-regulatory haplotypes in A. Thaliana

IAP workshop, Ghent, Sept. 18th, 2008 16

model:

Mixed-Model Equations

Gene X:

g1 =g2? g1 =g3? g1 =g4? … g1= g10? g2 =g3? g2= g4? g2=g5? … g2 =g10? ……, …… g9 = g10?

Test 45 pairs ?

Two sample dependent t-test

Non-standard P-value

2

^

21

^

1

2

^

21

^

1

2

^

21

^

1

^

2

^

1

^

2

^

1

gofBLUPisg,gofBLUPisg

))gg()ggSE((

))gg()gg((standard_tnon

)ggSE(

)gg(standard_t

Distribution of true null p-values is not uniformly distributed from 0 to 1

Methodology - Step III

yklijm = μ + dyek + replicatel + gcai + gcaj + arraym + errorkijlm

jgca

igca

Page 17: Mixed model analysis to discover cis-regulatory haplotypes in A. Thaliana

IAP workshop, Ghent, Sept. 18th, 2008 17

Multiple testing correction

two sample t-test testing BLUPs

jgcaigcaHvsjgcaigcaH a __:__:0 Gene X:

1380 Genes q-value (FDR)

Simulate H0 distribution from real data: simulation-based p-value

Methodology - Step III

Page 18: Mixed model analysis to discover cis-regulatory haplotypes in A. Thaliana

IAP workshop, Ghent, Sept. 18th, 2008 18

Gene

Full model:

Mixed-Model Equations

SNP1 SNP2 SNP3 ………SNPi (tag SNPs)

Reduced model: yklim = μ + dyek + replicate+ genotypei + arraym + errorkilm

Gene X:(cis-regulated)

Methodology - Step IV

yklim = μ + dyek + replicatel + + genotypei + arraym + errorkijlm

FIXED effects RANDOM effect Residual

chromosome

genotype ~ N(0,Σgenotype) , Σ genotype=G = K2g;

K = 55 x 55 marker-based relatedness matrix.

array ~ N(0,Σa) , Σ a=I1102a; error ~ N(0,Σe) , Σ e=I2202

e

i

iSNP

iSNP

β *

Page 19: Mixed model analysis to discover cis-regulatory haplotypes in A. Thaliana

IAP workshop, Ghent, Sept. 18th, 2008 19

0SNP

βoneleastat:a

H

0SNP

βSNP

βSNP

β:0

H

i

i

...21

Gene X:(cis-regulated)

836 Genes q-value (FDR)

p-value

Multiple testing correction

Methodology - Step IV

LRT ~ 2(2n) n is the number of SNPs

Likelihood ratio test (ML)

Page 20: Mixed model analysis to discover cis-regulatory haplotypes in A. Thaliana

IAP workshop, Ghent, Sept. 18th, 2008 20

ResultsData contains all expressed

genes (25527 genes)

Step I:

20979 genes

0genotype Adjusted_q value<0.0005

1328 genes

972 genes

859 genes

Adjusted_qa value<0.010_ ijsca

jgcaigca __ q value<0.01

0 SNPiq value<0.01

Step II:

Step III:

Step IV:

Page 21: Mixed model analysis to discover cis-regulatory haplotypes in A. Thaliana

IAP workshop, Ghent, Sept. 18th, 2008 21

Results

Among all 25527 genes, 20979 genes have significant genotypic variation (qvalue < 0.0005). (–Step I)

Among these 20979 genes, 1328 genes have no-trans regulated effect (qavalue < 0.01). (–Step II)

Among these 1328 genes, 972 genes have showed significant different allelic expressions (qvlaue < 0.01); these 972 genes are discovered as cis-regulated. (–Step III)

We confirm our discovery from these 972 cis-regulated genes in step IV:

an allelic expression difference caused by cis-regulatory variant implies a nearby polymorphism (SNP) that controls expression in LD;

We indeed found 96.5% selected cis-regulated genes have associated polymorphisms (haplotype blocks ) nearby.

Page 22: Mixed model analysis to discover cis-regulatory haplotypes in A. Thaliana

IAP workshop, Ghent, Sept. 18th, 2008 22

Conclusions This mixed-model approach used here for association mapping analysis with Kinship matrix included are more appropriate than other recent methods in identifying cis-regulated genes ( p-values more reliable).

Each step’s statistical method is controlled in a more accurate way to specify statistical significance (referring to FDR, FNR).

Using simulation-based pvalues when testing difference between random effects increases power of detecting association.

A comprehensive analysis of gene expression variation in plant populations has been described.

Using this mixed-model analysis strategy, a detailed characterization of both the genetic and the positional effects in the genome is provided.

This detailed statistical analysis provides a robust and useful framework for the future analysis of gene expression variation in large sample sizes.

Advanced statistical methods look promising in identifying interesting discoveries in genetics.

Page 23: Mixed model analysis to discover cis-regulatory haplotypes in A. Thaliana

IAP workshop, Ghent, Sept. 18th, 2008 23

Many thanks

for your attention !