Fei Zou Department of Biostatistics Email: [email protected]

42
Bayesian Variable Selection in Semiparametric Regression Modeling with Applications to Genetic Mappping Fei Zou Department of Biostatistics Email: [email protected]

description

Bayesian Variable Selection in Semiparametric Regression Modeling with Applications to Genetic Mappping. Fei Zou Department of Biostatistics Email: [email protected]. Outline. Introduction Experimental crosses Existing QTL Mapping Methods Bayesian semi-parametric QTL Mapping Results - PowerPoint PPT Presentation

Transcript of Fei Zou Department of Biostatistics Email: [email protected]

Page 1: Fei Zou Department of Biostatistics Email:  fzou@bios.unc

Bayesian Variable Selection in Semiparametric Regression

Modeling with Applications to Genetic Mappping

Fei ZouDepartment of BiostatisticsEmail: [email protected]

Page 2: Fei Zou Department of Biostatistics Email:  fzou@bios.unc

Outline

• Introduction– Experimental crosses– Existing QTL Mapping Methods

• Bayesian semi-parametric QTL Mapping• Results• Remarks and Conclusions

Page 3: Fei Zou Department of Biostatistics Email:  fzou@bios.unc

http://www.cs.unc.edu/Courses/comp590-090-f06/Slides/CSclass_Threadgill.ppt

Page 4: Fei Zou Department of Biostatistics Email:  fzou@bios.unc

Overview

• One gene one trait: very unlikely• The vast majority of biological traits are

caused by complex polygenes– Potentially interacting with each other

• Most traits have significant environmental exposure components– Potentially interacting with polygenes

Page 5: Fei Zou Department of Biostatistics Email:  fzou@bios.unc

ParentsP1

P2

Experimental Crosses: F2

Page 6: Fei Zou Department of Biostatistics Email:  fzou@bios.unc

Experimental Crosses

• F2 Backcross(BC)

F1 F1

P1 P2

F2: BC:

P1 P2

P1 F1

AA BB

AB AB

BB AB AB

AA BB

AA AB

AB AA AB

Page 7: Fei Zou Department of Biostatistics Email:  fzou@bios.unc

QTL Data Format

0: homozygous AA, 2: homozygous BB, 1: heterozygote AB.Marker positions:

Page 8: Fei Zou Department of Biostatistics Email:  fzou@bios.unc

Linkage Analysis

• Data structure: – Marker data (genotypes plus positions)– Phenotypic trait(s)– Other nongenetic covariates, such as age,

gender, environmental conditions etc• Quantitative trait loci (QTL): a particular

region of the genome containing one or more genes that are associated with the trait being assayed or measured

Page 9: Fei Zou Department of Biostatistics Email:  fzou@bios.unc

QTL Mapping of Experimental Crosses

• Single QTL Mapping• Single marker analysis (Sax, 1923 Genetics)• Interval mapping: Lander & Botstein (1989,

Genetics)• Multiple QTL mapping

• Composite interval mapping (Zeng 1993 PNAS, 1994 Genetics; Jansen & Stam, 1994 Genetics)

• Multiple interval mapping (Kao et al., 1999 Genetics)

• Bayesian analysis (Satagopan et al., 1997 Genetics)

Page 10: Fei Zou Department of Biostatistics Email:  fzou@bios.unc

Single QTL Interval Mapping

• For backcross, the model assumes

• QTL analysis:

• If QTL genotypes are observed, the analysis is trivial: simple t-test!

• However, QTL position is unknown and therefore QTL genotypes are unobserved

2(QTL genotype is AA) ~ ( , )i AAy N 2(QTL genotype is Aa) ~ ( , )i Aay N

0 : vs :AA Aa A AA AaH H

Page 11: Fei Zou Department of Biostatistics Email:  fzou@bios.unc

Interval Mapping

• For QTL between markers– QTL genotypes missing: can use marker genotypes to

infer the conditional probabilities of the QTL genotypes for a given QTL position

– Profile likelihood (LOD score) calculated across the whole genome or candidate regions using EM algorithm

– In any region where the profile exceeds a (genome-wide) significance threshold, a QTL declared at the position with the highest LOD score.

Page 12: Fei Zou Department of Biostatistics Email:  fzou@bios.unc

0

2

4

6

8

Chromosome

lod

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16171819 X

Profile LOD

Page 13: Fei Zou Department of Biostatistics Email:  fzou@bios.unc

Multiple QTL Mapping

• Most complicated traits are caused by multiple (potentially interacting) genes, which also interact with environment stimuli – Single QTL interval mapping

• Ghost QTL (Lander & Botstein 1989)• Low power

Page 14: Fei Zou Department of Biostatistics Email:  fzou@bios.unc

Multiple QTL Mapping• Composite interval mapping (Zeng 1993, 1994;

Jansen & Stam1993): searching for a putative QTL in a given region while simultaneously fitting partial regression coefficients for "background markers" to adjust the effects of other QTLs outside the region

• which background markers to include; window size etc

• Multiple interval mapping (Kao et al 1999): fitting multiple QTLs simultaneously

• Computationally intensive; how many QTLs to include?

Page 15: Fei Zou Department of Biostatistics Email:  fzou@bios.unc

Multiple QTL Mapping• Bayesian methods (Stephens and Fisch 1998

Biometrics; Sillanpaa and Arjas 1998 Genetics; Yi and Xu 2002 Genetic Research, and Yi et al. 2003 Genetics): treat the number of QTLs as a parameter by using reversible jump Markov chain Monte Carlo (MCMC) of Green (1995 Biometrika)

• change of dimensionality, the acceptance probability for such dimension change, which in practice, may not be handled correctly (Ven 2004 Genetics)

Page 16: Fei Zou Department of Biostatistics Email:  fzou@bios.unc

Multiple QTL Mapping• Alternative, multiple QTL mapping can be

viewed as a variable selection problem– Forward and step-wise selection procedures (Broman

and Speed 2002 JRSSB) – LASSO, etc– Bayesian QTL mapping

• Xu (2003 Genetics), Wang et al (2005 Genetics) Huang et al (2007 Genetics): Bayesian shrinkage

• Yi et al (2003 Genetics): stochastic search variable selection (SSVS) of George and McCulloch (1993 JASA)

• Yi (2004 Genetics): composite model space of Godsill (2001 J. Comp. Graph. Stat)

• Software: R/qtlbim by Yi’s group

Page 17: Fei Zou Department of Biostatistics Email:  fzou@bios.unc

Multiple QTL Mapping

• Limitations of existing QTL mapping methods – do not model covariates at all or only

model covariate effect linearly – do not model interactions at all or

model only lower order interactions, such as two way interactions

Page 18: Fei Zou Department of Biostatistics Email:  fzou@bios.unc

• The multiple QTL mapping is a very large variable selection problem: for p potential genes, with p being in the hundreds or thousands, there are possible main effect models, possible two-way interactions and possible higher order (k > 2) interactions.

2 p 22p

2pk

Page 19: Fei Zou Department of Biostatistics Email:  fzou@bios.unc

Semiparmetric Multiple (Potentially Interacting) QTL Mapping

• Goal: map multiple potentially interacting QTLs without specifically model all potential main and higher order interaction effects

• Semiparametric model:

where function is unspecified, QTL genotypes and represent all non- genetics factors/covariates.

• When equals : non-explicitly modeling the two way interaction between genes 1 and 2 and the gene-environmental interaction between gene 3 and covariate 1.

21 2 1( , ,... , ,..., ) , 1, , with (0, )~

iid

i i i ip i iq i iy x x x t t e i n e N

1 2, ,...i i ipx x x1,...,i iqt t

1 2 3 1i i i ix x x t

Page 20: Fei Zou Department of Biostatistics Email:  fzou@bios.unc

Bayesian Semi/non-parametric Methods

– Dirichlet process (Muller et al. 1996)– Splines (Smith and Kohn 1996; Denison et al. 1998

and DiMatteo et al. 2001)– Wavelets (Abramovich et al. 1998 JRSSB)– Kernel models (Liang et al 2007) – Gaussian process (Neal 1997; 1996)

• Gaussian process priors have a large support in the space of all smooth functions through an appropriate choice of covariance kernel.

• Gaussian process is flexible for curve estimation because of their flexible sample path shapes

• Gaussian process related to smoothing spline somehow (Wahba 1978 JRSSB)

Page 21: Fei Zou Department of Biostatistics Email:  fzou@bios.unc

Prior Specification on

• A Gaussian process such that all possible finite dimensional distributions follow multivariate normal with mean 0 and covariance function

where , s and s are hyperparameters and

1( ,..., )Tn

2 2' ' '

1 1

1cov( , ) exp[ ( ) ( ) ]p q

i i xk ik i k tj ij i jk j

x x t t

xk

1 1( ,..., , ,..., )i i ip i iqx x t t tj

Page 22: Fei Zou Department of Biostatistics Email:  fzou@bios.unc

• Hyperparameter defines the vertical scale of variations, i.e., controls the magnitude of the exponential part. Hyperparameters related to length scales which characterize the distance in that particular direction over which y is expected to vary significantly– controls the smoothness of : when

the posterior mean of almost interpolates the data while centered around the prior mean function if

– When = 0, y is expected to be an essentially constant function of that input variable xj, which is therefore deemed irrelevant (Mackay 1998).

( )xk tj 1/ xk

xk

0

Page 23: Fei Zou Department of Biostatistics Email:  fzou@bios.unc

Priors on • The original papers on the Gaussian process

(Mackay 1998; Neal 1997) did not view this method as an approach for variable selection and imposed a Gamma prior on the parameters. However, does provide information about the relevance of any QTL with value near zero indicating an irrelevant QTL.

• For variable selection purpose, we can impose the following Gamma mixture priors on

1/xk xk

s

Page 24: Fei Zou Department of Biostatistics Email:  fzou@bios.unc

Prior Specifications

• Inverse Gamma distributions are used for the priors of and . 2

Page 25: Fei Zou Department of Biostatistics Email:  fzou@bios.unc

Simulations

• Set ups:– backcross population – 200 or 500 individuals– 151 evenly spaced markers at 5cM intervals– Four QTLs with varying heritabilities:

• Main effect model: all four QTL act additively• Main plus two way interactions• Four way interactions only

Page 26: Fei Zou Department of Biostatistics Email:  fzou@bios.unc
Page 27: Fei Zou Department of Biostatistics Email:  fzou@bios.unc
Page 28: Fei Zou Department of Biostatistics Email:  fzou@bios.unc
Page 29: Fei Zou Department of Biostatistics Email:  fzou@bios.unc
Page 30: Fei Zou Department of Biostatistics Email:  fzou@bios.unc
Page 31: Fei Zou Department of Biostatistics Email:  fzou@bios.unc
Page 32: Fei Zou Department of Biostatistics Email:  fzou@bios.unc

n=500 and pure 4 way-interaction model

Page 33: Fei Zou Department of Biostatistics Email:  fzou@bios.unc

n=500 and pure additive model

Page 34: Fei Zou Department of Biostatistics Email:  fzou@bios.unc

Real Data Analysis

• A mouse study– # samples: 187 backcross samples– # markers: 85 with average marker distance

20 cM – Phenotypes: inguinal, gonadal, retroperitoneal

and mesenteric fat pad weights

Page 35: Fei Zou Department of Biostatistics Email:  fzou@bios.unc
Page 36: Fei Zou Department of Biostatistics Email:  fzou@bios.unc
Page 37: Fei Zou Department of Biostatistics Email:  fzou@bios.unc

Remarks

• For studies with large # of samples and/or large # of markers, MCMC converges very slowly– We employed the hybrid Monte Carlo method, which

merges the Metropolis-Hastings algorithm with sampling techniques based on dynamics simulation.

– We also estimated the maximum a posteriori (MAP) via conjugate gradient method (Hestenes et al 1952 J. Research of National Bureau of Standards)

• point estimate

Page 38: Fei Zou Department of Biostatistics Email:  fzou@bios.unc

Real Study: Cardiovascular Disease

• 2655 tag SNPs from roughly 200 selected candidate genes for cardiovascular disease

• 820 individuals• Non-genetic covariates: gender, smoking

status, age

Page 39: Fei Zou Department of Biostatistics Email:  fzou@bios.unc
Page 40: Fei Zou Department of Biostatistics Email:  fzou@bios.unc

Remarks

• Semiparemetric mapping is powerful in mapping multiple (potentially interacting with higher orders) QTL – Picks up genes related to the trait regardless

of their marginal main effects or joint epistasis effects

– Cannot readily differentiates genetic contributions

• main effect? interaction? or both?• Fine tuned parametric model with selected genes

Page 41: Fei Zou Department of Biostatistics Email:  fzou@bios.unc

Remarks and Future Research• How to extend the methodologies to human

genome-wide association (GWA) studies, where hundreds of thousands of markers are available– Is it possible?– potential solutions: pathway analysis; data reduction

techniques• How to extend the method to human pedigree

analysis where mixed effect model is used for correlated family members?– Use inheritance vector: so far results are very

promising

Page 42: Fei Zou Department of Biostatistics Email:  fzou@bios.unc

Acknowledgement

• Joint work with – Hanwen Huang – Haibo Zhou– Fuxia Cheng– Ina Hoeschele

• Funding support– NIH R01 GM074175