Linear Modelling III
description
Transcript of Linear Modelling III
Linear Modelling III
Richard MottWellcome Trust Centre for Human
Genetics
Synopsis
• Logistic Regression• Mixed Models
Logistic Regression and Statistical Genetics
• Applicable to Association Studies• Data:
– Binary outcomes (eg disease status)– Dependent on genotypes [+ sex, environment]
• Aim is to identify which factors influence the outcome• Rigorous tests of statistical significance• Flexible modelling language• Generalisation of Chi-Squared Test
Data
• Data for individual labelled i=1…n:– Binary Response yi
– Unphased Genotypes gij for markers j=1..m
Coding Unphased Genotypes
• Several possibilities:– AA, AG, GG original genotypes– 12, 21, 22– 1, 2, 3– 0, 1, 2 # of G alleles
• Missing Data– NA default in R
Using R
• Load genetic logistic regression tools• > source(‘logistic.R’)
• Read data table from file– > t <- read.table(‘geno.dat’, header=TRUE)
• Column names– names(t)– t$y response (0,1)– t$m1, t$m2, …. Genotypes for each marker
Contigency Tables in R
• ftable(t$y,t$m31) prints the contingency table
> ftable(t$y,t$m31) 11 12 22 0 515 387 751 28 11 2>
Chi-Squared Test in R
> chisq.test(t$y,t$m31)
Pearson's Chi-squared test
data: t$y and t$m31 X-squared = 3.8424, df = 2, p-value = 0.1464
Warning message: Chi-squared approximation may be incorrect in: chisq.test(t$y, t$m31) >
The Logistic Model
• Effect of SNP k on phenotype
• Null model
The Logistic Model
• Prob(Yi=1) = exp(hi)/(1+exp(hi))
• hi = Sj xij bj - Linear Predictor
• xij – Design Matrix (genotypes etc)
• bj – Model Parameters (to be estimated)
• Model is investigated by – estimating the bj’s by maximum likelihood– testing if the estimates are different from 0
The Logistic FunctionProb(Yi=1) = exp(hi)/(1+exp(hi))
h
Prob(Y=1)
Types of genetic effect at a single locus
AA AG GG
Recessive 0 0 1
Dominant 1 1 0
Additive 0 1 2
Genotype 0 a b
Additive Genotype Model• Code genotypes as
– AA x=0, – AG x=1, – GG x=2
• Linear Predictor– h = b0 + xb1
• P(Y=1|x) = exp(b0 + xb1)/(1+exp(b0 + xb1))• PAA = P(Y=1|x=0) = exp(b0)/(1+exp(b0))• PAG = P(Y=1|x=1) = exp(b0 + b1)/(1+exp(b0 + b1))• PGG = P(Y=1|x=2) = exp(b0 + 2b1)/(1+exp(b0 + 2b1))
Additive Model: b0 = -2 b1 = 2PAA = 0.12 PAG = 0.50 PGG = 0.88
h
Prob(Y=1)
Additive Model: b0 = 0 b1 = 2
PAA = 0.50 PAG = 0.88 PGG = 0.98
h
Prob(Y=1)
Recessive Model• Code genotypes as
– AA x=0, – AG x=0, – GG x=1
• Linear Predictor– h = b0 + xb1
• P(Y=1|x) = exp(b0 + xb1)/(1+exp(b0 + xb1))• PAA = PAG = P(Y=1|x=0) = exp(b0)/(1+exp(b0))• PGG = P(Y=1|x=1) = exp(b0 + b1)/(1+exp(b0 + b1))
Recessive Model: b0 = 0 b1 = 2
PAA = PAG = 0.50 PGG = 0.88
h
Prob(Y=1)
Genotype Model
• Each genotype has an independent probability• Code genotypes as (for example)
– AA x=0, z=0– AG x=1, z=0– GG x=0, z=1
• Linear Predictor– h = b0 + xb1+zb2 two parameters
• P(Y=1|x) = exp(b0 + xb1+zb2)/(1+exp(b0 + xb1+zb2))• PAA = P(Y=1|x=0,z=0) = exp(b0)/(1+exp(b0))• PAG = P(Y=1|x=1,z=0) = exp(b0 + b1)/(1+exp(b0 + b1))• PGG = P(Y=1|x=0,z=1) = exp(b0 + b2)/(1+exp(b0 + b2))
Genotype Model: b0 = 0 b1 = 2 b2 = -1
PAA = 0.5 PAG = 0.88 PGG = 0.27
h
Prob(Y=1)
Models in Rresponse ygenotype g
AA AG GG model DF
Recessive 0 0 1 y ~ recessive(g) 1
Dominant 0 1 1 y ~ dominant(g) 1
Additive 0 1 2 y ~ additive(g) 1
Genotype 0 a b y ~ genotype(g) 2
Data Transformation
• g <- t$m1• use these functions to treat a genotype vector
in a certain way:– a <- additive(g)– r <- recessive(g)– d <- dominant(g)– g <- genotype(g)
Fitting the Model• afit <- glm( t$y ~ additive(g),family=‘binomial’)• rfit <- glm( t$y ~ recessive(g),family=‘binomial’)• dfit <- glm( t$y ~ dominant(g),family=‘binomial’)• gfit <- glm( t$y ~ genotype(g),family=‘binomial’)
• Equivalent models:– genotype = dominant + recessive– genotype = additive + recessive– genotype = additive + dominant– genotype ~ standard chi-squared test of genotype
association
Parameter Estimates
> summary(glm( t$y ~ genotype(t$m31), family='binomial'))
Coefficients: Estimate Std. Error z value Pr(>|z|) b0 (Intercept) -2.9120 0.1941 -15.006 <2e-16 ***b1 genotype(t$m31)12 -0.6486 0.3621 -1.791 0.0733 . b2 genotype(t$m31)22 -0.7124 0.7423 -0.960 0.3372 ---Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 >
Analysis of DevianceChi-Squared Test
> anova(glm( t$y ~ genotype(t$m31), family='binomial'), test="Chisq")Analysis of Deviance Table
Model: binomial, link: logit
Response: t$y
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. DevNULL 1017 343.71genotype(t$m31) 2 3.96 1015 339.76
Model Comparison
• Compare general model with additive, dominant or recessive models:> afit <- glm(t$y ~ additive(t$m20))> gfit <- glm(t$y ~ genotype(t$m20))> anova(afit,gfit)Analysis of Deviance Table
Model 1: t$y ~ additive(t$m20)Model 2: t$y ~ genotype(t$m20) Resid. Df Resid. Dev Df Deviance1 1016 38.301 2 1015 38.124 1 0.177>
Scanning all Markers> logscan(t,model=‘additive’) Deviance DF Pval LogPvalm1 8.604197e+00 1 3.353893e-03 2.474450800m2 7.037336e+00 1 7.982767e-03 2.097846522m3 6.603882e-01 1 4.164229e-01 0.380465360m4 3.812860e+00 1 5.086054e-02 1.293619014m5 7.194936e+00 1 7.310960e-03 2.136025588m6 2.449127e+00 1 1.175903e-01 0.929628598m7 2.185613e+00 1 1.393056e-01 0.856031566m8 1.227191e+00 1 2.679539e-01 0.571939852m9 2.532562e+01 1 4.842353e-07 6.314943565m10 5.729634e+01 1 3.748518e-14 13.426140380m11 3.107441e+01 1 2.483233e-08 7.604982503………
Multilocus Models
• Can test the effects of fitting two or more markers simultaneously
• Several multilocus models are possible• Interaction Model assumes that each
combination of genotypes has a different effect
• eg t$y ~ t$m10 * t$m15
Multi-Locus Models> f <- glm( t$y ~ genotype(t$m13) * genotype(t$m26) , family='binomial’, test="Chisq")> anova(f)Analysis of Deviance Table
Model: binomial, link: logit
Response: t$y
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. DevNULL 1017 343.71genotype(t$m13) 2 108.68 1015 235.03genotype(t$m26) 2 1.14 1013 233.89genotype(t$m13):genotype(t$m26) 3 6.03 1010 227.86
> pchisq(6.03,3,lower.tail=F) calculate p-value [1] 0.1101597
Adding the effects of Sex and other Covariates
• Read in sex and other covariate data, eg. age from a file into variables, say a$sex, a$age
• Fit models of the form• fit1 <- glm(t$y ~ additive(t$m10) + a$sex + a$age, family=‘binomial’) • fit2 <- glm(t$y ~ a$sex + a$age, family=‘binomial’)
Adding the effects of Sex and other Covariates
• Compare models using anova – test if the effect of the marker m10 is significant after taking into account sex and age
• anova(fit1,fit2)
Random EffectsMixed Models
• So far our models have had fixed effects – each parameter can take any value independently of the
others– each parameter estimate uses up a degree of freedom– models with large numbers of parameters have problems
• saturation - overfitting• poor predictive properties• numerically unstable and difficult to fit• in some cases we can treat parameters as being sampled from a
distribution – random effects• only estimate the parameters needed to specify that distribution• can result in a more robust model
Example of a Mixed Model• Testing genetic association across a large number of of families
– yi = trait value of i’th individual
– bi = genotype of individual at SNP of interest
– fi = family identifier (factor)
– ei = error, variance s2
– y = m + b + f + e
– If we treat family effects f as fixed then we must estimate a large number of parameters– Better to think of these effects as having a distribution N(0,t2) for some variance t which must be
estimated• individuals from the same family have the same f and are correlated
– cov = t2
• individuals from different families are uncorrelated
– genotype parameters b still treated as fixed effects• mixed model
Mixed Models• y = m + b + f + e
• Fixed effects model– E(y) = m + b + f – Var(y) = Is2
• I is identity matrix
• Mixed model– E(y) = m + b– Var(y) = Is2 + Ft2
• F is a matrix, Fij = 1 if i, j in same family• Need to estimate both s2 and t2
Generalised Least Squares• y = Xb + e• Var(y) = V (symmetric covariance matrix)
• V = Is2 (uncorrelated errors)• V = Is2 + Ft2 (single grouping random effect)• V = Is2 + Gt2 (G = genotype identity by state)
• GLS solution, if V is known
• unbiased, • efficient (minimum variance)• consistent (tends to true value in large samples)• asymptotically normally distributed
Multivariate Normal Distribution• Joint distribution of a vector of correlated observations• Another way of thinking about the data• Estimation of parameters in a mixed model is special case of likelihood
analysis of a MVN• y ~ MVN(m, V)
– m is vector of expected values– V is variance-covariance matrix– If V = Is2 then elements of y are uncorrelated and equivalent to n
independent Normal variables– probability density/Likelihood is
Henderson’s Mixed Model Equations
• General linear mixed model.• b are fixed effect• u are random effects• X, Z design matrices
Mixed Models
• Fixed effects models are special cases of mixed models
• Mixed models sometimes are more powerful (as fewer parameters) – But check that the assumption effects are sampled from a Normal
distribution is reasonable– Differences between estimated parameters and statistical significance in
fixed vs mixed models
• Shrinkage: often random effect estimates from a mixed model have smaller magnitudes than from a fixed effects model.
Mixed Models in R• Several R packages available, e.g.
– lme4 general purpose mixed models package– emma for genetic association in structured populations
• Formulas in lme4, eg test for association between genotype and trait• H0: E(y) = m • vs• H1: E(y) = m + b
• Var(y) = Is2 + Ft2
– h0 = lmer(y ~ 1 +(1|family), data=data)– h1 = lmer(y ~ genotype +(1|family), data=data)– anova(h0,h1)
Formulas in lmer()
• y ~ fixed.effects + (1|Group1) + (1|Group2) etc– random intercept models
• y ~ fixed.effects +(1|Group1) + (x|Group1)– random slope model
Example: HDL Data> library(lme4)> b=read.delim("Biochemistry.txt”)cc=complete.cases(b$Biochem.HDL)
> f0=lm(Biochem.HDL ~ Family , data=b,subset=cc)> f1=lmer(Biochem.HDL ~ (1|Family) , data=b,subset=cc)
> anova(f0)Analysis of Variance Table
Response: Biochem.HDL Df Sum Sq Mean Sq F value Pr(>F) Family 175 147.11 0.84064 5.2098 < 2.2e-16 ***Residuals 1681 271.24 0.16136 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> f1Linear mixed model fit by REML Formula: Biochem.HDL ~ (1 | Family) Data: b Subset: cc AIC BIC logLik deviance REMLdev 2161 2178 -1078 2149 2155Random effects: Groups Name Variance Std.Dev. Family (Intercept) 0.066633 0.25813 Residual 0.161487 0.40185 Number of obs: 1857, groups: Family, 176
Fixed effects: Estimate Std. Error t value(Intercept) 1.60615 0.02263 70.97> plot(resid(f0),resid(f1))