Association Analysis, Logistic Regression, R and S-PLUS Richard Mott

31
Association Analysis, Logistic Regression, R and S-PLUS Richard Mott http:// bioinformatics.well.ox.ac.uk/ lectures/

Transcript of Association Analysis, Logistic Regression, R and S-PLUS Richard Mott

Page 1: Association Analysis, Logistic Regression, R and S-PLUS Richard Mott

Association Analysis, Logistic Regression,

R and S-PLUSRichard Mott

http://bioinformatics.well.ox.ac.uk/lectures/

Page 2: Association Analysis, Logistic Regression, R and S-PLUS Richard Mott

Logistic Regression in Statistical Genetics

• Applicable to Association Studies• Data:

– Binary outcomes (eg disease status)– Dependent on genotypes [+ sex, environment]

• Aim is to identify which factors influence the outcome

• Rigorous tests of statistical significance• Flexible modelling language• Generalisation of Chi-Squared Test

Page 3: Association Analysis, Logistic Regression, R and S-PLUS Richard Mott

What is R ?

• Statistical analysis package• Free• Similar to commercial package S-PLUS• Runs on Unix, Windows, Mac• www.r-project.org• Many packages for statistical genetics,

microarray analysis available in R• Easily Programmable

Page 4: Association Analysis, Logistic Regression, R and S-PLUS Richard Mott

Modelling in R

• Data for individual labelled i=1…n:– Response yi

– Genotypes gij for markers j=1..m

Page 5: Association Analysis, Logistic Regression, R and S-PLUS Richard Mott

Coding Unphased Genotypes

• Several possibilities:– AA, AG, GG original genotypes– 12, 21, 22– 1, 2, 3– 0, 1, 2 # of G alleles

• Missing Data– NA default in R

Page 6: Association Analysis, Logistic Regression, R and S-PLUS Richard Mott

Using R

• Load genetic logistic regression tools• > source(‘logistic.R’)

• Read data table from file– > t <- read.table(‘geno.dat’, header=TRUE)

• Column names– names(t)

– t$y response (0,1)– t$m1, t$m2, …. Genotypes for each marker

Page 7: Association Analysis, Logistic Regression, R and S-PLUS Richard Mott

Contigency Tables in R

• ftable(t$y,t$m31) prints the contingency table

> ftable(t$y,t$m31) 11 12 22 0 515 387 751 28 11 2>

Page 8: Association Analysis, Logistic Regression, R and S-PLUS Richard Mott

Chi-Squared Test in R

> chisq.test(t$y,t$m31)

Pearson's Chi-squared test

data: t$y and t$m31 X-squared = 3.8424, df = 2, p-value = 0.1464

Warning message: Chi-squared approximation may be incorrect in: chisq.test(t$y, t$m31) >

Page 9: Association Analysis, Logistic Regression, R and S-PLUS Richard Mott

The Logistic Model

• Prob(Yi=0) = exp(iexp(i))

i = j xij bj - Linear Predictor

• xij – Design Matrix (genotypes etc)

• bj – Model Parameters (to be estimated)

• Model is investigated by – estimating the bj’s by maximum likelihood

– testing if the estimates are different from 0

Page 10: Association Analysis, Logistic Regression, R and S-PLUS Richard Mott

The Logistic FunctionProb(Yi=0) = exp(iexp(i))

Prob(Y=0)

Page 11: Association Analysis, Logistic Regression, R and S-PLUS Richard Mott

Types of genetic effect at a single locus

AA AG GG

Recessive 0 0 1

Dominant 1 1 0

Additive 0 1 2

Genotype 0

Page 12: Association Analysis, Logistic Regression, R and S-PLUS Richard Mott

Additive Genotype Model

• Code genotypes as – AA x=0, – AG x=1, – GG x=2

• Linear Predictor = b0 + xb1

• P(Y=0|x) = exp(b0 + xb1)/(1+exp(b0 + xb1))• PAA = P(Y=0|x=0) = exp(b0)/(1+exp(b0))• PAG = P(Y=0|x=1) = exp(b0 + b1)/(1+exp(b0 + b1))• PGG = P(Y=0|x=2) = exp(b0 + 2b1)/(1+exp(b0 + 2b1))

Page 13: Association Analysis, Logistic Regression, R and S-PLUS Richard Mott

Additive Model: b0 = -2 b1 = 2PAA = 0.12 PAG = 0.50 PGG = 0.88

Prob(Y=0)

Page 14: Association Analysis, Logistic Regression, R and S-PLUS Richard Mott

Additive Model: b0 = 0 b1 = 2

PAA = 0.50 PAG = 0.88 PGG = 0.98

Prob(Y=0)

Page 15: Association Analysis, Logistic Regression, R and S-PLUS Richard Mott

Recessive Model

• Code genotypes as – AA x=0, – AG x=0, – GG x=1

• Linear Predictor = b0 + xb1

• P(Y=0|x) = exp(b0 + xb1)/(1+exp(b0 + xb1))• PAA = PAG = P(Y=0|x=0) = exp(b0)/(1+exp(b0))• PGG = P(Y=0|x=1) = exp(b0 + b1)/(1+exp(b0 + b1))

Page 16: Association Analysis, Logistic Regression, R and S-PLUS Richard Mott

Recessive Model: b0 = 0 b1 = 2

PAA = PAG = 0.50 PGG = 0.88

Prob(Y=0)

Page 17: Association Analysis, Logistic Regression, R and S-PLUS Richard Mott

Genotype Model

• Each genotype has an independent probability• Code genotypes as (for example)

– AA x=0, y=0– AG x=1, y=0– GG x=0, y=1

• Linear Predictor = b0 + xb1+yb2 two parameters

• P(Y=0|x) = exp(b0 + xb1+yb2)/(1+exp(b0 + xb1+yb2))• PAA = P(Y=0|x=0,y=0) = exp(b0)/(1+exp(b0))• PAG = P(Y=0|x=1,y=0) = exp(b0 + b1)/(1+exp(b0 + b1))• PGG = P(Y=0|x=0,y=1) = exp(b0 + b2)/(1+exp(b0 + b2))

Page 18: Association Analysis, Logistic Regression, R and S-PLUS Richard Mott

Genotype Model: b0 = 0 b1 = 2 b2 = -1

PAA = 0.5 PAG = 0.88 PGG = 0.27

Prob(Y=0)

Page 19: Association Analysis, Logistic Regression, R and S-PLUS Richard Mott

Models in Rresponse ygenotype g

AA AG GG model DF

Recessive 0 0 1 y ~ dominant(g) 1

Dominant 0 1 1 y ~ recessive(g) 1

Additive 0 1 2 y ~ additive(g) 1

Genotype 0 y ~ genotype(g) 2

Page 20: Association Analysis, Logistic Regression, R and S-PLUS Richard Mott

Data Transformation

• g <- t$m1• use these functions to treat a genotype

vector in a certain way:– a <- additive(g)– r <- recessive(g)– d <- dominant(g)– g <- genotype(g)

Page 21: Association Analysis, Logistic Regression, R and S-PLUS Richard Mott

Fitting the Model

• afit <- glm( t$y ~ additive(g),family=‘binomial’)• rfit <- glm( t$y ~ recessive(g),family=‘binomial’)• dfit <- glm( t$y ~ dominant(g),family=‘binomial’)• gfit <- glm( t$y ~ genotype(g),family=‘binomial’)

• Equivalent models:– genotype = dominant + recessive– genotype = additive + recessive– genotype = additive + dominant– genotype ~ standard chi-squared test of genotype

association

Page 22: Association Analysis, Logistic Regression, R and S-PLUS Richard Mott

Parameter Estimates

> summary(glm( t$y ~ genotype(t$m31), family='binomial'))

Coefficients: Estimate Std. Error z value Pr(>|z|) b0 (Intercept) -2.9120 0.1941 -15.006 <2e-16 ***b1 genotype(t$m31)12 -0.6486 0.3621 -1.791 0.0733 . b2 genotype(t$m31)22 -0.7124 0.7423 -0.960 0.3372 ---Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 >

Page 23: Association Analysis, Logistic Regression, R and S-PLUS Richard Mott

Analysis of DevianceChi-Squared Test

> anova(glm( t$y ~ genotype(t$m31), family='binomial'))Analysis of Deviance Table

Model: binomial, link: logit

Response: t$y

Terms added sequentially (first to last)

Df Deviance Resid. Df Resid. DevNULL 1017 343.71genotype(t$m31) 2 3.96 1015 339.76

Page 24: Association Analysis, Logistic Regression, R and S-PLUS Richard Mott

Model Comparison

• Compare general model with additive, dominant or recessive models:> afit <- glm(t$y ~ additive(t$m20))> gfit <- glm(t$y ~ genotype(t$m20))> anova(afit,gfit)Analysis of Deviance Table

Model 1: t$y ~ additive(t$m20)Model 2: t$y ~ genotype(t$m20) Resid. Df Resid. Dev Df Deviance1 1016 38.301 2 1015 38.124 1 0.177>

Page 25: Association Analysis, Logistic Regression, R and S-PLUS Richard Mott

Scanning all Markers> logscan(t,model=‘additive’) Deviance DF Pval LogPvalm1 8.604197e+00 1 3.353893e-03 2.474450800m2 7.037336e+00 1 7.982767e-03 2.097846522m3 6.603882e-01 1 4.164229e-01 0.380465360m4 3.812860e+00 1 5.086054e-02 1.293619014m5 7.194936e+00 1 7.310960e-03 2.136025588m6 2.449127e+00 1 1.175903e-01 0.929628598m7 2.185613e+00 1 1.393056e-01 0.856031566m8 1.227191e+00 1 2.679539e-01 0.571939852m9 2.532562e+01 1 4.842353e-07 6.314943565m10 5.729634e+01 1 3.748518e-14 13.426140380m11 3.107441e+01 1 2.483233e-08 7.604982503………

Page 26: Association Analysis, Logistic Regression, R and S-PLUS Richard Mott

Multilocus Models

• Can test the effects of fitting two or more markers simultaneously

• Several multilocus models are possible

• Interaction Model assumes that each combination of genotypes has a different effect

• eg t$y ~ t$m10 * t$m15

Page 27: Association Analysis, Logistic Regression, R and S-PLUS Richard Mott

Multi-Locus Models> f <- glm( t$y ~ genotype(t$m13) * genotype(t$m26) , family='binomial')> anova(f)Analysis of Deviance Table

Model: binomial, link: logit

Response: t$y

Terms added sequentially (first to last)

Df Deviance Resid. Df Resid. DevNULL 1017 343.71genotype(t$m13) 2 108.68 1015 235.03genotype(t$m26) 2 1.14 1013 233.89genotype(t$m13):genotype(t$m26) 3 6.03 1010 227.86

> pchisq(6.03,2,lower.tail=F) calculate p-value [1] 0.04904584

Page 28: Association Analysis, Logistic Regression, R and S-PLUS Richard Mott

Adding the effects of Sex and other Covariates

• Read in sex and other covariate data, eg. age from a file into variables, say a$sex, a$age

• Fit models of the form• fit1 <- glm(t$y ~ additive(t$m10) + a$sex + a$age, family=‘binomial’) • fit2 <- glm(t$y ~ a$sex + a$age, family=‘binomial’)

Page 29: Association Analysis, Logistic Regression, R and S-PLUS Richard Mott

Adding the effects of Sex and other Covariates

• Compare models using anova – test if the effect of the marker m10 is significant after taking into account sex and age

• anova(fit1,fit2)

Page 30: Association Analysis, Logistic Regression, R and S-PLUS Richard Mott

Multiple Testing

• Take care interpreting significance levels when performing multiple tests

• Linkage disequilibrium can reduce the effective number of independent tests

• Permutation is a safe procedure to determine significance

• Repeat j=1..N times:– Permute disease status y between individuals– Fit all markers– Record maximum deviance maxdev[j] over all markers

• Permutation p-value for a marker is the proportion of times the permuted maximum deviance across all markers exceeds the observed deviance for the marker– logscan(t,permute=1000) slow!

Page 31: Association Analysis, Logistic Regression, R and S-PLUS Richard Mott

Haplotype Association

• Haplotype Association– Different from multiple genotype models– Phase taken into account– Haplotype association can be modelled in a similar logistic

framework

• Treat haplotypes as extended alleles• Fit additive, recessive, dominant & genotype models as

before– Eg haplotypes are h = AAGCAT, ATGCTT, etc– y ~ additive(h)– y ~ dominant(h) etc