drflyguy.weebly.com · Web viewTo figure out if your data shows what you want, you have to figure...

23
Stats in R 101 Good job you got some data! Now to think about analysing it the right way to see if it shows anything… Actual scripts are coloured in blue to help separate them from helpful notes Configuring your Data First configure your data in such a way so it’s readable in R. If using an excel file, put it into this style of format and save as a .csv file! NB. Any categories termed as ‘1, 2, 3’ etc. will be treated as numerical data, so best use letters If using survival data, see Configuring for Survival Analysis at the end of this document Loading your Data into R First load up R and change the directory to the file where your data is. Do this in ‘File’ ‘Change dir…’ and follow the file path Then to read the file you want: dt=read.csv("Data.csv") You can get a summary (1 st 5 lines) by writing head(dt) Or you can view your data in a table format like in Excel but in R by: fix(dt)

Transcript of drflyguy.weebly.com · Web viewTo figure out if your data shows what you want, you have to figure...

Page 1: drflyguy.weebly.com · Web viewTo figure out if your data shows what you want, you have to figure out which Stat tests are appropriate. First it’s always good to identify whether

Stats in R 101

Good job you got some data!

Now to think about analysing it the right way to see if it shows anything…

Actual scripts are coloured in blue to help separate them from helpful notes

Configuring your Data

First configure your data in such a way so it’s readable in R.

If using an excel file, put it into this style of format and save as a .csv file!

NB. Any categories termed as ‘1, 2, 3’ etc. will be treated as numerical data, so best use letters

If using survival data, see Configuring for Survival Analysis at the end of this document

Loading your Data into R

First load up R and change the directory to the file where your data is.

Do this in ‘File’ ‘Change dir…’ and follow the file path

Then to read the file you want:

dt=read.csv("Data.csv")

You can get a summary (1st 5 lines) by writing

head(dt)

Or you can view your data in a table format like in Excel but in R by:

fix(dt)

Subset your data

It’s really useful to subset your data to look at changes in means etc., say by Genotype for example:

dtr=subset(dt, Genotype=="Relish")

Then you ask some general aspects of your data:

summary(dtr)

Page 2: drflyguy.weebly.com · Web viewTo figure out if your data shows what you want, you have to figure out which Stat tests are appropriate. First it’s always good to identify whether

Distance Genotype Tube Replicate

Min. : 0.000 Relish:225 a:69 i :75

1st Qu.: 3.180 Upd3 : 0 b:81 ii :75

Median : 4.730 c:75 iii:75

Mean : 4.952

3rd Qu.: 6.620

Max. :12.940

Page 3: drflyguy.weebly.com · Web viewTo figure out if your data shows what you want, you have to figure out which Stat tests are appropriate. First it’s always good to identify whether

General StatsNormality

To figure out if your data shows what you want, you have to figure out which Stat tests are appropriate. First it’s always good to identify whether your data follows a normal distribution.

You can do this through a statistical test, Shapiro: (in this example we’re testing the districbution of the response variable ‘Distance’ in the data sets ‘dtr’ & ‘dtu’

attach(dtr)

shapiro.test(Distance)

W = 0.987, p-value = 0.03772

detach(dtr)

attach(dtu)

shapiro.test(Distance)

W = 0.945, p-value = 6.382e-07

Both the data from the different Genotypes are not normal…

This means I should use non-parametric tests which make fewer assumptions about the data distribution. Or I could transform it so it fits a normal distribution, but I won’t.

You can also look at the distribution of your data using Quantile-Quantile (Q-Q) plots:

qqnorm(Distance); qqline(Distance); abline(0,1)

-3 -2 -1 0 1 2 3

02

46

810

12

Normal Q-Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Page 4: drflyguy.weebly.com · Web viewTo figure out if your data shows what you want, you have to figure out which Stat tests are appropriate. First it’s always good to identify whether

Binomial Tests

Probably the simplest Stats test to do is the binomial test.

You use this test to see whether your ratios between 2 (i.e. binomial) categories are different from an expected ratio.

A great example is looking at sex ratio.

Say you were over-expressing a gene during development which you think is important in male-specific development. You can load your data into R and do a binomial test:

binom.test(17, 83, p=0.5)

Exact binomial test

data: 17 and 83

number of successes = 17, number of trials = 83, p-value = 5.603e-08

alternative hypothesis: true probability of success is not equal to 0.5

95 percent confidence interval:

0.1240805 0.3075558

sample estimates:

probability of success 0.2048193

Proportionality Tests

If you want to check difference in proportions between 2 populations (so with no expected ratio), you can use the proportionality test which is a version of a chi-squared test.

Say if you wanted to assess the sex ratios between 2 populations (i.e. PopA = 17M 83F, PopB = 40M 60F), you have to input them into the test as ‘successes’ (x) and ‘attempts’ (n).

So say we take Males as ‘successes’ and the overall offspring in each population as ‘attempts’.

The Males in each population are 17 & 40 respectively, the total offspring is 100 in each population:

prop.test(x=c(17, 40), n=(100, 100))

2-sample test for equality of proportions with continuity correction

X-squared = 11.8758, df = 1, p-value = 0.0005687

alternative hypothesis: two.sided

95 percent confidence interval:

-0.36099504 -0.09900496

sample estimates:

0.17 0.40

Page 5: drflyguy.weebly.com · Web viewTo figure out if your data shows what you want, you have to figure out which Stat tests are appropriate. First it’s always good to identify whether

Simple 2-level Category Tests

So say you have a set-up like 2 Genotypes that you want to see a difference in how far they climb. This is pretty simple to do.

For Normal Data – You use a Welch t-test

t.test(Eaten~Genotype, dt)

Welch Two Sample t-test

t = 8.7303, df = 422.968, p-value < 2.2e-16

For Non-Parametric Data – You use a Wilcox Rank Sum

wilcox.test(Distance ~ Genotype, dt)

Wilcoxon rank sum test with continuity correction

W = 32453, p-value = 3.399e-15

You can visualise the data pretty easily using the normal plot function:

plot(Distance~Genotype, dt)

Relish Upd3

02

46

810

12

Genotype

Dis

tanc

e

Or if you want something nicer, load the ggplot2 package (‘Packages’ ‘Install Package’ [if first time], or ‘load package’ if already installed in previous sessions, or load in terminal)

library(ggplot2)

qplot(Genotype, Distance, data=dt, geom=c("boxplot", "jitter"),

fill=Genotype, main="Climbing Assay",

xlab="Genotype", ylab="Distance Climbed (cm)")

Page 6: drflyguy.weebly.com · Web viewTo figure out if your data shows what you want, you have to figure out which Stat tests are appropriate. First it’s always good to identify whether

0

5

10

Relish Upd3Genotype

Dis

tanc

e C

limbe

d (c

m)

GenotypeRelish

Upd3

Climbing Assay

Alternatively you can view the response variable between 2 populations as an overlapping Histogram

ggplot(dt, aes(x=Distance, fill=Genotype)) + geom_histogram(alpha=0.2, position="identity")

2-Level Boxplots

Say if you had data with 2 levels e.g. Genotype and Dose, you can make a boxplot like this too!

library(ggplot2)

dt$GD <- interaction(dt$Genotype, dt$Dose) # As 2-levels of factors, need to create an interaction

Page 7: drflyguy.weebly.com · Web viewTo figure out if your data shows what you want, you have to figure out which Stat tests are appropriate. First it’s always good to identify whether

ggplot(aes(y = Lifespan, x = Genotype, fill = Dose), data = dt) + geom_boxplot(position = position_dodge(width = .9))

the ‘position = position_dodge(width = .9)’ bit separates the Dose sections under Genotype

Page 8: drflyguy.weebly.com · Web viewTo figure out if your data shows what you want, you have to figure out which Stat tests are appropriate. First it’s always good to identify whether

Simple 2-level Numeric Tests

Say you want to see if there’s a difference between two numeric factors, e.g. amount eaten and body size.

You can use a simple Pearson’s Correlation

library(Hmisc)

rcorr(Size~Eaten)

x y

x 1.00 0.13

y 0.13 1.00

P x y

0.0382 0.0382 # Significant!

or a simple linear model

m1=lm(Size~Eaten)

summary(m1)

Estimate Std. Error t value Pr(>|t|)

(Intercept) -33.629 104.636 -0.321 0.750

Eaten 3.069 3.043 1.008 0.323

Multiple R-squared: 0.03764, Adjusted R-squared: 0.0006251

F-statistic: 1.017 on 1 and 26 DF, p-value: 0.3226

To plot correlations you can write:

plot(Eaten, Size, main=”Body Size vs Food Eaten”, xlab=”Eaten(ul)”, ylab=”Size(mg)”, pch=19)

you can add a line-of-best-fit by:

abline(lm(Size~Eaten), col=”red”)

Page 9: drflyguy.weebly.com · Web viewTo figure out if your data shows what you want, you have to figure out which Stat tests are appropriate. First it’s always good to identify whether

3-or-more-level Category Tests

Say you want to assess the difference between the expression level of immune genes when flies on conditioned to 3 different diets.

If the expression data is normally distributed, you can use an ANOVA to assess the effect of each treatment:

fit=aov(Expression~Diet, dt)

summary(fit)

Df Sum Sq Mean Sq F value Pr(>F)

Diet 2 2.905 1.4523 29.96 1.4e-07 ***

Residuals 27 1.309 0.0485

If your response variable isn’t normally distributed, you can use a Kruskal-Wallis test to do the same thing an ANOVA does:

kruskal.test(Expression~Diet, dt)

Kruskal-Wallis rank sum test

Kruskal-Wallis chi-squared = 21.3008, df = 2, p-value = 2.369e-05

Great, there’s a significant effect of diet, but which one?

For this you need Post-Hoc analysis.

For the ANOVA this is pretty straight forward, you use a Tukey’s test:

TukeyHSD(fit)

Fit: aov(formula = Expression ~ Diet, data = dt)

diff lwr upr p adj

Low-High -0.76 -1.0041477 -0.5158523 0.0000001

Med-High -0.33 -0.5741477 -0.0858523 0.0065371

Med-Low 0.43 0.1858523 0.6741477 0.0004751

P adj is the important value here, everything’s significant with each other

For Kruskal-Wallis analysis, you use the crazy named Kruskal-Wallis-Nemenyi test. You need to load the PMCMR package:

library(PMCMR)

posthoc.kruskal.nemenyi.test(Expression~Diet, dt)

Page 10: drflyguy.weebly.com · Web viewTo figure out if your data shows what you want, you have to figure out which Stat tests are appropriate. First it’s always good to identify whether

High Low

Low 1.4e-05 -

Med 0.093 0.033

So ‘Low’ is different from both ‘Med’ & ‘High’ diets, but ‘Med’ & ‘High’ diets are not significantly different

Page 11: drflyguy.weebly.com · Web viewTo figure out if your data shows what you want, you have to figure out which Stat tests are appropriate. First it’s always good to identify whether

Multi-factorial Tests

Sometimes you want to measure the effect of more than one factor e.g. how many eggs flies have laid under different diets and temperatures> in this scenario you want to see whether there are independent effects of diet and temperature, but also whether there is an interaction between these variables on eggs laid. For this you can use ANOVA for normal data when dealing with categorical factors. NB. Use ‘*’ to look at factor & interaction, and ‘:’ for just the interaction

m1=aov(Eggs~Diet*Temperature, dt)

summary(m1)

Df Sum Sq Mean Sq F value Pr(>F)

Diet 2 312 156.2 0.196 0.823

Temperature 1 459 459.3 0.575 0.451

Diet:Temperature 2 1681 840.6 1.053 0.356

Residuals 54 43113 798.4

Again, we can use Tuskey’s HSD to see the pairwise interactions:

TukeyHSD(m1)

Diet

diff lwr upr p adj

Low-High 5.45 -16.08384 26.98384 0.8153221

Med-High 1.65 -19.88384 23.18384 0.9813831

Med-Low -3.80 -25.33384 17.73384 0.9052964

Temperature

diff lwr upr p adj

Hot-Cold 5.533333 -9.093484 20.16015 0.4514811

`Diet:Temperature`

diff lwr upr p adj

Low:Cold-High:Cold -5.6 -42.93389 31.73389 0.9977239

Med:Cold-High:Cold 2.0 -35.33389 39.33389 0.9999853

High:Hot-High:Cold -1.6 -38.93389 35.73389 0.9999952

Low:Hot-High:Cold 14.9 -22.43389 52.23389 0.8447721

Med:Hot-High:Cold -0.3 -37.63389 37.03389 1.0000000

Med:Cold-Low:Cold 7.6 -29.73389 44.93389 0.9904956

High:Hot-Low:Cold 4.0 -33.33389 41.33389 0.9995517

Page 12: drflyguy.weebly.com · Web viewTo figure out if your data shows what you want, you have to figure out which Stat tests are appropriate. First it’s always good to identify whether

Low:Hot-Low:Cold 20.5 -16.83389 57.83389 0.5877812

Med:Hot-Low:Cold 5.3 -32.03389 42.63389 0.9982504

High:Hot-Med:Cold -3.6 -40.93389 33.73389 0.9997322

Low:Hot-Med:Cold 12.9 -24.43389 50.23389 0.9089225

Med:Hot-Med:Cold -2.3 -39.63389 35.03389 0.9999707

Low:Hot-High:Hot 16.5 -20.83389 53.83389 0.7805979

Med:Hot-High:Hot 1.3 -36.03389 38.63389 0.9999983

Med:Hot-Low:Hot -15.2 -52.53389 22.13389 0.8335464

When you have non-parametric distributions, you can’t use Kruskal-Wallis as it only works on one-way analysis…

Ordinal Logistic Regressions are pretty bad as your response variable has to be categorical with >=2 levels…

So use Generalised Linear Mixed Models (GLMM)!

Page 13: drflyguy.weebly.com · Web viewTo figure out if your data shows what you want, you have to figure out which Stat tests are appropriate. First it’s always good to identify whether

GLMMs

These are good as they work with both categorical & numerical factors.

Linear Mixed Models are used when you have random effects which allow us to account for inherent lumpiness of data caused by factors like the individual who recorded the data, differences in position, differences in time of the day etc.

Need to load package lme4

If your data follows a Gaussian distribution, use a linear mixed model (lmer), otherwise, a glmer is good.

m1=glmer(Eggs~Diet+Temperature+(1|Replicate), family=”poisson”)

summary(m1)

If the variance of the random effect is close to 0, you can ignore its effects

Page 14: drflyguy.weebly.com · Web viewTo figure out if your data shows what you want, you have to figure out which Stat tests are appropriate. First it’s always good to identify whether

Collating & Analysing Survival Data

Configuring for Survival Analysis

Say you have data in an Excel file like:

To get it ready for R you have to configure it as so:

You do this in Excel for each replicate, treatment and factor individually. The formula can be found in the ‘Survival Transformation Example.xls’ file.

Once you have transformed each of your replicates like above, collate them with the appropriate information. Before you save the file, you have to DELETE the first data column (column E in this example) as it is just a list of the days survival was recorded. Once this has been done you can save the file as a .csv called “Survival Data for R.csv”.

Once in this format you can use a function made by the fantastic Dr Weihao Zhang to transpose it in R:

Weihao’s Script for Transposing Survival Data

raw=read.csv("Survival Data for R.csv")

# make function

make.rm<-function(constant,repeated,data,contrasts) {

if(!missing(constant) && is.vector(constant)) {

if(!missing(repeated) && is.vector(repeated)) {

if(!missing(data)) {

dd<-dim(data)

Page 15: drflyguy.weebly.com · Web viewTo figure out if your data shows what you want, you have to figure out which Stat tests are appropriate. First it’s always good to identify whether

replen<-length(repeated)

if(missing(contrasts))

contrasts<-

ordered(sapply(paste("T",1:length(repeated),sep=""),rep,dd[1]))

else

contrasts<-matrix(sapply(contrasts,rep,dd[1]),ncol=dim(contrasts)[2])

if(length(constant) == 1) cons.col<-rep(data[,constant],replen)

else cons.col<-lapply(data[,constant],rep,replen)

new.df<-data.frame(cons.col,

repdat=as.vector(data.matrix(data[,repeated])),

contrasts)

return(new.df)

}

}

}

cat("Usage: make.rm(constant, repeated, data [, contrasts])\n")

cat("\tWhere 'constant' is a vector of indices of non-repeated data and\n")

cat("\t'repeated' is a vector of indices of the repeated measures data.\n")

}

dt=make.rm(1:4, 5:112, raw)

dt.no.NAs=dt[complete.cases(dt),]

write.csv(dt.no.NAs,file = "Transposed survival data.csv", row.names = FALSE)

# End of script

** Change highlighted numbers depending on how many category (e.g. Genotype, Diet, Death etc.) and data columns (i.e. days lifespan was recorded) you have in your data set.

Delete ‘contrasts’ column and change 'repdat' to 'Lifespan' in the ‘Transposed survival data’ file, save it (again as a .csv file) & it’s ready for analysis!

Page 16: drflyguy.weebly.com · Web viewTo figure out if your data shows what you want, you have to figure out which Stat tests are appropriate. First it’s always good to identify whether

Analysing Survival Data

Once you have your data in R, there are a number of ways you can analyse your data

Load data

dt=read.csv(“Survivals.csv”)

Load (or install, then load) survival package

library(survival)

You want to have a general look at the data to see if there are any immediate differences (i.e. mean lifespan with standard errors).

You first have to create a formula for the standard error calculation.

Thankfully the brilliant Dr Weihao Zhang also made a function for this called se:

se=function(x) sqrt(var(x,na.rm=T)/(length(x)-length(which(is.na(x)))))

So subset the data as you wish and work out the means & se for Lifespan:

dti=subset(dt, Dose!="PBS")

with(dti,tapply(Lifespan, list(Genotype, Dose), mean))

se=function(x) sqrt(var(x,na.rm=T)/(length(x)-length(which(is.na(x)))))

with(dti,tapply(Lifespan, list(Genotype, Dose), se))

High Low

+/TotM 7.680851 8.140000

tub/TotM 7.604167 8.115385

High Low

+/TotM 0.2528403 0.292784

tub/TotM 0.1900438 0.280935

So not much difference in the means between the Genotypes…

You can test these statistically using a Wilcox test (lifespan data isn’t normally distributed)

wilcox.test(Lifespan ~ Genotype, dtl)

W = 863.5, p-value = 0.4037

wilcox.test(Lifespan ~ Genotype, dth)

W = 798.5, p-value = 0.2855

Page 17: drflyguy.weebly.com · Web viewTo figure out if your data shows what you want, you have to figure out which Stat tests are appropriate. First it’s always good to identify whether

Model Analysis

Now because I have infection data in this example, there’s not much point in me doing complex analysis on survival curves using model fitting, as the survivals crash, so I will use the Cox Hazard Proportion model. I will use the dti data set to see if there’s a Genotype effect:

coxph(Surv(Lifespan, Death)~Genotype,dti)

coxph(formula = Surv(Lifespan, Death) ~ Genotype, data = dti)

coef exp(coef) se(coef) z p

Genotype tub/TotM 0.0401 1.04 0.143 0.28 0.78

Likelihood ratio test=0.08 on 1 df, p=0.779 n= 197, number of events= 197

Nope, doesn’t look like it…

What about a Genotype:Dose interaction?

m1<-coxph(Surv(Lifespan, Death)~Genotype*Dose,dti)

m2<-update(m1,~.- Genotype:Dose )

anova(m1,m2)

Cox model: response is Surv(Lifespan, Death)

Model 1: ~ Genotype * Dose

Model 2: ~ Genotype + Dose

loglik Chisq Df P(>|Chi|)

1 -841.03

2 -841.74 1.4183 1 0.2337

Nope, no interaction either

Plotting Cox HP

Say I wanted to make a graph of the Cox HP results.

Let’s go back to our Genotype results:

coef exp(coef) se(coef) z p

Genotype tub/TotM 0.0401 1.04 0.143 0.28 0.78

Page 18: drflyguy.weebly.com · Web viewTo figure out if your data shows what you want, you have to figure out which Stat tests are appropriate. First it’s always good to identify whether

Here the exp(coef) is the hazard proportion relative to the first genotype assessed, which in this case is +/TotM, the control Genotype. (Subsets are processed in alphabetical order, so the subset you want to relate everything to has to alphabetically come first, thankfully ‘+’ comes before all letters!).

The se(coef) is the standard error of the coefficient.

I would put these values into a new .csv file and plot them using the Plotrix package:

dt=read.csv("Surv Fig.csv")

library(plotrix) # or install first!

par(mar=c(5, 5, 4, 2))

This sets the position of graph in space, best not to touch!

plotCI(x=c(1.5), 1.041, uiw=0.142, liw=0.142, col=c('grey80'), lwd=2, pch=16, err="y", xaxt="n", xlab="TotM RNAi", ylab="Hazard Ratio", main="Hazard Proportions of Treated Flies", cex.main=2, cex.lab=1.5, ylim=c(0.8, 1.2), xlim=c(1,2))

Lots of stuff in this line... uiw & liw are the error limits, pch is the point type, etc.

abline(h=1, col=1, lty=2, lwd=2)

This adds an additional line in, h=1 means that it's horizontal and is at 1, col is the colour, lty is line type, and lwd is line width

Or if using more data points

Make a .csv file of CoxHP data:

attach(dt)

par(mar=c(5, 5, 4, 2)) #sets position of graph in space

plotCI(x=c(1.5, 2, 3, 3.5), Mean, uiw=SE, liw=SE, col=c('black', 'black', 'grey65', 'grey65'), lwd=2, pch=16, err="y", xaxt="n", xlab="Dif Rel", ylab="Hazard Ratio", main="Hazard Proportions of Treated Flies", cex.main=2, cex.lab=1.5, ylim=c(0.4, 1.4), xlim=c(1, 4.5))

Page 19: drflyguy.weebly.com · Web viewTo figure out if your data shows what you want, you have to figure out which Stat tests are appropriate. First it’s always good to identify whether

# Lots of stuff in this line... uiw & liw are the error limits, pch is the point type, etc.

chp<-as.character(dt[[2]]) # stating which colomn in csv file to put as x axis

axis(side=1,at=c(1.5, 2, 3, 3.5), labels=chp, cex.axis=1) # stating space of axis, axis labels, cex.axis describes axis font size

abline(h=1, col=1, lty=2, lwd=2) # adds an additional line in, h=1 means that it's horizontal and is at 1, col is the colour, lty is line type, and lwd is line width