identify= to point out Analyze= to figure out Characteristics= individual qualities
drflyguy.weebly.com · Web viewTo figure out if your data shows what you want, you have to figure...
Transcript of drflyguy.weebly.com · Web viewTo figure out if your data shows what you want, you have to figure...
![Page 1: drflyguy.weebly.com · Web viewTo figure out if your data shows what you want, you have to figure out which Stat tests are appropriate. First it’s always good to identify whether](https://reader035.fdocuments.us/reader035/viewer/2022070707/5ea53969bb119e105c71ce9d/html5/thumbnails/1.jpg)
Stats in R 101
Good job you got some data!
Now to think about analysing it the right way to see if it shows anything…
Actual scripts are coloured in blue to help separate them from helpful notes
Configuring your Data
First configure your data in such a way so it’s readable in R.
If using an excel file, put it into this style of format and save as a .csv file!
NB. Any categories termed as ‘1, 2, 3’ etc. will be treated as numerical data, so best use letters
If using survival data, see Configuring for Survival Analysis at the end of this document
Loading your Data into R
First load up R and change the directory to the file where your data is.
Do this in ‘File’ ‘Change dir…’ and follow the file path
Then to read the file you want:
dt=read.csv("Data.csv")
You can get a summary (1st 5 lines) by writing
head(dt)
Or you can view your data in a table format like in Excel but in R by:
fix(dt)
Subset your data
It’s really useful to subset your data to look at changes in means etc., say by Genotype for example:
dtr=subset(dt, Genotype=="Relish")
Then you ask some general aspects of your data:
summary(dtr)
![Page 2: drflyguy.weebly.com · Web viewTo figure out if your data shows what you want, you have to figure out which Stat tests are appropriate. First it’s always good to identify whether](https://reader035.fdocuments.us/reader035/viewer/2022070707/5ea53969bb119e105c71ce9d/html5/thumbnails/2.jpg)
Distance Genotype Tube Replicate
Min. : 0.000 Relish:225 a:69 i :75
1st Qu.: 3.180 Upd3 : 0 b:81 ii :75
Median : 4.730 c:75 iii:75
Mean : 4.952
3rd Qu.: 6.620
Max. :12.940
![Page 3: drflyguy.weebly.com · Web viewTo figure out if your data shows what you want, you have to figure out which Stat tests are appropriate. First it’s always good to identify whether](https://reader035.fdocuments.us/reader035/viewer/2022070707/5ea53969bb119e105c71ce9d/html5/thumbnails/3.jpg)
General StatsNormality
To figure out if your data shows what you want, you have to figure out which Stat tests are appropriate. First it’s always good to identify whether your data follows a normal distribution.
You can do this through a statistical test, Shapiro: (in this example we’re testing the districbution of the response variable ‘Distance’ in the data sets ‘dtr’ & ‘dtu’
attach(dtr)
shapiro.test(Distance)
W = 0.987, p-value = 0.03772
detach(dtr)
attach(dtu)
shapiro.test(Distance)
W = 0.945, p-value = 6.382e-07
Both the data from the different Genotypes are not normal…
This means I should use non-parametric tests which make fewer assumptions about the data distribution. Or I could transform it so it fits a normal distribution, but I won’t.
You can also look at the distribution of your data using Quantile-Quantile (Q-Q) plots:
qqnorm(Distance); qqline(Distance); abline(0,1)
-3 -2 -1 0 1 2 3
02
46
810
12
Normal Q-Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
![Page 4: drflyguy.weebly.com · Web viewTo figure out if your data shows what you want, you have to figure out which Stat tests are appropriate. First it’s always good to identify whether](https://reader035.fdocuments.us/reader035/viewer/2022070707/5ea53969bb119e105c71ce9d/html5/thumbnails/4.jpg)
Binomial Tests
Probably the simplest Stats test to do is the binomial test.
You use this test to see whether your ratios between 2 (i.e. binomial) categories are different from an expected ratio.
A great example is looking at sex ratio.
Say you were over-expressing a gene during development which you think is important in male-specific development. You can load your data into R and do a binomial test:
binom.test(17, 83, p=0.5)
Exact binomial test
data: 17 and 83
number of successes = 17, number of trials = 83, p-value = 5.603e-08
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.1240805 0.3075558
sample estimates:
probability of success 0.2048193
Proportionality Tests
If you want to check difference in proportions between 2 populations (so with no expected ratio), you can use the proportionality test which is a version of a chi-squared test.
Say if you wanted to assess the sex ratios between 2 populations (i.e. PopA = 17M 83F, PopB = 40M 60F), you have to input them into the test as ‘successes’ (x) and ‘attempts’ (n).
So say we take Males as ‘successes’ and the overall offspring in each population as ‘attempts’.
The Males in each population are 17 & 40 respectively, the total offspring is 100 in each population:
prop.test(x=c(17, 40), n=(100, 100))
2-sample test for equality of proportions with continuity correction
X-squared = 11.8758, df = 1, p-value = 0.0005687
alternative hypothesis: two.sided
95 percent confidence interval:
-0.36099504 -0.09900496
sample estimates:
0.17 0.40
![Page 5: drflyguy.weebly.com · Web viewTo figure out if your data shows what you want, you have to figure out which Stat tests are appropriate. First it’s always good to identify whether](https://reader035.fdocuments.us/reader035/viewer/2022070707/5ea53969bb119e105c71ce9d/html5/thumbnails/5.jpg)
Simple 2-level Category Tests
So say you have a set-up like 2 Genotypes that you want to see a difference in how far they climb. This is pretty simple to do.
For Normal Data – You use a Welch t-test
t.test(Eaten~Genotype, dt)
Welch Two Sample t-test
t = 8.7303, df = 422.968, p-value < 2.2e-16
For Non-Parametric Data – You use a Wilcox Rank Sum
wilcox.test(Distance ~ Genotype, dt)
Wilcoxon rank sum test with continuity correction
W = 32453, p-value = 3.399e-15
You can visualise the data pretty easily using the normal plot function:
plot(Distance~Genotype, dt)
Relish Upd3
02
46
810
12
Genotype
Dis
tanc
e
Or if you want something nicer, load the ggplot2 package (‘Packages’ ‘Install Package’ [if first time], or ‘load package’ if already installed in previous sessions, or load in terminal)
library(ggplot2)
qplot(Genotype, Distance, data=dt, geom=c("boxplot", "jitter"),
fill=Genotype, main="Climbing Assay",
xlab="Genotype", ylab="Distance Climbed (cm)")
![Page 6: drflyguy.weebly.com · Web viewTo figure out if your data shows what you want, you have to figure out which Stat tests are appropriate. First it’s always good to identify whether](https://reader035.fdocuments.us/reader035/viewer/2022070707/5ea53969bb119e105c71ce9d/html5/thumbnails/6.jpg)
0
5
10
Relish Upd3Genotype
Dis
tanc
e C
limbe
d (c
m)
GenotypeRelish
Upd3
Climbing Assay
Alternatively you can view the response variable between 2 populations as an overlapping Histogram
ggplot(dt, aes(x=Distance, fill=Genotype)) + geom_histogram(alpha=0.2, position="identity")
2-Level Boxplots
Say if you had data with 2 levels e.g. Genotype and Dose, you can make a boxplot like this too!
library(ggplot2)
dt$GD <- interaction(dt$Genotype, dt$Dose) # As 2-levels of factors, need to create an interaction
![Page 7: drflyguy.weebly.com · Web viewTo figure out if your data shows what you want, you have to figure out which Stat tests are appropriate. First it’s always good to identify whether](https://reader035.fdocuments.us/reader035/viewer/2022070707/5ea53969bb119e105c71ce9d/html5/thumbnails/7.jpg)
ggplot(aes(y = Lifespan, x = Genotype, fill = Dose), data = dt) + geom_boxplot(position = position_dodge(width = .9))
the ‘position = position_dodge(width = .9)’ bit separates the Dose sections under Genotype
![Page 8: drflyguy.weebly.com · Web viewTo figure out if your data shows what you want, you have to figure out which Stat tests are appropriate. First it’s always good to identify whether](https://reader035.fdocuments.us/reader035/viewer/2022070707/5ea53969bb119e105c71ce9d/html5/thumbnails/8.jpg)
Simple 2-level Numeric Tests
Say you want to see if there’s a difference between two numeric factors, e.g. amount eaten and body size.
You can use a simple Pearson’s Correlation
library(Hmisc)
rcorr(Size~Eaten)
x y
x 1.00 0.13
y 0.13 1.00
P x y
0.0382 0.0382 # Significant!
or a simple linear model
m1=lm(Size~Eaten)
summary(m1)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -33.629 104.636 -0.321 0.750
Eaten 3.069 3.043 1.008 0.323
Multiple R-squared: 0.03764, Adjusted R-squared: 0.0006251
F-statistic: 1.017 on 1 and 26 DF, p-value: 0.3226
To plot correlations you can write:
plot(Eaten, Size, main=”Body Size vs Food Eaten”, xlab=”Eaten(ul)”, ylab=”Size(mg)”, pch=19)
you can add a line-of-best-fit by:
abline(lm(Size~Eaten), col=”red”)
![Page 9: drflyguy.weebly.com · Web viewTo figure out if your data shows what you want, you have to figure out which Stat tests are appropriate. First it’s always good to identify whether](https://reader035.fdocuments.us/reader035/viewer/2022070707/5ea53969bb119e105c71ce9d/html5/thumbnails/9.jpg)
3-or-more-level Category Tests
Say you want to assess the difference between the expression level of immune genes when flies on conditioned to 3 different diets.
If the expression data is normally distributed, you can use an ANOVA to assess the effect of each treatment:
fit=aov(Expression~Diet, dt)
summary(fit)
Df Sum Sq Mean Sq F value Pr(>F)
Diet 2 2.905 1.4523 29.96 1.4e-07 ***
Residuals 27 1.309 0.0485
If your response variable isn’t normally distributed, you can use a Kruskal-Wallis test to do the same thing an ANOVA does:
kruskal.test(Expression~Diet, dt)
Kruskal-Wallis rank sum test
Kruskal-Wallis chi-squared = 21.3008, df = 2, p-value = 2.369e-05
Great, there’s a significant effect of diet, but which one?
For this you need Post-Hoc analysis.
For the ANOVA this is pretty straight forward, you use a Tukey’s test:
TukeyHSD(fit)
Fit: aov(formula = Expression ~ Diet, data = dt)
diff lwr upr p adj
Low-High -0.76 -1.0041477 -0.5158523 0.0000001
Med-High -0.33 -0.5741477 -0.0858523 0.0065371
Med-Low 0.43 0.1858523 0.6741477 0.0004751
P adj is the important value here, everything’s significant with each other
For Kruskal-Wallis analysis, you use the crazy named Kruskal-Wallis-Nemenyi test. You need to load the PMCMR package:
library(PMCMR)
posthoc.kruskal.nemenyi.test(Expression~Diet, dt)
![Page 10: drflyguy.weebly.com · Web viewTo figure out if your data shows what you want, you have to figure out which Stat tests are appropriate. First it’s always good to identify whether](https://reader035.fdocuments.us/reader035/viewer/2022070707/5ea53969bb119e105c71ce9d/html5/thumbnails/10.jpg)
High Low
Low 1.4e-05 -
Med 0.093 0.033
So ‘Low’ is different from both ‘Med’ & ‘High’ diets, but ‘Med’ & ‘High’ diets are not significantly different
![Page 11: drflyguy.weebly.com · Web viewTo figure out if your data shows what you want, you have to figure out which Stat tests are appropriate. First it’s always good to identify whether](https://reader035.fdocuments.us/reader035/viewer/2022070707/5ea53969bb119e105c71ce9d/html5/thumbnails/11.jpg)
Multi-factorial Tests
Sometimes you want to measure the effect of more than one factor e.g. how many eggs flies have laid under different diets and temperatures> in this scenario you want to see whether there are independent effects of diet and temperature, but also whether there is an interaction between these variables on eggs laid. For this you can use ANOVA for normal data when dealing with categorical factors. NB. Use ‘*’ to look at factor & interaction, and ‘:’ for just the interaction
m1=aov(Eggs~Diet*Temperature, dt)
summary(m1)
Df Sum Sq Mean Sq F value Pr(>F)
Diet 2 312 156.2 0.196 0.823
Temperature 1 459 459.3 0.575 0.451
Diet:Temperature 2 1681 840.6 1.053 0.356
Residuals 54 43113 798.4
Again, we can use Tuskey’s HSD to see the pairwise interactions:
TukeyHSD(m1)
Diet
diff lwr upr p adj
Low-High 5.45 -16.08384 26.98384 0.8153221
Med-High 1.65 -19.88384 23.18384 0.9813831
Med-Low -3.80 -25.33384 17.73384 0.9052964
Temperature
diff lwr upr p adj
Hot-Cold 5.533333 -9.093484 20.16015 0.4514811
`Diet:Temperature`
diff lwr upr p adj
Low:Cold-High:Cold -5.6 -42.93389 31.73389 0.9977239
Med:Cold-High:Cold 2.0 -35.33389 39.33389 0.9999853
High:Hot-High:Cold -1.6 -38.93389 35.73389 0.9999952
Low:Hot-High:Cold 14.9 -22.43389 52.23389 0.8447721
Med:Hot-High:Cold -0.3 -37.63389 37.03389 1.0000000
Med:Cold-Low:Cold 7.6 -29.73389 44.93389 0.9904956
High:Hot-Low:Cold 4.0 -33.33389 41.33389 0.9995517
![Page 12: drflyguy.weebly.com · Web viewTo figure out if your data shows what you want, you have to figure out which Stat tests are appropriate. First it’s always good to identify whether](https://reader035.fdocuments.us/reader035/viewer/2022070707/5ea53969bb119e105c71ce9d/html5/thumbnails/12.jpg)
Low:Hot-Low:Cold 20.5 -16.83389 57.83389 0.5877812
Med:Hot-Low:Cold 5.3 -32.03389 42.63389 0.9982504
High:Hot-Med:Cold -3.6 -40.93389 33.73389 0.9997322
Low:Hot-Med:Cold 12.9 -24.43389 50.23389 0.9089225
Med:Hot-Med:Cold -2.3 -39.63389 35.03389 0.9999707
Low:Hot-High:Hot 16.5 -20.83389 53.83389 0.7805979
Med:Hot-High:Hot 1.3 -36.03389 38.63389 0.9999983
Med:Hot-Low:Hot -15.2 -52.53389 22.13389 0.8335464
When you have non-parametric distributions, you can’t use Kruskal-Wallis as it only works on one-way analysis…
Ordinal Logistic Regressions are pretty bad as your response variable has to be categorical with >=2 levels…
So use Generalised Linear Mixed Models (GLMM)!
![Page 13: drflyguy.weebly.com · Web viewTo figure out if your data shows what you want, you have to figure out which Stat tests are appropriate. First it’s always good to identify whether](https://reader035.fdocuments.us/reader035/viewer/2022070707/5ea53969bb119e105c71ce9d/html5/thumbnails/13.jpg)
GLMMs
These are good as they work with both categorical & numerical factors.
Linear Mixed Models are used when you have random effects which allow us to account for inherent lumpiness of data caused by factors like the individual who recorded the data, differences in position, differences in time of the day etc.
Need to load package lme4
If your data follows a Gaussian distribution, use a linear mixed model (lmer), otherwise, a glmer is good.
m1=glmer(Eggs~Diet+Temperature+(1|Replicate), family=”poisson”)
summary(m1)
If the variance of the random effect is close to 0, you can ignore its effects
![Page 14: drflyguy.weebly.com · Web viewTo figure out if your data shows what you want, you have to figure out which Stat tests are appropriate. First it’s always good to identify whether](https://reader035.fdocuments.us/reader035/viewer/2022070707/5ea53969bb119e105c71ce9d/html5/thumbnails/14.jpg)
Collating & Analysing Survival Data
Configuring for Survival Analysis
Say you have data in an Excel file like:
To get it ready for R you have to configure it as so:
You do this in Excel for each replicate, treatment and factor individually. The formula can be found in the ‘Survival Transformation Example.xls’ file.
Once you have transformed each of your replicates like above, collate them with the appropriate information. Before you save the file, you have to DELETE the first data column (column E in this example) as it is just a list of the days survival was recorded. Once this has been done you can save the file as a .csv called “Survival Data for R.csv”.
Once in this format you can use a function made by the fantastic Dr Weihao Zhang to transpose it in R:
Weihao’s Script for Transposing Survival Data
raw=read.csv("Survival Data for R.csv")
# make function
make.rm<-function(constant,repeated,data,contrasts) {
if(!missing(constant) && is.vector(constant)) {
if(!missing(repeated) && is.vector(repeated)) {
if(!missing(data)) {
dd<-dim(data)
![Page 15: drflyguy.weebly.com · Web viewTo figure out if your data shows what you want, you have to figure out which Stat tests are appropriate. First it’s always good to identify whether](https://reader035.fdocuments.us/reader035/viewer/2022070707/5ea53969bb119e105c71ce9d/html5/thumbnails/15.jpg)
replen<-length(repeated)
if(missing(contrasts))
contrasts<-
ordered(sapply(paste("T",1:length(repeated),sep=""),rep,dd[1]))
else
contrasts<-matrix(sapply(contrasts,rep,dd[1]),ncol=dim(contrasts)[2])
if(length(constant) == 1) cons.col<-rep(data[,constant],replen)
else cons.col<-lapply(data[,constant],rep,replen)
new.df<-data.frame(cons.col,
repdat=as.vector(data.matrix(data[,repeated])),
contrasts)
return(new.df)
}
}
}
cat("Usage: make.rm(constant, repeated, data [, contrasts])\n")
cat("\tWhere 'constant' is a vector of indices of non-repeated data and\n")
cat("\t'repeated' is a vector of indices of the repeated measures data.\n")
}
dt=make.rm(1:4, 5:112, raw)
dt.no.NAs=dt[complete.cases(dt),]
write.csv(dt.no.NAs,file = "Transposed survival data.csv", row.names = FALSE)
# End of script
** Change highlighted numbers depending on how many category (e.g. Genotype, Diet, Death etc.) and data columns (i.e. days lifespan was recorded) you have in your data set.
Delete ‘contrasts’ column and change 'repdat' to 'Lifespan' in the ‘Transposed survival data’ file, save it (again as a .csv file) & it’s ready for analysis!
![Page 16: drflyguy.weebly.com · Web viewTo figure out if your data shows what you want, you have to figure out which Stat tests are appropriate. First it’s always good to identify whether](https://reader035.fdocuments.us/reader035/viewer/2022070707/5ea53969bb119e105c71ce9d/html5/thumbnails/16.jpg)
Analysing Survival Data
Once you have your data in R, there are a number of ways you can analyse your data
Load data
dt=read.csv(“Survivals.csv”)
Load (or install, then load) survival package
library(survival)
You want to have a general look at the data to see if there are any immediate differences (i.e. mean lifespan with standard errors).
You first have to create a formula for the standard error calculation.
Thankfully the brilliant Dr Weihao Zhang also made a function for this called se:
se=function(x) sqrt(var(x,na.rm=T)/(length(x)-length(which(is.na(x)))))
So subset the data as you wish and work out the means & se for Lifespan:
dti=subset(dt, Dose!="PBS")
with(dti,tapply(Lifespan, list(Genotype, Dose), mean))
se=function(x) sqrt(var(x,na.rm=T)/(length(x)-length(which(is.na(x)))))
with(dti,tapply(Lifespan, list(Genotype, Dose), se))
High Low
+/TotM 7.680851 8.140000
tub/TotM 7.604167 8.115385
High Low
+/TotM 0.2528403 0.292784
tub/TotM 0.1900438 0.280935
So not much difference in the means between the Genotypes…
You can test these statistically using a Wilcox test (lifespan data isn’t normally distributed)
wilcox.test(Lifespan ~ Genotype, dtl)
W = 863.5, p-value = 0.4037
wilcox.test(Lifespan ~ Genotype, dth)
W = 798.5, p-value = 0.2855
![Page 17: drflyguy.weebly.com · Web viewTo figure out if your data shows what you want, you have to figure out which Stat tests are appropriate. First it’s always good to identify whether](https://reader035.fdocuments.us/reader035/viewer/2022070707/5ea53969bb119e105c71ce9d/html5/thumbnails/17.jpg)
Model Analysis
Now because I have infection data in this example, there’s not much point in me doing complex analysis on survival curves using model fitting, as the survivals crash, so I will use the Cox Hazard Proportion model. I will use the dti data set to see if there’s a Genotype effect:
coxph(Surv(Lifespan, Death)~Genotype,dti)
coxph(formula = Surv(Lifespan, Death) ~ Genotype, data = dti)
coef exp(coef) se(coef) z p
Genotype tub/TotM 0.0401 1.04 0.143 0.28 0.78
Likelihood ratio test=0.08 on 1 df, p=0.779 n= 197, number of events= 197
Nope, doesn’t look like it…
What about a Genotype:Dose interaction?
m1<-coxph(Surv(Lifespan, Death)~Genotype*Dose,dti)
m2<-update(m1,~.- Genotype:Dose )
anova(m1,m2)
Cox model: response is Surv(Lifespan, Death)
Model 1: ~ Genotype * Dose
Model 2: ~ Genotype + Dose
loglik Chisq Df P(>|Chi|)
1 -841.03
2 -841.74 1.4183 1 0.2337
Nope, no interaction either
Plotting Cox HP
Say I wanted to make a graph of the Cox HP results.
Let’s go back to our Genotype results:
coef exp(coef) se(coef) z p
Genotype tub/TotM 0.0401 1.04 0.143 0.28 0.78
![Page 18: drflyguy.weebly.com · Web viewTo figure out if your data shows what you want, you have to figure out which Stat tests are appropriate. First it’s always good to identify whether](https://reader035.fdocuments.us/reader035/viewer/2022070707/5ea53969bb119e105c71ce9d/html5/thumbnails/18.jpg)
Here the exp(coef) is the hazard proportion relative to the first genotype assessed, which in this case is +/TotM, the control Genotype. (Subsets are processed in alphabetical order, so the subset you want to relate everything to has to alphabetically come first, thankfully ‘+’ comes before all letters!).
The se(coef) is the standard error of the coefficient.
I would put these values into a new .csv file and plot them using the Plotrix package:
dt=read.csv("Surv Fig.csv")
library(plotrix) # or install first!
par(mar=c(5, 5, 4, 2))
This sets the position of graph in space, best not to touch!
plotCI(x=c(1.5), 1.041, uiw=0.142, liw=0.142, col=c('grey80'), lwd=2, pch=16, err="y", xaxt="n", xlab="TotM RNAi", ylab="Hazard Ratio", main="Hazard Proportions of Treated Flies", cex.main=2, cex.lab=1.5, ylim=c(0.8, 1.2), xlim=c(1,2))
Lots of stuff in this line... uiw & liw are the error limits, pch is the point type, etc.
abline(h=1, col=1, lty=2, lwd=2)
This adds an additional line in, h=1 means that it's horizontal and is at 1, col is the colour, lty is line type, and lwd is line width
Or if using more data points
Make a .csv file of CoxHP data:
attach(dt)
par(mar=c(5, 5, 4, 2)) #sets position of graph in space
plotCI(x=c(1.5, 2, 3, 3.5), Mean, uiw=SE, liw=SE, col=c('black', 'black', 'grey65', 'grey65'), lwd=2, pch=16, err="y", xaxt="n", xlab="Dif Rel", ylab="Hazard Ratio", main="Hazard Proportions of Treated Flies", cex.main=2, cex.lab=1.5, ylim=c(0.4, 1.4), xlim=c(1, 4.5))
![Page 19: drflyguy.weebly.com · Web viewTo figure out if your data shows what you want, you have to figure out which Stat tests are appropriate. First it’s always good to identify whether](https://reader035.fdocuments.us/reader035/viewer/2022070707/5ea53969bb119e105c71ce9d/html5/thumbnails/19.jpg)
# Lots of stuff in this line... uiw & liw are the error limits, pch is the point type, etc.
chp<-as.character(dt[[2]]) # stating which colomn in csv file to put as x axis
axis(side=1,at=c(1.5, 2, 3, 3.5), labels=chp, cex.axis=1) # stating space of axis, axis labels, cex.axis describes axis font size
abline(h=1, col=1, lty=2, lwd=2) # adds an additional line in, h=1 means that it's horizontal and is at 1, col is the colour, lty is line type, and lwd is line width