Basic statisticsDescriptive statistics and ANOVA
Thomas Alexander Gerds
Department of Biostatistics, University of Copenhagen
Contents
I Data are variableI Statistical uncertaintyI Summary and display of dataI Confidence intervalsI ANOVA
Data are variable
A statistician is used to receive a value, such as
3.17 %,
together with an explanation, such as
"this is the expression of 1-B6.DBA-GTM in mouse 12".
The value from the next mouse in the list is 4.88% . . .
Decomposing variance
Variability of data is usually a composite of
I Measurement error, sampling schemeI Random variationI GenotypeI Exposure, life style, environmentI Treatment
Statistical conclusions can often be obtained by explaining thesources of variation in the data.
Example 1
In the yeast experiment of Smith and Kruglyak (2008) 1 transcriptlevels were profiled in 6 replicates of the same strain called ’RM’ inglucose under controlled conditions.
1the article is available at http://biology.plosjournals.org
Example 1
In the same yeast experiment Smith and Kruglyak (2008) profiledalso 6 replicates of a different strain called ’By’ in glucose.Theorder in which the 12 samples were processed was at random tominimize a systematic experimental effect.
Sources of the variation of these 12 values
I Measurement errorI Study design/experimental environmentI Genotype
Example 1
Furthermore, Smith and Kruglyak (2008) cultured 6 ’RM’ and 6’By’ replicates in ethanol.The order in which the 24 samples wereprocessed was random to minimize a systematic experimental effect.
Sources of variation
I Measurement errorI Experimental environmentI GenesI Exposure, environmental factors
Example 2
Festing and Weigler in the Handbook of Laboratory Animal Science. . .
. . . consider the results of an experiment using a completelyrandomized design . . .
. . . in which adult C57BL/6 mice were randomly allocated to oneof four dose levels of a hormone compound.
The uterus weight was measured after an appropriate time interval.
Example 2
Conclusions from the figures
I The uterus weight depends on the doseI The variation of the data increases with increasing dose
Question: Why could these first conclusions be wrong?
Descriptive statistics (summarizing data)
Categorical variables: count (%).
Continuous variables:I raw values (if n is small)I range (min, max)I location: median (IQR=inter quartile range)I location: means (SD)
Sample: Table 1
22Quality of life (QOL), supportive care, and spirituality in hematopoietic
stem cell transplant (HSCT) patients. Sirilla & Overcash. Supportive Care inCancer, October 2012.
R excursion: calculating descriptive statistics in groups
library(Publish)library(data.table)data(Diabetes)setDT(Diabetes) ## make data.tableDiabetes[,.(mean.age=mean(age), sd.age=sd(age),median.
chol=median(chol,na.rm=TRUE)),by=location]
location mean.age sd.age median.chol1: Buckingham 47.07500 16.74849 2022: Louisa 46.63054 15.90929 206
R excursion: making table onelibrary(Publish)data(Diabetes)tab1 <- summary(utable(location∼gender + age + Q(chol) + BMI,
data=Diabetes))tab1
Variable Level Buckingham (n=200) Louisa (n=203) Total (n=403) p-valuegender female 114 (57.0) 120 (59.1) 234 (58.1)
male 86 (43.0) 83 (40.9) 169 (41.9) 0.7422age mean (sd) 47.1 (16.7) 46.6 (15.9) 46.9 (16.3) 0.7847chol median [iqr] 202.0 [174.0, 231.0] 206.0 [183.5, 229.0] 204.0 [179.0, 230.0] 0.2017
missing 1 0 1BMI mean (sd) 28.6 (7.0) 29.0 (6.2) 28.8 (6.6) 0.5424
missing 3 3 6
R excursion: exporting a table
Method 1: Write table to file
write.csv(tab1,file="tables/tab1.csv")
Then open file tab1.csv with Excel
Method 2: Use kable3 and include in dynamic report4
‘‘‘{r,results=’asis’}knitr::kable(tab1)‘‘‘
3https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html
4https://www.rdocumentation.org/packages/knitr/versions/1.17/topics/kable
R excursion: exporting a table
Method 1: Write table to file
write.csv(tab1,file="tables/tab1.csv")
Then open file tab1.csv with Excel
Method 2: Use kable3 and include in dynamic report4
‘‘‘{r,results=’asis’}knitr::kable(tab1)‘‘‘
3https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html
4https://www.rdocumentation.org/packages/knitr/versions/1.17/topics/kable
Dynamite plots are depreciated (DO NOT USE) 5
5http://biostat.mc.vanderbilt.edu/wiki/Main/DynamitePlots
Dot plots are appreciated when n is small
●
●●
●
●
●
●
●
●●
●
Mea
sure
men
t sca
le
−3
−2
−1
01
23
A B C
Figure: Group A (n=3), group B (n=3, one replicate), group C (n=4)
Box plots are appreciated when n is large
●●
●
●
●
●●
●
●
●●
●●●
Mea
sure
men
t sca
le
−4
−2
02
4
A B C
Figure: Group A (n=300), group B (n=400), group C (n=400)
Making boxplots with ggplot2
library(ggplot2)bp <- ggplot(Diabetes, aes(location,chol))bp <- bp + geom_boxplot(aes(fill=location))print(bp)
Find the ggplot2 cheat sheet via help menu in Rstudio
Making boxplots with ggplot2
●
●
●●
●
●
●
●
●
●●
●
●
●●
100
200
300
400
Buckingham Louisalocation
chol
location
Buckingham
Louisa
Making boxplots with ggplot2
bp+facet_grid(.∼gender)
female male
●
●
●●
●
●
●
●
●●
●
●
100
200
300
400
Buckingham Louisa Buckingham Louisalocation
chol
location
Buckingham
Louisa
Making dotplots with ggplot2dp <- ggplot(mice,aes(x=Dose,fill=Dose,y=BodyWeight))dp <- dp + geom_dotplot(binaxis="y")print(dp)
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●●
10
11
12
13
0 1 2.5 7.5 50Dose
Bod
yWei
ght
Dose
●
●
●
●
●
0
1
2.5
7.5
50
R excursion: exporting a figure
Write figure to pdf (vector graphics, also eps 6, infinite zoom)
pdf("figures/dotplot-mice-bodyweight.pdf")dpdev.off()
Write figure to jpg (image file, also tiff, giff etc)
jpeg("figures/dotplot-mice-bodyweight.jpg")dpdev.off()
6postscript("figures/dotplot-mice-bodyweight.eps")
Quantifying variability
A sample of data X1, . . . ,XN has a standard deviation (sd); it isdefined by
SD =
√√√√ 1N − 1
N∑i=1
(Xi − X )2; X =1N
N∑i=1
Xi
SD measures the variability of the measurements in the sample.
The variance of the sample is defined as SD2. The term ’standarddeviation’ relates to the normal distribution.
What is so special about the normal distribution?
I It is symmetric around the mean, thus the mean is equal tothe median.
I The mean is the most likely value. Mean and standarddeviation describe the full destribution.
I The distribution of measurements, like height, distance,volume is often normal.
I The distribution of statistics, like mean, proportion, meandifference, etc. are very often approximately normal.
Quantifying statistical uncertainty
For statistical inference and conclusion making, via p-values andconfidence intervals, it is crucial to quantify the variability of thestatistic (mean, proportion, mean difference, risk ratio, etc.):
The standard error is the standard deviation of the statistic.
The standard error is a measure of the statistical uncertainty.
Quantifying statistical uncertainty
Example: We want to estimate the unknown mean uterus weightfor untreated mice. The standard error of the mean is defined as
SE = SD/√N where N is the sample size.
Based on N = 4 values, 0.012, 0.0088, 0.0069, 0.009:
I mean: β̂ = 0.0091I standard deviation: SD = 0.002108I empirical variance: var = 0.0000044I standard error: SE = 0.002108/2 = 0.001054
Quantifying statistical uncertainty
Example: We want to estimate the unknown mean uterus weightfor untreated mice. The standard error of the mean is defined as
SE = SD/√N where N is the sample size.
Based on N = 4 values, 0.012, 0.0088, 0.0069, 0.009:
I mean: β̂ = 0.0091I standard deviation: SD = 0.002108I empirical variance: var = 0.0000044I standard error: SE = 0.002108/2 = 0.001054
The standard error is the standard deviation of the mean
Ute
rus
wei
ght (
g)
0.00
00.
005
0.01
00.
015
Ourstudy
Hypotheticalstudy 1
Hypotheticalstudy 47
Hypotheticalstudy 100
The unknown trueaverage uterus
weight●
●
●
●
The (hypothetical) mean values are approximately normallydistributed, even if the data are not normally distributed!
Variance vs statistical uncertainty
"’The terms standard error and standard deviation are oftenconfused. The contrast between these two terms reflects theimportant distinction between data description and inference, onethat all researchers should appreciate."’ 7
Rules:I The higher the unexplained variability of the data, the higher
the statistical uncertainty.I The higher the sample size, the lower the statistical
uncertainty.
7Altman & Bland, Statistics Notes, BMJ, 2005, Nagele P, Br J Anaesthesiol2003;90: 514-6
Constructing confidence limits
A 95% confidence interval for the parameter β is
[β̂ − 1.96 ∗ SE ; β̂ + 1.96 ∗ SE ]
Example: a confidence interval for the mean uterus weight ofuntreated mice is given by
95%CI = [0.0091− 1.96 ∗ 0.001054; 0.0091+ 1.96 ∗ 0.001054]= [0.007; 0.011].
The standard error SE measures the variability of the mean β̂around the (unknown) population value β, under the assumptionthat the model is correctly specified.
The idea of a 95% confidence interval
Ute
rus
wei
ght (
g)
0.00
00.
005
0.01
00.
015
Ourstudy
Hypotheticalstudy 1
Hypotheticalstudy 47
Hypotheticalstudy 100
The unknown trueaverage uterus
weight●
●
●
●
By construction, we expect at most 5 of the 100 confidenceintervals not to cover (include) the true value.
Confidence limits for the mean uterus weights (long code)
library(Publish)cidat <- mice[,{meanU=mean(UterusWeight)
seU=sqrt(var(UterusWeight)/.N).(meanU=meanU,
lowerU=meanU-seU*qnorm(1 - 0.05/2),upperU=meanU+seU*qnorm(1 - 0.05/2))},by=Dose]
publish(cidat,digits=1)
Dose meanU lowerU upperU0.0 0.009 0.007 0.011.0 0.025 0.020 0.032.5 0.051 0.046 0.067.5 0.089 0.079 0.10
50.0 0.087 0.066 0.11
Confidence limits for the mean uterus weights (short code)
library(Publish)cidat <- mice[,ci.mean(UterusWeight),by=Dose]publish(cidat,digits=1)
Dose mean se lower upper level statistic0.0 0.009 0.001 0.006 0.01 0.05 arithmetic1.0 0.025 0.002 0.018 0.03 0.05 arithmetic2.5 0.051 0.002 0.044 0.06 0.05 arithmetic7.5 0.089 0.005 0.072 0.11 0.05 arithmetic
50.0 0.087 0.011 0.053 0.12 0.05 arithmetic
Confidence limits for the geometric mean uterus weights(short code)
library(Publish)gcidat <- mice[,ci.mean(UterusWeight,statistic="
geometric"),by=Dose]publish(gcidat,digits=1)
Dose geomean se lower upper level statistic0.0 0.009 1.1 0.006 0.01 0.05 geometric1.0 0.024 1.1 0.018 0.03 0.05 geometric2.5 0.051 1.0 0.044 0.06 0.05 geometric7.5 0.089 1.1 0.073 0.11 0.05 geometric
50.0 0.085 1.1 0.057 0.13 0.05 geometric
ggplot2: Plot of means with confidence intervals
library(ggplot2)pom <- ggplot(cidat)+geom_pointrange(aes(x=Dose,
y=meanU,ymin=upperU,ymax=lowerU),color=4)
pom + coord_flip() + ylab("Uterus weight (g)")+xlab("Dose")
Plot of means with confidence intervals
●
●
●
●
●
0
1
2.5
7.5
50
0.03 0.06 0.09Uterus weight (g)
Dos
e
Publish: Plot of means with confidence intervals (code)
library(Publish)plotConfidence(x=cidat$mean,
lower=cidat$lower,upper=cidat$upper,labels=cidat$Dose,title.labels="Hormon dose",title.values=expression(paste("Mean (",CI[95],")")),cex=1.8,
stripes=TRUE,xratio=c(.2,.3),xlim=c(0,.15),xlab="Uterus weight (g)")
Publish: Plot of means with confidence intervals (result)
0
1
2.5
7.5
50
Hormon dose
0.01 (0.01−0.01)
0.02 (0.02−0.03)
0.05 (0.04−0.06)
0.09 (0.07−0.11)
0.09 (0.05−0.12)
Mean (CI95)
0.00 0.05 0.10 0.15Uterus weight (g)
●
●
●
●
●
Parameters
It is generally difficult to interpret a p-value without furtherquantification of the parameter of interest.
Parameters are interpretable characteristics that have to beestimated based on data.
Examples that we will study during the course:
I MeansI Mean differencesI ProbabilitiesI Risk ratios, odds ratios, hazard ratiosI Association parameters, regression coefficients
Juonala et al. (part I)
Aims: The objective was to produce reference values and to analyse theassociations of age and sex with carotid intima-media thickness (IMT),carotid compliance (CAC), and brachial flow-mediated dilatation (FMD) inyoung healthy adults.
Methods and results: We measured IMT, CAC, and FMD with ultrasound in2265 subjects aged 24–39 years. The mean values (mean ± SD) in men andwomen were 0.592± 0.10 vs. 0.572± 0.08mm (P < 0.0001) for IMT,2.00± 0.66 vs. 2.31± 0.77%/10 mmHg (P < 0.0001) for CAC, and6.95± 4.00 vs. 8.83± 4.56% (P < 0.0001) for FMD.
The sex differences in IMT (95% confidence interval= [-0.013; 0.004] mm,P = 0.37) and CAC (95% CI=[-0.01;0.18]%/10 mmHg, P = 0.09) becamenon-significant after adjustments with risk factors and carotid diameter.
Confidence intervals
A confidence interval is a range of values which covers the unknowntrue population parameter with high probability. Roughly theprobability is 100− α% where α is the level of significance.
For example:−0.013 to 0.004
is a 95% confidence interval for the unknown average difference inIMT between men and women.
Confidence intervals have the advantage over p-values, that theirabsolute value has a direct interpretation.8
8Confidence intervals rather than P values: estimation rather thanhypothesis testing. Statistics with Confidence, Altman et al.
Relation between confidence intervals and p-values
If we estimate the parameter β, e.g.
β = mean(IMT men )-mean(IMT women)
and have computed a 95% confidence interval for this parameter,
[lower95, upper95]
then the null hypothesis
β = 0 "There is no difference"
can be rejected at the 5% significance level if the value 0 is notincluded in the interval: 0 /∈ [lower95, upper95].
Example (DGA p.208)
22 cardiac bypass operation patients were randomized to 3 types ofventilation.
Outcome: Red cell folate level (µ g/l)Group Ventilation N Mean SdI 50% N2O, 50% O2 in 24 hours 8 316.6 58.7II 50% N2O, 50% O2 during operation 9 256.4 37.1III 30–50% O2 (no N2O) in 24 hours 5 278.0 33.8
ANOVA table for red cell folate levels
Source ofvariation
Degreesof free-dom
Sum ofsquares
Meansquares
F P
Betweengroups
2 15515.88 7757.9 3.71 0.04
Withingroups
19 39716.09 2090.3
Total 21 55231.97
What are sum of squares and degrees of freedom?
Recall the definition of the variance for a sample of N valuesX1, . . . ,XN with mean=X :
Var =1
N − 1{(X1 − X )2 + · · ·+ (XN − X )2}
Var =1
N − 1︸ ︷︷ ︸degrees of freedom
{(X1 − X )2 + · · ·+ (XN − X )2︸ ︷︷ ︸Sum of squares
}
In ANOVA terminology the variance is referred to as a mean squarewhich is short for: mean squared deviation from the mean.
What are sum of squares and degrees of freedom?
Recall the definition of the variance for a sample of N valuesX1, . . . ,XN with mean=X :
Var =1
N − 1{(X1 − X )2 + · · ·+ (XN − X )2}
Var =1
N − 1︸ ︷︷ ︸degrees of freedom
{(X1 − X )2 + · · ·+ (XN − X )2︸ ︷︷ ︸Sum of squares
}
In ANOVA terminology the variance is referred to as a mean squarewhich is short for: mean squared deviation from the mean.
What are sum of squares and degrees of freedom?
Recall the definition of the variance for a sample of N valuesX1, . . . ,XN with mean=X :
Var =1
N − 1{(X1 − X )2 + · · ·+ (XN − X )2}
Var =1
N − 1︸ ︷︷ ︸degrees of freedom
{(X1 − X )2 + · · ·+ (XN − X )2︸ ︷︷ ︸Sum of squares
}
In ANOVA terminology the variance is referred to as a mean squarewhich is short for: mean squared deviation from the mean.
ANOVA methods
I Independent observationsI t test for two groupsI One-way ANOVA for more groupsI More-way ANOVA for more grouping variables
I Dependent observations:I Repeated measures anovaI Mixed effect models
I Rank statistics (non-parametric ANOVA tests)I Nonparametric anova (Kruskal-Wallis test)
I Mixture of discrete and continuous factors:I Ancova
I Model comparison and model selection . . .
Typical F-test hypotheses
H0 Null hypothesis The red cell folate does not depend onthe treatment
H1 Alternativehypothesis
The red cell folate does depend on thetreatment
This means
H0 : Mean group I = Mean group II = Mean group III
H1 : Mean group I 6= Mean group IIor Mean group III 6= Mean group IIor Mean group I 6= Mean group III
Usually we want to know which treatment yields the best response.
F-test statistic
Central idea: The deviation of a subjects response from the grandmean of all responses is attributable to a deviation of that valuefrom its group mean plus the deviation of that group mean fromthe grand mean.
F =between-group variabilitywithin-group variability
=Variance of the mean response values between groups
Variance of the values within the groups
If the between-group variability is large relative to the within-groupvariability, then the grouping factor contributes to the systematicpart of the variability of the response values.
Conclusions from the ANOVA table
Source ofvariation
Degreesof free-dom
Sum ofsquares
Meansquares
F P
Betweengroups
2 15515.88 7757.9 3.710.04
Withingroups
19 39716.09 2090.3
Total 21 55231.97
Conclusion: The red cell folate depends significantly on thetreatment.
Take home messages
I The variation of data can be decomposed into a systematicand a random part.
I The standard deviation quantifies the variability of the data.I The standard error quantifies the uncertainty of statistical
conclusions.I ANOVA is an old and general statistical technique with many
different applications.
Top Related