Post on 10-Jul-2020
PRESENTING AND SUMMARIZING DATA. CONFIDENCE INTERVALS
Jesús Piedrafita Arilla jesus.piedrafita@uab.cat
Departament de Ciència Animal i dels Aliments
Experimental Design and Statistical Methods
Workshop
Items
• Types of variables
• Numerical methods for presenting data: means, variance, skewness, kurtosis
• Estimation – Point estimates
– Interval estimates
• Distributions: normal, t, chi-square
• Starting with R – Website
– Objects, workspace
– Commands • Expressions
• Assignments
• First operations in R – Creating a vector
– Descriptive statistics
– Distributions
– Pie and bar charts
• Script 2
Variables
• Quantitative (numerical)
– Continuous (adult weight, percent of a fatty acid, … )
– Discrete -countable, finite or infinite- (number of colonies, litter size)
• Qualitative (categorical or classification)
– Ordinal (calving ease score, panel score)
– Nominal (gender, coat colour, …) • Use bar diagrams better than pie-charts
Variable: Set of observations of a particular character
Data: Values of a variable
3
Summary of numerical methods for presenting data
Descriptive statistics
Measurements of central tendency
Measures of variability
Measures of the shape of a distribution
Measures of relative position
Aritmetic mean Range Skewness Percentiles Median Variance Kurtosis Quartiles (Q1, Q2, Q3) Mode Standard deviation z-values Coefficient of
variation
Descriptive Statistics attempts to describe the distribution of the data
4
Measures of central tendency
i ii
i iyfy
n
yy or
Arithmetic mean
The second formula is for grouped data, fi being the proportion of each value
Median: value that is in the middle when observations are sorted
from the smallest to the largest. Robust to the presence of extreme values (in that differed from the mean).
Mode: value among the observations that has the highest frequency
5
Measures of variability
1
)( 2
2
n
yys i i
i
i i
ii iyyn
yyyySS
2
22)(
2ss
Sample variance of n observations: More variance indicates more dispersion.
Corrected sum of squares
Range: difference between the maximum and the minimum values in a set
of observations. Very affected by extreme values.
Sample standard deviation: it maintains the unit of
measurement of row data. Both variance and standard deviations are affected by extreme values.
Coefficient of variation: a relative measure of
variability, dimensionless. %)100(
y
sCV
6
The concept of degrees of freedom (df) is central to the principle of estimating statistics of
populations from samples of them. In short, think of df as a mathematical restriction that we need
to put in place when we calculate an estimate of one statistic from an estimate of another.
Let us see an example. Normal distributions need only two parameters (mean and standard
deviation) for their definition. The population values of mean and standard deviation are referred
to as and , respectively, and the sample estimates are and s.
In order to estimate , we must first have estimated . Thus, is replaced by in the formula
for . At this point, we need to apply the restriction that the deviations must sum to zero. Thus,
degrees of freedom are n-1.
When this principle of restriction is applied to regression and analysis of variance, the general
result is that you lose one degree of freedom for each parameter estimated prior to
estimating the (residual) standard deviation.
Another way of thinking about the restriction principle behind degrees of freedom is to imagine
contingencies. For example, imagine you have four numbers (a, b, c and d) that must add up to a
total of m; you are free to choose the first three numbers at random, but the fourth must be
chosen so that it makes the total equal to m - thus your degrees of freedom are 3.
Degrees of freedom
y
y
7
Measures of the shape of a distribution
i
i
i
i
s
yy
nn
n
s
y
nn
nsk
3
3
)2)(1(
)2)(1(
Skewness: measure of asymmetry of a frequency distribution. It is 0 for a
symmetric distribution.
Kurtosis: measure of flatness or steepness of a distribution, or a measure
of the heaviness of the tails of a distribution. It is 0 for a normal distribution.
i
i
i
i
nn
n
s
yy
nnn
nn
s
y
nkt
)3)(2(
)1(3
)3)(2)(1(
)1(
31
24
4
(+)
(+)
(-)
(-)
8
Measures of the relative position
Percentiles: The percentile value (p) of an observation yi, in a data set
has 100p% of observations smaller than yi and 100(1-p)% observations greater than yi.
Quartiles: Percentiles 25% (Q1 or lower quartile), 50% (Q2 o median)
and 75% (Q3 or upper quartile).
z-value: Deviation of an observation from the mean expressed in standard deviation units:
s
yyz i
i
IQR: Interquartile range. Q3-Q1. Little affected by extreme values (outliers).
9
Why R?
• Pros
– Free software for Statistical Analysis
– Powerful to manage data and draw graphics
– Many complementary packages available
– Programming allows a better understanding of statistical methods and R procedures
– Large internet community (websites, forums, …)
• Cons
– Programming makes the analysis more slow. Some Java applications can be used (Deducer)
– Treating random effects is more complicated
10
Some interesting websites
• The Comprehensive R Archive Network – http://cran.r-project.org
• Cookbook for R – http://www.cookbook-r.com
• Quick R – http://www.statmethods.net
• R Statistics UCLA – http://statistics.ats.ucla.edu/stat/r
• Bioconductor – http://manuals.bioinformatics.ucr.edu/home/R_BioCondManual
• R blogers – http://www.r-bloggers.com
11
Starting with R
R is an open source software that can be found in: http://www.r-project.org
Integrated software for manipulating data, calculus and
graphical procedures.
Follow carefully the installation instructions (Windows x32 or x64, Mac)
12
Starting with R (2)
• The entities that R creates and manipulates are called OBJECTS: – Scalars: numbers, characters, logic (booleans), factors.
– Vectors, matrices, scalar lists.
– Functions.
– Objects ad hoc.
• All the objects are saved in a WORKSPACE.
• During an R session all the objects are in memory and can be saved for the next sessions.
• It is recommended to use several workspaces for different analyses.
• Workspaces are loaded and saved with the instructions load and save.image (in the menu).
13
Starting with R (2b)
• Workspaces are loaded and saved with the instructions load and save.image (in the menu).
We will see later another way of working (through SCRIPTS) in my opinion better than that one. 14
Starting with R (3)
• Two types of commands:
– Expressions: the result is shown by the screen and is not saved. > 2+2
[1] 4
– Assignments: nothing is shown in the screen. >a <- 2+2
>a
[1] 4
a <- 2+2 indicates that we are assigning the
sum of 2+2 to the object a.
An alternative is 2+2 -> a.
Note that to recover the result of 2+2 we have
to type a, followed by ENTER.
ENTER executes the command
15
Starting with R (4)
• R is case sensitive and distinguishes between capital and lowercase letters.
# Two different objects
> b <- 3
> b
[1] 3
> B<-6
> B
[1] 6
The # symbol is for including comments, non
executable.
> b <- 3 indicates that we are assigning the
number 3 to the object b.
Note that in the second assignement, B<-6 there
are not spaces between B, <-, and 6. This can be
the general case.
Do not use c to give the name of an object! It is
reserved to create vectors.
16
> ADG<-c(1.99, 1.72, 1.95, 1.67, 1.51, 1.32, 1.39, 1.64,
1.78, 1.50, 1.43, 1.37, 1.60, 1.58, 1.76, 1.57, 1.81,
1.21, 1.45, 1.58, 1.58, 1.68, 1.61, 1.61, 1.78, 1.95,
1.63, 1.68, 1.71, 1.74, 1.69, 1.68, 1.36, 1.30, 1.35,
1.24, 1.38, 1.32)
> ADG
[1] 1.99 1.72 1.95 1.67 1.51 1.32 1.39 1.64 1.78 1.50 1.43 1.37
1.60 1.58 1.76 1.57 1.81 1.21 1.45 1.58 1.58 1.68 1.61 1.61 1.78
1.95 1.63 1.68 1.71 1.74 1.69 1.68 1.36 1.30
[35] 1.35 1.24 1.38 1.32
A first dataset (distribution)
Remember that c() creates a vector
Imagine we have record of the average daily gain (ADG) during fattening of a group of bulls of the Bruna dels Pirineus beef breed.
We are going to create a vector and save it in the object ADG:
Note that the first 34 values were printed in the same line of the screen
17
The first calculus (1)
We can compute several statistics:
Sum all values > sum(ADG)
[1] 60.12
> length(ADG)
[1] 38
> min(ADG)
[1] 1.21
> max(ADG)
[1] 1.99
> range(ADG)
[1] 1.21 1.99
> mean(ADG)
[1] 1.582105
> median(ADG)
[1] 1.605
> var(ADG)
[1] 0.03962248
> sd(ADG)
[1] 0.199054
Sample mean
Sample median (central value of the distribution)
Sample variance
Sample standard deviation
Sample minimum
Sample maximum
Sample range (lower and upper values)
Number of records in the sample (n)
Note that before making calculus we have to define some dataset 18
The first calculus (1b)
When there are missing values, NA in R, use the following:
Sum all values > sum(ADG,na.rm=TRUE)
[1] 60.12
> sum(!is.na(ADG))
[1] 38
> min(ADG,na.rm=TRUE)
[1] 1.21
> max(ADG,na.rm=TRUE)
[1] 1.99
> range(ADG,na.rm=TRUE)
[1] 1.21 1.99
> mean(ADG,na.rm=TRUE)
[1] 1.582105
> median(ADG,na.rm=TRUE)
[1] 1.605
> var(ADG,na.rm=TRUE)
[1] 0.03962248
> sd(ADG,na.rm=TRUE)
[1] 0.199054
Sample mean
Sample median (central value of the distribution)
Sample variance
Sample standard deviation
Sample minimum
Sample maximum
Sample range (lower and upper values)
Number of records in the sample (n)
19
> quantile(ADG)
0% 25% 50% 75% 100%
1.210 1.400 1.605 1.705 1.990
> IQR(ADG)
[1] 0.305
> summary(ADG)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.210 1.400 1.605 1.582 1.705 1.990
> CV<-sd(ADG)/mean(ADG)*100
> CV
[1] 12.58159
The first calculus (2)
More statistics:
Quantiles. Remember that the treatment of NAs also applies
Coefficient of variation –to be defined in R, not a function-
Q3 - Q1
20
Statistical inference
Drawing conclusions based on data taking into account the inherent random variation.
1. Second step after the description of the data.
2. We want to extrapolate to the population which we observe in a sample.
3. Need to assume a particular data distribution
• Normal: adult weight, loin muscle area, average daily gain, …
• Bernoulli: ill vs. not ill.
• Poisson: number of microorganisms in a microscope field.
4. In inferential statistics we estimate –obtain an approximate value of- the true value of the parameter (a mean for example) through an adequate statistic (sample mean, for example).
5. There are a many contexts in which inference is desirable, and there are many approaches to performing inference.
6. Some methods do not need to assume a distribution: non parametric methods.
21
Parameters and statistics
Usually the parameters of the distribution are designed with Greek letters, whereas the corresponding statistics are designed with Latin letters. The next table includes some examples:
Estimator: Some equation that allows us to estimate some parameter.
Estimate: The value obtained.
Parameter (population)
Statistic (sample)
Mean y
Variance 2 2s
Standard deviation s
Proportion p
22
Estimation of parameters
1. Point estimation: a value is obtained as an estimate of the parameter.
2. Interval estimation: we calculate an interval in which we affirm that with a certain probability we can find the true value of the parameters.
So far we have presented some point estimators of several parameters.
In practice, when we work with the unknown parameter of the population, in addition to this point estimate we are usually interested in an interval (confidence interval, CI) that gives an idea of the uncertainness of the estimate.
We will present the way to construct intervals through some classical examples. The procedure is based upon the distribution of the statistic.
23
Normal distribution
),(~ 2NY
2
2
2
)(
2
1)(
y
eyf
if its p.d.f. is
Gauss
Standard normal
http://en.wikipedia.org/wiki/Carl
_Friedrich_Gauss
http://en.wikipedia.org/wiki/Normal_distribution
24
Applications of the normal distribution
1. Sometimes we have to know whether a given sample is distributed
normally before we can apply a certain test to it.
2. Knowing whether a sample is distributed normally may confirm or
reject certain underlying hypotheses about the nature of the factors
affecting the phenomenon studied. If a variable is distributed
normally, we can think that the causing factors affecting this
variable are additive, independent and of equal variance.
• Skewness may suggest some type of selection.
• Bimodality may indicate a mixture of observations from two
populations.
• In many cases, transformations of non normal variables change the
distribution of the transformed variable to normality.
3. If we assume a given distribution to be normal, we may make
predictions and tests of given hypothesis based upon this assumption.
25
One application of the standard normal
Remember that we defined the z-value as follows:
s
yyz i
i
where zi follows a standard normal distribution.
Imagine we have a value of 1.75 kg /day for ADG in the Bruna breed. We can compute the probability of having a value lower than this one (CDF) if the distribution is normal using R commands:
> z<-(1.75-1.582)/0.199;z
[1] 0.8442211
> pnorm(z)
[1] 0.8007271
The complement of pnorm(z), i.e. 1-pnorm(z), will be the
probability of having a value bigger than 1.75, in this case 0.2. 26
Distribution of the entire population
Suppose we want to measure the mean for ADG of BP bulls (as we have done in fact). Usually we do not have the entire population, but a sample. Let assume that we take repeated samples with replacement of size n (in our case n = 38) from that entire population, that is normally distributed.
For each sample we will have a different, but close mean, for example 1.58, 1.60, 1.64, 1.53, 1.59, … and so on. It can be shown that:
Standard error of the mean
n
Distribution of the estimated means
Standard error of
the mean
Used to compute C.I.
27
CI: mean of a normal, variance known
nzy
nzy
22
11
If we fix some confidence level (for example 95%), with = 1 - , the true mean is found in the interval:
1.96 for = 95%
Confidence limits
> qnorm(0.975)
[1] 1.959964
nz
2
12
n
zyn
zy
22
11,
Interval length
Note that is the
standard error, i.e.,
the standard deviation
of the distribution of
means under a
repeated sampling
(infinite) of size n.
n
qnorm gives the
quantile (z-value) of the
cdf of a standard normal
28
CI: mean of a normal, variance known (example)
Assuming that the estimated variance for ADG is the true variance, and that ADG is normally distributed, the 95% confidence interval of the mean is:
032.096.1582.1,032.096.1582.1
645.1,519.1
12544.0032.096.12
With interval length:
29
CI: mean of a normal, variance unknown (1)
n
sty
n
sty
n
sty nnn 1
1
1
1
1
1222
,
13.0032.003.22
This is the common case, and is similar to the previous case but using the t-distribution instead of the normal distribution.
The length of the interval is:
If we use the data of the average daily gain in beef, we have
647.1,517.1032.003.2582.1,032.003.2582.1
t value with 37 df and /2= 0.025
30
t distribution (1)
1~ ntT
2
12
1
2
2
1
)(
ttf
if its p.d.f. is
Gosset
Note that the t-distribution (red or green line)
approaches the normal distribution (blue line)
as increases
with = n - 1
http://en.wikipedia.org/wiki/Student%27s_t-distribution
http://en.wikipedia.org/wiki/
William_Sealy_Gosset
=1 =30
31
t distribution (2)
2
zt
Let z be a standard normal random variable with = 0 and σ = 1, and let 2
be a chi-square random variable with degrees of freedom. Then
is a random variable with a Student t distribution with degrees of freedom.
We can also have a normal random variable with mean and σ = 1, then
This distribution is defined by degrees of freedom and the noncentrality
parameter (included in y).
Central t distribution
2
yt
Non-central t distribution
32
The chi-square distribution (also chi-squared or 2-distribution) with k degrees of
freedom is the distribution of a sum of the squares of k independent standard normal
random variables. It is one of the most widely used probability distributions in
inferential statistics, e.g. in hypothesis testing or in construction of confidence
intervals.
The best-known situations in which the chi-square distribution is used are the
common chi-square tests for goodness of fit of an observed distribution to a
theoretical one, and of the independence of two criteria of classification of qualitative
data.
The chi-square distribution is a special case
of the gamma distribution, with p.d.f.:
Chi-square distribution
http://en.wikipedia.org/wiki/Chi_square_distribution
33
CI: mean of a normal, variance unknown (2)
1
12
nt In R we can compute as:
If we use the data of the average daily gain in beef, we have
> qt(0.975,length(ADG)-1)
[1] 2.026192
> YBAR<-mean(ADG)
[1] 1.582105
> SEM<-sd(ADG)/sqrt(length(ADG))
> SEM
[1] 0.03229081
> UCL<-YBAR+SEM*qt(0.975,length(ADG)-1)
> UCL
[1] 1.647533
> LCL<-YBAR-SEM*qt(0.975,length(ADG)-1)
> LCL
[1] 1.516678 34
CI: Interpretation
If we would take all possible samples of size 38 of average daily gain in the Bruna dels Pirineus beef breed, and for each of them we would made the above calculations, 95% of the intervals found, approximately, would contain .
We do not know whether the interval we have found contains or not , because this parameter is unknown (in fact it is what we are looking for), but we are 95% confident in that it be so.
35
Creating and executing an script (1)
Usually we do not work writing in the R console the commands we want
to execute. Instead of that we use scripts.
An script is a sequence of commands that we write in the R editor and
afterwards we save in R format.
It is important to work in a particular directory where we will save both
the scripts and the data needed to execute the calculus.
To go to a new directory in Windows you can use
the Change directory option in File (“Archivo”) in
the console.
If you are a Mac user, you can write and execute
into the console something similar to:
To know in which directory you are (both in Windows and Mac): getwd()
setwd(“/Users/Documents/DEME/scripts-data”)
36
Creating and executing an script (2)
After been in the working directory, to create a new script go to File and
then click New Script button (New Document in Macs).
To open an script previously
saved it is necessary to click
Open Script and then click the
desired script (for example
sdescriptive.R).
To execute the script in Windows we have
to mark with the cursor the line or lines to
be executed and press the icon that
indicates the green arrow (in the left figure).
In Macs, select the line or lines (or put the
cursor at the end of an executable line) and
then press cmd+enter.
37
An script to describe the data
Note that > is not
necessary at the
beginning of the line
38
Piecharts in R (1)
> #Simple piechart
> SLICES <- c(2345,350,47,13)
> LBSL <- c("1. Easy", "2. Light assist.", "3. Strong assist.",
+ "4. Vet. assist.")
> pie(SLICES, labels = LBSL, main="PIE CHART OF CALVING EASE")
Let us construct a pie chart describing calving ease in Bruna P. breed:
Note that c() is the usual way to create a
vector; in this case SLICES is the name
assigned to it in the second line. In the third
line, we indicate the name of the categories of calving ease in another vector: LBSL.
Note that we will use capital letters to
design our variables. Lower case letters will
be reserved for R commands and text within
quotas.
Note also that when we need to write
commands in two or more lines, the second and the next lines start with +.
1. Easy
2. Light assist.
3. Strong assist.4. Vet. assist.
PIE CHART OF CALVING EASE
39
Piecharts in R (2)
#Pie Chart with percentages and
another set of colors
> SLICES <- c(2345,350,47,13)
> LBLS <- c("1. Easy","2. Light
assist.","3. Strong assist.",
+ "4. Vet. assist.")
> PCT <- round(SLICES/
+ sum(SLICES)*100,1)
> LBLS <- paste(LBLS, PCT)
> LBLS <- paste(LBLS,"%",sep="")
> COLORS <- c("green","blue",
+ "yellow","maroon")
> pie(SLICES, labels = LBLS,
+ col=COLORS,
+ main="PIE CHART OF CALVING
EASE")
We can choose the colours and add percentages:
1. Easy 85.1%
2. Light assist. 12.7%
3. Strong assist. 1.7%4. Vet. assist. 0.5%
PIE CHART OF CALVING EASE
More pie chart variants can be found in internet or in some books, for example 3D pie charts, rainbow colours, etc.
40
Bar chart in R
Note that it is easier to visualize the distribution in this format.
> COUNTS <- c(2345,350,47,13)
> barplot(COUNTS, main="Calving ease in beef cattle",
+ ylab="Number of calvings per category",
+ names.arg = c("1. Easy","2. Light ass.","3. Str ass.","4. Vet ass.")
+ border="blue", density=c(20,30,40,50))
1. Easy 2. Light ass. 3. Str ass. 4. Vet ass.
Calving ease in beef cattle
Nu
mb
er
of ca
lvin
gs p
er
ca
teg
ory
05
00
10
00
15
00
20
00
41