Use of Statistics by Scientist
-
Upload
vzimak2355 -
Category
Documents
-
view
216 -
download
0
Transcript of Use of Statistics by Scientist
-
8/9/2019 Use of Statistics by Scientist
1/22
3LCGC Europe Online Supplement statistics and data analysis
In this article we look at the initial steps in
data analysis (i.e., exploratory data analysis),
and how to calculate the basic summary
statistics (the mean and sample standarddeviation). These two processes, which
increase our understanding of the data
structure, are vital if the correct selection of
more advanced statistical methods and
interpretation of their results are to be
achieved. From that base we will progress to
significance testing (t-tests and the F-test).
These statistics allow a comparison between
two sets of results in an objective and
unbiased way. For example, significance
tests are useful when comparing a new
analytical method with an old method or
when comparing the current daysproduction with that of the previous day.
Exploratory Data Analysis
Exploratory data analysis is a term used to
describe a group of techniques (largely
graphical in nature) that sheds light on the
structure of the data. Without this
knowledge the scientist, or anyone else,
cannot be sure they are using the correct
form of statistical evaluation.
The statistics and graphs referred to in this
first section are applicable to a single
column of data (i.e., univariate data), suchas the number of analyses performed in a
laboratory each month. For small amounts
of data (
-
8/9/2019 Use of Statistics by Scientist
2/22
LCGC Europe Online Supplement4 statistics and data analysis
obtained). If any systematic trends are
observed (Figures 3(a)3(c)) then the
reasons for this must be investigated.
Normal statistical methods assume a
random distribution about the mean with
time (Figure 3(d)) but if this is not the case
the interpretation of the statistics can be
erroneous.
Summary Statistics
Summary statistics are used to make sense
of large amounts of data. Typically, the
mean, sample standard deviation, range,
confidence intervals, quantiles (1), and
measures for skewness and
spread/peakedness of the distribution
(kurtosis) are reported (2). The mean and
sample standard deviation are the most
widely used and are discussed below
together with how they relate to the
confidence intervals for normally
distributed data.
The Mean
The average or arithmetic mean (3) is
generally the first statistic everyone is
taught to calculate. This statistic is easilyfound using a calculator or spreadsheet
and simply involves the summing of the
individual results x1, x2, x3, ..., xi) and
division by the number of results (n),
where,
n
i1
x1x2 x3 xi
x
n
i1
xi
n
Unfortunately, the mean is often reported
as an estimate of the true-value (m) of
whatever is being measured without
considering the underlying distribution.
This is a mistake. Before any statistic is
calculated it is important that the raw data
should be carefully scrutinized and plotted
as described above. An outlying point canhave a big effect on the mean (compare
Figure 1(a) with 1(b)).
The Standard Deviation (3)
The standard deviation is a measure of the
spread of data (dispersion) about the mean
and can again be calculated using a
calculator or spreadsheet. There is,
however, a slight added complication; if
you look at a typical scientific calculator
you will notice there are two types of
Box 1: Stem-and-leaf plot
A stem-and-leaf plot is anothermethod of examining patterns in thedata set. They show the range, inwhich the values are concentrated,and the symmetry. This type of plot isconstructed by splitting data into thestem (the leading digits). In the figurebelow, this is from 0.1 to 0.6, andthe leaf (the trailing digit). Thus,
0.216 is represented as 2|1 and0.350 by 3|5. Note, the decimalplaces are truncated and not round-ed in this type of plot. Reading theplot below, we can see that the datavalues range from 0.12 to 0.63. Thecolumn on the left contains thedepth information (i.e., how manyleaves lie on the lines closest to theend of the range). Thus, there are 13points which lie between 0.40 and0.63. The line containing the middlevalue is indicated differently with a
count (the number of items in theline) and is enclosed in parentheses.
Stem-and-leaf plot
Units = 0.1 1|2 = 0.12 Count =
42
5 1|22677
14 2|112224578
(15) 3|000011122333355
13 4|0047889
6 5|56669
1 6|3
*outlier
(a) (b)
upper quartile value
interquartile
median
lower quartile value
1.5 interquartile
1.5 interquartile
*The interquartile range is the range which contains the middle 50% of the data whenit is sorted into ascending order.
Fre
quency(Nofdata
pointsin
each
bar)
figure 2 Frequency histogram and Box and Whisker plot.
(a)
Magnitude
10
8
6
4
2
0n = 7, mean = 6, standard deviation = 2.16
(b)
Magnitude
10
8
6
4
2
0n = 9, mean = 6, standard deviation = 2.65
(c)
Magnitude
10
8
6
4
2
0
n = 9, mean = 6, standard deviation = 2.06
(d)
Magnitude
10
8
6
4
2
0
n = 9, mean = 6, standard deviation = 1.80
Time
Time
Time
Time
figure 3 Time-indexed plots.
-
8/9/2019 Use of Statistics by Scientist
3/22
-
8/9/2019 Use of Statistics by Scientist
4/22
LCGC Europe Online Supplement6 statistics and data analysis
Significance Testing
Suppose, for example, we have the
following two sets of results for lead
content in water 17.3, 17.3, 17.4, 17.4
and 18.5, 18.6, 18.5, 18.6. It is fairly clear,
by simply looking at the data, that the two
sets are different. In reaching this
conclusion you have probably consideredthe amount of data, the average for each
set and the spread in the results. The
difference between two sets of data is,
however, not so clear in many situations.
The application of significance tests gives
us a more systematic way of assessing the
results with the added advantage of
allowing us to express our conclusion with
a stated degree of confidence.
What does significance mean?
In statistics the words significant and
significance have specific meanings. Asignificant difference, means a difference
that is unlikely to have occurred by chance.
A significance test, shows up differences
unlikely to occur because of a purely
random variation.
As previously mentioned, to decide if one
set of results is significantly different from
another depends not only on the
magnitude of the difference in the means
but also on the amount of data available
and its spread. For example, consider the
blob plots shown in Figure 5. For the two
data sets shown in Figure 5(a), the means
for set (i) and set (ii) are numerically
different. From the limited amount of
information available, however, they are
from a statistical point of view the same.
For Figure 5(b), the means for set (i) andset (ii) are probably different but when
fewer data points are available, Figure 5(c),
we cannot be sure with any degree of
confidence that the means are different
even if they are a long way apart. With a
large number of data points, even a very
small difference, can be significant (Figure
5(d)). Similarly, when we are interested in
comparing the spread of results, for
example, when we want to know if
method (i) gives more consistent results
than method (ii), we have to take note of
the amount of information available(Figures 5(e)(g)).
It is fortunate that tables are published
that show how large a difference needs to
be before it can be considered not to have
occurred by chance. These are, critical
t-value for differences between means,
and critical F-values for differences
between the spread of results (4).
Note: Significance is a function of sample
size. Comparing very large samples will
nearly always lead to a significant
difference but a statistically significant
result is not necessarily an important result.
For example in Figure 5(d) there is a
statistically significant difference, but does
it really matter in practice?
What is a t-test?A t-test is a statistical procedure that can
be used to compare mean values. A lot of
jargon surrounds these tests (see Table 1
for definition of the terms used below) but
they are relatively simple to apply using the
built-in functions of a spreadsheet like
Excel or a statistical software package.
Using a calculator is also an option but you
have to know the correct formula to apply
(see Table 2) and have access to statistical
tables to look up the so-called critical
values (4).
Three worked examples are shown inBox 2 (5) to illustrate how the different
t-tests are carried out and how to interpret
the results.
What is an F-test?
An F-test compares the spread of results in
two data sets to determine if they could
reasonably be considered to come from the
same parent distribution. The test can,
therefore, be used to answer questions
such as are two methods equally precise?
The measure of spread used in the F-test is
variance which is simply the square of thestandard deviation. The variances are
ratioed (i.e., divide the variance of one set
of data by the variance, of the other) to
get the test value F =
This F value is then compared with a critical
value that tells us how big the ratio needs
to be to rule out the difference in spread
occurring by chance. The Fcrit value is
found from tables using (n11) and (n21)
degrees of freedom, at the appropriate
level of confidence.[Note: it is usual to arrange s1 and s2 so
that F > 1]. If the standard deviations are to
be considered to come from the same
population then Fcrit > F. As an example we
use the data in Example 2 (see Box 2).
Fcrit = 9.605 (51) and (51) degrees of
freedom at the 97.5% confidence level.
As Fcrit> Fcalculated we can conclude that the
spread of results in the two data sets are
not significantly different and it is,
therefore, reasonable to combine the two
standard deviations as we have done.
F2.752
1.471 22.75 2
1.471 23.49
S 12
S 22S 1
2
S 22
table 1 Definitions of statistical terms used in significance testing.
Jargon Definition
Alternate Hypothesis A statement describing the alternative to the null hypothesis(H1) (i.e., there is a difference between the means [see two-tailed]
or mean1 is mean2 [see one-tailed]).
Critical Value The value obtained from statistical tables or statistical packages at a(tcrit or Fcrit) given confidence level against which the result of applying a signifi-
cance test is compared.
Null hypothesis A statement describing what is being tested(H0) (i.e., there is no difference between the two means [mean1 = mean2]).
One-tailed A one-tailed test is performed if the analyst is only interested in theanswer when the result is different in one direction, for example, (1)
thenew production method results in a higher yield, or (2) the amount ofwaste product is reduced (i.e., a limit value , >,
-
8/9/2019 Use of Statistics by Scientist
5/22
-
8/9/2019 Use of Statistics by Scientist
6/22
LCGC Europe Online Supplement8 statistics and data analysis
Box 2
Example 1
A chemist is asked to validate a neweconomic method of derivatization
before analysing a solution by a standardgas chromatography method. The long-term mean for the check samples usingthe old method is 22.7 g/L. For the newmethod the mean is 23.5 g/L, based on10 results with a standard deviation of0.9 g/L. Is the new method equivalentto the old? To answer this question weuse the t-test to compare the two meanvalues. We start by stating exactly whatwe are trying to decide, in the form oftwo alternative hypotheses; (i) the meanscould really be the same, or (ii) the
means could really be different. Instatistical terminology this is written as: The null hypothesis (H0): new method
mean = long-term check sample mean.
The alternative hypothesis (H1): new
method mean long-term check sample
mean.
To test the null hypothesis we calculatethe t-value as below. Note, the calculatedt-value is the ratio of the differencebetween the means and a measure ofthe spread (standard deviation) and theamount of data available (n).
In the final step of the significance testwe compare the calculated t-value withthe critical t-value obtained from tables(4). To look up the critical value we needto know three pieces of information:
(i) Are we interested in the directionof the difference between the twomeans or only that there is a difference,for example, are we performing a one-sided or two-sided t-test (see Table 1)?
In the case above, it is the latter, there-fore, the two-sided critical value is used.(ii) The degrees of freedom: this is
simply the number of data pointsminus one (n1).
(iii) How certain do we want to beabout our conclusions? It is normalpractice in chemistry to select the 95%confidence level (i.e., about 1 in 20times we perform the t-test we couldarrive at an erroneous conclusion).However, in some situations this is anunacceptable level of error, such as in
medical research. In these cases, the99% or even the 99.9% confidencelevel can be chosen.
t23.522.70.9 / 10
2.81
tcrit = 2.26 at the 95% confidencelevel for 9 degrees of freedom.
As tcalculated > tcrit we can reject the nullhypothesis and conclude that we are 95%certain that there is a significant differencebetween the new and old methods.
[Note: This does not mean the newderivatization method should beabandoned. A judgement needs tobe made on the economics and onwhether the results are fit for purpose.
The significance test is only one pieceof information to be considered.]
Example 2 (5)
Two methods for determining theconcentration of Selenium are to becompared. The results from eachmethod are shown in Table 3:
Using the t-test for independentsample means we define the nullhypothesis H0 as x
1 = x
2
This means there is no difference between
the means of the two methods (the
alternative hypothesis is H1: x1 x2). Ifthe two methods have sample standard
deviations that are not significantly
different then we can combine (or pool)
the standard deviation (Sc).
(see What is an F-Test?)
If the standard deviations aresignificantly different then the t-test
for un-equal variances should be used(Table 2).Evaluating the test statistic t
=>t(5.404.76)
2.205 15 15
Sc1.4712 (51)2.7502 (51)
(552)
2.205
The 95% critical value is 2.306 for
n = 8 (n1 + n2 2 ) degrees of freedom.
This exceeds the calculated value of
0.459, thus the null hypothesis (H0)
cannot be rejected and we conclude
there is no significant difference between
the means or the results given by the
two methods.
Example 3 (5)
Two methods are available fordetermining the concentration ofvitamins in foodstuffs. To comparethe methods several different samplematrices are prepared using the sametechnique. Each sample preparation isthen divided into two aliquots andreadings are obtained using the twomethods, ideally commencing at thesame time to lessen the possible effectsof sample deterioration. The results are
shown in Table 4.The null hypothesis is H0: d = 0
against the alternative H1: d 0
The test is a two-tailed test as we areinterested in both d
0
The mean d
= 0.475 and the samplestandard deviation of the paireddifferences is sd = 0.700
The tabulated value of tcrit (with
n = 7 degrees of freedom, at the 95%
confidence limit) is 2.365. Since thecalculated value is less than the critical
value, H0 cannot be rejected and it
follows that there is no difference between
the two techniques.
t0.475 80.700
1.918
t 0.642.2050.632
0.641.395
0.459
table 3 Results from two methods used to determine concentrations of selenium.
x s
Method 1 4.2 4.5 6.8 7.2 4.3 5.40 1.471
Method 2 9.2 4.0 1.9 5.2 3.5 4.76 2.750
table 4 Comparison of two methods used to determine the concentration of vitamins in foodstuffs.
Matrix
Method 1 2 3 4 5 6 7 8
A (mg/g) 2.52 3.13 4.33 2.25 2.79 3.04 2.19 2.16
B (mg/g) 3.17 5.00 4.03 2.38 3.68 2.94 2.83 2.18
Difference (d) -0.65 -1.87 0.30 -0.13 -0.89 0.10 -0.64 -0.02
-
8/9/2019 Use of Statistics by Scientist
7/22
-
8/9/2019 Use of Statistics by Scientist
8/22
LCGC Europe Online Supplement0 statistics and data analysis
Two-way ANOVA
In a typical experiment things can be more
complex than described previously. For
example, in Example 2 the aim is to find
out if time and/or temperature have any
effect on protein yield when analysing
samples of tinned ham. When analysing
data from this type of experiment we usetwo-way ANOVA. Two-way ANOVA can
test the significance of each of two
experimental variables (factors or
treatments) with respect to the response,
such as an instrument's output. When
replicate measurements are made we can
also examine whether or not there are
significant interactions between variables.
An interaction is said to be present when
the response being measured changes
more than can be explained from the
change in level of an individual factor. This
is illustrated in Figure 2 for a process withtwo factors (Y and Z) when both factors
are studied at two levels (low and high). In
Figure 2(b), the changes in response
caused by Y depend on Z, and vice versa.
In two-way ANOVA we ask the
following questions:
Is there a significant interaction between
the two factors (variables)?
Does a change in any of the factors
affect the measured result?
It is important to check the answers in the
right order: Figure 3 illustrates the
decision process. In the case of Example2 the questions are:
Is there an interaction between
temperature and time which affects the
protein yield?
Does time and/or temperature affect the
protein yield?
Using the built-in functions of a
spreadsheet (in this case Excels data
analysis tools two-factor analysis with
replication) we see that there is a
significant interaction between time and
temperature and a significant effect of
temperature alone (both p-value < 0.05and F > Fcrit). Following the process
outlined in Figure 3, we consider the
interaction question first by comparing the
mean squares (MS) for the within-group
variation with the interaction MS. This is
reported in the results table of Example 2.
F = 0.021911/0.004315 = 5.078
If the interaction is significant (F > Fcrit),
as in this case, then the individual factors
(time and temperature) should each be
compared with the MS for the interaction
(not the within-group MS) thus:
Ftemp = 0.024844/0.021911 = 1.134
Example 2 Two-way ANOVAThe analysis of tinned ham was carried out at three temperatures (415, 435 and 460C) and three times (30, 60 and 90 minutes). Three analyses, determining proteinyield were made at each temperature and time. The measurements are summarizedin the diagram below and the results of the two-way ANOVA are given in the table.
Time (min)
415 435 460
26.9
27
27.1
27.2
27.3
26.9
27
27.1
27.2
27.3
26.9
27
27.1
27.2
27.3
26.9 2
7
27.1
27.2
27.3
26.9 2
7
27.1
27.2
27.3
26.9 2
7
27.1
27.2
27.3
26.9 2
7
27.1
27.2
27.3
26.9 2
7
27.1
27.2
27.3
26.9 2
7
27.1
27.2
27.3
Temp (C)
415
27.13
27.2
27.13
27.29
27.1327.23
27.03
27.13
27.07
435
27.2
26.97
27.13
27.07
27.127.03
27.2
27.23
27.27
460
27.03
27.1
27.13
27.1
27.0727.03
27.03
27.07
26.9
Time (min)/Temp (C)
30
30
30
60
6060
90
90
90
Anova: Two-factor with replication
Source of Variation SS df MS F P-value F crit
Sample (=Time)
Columns (=Temperature)
Interaction
Within
Total
0.000867
0.049689
0.087644
0.077667
0.215867
2
2
4
18
26
0.000433
0.024844
0.021911
0.004315
0.100429
5.75794
5.078112
0.904952
0.011667
0.006437
3.554561
3.554561
2.927749
30
60
90
Example 1 An example of one-way ANOVA carried out by Excel
(Note: the data table has been split into two sections (A_1 to A_6, A_7 to A_12) for display purposes. The ANOVA iscarried out on a single table.)
SS = sum of squares, df = degrees of freedom, MS = mean square (SS/df).
The P-value is < 0.05 (Fvalue is > Fcrit - 95% confidence level for 11 and 36 degrees of freedom )
therefore it can be concluded that there is a significant difference between the analysts' results.
34.1
34.1
34.69
34.6
35.84
36.58
31.3
34.19
36.67
37.33
36.96
36.83
40.54
40.67
40.81
40.78
41.19
40.29
40.99
40.4
41.22
39.61
37.89
36.67
40.7140.91
40.8
38.42
39.239.3
39.3
39.3
42.542.3
42.5
42.5
39.7539.69
39.23
39.73
36.0437.03
36.85
36.24
44.3645.73
45.25
45.34
Replicate 1
Replicate 2
Replicate 3
Replicate 4
Replicate 1Replicate 2
Replicate 3
Replicate 4
Anova: Single Factor
Source of Variation SS df MS F P-value F crit
Between Groups
Within Groups438.7988
35.6208
11
36
39.8908
0.98946740.31545 6.6E-17 2.066606
A_1 A_2 A_3 A_4 A_5 A_6
A_7 A_8 A_9 A_10 A_11 A_12
Note: in the above example, the spreadsheet (Excel) labels Source of Variation as Sample, Columns, Interaction and Within.
Sample = Time, Columns = Temperature, Interaction is the interaction between temperature and time, and Within is a
measure of the within-group variation. (Note: Source of variation Columns = Temperature and Sample = Time).
-
8/9/2019 Use of Statistics by Scientist
9/22
11LCGC Europe Online Supplement statistics and data analysis
Ftime = 0.000433/0.021911 = 0.020
Fcrit = 6.944, for 2 and 4 degrees of freedom (at the 95% confidence level)
In other words, there is no significant difference between the interaction of time and
temperature with respect to either of the individual factors, and, therefore, the interaction
of temperature with time is worth further investigation. If one or both of the individual
factors were significant compared with the interaction, then the individual factor or factors
would dominate and for all practical purposes any interaction could be ignored.If the interaction term is not significant then it can be considered to be another small
error term and can thus be pooled with the within-group (error) sums of squares term. It is
the pooled value (SS2pooled) that is then used as the denominator in the F-test to
determine if the individual factors affect the measured results significantly. To combine the
sums of squares the following formula is used:
where dofinter and dofwithin are the degrees of freedom for the interaction term and
error term, and SSinter and SSwithin are the sums of squares for the interaction term and
error term, respectively.
(dofpooled dofinter dofwithin)
Selecting the ANOVA method
One-way ANOVA should be used when there is only one factor being considered and
replicate data from changing the level of that factor are available. Two-way ANOVA (with
or without replication) is used when there are two factors being considered. If no replicate
data are collected then the interactions between the two factors cannot be calculated.
Higher level ANOVAs are also available for looking at more than two factors.
Advantages of ANOVA
Compared with using multiple t-tests, one-way and two-way ANOVA require fewer
measurements to discover significant effects (i.e., the tests are said to have more power).
This is one reason why ANOVA is used frequently when analysing data from statisticallydesigned experiments.
Other ANOVA and multivariate ANOVA (MANOVA) methods exist for more complex
experimental situations but a description of these is beyond the scope of this introductory
article. More details can be found in reference 6.
ss2pooledss intersswithindofinterdofwithin
Interpretation of the result(s)
To reiterate the interpretation of ANOVA
results, a calculated F-value that is greater
than Fcrit for a stated level of confidence
(typically 95%) means that the difference
being tested is statistically significant at
that level. As an alternative to using the F-
values the p-value can be used to indicatethe degree of confidence we have that
there is a significant difference between
means (i.e., (1-p) * 100 is the percentage
confidence). Normally a p-value of 0.05
is considered to denote a significant
difference.
Note: Extrapolation of ANOVA results is
not advisable, so in Example 2 for instance,
it is impossible to say if a time of 15 or 120
minutes would lead to a measurable effect
on protein yield. It is, therefore, always
more economic in the long run to design
the experiment in advance, in order tocover the likely ranges of the parameter(s)
of interest.
Avoiding some of
the pitfalls using ANOVA
In ANOVA it is assumed that the data for
each variable are normally distributed.
Usually in ANOVA we dont have a large
amount of data so it is difficult to prove
any departure from normality. It has been
shown, however, that even quite large
deviations do not affect the decisions
made on the basis of the F-test.A more important assumption about
ANOVA is that the variance (spread)
between groups is homogeneous
(homoscedastic). If this is not the case (this
often happens in chemistry, see Figure 1)
then the F-test can suggest a statistically
48
46
44
42
40
38
36
34
32
30A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12
Mean
Analyst ID
Analyteconcentration(ppm
)
totalstandarddeviation
figure 1 Plot comparing the results from 12 analysts.
-
8/9/2019 Use of Statistics by Scientist
10/22
LCGC Europe Online Supplement2 statistics and data analysis
problem in the data structure by
transforming it, such as by taking logs (7).
If the variability within a group is
correlated with its mean value then
ANOVA may not be appropriate and/or it
may indicate the presence of outliers in the
data (Figure 4). Cochran's test (5) can be
used to test for variance outliers.
Conclusions
ANOVA is a powerful tool for
determining if there is a statistically
significant difference between two or
more sets of data.
One-way ANOVA should be used
when we are comparing several sets
of observations.
Two-way ANOVA is the method
used when there are two separate
factors that may be influencing a result.
Except for the smallest of data setsANOVA is best carried out using a
spreadsheet or statistical software
package.
You should always plot your data to
make sure the assumptions ANOVA is
based on are not violated.
Acknowledgements
The preparation of this paper was
supported under a contract with the UK
Department of Trade and Industry as part
of the National Measurement System Valid
Analytical Measurement Programme (VAM)(8).
References(1) S. Burke, Scientific Data Management1(1),
3238, September 1997.(2) G.A. Millikem and D.E. Johnson,Analysis of
Messy Data, Volume 1: Designed Experiments,Van Nostrand Reinhold Company, New York,USA (1984).
(3) J.C. Miller and J.N. Miller, Statistics forAnalytical Chemistry, Ellis Horwood PTRPrentice Hall, London, UK (ISBN 0 13 0309907).
(4) C. Chatfield, Statistics for Technology,Chapman & Hall, London, UK (ISBN 041225340 2).
(5) T.J. Farrant, Practical Statistics for the Analytical
Scientist, A Bench Guide, Royal Society ofChemistry, London, UK (ISBN 0 85404 442 6)(1997).
(6) K.V. Mardia, J.T. Kent and J.M. Bibby,Multivariate Analysis, Academic Press Inc. (ISBN0 12 471252 5) (1979).
(7) ISO 4259: 1992. Petroleum Products -Determination and Application of PrecisionData in Relation to Methods of Test. Annex E,International Organisation for Standardisation,Geneva, Switzerland (1992).
(8) M. Sargent, VAM Bulletin, Issue 13, 45,Laboratory of the Government Chemist(Autumn 1995).
Shaun Burke currently works in the Food
Technology Department of RHM Technology
Ltd, High Wycombe, Buckinghamshire, UK.
However, these articles were produced while
he was working at LGC, Teddington,
Middlesex, UK (http://www.lgc.co.uk).
a number of tests for heteroscedasity (i.e.,
Bartlett's test (5) and Levene's test (2)). It
may be possible to overcome this type of
significant difference when none is
present. The best way to avoid this pitfall
is, as ever, to plot the data. There also exist
Variance
Mean value
Significantly different means by ANOVA
Unreliable high mean (may contain outliers)
figure 4 A plot of variance versus the mean.
Pool the within-group andinteraction sums of squares
Compare pooled meansquares with individual factor
mean squares
Start
Yes
No
Compare interaction meansquares with individual factor
mean squares
Significantdifference?
(F>F crit)
Compare within-group meansquares with interaction mean
squares
figure 3 Comparing mean squares in two-way ANOVA with replication.
ZHigh
ZLow
ZHigh
ZLow
YLow
(a) Y and Z are independent (b) Y and Z are interacting
Response
Response
YHigh
YLow
YHigh
figure 2 Interactive factors.
-
8/9/2019 Use of Statistics by Scientist
11/22
13LCGC Europe Online Supplement statistics and data analysis
Calibration is fundamental to achieving
consistency of measurement. Often
calibration involves establishing the
relationship between an instrument
response and one or more reference
values. Linear regression is one of the most
frequently used statistical methods in
calibration. Once the relationship between
the input value and the response value
(assumed to be represented by a straight
line) is established, the calibration model isused in reverse; that is, to predict a value
from an instrument response. In general,
regression methods are also useful for
establishing relationships of all kinds, not
just linear relationships. This paper
concentrates on the practical applications
of linear regression and the interpretation
of the regression statistics. For those of you
who want to know about the theory of
regression there are some excellent
references (16).
For anyone intending to apply linear
least-squares regression to their own data,it is recommended that a statistics/graphics
package is used. This will speed up the
production of the graphs needed to
confirm the validity of the regression
statistics. The built-in functions of a
spreadsheet can also be used if the
routines have been validated for accuracy
(e.g., using standard data sets (7)).
What is regression?
In statistics, the term regression is used to
describe a group of methods that
summarize the degree of association
between one variable (or set of variables)
and another variable (or set of variables).
The most common statistical method used
to do this is least-squares regression, which
works by finding the best curve through
the data that minimizes the sums of
squares of the residuals. The important
term here is the best curve, not the
method by which this is achieved. There
are a number of least-squares regression
models, for example, linear (the most
common type), logarithmic, exponential
and power. As already stated, this paper
will concentrate on linear least-squaresregression.
[You should also be aware that there are
other regression methods, such as ranked
regression, multiple linear regression, non-
linear regression, principal-component
regression, partial least-squares regression,
etc., which are useful for analysing instrument
or chemically derived data, but are beyond
the scope of this introductory text.]
What do the linear least-squares
regression statistics mean?
Correlation coefficient: Whether you use acalculators built-in functions, a
spreadsheet or a statistics package, the
first statistic most chemists look at when
performing this analysis is the correlation
coefficient (r). The correlation coefficient
ranges from 1, a perfect negative
relationship, through zero (no relationship),
to +1, a perfect positive relationship
(Figures 1(ac)). The correlation coefficient
is, therefore, a measure of the degree of
linear relationship between two sets of
data. However, the r value is open to
misinterpretation (8) (Figures 1(d) and (e),
show instances in which the r values alone
would give the wrong impression of the
underlying relationship). Indeed, it is
possible for several different data sets to
yield identical regression statistics (r value,
residual sum of squares, slope and
intercept), but still not satisfy the linear
assumption in all cases (9). It, therefore,
remains essential to plot the data in order
to check that linear least-squares statistics
are appropriate.
As in the t-tests discussed in the first
paper (10) in this series, the statistical
significance of the correlation coefficient isdependent on the number of data points.
To test if a particular r value indicates a
statistically significant relationship we can
use the Pearsons correlation coefficient
test (Table 1). Thus, if we only have four
points (for which the number of degrees of
freedom is 2) a linear least-squares
correlation coefficient of 0.94 will not be
significant at the 95% confidence level.
However, if there are more than 60 points
an r value of just 0.26 (r2 = 0.0676) would
indicate a significant, but not very strong,
positive linear relationship. In other words,a relationship can be statistically significant
but of no practical value. Note that the test
used here simply shows whether two sets
are linearly related; it does not prove
linearity or adequacy of fit.
It is also important to note that a
significant correlation between one
variable and another should not be taken
as an indication of causality. For example,
there is a negative correlation between
time (measured in months) and catalyst
performance in car exhaust systems.
However, time is not the cause of the
deterioration, it is the build up of sulfur
and phosphorous compounds that
gradually poisons the catalyst. Causality is,
One of the most frequently used statistical methods in calibration is linearregression. This third paper in our statistics refresher series concentrates onthe practical applications of linear regression and the interpretation of the
regression statistics.
Shaun Burke, RHM Technology Ltd, High Wycombe, Buckinghamshire, UK.
Regression
and Calibration
-
8/9/2019 Use of Statistics by Scientist
12/22
LCGC Europe Online Supplement4 statistics and data analysis
in fact, very difficult to prove unless the
chemist can vary systematically and
independently all critical parameters, whilemeasuring the response for each change.
Slope and intercept
In linear regression the relationship
between the X and Y data is assumed to
be represented by a straight line, Y = a +
bX (see Figure 2), where Y is the estimated
response/dependent variable, b is the slope
(gradient) of the regression line and a is
the intercept (Y value when X = 0). This
straight-line model is only appropriate if
the data approximately fits the assumption
of linearity. This can be tested for byplotting the data and looking for curvature
(e.g., Figure 1(d)) or by plotting the
residuals against the predicted Y values or
X values (see Figure 3).
Although the relationship may be known
to be non-linear (i.e., follow a different
functional form, such as an exponential
curve), it can sometimes be made to fit the
linear assumption by transforming the data
in line with the function, for example, by
taking logarithms or squaring the Y and/or
X data. Note that if such transformations
are performed, weighted regression(discussed later) should be used to obtain
an accurate model. Weighting is required
because of changes in the residual/error
structure of the regression model. Using
non-linear regression may, however, be a
better alternative to transforming the data
when this option is available in the
statistical packages you are using.
Residuals and residual standard error
A residual value is calculated by taking the
difference between the predicted value
and the actual value (see Figure 2). Whenthe residuals are plotted against the
predicted (or actual) data values the plot
becomes a powerful diagnostic tool,
enabling patterns and curvature in the data
to be recognized (Figure 3). It can also be
used to highlight points of influence (see
Bias, leverage and outliers overleaf).
The residual standard error (RSE, also
known as the residual standard deviation,
RSD) is a statistical measure of the average
residual. In other words, it is an estimate
of the average error (or deviation) about
the regression line. The RSE is used to
calculate many useful regression statistics
including confidence intervals and outlier
test values.
where s(y) is the standard deviation of the y values in the calibration, n is the number of
data pairs and r is the least-squares regression correlation coefficient.
Confidence intervalsAs with most statistics, the slope (b) and intercept (a) are estimates based on a finite
sample, so there is some uncertainty in the values. (Note: Strictly, the uncertainty arises
from random variability between sets of data. There may be other uncertainties, such as
measurement bias, but these are outside the scope of this article.) This uncertainty is
quantified in most statistical routines by displaying the confidence limits and other
statistics, such as the standard error and p values. Examples of these statistics are given in
Table 2.
RSE s(y)(n1)
(n2) 1 r2
table 1 Pearson's correlation coefficient test.
Degrees of freedom Confidence level
(n-2) 95% ( = 0.05) 99% ( = 0.01)
2 0.950 0.9903 0.878 0.959
4 0.811 0.917
5 0.754 0.875
6 0.707 0.834
7 0.666 0.798
8 0.632 0.765
9 0.602 0.735
10 0.576 0.708
11 0.553 0.684
12 0.532 0.661
13 0.514 0.641
14 0.497 0.623
15 0.482 0.606
20 0.423 0.537
30 0.349 0.449
40 0.304 0.393
60 0.250 0.325
Significant correlation when |r| table value
1
-0.8
-1
0.8
0.6
0.4
0.2
0
5 10 20 25 30 35 40 45 50 5515 60-0.2
-0.4
-0.6Correlationc
oefficient(r)
Degrees of freedom (n-2)
95% confidence level
99% confidence level
-
8/9/2019 Use of Statistics by Scientist
13/22
15LCGC Europe Online Supplement statistics and data analysis
The p value is the probability that a value could arise by chance if the true value was
zero. By convention a p value of less than 0.05 indicates a significant non-zero statistic.
Thus, examining the spreadsheets results, we can see that there is no reason to reject the
hypothesis that the intercept is zero, but there is a significant non-zero positive
gradient/relationship. The confidence intervals for the regression line can be plotted for all
points along the x-axis and is dumbbell in shape (Figure 2). In practice, this means that the
model is more certain in the middle than at the extremes, which in turn has important
consequences for extrapolating relationships.When regression is used to construct a calibration model, the calibration graph is used
in reverse (i.e., we predict the X value from the instrument response [Y-value]). This
prediction has an associated uncertainty (expressed as a confidence interval)
Conf. interval for the prediction is:
where a is the intercept and b is the slope obtained from the regression equation.
Y
is the mean value of the response (e.g., instrument readings) for m replicates (replicatesare repeat measurements made at the same level).
y is the mean of the y data for the n points in the calibration. tis the critical value obtained
from t-tables for n2 degrees of freedom. s(x) is the standard deviation for the
XpredictedtRSE
b1m
1n
Yy2
b2 n1 sx2
Xpredicted Y a
b
x data for the n points in the calibration.
RSE is the residual standard error for the
calibration.
If we want, therefore, to reduce the size
of the confidence interval of the prediction
there are several things that can be done.
1. Make sure that the unknown
determinations of interest are close tothe centre of the calibration (i.e., close
to the values x,y [the centroid point]).
This suggests that if we want a small
confidence interval at low values of x
then the standards/reference samples
used in the calibration should be
concentrated around this region. For
example, in analytical chemistry, a typical
pattern of standard concentrations
might be 0.05, 0.1, 0.2, 0.4, 0.8, 1.6
possible outlier
0.14
0.06
0.08
0.1
0.12
0.04
0.02
0
-0.02
-0.04
-0.06
-0.08
0 21 3 4 5 6 7 8 9 10
Residuals
X
figure 3 Residuals plot.
r = -1
r = 0
r = 0
r = 0.99
r = 0.9
r = 0.9
r = +1
(a)
(b)
(d)
(e)
(f)
(g)
(c)
figure 1 Correlation coefficients and
goodness of fit.
Y= -0.046 + 0.1124 * Xr = 0.98731
Intercept Slope
Correlation coefficient
Intercept
confidence limits for the prediction
confidence limits forthe regression line
Residuals
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
-0.20 2 4 6 8 10 12
Y
X
figure 2 Calibration graph.
-
8/9/2019 Use of Statistics by Scientist
14/22
LCGC Europe Online Supplement6 statistics and data analysis
(i.e., only one or two standards are used athigher concentrations). While this will lead
to a smaller confidence interval at lower
concentrations the calibration model will
be prone to leverage errors (see below).
2. Increase the number of points in the
calibration (n). There is, however, little
improvement to be gained by going
above 10 calibration points unless
standard preparation and analysis is
rapid and cheap.
3. Increase the number of replicate
determinations for estimating the
unknown (m). Once again there is alaw of diminishing returns, so the
number of replicates should typically
be in the range 2 to 5.
4. The range of the calibration can be
extended, providing the calibration is still
linear.
Bias, leverage and outliers
Points of influence, which may or may not
be outliers, can have a significant effect on
the regression model and therefore, on its
predictive ability. If a point is in the middle
of the model (i.e., close to x
) but outlyingon the Y axis, its effect will be to move the
regression line up or down. The point is
then said to have influence because it
introduces an offset (or bias) in the
predicted values (see Figure 1(f)). If the
point is towards one of the extreme ends
of the plot its effect will be to tilt the
regression line. The point is then said to
have high leverage because it acts as a
lever and changes the slope of the
regression model (see Figure 1(g)).
Leverage can be a major problem if one or
two data points are a long way from all theother points along the X axis.
A leverage statistic (ranging between1_n and 1) can be calculated for each value
of x. There is no set value above which this
leverage statistic indicates a point of
influence. A value of 0.9 is, however, used
by some statistical software packages.
where xiis the x value for which the leverage statistic is to be calculated, n is the
number of points in the calibration and x is the mean of all the x values in the calibration.
To test if a data point (xi,yi) is an outlier (relative to the regression model) the following
outlier test can be applied.
where RSE is the residual standard error, sy is the standard deviation of the Y values, Yiis
the y value, n is the number of points, y is the mean of all the y values in the calibration
and residualmax is the largest residual value.For example, the test value for the suspected outlier in Figure 3 is 1.78 and the critical
value is 2.37 (Table 3 for 10 data points). Although the point appears extreme, it could
reasonably be expected to arise by chance within the data set.
Extrapolation and interpolation
We have already mentioned that the regression line is subject to some uncertainty and that
this uncertainty becomes greater at the extremes of the line. If we, therefore, try to
extrapolate much beyond the point where we have real data (10%) there may be
relatively large errors associated with the predicted value. Conversely, interpolation near
the middle of the calibration will minimize the prediction uncertainty. It follows, therefore,
that when constructing a calibration graph, the standards should cover a larger range of
concentrations than the analyst is interested in. Alternatively, several calibration graphscovering smaller, overlapping, concentration ranges can be constructed.
Test valueresidualma x
RSE 1 1n Yiy
2
n1 sy2
Leverage i1n
xix2
j= 1
n
xj
x
2
table 2 Statistics obtained using Excel 5.0 regression analysis function from the data used to generate the calibration graph in Figure 2.
Coefficients Standard Error tStat p value Lower 95% Upper 95%
Intercept -0.046000012 0.039648848 -1.160185324 0.279423552 -0.137430479 0.045430455
Slope 0.112363638 0.00638999 17.58432015 1.11755E-07 0.097628284 0.127098992
*Note the large number of significant figures. In fact none of the values above warrant more than 3 significant figures!
Concentration
Response
(a)
Predicted value
Residuals
(b)
0
figure 4 Plots of typical instrument response versus concentration.
-
8/9/2019 Use of Statistics by Scientist
15/22
17LCGC Europe Online Supplement statistics and data analysis
Weighted linear regression and calibration
In analytical science we often find that the precision changes with concentration. In
particular, the standard deviation of the data is proportional to the magnitude of the value
being measured, (see Figure 4(a)). A residuals plot will tend to show this relationship even
more clearly (Figure 4(b)). When this relationship is observed (or if the data has been
transformed before regression analysis), weighted linear regression should be used forobtaining the calibration curve (3). The following description shows how the weighted
regression works. Dont be put off by the equations as most modern statistical software
packages will perform the calculations for you. They are only included in the text for
completeness.
Weighted regression works by giving points known to have a better precision a higher
weighting than those with lower precision. During method validation the way the standard
deviation varies with concentration should have been investigated. This relationship can
then be used to calculate the initial weightings
at each of the n
concentrations in the calibration.
These initial weightings can then be
standardized by multiplying by the number
of calibration points divided by the sum of
all the weights to give the final weights (Wi).
The regression model generated will be
similar to that for non-weighted linear
regression. The prediction confidence
intervals will, however, be different.
The weighted prediction (xw) for a given
instrument reading (y) for the regression
model forcing the line through the origin (y
= bx) is:
with
where Y
is the mean value of the
response (e.g., instrument readings) for m
replicates and xiand yiare the data pair for
the ith point.
By assuming the regression line goes
through the origin a better estimate of theslope is obtained, providing that the
assumption of a zero intercept is correct.
This may be a reasonable assumption in
some instrument calibrations. However, in
most cases, the regression line will no
longer represent the least-squares best
line through the data.
b(w)Wixiyi
i= 1
n
Wixi2
i= 1
n
X(w)predicted Ybw
Wi
wi
n
wj
j= 1
n
(wi1si
2)
table 3 Outlier test for simple linear least-squares regression.
Sample size Confidence table-value
(n) 95% 99%
5 1.74 1.75
6 1.93 1.98
7 2.08 2.17
8 2.20 2.23
9 2.29 2.44
10 2.37 2.55
12 2.49 2.70
14 2.58 2.82
16 2.66 2.92
18 2.72 3.00
20 2.77 3.06
25 2.88 3.2530 2.96 3.36
35 3.02 3.40
40 3.08 3.43
45 3.12 3.47
50 3.16 3.51
60 3.23 3.57
70 3.29 3.62
80 3.33 3.68
90 3.37 3.73
100 3.41 3.78
3
3.5
4
2.5
2
1.50 10 30 40 50 60 70 80 90 10020
Testvalue
Number of samples (n)
95%
99%
-
8/9/2019 Use of Statistics by Scientist
16/22
LCGC Europe Online Supplement8 statistics and data analysis
References(1) G.W. Snedecor and W.G. Cochran, Statistical
Methods, The Iowa State University Press, USA,6th edition (1967).
(2) N. Draper and H. Smith,Applied RegressionAnalysis, John Wiley & Sons Inc., New York,USA, 2nd edition (1981).
(3) BS ISO 11095: Linear Calibration UsingReference Materials (1996).
(4) J.C. Miller and J.N. Miller, Statistics forAnalytical Chemistry, Ellis Harwood PTR PrenticeHall, London, UK.
(5) A.R. Hoshmand, Statistical Methods forEnvironmental and Agricultural Sciences, 2ndedition, CRC Press (ISBN 0-8493-3152-8)(1998).
(6) T.J. Farrant, Practical Statistics for the AnalyticalScientist, A Bench Guide, Royal Society ofChemistry, London, UK (ISBN 0 85404 4226)(1997).
(7) Statistical Software Qualification: ReferenceData Sets, Eds. B.P. Butler, M.G. Cox, S.L.R.Ellison and W.A. Hardcastle, Royal Society ofChemistry, London, UK (ISBN 0-85404-422-1)(1996).
(8) H. Sahai and R.P. Singh, Virginia J. Sci., 40(1),59, (1989).
(9) F.J. Anscombe, Graphs in Statistical Analysis,American Statistician, 27, 1721, February1973.
(10) S. Burke, Scientific Data Management, 1(1),3238, September 1997.
(11) M. Sargent, VAM Bulletin, Issue 13, 45,Laboratory of the Government Chemist(Autumn 1995).
Shaun Burke currently works in the Food
Technology Department of RHM Technology
Ltd, High Wycombe, Buckinghamshire, UK.
However, these articles were produced while
he was working at LGC, Teddington,
Middlesex, UK (http://www.lgc.co.uk).
The associated uncertainty for the weighted prediction, expressed as a confidence
interval is then:
Conf. interval for the prediction is
where tis the critical value obtained from ttables for n2 degrees of freedom at a
stated significance level (typically a = 0.05), Wiis the weighted standard deviation for the
x data for the ith point in the calibration, m is the number of replicates and the weighted
residual.
Standard error for the calibration
Conclusions
Always plot the data. Dont rely on the regression statistics to indicate a linear
relationship. For example, the correlation coefficient is not a reliable measure of
goodness-of-fit. Always examine the residuals plot. This is a valuable diagnostic tool.
Remove points of influence (leverage, bias and outlying points) only if a reason can be
found for their aberrant behaviour.
Be aware that a regression line is an estimate of the best line through the data and
that there is some uncertainty associated with it. The uncertainty, in the form of a
confidence interval, should be reported with the interpolated result obtained from any
linear regression calibrations.
Acknowledgement
The preparation of this paper was supported under a contract with the Department of
Trade and Industry as part of the National Measurement System Valid Analytical
Measurement Programme (VAM) (11).
RSE(w)Wjyj
2j= 1
n
b(w)2
Wjxj2
j= 1
n
n1
X(w)predictedtRSE(w)
b(w)
1mWi
Y
2
b(w)2
Wjxj2
j= 1
n
-
8/9/2019 Use of Statistics by Scientist
17/22
19LCGC Europe Online Supplement statistics and data analysis
This is the last article in a series of short
papers introducing basic statistical methods
of use in analytical science. In the three
previous papers (13) we have assumed
the data has been tidy; that is, normally
distributed with no anomalous and/or
missing results. In the real world, however,
we often need to deal with messy data,
for example data sets that contain
transcription errors, unexpected extreme
results or are skewed. How we deal with
this type of data is the subject of this article.
Transcription errors
Transcription errors can normally be
corrected by implementing good quality
control procedures before statistical
analysis is carried out. For example, the
data can be independently checked or,
more rarely, the data can be entered, again
independently, into two separate files and
the files compared electronically to
highlight any discrepancies. There are also
a number of outlier tests that can be used
to highlight anomalous values before other
statistics are calculated. These tests do not
remove the need for good quality
assurance; rather they should be seen as
an additional quality check.
Missing data
No matter how well our experiments are
planned there will always be times when
something goes wrong, resulting in gaps in
the data. Some statistical procedures will
not work as well, or at all, with some data
missing. The best recourse is always to
repeat the experiment to generate the
complete data set. Sometimes, however,
this is not feasible, particularly where
readings are taken at set times or the cost
of retesting is prohibitive, so alternative
ways of addressing this problem are needed.
Current statistical software packages
typically deal with missing data by one of
three methods:
Casewise deletion excludes all examples
(cases) that have missing data in at least
one of the selected variables. For example,
in ICPAAS (inductively coupled
plasmaatomic absorption spectroscopy)
calibrated with a number of standard
solutions containing several metal ions at
different concentrations, if the aluminium
value were missing for a particular test
portion, all the results for that test portion
would be disregarded (See Table 1).
This is the usual way of dealing with
missing data, but it does not guarantee
correct answers. This is particularly so, in
complex (multivariate) data sets where it is
possible to end up deleting the majority
of your data if the missing data are
randomly distributed across cases
and variables.
Pairwise deletion can be used as an
alternative to casewise deletion in
situations where parameters (correlation
coefficients, for example) are calculated on
successive pairs of variables (e.g., in a
recovery experiment we may be interested
in the correlations between material
recovered and extraction time, temperature,
particle size, polarity, etc. With pairwise
deletion, if one solvent polarity measurement
was missing only this single pair would be
deleted from the correlation and the
correlations for recovery versus extraction
time and particle size would be unaffected)
(see Table 2).
Pairwise deletion can, however, lead to
serious problems. For example, if there is a
hidden systematic distribution of missing
points then a bias may result when
calculating a correlation matrix (i.e., different
correlation coefficients in the matrix can be
based on different subsets of cases).
Mean substitution replaces all missing
data in a variable by the mean value for
that variable. Though this looks as if the
This article, the fourth and final part of our statistics refresher series, looksat how to deal with messy data that contain transcription errors or extremeand skewed results.
Shaun Burke, RHM Technology Ltd, High Wycombe, Buckinghamshire, UK.
Missing Values, Outliers,Robust Statistics &
Non-parametric Methods
table 1 Casewise deletion.
Solution 1
Solution 2
Solution 3
Solution 4
Al
567
234
B
94.5
72.1
34.0
97.4
Fe
578
673
674
429
Ni
23.1
7.6
44.7
82.9
Solution 2Solution 4
Al
567234
B
72.197.4
Fe
673429
Ni
7.682.9
Casewise deletion. Statistical analysisonly carried out on the reduced data set.
-
8/9/2019 Use of Statistics by Scientist
18/22
LCGC Europe Online Supplement0 statistics and data analysis
data set is now complete, mean substitution
has its own disadvantages. The variability
in the data set is artificially decreased in
direct proportion to the number of missingdata points, leading to underestimates of
dispersion (the spread of the data). Mean
substitution may also considerably change
the values of some other statistics, such as
linear regression statistics (3), particularly
where correlations are strong (See Table 3).
Examples of these three approaches are
illustrated in Figure 1, for the calculation of
a correlation matrix, where the correlation
coefficient (r) (3) is determined for each
paired combination of the five variables,
A to E. Note, how the r value can increase,
diminish or even reverse sign depending on
which method is chosen to handle the
missing data (i.e., the A, B correlation
coefficients).
Extreme values,
stragglers and outliers
Extreme values are defined as observations
in a sample, so far separated in value from
the remainder as to suggest that they may
be from a different population, or the
result of an error in measurement (6).
Extreme values can also be subdivided into
stragglers, extreme values detected
between the 95% and 99% confidence
levels; and outliers, extreme values at >
99% confidence level.
It is tempting to remove extreme values
automatically from a data set, because
they can alter the calculated statistics, e.g.,
increase the estimate of variance (a
measure of spread), or possibly introduce a
bias in the calculated mean. There is one
golden rule however: no value should be
removed from a data set on statistical
grounds alone. Statistical grounds include
outlier testing.
Outlier tests tell you, on the basis of
some simple assumptions, where you are
most likely to have a technical error; they
do not tell you that the point is wrong.
No matter how extreme a value is in a set
of data, the suspect value could
nonetheless be a correct piece of
information (1). Only with experience or
the identification of a particular cause can
data be declared wrong and removed.
So, given that we understand that the
tests only tell us where to look, how do we
test for outliers? If we have good grounds
for believing our data is normally
distributed then a number of outlier tests
(sometimes called Q-tests) are available
that identify extreme values in an objective
way (7,8). Good grounds for believing the
data is normal are
past experience of similar data
passing normality tests, for example,KolmogrovSmirnovLillefors test,
ShapiroWilks test, skewness test,
kurtosis test (7,9) etc.
plots of the data, e.g., frequency
histogram normal probability plots (1,7).Note that the tests used to check
table 2 Pairwise deletion.
Sample 1
Sample 2
Sample 3
Sample 4
Recovery%
Extractiontime
(mins)
ParticleSize(m)
SolventPolarity(pKa)
RecoveryvsExtraction
time
RecoveryvsParticle
Size
RecoveryvsSolventPolarity
Pairwise deletion. Statistical analysis unaffected exceptfor when one of a pair of data points are missing.
93
105
99
73
20
120
180
10
90
150
50
500
1.8
1.0
1.5
0.728886(4)
-0.87495(4)
0.033942(3)
r(number of data points
in the correlation)
table 3 Mean substitution.
Solution 1
Solution 2
Solution 3
Solution 4
Al
567
234
B
94.5
72.1
34.0
97.4
Fe
578
673
674
429
Ni
23.1
7.6
44.7
82.9
Solution 1
Solution 2
Solution 3
Solution 4
Al
400.5
567
400.5
234
B
94.5
72.1
34.0
97.4
Fe
578
673
674
429
Ni
23.1
7.6
44.7
82.9
Mean substitution. Statistical analysis carriedout on pseudo completed data with noallowance made for errors in estimated values.
Box 1: Imputation (4,5) is yet another method that is increasingly being used tohandle missing data. It is, however, not yet widely available in statistical softwarepackages. In its simplest ad hoc form an imputed value is substituted for themissing value (e.g., mean substitution already discussed above is a form ofimputation). In its more general/systematic form, however, the imputed missingvalues are predicted from patterns in the real (non-missing) data. A total of mpossible imputed values are calculated for each missing value (using a suitablestatistical model derived from the patterns in the data) and then m possiblecomplete data sets are analysed in turn by the selected statistical method. The mintermediate results are then pooled to yield the final result (statistic) and anestimate of its uncertainty. This method works well providing that the missingdata is randomly distributed and the model used to predict the inputed valuesis sensible.
-
8/9/2019 Use of Statistics by Scientist
19/22
21LCGC Europe Online Supplement statistics and data analysis
normality usually require a significant
amount of data (a minimum of 1015
results are recommended depending on
the normality test applied). For this reason
there will be many examples in analytical
science where either it will be impractical
to carry out such tests, or the tests will not
tell us anything meaningful.
If we are not sure the data set is normally
distributed then robust statistics and/or
non-parametric (distribution independent)
tests can be applied to the data. These
three approaches (outlier tests, robust
estimates and non-parametric methods)
are examined in more detail below.
Outlier tests
In analytical chemistry it is rare that we
have large numbers of replicate data, and
small data sets often show fortuitous
grouping and consequent apparent
outliers. Outlier tests should, therefore, be
used with care and, of course, identified
data points should only be removed if a
technical reason can be found for their
aberrant behaviour.
Most outlier tests look at some measure
of the relative distance of a suspect point
from the mean value. This measure is then
assessed to see if the extreme value could
reasonably be expected to have arisen by
chance. Most of the tests look for single
extreme values (Figure 2(a)), but
sometimes it is possible for several
outliers to be present in the same data
set. These can be identified in one of two
ways:
by iteratively applying the outlier test
by using tests that look for pairs of
extreme values, i.e., outliers that are
masking each other (see Figure 2(b) and
2(c)).
Note, as a rule of thumb, if more than
20% of the data are identified as outlying
you should start to question your
assumption about the data distribution
and/or the quality of the data collected.
The appropriate outlier tests for the
three situations described in Figure 2 are:
2(a) Grubbs 1, Dixon or Nalimov; 2(b)
Grubbs 2 and 2(c) Grubbs 3.
We will concentrate on the three
Grubbs tests (7). The test values are
calculated using the formulae below, after
the data are arranged in ascending order.
(a) or
Outlier Outlier
(c) or
Outliers Outliers
(b)
Outlier Outlier
figure 2 Outliers and masking.
Cases
1
2
3
13
14
15
mean
value
mean
A
105.1
77.0
86.0
90.0
90.0
96.9
99.2
B
101.7
72.9
82.2
77.4
91.3
103.0
92.4
C
115.1
77.5
78.9
100.8
89.2
97.5
94.6
D
101.0
72.7
78.0
97.0
81.3
98.5
89.4
E
95.2
61.6
91.7
111.1
100.5
96.8
91.7
Variables / Factors
= Data removed to show the effects
of missing data.
= Mean values replacing missing data.
A
B
C
D
B0.62
C0.68
0.53
D0.41
0.47
0.57
E0.39
0.50
0.59
0.61
No missing data (15 cases)
A
B
C
D
B
-0.62
C
0.11
-0.21
D
0.50
-0.36
0.91
E
0.02
0.17
0.71
0.66
Casewise deletion (only 5 cases remain)
A
B
C
D
B
0.01
C
-0.05
0.40
D
0.02
0.47
0.47
E
0.36
0.25
0.43
0.46
A
B
C
D
B
0.54(12)
C
0.55(12)
0.50(11)
D
0.27(12)
0.47(11)
0.79(11)
E
0.23(11)
0.77(10)
0.70(10)
0.71(10)
Pairwise deletion (Variable number of cases)
Mean substitution (15 cases)n
15
12
11
10
5
r
0.514
0.576
0.602
0.632
0.950
Note, at the 95% confidence level, significant correlations are indicated at3
Correlation matrices with differentapproaches selected for missing data.
figure 1 Effect of missing data on a correlation matrix.
where,s is the standard deviation for the whole data set, xiis the suspected single
outlier, i.e., the value furthest away from the mean, | | is the modulus the value of a
calculation ignoring the sign of the result, x is the mean, n is the number of data points, xnand x1 are the most extreme values,sn-2 is the standard deviation for the data set
G1= x xi
s G2=
xn x1s
G3= 1 n 3 sn 2
2
n 1 s2
-
8/9/2019 Use of Statistics by Scientist
20/22
LCGC Europe Online Supplement2 statistics and data analysis
excluding the suspected pair of outlier
values, i.e., the pair of values furthest away
from the mean.
If the test values (G1, G2, G3) are greater
than the critical value obtained from tables
(see Table 4) then the extreme value(s) areunlikely to have occurred by chance at the
stated confidence level (see Box 2).
Pitfalls of outlier tests
Figure 3 shows three situations where
outlier tests can misleadingly identify an
extreme value.
Figure 3(a) shows a situation common in
chemical analysis. Because of limited
measurement precision (rounding errors) it
is possible to end up comparing a result
which, no matter how close it is to the
other values, is an infinite number of
standard deviations away from the mean
of the remaining results. This value will
therefore always be flagged as an outlier.
In Figure 3(b) there is a genuine long tail
on the distribution that may cause
successive outlying points to be identified.
This type of distribution is surprisingly
common in some types of chemical
analysis, e.g., pesticide residues.
If there is very little data (Figure 3(c)) an
outlier can be identified by chance. In this
situation it is possible that the identified
point is closer to the true value and it is
the other values that are the outliers. This
occurs more often than we would like to
admit; how many times do your procedures
state average the best two out of three
determinations?
Outliers by variance
When the data are from different groups
(for example when comparing test
methods via interlaboratory comparison) it
is not only possible for individual points within a group to be outlying but also for the
group means to have outliers with respect to each other. Another type of outlier that can
occur is when the spread of data within one particular group is unusually small or large
when compared with the spread of the other groups (see Figure 4).
The same Grubbs tests that are used to determine the presence of within group
outlying replicates may also be used to test for suspected outlying means. The Cochrans test can be used to test for the third case, that of a suspected
outlying variance.
To carry out the Cochrans test, the suspect variance is compared with the sum of all
group variances. (The variance is a measure of spread and is simply the square of the
standard deviation (1).)
If this calculated ratio, Cn , exceeds the critical value obtained from statistical tables (7)
then the suspect group spread is extreme. The choice of n is the average number of all
sample results produced by all groups.
The Cochrans test assumes the number of replicates within the groups are the same or
at least similar ( 1). It also assumes that none of the data have been rounded and there
are sufficient numbers of replicates to get a reasonable estimate of the variance. The
Cochrans test should not be used iteratively as this could lead to a large percentage of
data being removed (See Box 3).
Robust statistics
Robust statistics include methods that are largely unaffected by the presence of extreme
values. The most commonly used of these statistics are as follows:
Median: The median is a measure of central tendency1 and can be used instead of the
mean. To calculate the median ( ) the data are arranged in order of magnitude and the
median is then the central member of the series (or the mean of the two central
members when there is an even number of data, i.e., there are equal numbers of
observations smaller and greater than the median). For a symmetrical distribution the mean
and median have the same value.
Median Absolute Deviation (MAD): The MAD value is an estimate of the spread in the
data similar to the standard deviation.
x =
xmxm xm 1
2 when n iseven 2, 4, 6,
when n is odd 1, 3, 5,where m = round up n
2
Cn= suspected s2
Si2
i= 1
g where g is the numberof groupsand n =ni
i= 1
g
g
Box 2: Grubbs tests (worked example).
13 replicates are ordered in ascending order.
x1 xn47.876 47.997 48.065 48.118 48.151 48.211 48.251 48.559 48.634 48.711 49.005 49.166 49.484
Grubbs critical values for 13 values are G1 = 2.331 and 2.607, G2 = 4.00 and 4.24, G3 = 0.6705 and 0.7667 for the 95%and 99% confidence levels. Since the test values are less than their respective critical values, in all cases, it can be concludedthere are no outlying values.
G3= 1 10 0.123
12 0.4982 = 0.587G2=
49.484 47.8760.498
= 3.23G1=49.484 48.479
0.498 = 2.02
n = 13, mean = 48.479, s= 0.498, sn22 = 0.123
-
8/9/2019 Use of Statistics by Scientist
21/22
23LCGC Europe Online Supplement statistics and data analysis
If the MAD value is scaled by a factor of 1.483 it becomes comparable with a standard
deviation, this is the MADE value.
For n values MAD = median xi x
i= 1, 2,, n
Other robust statistical estimates include
trimmed mean and deviations, Winsorized
mean and deviation, least median of
squares (robust regression), Levenes test(heterogeneity in ANOVA), etc. A
discussion of robust statistics in analytical
chemistry can be found elsewhere (10, 11).
Non-parametric tests
Typical statistical tests incorporate
assumptions about the underlying
distribution of data (such as normality),
and hence rely on distribution parameters.
Non-parametric tests are so called
because they make few or no assumptions
about the distributions, and do not rely on
distribution parameters. Their chief
advantage is improved reliability when the
distribution is unknown. There is at least
one non-parametric equivalent for each
parametric type of test (see Table 5). In a
short article, such as this, it is impossible to
describe the methodology for all these
tests but more information can be found in
other publications (12, 13).
Conclusions
Always check your data for transcription
errors. Outlier tests can help to identify
them as part of a quality control check.
Delete extreme values only when a
technical reason for their aberrant
behaviour can be found.
Missing data can result in misinterpretation
of the resulting statistics so care should
be taken with the method chosen to
handle the gaps. If at all possible, further
experiments should be carried out to fill
in the missing points.
MADE= 1.483 MAD
table 4 Grubbs critical value table (5).
95% confidence level 99% confidence
level
n G(1) G(2) G(3) G(1) G(2) G(3)
3 1.153 2.00 --- 1.155 2.00 ---
4 1.463 2.43 0.9992 1.492 2.44 1.0000
5 1.672 2.75 0.9817 1.749 2.80 0.9965
6 1.822 3.01 0.9436 1.944 3.10 0.9814
7 1.938 3.22 0.8980 2.097 3.34 0.9560
8 2.032 3.40 0.8522 2.221 3.54 0.9250
9 2.110 3.55 0.8091 2.323 3.72 0.8918
10 2.176 3.68 0.7695 2.410 3.88 0.8586
12 2.285 3.91 0.7004 2.550 4.13 0.7957
13 2.331 4.00 0.6705 2.607 4.24 0.7667
15 2.409 4.17 0.6182 2.705 4.43 0.7141
20 2.557 4.49 0.5196 2.884 4.79 0.6091
25 2.663 4.73 0.4505 3.009 5.03 0.5320
30 2.745 4.89 0.3992 3.103 5.19 0.4732
35 2.811 5.026 0.3595 3.178 5.326 0.4270
40 2.866 5.150 0.3276 3.240 5.450 0.3896
50 2.956 5.350 0.2797 3.336 5.650 0.3328
60 3.025 5.500 0.2450 3.411 5.800 0.2914
70 3.082 5.638 0.2187 3.471 5.938 0.259980 3.130 5.730 0.1979 3.521 6.030 0.2350
90 3.171 5.820 0.1810 3.563 6.120 0.2147
100 3.207 5.900 0.1671 3.600 6.200 0.1980
110 3.239 5.968 0.1553 3.632 6.268 0.1838
120 3.267 6.030 0.1452 3.662 6.330 0.1716
130 3.294 6.086 0.1364 3.688 6.386 0.1611
140 3.318 6.137 0.1288 3.712 6.437 0.1519
Box 3: Cochrans test (worked example).
An interlaboratory study was carried out by 13 laboratories to determine the amount of cotton in a cotton/polyester fabric,85 determinations where carried out in total. The standard deviations of the data obtained by each of the 13 laboratorieswas as follows:Std. Dev. 0.202 0.402 0.332 0.236 0.318 0.452 0.210 0.074 0.525 0.067 0.609 0.246 0.198
Cochrans critical value for n
= 7 and g = 13 is 0.23 at the 95% confidence levels7.
As the test value is greater than the critical values it can be concluded that the laboratory with the highest standard deviation(0.609) has an outlying spread of replicates and this laboratorys results therefore need to be investigated further. It is normalpractice in inter-laboratory comparisons not to test for low variance outliers, i.e., laboratories reporting unusually precise results.
Cn= 0.6092
0.2022 + 0.4022 .......0.2462 + 0.1982=0.371
1.474= 0.252n =85
13= 6.54 7
-
8/9/2019 Use of Statistics by Scientist
22/22
LCGC Europe Online Supplement4 statistics and data analysis
Outlier tests assume the data distribution
is known. This assumption should be
checked for validity before these tests
are applied. Robust statistics avoid the need to use
outlier tests by down-weighting the
effect of extreme values.
When knowledge about the underlying
data distribution is limited, non-
parametric methods should be used.
NB: It should be noted that following a
judgement in a US court, the Food and
Drug Administration (FDA) in a guide
Guide to inspection of pharmaceutical
quality control laboratories has
specifically prohibited the use of outlier
tests.
Acknowledgement
The preparation of this paper was supported
under a contract with the UKs Department
of Trade and Industry as part of the
National Measurement System Valid
Analytical Measurement Programme (VAM)
(14).
References(1) S. Burke, Scientific Data Management1(1),
3238, 1997.
(2) S. Burke, Scientific Data Management2(1),3641, 1998.
(3) S. Burke, Scientific Data Management2(2),3240, 1998.
(4) J.L. Schafer, Monographs on Statistics andApplied Probability 72 Analysis of
Incomplete Multivariate Data, Chapman & Hall(1997) ISBN 0-412-04061-1.
(5) R.J.A. Little & D.B. Rubin, Statistical AnalysisWith Missing Data, John Wiley & Sons (1987),ISBN 0-471-80243-9.
(6) ISO 3534. Statistics Vocabulary and Symbols.Part 1: Probability and general statistical terms,section 2.64. Geneva 1993.
(7) T.J. Farrant, Practical statistics for the analyticalscientist: A bench guide, Royal Society of
Chemistry 1997. (ISBN 0 85404 442 6).(8) V. Barret & T. Lewis, Outliers in Statistical Data,3rd Edition, John Wiley (1994).
(9) William H. Kruskal & Judith M. Tanur,International Encyclopaedia of Statistics, CollierMacmillian Publishers, 1978. ISBN 0-02-917960-2.
(10) Analytical Methods Committee, RobustStatistics How Not to Reject Outliers Part 2.
Analyst1989 114, 16937.(11) D.C. Hoaglin, F. Mosteller & J.W. Tukey,
Understanding Robust and Exploratory Data
Analysis, John Wiley & Sons (1983), ISBN 0-471-09777-2.
(12) M. Hollander & D.A. Wolf, Non-parametricstatistical methods, Wiley & Sons, New York1973.
(13) W.W. Daniel,Applied non-parametric statistics,Houghton Mifflin, Boston 1978.
(14) M Sargent VAM Bulletin Issue 13 45
22
21
20
19
18
17
16
15
14
13
Analyte concentrationBox & Whisker Plot
Laboratory ID
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
outlying variance
outlying mean
figure 4 Different types of outlier in grouped data.
Types of comparison Parametric methods Non-parametric methods (12, 13)
Differences between t-test for independent groups2 WaldWolfowitz runs test
independent groups MannWhitney U test
of data KolmogorovSmirnov two-sample
test
(ANOVA/MANOVA)2 KruskalWallis analysis of ranks.
Median test
Differences between t-test for dependent groups2 Sign test
dependent groups Wilcoxons matched pairs test
of data McNemars test
2 (Chi-square) testFriedmans two-way ANOVA
ANOVA with replication2 Cochran Q test
Relationships between Linear regression3 Spearman R
continuous variables Correlation coefficient3 Kendall
Tau
Homogeneity of Variance Bartletts test7 Levenes test, Brown & Forsythe
Relationships between coefficient Gamma
counted variables 2 (Chi-square) testPhi coefficient
Fisher exact test
Kendall coefficient of
concordance