Use of Statistics by Scientist

8/9/2019 Use of Statistics by Scientist

1/22

3LCGC Europe Online Supplement statistics and data analysis

In this article we look at the initial steps in

data analysis (i.e., exploratory data analysis),

and how to calculate the basic summary

statistics (the mean and sample standarddeviation). These two processes, which

increase our understanding of the data

structure, are vital if the correct selection of

more advanced statistical methods and

interpretation of their results are to be

achieved. From that base we will progress to

significance testing (t-tests and the F-test).

These statistics allow a comparison between

two sets of results in an objective and

unbiased way. For example, significance

tests are useful when comparing a new

analytical method with an old method or

when comparing the current daysproduction with that of the previous day.

Exploratory Data Analysis

Exploratory data analysis is a term used to

describe a group of techniques (largely

graphical in nature) that sheds light on the

structure of the data. Without this

knowledge the scientist, or anyone else,

cannot be sure they are using the correct

form of statistical evaluation.

The statistics and graphs referred to in this

first section are applicable to a single

column of data (i.e., univariate data), suchas the number of analyses performed in a

laboratory each month. For small amounts

of data (


2/22

LCGC Europe Online Supplement4 statistics and data analysis

obtained). If any systematic trends are

observed (Figures 3(a)3(c)) then the

reasons for this must be investigated.

Normal statistical methods assume a

random distribution about the mean with

time (Figure 3(d)) but if this is not the case

the interpretation of the statistics can be

erroneous.

Summary Statistics

Summary statistics are used to make sense

of large amounts of data. Typically, the

mean, sample standard deviation, range,

confidence intervals, quantiles (1), and

measures for skewness and

spread/peakedness of the distribution

(kurtosis) are reported (2). The mean and

sample standard deviation are the most

widely used and are discussed below

together with how they relate to the

confidence intervals for normally

distributed data.

The Mean

The average or arithmetic mean (3) is

generally the first statistic everyone is

taught to calculate. This statistic is easilyfound using a calculator or spreadsheet

and simply involves the summing of the

individual results x1, x2, x3, ..., xi) and

division by the number of results (n),

where,

n

i1

x1x2 x3 xi

x

n

i1

xi

n

Unfortunately, the mean is often reported

as an estimate of the true-value (m) of

whatever is being measured without

considering the underlying distribution.

This is a mistake. Before any statistic is

calculated it is important that the raw data

should be carefully scrutinized and plotted

as described above. An outlying point canhave a big effect on the mean (compare

Figure 1(a) with 1(b)).

The Standard Deviation (3)

The standard deviation is a measure of the

spread of data (dispersion) about the mean

and can again be calculated using a

calculator or spreadsheet. There is,

however, a slight added complication; if

you look at a typical scientific calculator

you will notice there are two types of

Box 1: Stem-and-leaf plot

A stem-and-leaf plot is anothermethod of examining patterns in thedata set. They show the range, inwhich the values are concentrated,and the symmetry. This type of plot isconstructed by splitting data into thestem (the leading digits). In the figurebelow, this is from 0.1 to 0.6, andthe leaf (the trailing digit). Thus,

0.216 is represented as 2|1 and0.350 by 3|5. Note, the decimalplaces are truncated and not round-ed in this type of plot. Reading theplot below, we can see that the datavalues range from 0.12 to 0.63. Thecolumn on the left contains thedepth information (i.e., how manyleaves lie on the lines closest to theend of the range). Thus, there are 13points which lie between 0.40 and0.63. The line containing the middlevalue is indicated differently with a

count (the number of items in theline) and is enclosed in parentheses.

Stem-and-leaf plot

Units = 0.1 1|2 = 0.12 Count =

42

5 1|22677

14 2|112224578

(15) 3|000011122333355

13 4|0047889

6 5|56669

1 6|3

*outlier

(a) (b)

upper quartile value

interquartile

median

lower quartile value

1.5 interquartile

1.5 interquartile

*The interquartile range is the range which contains the middle 50% of the data whenit is sorted into ascending order.

Fre

quency(Nofdata

pointsin

each

bar)

figure 2 Frequency histogram and Box and Whisker plot.

(a)

Magnitude

10

8

6

4

2

0n = 7, mean = 6, standard deviation = 2.16

(b)

Magnitude

10

8

6

4

2

0n = 9, mean = 6, standard deviation = 2.65

(c)

Magnitude

10

8

6

4

2

0

n = 9, mean = 6, standard deviation = 2.06

(d)

Magnitude

10

8

6

4

2

0

n = 9, mean = 6, standard deviation = 1.80

Time

Time

Time

Time

figure 3 Time-indexed plots.


3/22


4/22


Significance Testing

Suppose, for example, we have the

following two sets of results for lead

content in water 17.3, 17.3, 17.4, 17.4

and 18.5, 18.6, 18.5, 18.6. It is fairly clear,

by simply looking at the data, that the two

sets are different. In reaching this

conclusion you have probably consideredthe amount of data, the average for each

set and the spread in the results. The

difference between two sets of data is,

however, not so clear in many situations.

The application of significance tests gives

us a more systematic way of assessing the

results with the added advantage of

allowing us to express our conclusion with

a stated degree of confidence.

What does significance mean?

In statistics the words significant and

significance have specific meanings. Asignificant difference, means a difference

that is unlikely to have occurred by chance.

A significance test, shows up differences

unlikely to occur because of a purely

random variation.

As previously mentioned, to decide if one

set of results is significantly different from

another depends not only on the

magnitude of the difference in the means

but also on the amount of data available

and its spread. For example, consider the

blob plots shown in Figure 5. For the two

data sets shown in Figure 5(a), the means

for set (i) and set (ii) are numerically

different. From the limited amount of

information available, however, they are

from a statistical point of view the same.

For Figure 5(b), the means for set (i) andset (ii) are probably different but when

fewer data points are available, Figure 5(c),

we cannot be sure with any degree of

confidence that the means are different

even if they are a long way apart. With a

large number of data points, even a very

small difference, can be significant (Figure

5(d)). Similarly, when we are interested in

comparing the spread of results, for

example, when we want to know if

method (i) gives more consistent results

than method (ii), we have to take note of

the amount of information available(Figures 5(e)(g)).

It is fortunate that tables are published

that show how large a difference needs to

be before it can be considered not to have

occurred by chance. These are, critical

t-value for differences between means,

and critical F-values for differences

between the spread of results (4).

Note: Significance is a function of sample

size. Comparing very large samples will

nearly always lead to a significant

difference but a statistically significant

result is not necessarily an important result.

For example in Figure 5(d) there is a

statistically significant difference, but does

it really matter in practice?

What is a t-test?A t-test is a statistical procedure that can

be used to compare mean values. A lot of

jargon surrounds these tests (see Table 1

for definition of the terms used below) but

they are relatively simple to apply using the

built-in functions of a spreadsheet like

Excel or a statistical software package.

Using a calculator is also an option but you

have to know the correct formula to apply

(see Table 2) and have access to statistical

tables to look up the so-called critical

values (4).

Three worked examples are shown inBox 2 (5) to illustrate how the different

t-tests are carried out and how to interpret

the results.

What is an F-test?

An F-test compares the spread of results in

two data sets to determine if they could

reasonably be considered to come from the

same parent distribution. The test can,

therefore, be used to answer questions

such as are two methods equally precise?

The measure of spread used in the F-test is

variance which is simply the square of thestandard deviation. The variances are

ratioed (i.e., divide the variance of one set

of data by the variance, of the other) to

get the test value F =

This F value is then compared with a critical

value that tells us how big the ratio needs

to be to rule out the difference in spread

occurring by chance. The Fcrit value is

found from tables using (n11) and (n21)

degrees of freedom, at the appropriate

level of confidence.[Note: it is usual to arrange s1 and s2 so

that F > 1]. If the standard deviations are to

be considered to come from the same

population then Fcrit > F. As an example we

use the data in Example 2 (see Box 2).

Fcrit = 9.605 (51) and (51) degrees of

freedom at the 97.5% confidence level.

As Fcrit> Fcalculated we can conclude that the

spread of results in the two data sets are

not significantly different and it is,

therefore, reasonable to combine the two

standard deviations as we have done.

F2.752

1.471 22.75 2

1.471 23.49

S 12

S 22S 1

2

S 22

table 1 Definitions of statistical terms used in significance testing.

Jargon Definition

Alternate Hypothesis A statement describing the alternative to the null hypothesis(H1) (i.e., there is a difference between the means [see two-tailed]

or mean1 is mean2 [see one-tailed]).

Critical Value The value obtained from statistical tables or statistical packages at a(tcrit or Fcrit) given confidence level against which the result of applying a signifi-

cance test is compared.

Null hypothesis A statement describing what is being tested(H0) (i.e., there is no difference between the two means [mean1 = mean2]).

One-tailed A one-tailed test is performed if the analyst is only interested in theanswer when the result is different in one direction, for example, (1)

thenew production method results in a higher yield, or (2) the amount ofwaste product is reduced (i.e., a limit value , >,


5/22


6/22


Box 2

Example 1

A chemist is asked to validate a neweconomic method of derivatization

before analysing a solution by a standardgas chromatography method. The long-term mean for the check samples usingthe old method is 22.7 g/L. For the newmethod the mean is 23.5 g/L, based on10 results with a standard deviation of0.9 g/L. Is the new method equivalentto the old? To answer this question weuse the t-test to compare the two meanvalues. We start by stating exactly whatwe are trying to decide, in the form oftwo alternative hypotheses; (i) the meanscould really be the same, or (ii) the

means could really be different. Instatistical terminology this is written as: The null hypothesis (H0): new method

mean = long-term check sample mean.

The alternative hypothesis (H1): new

method mean long-term check sample

mean.

To test the null hypothesis we calculatethe t-value as below. Note, the calculatedt-value is the ratio of the differencebetween the means and a measure ofthe spread (standard deviation) and theamount of data available (n).

In the final step of the significance testwe compare the calculated t-value withthe critical t-value obtained from tables(4). To look up the critical value we needto know three pieces of information:

(i) Are we interested in the directionof the difference between the twomeans or only that there is a difference,for example, are we performing a one-sided or two-sided t-test (see Table 1)?

In the case above, it is the latter, there-fore, the two-sided critical value is used.(ii) The degrees of freedom: this is

simply the number of data pointsminus one (n1).

(iii) How certain do we want to beabout our conclusions? It is normalpractice in chemistry to select the 95%confidence level (i.e., about 1 in 20times we perform the t-test we couldarrive at an erroneous conclusion).However, in some situations this is anunacceptable level of error, such as in

medical research. In these cases, the99% or even the 99.9% confidencelevel can be chosen.

t23.522.70.9 / 10

2.81

tcrit = 2.26 at the 95% confidencelevel for 9 degrees of freedom.

As tcalculated > tcrit we can reject the nullhypothesis and conclude that we are 95%certain that there is a significant differencebetween the new and old methods.

[Note: This does not mean the newderivatization method should beabandoned. A judgement needs tobe made on the economics and onwhether the results are fit for purpose.

The significance test is only one pieceof information to be considered.]

Example 2 (5)

Two methods for determining theconcentration of Selenium are to becompared. The results from eachmethod are shown in Table 3:

Using the t-test for independentsample means we define the nullhypothesis H0 as x

1 = x

2

This means there is no difference between

the means of the two methods (the

alternative hypothesis is H1: x1 x2). Ifthe two methods have sample standard

deviations that are not significantly

different then we can combine (or pool)

the standard deviation (Sc).

(see What is an F-Test?)

If the standard deviations aresignificantly different then the t-test

for un-equal variances should be used(Table 2).Evaluating the test statistic t

=>t(5.404.76)

2.205 15 15

Sc1.4712 (51)2.7502 (51)

(552)

2.205

The 95% critical value is 2.306 for

n = 8 (n1 + n2 2 ) degrees of freedom.

This exceeds the calculated value of

0.459, thus the null hypothesis (H0)

cannot be rejected and we conclude

there is no significant difference between

the means or the results given by the

two methods.

Example 3 (5)

Two methods are available fordetermining the concentration ofvitamins in foodstuffs. To comparethe methods several different samplematrices are prepared using the sametechnique. Each sample preparation isthen divided into two aliquots andreadings are obtained using the twomethods, ideally commencing at thesame time to lessen the possible effectsof sample deterioration. The results are

shown in Table 4.The null hypothesis is H0: d = 0

against the alternative H1: d 0

The test is a two-tailed test as we areinterested in both d

0

The mean d

= 0.475 and the samplestandard deviation of the paireddifferences is sd = 0.700

The tabulated value of tcrit (with

n = 7 degrees of freedom, at the 95%

confidence limit) is 2.365. Since thecalculated value is less than the critical

value, H0 cannot be rejected and it

follows that there is no difference between

the two techniques.

t0.475 80.700

1.918

t 0.642.2050.632

0.641.395

0.459

table 3 Results from two methods used to determine concentrations of selenium.

x s

Method 1 4.2 4.5 6.8 7.2 4.3 5.40 1.471

Method 2 9.2 4.0 1.9 5.2 3.5 4.76 2.750

table 4 Comparison of two methods used to determine the concentration of vitamins in foodstuffs.

Matrix

Method 1 2 3 4 5 6 7 8

A (mg/g) 2.52 3.13 4.33 2.25 2.79 3.04 2.19 2.16

B (mg/g) 3.17 5.00 4.03 2.38 3.68 2.94 2.83 2.18

Difference (d) -0.65 -1.87 0.30 -0.13 -0.89 0.10 -0.64 -0.02


7/22


8/22


Two-way ANOVA

In a typical experiment things can be more

complex than described previously. For

example, in Example 2 the aim is to find

out if time and/or temperature have any

effect on protein yield when analysing

samples of tinned ham. When analysing

data from this type of experiment we usetwo-way ANOVA. Two-way ANOVA can

test the significance of each of two

experimental variables (factors or

treatments) with respect to the response,

such as an instrument's output. When

replicate measurements are made we can

also examine whether or not there are

significant interactions between variables.

An interaction is said to be present when

the response being measured changes

more than can be explained from the

change in level of an individual factor. This

is illustrated in Figure 2 for a process withtwo factors (Y and Z) when both factors

are studied at two levels (low and high). In

Figure 2(b), the changes in response

caused by Y depend on Z, and vice versa.

In two-way ANOVA we ask the

following questions:

Is there a significant interaction between

the two factors (variables)?

Does a change in any of the factors

affect the measured result?

It is important to check the answers in the

right order: Figure 3 illustrates the

decision process. In the case of Example2 the questions are:

Is there an interaction between

temperature and time which affects the

protein yield?

Does time and/or temperature affect the

protein yield?

Using the built-in functions of a

spreadsheet (in this case Excels data

analysis tools two-factor analysis with

replication) we see that there is a

significant interaction between time and

temperature and a significant effect of

temperature alone (both p-value < 0.05and F > Fcrit). Following the process

outlined in Figure 3, we consider the

interaction question first by comparing the

mean squares (MS) for the within-group

variation with the interaction MS. This is

reported in the results table of Example 2.

F = 0.021911/0.004315 = 5.078

If the interaction is significant (F > Fcrit),

as in this case, then the individual factors

(time and temperature) should each be

compared with the MS for the interaction

(not the within-group MS) thus:

Ftemp = 0.024844/0.021911 = 1.134

Example 2 Two-way ANOVAThe analysis of tinned ham was carried out at three temperatures (415, 435 and 460C) and three times (30, 60 and 90 minutes). Three analyses, determining proteinyield were made at each temperature and time. The measurements are summarizedin the diagram below and the results of the two-way ANOVA are given in the table.

Time (min)

415 435 460

26.9

27

27.1

27.2

27.3

26.9

27

27.1

27.2

27.3

26.9

27

27.1

27.2

27.3

26.9 2

7

27.1

27.2

27.3

26.9 2

7

27.1

27.2

27.3

26.9 2

7

27.1

27.2

27.3

26.9 2

7

27.1

27.2

27.3

26.9 2

7

27.1

27.2

27.3

26.9 2

7

27.1

27.2

27.3

Temp (C)

415

27.13

27.2

27.13

27.29

27.1327.23

27.03

27.13

27.07

435

27.2

26.97

27.13

27.07

27.127.03

27.2

27.23

27.27

460

27.03

27.1

27.13

27.1

27.0727.03

27.03

27.07

26.9

Time (min)/Temp (C)

30

30

30

60

6060

90

90

90

Anova: Two-factor with replication

Source of Variation SS df MS F P-value F crit

Sample (=Time)

Columns (=Temperature)

Interaction

Within

Total

0.000867

0.049689

0.087644

0.077667

0.215867

2

2

4

18

26

0.000433

0.024844

0.021911

0.004315

0.100429

5.75794

5.078112

0.904952

0.011667

0.006437

3.554561

3.554561

2.927749

30

60

90

Example 1 An example of one-way ANOVA carried out by Excel

(Note: the data table has been split into two sections (A_1 to A_6, A_7 to A_12) for display purposes. The ANOVA iscarried out on a single table.)

SS = sum of squares, df = degrees of freedom, MS = mean square (SS/df).

The P-value is < 0.05 (Fvalue is > Fcrit - 95% confidence level for 11 and 36 degrees of freedom )

therefore it can be concluded that there is a significant difference between the analysts' results.

34.1

34.1

34.69

34.6

35.84

36.58

31.3

34.19

36.67

37.33

36.96

36.83

40.54

40.67

40.81

40.78

41.19

40.29

40.99

40.4

41.22

39.61

37.89

36.67

40.7140.91

40.8

38.42

39.239.3

39.3

39.3

42.542.3

42.5

42.5

39.7539.69

39.23

39.73

36.0437.03

36.85

36.24

44.3645.73

45.25

45.34

Replicate 1

Replicate 2

Replicate 3

Replicate 4

Replicate 1Replicate 2

Replicate 3

Replicate 4

Anova: Single Factor

Source of Variation SS df MS F P-value F crit

Between Groups

Within Groups438.7988

35.6208

11

36

39.8908

0.98946740.31545 6.6E-17 2.066606

A_1 A_2 A_3 A_4 A_5 A_6

A_7 A_8 A_9 A_10 A_11 A_12

Note: in the above example, the spreadsheet (Excel) labels Source of Variation as Sample, Columns, Interaction and Within.

Sample = Time, Columns = Temperature, Interaction is the interaction between temperature and time, and Within is a

measure of the within-group variation. (Note: Source of variation Columns = Temperature and Sample = Time).


9/22


Ftime = 0.000433/0.021911 = 0.020

Fcrit = 6.944, for 2 and 4 degrees of freedom (at the 95% confidence level)

In other words, there is no significant difference between the interaction of time and

temperature with respect to either of the individual factors, and, therefore, the interaction

of temperature with time is worth further investigation. If one or both of the individual

factors were significant compared with the interaction, then the individual factor or factors

would dominate and for all practical purposes any interaction could be ignored.If the interaction term is not significant then it can be considered to be another small

error term and can thus be pooled with the within-group (error) sums of squares term. It is

the pooled value (SS2pooled) that is then used as the denominator in the F-test to

determine if the individual factors affect the measured results significantly. To combine the

sums of squares the following formula is used:

where dofinter and dofwithin are the degrees of freedom for the interaction term and

error term, and SSinter and SSwithin are the sums of squares for the interaction term and

error term, respectively.

(dofpooled dofinter dofwithin)

Selecting the ANOVA method

One-way ANOVA should be used when there is only one factor being considered and

replicate data from changing the level of that factor are available. Two-way ANOVA (with

or without replication) is used when there are two factors being considered. If no replicate

data are collected then the interactions between the two factors cannot be calculated.

Higher level ANOVAs are also available for looking at more than two factors.

Advantages of ANOVA

Compared with using multiple t-tests, one-way and two-way ANOVA require fewer

measurements to discover significant effects (i.e., the tests are said to have more power).

This is one reason why ANOVA is used frequently when analysing data from statisticallydesigned experiments.

Other ANOVA and multivariate ANOVA (MANOVA) methods exist for more complex

experimental situations but a description of these is beyond the scope of this introductory

article. More details can be found in reference 6.

ss2pooledss intersswithindofinterdofwithin

Interpretation of the result(s)

To reiterate the interpretation of ANOVA

results, a calculated F-value that is greater

than Fcrit for a stated level of confidence

(typically 95%) means that the difference

being tested is statistically significant at

that level. As an alternative to using the F-

values the p-value can be used to indicatethe degree of confidence we have that

there is a significant difference between

means (i.e., (1-p) * 100 is the percentage

confidence). Normally a p-value of 0.05

is considered to denote a significant

difference.

Note: Extrapolation of ANOVA results is

not advisable, so in Example 2 for instance,

it is impossible to say if a time of 15 or 120

minutes would lead to a measurable effect

on protein yield. It is, therefore, always

more economic in the long run to design

the experiment in advance, in order tocover the likely ranges of the parameter(s)

of interest.

Avoiding some of

the pitfalls using ANOVA

In ANOVA it is assumed that the data for

each variable are normally distributed.

Usually in ANOVA we dont have a large

amount of data so it is difficult to prove

any departure from normality. It has been

shown, however, that even quite large

deviations do not affect the decisions

made on the basis of the F-test.A more important assumption about

ANOVA is that the variance (spread)

between groups is homogeneous

(homoscedastic). If this is not the case (this

often happens in chemistry, see Figure 1)

then the F-test can suggest a statistically

48

46

44

42

40

38

36

34

32

30A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12

Mean

Analyst ID

Analyteconcentration(ppm

)

totalstandarddeviation

figure 1 Plot comparing the results from 12 analysts.


10/22


problem in the data structure by

transforming it, such as by taking logs (7).

If the variability within a group is

correlated with its mean value then

ANOVA may not be appropriate and/or it

may indicate the presence of outliers in the

data (Figure 4). Cochran's test (5) can be

used to test for variance outliers.

Conclusions

ANOVA is a powerful tool for

determining if there is a statistically

significant difference between two or

more sets of data.

One-way ANOVA should be used

when we are comparing several sets

of observations.

Two-way ANOVA is the method

used when there are two separate

factors that may be influencing a result.

Except for the smallest of data setsANOVA is best carried out using a

spreadsheet or statistical software

package.

You should always plot your data to

make sure the assumptions ANOVA is

based on are not violated.

Acknowledgements

The preparation of this paper was

supported under a contract with the UK

Department of Trade and Industry as part

of the National Measurement System Valid

Analytical Measurement Programme (VAM)(8).

References(1) S. Burke, Scientific Data Management1(1),

3238, September 1997.(2) G.A. Millikem and D.E. Johnson,Analysis of

Messy Data, Volume 1: Designed Experiments,Van Nostrand Reinhold Company, New York,USA (1984).

(3) J.C. Miller and J.N. Miller, Statistics forAnalytical Chemistry, Ellis Horwood PTRPrentice Hall, London, UK (ISBN 0 13 0309907).

(4) C. Chatfield, Statistics for Technology,Chapman & Hall, London, UK (ISBN 041225340 2).

(5) T.J. Farrant, Practical Statistics for the Analytical

Scientist, A Bench Guide, Royal Society ofChemistry, London, UK (ISBN 0 85404 442 6)(1997).

(6) K.V. Mardia, J.T. Kent and J.M. Bibby,Multivariate Analysis, Academic Press Inc. (ISBN0 12 471252 5) (1979).

(7) ISO 4259: 1992. Petroleum Products -Determination and Application of PrecisionData in Relation to Methods of Test. Annex E,International Organisation for Standardisation,Geneva, Switzerland (1992).

(8) M. Sargent, VAM Bulletin, Issue 13, 45,Laboratory of the Government Chemist(Autumn 1995).

Shaun Burke currently works in the Food

Technology Department of RHM Technology

Ltd, High Wycombe, Buckinghamshire, UK.

However, these articles were produced while

he was working at LGC, Teddington,

Middlesex, UK (http://www.lgc.co.uk).

a number of tests for heteroscedasity (i.e.,

Bartlett's test (5) and Levene's test (2)). It

may be possible to overcome this type of

significant difference when none is

present. The best way to avoid this pitfall

is, as ever, to plot the data. There also exist

Variance

Mean value

Significantly different means by ANOVA

Unreliable high mean (may contain outliers)

figure 4 A plot of variance versus the mean.

Pool the within-group andinteraction sums of squares

Compare pooled meansquares with individual factor

mean squares

Start

Yes

No

Compare interaction meansquares with individual factor

mean squares

Significantdifference?

(F>F crit)

Compare within-group meansquares with interaction mean

squares

figure 3 Comparing mean squares in two-way ANOVA with replication.

ZHigh

ZLow

ZHigh

ZLow

YLow

(a) Y and Z are independent (b) Y and Z are interacting

Response

Response

YHigh

YLow

YHigh

figure 2 Interactive factors.


11/22


Calibration is fundamental to achieving

consistency of measurement. Often

calibration involves establishing the

relationship between an instrument

response and one or more reference

values. Linear regression is one of the most

frequently used statistical methods in

calibration. Once the relationship between

the input value and the response value

(assumed to be represented by a straight

line) is established, the calibration model isused in reverse; that is, to predict a value

from an instrument response. In general,

regression methods are also useful for

establishing relationships of all kinds, not

just linear relationships. This paper

concentrates on the practical applications

of linear regression and the interpretation

of the regression statistics. For those of you

who want to know about the theory of

regression there are some excellent

references (16).

For anyone intending to apply linear

least-squares regression to their own data,it is recommended that a statistics/graphics

package is used. This will speed up the

production of the graphs needed to

confirm the validity of the regression

statistics. The built-in functions of a

spreadsheet can also be used if the

routines have been validated for accuracy

(e.g., using standard data sets (7)).

What is regression?

In statistics, the term regression is used to

describe a group of methods that

summarize the degree of association

between one variable (or set of variables)

and another variable (or set of variables).

The most common statistical method used

to do this is least-squares regression, which

works by finding the best curve through

the data that minimizes the sums of

squares of the residuals. The important

term here is the best curve, not the

method by which this is achieved. There

are a number of least-squares regression

models, for example, linear (the most

common type), logarithmic, exponential

and power. As already stated, this paper

will concentrate on linear least-squaresregression.

[You should also be aware that there are

other regression methods, such as ranked

regression, multiple linear regression, non-

linear regression, principal-component

regression, partial least-squares regression,

etc., which are useful for analysing instrument

or chemically derived data, but are beyond

the scope of this introductory text.]

What do the linear least-squares

regression statistics mean?

Correlation coefficient: Whether you use acalculators built-in functions, a

spreadsheet or a statistics package, the

first statistic most chemists look at when

performing this analysis is the correlation

coefficient (r). The correlation coefficient

ranges from 1, a perfect negative

relationship, through zero (no relationship),

to +1, a perfect positive relationship

(Figures 1(ac)). The correlation coefficient

is, therefore, a measure of the degree of

linear relationship between two sets of

data. However, the r value is open to

misinterpretation (8) (Figures 1(d) and (e),

show instances in which the r values alone

would give the wrong impression of the

underlying relationship). Indeed, it is

possible for several different data sets to

yield identical regression statistics (r value,

residual sum of squares, slope and

intercept), but still not satisfy the linear

assumption in all cases (9). It, therefore,

remains essential to plot the data in order

to check that linear least-squares statistics

are appropriate.

As in the t-tests discussed in the first

paper (10) in this series, the statistical

significance of the correlation coefficient isdependent on the number of data points.

To test if a particular r value indicates a

statistically significant relationship we can

use the Pearsons correlation coefficient

test (Table 1). Thus, if we only have four

points (for which the number of degrees of

freedom is 2) a linear least-squares

correlation coefficient of 0.94 will not be

significant at the 95% confidence level.

However, if there are more than 60 points

an r value of just 0.26 (r2 = 0.0676) would

indicate a significant, but not very strong,

positive linear relationship. In other words,a relationship can be statistically significant

but of no practical value. Note that the test

used here simply shows whether two sets

are linearly related; it does not prove

linearity or adequacy of fit.

It is also important to note that a

significant correlation between one

variable and another should not be taken

as an indication of causality. For example,

there is a negative correlation between

time (measured in months) and catalyst

performance in car exhaust systems.

However, time is not the cause of the

deterioration, it is the build up of sulfur

and phosphorous compounds that

gradually poisons the catalyst. Causality is,

One of the most frequently used statistical methods in calibration is linearregression. This third paper in our statistics refresher series concentrates onthe practical applications of linear regression and the interpretation of the

regression statistics.

Shaun Burke, RHM Technology Ltd, High Wycombe, Buckinghamshire, UK.

Regression

and Calibration


12/22


in fact, very difficult to prove unless the

chemist can vary systematically and

independently all critical parameters, whilemeasuring the response for each change.

Slope and intercept

In linear regression the relationship

between the X and Y data is assumed to

be represented by a straight line, Y = a +

bX (see Figure 2), where Y is the estimated

response/dependent variable, b is the slope

(gradient) of the regression line and a is

the intercept (Y value when X = 0). This

straight-line model is only appropriate if

the data approximately fits the assumption

of linearity. This can be tested for byplotting the data and looking for curvature

(e.g., Figure 1(d)) or by plotting the

residuals against the predicted Y values or

X values (see Figure 3).

Although the relationship may be known

to be non-linear (i.e., follow a different

functional form, such as an exponential

curve), it can sometimes be made to fit the

linear assumption by transforming the data

in line with the function, for example, by

taking logarithms or squaring the Y and/or

X data. Note that if such transformations

are performed, weighted regression(discussed later) should be used to obtain

an accurate model. Weighting is required

because of changes in the residual/error

structure of the regression model. Using

non-linear regression may, however, be a

better alternative to transforming the data

when this option is available in the

statistical packages you are using.

Residuals and residual standard error

A residual value is calculated by taking the

difference between the predicted value

and the actual value (see Figure 2). Whenthe residuals are plotted against the

predicted (or actual) data values the plot

becomes a powerful diagnostic tool,

enabling patterns and curvature in the data

to be recognized (Figure 3). It can also be

used to highlight points of influence (see

Bias, leverage and outliers overleaf).

The residual standard error (RSE, also

known as the residual standard deviation,

RSD) is a statistical measure of the average

residual. In other words, it is an estimate

of the average error (or deviation) about

the regression line. The RSE is used to

calculate many useful regression statistics

including confidence intervals and outlier

test values.

where s(y) is the standard deviation of the y values in the calibration, n is the number of

data pairs and r is the least-squares regression correlation coefficient.

Confidence intervalsAs with most statistics, the slope (b) and intercept (a) are estimates based on a finite

sample, so there is some uncertainty in the values. (Note: Strictly, the uncertainty arises

from random variability between sets of data. There may be other uncertainties, such as

measurement bias, but these are outside the scope of this article.) This uncertainty is

quantified in most statistical routines by displaying the confidence limits and other

statistics, such as the standard error and p values. Examples of these statistics are given in

Table 2.

RSE s(y)(n1)

(n2) 1 r2

table 1 Pearson's correlation coefficient test.

Degrees of freedom Confidence level

(n-2) 95% ( = 0.05) 99% ( = 0.01)

2 0.950 0.9903 0.878 0.959

4 0.811 0.917

5 0.754 0.875

6 0.707 0.834

7 0.666 0.798

8 0.632 0.765

9 0.602 0.735

10 0.576 0.708

11 0.553 0.684

12 0.532 0.661

13 0.514 0.641

14 0.497 0.623

15 0.482 0.606

20 0.423 0.537

30 0.349 0.449

40 0.304 0.393

60 0.250 0.325

Significant correlation when |r| table value

1

-0.8

-1

0.8

0.6

0.4

0.2

0

5 10 20 25 30 35 40 45 50 5515 60-0.2

-0.4

-0.6Correlationc

oefficient(r)

Degrees of freedom (n-2)

95% confidence level

99% confidence level


13/22


The p value is the probability that a value could arise by chance if the true value was

zero. By convention a p value of less than 0.05 indicates a significant non-zero statistic.

Thus, examining the spreadsheets results, we can see that there is no reason to reject the

hypothesis that the intercept is zero, but there is a significant non-zero positive

gradient/relationship. The confidence intervals for the regression line can be plotted for all

points along the x-axis and is dumbbell in shape (Figure 2). In practice, this means that the

model is more certain in the middle than at the extremes, which in turn has important

consequences for extrapolating relationships.When regression is used to construct a calibration model, the calibration graph is used

in reverse (i.e., we predict the X value from the instrument response [Y-value]). This

prediction has an associated uncertainty (expressed as a confidence interval)

Conf. interval for the prediction is:

where a is the intercept and b is the slope obtained from the regression equation.

Y

is the mean value of the response (e.g., instrument readings) for m replicates (replicatesare repeat measurements made at the same level).

y is the mean of the y data for the n points in the calibration. tis the critical value obtained

from t-tables for n2 degrees of freedom. s(x) is the standard deviation for the

XpredictedtRSE

b1m

1n

Yy2

b2 n1 sx2

Xpredicted Y a

b

x data for the n points in the calibration.

RSE is the residual standard error for the

calibration.

If we want, therefore, to reduce the size

of the confidence interval of the prediction

there are several things that can be done.

1. Make sure that the unknown

determinations of interest are close tothe centre of the calibration (i.e., close

to the values x,y [the centroid point]).

This suggests that if we want a small

confidence interval at low values of x

then the standards/reference samples

used in the calibration should be

concentrated around this region. For

example, in analytical chemistry, a typical

pattern of standard concentrations

might be 0.05, 0.1, 0.2, 0.4, 0.8, 1.6

possible outlier

0.14

0.06

0.08

0.1

0.12

0.04

0.02

0

-0.02

-0.04

-0.06

-0.08

0 21 3 4 5 6 7 8 9 10

Residuals

X

figure 3 Residuals plot.

r = -1

r = 0

r = 0

r = 0.99

r = 0.9

r = 0.9

r = +1

(a)

(b)

(d)

(e)

(f)

(g)

(c)

figure 1 Correlation coefficients and

goodness of fit.

Y= -0.046 + 0.1124 * Xr = 0.98731

Intercept Slope

Correlation coefficient

Intercept

confidence limits for the prediction

confidence limits forthe regression line

Residuals

1.4

1.2

1.0

0.8

0.6

0.4

0.2

0.0

-0.20 2 4 6 8 10 12

Y

X

figure 2 Calibration graph.


14/22


(i.e., only one or two standards are used athigher concentrations). While this will lead

to a smaller confidence interval at lower

concentrations the calibration model will

be prone to leverage errors (see below).

2. Increase the number of points in the

calibration (n). There is, however, little

improvement to be gained by going

above 10 calibration points unless

standard preparation and analysis is

rapid and cheap.

3. Increase the number of replicate

determinations for estimating the

unknown (m). Once again there is alaw of diminishing returns, so the

number of replicates should typically

be in the range 2 to 5.

4. The range of the calibration can be

extended, providing the calibration is still

linear.

Bias, leverage and outliers

Points of influence, which may or may not

be outliers, can have a significant effect on

the regression model and therefore, on its

predictive ability. If a point is in the middle

of the model (i.e., close to x

) but outlyingon the Y axis, its effect will be to move the

regression line up or down. The point is

then said to have influence because it

introduces an offset (or bias) in the

predicted values (see Figure 1(f)). If the

point is towards one of the extreme ends

of the plot its effect will be to tilt the

regression line. The point is then said to

have high leverage because it acts as a

lever and changes the slope of the

regression model (see Figure 1(g)).

Leverage can be a major problem if one or

two data points are a long way from all theother points along the X axis.

A leverage statistic (ranging between1_n and 1) can be calculated for each value

of x. There is no set value above which this

leverage statistic indicates a point of

influence. A value of 0.9 is, however, used

by some statistical software packages.

where xiis the x value for which the leverage statistic is to be calculated, n is the

number of points in the calibration and x is the mean of all the x values in the calibration.

To test if a data point (xi,yi) is an outlier (relative to the regression model) the following

outlier test can be applied.

where RSE is the residual standard error, sy is the standard deviation of the Y values, Yiis

the y value, n is the number of points, y is the mean of all the y values in the calibration

and residualmax is the largest residual value.For example, the test value for the suspected outlier in Figure 3 is 1.78 and the critical

value is 2.37 (Table 3 for 10 data points). Although the point appears extreme, it could

reasonably be expected to arise by chance within the data set.

Extrapolation and interpolation

We have already mentioned that the regression line is subject to some uncertainty and that

this uncertainty becomes greater at the extremes of the line. If we, therefore, try to

extrapolate much beyond the point where we have real data (10%) there may be

relatively large errors associated with the predicted value. Conversely, interpolation near

the middle of the calibration will minimize the prediction uncertainty. It follows, therefore,

that when constructing a calibration graph, the standards should cover a larger range of

concentrations than the analyst is interested in. Alternatively, several calibration graphscovering smaller, overlapping, concentration ranges can be constructed.

Test valueresidualma x

RSE 1 1n Yiy

2

n1 sy2

Leverage i1n

xix2

j= 1

n

xj

x

2

table 2 Statistics obtained using Excel 5.0 regression analysis function from the data used to generate the calibration graph in Figure 2.

Coefficients Standard Error tStat p value Lower 95% Upper 95%

Intercept -0.046000012 0.039648848 -1.160185324 0.279423552 -0.137430479 0.045430455

Slope 0.112363638 0.00638999 17.58432015 1.11755E-07 0.097628284 0.127098992

*Note the large number of significant figures. In fact none of the values above warrant more than 3 significant figures!

Concentration

Response

(a)

Predicted value

Residuals

(b)

0

figure 4 Plots of typical instrument response versus concentration.


15/22


Weighted linear regression and calibration

In analytical science we often find that the precision changes with concentration. In

particular, the standard deviation of the data is proportional to the magnitude of the value

being measured, (see Figure 4(a)). A residuals plot will tend to show this relationship even

more clearly (Figure 4(b)). When this relationship is observed (or if the data has been

transformed before regression analysis), weighted linear regression should be used forobtaining the calibration curve (3). The following description shows how the weighted

regression works. Dont be put off by the equations as most modern statistical software

packages will perform the calculations for you. They are only included in the text for

completeness.

Weighted regression works by giving points known to have a better precision a higher

weighting than those with lower precision. During method validation the way the standard

deviation varies with concentration should have been investigated. This relationship can

then be used to calculate the initial weightings

at each of the n

concentrations in the calibration.

These initial weightings can then be

standardized by multiplying by the number

of calibration points divided by the sum of

all the weights to give the final weights (Wi).

The regression model generated will be

similar to that for non-weighted linear

regression. The prediction confidence

intervals will, however, be different.

The weighted prediction (xw) for a given

instrument reading (y) for the regression

model forcing the line through the origin (y

= bx) is:

with

where Y

is the mean value of the

response (e.g., instrument readings) for m

replicates and xiand yiare the data pair for

the ith point.

By assuming the regression line goes

through the origin a better estimate of theslope is obtained, providing that the

assumption of a zero intercept is correct.

This may be a reasonable assumption in

some instrument calibrations. However, in

most cases, the regression line will no

longer represent the least-squares best

line through the data.

b(w)Wixiyi

i= 1

n

Wixi2

i= 1

n

X(w)predicted Ybw

Wi

wi

n

wj

j= 1

n

(wi1si

2)

table 3 Outlier test for simple linear least-squares regression.

Sample size Confidence table-value

(n) 95% 99%

5 1.74 1.75

6 1.93 1.98

7 2.08 2.17

8 2.20 2.23

9 2.29 2.44

10 2.37 2.55

12 2.49 2.70

14 2.58 2.82

16 2.66 2.92

18 2.72 3.00

20 2.77 3.06

25 2.88 3.2530 2.96 3.36

35 3.02 3.40

40 3.08 3.43

45 3.12 3.47

50 3.16 3.51

60 3.23 3.57

70 3.29 3.62

80 3.33 3.68

90 3.37 3.73

100 3.41 3.78

3

3.5

4

2.5

2

1.50 10 30 40 50 60 70 80 90 10020

Testvalue

Number of samples (n)

95%

99%


16/22


References(1) G.W. Snedecor and W.G. Cochran, Statistical

Methods, The Iowa State University Press, USA,6th edition (1967).

(2) N. Draper and H. Smith,Applied RegressionAnalysis, John Wiley & Sons Inc., New York,USA, 2nd edition (1981).

(3) BS ISO 11095: Linear Calibration UsingReference Materials (1996).

(4) J.C. Miller and J.N. Miller, Statistics forAnalytical Chemistry, Ellis Harwood PTR PrenticeHall, London, UK.

(5) A.R. Hoshmand, Statistical Methods forEnvironmental and Agricultural Sciences, 2ndedition, CRC Press (ISBN 0-8493-3152-8)(1998).

(6) T.J. Farrant, Practical Statistics for the AnalyticalScientist, A Bench Guide, Royal Society ofChemistry, London, UK (ISBN 0 85404 4226)(1997).

(7) Statistical Software Qualification: ReferenceData Sets, Eds. B.P. Butler, M.G. Cox, S.L.R.Ellison and W.A. Hardcastle, Royal Society ofChemistry, London, UK (ISBN 0-85404-422-1)(1996).

(8) H. Sahai and R.P. Singh, Virginia J. Sci., 40(1),59, (1989).

(9) F.J. Anscombe, Graphs in Statistical Analysis,American Statistician, 27, 1721, February1973.

(10) S. Burke, Scientific Data Management, 1(1),3238, September 1997.

(11) M. Sargent, VAM Bulletin, Issue 13, 45,Laboratory of the Government Chemist(Autumn 1995).

Shaun Burke currently works in the Food

Technology Department of RHM Technology

Ltd, High Wycombe, Buckinghamshire, UK.

However, these articles were produced while

he was working at LGC, Teddington,

Middlesex, UK (http://www.lgc.co.uk).

The associated uncertainty for the weighted prediction, expressed as a confidence

interval is then:

Conf. interval for the prediction is

where tis the critical value obtained from ttables for n2 degrees of freedom at a

stated significance level (typically a = 0.05), Wiis the weighted standard deviation for the

x data for the ith point in the calibration, m is the number of replicates and the weighted

residual.

Standard error for the calibration

Conclusions

Always plot the data. Dont rely on the regression statistics to indicate a linear

relationship. For example, the correlation coefficient is not a reliable measure of

goodness-of-fit. Always examine the residuals plot. This is a valuable diagnostic tool.

Remove points of influence (leverage, bias and outlying points) only if a reason can be

found for their aberrant behaviour.

Be aware that a regression line is an estimate of the best line through the data and

that there is some uncertainty associated with it. The uncertainty, in the form of a

confidence interval, should be reported with the interpolated result obtained from any

linear regression calibrations.

Acknowledgement

The preparation of this paper was supported under a contract with the Department of

Trade and Industry as part of the National Measurement System Valid Analytical

Measurement Programme (VAM) (11).

RSE(w)Wjyj

2j= 1

n

b(w)2

Wjxj2

j= 1

n

n1

X(w)predictedtRSE(w)

b(w)

1mWi

Y

2

b(w)2

Wjxj2

j= 1

n


17/22


This is the last article in a series of short

papers introducing basic statistical methods

of use in analytical science. In the three

previous papers (13) we have assumed

the data has been tidy; that is, normally

distributed with no anomalous and/or

missing results. In the real world, however,

we often need to deal with messy data,

for example data sets that contain

transcription errors, unexpected extreme

results or are skewed. How we deal with

this type of data is the subject of this article.

Transcription errors

Transcription errors can normally be

corrected by implementing good quality

control procedures before statistical

analysis is carried out. For example, the

data can be independently checked or,

more rarely, the data can be entered, again

independently, into two separate files and

the files compared electronically to

highlight any discrepancies. There are also

a number of outlier tests that can be used

to highlight anomalous values before other

statistics are calculated. These tests do not

remove the need for good quality

assurance; rather they should be seen as

an additional quality check.

Missing data

No matter how well our experiments are

planned there will always be times when

something goes wrong, resulting in gaps in

the data. Some statistical procedures will

not work as well, or at all, with some data

missing. The best recourse is always to

repeat the experiment to generate the

complete data set. Sometimes, however,

this is not feasible, particularly where

readings are taken at set times or the cost

of retesting is prohibitive, so alternative

ways of addressing this problem are needed.

Current statistical software packages

typically deal with missing data by one of

three methods:

Casewise deletion excludes all examples

(cases) that have missing data in at least

one of the selected variables. For example,

in ICPAAS (inductively coupled

plasmaatomic absorption spectroscopy)

calibrated with a number of standard

solutions containing several metal ions at

different concentrations, if the aluminium

value were missing for a particular test

portion, all the results for that test portion

would be disregarded (See Table 1).

This is the usual way of dealing with

missing data, but it does not guarantee

correct answers. This is particularly so, in

complex (multivariate) data sets where it is

possible to end up deleting the majority

of your data if the missing data are

randomly distributed across cases

and variables.

Pairwise deletion can be used as an

alternative to casewise deletion in

situations where parameters (correlation

coefficients, for example) are calculated on

successive pairs of variables (e.g., in a

recovery experiment we may be interested

in the correlations between material

recovered and extraction time, temperature,

particle size, polarity, etc. With pairwise

deletion, if one solvent polarity measurement

was missing only this single pair would be

deleted from the correlation and the

correlations for recovery versus extraction

time and particle size would be unaffected)

(see Table 2).

Pairwise deletion can, however, lead to

serious problems. For example, if there is a

hidden systematic distribution of missing

points then a bias may result when

calculating a correlation matrix (i.e., different

correlation coefficients in the matrix can be

based on different subsets of cases).

Mean substitution replaces all missing

data in a variable by the mean value for

that variable. Though this looks as if the

This article, the fourth and final part of our statistics refresher series, looksat how to deal with messy data that contain transcription errors or extremeand skewed results.

Shaun Burke, RHM Technology Ltd, High Wycombe, Buckinghamshire, UK.

Missing Values, Outliers,Robust Statistics &

Non-parametric Methods

table 1 Casewise deletion.

Solution 1

Solution 2

Solution 3

Solution 4

Al

567

234

B

94.5

72.1

34.0

97.4

Fe

578

673

674

429

Ni

23.1

7.6

44.7

82.9

Solution 2Solution 4

Al

567234

B

72.197.4

Fe

673429

Ni

7.682.9

Casewise deletion. Statistical analysisonly carried out on the reduced data set.


18/22


data set is now complete, mean substitution

has its own disadvantages. The variability

in the data set is artificially decreased in

direct proportion to the number of missingdata points, leading to underestimates of

dispersion (the spread of the data). Mean

substitution may also considerably change

the values of some other statistics, such as

linear regression statistics (3), particularly

where correlations are strong (See Table 3).

Examples of these three approaches are

illustrated in Figure 1, for the calculation of

a correlation matrix, where the correlation

coefficient (r) (3) is determined for each

paired combination of the five variables,

A to E. Note, how the r value can increase,

diminish or even reverse sign depending on

which method is chosen to handle the

missing data (i.e., the A, B correlation

coefficients).

Extreme values,

stragglers and outliers

Extreme values are defined as observations

in a sample, so far separated in value from

the remainder as to suggest that they may

be from a different population, or the

result of an error in measurement (6).

Extreme values can also be subdivided into

stragglers, extreme values detected

between the 95% and 99% confidence

levels; and outliers, extreme values at >

99% confidence level.

It is tempting to remove extreme values

automatically from a data set, because

they can alter the calculated statistics, e.g.,

increase the estimate of variance (a

measure of spread), or possibly introduce a

bias in the calculated mean. There is one

golden rule however: no value should be

removed from a data set on statistical

grounds alone. Statistical grounds include

outlier testing.

Outlier tests tell you, on the basis of

some simple assumptions, where you are

most likely to have a technical error; they

do not tell you that the point is wrong.

No matter how extreme a value is in a set

of data, the suspect value could

nonetheless be a correct piece of

information (1). Only with experience or

the identification of a particular cause can

data be declared wrong and removed.

So, given that we understand that the

tests only tell us where to look, how do we

test for outliers? If we have good grounds

for believing our data is normally

distributed then a number of outlier tests

(sometimes called Q-tests) are available

that identify extreme values in an objective

way (7,8). Good grounds for believing the

data is normal are

past experience of similar data

passing normality tests, for example,KolmogrovSmirnovLillefors test,

ShapiroWilks test, skewness test,

kurtosis test (7,9) etc.

plots of the data, e.g., frequency

histogram normal probability plots (1,7).Note that the tests used to check

table 2 Pairwise deletion.

Sample 1

Sample 2

Sample 3

Sample 4

Recovery%

Extractiontime

(mins)

ParticleSize(m)

SolventPolarity(pKa)

RecoveryvsExtraction

time

RecoveryvsParticle

Size

RecoveryvsSolventPolarity

Pairwise deletion. Statistical analysis unaffected exceptfor when one of a pair of data points are missing.

93

105

99

73

20

120

180

10

90

150

50

500

1.8

1.0

1.5

0.728886(4)

-0.87495(4)

0.033942(3)

r(number of data points

in the correlation)

table 3 Mean substitution.

Solution 1

Solution 2

Solution 3

Solution 4

Al

567

234

B

94.5

72.1

34.0

97.4

Fe

578

673

674

429

Ni

23.1

7.6

44.7

82.9

Solution 1

Solution 2

Solution 3

Solution 4

Al

400.5

567

400.5

234

B

94.5

72.1

34.0

97.4

Fe

578

673

674

429

Ni

23.1

7.6

44.7

82.9

Mean substitution. Statistical analysis carriedout on pseudo completed data with noallowance made for errors in estimated values.

Box 1: Imputation (4,5) is yet another method that is increasingly being used tohandle missing data. It is, however, not yet widely available in statistical softwarepackages. In its simplest ad hoc form an imputed value is substituted for themissing value (e.g., mean substitution already discussed above is a form ofimputation). In its more general/systematic form, however, the imputed missingvalues are predicted from patterns in the real (non-missing) data. A total of mpossible imputed values are calculated for each missing value (using a suitablestatistical model derived from the patterns in the data) and then m possiblecomplete data sets are analysed in turn by the selected statistical method. The mintermediate results are then pooled to yield the final result (statistic) and anestimate of its uncertainty. This method works well providing that the missingdata is randomly distributed and the model used to predict the inputed valuesis sensible.


19/22


normality usually require a significant

amount of data (a minimum of 1015

results are recommended depending on

the normality test applied). For this reason

there will be many examples in analytical

science where either it will be impractical

to carry out such tests, or the tests will not

tell us anything meaningful.

If we are not sure the data set is normally

distributed then robust statistics and/or

non-parametric (distribution independent)

tests can be applied to the data. These

three approaches (outlier tests, robust

estimates and non-parametric methods)

are examined in more detail below.

Outlier tests

In analytical chemistry it is rare that we

have large numbers of replicate data, and

small data sets often show fortuitous

grouping and consequent apparent

outliers. Outlier tests should, therefore, be

used with care and, of course, identified

data points should only be removed if a

technical reason can be found for their

aberrant behaviour.

Most outlier tests look at some measure

of the relative distance of a suspect point

from the mean value. This measure is then

assessed to see if the extreme value could

reasonably be expected to have arisen by

chance. Most of the tests look for single

extreme values (Figure 2(a)), but

sometimes it is possible for several

outliers to be present in the same data

set. These can be identified in one of two

ways:

by iteratively applying the outlier test

by using tests that look for pairs of

extreme values, i.e., outliers that are

masking each other (see Figure 2(b) and

2(c)).

Note, as a rule of thumb, if more than

20% of the data are identified as outlying

you should start to question your

assumption about the data distribution

and/or the quality of the data collected.

The appropriate outlier tests for the

three situations described in Figure 2 are:

2(a) Grubbs 1, Dixon or Nalimov; 2(b)

Grubbs 2 and 2(c) Grubbs 3.

We will concentrate on the three

Grubbs tests (7). The test values are

calculated using the formulae below, after

the data are arranged in ascending order.

(a) or

Outlier Outlier

(c) or

Outliers Outliers

(b)

Outlier Outlier

figure 2 Outliers and masking.

Cases

1

2

3

13

14

15

mean

value

mean

A

105.1

77.0

86.0

90.0

90.0

96.9

99.2

B

101.7

72.9

82.2

77.4

91.3

103.0

92.4

C

115.1

77.5

78.9

100.8

89.2

97.5

94.6

D

101.0

72.7

78.0

97.0

81.3

98.5

89.4

E

95.2

61.6

91.7

111.1

100.5

96.8

91.7

Variables / Factors

= Data removed to show the effects

of missing data.

= Mean values replacing missing data.

A

B

C

D

B0.62

C0.68

0.53

D0.41

0.47

0.57

E0.39

0.50

0.59

0.61

No missing data (15 cases)

A

B

C

D

B

-0.62

C

0.11

-0.21

D

0.50

-0.36

0.91

E

0.02

0.17

0.71

0.66

Casewise deletion (only 5 cases remain)

A

B

C

D

B

0.01

C

-0.05

0.40

D

0.02

0.47

0.47

E

0.36

0.25

0.43

0.46

A

B

C

D

B

0.54(12)

C

0.55(12)

0.50(11)

D

0.27(12)

0.47(11)

0.79(11)

E

0.23(11)

0.77(10)

0.70(10)

0.71(10)

Pairwise deletion (Variable number of cases)

Mean substitution (15 cases)n

15

12

11

10

5

r

0.514

0.576

0.602

0.632

0.950

Note, at the 95% confidence level, significant correlations are indicated at3

Correlation matrices with differentapproaches selected for missing data.

figure 1 Effect of missing data on a correlation matrix.

where,s is the standard deviation for the whole data set, xiis the suspected single

outlier, i.e., the value furthest away from the mean, | | is the modulus the value of a

calculation ignoring the sign of the result, x is the mean, n is the number of data points, xnand x1 are the most extreme values,sn-2 is the standard deviation for the data set

G1= x xi

s G2=

xn x1s

G3= 1 n 3 sn 2

2

n 1 s2


20/22


excluding the suspected pair of outlier

values, i.e., the pair of values furthest away

from the mean.

If the test values (G1, G2, G3) are greater

than the critical value obtained from tables

(see Table 4) then the extreme value(s) areunlikely to have occurred by chance at the

stated confidence level (see Box 2).

Pitfalls of outlier tests

Figure 3 shows three situations where

outlier tests can misleadingly identify an

extreme value.

Figure 3(a) shows a situation common in

chemical analysis. Because of limited

measurement precision (rounding errors) it

is possible to end up comparing a result

which, no matter how close it is to the

other values, is an infinite number of

standard deviations away from the mean

of the remaining results. This value will

therefore always be flagged as an outlier.

In Figure 3(b) there is a genuine long tail

on the distribution that may cause

successive outlying points to be identified.

This type of distribution is surprisingly

common in some types of chemical

analysis, e.g., pesticide residues.

If there is very little data (Figure 3(c)) an

outlier can be identified by chance. In this

situation it is possible that the identified

point is closer to the true value and it is

the other values that are the outliers. This

occurs more often than we would like to

admit; how many times do your procedures

state average the best two out of three

determinations?

Outliers by variance

When the data are from different groups

(for example when comparing test

methods via interlaboratory comparison) it

is not only possible for individual points within a group to be outlying but also for the

group means to have outliers with respect to each other. Another type of outlier that can

occur is when the spread of data within one particular group is unusually small or large

when compared with the spread of the other groups (see Figure 4).

The same Grubbs tests that are used to determine the presence of within group

outlying replicates may also be used to test for suspected outlying means. The Cochrans test can be used to test for the third case, that of a suspected

outlying variance.

To carry out the Cochrans test, the suspect variance is compared with the sum of all

group variances. (The variance is a measure of spread and is simply the square of the

standard deviation (1).)

If this calculated ratio, Cn , exceeds the critical value obtained from statistical tables (7)

then the suspect group spread is extreme. The choice of n is the average number of all

sample results produced by all groups.

The Cochrans test assumes the number of replicates within the groups are the same or

at least similar ( 1). It also assumes that none of the data have been rounded and there

are sufficient numbers of replicates to get a reasonable estimate of the variance. The

Cochrans test should not be used iteratively as this could lead to a large percentage of

data being removed (See Box 3).

Robust statistics

Robust statistics include methods that are largely unaffected by the presence of extreme

values. The most commonly used of these statistics are as follows:

Median: The median is a measure of central tendency1 and can be used instead of the

mean. To calculate the median ( ) the data are arranged in order of magnitude and the

median is then the central member of the series (or the mean of the two central

members when there is an even number of data, i.e., there are equal numbers of

observations smaller and greater than the median). For a symmetrical distribution the mean

and median have the same value.

Median Absolute Deviation (MAD): The MAD value is an estimate of the spread in the

data similar to the standard deviation.

x =

xmxm xm 1

2 when n iseven 2, 4, 6,

when n is odd 1, 3, 5,where m = round up n

2

Cn= suspected s2

Si2

i= 1

g where g is the numberof groupsand n =ni

i= 1

g

g

Box 2: Grubbs tests (worked example).

13 replicates are ordered in ascending order.

x1 xn47.876 47.997 48.065 48.118 48.151 48.211 48.251 48.559 48.634 48.711 49.005 49.166 49.484

Grubbs critical values for 13 values are G1 = 2.331 and 2.607, G2 = 4.00 and 4.24, G3 = 0.6705 and 0.7667 for the 95%and 99% confidence levels. Since the test values are less than their respective critical values, in all cases, it can be concludedthere are no outlying values.

G3= 1 10 0.123

12 0.4982 = 0.587G2=

49.484 47.8760.498

= 3.23G1=49.484 48.479

0.498 = 2.02

n = 13, mean = 48.479, s= 0.498, sn22 = 0.123


21/22


If the MAD value is scaled by a factor of 1.483 it becomes comparable with a standard

deviation, this is the MADE value.

For n values MAD = median xi x

i= 1, 2,, n

Other robust statistical estimates include

trimmed mean and deviations, Winsorized

mean and deviation, least median of

squares (robust regression), Levenes test(heterogeneity in ANOVA), etc. A

discussion of robust statistics in analytical

chemistry can be found elsewhere (10, 11).

Non-parametric tests

Typical statistical tests incorporate

assumptions about the underlying

distribution of data (such as normality),

and hence rely on distribution parameters.

Non-parametric tests are so called

because they make few or no assumptions

about the distributions, and do not rely on

distribution parameters. Their chief

advantage is improved reliability when the

distribution is unknown. There is at least

one non-parametric equivalent for each

parametric type of test (see Table 5). In a

short article, such as this, it is impossible to

describe the methodology for all these

tests but more information can be found in

other publications (12, 13).

Conclusions

Always check your data for transcription

errors. Outlier tests can help to identify

them as part of a quality control check.

Delete extreme values only when a

technical reason for their aberrant

behaviour can be found.

Missing data can result in misinterpretation

of the resulting statistics so care should

be taken with the method chosen to

handle the gaps. If at all possible, further

experiments should be carried out to fill

in the missing points.

MADE= 1.483 MAD

table 4 Grubbs critical value table (5).

95% confidence level 99% confidence

level

n G(1) G(2) G(3) G(1) G(2) G(3)

3 1.153 2.00 --- 1.155 2.00 ---

4 1.463 2.43 0.9992 1.492 2.44 1.0000

5 1.672 2.75 0.9817 1.749 2.80 0.9965

6 1.822 3.01 0.9436 1.944 3.10 0.9814

7 1.938 3.22 0.8980 2.097 3.34 0.9560

8 2.032 3.40 0.8522 2.221 3.54 0.9250

9 2.110 3.55 0.8091 2.323 3.72 0.8918

10 2.176 3.68 0.7695 2.410 3.88 0.8586

12 2.285 3.91 0.7004 2.550 4.13 0.7957

13 2.331 4.00 0.6705 2.607 4.24 0.7667

15 2.409 4.17 0.6182 2.705 4.43 0.7141

20 2.557 4.49 0.5196 2.884 4.79 0.6091

25 2.663 4.73 0.4505 3.009 5.03 0.5320

30 2.745 4.89 0.3992 3.103 5.19 0.4732

35 2.811 5.026 0.3595 3.178 5.326 0.4270

40 2.866 5.150 0.3276 3.240 5.450 0.3896

50 2.956 5.350 0.2797 3.336 5.650 0.3328

60 3.025 5.500 0.2450 3.411 5.800 0.2914

70 3.082 5.638 0.2187 3.471 5.938 0.259980 3.130 5.730 0.1979 3.521 6.030 0.2350

90 3.171 5.820 0.1810 3.563 6.120 0.2147

100 3.207 5.900 0.1671 3.600 6.200 0.1980

110 3.239 5.968 0.1553 3.632 6.268 0.1838

120 3.267 6.030 0.1452 3.662 6.330 0.1716

130 3.294 6.086 0.1364 3.688 6.386 0.1611

140 3.318 6.137 0.1288 3.712 6.437 0.1519

Box 3: Cochrans test (worked example).

An interlaboratory study was carried out by 13 laboratories to determine the amount of cotton in a cotton/polyester fabric,85 determinations where carried out in total. The standard deviations of the data obtained by each of the 13 laboratorieswas as follows:Std. Dev. 0.202 0.402 0.332 0.236 0.318 0.452 0.210 0.074 0.525 0.067 0.609 0.246 0.198

Cochrans critical value for n

= 7 and g = 13 is 0.23 at the 95% confidence levels7.

As the test value is greater than the critical values it can be concluded that the laboratory with the highest standard deviation(0.609) has an outlying spread of replicates and this laboratorys results therefore need to be investigated further. It is normalpractice in inter-laboratory comparisons not to test for low variance outliers, i.e., laboratories reporting unusually precise results.

Cn= 0.6092

0.2022 + 0.4022 .......0.2462 + 0.1982=0.371

1.474= 0.252n =85

13= 6.54 7


22/22


Outlier tests assume the data distribution

is known. This assumption should be

checked for validity before these tests

are applied. Robust statistics avoid the need to use

outlier tests by down-weighting the

effect of extreme values.

When knowledge about the underlying

data distribution is limited, non-

parametric methods should be used.

NB: It should be noted that following a

judgement in a US court, the Food and

Drug Administration (FDA) in a guide

Guide to inspection of pharmaceutical

quality control laboratories has

specifically prohibited the use of outlier

tests.

Acknowledgement

The preparation of this paper was supported

under a contract with the UKs Department

of Trade and Industry as part of the

National Measurement System Valid

Analytical Measurement Programme (VAM)

(14).

References(1) S. Burke, Scientific Data Management1(1),

3238, 1997.

(2) S. Burke, Scientific Data Management2(1),3641, 1998.

(3) S. Burke, Scientific Data Management2(2),3240, 1998.

(4) J.L. Schafer, Monographs on Statistics andApplied Probability 72 Analysis of

Incomplete Multivariate Data, Chapman & Hall(1997) ISBN 0-412-04061-1.

(5) R.J.A. Little & D.B. Rubin, Statistical AnalysisWith Missing Data, John Wiley & Sons (1987),ISBN 0-471-80243-9.

(6) ISO 3534. Statistics Vocabulary and Symbols.Part 1: Probability and general statistical terms,section 2.64. Geneva 1993.

(7) T.J. Farrant, Practical statistics for the analyticalscientist: A bench guide, Royal Society of

Chemistry 1997. (ISBN 0 85404 442 6).(8) V. Barret & T. Lewis, Outliers in Statistical Data,3rd Edition, John Wiley (1994).

(9) William H. Kruskal & Judith M. Tanur,International Encyclopaedia of Statistics, CollierMacmillian Publishers, 1978. ISBN 0-02-917960-2.

(10) Analytical Methods Committee, RobustStatistics How Not to Reject Outliers Part 2.

Analyst1989 114, 16937.(11) D.C. Hoaglin, F. Mosteller & J.W. Tukey,

Understanding Robust and Exploratory Data

Analysis, John Wiley & Sons (1983), ISBN 0-471-09777-2.

(12) M. Hollander & D.A. Wolf, Non-parametricstatistical methods, Wiley & Sons, New York1973.

(13) W.W. Daniel,Applied non-parametric statistics,Houghton Mifflin, Boston 1978.

(14) M Sargent VAM Bulletin Issue 13 45

22

21

20

19

18

17

16

15

14

13

Analyte concentrationBox & Whisker Plot

Laboratory ID

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

outlying variance

outlying mean

figure 4 Different types of outlier in grouped data.

Types of comparison Parametric methods Non-parametric methods (12, 13)

Differences between t-test for independent groups2 WaldWolfowitz runs test

independent groups MannWhitney U test

of data KolmogorovSmirnov two-sample

test

(ANOVA/MANOVA)2 KruskalWallis analysis of ranks.

Median test

Differences between t-test for dependent groups2 Sign test

dependent groups Wilcoxons matched pairs test

of data McNemars test

2 (Chi-square) testFriedmans two-way ANOVA

ANOVA with replication2 Cochran Q test

Relationships between Linear regression3 Spearman R

continuous variables Correlation coefficient3 Kendall

Tau

Homogeneity of Variance Bartletts test7 Levenes test, Brown & Forsythe

Relationships between coefficient Gamma

counted variables 2 (Chi-square) testPhi coefficient

Fisher exact test

Kendall coefficient of

concordance

Use of Statistics by Scientist

Documents

Transcript of Use of Statistics by Scientist