APPLIED STATISTICS FOR THE FOOD SCIENCESaremautd/CourseNotes2002_2003/notes stat… · Applied...

FACULTY OF AGRICULTURAL ANDAPPLIED BIOLOGICAL SCIENCES

Department of Applied Mathematics, Biometrics and Process Control

APPLIED STATISTICS FOR THE

FOOD SCIENCES

Dr. ir. O. Thas

Chapter 1

Introduction

1.1 Structure and aim of the course

These course notes are only a part of the course. Since the aim of this courseis mainly to give you insight in the interpretation of statistical proceduresand how they can be applied, rather than on the statistical theory, the theoryclasses will often make use of a series of applets that are meant to dynam-ically and interactively illustrate statistical concepts. This course is not amanual for these applets. Instead, this text contains a brief introductionand summary of the methods that are discussed in the classes. Some of theexamples will be included as well.

The applets can be found at

http://fltbw2.rug.ac.be/ILO

The aim of the course is that you can interpret statistical analysis, suchthat you’re able to understand the statistical considerations in literature. Asecond aim is that you are able to perform a statistical analysis yourself.

1

1.2 Outline of the course

The following topics will be discussed in this course:

• random variables and distributions; populations and samplesweek 1

• the sample mean as an example of a “statistic”week 2

• hypothesis testing (applied to the mean)weeks 3,4

• methods concerning proportionsweek 5

• ANOVA (analysis of variance)weeks 6,7,8

• regressionweeks 9,10

• logistic regressionweek 11

Applied Statistics for the Food Sciences Chapter 1 p. 2

1.3 What is Statistics

Statistics is the science that aims to formalize the process of “induction”.In many sciences where experiments play a central role, the scientist buildsits theories be means of induction, i.e. from the results of an experiment(observations), he tries to generalize these specific observations to a theorythat holds more generally. Let’s take a simple example: suppose you want toassess the effect of a particular diet on the weight of people. Then, you willtypically take (sample) a group of people, next you will split them into twogroups (randomize them). One group will get no specific diet, and the otherget the diet. After some weeks you will weight all those people so that youknow the weight reduction. The problem to be solved is: “what is the differ-ence in weight reduction between the two groups of people”, or, equivalently,“what is the effect of the diet in terms of weight reduction”. Of course, onceyou have the data, you can simply calculate the average weight reductionwithin the two treatment groups, and you might be willing to conclude thatthe diet is effective as soon as the average weight reduction is larger in thediet group as compared to the other group. This process of reasoning is“induction”. You want to generalize the results from a small experiment toall future people that will take this diet. Now, statistics is the science thatsays how you have to make these calculations and to what extend you maygeneralize the observations. In particular, statistics uses probability theorysuch that these generalized conclusions can be accompanied with probabilitystatements, which allows you to formulate the conclusions of an experimentwith a degree of certainty. In the above example, statistics can also be usedthe other way around: it may help you in determining a priori (i.e. beforethe start of the experiment), how large the samples (the two groups of peo-ple) must be in order to make the conclusions at the end with a prespecifieddegree of certainty.

The statistic process of induction is also known as inference. Statistics helpswith the inferring from the sample.

Yet another important methodological aspect in theory building in experi-mental sciences is “falsification”, which was originally introduced by Popper.This means that if you have postulated a theory, i.e. you have a hypothesis,then you must try to falsify the hypothesis and you must not only try toconfirm it. Such an experiment must be designed accordingly. As we will seelater in this course, statistical hypothesis testing agrees with this philosophy.


Chapter 2

Basic Concepts

In this chapter some of the basic concepts of probability theory and statisticswill be briefly explained in a non-mathematical way. We will start withexplaining the difference and the relationship between a “population” anda “sample”. From there, we go to the important concept of a “randomvariable”, which is characterized by a “distribution”. Finally, we will alsosay something about the different types of data (continuous, discrete, factors,...).

Many of the concepts will in the classroom be illustrated by means of theapplets. In this text, we will only briefly refer to these, though actually theyform the back bone of this course!

2.1 Populations and Samples

The difference between a population and a sample is most easily explainedin an example.

Example 2.1. Consider again the example given in the Introduction: weare willing to assess the effect of a particular diet on the weight reduction ofpeople. Up to now we have very loosely used the term “people”. We expect,though, that not all people react equally on a certain diet. E.g. there maybe a difference between men and women, differences between younger andelderly people, difference between people that do a lot of physical labour ascompared to people having to sit all day, ...

5

in this example we could define the population as the group of people forwhich we want the conclusion of the experiment to hold. Thus, if at theend of the experiment, we want to recommend the diet to all people of allsexes and of any age, than we define the population as such. If, on the otherhand, we aim only at elderly women, then we restrict the definition of thepopulation accordingly. �

In the above example, people constitute the basic elements of the population.In general we sometimes use the term “experimental unit“ or just “element”.We will often assume that the population is infinitely large. In practice thiswill only mean that it is at least approximately large, or, the population islarge as compared to the number of elements that will be sampled.

Since the concept “sample” is closely related to the population, we will firstdiscuss the sample before giving the more general definitions.

Example 2.2. Example 1 continued. Once the population is defined, e.g.all people, then we should determine how to take a sample. A sample is asubset of the population that is subject to the experiment. Throughout thiscourse we will always assume that every element of the population has thesame chance of being selected in the sample. Thus, in the example, everyhuman being on earth must have the same change of being selected in thesample. You see immediately that this implies a problem: it will not bepractically feasible to select all people with the same probability. E.g. veryyoung babies cannot be selected at all (they only drink milk!), and, verylikely the scientist is not willing to perform a world wide experiment (unlesshe found sufficient funds to finance this expensive experiment). Thus, thesepractical restrictions imply that the definition of the population must berestricted to people older than e.g. 12 years, living in Belgium.

Once a realistic population is defined, the sampling can start. Or, at least,the design of the sampling plan. As mentioned before, every element inthe population must have the same probability of being sampled. Suppose,for simplicity, that you have a list of all people in the population, and theiraddresses. Typically, you will also have determined a priori how many peoplemust be sampled. Then, you must only sample completely at random toassure the equal selection probability.

At this point it is already important to realize that the procedure describedabove, implies that every time you would repeat the whole procedure (i.e. thesampling), other persons may be selected in the sample. This characteristic,which is due to the randomness of the sampling process, will be referred to


as repeated sampling. �

From the above example we have learnt that in practice the definition of thepopulation and of the sample go hand in hand. It is time now to give thegeneral definitions.

Definition 2.1 population: the population is the set of elements (units)which all have equal chance of being selected in the sample. Furthermore, thepopulation is the set of elements about which you want the conclusions of theexperiment to hold.

Definition 2.2 sample: the sample is a subset of the population. The ele-ments of the sample are subject to the experiment.

Before we continue with explaining “randomization”, we will discuss the roleof populations and samples in the induction process. In the introduction itwas already mentioned that statistics is a formalism of induction. Populationand sample are important concepts in this respect. If the sample is takenin a correct way from a well defined population (and the calculations areperformed correctly), the induction is valid. Thus, the conclusions derivedfrom the sample may be generalized to the whole population.

2.2 Randomization

Randomization is a very important concept in statistics. In the previoussection we have actually already come across it once. There, we have said thata sample has to be taken at random such that every unit in the population hasthe same chance of being selected. This is one of the forms of randomization.The other type of randomization is explained in the next example.

Example 2.3. Example 1 continued. Once a sample is taken from thepopulation, it has to be decided which people should have the diet and whichshould not have the diet (control group). Thus we have to split the sampleinto two groups. Suppose, that we have decided that both groups are equallylarge.

In order to make valid conclusions (induction), the sample has to be splitcompletely at random into the two groups, i.e. every element in the sampleshould have equal probability to be selected to each of the groups. Here we


have two groups, thus each person in the sample should have probability 12

to be in the diet group and probability 12

to be in the control group, with theadditional assumption that both groups must be equally large.

An easy counterexample to illustrate the importance. Suppose that instead ofsplitting the sample completely at random, we would select only overweighedpeople in the diet group. It may (a priori) be expected that these people reactbetter to the diet as compared to other normal-weighted people. Thus, itmay be expected (i.e. it has a high probability) that the calculation basedon the sample, indicate that the diet has very good results. It is obviousthat this conclusion is not representative for the whole population. Indeed,we wanted the conclusions to hold for every individual in the population,and not only for the overweighed people, and then even only as comparedto normal-weighted people. We say that such a procedure leads to biasedresults, i.e. the estimation of the effect of the diet is biased. When, on theother hand, the randomization would have been performed completely atrandom, the estimated diet effect would have been unbiased. �

The above example not only illustrates the meaning of randomization, butit also shows what is meant by “bias”. Later, we will see a more formaldefinition of “bias”, but already at this point we could intuitively see that biasmeans the difference between the real diet effect (= what we would measureif we were able to take the whole population as a sample, split completely atrandom into two groups) and the expected estimated diet effect in the samplewhen the sample is split into two groups according to some specific procedure.In particular, when this procedure is “completely at random”, then the biasis zero, and when this procedure is the one of the example (selecting over-weighted persons in the diet group), then there will be a substantial bias.

Thus, in conclusion, randomization is crucial in order to make valid (unbi-ased) conclusions. Also here, the “repeated sampling” concept comes intoplay. You can imagine that an experiment can be performed repeatedly (e.g.over time, or at least as a though-experiment). Since the splitting of thesample into groups is a random process, the groups will every time you dothe experiment be different. Thus, also the results from the experiment willvery probable be different every time. We will see later that this is exactlythe basis of the understanding of statistical methods, and the basis of thevalidness of the induction process.

Yet another way to understand the necessity of randomization is the followingexample.


Example 2.4. Example 1 continued. Suppose that the diet effect actuallydepends on some genetic characteristics of the people in the sample. Sinceit is not feasible in practice to do a genome analysis on every one, in orderto eventually assess the relation between some genetic markers and the dieteffect, it is best to distribute the people, w.r.t. their genome, as equallyas possible over the two groups. In this way the possible influence of thegenome on the diet effect will be “averaged out” from the sample. Since wedon’t have information on the genomes of the people in the sample, the onlyway to guarantee an equal distribution of genomes over the two groups, is torandomize the people in the sample. Then, at least on average, the effect ofgenome on the diet will be eliminated. (Note the “on average” refers here to“repeated sampling”.)

Note also that the result of such a randomization also guarantees that thetwo groups resemble the whole population as close as possible w.r.t. thegenome distribution. Hence, the results from the study will very probably berepresentative for the whole population (this is important in the inductionprocess, of course). Also, if the two groups, on average, do not differ w.r.t.any characteristic that might have an influence on the diet effect, then anysubstantial difference in weight reduction that is estimated from the samplemay be attributed to the diet since the diet is the only difference betweenthe two groups! This is a very important reasoning; it is a corner stone inestablishing a causal relationship! �

Finally, we mention that the two groups in the example, are in general oftenreferred to as the two treatment arms. Typical characteristics of the elementsin the sample (e.g. the genome in the previous example) that might possi-bly have in influence on the results of the experiment, are called baselinecharacteristics or baseline variables.

2.3 Stratified Sampling

From the reasonings presented in the two previous sections, we have learntthat randomization and sampling completely at random are important stepsin order to guarantee that the resulting treatment groups in the sample areas representative as possible for the complete population, and that the treat-ment groups do not differ on average w.r.t. any baseline characteristic.

Example 2.5. Consider an extreme example, where we take a sample ofonly 4 persons. The sample is taken completely at random from a population


with 50% men and 50% women, but since the sample size is extremely small,we end up with a sample of 4 men. We could say now, that this sampleis not representative for the population, although the sampling has beenperformed correctly. Such an example may however occur, especially withsmall samples. Stratified sampling may reduce such problems.

If we know a priori that the population consists of 50% men and 50% women,and that gender may have an influence on the experiment, then we couldstratify for gender. This means that we will perform the sampling such thatin every treatment arm there will be 50% men and 50% women, and stillhave as much random sampling as possible.

Suppose the total sample size is 40. Since the process of “sampling completelyat random” and the “randomization” over the treatment groups are perfectlyconsistent, the stratified sampling can be established by first sampling 10 mencompletely at random from the population. These 10 men form the 50% menin the diet group. Next, again completely at random, 10 women are selectedfrom the population. These women are also put into the diet group. Finally,independently of the sampling process to form the diet group, 10 men and10 women are sampled from to population to constitute the control group.

�

2.4 Random Variables

A random variable is about one of the most important concepts in statistics.It is the basis for the calculation of probability statements that come alongwith the statistical inference.

We will first illustrate a random variable based on Example 1, but supposenow that the response of the experiment is not the weight reduction of aperson, measured in kilogram, but suppose that it is simple a binary response,say response=1 if there is a weight reduction for the person, and response=0if there is no weight reduction. We could also say that the binary responseindicates a success or no success of the diet. More formally, we say that theresponse variable Y is defined as

Y =

{1 if succes0 if no succes

(2.1)

We could define such a response variable for each person in the sample.


Therefore, we introduce an index i, referring to person i. Thus,

Yi =

{1 if succes for person i0 if no succes for person i

(2.2)

where i = 1, . . . , n (n persons in sample).

It is very clear what Y means as soon as the experiment is finished, forthen the responses are measured. But let us think about the behaviour of Ybefore the experiment, or how Y would behave under “repeated sampling”.For simplicity, we will first suppose that there is only 1 person in the sample.Here is what may happen:

• we sample 1 person completely at random, give him the diet, and after1 month we measure his weight, which results in Y = 1 or Y = 0. Thisis not deterministic! The result is not only determined by the trueeffect of the diet. As mentioned before, it may partly be determinedby e.g. its genetic information, which we do not know and which isdifferent among all people. Moreover, this one person is sampled com-pletely at random from the population, which means that the samplingprocess is independent of e.g. its genetic information (in fact of anybaseline characteristic). Thus, also Y is random. Say, e.g., that in ourexperiment the results is a success of the diet, thus Y = 1.

• Repeated Sampling means that, at least in our minds, we repeat thewhole experiment, independently of all previous experiments. Thus, inour simple experiment, we sample again 1 person completely at ran-dom from the population. Since we assumed the population to be(approximately) infinitely large, it will very probable another personnow. Moreover, this person will be independent of the previous sam-pled persons. Again we give him the diet and measure its weight afterone month. Again, the response will be partly influence by the dietitself, and partly by (unmeasurable) baseline characteristics of this per-son. Suppose, that this time there was no weight reduction, thus nowY = 0.

• We could this experiment over and over ... in our minds we could doit an infinite number of times.

Y is a random variable. It is not deterministic in the sense that in a repeatedsampling experiment (the thought-experiment given above) under constantconditions (i.e. every time the same diet is given), the response, Y , is not


constant, and it cannot be predicted without (random) error. Fortunatelythere is probability theory, which allows us to study random variables andto state at least some of its properties. In our example, Y can only take twodifferent values, 0 or 1, and is therefore called a Bernoulli random variable.Or, we say that Y is distributed as a Bernoulli random variable, or simply,Y is Bernoulli distributed. Typically distributions are determined by someparameters (see later for a more precise description). In our example, theonly parameter describing the random process, is a probability, say π. Thisis the probability of success, and may be directly interpreted in terms ofthe repeated sampling experiment: suppose the experiment is repeated oninfinite number of times, then, at the end, we could compute the fraction ofexperiments that resulted in a success (i.e. the number of successes dividedby the total number of experiments). This fraction is by definition exactlythe probability of success π. We write

Y ∼ B(π). (2.3)

This reasoning can be put the other way around: if we know that Y is aBernoulli variable with e.g. π = 0.80, then this means this if we would dothe experiment (approximately) an infinite number of times, we could expectthat 80% of all experiments result in a success. However, this direction ofthe reasoning will not often happen in statistics. It is rather the other wayaround. Based on an experiment we want to estimate the probability ofsuccess π. At this point, you can already feel that the more experiments (orthe larger the sample size), the more accurate π can be estimated.

A probability statement as given above, is denoted by

P {Y = 1} = π. (2.4)

Since a Bernoulli variable can take only two distinct values, we can veryeasily calculate the probability of no success,

P {Y = 0} = 1− P {Y = 1} = 1− π. (2.5)

This is simply based on a basic property of probabilities that they must sumto one, since Y must be one of “1” or “0”.

Applet. Applets 1b and 3b illustrate the sampling from a Bernoullidistribution. The applet can be used to illustrate the sampling and themeaning of the parameter π. �

Back to the example. Suppose now that not only one person is sampled inthe experiment, but that more generally n persons are sampled completely


at random. Then we could define as a response variable, the total numberof successes, which will of course be between 0 and n. Again you see thata similar repeated sampling experiment can be performed. But now, everyrepetition of the experiment, results in a number between 0 and n. Again wecall the response Y , which is now distributed as a binomial random variable,i.e.

Y ∼ Bin(n, π), (2.6)

where n and π are the parameters. Again π has the interpretation of theprobability of success for a single person. In particular, this probabilityapplies to all n persons in the sample.

When n > 1 the probability π may even be estimated in one sample, bysimply dividing the number of success by n, i.e. Y

n. Later we will discuss this

in more detail. If, in the repeated sampling experiment, we would calculatethis estimate for every repetition. Then, by definition, the parameter π isthe average of all the computed Y

n.

Note that there is a simple relation between a Bernoulli (Yi) and a Binomial(Y ) random variable: Y =

∑ni=1 Yi, both with the same probability π =

P {Yi}.

Up to now we have only discussed so called discrete random variables inthe sense that the variable can take only values in a finite set S. For theBernoulli variable S = {0, 1}, and for the binomial variable S = {0, 1, . . . , n}.If however, we go back to the original formulation of Example 1, then theresponse variable is the weight reduction in kilograms. For such a variablethe set S of possible values is, at least theoretically, infinitely large, for themeasurement is a continuous. If the response variable, Y , is indeed definedas such, then we say that Y is a continuous random variable.

It is fairly simple to see that in theory we cannot use probability statementsas e.g. P {Y = 7.4}. The reason: if there are an infinite number of possibleoutcomes for a continuous random variable Y , then every probability of theabove form, is exactly zero. We will need another type of characterizationfor continuous random variables: distribution functions.


2.5 Distribution Functions

2.5.1 Some General Definitions

In the previous section we have argued that for continuous random variablesY , probability statements of the form P {Y = 7.4} cannot be used for theyare all zero. Nevertheless, we can use expressions of the form P {Y ≤ 7.4},i.e. the probability that Y is smaller or equal to 7.4 kilogram. what does thismean in a repeated sampling experiment? This means that if the experiment(with one person in a sample) were repeated an infinite number of times, andevery time the weight reduction of the person in the sample were measured,then the number of experiments that resulted in a weight reduction ≤ 7.4 canbe counted, and this number divided by the total number of experiments, is,by definition, exactly equal to the probability P {Y ≤ 7.4}. (Essential in thepossibility of computing probabilities, is that the event of which you want toknow the probability, must be countable in a repeated sampling experiment.)

Definition 2.3 Distribution function: the distribution function of a randomvariable Y is defined as

F (x) = P {Y ≤ x} . (2.7)

Sometimes the distribution function is called the cumulative distributionfunction.

We will now give some mathematical properties of this function. They areillustrated in Applet 1b for the most common continuous distributions (e.g.the normal distribution).

Note that F (x) is an non-decreasing function in x. We will further assumethat F (x) is a (right-)continuous function. Moreover, we will assume that itsfirst derivative exists,

f(x) =d

dxF (x), (2.8)

which is called the density function. It is non-negative continuous function.Once we know the relation between f and F we can also define the distribu-tion function as

F (x) =

∫ x

−∞f(y)dy, (2.9)


which may be interpreted as the surface under f between −∞ and x.

Based on basic probability calculations, we can also calculate

P {Y > x} = 1− P {Y ≤ x} = 1− F (x) =

∫ +∞

x

f(y)dy, (2.10)

which is the surface under f between x and +∞. We often refer to such aprobability as the tail probability. For a random variable Y with distributionfunction F we define the quantile yα implicitly as

P {Y > yα} = α, (2.11)

i.e. yα is such that the probability that Y is larger than yα is α.

Also probabilities of the type P {x1 < Y ≤ x2} can be calculated:

P {x1 < Y ≤ x2} = F (x2)− F (x1) =

∫ x2

x1

f(y)dy, (2.12)

i.e. it is the surface under f between x1 and x2.

Applet. Applet 1b is used to illustrate the above concepts. �

2.5.2 Some Important Distributions

We will now discuss briefly 4 important continuous distributions: the normal,the t, the F and the χ2-distribution.

Applet. Applet 1b illustrates these 4 distributions and the interpretationof their parameters. �

The normal distribution

Without any doubt is the normal distribution the most important distribu-tion in statistics. Often we will assume that a random variable is normallydistributed. A normally distributed random variable takes values in S = R,the set of real numbers ranging from −∞ up to +∞. In practice, however,response are often bounded. Still it is very often a reasonable approximationto use the normal distribution.

A normal distribution is characterized by its distribution function F , whichis a function that is “parameterized” by two parameters: the mean µ, and the


variance σ2. We sometimes write Fµ,σ2 to stress this fact. The square rootof σ2, σ, is called the standard deviation. We say that the mean determinesthe location of the distribution, and σ2 determines the spread or the scale.This is very clearly illustrated in Applet 1b . A normal distributed randomvariable is denoted by

Y ∼ N(µ, σ2). (2.13)

Note also that the distribution is symmetric about µ (i.e. the density f issymmetric). This implies e.g. that f(µ− x) = f(µ+ x).

When µ = 0 and σ2 = 1, the corresponding normal distribution is called thestandard normal distribution. We will often use the symbol Z to indicate astandard normal distributed random variable. This is a distribution that issymmetric about µ = 0. Therefore, f(z) = f(−z), and

F (−z) = P {Z ≤ −z} = 1− P {Z > z} , (2.14)

and thus, zα = 1− z1−α.

Suppose that Y ∼ N(µ, σ2), then the following important property holds:

Y ∼ N(µ, σ2) (2.15)

Y − µ ∼ N(0, σ2) (2.16)

Y − µσ

∼ N(0, 1), (2.17)

i.e. the standardized variable Y−µσ

is a standard normal variable. More gen-erally holds for constant a and b that

Y ∼ N(µ, σ2) (2.18)

Y − a ∼ N(µ− a, σ2) (2.19)

Y − ab

∼ N(µ− a, σ2

b2). (2.20)

The t-distribution

A t-distribution is parameterized by one parameter, f , which is called the“degrees-of-freedom”. We say that a random variable, say T , is distributedas t with f degrees of freedom, i.e.

T ∼ tf . (2.21)


As the normal distribution, T is defined over S = R.

The t-distribution is also a symmetric distribution. Thus, for the quantilescorresponding to tail probabilities α and 1− α holds

tf,α = 1− tf,1−α. (2.22)

The F -distribution

The F -distribution is parameterized by two parameters, f1 and f2. Both arecalled “degrees-of-freedom”. In particular, f1 is the degrees-of-freedom onthe numerator, and f2 is the degrees-of-freedom of the denominator. We saythat e.g. F is distributed as F with f1 and f2 degrees-of-freedom,

F ∼ Ff1,f2 . (2.23)

The F -distribution is not symmetric. F can take values in S = R+.

The χ2-distribution

Finally we give the χ2-distribution, defined over S = R+. It is parame-

terized by one parameters, f , the degrees-of-freedom. We say that X isχ2-distributed with f degrees-of-freedom,

X ∼ χ2f . (2.24)

It is not a symmetric distribution.

2.6 Expected Value

2.6.1 The Mean

A very important concept in statistics is the expected value or the expectationof a random variable Y , denoted by E {Y }. It is sometimes also referred to asthe mean of the variable Y . The expectation can be defined mathematically,but here we rather prefer to explain it in the context of a repeated samplingexperiment.

As an example, we consider a normally distributed random variable Y . Inparticular, Y ∼ N(µ, σ2). Suppose, in a repeated sampling experiment, that


we sample an infinite number of times from such a distribution. And, at theend, we compute the average of all Y ’s. Then, by definition, this averageis exactly equal to E {Y }, which is for the normal distribution equal to µ.Thus, for a normal distribution holds,

E {Y } = µ. (2.25)

Remember that we actually have already encountered this type of calculationin a repeated sampling experiment (Section 2.4). In particular Y was therea Bernoulli random variable with probability α. There we have argued thatα is exactly the fraction of the number of successes over the total numberof repeated sampling experiments. Since the number of success is also givenby the sum of Y ’s (remember they are 1 for a success, and 0 otherwise), thefraction is basically the average of Y ’s, and thus, α = E {Y }. This is animportant example in the sense that here we have a probability that can bedefined as an expectation.

If Y ∼ Bin(n, α), then E {Y } = nα, and for X ∼ χ2f , E {X} = f .

2.6.2 The Variance

Based on the definition of the expectation, the variance of a random variablecan be defined as well:

Var {Y } = E{

(Y − E {Y })2} , (2.26)

i.e. the variance of Y is the expectation of the squared deviation from itsmean. And, since it is defined as an expectation it can be interpreted in therepeated sampling context, but here in every repetition of the experiment(Y − E {Y })2 must be computed and averaged at the end.

For the normal distribution, it can be shown that

Var {Y } = σ2, (2.27)

for the Bernoulli distribution Var {Y } = α(1 − α), and for the binomialdistribution Var {Y } = nα(1− α). Finally, if X ∼ χ2

f , Var {X} = 2f .

Its name already suggests that the variance is some kind of measure ofthe variability of a random variable. To illustrate this interpretation, weconsider a very extreme imiting case: suppose that Var {Y } = 0, then


E{

(Y − E {Y })2}. Hence, if, in a repeated sampling experiment, the sam-pled Y were measured an infinite number of times, and every time (Y −E {Y })2 is computed, then, in order to end up with an average of these (Y −E {Y })2 equal to zero, almost every repetition must result in a (Y −E {Y })2

extremely close to zero, thus Y extremely close to its expectation E {Y }.Hence, when Var {Y } = 0, we may be extremely sure that every sampled Yis almost exactly equal to E {Y }. We say that there is almost no variability.Sometimes we will also say that there is almost no uncertainty about Y .

2.7 The Relation between Population and Dis-

tribution

In this section we will illustrate that there is a very tight relation betweena population and a distribution function, as soon as the response variableis defined and the experimental circumstances are set. We will split thediscussion into two parts. First we will show that once the population isdefined, the distribution function is set as well. Next, we will go the otherway around.

• Population −→ distribution function.

Example 2.6. Example 1 continued. As before, we define thepopulation as the set of all people older than 12 years, living in Belgium.Further, we define the response variable as the Bernoulli variable Yindicating success (i.e. a weight reduction). We argue now that withthese specifications the distribution of Y is determined, though notnecessarily known (e.g. parameters may still be unknown).

Since Y is a binary variable, we know already that Y ∼ B(α), thoughα is still unknown to us. However, in theory, now that the population isdefined, we could repeatedly sample from the population. By countingthe fraction of successes in an infinite number of repeated experiments,P {Y = 1} is completely determined. Hence, also the unknown param-eter α = P {Y = 1}. Of course, also P {Y = 0} = 1−P {Y = 1} is nowknown. � Remark that in the above example,the link between distribution and population is only there thanks tothe repeated sampling interpretation of probability. And the repeatedsampling experiments cannot in general not be performed in practice;


we should only use it as a theoretical tool. Here we used it to establisha relation between population and distribution.

• Distribution function −→ population.Suppose now that we only defined a distribution. E.g. a Bernoulli dis-tribution with probability α. We further known that the experimentconsists in measuring success (e.g. weight reduction). Then we knowthat these specifications imply the existence of a infinitely large pop-ulation, such that the relative frequency of success is exactly equal toα.

The arguments presented here are fairly easy to understand when Y is adiscrete Bernoulli variable. Let’s try now to make the same exercise whenY is a continuous random variable, e.g. Y ∼ N(0, 1). We now claim thatthere is a link between the population and the complete distribution functionF (x) = P {Y ≤ x} for all x. Well, the same reasoning as above applies now,except that not relative frequencies of success must be calculated from therepeated sampling experiments, but rather the relative frequencies of all theevents of the form Y ∈] −∞, x]. All these relative frequencies must equalthe theoretical probabilities

P {Y ∈]−∞, x]} = P {Y ≤ x} = F (x). (2.28)

In conclusion, the population may also be seen as a hypothetical infinitelylarge set of values that occur with the relative frequencies given by the distri-bution function. This construction of the population is purely mathematical,but of course, in statistics, there must be a one-to-one relation between thismathematical population and the real population from which a real samplecan be taken. Thus, in Example 1, the population consists of the weightreduction of all people of 12 years and older, living in Belgium.

Finally, you may have noticed that we have often reasoned with a sampleof size 1. Especially in the repeated sampling experiments, we have oftensaid that each repetition consisted of sampling one unit and looking at oneobservation Y at a time. Of course, the same reasoning applies when largersamples are taken at each step in the repeated sampling experiments. Inparticular, remember that every sample has to be taken completely at ran-dom. This implies that any two observations in a sample are independentlysampled. In the repeated sampling experiments, this means that there is nodifference between two sampled observations in one sample, and two sampledobservations in two different repeated samples. Hence, relative frequencieshold over all observations over all repeated samples.


2.8 Experimental versus Observational Stud-

ies

In this chapter we have put stress on the importance of sampling completelyat random, and on the necessity of randomization. In practice, however,there are often situation in which this randomness cannot be assured. E.g.what is the effect of a fat diet on the risk of a heart attack? it is clear that youcannot randomize people over a low fat and a high fat diet for there whole lifetime. Still, statistical procedures exist that take this lack of randomness intoaccount (or, better, procedures exists that let the randomness come into playat another stage!). Typically, such studies are the subject of “epidemiology”.These methods are, however, not treated in this course.

A study in which no explicit sampling is performed, as in the above example,is often called an observational study. Another typical example of an obser-vational study is a survey in which several questions are asked to a sampleof people. It is then often aimed to find relations between the answers (e.g.relation between smoking status and health problems) Here we should becareful: e.g. because of the lack of randomization (e.g. over the group ofsmokers and the group of non-smokers), we cannot infer causal relationships.

Experiments that are performed in laboratoria, are typically experimentalstudies for the experiment can be designed and analyzed completely accord-ing to the experimenter’s demands. These demands include the practicalimplications of the typical assumptions that are needed to guarantee thecorrectness of a formal statistical inference, e.g. randomization.


Chapter 3

The Mean and the Variance

In this chapter we discuss the sample mean and the sample variance, whichare computed from a sample. We will show that these quantities, which aregenerally referred to as “statistics”, are random variables and that they thushave a particular distribution. Based on the knowledge on these distributionswe will derive some very important properties which should be kept in mindwhenever one uses the sample mean and sample variance. We will also discussconfidence intervals.

3.1 Estimating a Distribution: Introductory

Example

Example 3.1. Suppose that you need to report on the cholesterol levels inblood of a particular group of people, say the inhabitants of Italy. Measuringthe cholesterol level in every person, is of course not feasible. Therefore,we will take only a sample of people. Even if we were able to measure allcholesterol levels, it is clear that some people will have a low level, and otherswill have a high level. We could think of the set of all cholesterol levels asa population which is characterized by a distribution function or a densityfunction (Figure 3.1). Based on the sample we now want to estimate the truedistribution, and of course we want this estimate to be as close as possibleto the true but unknown distribution.

One simple solution could be to draw a histogram. This could indeed beconsidered as an estimate of the distribution. Figure 3.1 shows a possible

23

x=cholesterol

f(x)

0 200 400 600 800 1000

0.0

0.00

10.

002

0.00

30.

004

0

5

10

15

20

25

300 400 500 600 700

cholesterol level

Per

cent

of T

otal

Figure 3.1: The density function of the cholesterol level (left) and a histogramof a sample of 100 people (right)

0

10

20

30

200 300 400 500 600 700 800

cholesterol level

Per

cent

of T

otal

Figure 3.2: A histogram of a sample of 100 people with changed positions ofthe bins

histogram of the cholesterol levels, based on a sample of 100 people. Wewill often use the histogram, but we will use it rather as an exploratorytool, because the histogram has one major drawback: it cannot simply beused in further calculation. This is easy to see: a histogram is not a singlenumber. It is rather a collection of numbers. In particular it is a collectionof frequencies (counts) and intervals (bins). Moreover, the exact positionand width of the bins must be chosen by the researcher. This may make thehistogram even a subjective tool. Figure 3.2 shows a histogram, based onthe same sample as for the previous histogram, but now we have changedthe positions of the bins. As you can see, the histogram looks quiet different.It even gives almost the opposite conclusion with respect to the skewness ofthe distribution. Thus, histograms may be helpful as an exploratory tool,but they may also be misleading because of the arbitrariness of the positionsof the bins. Furthermore, it is not a simple number which can be used forfurther calculations.

In our discussion thus far, we have nothing assumed about the true distri-bution of cholesterol levels. Sometimes, however, such an assumption may


simplify the problem considerably. Here, we will assume that the distribu-tion is a normal distribution (note that the distribution in Figure 3.1 clearlyshows a normal distribution). We have seen in the previous chapter that anormal distribution is completely determined by two parameters: the meanµ and the variance σ2. Thus, when the cholesterol level is indeed normallydistributed, then it is sufficient to know these two parameters. In practice,this implies that we only have to estimate µ and σ2 from the sample. It isobvious that µ and σ2 are estimated by two “numbers” and that thereforethey can be used in further calculations.

It can be shown that the mean µ can be estimated by means of the samplemean, which is simply the average of the observations in the sample. Suppose,the sample has n observations denoted by Xi (i = 1, . . . , n), then the samplemean is given by

X =1

n

n∑i=1

Xi. (3.1)

In general, we will denote an estimator of a parameter by the same symbol,but with a “hat” added to it. Thus, µ is estimated by µ and σ2 by σ2. And,thus, µ = X. The variance is estimated by means of the sample variancewhich is given by

σ2 = S2 =1

n− 1

n∑i=1

(Xi − X

)2. (3.2)

Later we will discuss X and S2 in more detail.

In this example, we find for instance, µ = X = 510.97 and σ2 = 11122.77(or, σ = 105.46).

Once µ and σ2 are calculated from the sample, the true distribution is es-timated as a normal distribution with mean µ and variance σ2. Figure 3.3shows the density function of this estimated normal distribution; the densityfunction of the true distribution is added as a reference.

Finally, we like to note that if we would have assumed another distribution,e.g. a t-distribution, then not the mean and the variance should have beenestimated for these are no parameters of the t-distribution. The “degrees-of-freedom” is the one and only parameter of the t-distribution, and it isthis parameter that in this case should be estimated (we will not discussthis here any further). Later, we will argue that even if we do not assumea distribution of which the mean and/or variance are natural parameters, it


x=cholesterol

f(x)

0 200 400 600 800 1000

0.0

0.00

10.

002

0.00

3Figure 3.3: The estimated density function (full line) and the density functionof the true distribution (dotted line)

is still meaningful to compute X and S2. One of the reasons is that thesequantities have a clear and important interpretation (location and scale ofthe distribution). �

3.2 “Statistics”

When observations are real numbers (e.g. the cholesterol level), we could e.g.assume that the observations are normally distributed, i.e.

Xi ∼ N(µ, σ2), (3.3)

where µ and σ2 are unknown.

In Example 1 we have shown that µ and σ2 can be estimated from a sampleby means of the formulae in Equations 3.1 and 3.2. These expressions alsoshow that X and S2 are functions of the elements of the sample (i.e. thesampled observations). From the previous chapter we know that these sam-pled observations (Xi) are random variables (e.g. with the repeated samplingreasoning: in repeated experiments are the observations sampled completelyat random, and in each of these repeated experiments the sampled observa-tions will every time very probably be different). And, therefore, X and S2

are random variables as well! Indeed, in the repeated sampling reasoning, Xand S2 will be different for every new sample. Thus X and S2 are randomvariables themselves. As with every random variable, their behaviour canbe characterized by means of a distribution (distribution function or densityfunction).

Applet. The behaviour of X in repeated sampling experiments is illus-trated in Applet 3a. Applet 3c applies to S2. �


X and S2 are only two examples of what we will call statistics. In general astatistic is a random variable that is defined as a function of the observationsin the sample. Suppose X = {X1, . . . , Xn} denoted the sample, and gdenotes some function, then

T = g(X) (3.4)

is called a statistic. Its distribution is completely determined by the distri-bution of the observations in X and by the function g. Later we will find thedistributions of X and S2 because these are two very important examples ofstatistics.

Sometimes, X and S2 are referred to as summary statistics because theysummarize the distributional behaviour of the observations in one number(Example 1 shows this very clearly when we are prepared to assume thatthe data is normally distributed). There are some other statistics that maybe called summary statistics: e.g. the median, the 1st and the 3rd quartile.The median, M , is defined as the element of the sample, such that half of theobservation (n/2) are smaller than M and half are larger than M (if n is even,M is typically the average of the two middle observations). The 1st quartile,Q1, is defined as the element of the sample such that n/4 observations aresmaller than Q1. The 3rd quartile, Q3, is defined similarly, but with 3n/4observations smaller than Q3.

Since X and S2 are estimators of parameters (µ and σ2), and since they arejust single numbers, they are also called point estimators. Later we will seethat there are statistics that are not just estimators of a parameter. In thuscourse we will discuss three types of statistics:

• point estimators (e.g. X = µ as a “point estimator” of µ)

• interval estimators (see later)

• test statistics (see later)

3.3 The Distribution of the Sample Mean

The sample mean is given by Equation 3.1 and may be seen as a point estima-tor of the true mean of the distribution. When the distribution is a normaldistribution, then the mean µ is one of the two parameters characterizing


the normal distribution. But also when the distribution is not a normal dis-tribution, the mean is a meaningful quantity (see Section 2.6), i.e. the meanis always defined as the expected value. Thus, whatever distribution theobservations Xi have, the expected value E {Xi} may be of interest. Even ifthe distribution is not normal, we will use the notation µ = E {Xi} to denotethe mean or the expected value.

3.3.1 The Distribution of the Sample Mean under theNormal Assumption

Applet. Applet 3a may be used to illustrate the following properties ofthe distribution of the sample mean:

• the expected value of the sample mean equals the expected value of thedistribution of the observations, i.e. E

{X}

= E {Xi}

• the variance of the sample mean decreases as the sample size (n) in-creases

• when the observations are sampled from a normal distribution (i.e.Xi ∼ N(µ, σ2)), then the sample mean is also normally distributed

�

Without proof we will now formulate the conclusions from the above appletin a more formal way. First we will assume that the observations are normallydistributed. In the next section we will drop this assumption.

Suppose the observations are normally distributed, i.e.

Xi ∼ N(µ, σ2), (3.5)

and that we consider a sample X1, . . . , Xn of size n. Further, we will assumethat all n observations are independent (e.g. by independently sampling).Sometimes, when we want to stress that the observations Xi are independent,we will write

Xi i.i.d. N(µ, σ2), (3.6)

where “i.i.d.” stands for “...identically and independently distributed as ...”.


Under these conditions it can be shown that

X ∼ N

(µ,σ2

n

). (3.7)

Thus, the sample mean X is also normally distributed. Moreover it has thesame expected value as the observation, i.e. E

{X}

= E {Xi} (exercise: ex-plain this in term of repeated sampling experiments). The variance of X isproportional to the variance σ2 of the observations, and inverse proportionalto the sample size n, i.e. Var

{X}

= σ2

n. This is an extremely important

property! It says that, as long as the sample size is finite, there is “samplevariability”, i.e. in the repeated sampling experiments, the repeatedly cal-culated sample means will be different (i.e. the sample mean has a variancedue to sampling). Thus, as we actually all know, the sample mean is notexactly equal to the true mean: the sample mean is only an estimator ofthe true mean. Moreover, since Var

{X}

= σ2

n, this sampling variability be-

comes smaller as the sample size increases. Thus, the larger the sample, themore accurate the sample mean is as an estimator of the true mean. And, inthe limit as n approaches infinity, the Var

{X}

even becomes approximatelyzero, which means that, in the limit, there is no sampling variability left, andX becomes exactly equal to the true mean. This property is an asymptoticproperty, i.e. it holds when the sample size goes to infinity. The propertythat for a point estimator T , Var {T} → 0 as n → ∞ is called consistency.Thus, we may conclude that X is a consistent estimator for the true meanµ = E {Xi}.

The property that E{X}

= µ is also extremely important. We say of suchan estimator that it is unbiased. The bias can be determined as E

{X}− µ,

which is zero in our case.

3.3.2 The Distribution of the Sample Mean withoutthe Normal Assumption

When the observations Xi are not normally distributed, then, in general, thedistribution of X is not a normal distribution. Its distribution depends onthe distribution of the observations used in the calculation of X. Still thefollowing important properties hold:

• E{X}

= E {Xi} = µ

• Var{X}

= Var{Xi}n

= σ2

n


(note that this properties of course also hold when the observations Xi arenormally distributed; also note that we still use the notation E {Xi} = µand Var {Xi} = σ2, even for non-normal situations) Thus also here is Xa consistent estimator of the true mean. The above mentioned propertiesremain valid for discrete distributions!

In the previous chapter we have seen that the square root of the varianceis called the standard deviation. Here, the same applies, except that wenow have to mention to which statistic the standard deviation refers. Thus,√

Var{X}

is called the standard deviation of the mean. In some literature,

you might read “standard error of the mean”, or simply “standard error”.

3.3.3 The Central Limit Theorem

Applet. Applets 3a and 3b may be used to illustrate the following propertyof the sample mean: when the observations are sampled from a distributiondifferent from a normal distribution, then the distribution of the samplemean becomes approximately a normal distribution when the sample size issufficiently large. �

Thus, the applet shows that, as long as the sample size is sufficiently large,the sample mean is normally distributed, irrespectively of the distributionof the observations Xi. This result is know as the Central Limit Theorem(CLT). We may formulate the CLT as one of the following:

• the sample mean X is asymptotically normally distributed (here, “asymp-totically” really means the limit “n→∞”)

X → N

(µ,σ2

n

)(3.8)

• the sample mean X is approximately normally distributed (here, “ap-proximately” refers to large, but finite, sample sizes; in practice, oftenn = 30 or n = 50 is sufficiently large)

X.∼ N

(µ,σ2

n

)(3.9)

The CLT applies to discrete distribution as well. This is illustrated in thenext example.


Example 3.2. An example that will also be used later, is the following.A group of 50 people (a panel) is given 3 glasses of milk. Two glasses arefilled with the same milk, and the 3rd glass is filled with another type ofmilk. There is no visual difference between the 3 glasses. Each member ofthe panel is asked to point the glass that tastes different from the other two.From a statistical point of view we could say that the response variable forperson i is defined as

Xi =

{1 if person i selects the right glass of milk0 if person i does not select the right glass of milk

(3.10)

Thus, Xi is Bernoulli distributed. A Bernoulli distribution is characterizedby only one parameter: the probability π (i.e. the probability of selectingthe right glass of milk). In Section 2.6 we have seen that E {Xi} = π. Thus,the sample mean X = 1

50

∑50i=1 Xi is a good estimator of π. In particular, it

is as unbiased estimator in the sense that E{X}

= π, and it is a consistent

estimator in the sense that Var{X}

= Var{Xi}n

→ 0 as n→∞.

From Section 2.6 we also know that Var {Xi} = π(1− π). Thus, Var{X}

=π(1−π)

n= π(1−π)

50.

Thus far we know already the mean and the variance of the point estimatorX of π, but we still do not know its distribution. Fortunately, the CLT tellsus that for sufficiently large n, the distribution of X is approximately normal,even for discrete distributions. Thus we have

X.∼ N

(π,π(1− π)

50

). (3.11)

Here, we believe that n = 50 is indeed sufficiently large to make the approx-imation hold. �

3.4 Accuracy of the Sample Mean and Sam-

ple Size

Equation 3.2 illustrates very well how the accuracy of an estimator is relatedto the sample size. Indeed, as we have seen before, Var

{X}

is a measure ofthe variability of X from sample to sample in the repeated sampling experi-ments. Thus, the smaller Var

{X}

the more accurate the mean µ is estimatedby X.


Equation 3.2 shows that Var{X}

depends on two quantities: the varianceσ2 = Var {Xi} of the observations, and the sample size n. We will supposethat we cannot influence the variance of the observations, i.e. σ2 is an in-herent property of the population of interest. what we do have under ourpower, is the sample size n. In this section we will see an easy method thatmay sometimes be used to determine n such that a minimal level of accuracyis guaranteed. The method is first illustrated in the following example.

Example 3.3. Example 3.3.3 continued. The aim is to have an accurateestimate of the probability π. We will measure the accuracy by the absolutedifference between the estimator X and the true probability π, i.e. |X − π|.Since X is a random variable, |X − π| is a random variable as well. Thus,we can only make probability statements about the accuracy. Since X is thesample mean, i.e. it is the average of the n elements of the sample, we willwrite Xn to stress the fact that is depends on the sample size n. Let δ > 0denote the minimal accuracy imposed by the researcher (here, e.g., δ = 0.10).Then, we could e.g. determine n such that the accuracy is attained with aprobability of 95%. Thus, the probability statement becomes,

P{|Xn − π| < δ

}= 0.95, (3.12)

or, equivalently,

P{−δ < Xn − π < δ

}= 0.95, (3.13)

In general, probabilities can only be calculated when the distribution isknown. Here, we know that for sufficiently large n, Xn is approximatelynormally distributed. In Section 2.5 we have seen that normal distributedrandom variables may be standardized by subtracting its expected value, anddividing by its standard deviation. If we apply it here, we become

X ∼ N

(π,π(1− π)

n

)(3.14)

X − π ∼ N

(0,π(1− π)

n

)(3.15)

X − π√π(1−π)

n

∼ N(0, 1). (3.16)

If we divide all parts in the inequality in the probability operator in Equation


3.13, then we become

P

− δ√π(1−π)

n

<Xn − π√π(1−π)

n

<δ√π(1−π)

n

= 0.95, (3.17)

where now the middle part in the probability operator is a standard normaldistributed random variable. For such a standard normal random variable,say Z, we know that by the definition of the quantiles (Section 2.5),

P{−zα/2 < Z < zα/2

}= 1− α, (3.18)

which gives with 1− α = 0.95,

P {−z0.025 < Z < z0.025} = 0.95, (3.19)

and from the table of the quantiles of a standard normal distribution, we findz0.025 = 1.96. Since Equations 3.17 and 3.19 are equivalent, we find

δ√π(1−π)

n

= 1.96, (3.20)

from which we find

n =1.962π(1− π)

δ2. (3.21)

Filling in δ = 0.10 into Equation 3.21 we become

n = 384.16π(1− π). (3.22)

The problem with the above solution is, of course, that it still depends onthe unknown parameter π. Indeed, π is unknown, otherwise we would nothave been interested in estimating it! How should we proceed now??? Asimple solution is to calculated n for all possible values for π, for which weknow that it is not smaller than 0 and not larger than 1 (because π is aprobability). This is shown in Figure 3.4. It may now be concluded that nis maximal for π = 0.5, i.e. n = 96.04. Since n must be an integer, we willhave to take n = 97 which is the smallest integer larger than 96.04. Thissolution is the safest solution, whatever the true value of π.

�

In the above example we have encountered one problem: the solution for n,given by Equation 3.21, still contains one unknown parameter (π). Fortu-nately, π can only take values in the interval [0, 1], and for all these π the


pin

0.0 0.2 0.4 0.6 0.8 1.0

020

4060

80

Figure 3.4: The solution for n as a function of the unknown parameter π

solution for n was practically feasible (i.e. n remained finite, it even neverexceeded 100). If, on the other hand, we would have done the same exercise,but starting from a normal population (i.e. the observations are normallydistributed), then we would have found the following solution for the samplesize n,

n =1.962σ2

δ2, (3.23)

where σ2 is the variance of the normal distribution of the observations. Veryoften σ2 is unknown to the researcher. Furthermore, since σ2 has no upperbound, in theory at least, σ2 may be infinitely large, and by Equation 3.23, nwill become infinitely large as well. Clearly, this is no neat solution. However,when σ2 is known by the researcher, then Equation 3.23 may be used.

Sometimes, one does not know σ2, but one can estimate σ2 from previouslyobtained data, or from a small experiment. In that case we could use σ2 =S2 in Equation 3.23 instead of the unknown σ2. There are however sometheoretical problems with this approach. These will be discussed later in thischapter. First the distribution of the sample variance is discussed.

3.5 The Distribution of the Sample Variance

In the previous chapter it was illustrated how the variance, σ2, or the stan-dard deviation σ can be interpreted. It is the characteristic of the distributionthat measures the scale. This is clear when looking at the density function:the larger σ, the wider the density, which means that the observations aremore variable about the mean.


One can imagine that it is indeed important to know the variance of a distri-bution. Hence, it is important to have an estimate of the variance. Equation3.2,

σ2 = S2 =1

n− 1

n∑i=1

(Xi − X

)2, (3.24)

gives the formula. Exactly as with the mean and the sample mean, is S2 thesample variance, and it is a point estimator of the variance σ2. Therefore,we sometimes write σ2 instead of S2. Also, even if the observations are notnormally distributed, the variance is still an interesting characteristic of thedistribution.

Example 3.4. In a laboratory, one is often interested to know the variabil-ity due to a measurement method. This is the variability (or the variance) inthe measurements that would appear if the measurement is repeated manytimes on the same “sample” (note: here “sample” is not meant in the sta-tistical sense). Suppose, we want to know the variance of measurements ofthe concentration of a chemical compound (e.g. NH4) (tritrimetric measure-ments). Then, we could make a solution of this chemical compound, anddistribute this solution over n recipients. we now may expect that the trueconcentration is equal in all recipient. Next, the concentration is n timesindependently measured. Thus, we can say that we have n independent ob-servations from some distribution, with unknown mean (we do not know theexact concentration) and unknown variance. It is clear that the smaller thevariance, the more accurate the measurement method will be. The variancecan be estimated by using Equation 3.2. �

As with the sample mean, we will have to make a distinction between thesituation where all data are normally distributed, and the case where thedistribution of the observations is unknown.

3.5.1 The Distribution of the Sample Variance underthe Normal Assumption

Applet. The distribution of the sample variance is illustrated in Applet3c. In particular it shows that

• S2 is an unbiased estimator of σ2

• the variance of S2 decreases as n increases


�

Suppose that Xi i.i.d. N(µ, σ2) (i = 1, . . . , n). Then, it can be proven that

(n− 1)S2

σ2∼ χ2

n−1. (3.25)

Since, the mean of a χ2n−1 distributed random variable is equal to n− 1, we

see immediately that

E{S2}

= σ2, (3.26)

i.e. S2 is an unbiased estimator of σ2.

We also know that the variance of a χ2n−1 distributed random variable is

equal to 2(n− 1). Thus,

Var{S2}

= 2σ4

n− 1, (3.27)

from which we learn that the variance of the sample variance S2 is inverseproportional with the sample size n. Thus, Var {S2} → 0 as n → ∞, and,hence, the sample variance is a consistent estimator of σ2.

3.5.2 The Distribution of the Sample Variance withoutthe Normal Assumption

As with the sample mean, an assertion as in Equation 3.25 cannot be provenwhen the data is not assumed normally distributed.

We will only mention here that, asymptotically (i.e. when n → ∞), S2 isstill an unbiased, consistent estimator of σ2. There is a form of the CLT thatapplies to S2, but we will not go into detail here.

3.5.3 Some Final Remarks

Some of the terminology may be confusing. Especially here, where we alsoconsider the variance of the sample variance. In the next few lines we havewritten a few statement. Convince yourself, as an exercise, that these state-ments are correct.

• the mean of the sample variance is the variance


• the mean of the sample mean is the mean

• under the normal assumption, the variance of the variance estimator isgiven by 2σ4

n−1

3.6 Interval Estimator of the Mean

In Section 3.2 we have already mentioned that point estimators are not theonly type of “statistics”. They are even not the only type of estimators. Inthis section we will discuss the interval estimator of the mean. First we willargue why there is a need for another type of estimator.

3.6.1 Disadvantage of Point Estimators

It is easy to imagine that it is possible that in a first experiment n1 ob-servations are sampled, resulting in a sample mean X1, and, in a secondexperiment with n2 > n1 observations sampled from a distribution with thesame variance (σ2) as in the first experiment, the sample mean X2 exactlyequals X1. Although in this artificial example the two sample means areexactly equal, we know that

Var{X1

}=σ2

n1

>σ2

n2

= Var{X2

}, (3.28)

i.e. the second sample mean is more accurate than the first. However, whenwe would only report the sample means, the reader of this report will notknow anything about the variability of the estimators. The reader will nothave a clue on how much faith he must but in the sample mean as an estimatorof the true mean.

The above mentioned criticism can be avoided by not only reporting thesample mean, but also

• the sample size n

• the variance, σ2, or, when the true variance is not know, its estimatorS2

Yet another way of presenting an estimator of the mean, is by reporting theinterval estimator of the mean. An interval estimator is not just a number,


but it is an interval which is associated with a certain probability statement.The smaller the interval is, the more accurate the mean is estimated. Thewideness of the interval immediately reflects the degree of uncertainty on themean which is present in the dataset. Therefore, interval estimators are oftenpreferred over point estimators.

3.6.2 Interval Estimator of the Mean

An interval is completely determined by a lower and an upper bound, denotedby L and U , respectively. Basically, we will calculate L and U from thesample. Hence, L and U may be called statistics. The interval itself, [L,U ],is also generally known as the confidence interval of the mean.

We will determine L and U such that the true mean µ is in the interval [L,U ]with probability 1− α, i.e.

1− α = P {µ ∈ [L,U ]} (3.29)

= P {L ≤ µ ≤ U} . (3.30)

The solution is very simple when we assume that Xi i.i.d. N(µ, σ2). We startfrom the standardized sample mean, for which we know that

X ∼ N

(µ,σ2

n

)(3.31)

X − µ ∼ N

(0,σ2

n

)(3.32)

X − µ√σ2

n

∼ N(0, 1). (3.33)


Hence,

1− α = P

−zα/2 ≤ X − µ√σ2

n

≤ zα/2

(3.34)

= P

{−zα/2

√σ2

n≤ X − µ ≤ zα/2

√σ2

n

}(3.35)

= P

{X − zα/2

√σ2

n≤ µ ≤ X + zα/2

√σ2

n

}. (3.36)

Comparing Equations 3.30 and 3.36 gives us immediately

L = X − zα/2

√σ2

n(3.37)

U = X + zα/2

√σ2

n. (3.38)

Hence, the interval estimator or the 1− α confidence interval is given by[X − zα/2

√σ2

n, X + zα/2

√σ2

n

]. (3.39)

Note that this interval is symmetric about the sample mean X and that thewidth of the interval increases with increasing σ2, and the width decreaseswith increasing sample size n.

Applet. Applet 5a shows, in the repeated sampling setting, the interpre-tation of the interval estimator (or confidence interval) of the mean. �

When the observations Xi are not normally distributed, then, at least forlarge samples, we could still rely on the CLT, which tells us that

X.∼ N

(µ,σ2

n

), (3.40)

and thus, at least approximately, the same solution for L and U is obtained.

Expression 3.39 gives the solution for the interval estimator of the mean,but, in practice, the variance σ2 is most often not known. Fortunately itcan be estimated from the sample by means of S2 (Equation 3.2). However,


substituting σ2 with its estimator S2 will change the distribution of the stan-dardized sample mean, on which the whole solution of L and U is based. Inthe next section we will point out what is the consequence of replacing theunknown constant σ2 by its estimator S2, which is a random variable.

3.7 Studentization

We have already discussed the standardization of a normal distributed ran-dom variable. We apply it here immediately to the sample mean,

X ∼ N

(µ,σ2

n

)(3.41)

X − µ ∼ N

(0,σ2

n

)(3.42)

X − µ√σ2

n

∼ N(0, 1). (3.43)

Thus, Z = X−µqσ2

n

follows a standard normal distribution. An important prop-

erty of Z is that this distribution does not depend on any unknown parameter(i.e. the mean of Z is zero, and its variance is 1, whatever the value µ andσ2). In particular this is used for obtaining the interval estimator of the mean(see Equation 3.34).

When σ2 is unknown, a natural reflex is to replace it with its estimator S2.But, replacing a constant with a random variable will have an effect on thedistribution of a statistic.

When σ2 = S2 is used instead of σ2 in a standardization, the standardizationis called a studentization. We define

T =X − µ√

S2

n

. (3.44)

Intuitively we could understand the difference between Z and T as follows.By replacing a constant by a random variable, there is additional variabilityintroduced. Thus, the variance of T will be larger than the variance of Z.Further, since S2 is a consistent estimator of σ2 (i.e. the variance of S2


decreases with increasing sample size), we expect that the effect of using S2

will decrease as n increases. This is indeed the case. It can be shown that

T ∼ tn−1, (3.45)

i.e. the studentized sample mean, T , is t-distributed with n − 1 degrees offreedom. A property of the t-distribution is that tn−1 converges to a standardnormal distribution as n → ∞. This latter property confirms our intuitiveidea that the effect of using the estimator S2 vanishes as the sample sizebecomes very large.

3.8 Interval Estimator of the Mean (Variance

Unknown)

Expression 3.39 gives the interval estimator of the mean when the varianceσ2 is known. When, however, the variance is not known, we will replace σ2

with its estimator S2.

In order to find the solution to

1− α = P {L ≤ µ ≤ U} , (3.46)

we will now start from the distribution of the studentized sample mean.

X − µ√S2

n

∼ tn−1. (3.47)

Hence,

1− α = P

−tn−1,α/2 ≤X − µ√

S2

n

≤ tn−1,α/2

(3.48)

= P

{−tn−1,α/2

√S2

n≤ X − µ ≤ tn−1,α/2

√S2

n

}(3.49)

= P

{X − tn−1,α/2

√S2

n≤ µ ≤ X + tn−1,α/2

√S2

n

}. (3.50)


We now find

L = X − tn−1,α/2

√S2

n(3.51)

U = X + tn−1,α/2

√S2

n. (3.52)

Hence, the interval estimator or the 1− α confidence interval is given by[X − tn−1,α/2

√S2

n, X + tn−1,α/2

√S2

n

]. (3.53)

When an interval estimator is based on the CLT (i.e. the data are notnormally distributed, but due to the CLT the sample means is still approx-imately normally distributed for large samples), there is no need to switchfrom a standard normal distribution to a t-distribution when the unknownvariance is replaced by a consistent estimator. Intuitively, this may be seen inthe calculations presented in this section: when the CLT applies, the samplesize must be large, and when the sample size is large, then the t-distributionbehaves very similar to a standard normal distribution.

Example 3.5. Example 3.3.3 continued. We have already found that

X.∼ N

(π,π(1− π)

50

), (3.54)

where we believed that n = 50 is indeed sufficiently large to make the ap-proximation hold. Based on 3.54 we can find the 1−α confidence interval ofthe mean (π) (cfr. Equation 3.39):[

X − zα/2

√σ2

50, X + zα/2

√σ2

50

], (3.55)

where here σ2 = π(1− π). Thus, the confidence interval becomes[X − zα/2

√π(1− π)

50, X + zα/2

√π(1− π)

50

], (3.56)

The parameter π is of course unknown (otherwise we would not have beeninterested in estimating it), but it can be estimated by X. Note that in the


present example X may also be denoted by π for it is an estimator of π.Replacing π in Expression 3.56 with its estimator π, results in[

π − zα/2

√π(1− π)

50, π + zα/2

√π(1− π)

50

]. (3.57)

And since n = 50 is sufficiently large, the interval in Expression 3.57 will stillcontain π with a probability of (approximately) 1− α. �

3.9 Interval Estimator of the Sample Vari-

ance

Also here the interval estimator is determined by a lower and an upper limit,denoted by L and U , respectively. These bounds must be such that the truevariance σ2 has a probability of 1− α to be in the interval [L,U ], i.e.

1− α = P{L ≤ σ2 ≤ U

}. (3.58)

Under the assumption that the observations are i.i.d. normally distributes,we have seen in Section 3.5 that

(n− 1)S2

σ2∼ χ2

n−1. (3.59)

Hence,

1− α = P

{χ2n−1,1−α/2 ≤

(n− 1)S2

σ2≤ χ2

n−1,α/2

}(3.60)

= P

{(n− 1)S2

χ2n−1,α/2

≤ σ2 ≤ (n− 1)S2

χ2n−1,1−α/2

}. (3.61)

Thus, we find the interval[(n− 1)S2

χ2n−1,α/2

,(n− 1)S2

χ2n−1,1−α/2

]. (3.62)

Note that this interval is not symmetric about S2! The reason is that theχ2-distribution is asymmetric.

Applet. Applet 5b shows the interpretation of the interval estimator ofthe variance. �


Chapter 4

Statistical Tests

In this chapter a third type of “statistics” is discussed: test statistics. Atest statistic is a statistic that is used in the construction of a statistical test,which is basically a decision rule to make a decision which of two hypothesisto choose, based on a sample.

We will discuss 2 important types of statistical tests: one-sample and two-sample t-tests. But before we come to that, a simple example is given tointroduce and illustrate the most important concepts statistical testing.

4.1 Statistical Testing: Introductory Exam-

ple

4.1.1 Introduction

In a factory producing milk products, two types of cream are produced inbatch: one with 35% and one with 40% of fat. There has gone somethingwrong with the labelling of the bottles, so that there is now one batch ofbottles of which nobody knows the fat content. Since there is some variabilityof the fat content from bottle to bottle, it is not sufficient to take one bottleand measure its fat content in order to decide for the whole batch. Therefore,a sample will have to be taken (e.g. 10 bottles), and the average fat contentof these 10 bottles will be computed (i.e. the sample mean X). As weknow from the previous chapter, the sample mean is a good estimator of thetrue mean µ, which is unknown in our example. But we know at least that

45

there are only two possible values for µ. These possibilities are stated in twoconcurring hypothesis:

H0 : µ = 35

H1 : µ = 40

H0 and H1 are referred to as the null hypothesis and the alternative hypoth-esis, respectively.

We will throughout this example assume that all observations are normallydistributed, with unknown mean µ (that is what we have to make a decisionabout) and known variance σ2.

4.1.2 Decision Rule, Decision Errors and AssociatedProbabilities

Based on the sample mean X we will apply some decision rule to makethe final decision which of the two hypotheses is the most likely. Note thatwe used the term “likely”. This is because X is a random variable, and, bychance, we may have bad luck, and X is very close to 40 although the samplecame from a batch of the 35-group. Thus, using X to make a decision is alsoa random phenomenon. This implies that (1) decision errors may occur, and(2) probabilities of making decision errors can be calculated.

Suppose we adopt the following simple, but logical decision rule:“ if X ≤ 37.5 then H0 : µ = 35; if X > 37.5 then H1 : µ = 40 ”

We now will take a closer look at the decision errors and the associatedprobabilities. Based on the assumption of normally distributed data, one ofthe two following situation will be true:

• suppose µ = 35, → X ∼ N(35, σ2

n)

• suppose µ = 40, → X ∼ N(40, σ2

n)

These two situations are shown in Figure 4.1.

In general one of four situations may occur:


sample mean

f

25 30 35 40 45 50

0.0

0.05

0.10

0.15

37.5

mu=40mu=35

Figure 4.1: The two possible distribution of the sample mean (σ2 = 60, n =10)

Table 4.1: Probabilities of Decisions, conditional on the hypothesis (σ2 =60, n = 10). The probabilities in bold correspond to decision errors. Notethat the rows represent the two possible decisions.

H0 : µ = 35 H1 : µ = 40P{X ≤ 37.5

}0.85 0.15

P{X > 37.5

}0.15 0.85

• µ = 35, but based on the decision rule, we decide H1 : µ = 40→ decision error

• µ = 35, and based on the decision rule, we decide correctly H0 : µ = 35

• µ = 40, but based on the decision rule, we decide H0 : µ = 35→ decision error

• µ = 40, and based on the decision rule, we decide correctly H0 : µ = 40

These 4 situations may be presented in a table which contains the proba-bilities conditional on one of the two hypotheses. In order to calculate theprobabilities explicitly, we have to know σ2 and the sample size n; here wetake e.g. σ2 = 60 and n = 10. The table is given in Table 4.1. Theseprobabilities may be seen as surfaced under the densities in Figure 4.1.

Thus, when the decision rule is applied, we see that the 2 probabilities ofmaking decision errors are identical (0.15). This may sometimes be a dis-advantage (see later). Yet another drawback of this approach is that both


Table 4.2: Probabilities of Decisions, conditional on the hypothesis (σ2 =60, n = 25). The probabilities in bold correspond to decision errors. Notethat the rows represent the two possible decisions.

H0 : µ = 35 H1 : µ = 40P{X ≤ 37.5

}0.95 0.05

P{X > 37.5

}0.05 0.95

sample mean

f

25 30 35 40 45 50

0.0

0.05

0.10

0.15

0.20

0.25

37.5

mu=40mu=35

Figure 4.2: The two possible distribution of the sample mean (σ2 = 60, n =25)

probabilities of the decision errors depend on the variance σ2 and the sam-ple size n. This can be seen in Table 4.2 which shows similar probabilities,but now n = 25 (see also Figure 4.2). Both the probabilities of the decisionerrors are now only 5%. In a similar way these probabilities depend on thevariance σ2. Later, when we will not assume that the variance is known,this might become very problematic, because then we would not know theseprobabilities prior to the experiment.

Now we will construct a decision rule that controls for one type of decisionerror, i.e. whatever the variance σ2 and the sample size n, the probabilitiesof making this particular decision error remains constant. We say that withsuch a rule, the decision error rate is controlled for. (“decision error rate”is the same as the ”probability of making the decision error”). In particularwe want

P {decide H1 : µ = 40|H0 : µ = 35} = 0.05. (4.1)

The decision error of concluding H1 when in reality H0 is true is called thetype I error. The corresponding probability (here: 0.05) is referred to as thetype I error rate or the type I error level.


The decision rule is again of the same type as the previous one, but nowthe critical value to which X is compared, is not fixed at 37.5, but rather isrepresented by Xc. Xc will now be determined such that Equation 4.1 holdstrue (i.e. the type I error rate is controlled). Thus,

P {decide H1 : µ = 40|H0 : µ = 35} = 0.05

P{X > Xc|H0 : µ = 35

}= 0.05

P

{X − 35

σ/√n>Xc − 35

σ/√n|H0 : µ = 35

}= 0.05

Since X−35σ/√n

is standard normally distributed, we immediately find that

Xc − 35

σ/√n

= z0.05 = 1.64 (4.2)

and thus

Xc = 35 + 1.64σ/√n. (4.3)

In the above example with σ2 = 60 and n = 10, this means Xc = 39.80,which is very close to 40! When n = 25, this becomes Xc = 38.04, which isalready a bit closer to 35.

Note that in the above derivation, all probabilities are conditional on H0,i.e. the probabilities are calculated as if H0 were true. We say that thecalculations are under the null hypothesis.

The decision rule becomes, for general type I error level α:

• X ≤ 35 + zασ/√n −→ decide µ = 35

• X > 35 + zασ/√n −→ decide µ = 40

It is, however, not custom to write decision rules directly in terms of thesample mean X. Rather, it is written in terms of test statistics. Very often,these test statistics are statistics that have a distribution, under the nulldistribution, that does not depend on any population parameter (e.g. µ orσ). In this example the test statistic is

T =X − 35

σ/√n. (4.4)


t

f(t)

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

1.64

0.05

t_o

Figure 4.3: The null distribution of the test statistic T = X−35σ/√n

. The critical

value z0.05 = 1.64 is shown.

Its distribution under the null distribution is of course the standard normaldistribution. We write this as

TH0∼ N(0, 1). (4.5)

The distribution of a test statistic under the null distribution is called thenull distribution. Thus null distribution is shown in Figure 4.3.

It is obvious that T is indeed a statistic: indeed T is a function of theobservations in the sample, and since the observations are random variables,the test statistic is also random. However, once a particular sample is takenand observed, all the observation are fixed (this can be seen as one realizationof an experiment in the repeated sampling philosophy). Therefore, we willmake a distinction between the test statistic as a random variable, denotedby T , and the test statistic obtained after observing a particular sample. Inthe latter case, the observed test statistic is denoted by to.

We will also introduce yet another change in terminology. Instead of writing“decide H0” and “decide H1” in the decision rules, we will write “accept H0”and “reject H0, respectively.

From Equation 4.2 the decision rule may now be derived in terms of the teststatistic (see also Figure 4.3.

• to = X−35σ/√n≤ zα −→ accept H0 : µ = 35

• to = X−35σ/√n> zα −→ reject H0 and thus conclude H1 : µ = 40


Table 4.3: Decision Errors

H0 : µ = 35 H0 : µ = 40

accept H0 : µ = 35 OK β = F(zα − 5

σ/√n

)reject H0 : µ = 35 α OK

Thus, we now have a decision rule which guarantees that the type I errorrate is controlled at α. But what with the other decision error, i.e. the errorthat occurs when H0 is decided, when in reality H1 is true? This type ofdecision error is called the type II error and the corresponding probability βis

β = P {accept H0|H1 : µ = 40}

= P

{X − 35

σ/√n≤ zα|H1 : µ = 40

}= P

{X − 40

σ/√n

+40− 35

σ/√n≤ zα|H1 : µ = 40

}= P

{X − 40

σ/√n≤ zα −

40− 35

σ/√n|H1 : µ = 40

}= F

(zα −

5

σ/√n

),

where F (.) is the distribution function of a standard normal distribution. It isimportant to see that this probability still depends on σ and n. Therefore, wesay that the type II error rate is not controlled for (this will come especiallyclear when σ is not known).

The two decision errors and their probabilities may be represented in Table4.3.

A probability that is closely related to β is the power of a statistical test. Itis defined as

power = P {reject H0|H1} = 1− β. (4.6)

It is obvious that we want the statistical test to have a power as large aspossible. Later we will see that a large power may be obtained by ensuringthat the sample size n is large enough. At this time we still say that also thepower is not controlled for.


4.1.3 Strong and Weak Conclusions

The construction of the statistical test implies that concluding H0 or con-cluding H1 are not equally strong. This can be seen from Table 4.3. Weconsider the two possible conclusions and their consequences.

• Suppose that we have concluded to reject H0.Then, according to Table 4.3, in reality there are two possibilities. (1)indeed H1 is true, and in that case we have made a correct decision,or (2), H0 is true, and then we have made a decision error (type I).The probability of making this error, is exactly controlled at α, whichis typically very small, e.g. α = 0.05. Thus, if we reject the nullhypothesis, there is only a very small probability that we have madean error.Thus, rejecting H0 is a strong conclusion.

• Suppose that we have concluded to accept H0.Then, according to Table 4.3, there are two possibilities. (1), indeed H0

is true, and in that case we have made a correct decision, or (2), H1 istrue, and then we have made a decision error (type II). The probabilityof making this error, is given by β. We have already seen that βstill depends on the variance σ2 and the sample size n, and in laterapplication we will even see that β depends on unknown quantities,and, thus, the probability β is unknown as well. In practice it willhappen often that β is even large, say larger than 20%, often evenlargen than 50%. Thus, when the null hypothesis is accepted, there isa possibly large probability that this is a decision error. Therefore, thistype of conclusion is a weak conclusion. Accepting a null hypothesis isactually the same as saying that there was not enough evidence in thedata to prove the opposite.

4.1.4 p-Value

Many statistical computer software calculate the p-value. The p-value maybe used instead of the observed test statistic to in an equivalent decision rule.

The p-value is generally defined as

p = P {T is “more extreme” than observed to|H0} , (4.7)


t

f(t)

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

1.64

0.05

t_o

p-value

Figure 4.4: The null distribution of the test statistic T = X−35σ/√n

. The critical

value z0.05 = 1.64 is shown as well as an observed test statistic to = 0.95 <1.64 and its corresponding p-value=0.17 > 0.05.

where “more extreme” means “more extreme in the direction of the alterna-tive hypothesis H1”. This must be interpreted for each statistical test. Inour example, “more extreme” means that “larger than”, because the largerthe test statistic, the more evidence there is that H1 will be true instead ofH0. Thus,

p = P {T > to|H0} (4.8)

The decision rule in terms of p-values becomes

• p ≥ α −→ accept H0

• p < α −→ reject H0, conclude H1

Note that the p-value is a conditional probability, i.e. it is a probabilitycalculated under the condition that the null hypothesis is true. Suppose thatp is extremely small, say p = 0.00001. Then, this would mean that it is veryunlikely to obtain a test statistic that is at least as extreme as the one thatis observed from the present sample, given that the null hypothesis is true.Thus, it is very unlikely that the null hypothesis is true. Moreover, sincewe further know that, under the alternative hypothesis more extreme valuesof T are to be expected, it is indeed a good conclusion to state that it isvery unlikely that the null hypothesis is true, but that rather the alternativehypothesis is more likely to be true (hence for such a small p-value we rejectthe null hypothesis).


The above reasoning immediately implies that the p-value is a very valuablequantity, measuring in some way the evidence in the sample against the nullhypothesis in favour of the alternative hypothesis.

4.2 Composite Hypothesis

Up to now we have considered only so called “simple” alternative hypothesis,i.e. the alternative hypothesis (H1 : µ = 40) contains only one possibility forµ. Often, however, there are many more possibilities if the null hypothesisis not true. We will consider 3 types, which can be summarized asH1 : µ 6= 35 two-sided hypothesisH1 : µ < 35 one-sided to the leftH1 : µ > 35 one-sided to the right

The corresponding statistical tests are called one-sample tests. “one sample”refers to the fact that all calculations are based on only one sample from onepopulation (e.g. n bottles cream from one batch).

4.3 One-Sample t-Tests

Up to now we have assumed that the variance σ2 is known. In practice,however, this is almost never the case. Therefore, we will from now on dropthis assumption. The one-sample tests are then called one-sample t-tests.

4.3.1 Unknown Variance

If the variance σ2 is not known, a straightforward solution is to replace σ2

with its estimator σ2 = S2 in the formula of the test statistic (Equation 4.4).Thus, we now have

T =X − 35

S/√n. (4.9)

Under the null hypothesis H0 : µ = 35, the test statistic T is now, however,not normally distributed anymore. We have come across a similar situationin Section 3.7. There we have seen that, by replacing σ by S, the standardnormal distribution changes to a t-distribution with n−1 degrees of freedom.


Thus, we now have as a null distribution

T =X − 35

S/√n

H0∼ tn−1. (4.10)

This results only holds under the assumption that the observations are nor-mally distributed.

Remark:When n is very large, the CLT applies, and the sample mean X will alwaysbe approximately normally distributed. Is such a case we have, under thenull hypothesis

T =X − 35

S/√n

.∼ N(0, 1). (4.11)

4.3.2 One-Sided One-Sample t-Test

Example 4.1. Example of Section 4.1 continued. Suppose that we arestill interested in testing that the fat content is equal to 35 percent, but thatwe now don not know what the concentration may be if it is not equal to35 percent (e.g. because the company makes many different types of creamwith various fat contents). The only restriction there is, is that it can notbe smaller than 35% because such creams are not produced. Then we wouldlike to test

H0 : µ = 35

against

H1 : µ > 35.

Now, the variance is unknown and it will have to be estimated from thesample as well. �

Since the null hypothesis is the same as before, the null distribution is alsonot changed:

T =X − 35

S/√n

H0∼ tn−1. (4.12)

Since the alternative hypothesis is different from the one considered in Section4.1, we have to check whether the test statistic is indeed still a good statistic,


i.e. does T in some sense measure the deviation between the null hypothesis(H0 : µ = 35) and the alternative hypothesis (H1 : µ > 35)? The answeris “yes”: when the alternative hypothesis is true, we expect that X willprobably be larger than 35, and, hence, the distribution of T will, under thealternative hypothesis, be shifted to the right. Therefore, we conclude thatT is a suited test statistic for testing H0 against H1. Furthermore, from thisreasoning we also deduce that we will reject H0 for extreme large positivevalues of T . This latter property must be reflected in the decision rule, whichwill thus in general be of the following form:

• to = X−35S/√n≤ tc −→ accept H0

• to = X−35S/√n> tc −→ reject H0, conclude H1

where tc is the critical value, that must be such that the type I error rate iscontrolled at the α level. Thus, tc is such that

P {T > tc|H0} = α (4.13)

The solution is simple: since we know that under the null hypothesis T is dis-tributed as tn−1, the value of tc is simply the α-quantile of this t-distribution,thus

tc = tn−1,α. (4.14)

Hence, the decision rule becomes

• to = X−35S/√n≤ tn−1,α −→ accept H0

• to = X−35S/√n> tn−1,α −→ reject H0, conclude H1

A major difference from the situation of the simple alternative hypothesisof Section ?? is that now we cannot calculate the probability of makinga type II error (β). The reason is that this probability must be calculatedunder the alternative hypothesis, but now the alternative hypothesis containsan infinite number of possibilities for µ. Still, when we take on particularpossibility of H1, e.g. µ = 35 + δ, where δ > 0, then we can calculate β as afunction of δ. thus,

β(δ) = P {reject H0|µ = 35 + δ} . (4.15)


t

f(t)

-4 -2 0 2 4 6

0.0

0.1

0.2

0.3

0.4

alpha=0.05

b1

b2

null distri. mu=40 mu=45

Figure 4.5: The null distribution (left) of the test statistic T = X−35S/√

20. The

critical value t19,0.05 = 1.73 is shown as a vertical line. The middle dis-tribution is the distribution of T when µ = 40, and the right distributioncorresponds to µ = 45. b1 and b2 are the two corresponding type II errorsβ1 and β2.

Figure 4.5 shows the null distribution and, for some possibilities of µ underH1, the corresponding distributions of T . It can be seen that as δ increasesunder H1, β(δ) decreases, or, equivalently, the power (= 1− β(δ)) increases.

Example 4.2. Example of Section 4.1 continued. Suppose now H0 :µ = 35, but H1 : µ < 35. Since the null hypothesis remains the same, thenull distribution of T is still valid as well. In order for T to be a suitabletest statistic, we must think about the behaviour of T under the alternativehypothesis in the sense that T must measure the deviation of H0 in thedirection of the alternative hypothesis. Here, since under H1 the samplemean X will probably be smaller than 35, we expect the distribution of Tto be shifted to the left as compared to the null distribution. Thus, T is stilla good test statistic, but now we will reject the null hypothesis for extremenegative values of T . The decision rule is thus of the form

• to = X−35S/√n≥ tc −→ accept H0

• to = X−35S/√n< tc −→ reject H0, conclude H1

where tc is the critical value, that must be such that the type I error rate iscontrolled at the α level. Thus, tc is such that

P {T < tc|H0} = α (4.16)


The solution is simple: since we know that under the null hypothesis Tis distributed as tn−1, the value of tc is simply the 1 − α-quantile of thist-distribution, thus

tc = tn−1,1−α = −tn−1,α, (4.17)

where the last equality holds because of the symmetry of the t-distribution.Hence, the decision rule becomes

• to = X−35S/√n≥ tn−1,α −→ accept H0

• to = X−35S/√n< tn−1,α −→ reject H0, conclude H1

�

Example 4.3. Example of Section 4.1 continued. Suppose now thatH0 : µ = 35 is to be tested against H1 : µ 6= 35 (i.e. a two-sided alternativehypothesis). As before, the null distribution of T remains the same becausethe null hypothesis is not changed. Further, T still measures the deviationfrom the null hypothesis in the direction of the alternative hypothesis: ex-treme negative values of T indicate that µ < 35, which is part of H1, andextreme positive values of T indicate that µ > 35, which is also part of H1.This reasoning implies that the decision rule will now reject H0 for both ex-treme large negative and positive values of T . The type I error rate is nowcontrolled by finding critical values tc,1 and tc,2 such that

P {reject H0|H0} = α

P {T < tc,1 or T > tc,2|H0} = α

P {|T | > tc|H0} = α tc = tc,1 = tc,2,

where the last step is explained by the symmetry of the t-distribution (seeFigure ??). Thus, by dividing the total type I error rate α equally over thetwo tails of the null distribution, we find

tc = tn−1,α/2. (4.18)

And the decision rule is

• |to| = | X−35S/√n| ≤ tn−1,α/2 −→ accept H0

• |to| = | X−35S/√n| > tn−1,α/2 −→ reject H0, conclude H1

�


4.3.3 Example in S-plus

Suppose the following data is sampled:

35.6 35.8 35.0 34.5 34.9 35.9 36.0 34.9 35.5 35.4

• We test H0 : µ = 35 against H1 : µ 6= 35, i.e. the two-sided alternative.We do this because, if the mean fat content is not 35 percent, thenthere is no a priori knowledge that allows us to say whether or not thetrue mean will be larger or smaller than 35.

The test statistic to use is

T =X − 35

S/√n

(4.19)

of which we know that

TH0∼ tn−1. (4.20)

S-plus gives the following output:

One-sample t-Test

data: creamt = 2.2063, df = 9, p-value = 0.0548alternative hypothesis: true mean is not equal to 3595 percent confidence interval:34.99113 35.70887sample estimates:mean of x35.35

Thus, to = 2.2063. This has to be compared to the critical value at the5% level of significance, which is the α/2 = 0.025 quantile of t9:

t9,0.025 = qt(0.975,df=9) = 2.262157.

Since |to| = 2.2063 < 2.262157 we decide to accept the null hypothesis,and conclude that the mean fat content in the batch of cream bottlesis 35 percent. (although this is a weak conclusion because of the typeII error rate which is not controlled for)

The same conclusion may be made by looking at the p-value. Here wehave p = 0.0548, which is larger than α = 0.05, and thus we indeed


accept the null hypothesis. Note that the p-value is only slightly largerthan α. This means that actually there is some evidence in the samplethat the true mean fat content is different from 35 percent, but it isconsidered to be just not sufficient enough evidence to conclude theopposite. Here, you see clearly that the p-value is very valuable tomake nuances in the conclusion. Maybe, when the sample size wouldhave been larger, a significant result could have been established!

Finally, the S-Plus output also gives a 95% confidence interval for themean. This may only be interpreted when the alternative hypothesisis two-sided. For one-sided alternatives, S-plus gives another type ofconfidence intervals. Also the sample mean is given.

• We do the same exercise again, but now we will test H0 : µ = 35against H1 : µ > 35. This may be the case when e.g. we know thatthe mean fat content can impossibly be smaller than 35 percent (e.g. acharacteristic of the production process). Here, we do the same t-test,but in the decision rule we will only reject for large positive values ofT . The critical value is now

t9,0.05 = qt(0.95,df=9) = 1.833113.

The S-Plus output is

One-sample t-Test

data: creamt = 2.2063, df = 9, p-value = 0.0274alternative hypothesis: true mean is greater than 3595 percent confidence interval:35.05919 NAsample estimates:mean of x35.35

Of course, the observed test statistic is the same as before, but nowit has to be compared to another critical value. Since to = 2.2063 >1.833113 we now conclude to reject the null hypothesis in favour of thealternative hypothesis and conclude formally that the mean fat concen-tration in the batch of cream bottles is larger than 35 percent. This mayalso be concluded from the p-value, which is now p = 0.0274 < 0.05.This is a strong conclusion.

It is a general property of one-sided tests that they are more powerful(i.e. the power of the test is higher) than their two-sided counterparts.


Intuitively, the reason is that the alternative hypothesis is more specific,and the more specific the questions are that have to be answered, thethe more the statistical test can concentrate (i.e. specialize) on them.

• Actually we should have checked whether or not the assumptions hold:normality of the observations. But since there are only 10 observations,this seems not a sensible thing to do.

4.4 Paired t-test

4.4.1 Construction of the Test

It sometimes happens that one sample is taken from a population, but that oneach of the elements of the population (e.g. humans) two measurements aretaken (e.g. a measurement before, and one after a particular treatment). Wedenote the two observations on the i element as Xi,1 and Xi,2, respectively.

We want to test the null hypothesis H0 : µ1 = µ2, where µ1 and µ2 are themeans of the first and the second series of observations, respectively. It canbe shown that this null hypothesis is equivalent to H0 : µ1−2 = 0, where µ1−2

is the mean of the transformed observations (differences)

Xi = Xi,1 −Xi,2. (4.21)

Hence, the initial null hypothesis may be tested by means of a one-samplet-test (i.e. the paired t-test is essentially a one-sample t-test on the differenceof the two paired observations). This implies that the assumption under theconstruction of this test is that the transformed observations Xi = Xi,1−Xi,2

must be normally distributed.

4.4.2 Example in S-Plus

The dataset turkey.sdd has 3 variables. We consider only the variables PREand POST.A, which are the weights of 50 turkeys before and after a certaintreatment. The researcher knows that the mean weight cannot increase bygiving the turkeys the treatment. Thus the hypotheses are H0 : µPRE =µPOST.A against H0 : µPRE > µPOST.A.

The S-plus output is:


-2 -1 0 1 2 3

diff

-3

-2

-1

0

1

2

Figure 4.6: QQ-plot of the difference variable

Paired t-Test

data: x: PRE in turkey , and y: POST.A in turkeyt = 1.9138, df = 49, p-value = 0.0307alternative hypothesis: true mean of differences is greater than 095 percent confidence interval:0.03440644 NAsample estimates:mean of x - y0.2775549

Since it is basically again a one-sample t-test, we give here only the interpre-tation based on the p-value.

p = 0.0307 < α = 0.05, and thus we decide the reject the null hypothesis, andconclude that the mean change in weight (PRE - POST.A) is significantlylarger than zero at the 5% level.

Actually the assumptions have to be assessed. Thus we should calculate anew variable which is defined as the difference between the PRE and thePOST.A variable, and check the normality of this variable by means of e.g.a QQ-plot (Figure 4.6). The QQ-plots shows some minor deviation from thestraight line, indicating that the might be a slight deviation from normality.But since there are 50 observations involved, the CLT comes into play, tellingus that the sample mean will be approximately normally distributed whateverthe distribution of the observations. These two arguments (acceptable QQ-plots and large sample) brings us to the conclusion that we do not have toworry about the assumption of normality.


4.5 Two-Sample t-Test

4.5.1 Construction of Test Statistic: variances known

Example 4.4. The elasticity of puddings is an important physical propertyof puddings in the sense that customers do not only like puddings becauseof the taste, but also because of the experience of pudding in the mouth.This particular feeling can be measured by e.g. the elasticity of the pudding.This elasticity is influenced by the ingredients of a pudding. It is believedthat it is especially influenced by the carrageen concentration. Therefore,an experiment is set up in which two formulas of puddings are comparedw.r.t. the elasticity. n1 puddings are prepared with a high concentration ofcarrageen, and n2 puddings with a low concentration.

We are interested in testing whether the mean elasticities of both types ofpudding are equal, i.e.

H0 : µ1 = µ2,

where µ1 and µ2 are the mean elasticity of the high and low concentrationgroup, respectively. Suppose that the researcher is certain that the elasticityis definitely not smaller in the high concentration group, then the alternativehypothesis of interest is

H1 : µ1 > µ2.

�

The above example is a special case of a setting in which one wants tocompare means of two possible different populations, by sampling n1 and n2

observations from the populations. For this reason the statistical test for thistype of problem is called a two-sample test. In the example, the alternativehypothesis is one-sided to the right (H1 : µ1 > µ2). Later we will considerthe other possibilities for the alternative hypothesis as well.

We will make the following assumptions.

• the observations in the first sample are normally distributed, i.e.X1,i i.i.d. N(µ1, σ

21) (i = 1, . . . , n1)

• the observations in the second sample are normally distributed, i.e.X2,i i.i.d. N(µ2, σ

22) (i = 1, . . . , n2)


• variances σ21 and σ2

2 are known (this assumption will later be dropped)

For the construction of the test statistic, we first note that X1 − X2 is anestimate of µ1 − µ2. Thus, we will reject H0 in favour of H1 for large valuesof X1 − X2. It can be shown that

X1 − X2 ∼ N

(µ1 − µ2,

σ21

n1

+σ2

2

n2

)(4.22)

which becomes under H0 : µ1 = µ2,

X1 − X2H0∼ N

(0,σ2

1

n1

+σ2

2

n2

). (4.23)

And, by standardization,

X1 − X2√σ2

1

n1+

σ22

n2

H0∼ N(0, 1) (4.24)

Thus, the following statistic seems to be a good test statistic (it has a nicenull distribution and it measures the deviation from the null hypothesis inthe direction of the alternative),

T =X1 − X2√σ2

1

n1+

σ22

n2

H0∼ N(0, 1). (4.25)

Since we will reject N0 for large values of T , the decision rule will be of theform

• to = X1−X2rσ2

1n1

+σ2

2n2

≤ tc −→ accept H0

• to = X1−X2rσ2

1n1

+σ2

2n2

> tc −→ reject H0, conclude H1

and the critical value tc must be determined such that the type I error rateis α, i.e.

P {reject H0|H0} = α (4.26)

P {T > tc|H0} = α (4.27)

Since the null distribution of T is a standard normal distribution, we find

tc = zα. (4.28)


4.5.2 Variances Unknown

Since usually the variances σ21 and σ2

2 are not known, they will have to bereplaced by their estimates, S2

1 and S22 . As in the one-sample case, we may

expect that the null distribution will not be a standard normal distribution.Unfortunately, it turns out that we may not do this simple replacement.There has to be made a distinction between two situations:

• σ21 = σ2

2

• σ21 6= σ2

2

σ21 = σ2

2

When in reality the two variances are equal (note: this can be tested with astatistical test), the two points estimators of these variances, S2

1 and S22 can

be pooled into one single estimator of the common variance σ2 = σ21 = σ2

2.This estimator is the so called pooled variance estimator and it is given by

S2p =

(n1 − 1)S21 + (n2 − 1)S2

2

n1 + n2 − 2. (4.29)

Under these conditions it can be shown that

T =X1 − X2√S2p

n1+

S2p

n2

H0∼ tn1+n2−2 (4.30)

Still, the alternative hypothesis will be rejected for large values of the teststatistic. Thus we can now easily find the decision rule:

• to = X1−X2rS2pn1

+S2pn2

≤ tn1+n2−2,α −→ accept H0

• to = X1−X2rS2pn1

+S2pn2

> tn1+n2−1,α −→ reject H0, conclude H1

σ21 6= σ2

2

In this situation, it is proposed to use the test statistic

T =X1 − X2√S2

1

n1+

S22

n2

. (4.31)


Unfortunately, the exact null distribution is not a simple distribution, butapproximation exist. We will not give the details here, but in S-plus thistest can be performed. As with all statistical tests, the interpretation can bebased on the p-value given by S-plus.

When, on the other hand, the sample sizes n1 and n2 are large, then the CLTcomes into play, and the null distribution of T may be simply approximatedby a normal distribution.

4.5.3 Other Alternative Hypothesis

We have only given the solutions for the one-sided alternative hypothesisH1 : µ1 > µ2.

When H1 : µ1 < µ2, then we will reject H0 for extreme negative values of T .This implies that we only have to change the inequality signs in the decisionsrules given for H1 : µ1 > µ2.

When H1 : µ1 6= µ2, then we will reject H0 for both extreme negative andextreme positive values of T . In that case, the probability α will have to bedivided over the two tails of the null distribution and the critical value tc willthus be a α/2 quantile. Since all null distributions considered in this sectionare symmetric, the decision rule will be based on |to|, rejecting for extremelarge values of this absolute value of to.

4.5.4 Example in S-Plus

We consider the dataset nutrition.sdd and we use the variables gender and bm.The former is the gender of the subject and the latter is the basal metabolismof the subject. We are interested in testing whether men and women havethe same mean basal metabolism. Let group 1 contain the man, and group2 the women (n1 = 69, n2 = 146). Thus,

H0 : µ1 = µ2

H1 : µ1 6= µ2.

Before we can test these hypotheses, we must decide which test to use. Sincethe variances are unknown, the choice of the test depends on whether thevariances σ2

1 and σ22 are equal or not. One way to check this, is to make a


800

1000

1200

1400

1600

1800

BM

men women

Figure 4.7: Boxplots of bm for man and women

boxplot of both groups (see Figure 4.7). This plot suggests that the variancesare more or less equal. Thus we will use the t-test with the pooled variance,which is

S2p =

(n1 − 1)S21 + (n2 − 1)S2

2

n1 + n2 − 2= 22568.38. (4.32)

The observed test statistic is thus

to =X1 − X2√S2p

n1+

S2p

n2

= 9.863827. (4.33)

According to the decision rule, this value has to be compared to the criticalvalue tn1+n2−2,0.025 = 1.971164. Since |to| > 1.971164 we reject the nullhypothesis and conclude that the mean basal metabolism of men and womenare significantly different at the 5% level of significance. Moreover, since X1 =1380.87 > 1164.392 = X2 we can conclude that the mean basal metabolismof men is larger than that of women.

The S-Plus output is (also assuming equal variances)

Standard Two-Sample t-Test

data: x: BM with SEXE = men , and y: BM with SEXE = woment = 9.8638, df = 213, p-value = 0alternative hypothesis: true difference in means is not equal to 095 percent confidence interval:173.2176 259.7385sample estimates:


mean of x mean of y1380.87 1164.392

Of course, S-Plus has computed the same to. The same conclusion could bebased on the p-value, which is p = 0! This mean that p is extremely small,approximately 0. Thus we can very strongly reject the null hypothesis infavour of the alternative.

The t-test is only formally valid for normal distributed observations. Theboxplots in Figure 4.7 suggest that the distribution are indeed symmetric,but the figure shows also some outliers. On the other hand, sample sizesn1 = 69 and n2 = 146 are large, and thus, the CLT could apply. (notethat if the CLT applies, there was no need to calculate the pooled varianceestimator, but still this is a better way when the data may be normallydistributed). In this specific example, both approaches seem to be justified.Moreover, since the p-value is extremely small, there will be no difference infinal conclusion.

4.6 Tests for Normality

In the previous chapter as well as in this chapter, we have come across manymethods that are based on the assumption of normality, i.e. it is assumedthat Xi i.i.d N(µ, σ2). In this section we briefly discuss a statistical test totest the hypothesis that the data is normally distributed. First, a graphicalmethod for assessing normality is given: the QQ-plot.

4.6.1 QQ-Plot

A QQ-plot is a graphical tool to assess normality. It plots the observeddata (Xi) against the expected observations if the data were indeed normallydistributed. These expected outcomes can be calculated as

F−1(F (Xi)), (4.34)

where F (Xi) is the estimate of the distribution function at the observationXi. Since the distribution function is basically a probability, it can be veryeasily estimated. The definition of the distribution function is

F (x) = P {X ≤ x} , (4.35)


5 7 9 11 13

5

7

9

11

13

15

5 7 9 11 135

7

9

11

13

-30 -10 10 30 50

0

20

40

60

80

-4 -2 0 2 4-10

-5

0

5

Figure 4.8: QQ-plots. The two upper graphs are examples where the data isnormally distributed, and the two lower graphs are examples where the dataare not normally distributed.

which is estimated as

F (x) =number of observation in sample that are ≤ x

n. (4.36)

The distribution function F in Equation 4.34 (actually the inverse distri-bution function) is the distribution function of a normal distribution withthe mean equal to the sample mean and the variance equal to the samplevariance (i.e. µ = X and σ2 = S2).

Since a QQ-plot is a plot of the observed versus the expected observationsif they were normally distributed, we would expect the points in the plotto lay on a straight line. But since the sample is only a finite collectionof observations randomly selected from e.g. a normal population, the plotmay show sampling variability. Thus, even if the observations come from anormal distribution, the points may show some scattering around the straightline. But, as soon as the plot shows a clear systematic deviation from thestraight line, there is evidence in the sample that the data are not normallydistributed.

This plot is only useful when there is sufficient data, say at least n > 20.

Figure 4.8 shows some examples.


4.6.2 Kolmogorov-Smirnov Test for Normality

Probably one of the best known statistical tests for normality is the Kolmogorov-Smirnov (KS) test. It tests the null hypothesis

H0 : X is normally distributed

against

H1 : X is not normally distributed.

Theoretically, it is possible to derive the construction of the KS test, i.e.we could explain the test statistic, we could prove the null distribution ofthe test statistic and show how the decision rule is constructed and how thecritical values have to be computed. Nevertheless, we will not do it here (it isnot an easy statistical test). Fortunately, we have learnt that any statisticaltest can be interpreted only by its p-value. This is the way we will use theKS-test in this course. Thus, if a KS test results in p > α, we will acceptsthe null hypothesis of normality, and when p ≤ α, we will reject the nullhypothesis and conclude that the data is not normally distributed.

Example 4.5. To illustrate the KS test, we consider the data of theexample given in Section 4.4.2. The S-Plus output is

One-sample Kolmogorov-Smirnov TestHypothesized distribution = normal

data: diff in turkeyks = 0.1052, p-value = 0.6006alternative hypothesis:True cdf is not the normal distn. with the specified parameters

From this output we see that p = 0.6006 > α = 0.05. Thus, we acceptthe null hypothesis of normality. Although we know that accepting a nullhypothesis is a weak conclusion, we like to note that

• the p-value is considerably larger than α = 0.05. Taking into accountthat the sample size is not very small (n = 50), this p-value impliesthat there is indeed not much evidence against normality.

• The QQ-plot (Figure 4.6) did not indicate any substantial systematicdeparture from the straight line.


-30 -10 10 30 50

0

20

40

60

80

-2 0 2 4 6 8

0

2

4

6

8

Figure 4.9: QQ-plots of two non-normal distribution. Left: a symmetricdistribution; right: a asymmetric distribution

The two argument suggest that there is no good reason to say that the dataare not normally distributed. Thus, we are quite convinced that the dataare normally distributed, at least sufficiently approximate in order to applystatistical tests in a valid way (see also the next section). �

4.6.3 Consequences of Nonnormal Data

In this section we comment briefly on the situation that the observations arenot normally distributed (according to the KS test and/or the QQ-plot).

As we have seen, many of the statistical methods about the mean (confi-dence intervals and statistical tests) are based on the assumption that theobservations are normally distributed (except when the CLT applies). Manystudies have indicated, however, that these methods are not very sensitiveto violations to the normality assumption. In particular when the distribu-tion of the observations is symmetric, the methods are not sensitive to thistype of deviation from normality. The symmetry of a distribution can beassessed by means of e.g. a QQ-plot. Figure 4.9 shows a symmetric and aasymmetric non-normal distribution. (it may also be detected by means ofe.g. a histogram or a boxplot)

According to the above mentioned argument, the statistical inference (confi-dence intervals and statistical tests) remain valid as long as the distributionof the observations is symmetric. This is true, at least it is approximatelytrue. Thus, one must be careful when the decision is not convincing, e.g.when the p-value is close to α. E.g. when the sample size is small, sayn = 20 (thus the CLT does not apply), and the data is not normal, but itseems to be symmetric, and a statistical test results in p = 0.053, which isonly slightly larger than α = 0.05, then the it is hard to formulate the con-clusion formally. The reason is that maybe due to the small deviation from


normality, the calculated p-value is not exactly the probability that refers tothe correct null distribution, and, maybe, the exact p-value (i.e. the p-valuecalculated under the exact null distribution, which is unfortunately unknowndue to the non-normality) is a little bit smaller than the calculated one, suchthat this exact p-value might turns out to be smaller than α = 0.05, resultingin an opposite conclusion.

The same reasoning applies to a situation where e.g. p = 0.048, which isonly slightly smaller than the cut off value α = 0.05.

From this discussion we also learn again that it is important to always reportthe p-value.

In the next section an alternative class of tests is proposed. This class oftests can be used whenever the observations are not normally distributed.

4.7 Rank Tests

We will only very briefly discuss this class of tests. As before, we will inpractice only use the p-value to interpret the test.

In general rank tests are similar to t-tests, except that not the original ob-servations are used, but instead their so called ranks are used. Thus, theobservations are rank transformed. The rank transformation is illustratedin Table 4.4 (the rank transformation is denoted as R(Y )). This illustratesclearly, that whatever the distribution of the observations, they are trans-formed to a scale which is independent of the original one. A very importantfeature of this transformation is that the effect of outliers is eliminated. E.g.in Table 4.4 the largest observation of Y is an outlier, but once it is ranktransformed, it just recieved the largest rank R = 10, which is exactly oneunit larger than the second largest rank (R = 9).

In the next to sections two important rank tests are briefly discusses. Becauseof their property of being independent of the distribution of the original data,they are often referred to as nonparametric tests.

4.7.1 Wilcoxon Signed Rank Test

The Wilcoxon signed rank test is a nonparametric alternative to the pairedt-test. We will use the same notation as in Section 4.4.1.


Table 4.4: Observations Y and the rank transformed data R

Y 12.1 14.3 15.1 15.5 15.9 16.0 16.4 18.9 20.1 40.1R 1 2 3 4 5 6 7 8 9 10

As with the paired t-test, all paired observations are substracted, i.e.

Xi = Xi,1 −Xi,2, (4.37)

(i = 1, . . . , n). Some of these differences will be positive, and some will benegative. Next all differences are rank transformed, resulting in the ranksR(Xi, which are by definition all integers between 1 and n. The Wilcoxonsigned rank test statistic is defined as the sum of the ranks R(Xi) of onlythe positive Xi’s. The null distribution of this test statistic can be easilyenumerated because it only depends on all possible arrangements of n ranksover the positive and negative differences under the null hypothesis, whateverthe original form of the distribution of Xi.

Finally, we give the S-Plus output for the same example is in Section 4.4.1.

Wilcoxon signed-rank test

data: x: PRE in turkey , and y: POST.A in turkeysigned-rank normal statistic without correction Z = 1.8679, p-value = 0.0618alternative hypothesis: true mu is not equal to 0

From this analysis we see that p = 0.0618 > 0.05, and thus we concludeto accept the null hypothesis. Note that with the paired t-test we foundp < 0.05 and there we rejected H0. This seems to be a contradiction, but itis not. It can be shown that, if the the data is normally distributed, than thepaired t-test is more powerful, i.e. the paired t test will have more chanceto reject H0 when H1 is true. Thus, if the data is normally distributed, it isrecommended to use the paired t-test. This also illustrated the importanceof checking the assumptions underlying the paired t-test. Finally, we like toremark that the p-values of the paired t-test and this Wilcoxon signed ranktest are not much different, but unfortunately, α = 0.05 lies in between thetwo.


4.7.2 Wilcoxon Rank Sum Test

This test is also known as the Mann-Whitney test. They are alternatives tothe two sample t-test. We will use the same notation as in Section 4.5.

In a first step all data are pooled, i.e. X1 = X1,1, X2 = X1,2, . . . , Xn1 =X1,n1 , Xn1+1 = X2,1, . . . , Xn1+n2 = X2,n2 . Next these pooled observationsare ranked, i.e. R(Xi) (i = 1, . . . (n1 + n2)). The test statistic is calculatedas the sum of the ranks of the observations in the first sample.

Again, the null distribution can be exactly enumerated by considering allpossible arrangements of ranks under the null hypothesis. These arangementsdo not depend on the distributions of the original observations. Thereforealso this test is called a nonparametric test.

Finally, we give the S-Plus output for the same example is in Section 4.5.

Wilcoxon rank-sum test

data: x: BM with SEX = men , and y: BM with SEX = womenrank-sum normal statistic without correction Z = 7.8811, p-value = 0alternative hypothesis: true mu is not equal to 0

Here we have p = 0 < 0.05, and thus we conclude to reject H0 and concludethat the mean BM is not the same for men and women. This is the sameconclusion as obtained with the two-sample t-test.


Chapter 5

Analysis of Variance

In the previous chapter we have seen methods to test the hypothesis that twomeans are equal. In this chapter these methods are extended to test whetheran arbitrary number of means are equal. In doing so we will have to introducestatistical models, which, in later a chapter, will be further extended to e.g.regression models.

5.1 Statistical Models

5.1.1 Two populations

In the previous chapter we considered two populations of which we weremainly interested in their means. We have seen methods to estimate thesemeans (point and interval estimators) and we have introduced the two samplet test to test the equality of the two means. When we assume that the twovariances are equal, then this setting can be represented as a statistical model,or, in particular, an ANOVA model.

The assumptions underlying the traditional two sample t-test are

• population 1: Y1,j i.i.d. N(µ1, σ2) (j = 1, . . . , n1)

• population 1: Y2,j i.i.d. N(µ2, σ2) (j = 1, . . . , n2)

75

Thus, for population i = 1 and i = 2, the transformed variable

Yi,j − µi = εij i.i.d. N(0, σ2). (5.1)

This may also be written as

Yi,j = µi + εij, (5.2)

where εij i.i.d. N(0, σ2), i = 1, 2; j = 1, . . . , ni. Equation 5.2 is a statisticalmodel. The index i refers to the population, and the index j refers to thej-th observation of the sample from population i. Typically, one the firststeps in the application of a statistical model, is the fitting of the model tothe data, i.e. the unknown parameters in the model must be estimated. InModel 5.2, 3 parameters must be estimated: µ1, µ2 and σ2. Of course, thesolution is given in the previous chapter:

µ1 = X1, µ2 = X2, σ2 = S2

p ,

and the corresponding interval estimators or confidence intervals are alsoas seen previously. Finally, hypotheses can be formulated in terms of theparameters (e.g. H0 : µ1 = µ2) and can be tested.

5.1.2 p Populations

Example 5.1. In the nutrition.sdd dataset, many variables are measuredat 3 different time periods. An interesting research question is how the meanQI evolves over time. In the study, the researchers have taken 3 independentsamples of men and women at 3 different periods. These 3 periods correspondto 3 populations, each with a possibly different mean QI. We will assume thatthe variance of the QI is the same in the 3 populations.

The corresponding statistical model is

Yij = µi + εij, (5.3)

where εij i.i.d. N(0, σ2) (i = 1, 2, 3; j = 1, . . . , ni). �

More generally, we may be interested in the means of p populations. Thenthe observations Yij of population i may be written as the statistical model

Yij = µi + εij, (5.4)


where εij i.i.d. N(0, σ2) (i = 1, . . . , p; j = 1, . . . , ni). Note that this laststatement about the error terms εij constitute the assumptions underlyingthe statistical model, which are often needed for the statistical analysis basedon this model.

Before we say something more generally about the parameter estimationand the analysis of variance, we shall introduce another formulation of thestatistical model given in Equation 5.3. The reason for changing to anothermodel formulation will become clear only later.

The model in Equation 5.3 may equivalently written as

Yij = µ+ τi + εij, (5.5)

where εij i.i.d. N(0, σ2) (i = 1, . . . , p; j = 1, . . . , ni), and with the additionalrestriction that

p∑i=1

τi = 0 (5.6)

(this restriction is called the Σ-restriction). Further, µ is given by

µ =1

p

p∑i=1

µi. (5.7)

By comparing the two equivalent models (Equations 5.3 and 5.5), we seeimmediately that

µi = µ+ τi. (5.8)

The parameter µ serves as some kind of general mean, and it is referred toas the constant or intercept in the model. The parameters τi (i = 1, . . . , p)are the effects. When the p population means (µi) are equal, i.e. µ1 = µ2 =. . . = µp, then, by Equation 5.8,

µ+ τ1 = µ+ τ2 = . . . = µ+ τp, (5.9)

which can only be true if

τ1 = τ2 = . . . = τp = 0 (5.10)

(note that this solution still satisfies the Σ-restriction). Hence, the null hy-pothesis of equality of means is equivalent to Equation 5.10.


As mentioned before, the parameters τi are called the effects, or the treatmenteffects (i.e. often the p populations correspond to p different treatments;this was the historical reason for treatment effects). In particular, these ppopulations (or treatments) are specified by a factor. A factor may be seenas a discrete variable that indicates to which of the p populations everyobservation belongs. For instance, let Xij denote this factor variable, thenXij = i. We also say that this factor has p levels, i.e. the factor determinesp populations or treatments.

5.2 Parameter Estimation

5.2.1 Least Squares Criterion

Basically, a statistical model reflects the assumptions made about the obser-vations / populations. In particular, the ANOVA models introduced in thischapter may be seen as a parameterization of the means of p populationsof which it is further assumed that all observations are normally distributedwith the same variance σ2. I.e. we allow the p populations to be differentonly in terms of their mean. Equation 5.8 shows that τi is the difference inmean of population i to the general mean µ.

The unknown parameters are µ, τ1, τ2, . . . , τp and σ2. Based on a sample ofobservations of each population, these parameters must be estimated (i.e.point estimators). We will denote the point estimators by µ, τ1, τ2, . . . , τpand σ2. Sometimes the latter is also denoted by S2, which is again to beinterpreted as a pooled variance estimator.

The general method of parameter estimation is the method of least squares.We will only briefly discuss this method here.

Consider again the ANOVA model

Yij = µ+ τi + εij, (5.11)

where εij i.i.d. N(0, σ2) (i = 1, . . . , p; j = 1, . . . , ni). Define the fitted values

Yij = µ+ τi, (5.12)

and the residuals

eij = Yij − Yij, (5.13)


and note that the residuals eij may be seen as “estimates” of the error terms

εij in the statistical model. Thus both the fitted values Yij and the residualseij depend on the parameter estimates µ, τ1, τ2, . . . , τp. To stress this de-

pendence we write Yij(µ, τ1, τ2, . . . , τp) and eij(µ, τ1, τ2, . . . , τp). Intuitively,it is clear that “good” parameter estimates will be such that the errors willbe small. Since the errors are estimated by the residuals, this requirementmeans that the residuals eij should be small. In particular, the least squaresmethod consists in finding the parameter estimates such that

L(µ, τ1, τ2, . . . , τp) =

p∑i=1

ni∑j=1

e2ij(µ, τ1, τ2, . . . , τp) (5.14)

is minimized.

We now try to explain why this is a good approach.

Each fitted value Yij = µ+ τi is the estimator of the mean µi = µ+ τi of thecorresponding i-th population. Thus, for population i, the statistic

S2i =

1

ni − 1

ni∑j=1

(Yij − Yij

)2

(5.15)

=1

ni − 1

ni∑j=1

e2ij (5.16)

is the estimator of σ2. Equation 5.14 can now be written as

L(µ, τ1, τ2, . . . , τp) =

p∑i=1

(ni − 1)S2i . (5.17)

Remember from the 2-sample t-test (Section 4.5) that, when σ21 = σ2

2 = σ2,we have calculated a pooled variance estimator (S2

p = ((n1 − 1)S21 + (n2 −

1)S22)/(n1 + n2 − 2)) as the estimator of the common σ2. A straightforward

extension is thus

S2p =

L(µ, τ1, τ2, . . . , τp)∑pi=1 ni − p

(5.18)

as a pooled variance estimator of the common variance σ2. Let N =∑p

i=1 nidenote the total sample size, then σ2 = S2

p = L/(N − p) is our estimatorof σ2. We will often simply use the notation S2 to denote this estimator.Without proof, we state that S2 is an unbiased estimator of σ2.


In conclusion, we could say that the parameters are estimated such that theresidual variance (i.e. the unexplained variance) is minimized. In the nextsection, where the statistical properties of the parameter estimators are given,other arguments will be given to illustrate that the least squares method isindeed a good method.

In general, minimizing the least squares criterion L may be done by meansof a numerical algorithm. But for simple linear model (e.g. the ANOVAmodel), there exist an analytical solution of the minimization problem. Inparticular, the point estimators are given by

µ =1

p

p∑i=1

Yi (5.19)

τi =1

ni

ni∑j=1

Yij − µ = Yi − µ (5.20)

σ2 = L/(N − p) =1

N − p

p∑i=1

ni∑j=1

e2ij. (5.21)

We will also use the notation Y and Yi to denote the sample mean over allsamples, and the sample mean of the i-th sample, respectively.

5.2.2 Statistical Properties of the Estimators

In this section we give the statistical properties of the estimators, i.e. howthey are distributed. We will, however, not go into the theoretical details.

Equations 5.19 and 5.20 show that the estimators µ and τi are simple functionof sample means. From Chapter 3 we know that sample means are normallydistributed when the observations are assumed to be normally distributed.In particular,

Yi ∼ N

(µi,

σ2

ni

). (5.22)

From these properties, the distributions of the parameter estimators may bederived. Let

1

nr=

1

p

p∑i=1

1

ni, (5.23)


i.e. nr is the reciprocal mean of the sample sizes ni (note that if all samplesizes are equal, n1 = n2 = . . . = r, then nr = r). Then,

µ ∼ N

(µ,

σ2

pnr

)(5.24)

τi ∼ N

(τi,σ2

ni− σ2

p

(2

ni− 1

nr

)), (5.25)

where σ2

p

(2ni− 1

nr

)> 0. Further, it can be shown that the variance of τi

is minimal if n1 = n2 = . . . = np = nr. Thus, such a study design ismost efficient (it is called a balanced design). Clearly, the estimators areunbiased and consistent. Sometimes, we will use the notation σ2

µ = σ2

Nand

σ2τi

= σ2

ni− σ2

p

(2ni− 1

nr

).

The variances σ2µ and σ2

τimay be estimated by simply replacing σ with its

unbiased estimator S2 (Equation 5.18), resulting in

S2µ = σ2

µ =S2

N(5.26)

S2τi

= σ2τi

=S2

ni− S2

p

(2

ni− 1

nr

). (5.27)

An important characteristic is that both the estimated variance of µ and ofτi are proportional to the variance estimator S2. This variance estimator,however, is proportional to the minimized least squares criterion L. Hence,among all possible definitions of estimators, the least squares estimators arethose that have the smallest estimated variance! This may be consideredas an optimality property. Thus, the least squares estimators are optimalestimators. Finally, since S2 is an unbiased estimator, the estimators S2

µ andS2τi

are unbiased as well.

The statistical properties of S2 are more difficult. It may be shown that

(N − p)S2

σ2∼ χ2

N−p. (5.28)

Its obvious that this results only hold if indeed all observations have the samecommon variance σ2.


5.2.3 Interval Estimators (confidence intervals)

Once the statistical properties of the point estimators are known, it is straight-forward to obtain the confidence intervals (=interval estimators). Since thisis completely analogous to Section 3.6, we will omit the details here, and giveonly the solutions.

The lower and upper limits of the (1− α)-confidence interval of µ are givenby

L = µ− SµtN−p,α/2 (5.29)

U = µ+ SµtN−p,α/2. (5.30)

The lower and upper limits of the (1− α)-confidence interval of τi are givenby

L = τi − SτitN−p,α/2 (5.31)

U = τi + SτitN−p,α/2. (5.32)

Of course, it is also interesting to know the confidence interval of the mean µiof the i-th population. One solution would be to use only the observations ofthe i-sample, and to apply the methods given in Section 3.6. This is indeeda correct approach, but when the observations of all samples have a commonvariance σ2, there is a more efficient method. In particular, in this case thelower and upper limits of the (1− α)-confidence interval of µi are given by

L = µ+ τi −S√nitN−p,α/2 = Yi −

S√nitN−p,α/2 (5.33)

U = µ+ τi +S√nitN−p,α/2 = Yi +

S√nitN−p,α/2. (5.34)

The reason why this is a better approach is that now the observations of thep− 1 other samples are used to estimate the common variance σ2. Thus S2

is a better estimator than the estimator S2i , which is only calculated from

the ni observations in the i-th sample. The fact that S2 is a better estimatoris reflected in the degrees of freedom of the t-distribution that is used in theconfidence interval: N − p degrees of freedom instead of ni − 1 if only thei-th sample would have been used. (the larger the degrees of freedom, thesmaller the t-quantile, resulting in a smaller confidence interval)


5.3 ANOVA Table and Hypothesis Tests

A central role in the analysis of variance is played by the ANOVA table, inwhich the decomposition of the Total Sum of Squares is presented. Also theresults of the statistical tests are presented in this table.

The hypotheses that are of interest here, are

H0 : µ1 = µ2 = . . . = µp or, equivalently H0 : τ1 = τ2 = . . . = τp = 0(5.35)

versus

H1 : at least two means are different. (5.36)

5.3.1 Decomposition of the Total Sum of Squares

The name “analysis of variance” refers to the Decomposition of the Total Sumof Squares, where the total sum of squares measures the total variability in thedata, and where each of the components represents the variance that can beexplained by a factor in the statistical model. All these components are “sumsof squares” (SS), and with each sum of squares is a mean sum of squares(MS) associated and a degrees of freedom (df). The latter are determinedsuch that each MS is an unbiased estimator of a variance. Furthermore, theMS is always defined as the SS divided by the corresponding df.

The total sum of squares is defined as

SSTot =

p∑i=1

ni∑j=1

(Yij − Y

)2. (5.37)

There are N − 1 degrees of freedom associated with SSTot, and the totalmean sum of squares is given by

MSTot =SSTot

N − 1. (5.38)

The interpretation of SSTot is very easy. From Equations 5.37 and 5.38 itis seen that MSTot has the form of a sample variance estimator where nodistinction between the p populations is made (i.e. Y is used as an estimatorof a common mean). Thus, since under the null hypothesis there is indeedone common mean in the p populations, SSTot is an unbiased estimator of


σ2. We will show that, when H0 does not hold true, SSTot will be larger aswhat we would expect under H0.

The second sum of squares is the Treatment Sum of Squares, which is definedas

SST =

p∑i=1

ni∑j=1

(Yi − Y

)2=

p∑i=1

ni(Yi − Y

)2, (5.39)

which is associated with p−1 degrees of freedom. The Treatment Mean Sumof Squares is given by

MST =SSTot

p− 1. (5.40)

Again MST has the general form of a sample variance, but this time thep sample means Yi are treated as observations, and Y is considered as theestimator of a population from which the sample means Yi are sampled. Fromthis reasoning, it may be understood that MST will be small when in realityall µi are equal, and MST will be large when the µi are different. We couldthink of MST or SST as a measure of sample information against the nullhypothesis. This will be helpful in the construction of the statistical test.

The third sum of squares is the Error Sum of Squares or the Residual Sumof Squares, which is defined as

SSE =

p∑i=1

ni∑j=1

(Yij − Yi

)2=

p∑i=1

ni∑j=1

e2ij = L (5.41)

which is associated with N − p degrees of freedom. The Error Mean Sum ofSquares, or Residual Mean Sum of Squares is given by

MSE =SSE

N − p= S2. (5.42)

Thus, MSE, which is exactly equal to S2, is the unbiased estimator of theresidual variance σ2. This characteristic holds independently of whether ornot the null hypothesis is true.

So far we have defined the sums of squares. The decomposition is given bythe following important property, which holds for all samples:

SSTot = SST + SSE. (5.43)


Thus, SSTot, which is a measure for the total variability of the observations,is decomposed into (1) SST, which is the part of the total variance that canbe explained by the differences in the p sample means, and (2) the residualvariance, which is the part of the total variance that cannot be explained bythe differences in the p sample means.

Finally, the same decomposition as in Equation 5.43 holds for the degrees offreedom, i.e.

N − 1 = (p− 1) + (N − p). (5.44)

5.3.2 Hypothesis Tests

As before, we will specify the statistical test by (1) looking for a suitable teststatistic, (2) finding its null distribution, and (3) specifying the decision rulesuch that the type I error rate is exactly controlled at α.

1. From the previous section it is clear that SST, or MST, is a good startas a test statistic. It can be proven that

E {MST} = σ2 +

∑pi=1 τ

2i

p− 1. (5.45)

This, indeed, the distribution of MST shifts to the right, the more theτi’s are different from zero. On the other hand, MST will also depend onthe variance σ2. Therefore, we suggest to normalize MST by dividingit by MSE = S2 which is an unbiased estimator of σ2.

Thus, the test statistic is defined as

T =MST

MSE, (5.46)

and we will reject the null hypothesis for large values of T .

2. It can be shown that

TH0∼ Fp−1,N−p. (5.47)

(note that the degrees of freedom of the F -distribution are those ofMST and MSE)

3. Since (1) the null hypothesis will be rejected for large values of the teststatistic, and (2) its null distribution is Fp−1,N−p, the decision rule is


Table 5.1: An ANOVA Table

Source SS df MS F -value p-value

Treatment SST p− 1 MST MSTMSE p

Error SSE N − p MSETotal SSTot N − 1 MSTot

• to = MSTMSE ≤ Fp−1,N−p;α −→ accept H0

• to = MSTMSE > Fp−1,N−p;α −→ reject H0, conclude H1

Since the null distribution is an F -distribution, the test is often called the

F -test. Therefore, we will also often use the notation F = MSTMSE.

5.3.3 ANOVA Table

In an ANOVA table, the decomposition of the sum of squares and the resultof the statistical test are summarized. In general the ANOVA table lookslike Table 5.1. Sometimes MSTot is not mentioned in the table.

5.4 Example

Dataset: nutrition.sddThe daily spend energy (ENEX) is measured at three different time points(periods). In particular, at these three time points, independent samples aretaken from the population.

An interesting question to the researcher is whether or not the mean ENEXis the same in the three periods, i.e.

H0 : µ1 = µ2 = µ3 (5.48)

versus

H1 : at least two means are different. (5.49)


period 1 period 2 period 3

PERIOD

1000

1500

2000

2500

3000

3500

4000

EN

EX

-3 -2 -1 0 1 2

Normal Distribution

2000

4000

2000

4000

2000

4000

EN

EX

period 1

period 2

period 3

Figure 5.1: A box-plot (left) and QQ-plots (right) of the ENEX at the 3periods

(note: the variable ENEX is the factor variable; it has p = 3 levels in thisexample)

The data is shown in Figure 5.1. The box-plot clearly suggests that the 3variances are equal, and the QQ-plots illustrate that the observations in the3 samples are normally distributed. Hence, all assumptions of the ANOVAare satisfied.

The S-Plus output is given below.

*** Analysis of Variance Model ***

Type III Sum of Squares

Df Sum of Sq Mean Sq F Value Pr(F)

PERIOD 2 13715701 6857851 28.19607 1.386768e-011

Residuals 212 51562663 243220

Estimated K Coefficients for K-level Factor:

$"(Intercept)":

(Intercept)

2358.651

$PERIOD:


356.2778 -103.1324 -253.1454

Tables of means


Grand mean

2335

PERIOD


2714.9 2255.5 2105.5

rep 63.0 73.0 79.0

Tables of adjusted means

Grand mean

2358.651

se 33.782

PERIOD


2714.9 2255.5 2105.5

se 62.1 57.7 55.5

Interpretation and conclusions:

1. From the ANOVA table we read p = 1.386768e − 011 ≈ 0 < 0.05.Hence, we may reject the null hypothesis very strongly. We concludethat, at the 5% level of significance, the mean ENEX is not the samein the three periods. (later we will see methods to determine whichmeans are different, and which means are not)

2. Under the title “Estimated K Coefficients for K-level Factor:” we canfind the parameter estimates. In particular,

µ = 2358.651 (5.50)

τ1 = 356.2778 (5.51)

τ2 = −103.1324 (5.52)

τ3 = −253.1454 (5.53)

3. In the “Tables of means”, the sample means are found: Y = 2335,Y1 = 2714.9, Y2 = 2255.5 and Y3 = 2105.5. You can also read thenumber of observations in each sample.


4. In the “Tables of adjusted means”, you can find the “Grand Mean”,which is exactly the same as µ = Y1+Y2+Y3

3. The means given for the

three periods, are again the sample means Yi. The standard errors (se)that are given here, are calculated only from the data of the corre-sponding sample, i.e. for sample i the se is simply S1√

n1. We have seen

that S2 = MSE is a better estimator for σ2. Thus, better standard er-rors can be obtained by using S =

√243220 = 493.17 instead of S1, S2

and S3, resulting in Sµ1 = S/√

63 = 62.13, Sµ2 = S/√

73 = 57.72 andSµ3 = S/

√79 = 55.49. With these standard errors, confidence intervals

of the corresponding µi’s can be calculated (note: t-distributions withN−p degrees of freedom must be used because S2 is used as a varianceestimator)

5.5 The Kruskal-Wallis Test

When the observations in one of the p populations are not normally dis-tributed, then the F -test in ANOVA and the confidence intervals may notbe formally interpreted. However, when this occurs for population i, and thei-th sample size (ni) is sufficiently large for applying the CLT in that sample,then F -tests and confidence intervals may still be interpreted. In the othercases, we can still rely on a nonparametric alternative to the F -test: theKruskal-Wallis test. As Wilcoxon’s tests, the Kruskal-Wallis test is a ranktest, which means that this test is rather insensitive to outliers. We will notgive the details of this test, but, as with all other tests, it can be interpretedby only looking at the p-value.

Example 5.2. We take the same example as in Section 5.4, but now wewill act as if the observations were not normally distributed such that wehad to apply the Kruskal-Wallis test. Below is the S-Plus output.

Kruskal-Wallis rank sum test

data: ENEX and PERIOD from data set nutrition

Kruskal-Wallis chi-square = 45.3196, df = 2, p-value = 0

alternative hypothesis: two.sided

Also here, p = 0 < 0.05 which results in the same conclusion as in Section5.4. �


5.6 Multiple Comparison of Means

5.6.1 Situation of the Problem: Multiplicity

Again we consider the example of Section 5.4. Based on the F -test we haveconcluded that at least two means are different, but we were not able tosay which of the three means are different and which are equal. Methodsthat solve this type of question are called multiple comparison or pairwisecomparison methods.

One solution, which at first sight may look very obvious, is to apply 2-samplet-tests to test the three null hypotheses H0,1 : µ1 = µ2, H0,2 : µ2 = µ3 andH0,3 : µ1 = µ3. (note that all three null hypothesis are true if and only if thegeneral null hypothesis H0 : µ1 = µ2 = µ3 is true) The type I error rate forthis procedure is thus

P {type I error} = P {reject H0,1 or reject H0,2 or reject H0,3|H0} .(5.54)

Unfortunately, this probability is very hard to calculate, but basic probabilitycalculus tells us that at least

P {reject H0,1 or reject H0,2 or reject H0,3|H0} ≤ (5.55)

P {reject H0,1|H0}+ P {reject H0,2|H0}+ P {reject H0,3|H0} .

Each of these three terms represent exactly the type I error rates that holdfor the three 2-sample t-tests. Thus, if each of these t-tests is performed atthe α-level, then we have for the overall type I error rate

P {type I error} ≤ α + α + α = 3α. (5.56)

Hence, the type I error rate is inflated which is due to applying several teststo obtain a conclusion. This phenomenon is called multiplicity.

It is easy to extend the above discussion to the situation where p means haveto be compared in a pairwise fashion. In particular, when all p means haveto be compared, then q = p!

2(p−2)!two-sample t-tests must be performed. The

overall type I error rate is then (extending Equation 5.56)

P {type I error} ≤ qα. (5.57)


5.6.2 Bonferroni Correction

A simple approximate solution can be derived directly from Equation 5.57.Equation 5.57 suggests that each of the individual t-tests should be performedat a lower type I error rate. In particular, suppose that each of the q t-testsis performed at some αt-level. If we want to have the overall type I error ratecontrolled at α, then we could take αt = α

q. Equation 5.57 gives

P {type I error} ≤ qαt = α. (5.58)

This guarantees that the overall type I error rate does not exceed α. However,it turns out that, when q increases, the difference between the true type Ierror rate and the upper bound α typically becomes larger and larger. (theproperty of having a testing procedure that results in a type I error ratethat is smaller than the nominal α-level is called conservative) Thus, theBonferroni correction is actually a too strong correction procedure, but it isa safe method which has the advantage of being very simple to apply.

In the above procedure 2-sample t-tests have to be performed. When theANOVA assumptions hold, however, the data of all p samples may be usedto estimate the common variance σ2. Therefore, S2 = MSE is used to calcu-late the t-tests. The null distributions of the t-tests is thus tN−p instead oftni+nj−2.

Since there exists an equivalence between α-level t-tests and (1 − α)-levelconfidence intervals, the results of a multiple comparison procedure mayalso be presented as (1 − α)-simultaneous confidence intervals of the truedifference in mean µi − µj (i 6= j). In particular, the lower and upper limitsof the simultaneous (1− α)-confidence interval of µi − µj is given by

L = µi − µj −

√S2

ni+S2

njtN−p,(α/q)/2 (5.59)

U = µi − µj +

√S2

ni+S2

njtN−p,(α/q)/2. (5.60)

Note that tN−p,(α/q)/2 is the critical value of the individual t-tests in theBonferroni correction procedure. Confidence intervals that do not contain 0correspond to µi 6= µj (this is the equivalence between confidence intervalsand statistical tests).

Example 5.3. Again we consider the example of Section 5.4. The S-plusoutput is given below.


95 % simultaneous confidence intervals for specified

linear combinations, by the Bonferroni method

critical point: 2.4131

response variable: ENEX

intervals excluding 0 are flagged by ’****’

Estimate Std.Error Lower Bound Upper Bound

period 1-period 2 459 84.8 255.0 664 ****

period 1-period 3 609 83.3 408.0 810 ****

period 2-period 3 150 80.1 -43.2 343

The critical point is tN−p,(α/q)/2 = t212,0.008333 = 2.4131. From the simulta-neous confidence intervals, we conclude that the mean ENEX in period 1is significantly different of the mean ENEX in periods 2 and 3. There isinsufficient evidence that there would be difference in mean ENEX betweenperiods 2 and 3; therefore we consider them equal (accepting H0,2 : µ2 = µ3).

�

Finally, it is interesting to remark that the key step in the Bonferroni method(Equation 5.55) does not depend on the type of the q individual statisticaltests. Thus, also when the data are not normally distributed, the Bonferronimethod may be applied to e.g. Wilcoxon rank sum tests.

5.6.3 The Tukey Method

When all the ANOVA assumptions hold, the Tukey method is definitely thebest choice to correct for multiplicity. The Tukey method consists in theconstruction of simultaneous confidence intervals for all q differences µi− µjin an exact way, i.e. the method guarantees that the coverage probability ofthe Tukey simultaneous confidence intervals is exactly 1−α (Bonferroni onlyguarantees that this coverage probability is not smaller than 1 − α). Then,by the equivalence between statistical tests and confidence intervals, it canbe directly deduced which means are different and which are equal.

Example 5.4. Again we consider the example of Section 5.4. The S-Plusoutput is given below.

95 % simultaneous confidence intervals for specified


linear combinations, by the Tukey method

critical point: 2.3603

response variable: ENEX

intervals excluding 0 are flagged by ’****’

Estimate Std.Error Lower Bound Upper Bound

period 1-period 2 459 84.8 259 660 ****

period 1-period 3 609 83.3 413 806 ****

period 2-period 3 150 80.1 -39 339

The same conclusion as with the Bonferroni method is obtained. As for theBonferroni method, the critical point (here: 2.3603) is the critical value tobe used in the individual 2-sample t-tests (or, equivalently, in the calculationof the simultaneous confidence intervals). �

5.6.4 LSD method

Finally, we briefly comment on the LSD = Least Significant Difference methodof Fisher. This method consists in applying 2-sample t-tests to test the par-tial hypotheses µi = µj without correcting for multiplicity. Thus, this is nota correction procedure!

5.7 Two-way ANOVA

in this section, the ANOVA model of Section 5.1.2 is extended such that itincludes effects of two factor variables.

Example 5.5. Dataset: nutrition.sdd.With the ANOVA model that we have seen so far, it is possible to modele.g. the mean QI for the 3 populations defined by the 3 periods. It is alsopossible to use the ANOVA model to model the mean QI for the 2 populationsdefined by the 2 genders (men and women). It is, however, also possible toconsider the data as samples from 6 populations which are defined by thecombinations of the 3 periods and the 2 genders. Figure 5.2 shows the datain this manner.


20

30

20

30

QI



men men men

women women women

Figure 5.2: Box-plots of the 6 samples, obtained by considering all combina-tions of periods and genders

�

5.7.1 Two-way ANOVA Model

We consider two factor variables, denoted by T and B. Suppose T has tlevels, and B has b levels. By combining both factors, tb populations aredefined. We will allow these tb population to differ in mean, but we willassume that all tb populations have the same variance σ2. In particular, thepopulation defined by the i-th level of factor T and the j-th level of factor Bis assumed to be a normal distribution with variance σ2 and mean µij. Anobservation from the sample from population ij is denoted by Yijk, wherek = 1, . . . , nij. Thus, the model becomes

Yijk = µij + εijk, (5.61)

where εijk i.i.d. N(0, σ2) (i = 1, . . . , t; j = 1, . . . , b; k = 1, . . . , nij).

As in the one-factor ANOVA, the population means are modelled in termsof effects. Since we now have two factors, we will consider two sets of effects.Since there are bt possibly different population means µij, there should bean exactly equal number of independent parameters parameterizing thesemeans. By using factor effects, we could consider the additive model

µij = µ+ τi + βj, (5.62)

where for both sets of effects the Σ-restrictions hold, i.e.∑t

i=1 τi =∑b

i=1 βj =0. Although this model might hold in reality, there are only 1 + (t − 1) +


(b− 1) = t+ b− 1 independent parameters involved. Imposing such a modelimplies a restriction (additivity of the factor effects).

A possible extension of Model 5.62 is

µij = µ+ τi + βj + (τβ)ij, (5.63)

where τi and βj are as before, and for the bt parameters we imply (t−1)(b−1)

Σ-restrictions:∑t

i=1(τβ)ij =∑b

j=1(τβ)ij =. Hence, we now have 1 + (t −1) + (b − 1) + (t − 1)(b − 1) = tb independent parameters. The model inEquation 5.63 is called the saturated model. The parameters (τβ)ij are theeffects of the interaction between the factors T and B.

The effects (τβ)ij are called interaction effects and the effects τi and βj arecalled the main effects.

5.7.2 Interaction

Example 5.6. Example 5.7 continued. In terms of this example we willexplain the interpretation of interaction between the two factors (period andgender).

• Suppose there is not interaction, i.e. the additive model of Equation5.62 applies. From the model given in Equations 5.62 and 5.63 thismeans that all (τβ)ij = 0. Thus, the mean of the i-th period and the j-th gender is given by adding the corresponding effects: µij = µ+τi+βj.Or, when put in a different way, the effect of period i is the same forboth genders (j = 1 and j = 2).

Consider the difference in means of period i between e.g. the twogenders, i.e.

µi1 − µi2 = (µ+ τi + β1)− (µ+ τi + β2) = β1 − β2, (5.64)

which is independent of period i. Hence, the difference in means of anyperiod i between the two genders, is always β1 − β2, irrespectively ofthe period. A similar reasoning holds for the differences in means ofe.g. men (j = 1, say) between two periods; e.g.

µ11 − µ21 = (µ+ τ1 + β1)− (µ+ τ2 + β1) = τ1 − τ2, (5.65)

and the same difference is found for the women (j = 2).

The level plot in the left panel of Figure 5.3 illustrates the no-interactionsituation: parallel lines.


19,5

20

20,5

21

21,5

22

22,5

23

23,5


menwomen

19,5

20

20,5

21

21,5

22

22,5

23

23,5


menwomen

Figure 5.3: Level plots of an additive model (left) and an interaction model(right), showing the mean QI for all combinations of gender and period

• When there is an interaction effect between the two factors on the mean,then the lines in the level plot will typically no longer be parallel. Thisis illustrated in the right panel in Figure 5.3). In that figure we see thatthe difference in mean in period 2 between men and women is largerthan in period 1, and the difference in mean in period 3 between menand women is even different in sign. Thus, the effects of gender are notadditive, in the sense that the effect of gender is different from periodto period. A consequence is that we cannot conclude anything aboutthe effect of gender without specifying the period.

The interpretation of interaction can also be seen in the saturatedmodel (Equation 5.63). The difference in mean in period i betweenmean and women is given by

µi1 − µi2 = (µ+ τi + β1 + (τβ)i1)− (µ+ τi + β2 + (τβ)i2)(5.66)

= = β1 − β2 + ((τβ)i1 − (τβ)i2). (5.67)

Thus, indeed, this difference still depends on the period.

In general, when there is interaction, the effect of factor T depends onthe level j of factor B, and, equivalently, the effect of factor B dependson the level i of factor T .

�


5.7.3 Parameter Estimation

The method of least squares can be extended directly, resulting in pointestimators µ, τi, βj and (τβ)ij. Again it can be shown that these estimatorsunbiased and consistent. Furthermore, they are normally distributed andtheir variance is again proportional to σ2. The residual variance σ2 canagain be estimated by S2 = MSE (see later).

Confidence intervals (i.e. interval estimators) can also be calculated as before.Only the degrees of freedom has to be changed to the number of degrees offreedom associated with MSE (see later).

5.7.4 Decomposition of the Total Sum of Squares

In Section 5.3.1 the total sum of squares was introduced. It is straightforwardextended to

SSTot =t∑i=1

b∑j=1

nij∑k

(Yijk − Y

)2. (5.68)

As before, SSTot does not depend on the statistical model. It is only ameasure for the total variability in the dataset (pooled over all samples).

The decomposition of SSTot depends on the statistical model.

• Suppose the additive model (Equation 5.62) is used. Then, the decom-position is

SSTot = SST + SSB + SSE, (5.69)

where SST and SSB are the sum of squares of factor T and of factorB, respectively. They are associated with t − 1 and b − 1 degrees offreedom, and SSTot is associated with N − 1 degrees of freedom. Thedegrees of freedom of the residual sum of squares (SSE) is most easilycalculated as (N − 1)− (t− 1)− (b− 1).

• Suppose the saturated model (Equation 5.63) is used. Then, the de-composition is

SSTot = SST + SSB + SSTB + SSE, (5.70)


where SSTB is the sum of squares of the interaction between the factorsT and B. It is associated with (t− 1)(b− 1) degrees of freedom. Thusthe degrees of freedom of SSE is calculated as (N − 1)− (t− 1)− (b−1)− (t− 1)(b− 1).

5.7.5 Statistical Tests

Statistical tests can be constructed in a similar way as for the one-wayANOVA, but now three different null hypotheses may be of interest: theno-effect hypothesis of factor T , the no-effect hypothesis of factor B, and theno-interaction-effect hypothesis of TB.

Before we give some more details on the construction of these tests, we willargue that first the no-interaction hypothesis must be tested.

Suppose there is an interaction effect of TB. Then we have already shownthat the difference in means between two levels of factor T depends on thelevel of factor B. Hence, it is meaningless to say something about the maineffect of factor T without specifying a level of the other factor, i.e. theparameters τi have no clear interpretation in the presence of interaction (seealso Equation 5.67). The same reasoning applies to the main effect of factorB. Thus, in conclusion, we will always first test for interaction, and whenwe conclude that the interaction in present, then we will not test the maineffects. When, on the other hand, no interaction turns out to be present,then we will first eliminate the interaction effects from the saturated model(i.e. we change to the additive model) before testing for main effects.

The no-interaction null hypothesis is

(τβ)11 = (τβ)12 = . . . = (τβ)tb = 0. (5.71)

The test (F -test) is based on

F =SSTB

MSE

H0∼ F(t−1)(b−1),N−t−b−(t−1)(b−1)+1. (5.72)

The test for testing the main effects of factor T is based on

F =MST

MSE

H0∼ Ft−1,N−t−b+1. (5.73)

(note that the degrees of freedom of MSE are already calculated in the addi-tive model, i.e. after the interaction effects are eliminated) In a similar way,


Table 5.2: The ANOVA Table of the interaction model


T SST t− 1 MST MSTMSE p

B SSB b− 1 MSB SSBMSE p

BT SSBT (t− 1)(b− 1) MSTB SSTBMSE p

Error SSE N − 1− (t− 1) MSE−(b− 1)− (t− 1)(b− 1)

Total SSTot N − 1 MSTot

the test for testing the main effects of factor B is based on

F =SSB

MSE

H0∼ Fb−1,N−t−b+1. (5.74)

5.7.6 The ANOVA Table

The ANOVA table of the saturated model is of the form given in Table 5.2.For the additive model, only the BT line must be deleted and the degrees offreedom of SSE must be increased with (t− 1)(b− 1).

5.7.7 Example

Example 5.7 continued. The main purpose of the analysis is to test whetheror not the mean QI depend on the gender and/or on the period. We startthe analysis with the interaction model (Equation 5.63) (although actuallyall assumptions must be assessed, we do not show these analysis here).

The S-Plus output is shown below.




SEX 1 70.713 70.71346 8.543015 0.0038503

PERIOD 2 28.876 14.43795 1.744273 0.1773064

SEX:PERIOD 2 25.786 12.89308 1.557635 0.2130687


Residuals 209 1729.965 8.27734


$"(Intercept)":

(Intercept)

22.05973

$SEX:

men women

-0.6195476 0.6195476

$PERIOD:


0.08609352 0.4353718 -0.5214653

$"SEX:PERIOD":

menperiod 1 womenperiod 1 menperiod 2 womenperiod 2 menperiod 3

0.08940449 -0.08940449 0.4062557 -0.4062557 -0.4956602

womenperiod 3

0.4956602

In the ANOVA table we only look at the interaction effect. There we readp = 0.213. Since p = 0.213 > 0.05 we conclude that, at the 5% level ofsignificance, there is no significant interaction effect between period and sexon the mean QI.

Since we have concluded that there is no interaction, we can adopt the ad-ditive model (Equation 5.62). The S-plus output is given below.




SEX 1 70.877 70.87659 8.517701 0.0038983

PERIOD 2 12.363 6.18139 0.742858 0.4769926

Residuals 211 1755.751 8.32109


$"(Intercept)":

(Intercept)


22.07786

$SEX:

men women

-0.6202568 0.6202568

$PERIOD:


0.05222359 0.255565 -0.3077886

Now the interaction effects are eliminated from the model, we can safelylook to the main effects of both period and sex. For period, we read from theANOVA table p = 0.477 which is greater than 0.05. Thus, we conclude that,at the 5% level of significance, there are no differences in mean QI betweenthe three periods. For sex, we have p = 0.004 < 0.05, which results inthe rejection of the corresponding null hypothesis. Thus, we may concludethat, at the 5% level of significance, the mean QI of the men is differentfrom the mean QI of the women. Moreover, from the parameter estimates(β1 = −0.620 and β2 = 0.620) we may even conclude that the mean QI ofmen is smaller than the mean of QI of women.


Chapter 6

Regression Analysis

6.1 Introductory Example

Dataset: fatplasma.sddThree different diets are compared w.r.t. the change (reduction) in fat con-centration in blood plasma. The diets differ in their fat content. Since onesuspects that the fat reduction does very probably depend on the age of thesubject, the study was designed such that one person of each of 5 age cate-gories was included in the study. Figure 6.1 shows the interaction plot. Thisplot indeed suggests that fat reduction depends on the age. The significanceof this observation can be assessed by means of the ANOVA methods dis-cussed in the previous chapter. However, since the dataset does even containthe exact ages of the subjects (i.e. not only the age category, which is a fac-tor, but also the exact numeric value of the age), other statistical techniquescan be used: regression analysis.

For illustrating the regression methods, we will only consider the data of the“extreme low fat” diet. Figure 6.2 shows a scatter plot of the fat reductionw.r.t. the age of the subject. The plot suggests that there is more or lessa linear relation between both (continuous) variables. In particular, a re-gression model will model the mean of the fat reduction distribution as afunction of the age. If indeed this specific linear association between bothvariables exists, and if it is possible to estimate this linear association basedon a sample, then this linear relation can be used to e.g. estimate the meanfat reduction for people of an age that was not included in the sample usedfor the estimation (e.g. for a person of 40 years old). Thus, regression anal-

103

AGE.BLOCK

mea

n of

FA

TR

ED

UC

0.5

1.0

1.5

Ages 15-24 Ages 25-34 Ages 35-44 Ages 45-54 Ages 55-64

DIET

Extremely lowFairly lowModerately low

Figure 6.1: Interaction plot of the fatplasma.sdd dataset

10 20 30 40 50 60

AGE

0.6

0.8

1.0

1.2

1.4

1.6

FA

TR

ED

UC

Figure 6.2: A scatter plot of fat reduction w.r.t. the age of the subjects

ysis will allow us to predict outcomes. Such a prediction can be consideredas a point estimate. The regression methods will also allow us to calculateconfidence intervals on such predictions.

In the same way as we have reasoned in ANOVA, it is not because we see alinear relation in Figure 6.2, which is entirely based on a very small sample,that in reality there is indeed a true linear relation between age and mean fatreduction. In order to prove that a linear relation is present, the regressionanalysis will provide us with statistical tests.

In this example, the age is referred to as the independent variable, or thepredictor, or the regressor. The fat reduction is again referred to as thedependent variable, or the response variable.


6.2 The Regression Model

6.2.1 Reformulation of the ANOVA Model

Example 6.1. Example 6.1 continued.If we would consider the age as a factor and the fat reduction as the dependentvariable, then the ANOVA model would be

Yij = µ+ τi + εij = µi + εij, (6.1)

(i = 1, . . . , 5; j = 1, . . . , ni), where εij i.i.d. N(0, σ2). The model is equiva-lent to

Yij i.i.d. N(µ1, σ2) if age of subject (i, j) is 15 (6.2)





Thus, the distribution of Yij is always normal with the same variance σ2.Only the mean depends on the age of the subject. Let Xij denote the age ofthe corresponding subject. Then we could write

µi = µ(Xij). (6.7)

With this notation, the index i becomes obsolete because its function is takenover by the specific age which is given by the (independent) variable X. Letn denote the total number of observations, i.e. N =

∑pi=1 ni (here, p = 5).

Then, we will adopt the following notation: the index i refers directly to thesubject. Thus, we have i = 1, . . . , N . The corresponding fat reductions aredenoted by Yi and the ages by Xi.

With this new notation, the original ANOVA model can be written as

Yi = µ(Xi) + εi, (6.8)

(i = 1, . . . , N) where εi i.i.d. N(0, σ2) and where Xi takes one of the valuesin {15, 25, 36, 47, 60}. The latter restriction is here still needed to make tomodel equivalent to the ANOVA model which is here specifically defined foronly 5 different age levels. In a linear regression model, however, we will


assume that the mean µ(Xi) is a linear function of the independent variableXi. In particular,

µ(Xi) = µ+ βXi, (6.9)

where the parameters µ and β are referred to as the intercept or constant, andthe regression coefficient or the slope, respectively. Furthermore, Equation6.9 is often assumed to hold not only for the ages Xi which are present inthe dataset, but, more generally, for a large interval of ages (e.g. ages in theinterval [15, 64]). �

6.2.2 The Regression Model

In general the regression model is defined as

Yi = µ+ βXi + εi, (6.10)

(i = 1, . . . , N), where µ and β are parameters, and where εi i.i.d. N(0, σ2).

Thus, the major difference between a regression model and an ANOVA modelis that now the mean of the dependent variable is determined by a continu-ous linear function of the independent variable (X), whereas with ANOVAthe mean of the dependent variable was determined by the level of a factorvariable.

6.3 Parameter Estimation

6.3.1 Least Squares Criterion

The unknown parameters in the regression model 6.10 are estimated bymeans of exactly the same method as for the ANOVA model: least squares.The unknown regression parameters are now µ and β. The parameter esti-mates are denoted by µ and β, respectively.

The fitted value, or sometimes also called the predicted value, is denoted byYi = Y (µ, β, Xi) (the latter notation is to stress that the prediction dependson the parameters as well as on the independent variable Xi). Thus,

Yi = Y (µ, β, Xi) = µ+ βXi. (6.11)


As with ANOVA, the residuals are given by

ei = ei(µ, β) = Yi − Yi(µ, β) = Yi − Yi. (6.12)

Based on a sample, the the least squares criterion is given by

L(µ, β) =N∑i=1

e2i (µ, β). (6.13)

The least squares parameter estimators are those values of µ and β thatminimize the least squares criterion L(µ, β) (i.e. those values for the unknownparameters that minimize the total squared error).

Explicit formulae for the parameter estimates exist, but we will not give themhere.

The residual variance σ2 is estimated as

σ2 = S2 =L(µ, β)

N − 2. (6.14)

Note that n − 2 is equal to the total number of observations minus thenumber of unknown parameters in the regression model (µ and β). Theresidual variance of the the ANOVA model was estimated in a similar way,i.e. the minimized least squares criterion divided by the total number ofobservation minus the number of unknown parameters (µ1, . . . , µp): N − p.As with ANOVA, n−2 is referred to as the degrees of freedom of the varianceestimator.

6.3.2 Statistical Properties of the Estimators

We will give here only a very general treatment of the statistical propertiesof the estimators (more is not needed for the interpretation of the output ofstatistical software).

Let θ denote any of the parameters µ or β. And let θ denote the correspondingleast squares estimator. Then it can be shown that

θ ∼ N(θ, g(X1, . . . , XN)σ2

), (6.15)

where g(X1, . . . , Xn) represents a function of the observed independent vari-ables for which holds that g(X1, . . . , Xn) → 0 as the sample size N → ∞.


Thus, the parameter estimators are unbiased and consistent. Since Var[θ]

is

also a function of the independent variables, we may conclude that Var[θ]

is

a function of the study design, i.e. the accuracy of the parameter estimatorsdepends on the choice of the values of the independent variables. We will notgo into detail here, but this property opens a whole lot of techniques that

allows us to choose the values of the independent variables such that Var[θ]

is minimized, i.e. maximal accuracy is obtained.

As with the ANOVA model parameter estimators, the regression model pa-rameter estimators have the property that their variance is proportional tothe residual error variance σ2. Moreover, σ2 is the only unknown parameter

in Var[θ]. By replacing σ2 by its estimator σ2 = S2 (Equation 6.14), the

normal distribution becomes a t-distribution. In particular,

θ − θsθ∼ tN−2, (6.16)

where sθ =√g(X1, . . . , XN)σ2. Based on Equation 6.16 confidence intervals

(i.e. interval estimates) and statistical tests can be constructed.

6.3.3 Interval Estimators (confidence intervals)

From Equation 6.16 lower and upper limits of confidence intervals of µ andβ can be calculated. In particular, the (1−α) confidence interval of θ, whichis any of µ or β, is given by

L = θ − sθtN−2;α/2 (6.17)

U = θ + sθtN−2;α/2. (6.18)

6.3.4 Statistical Tests

Since there is an equivalence between (1−α) confidence intervals and α-levelstatistical tests, we can immediately give the statistical test for testing

H0 : θ = 0. (6.19)


where θ is any of µ or β. Again based on Equation 6.16, we find the nulldistribution of the test statistic,

T =θ

sθ

H0∼ tN−2. (6.20)

Depending on the specific alternative hypothesis (one-sided or two-sided), thedecision rule is found. For instance, for a two-sided hypothesis (H1 : θ 6= 0),the decision rule is

• |to| = | θsθ | ≤ tN−2;α/2 −→ accept H0

• |to| = | θsθ | > tN−2;α/2 −→ reject H0, conclude H1 .

In practice, often only the hypothesis H0 : β = 0 is of interest.

6.3.5 Example

Example 6.1 continued.Although we have not yet seen all about regression analysis, we give alreadya small example. Below the S-Plus output is given for the Example of Section6.1.

*** Linear Model ***

Call: lm(formula = FATREDUC ~ AGE, data = fatplasma.Extremely.low,

na.action = na.exclude)

Residuals:

1 10 5 14 9

0.06946 -0.008626 -0.1575 0.0736 0.02309

Coefficients:

Value Std. Error t value Pr(>|t|)

(Intercept) 0.3484 0.1226 2.8409 0.0656

AGE 0.0208 0.0031 6.7672 0.0066

Residual standard error: 0.109 on 3 degrees of freedom

Multiple R-Squared: 0.9385

F-statistic: 45.8 on 1 and 3 degrees of freedom, the p-value is 0.006593


In this output we find:

• The residuals ei for all 5 observations (note that the numbers abovethe residuals are the observation numbers (identification numbers) inthe original dataset (i.e. the fatplasma.sdd dataset before splitting ac-cording to AGE.CLASS).

As with ANOVA, the assumption of normality (ε i.i.d. N(0, σ2)) can beassessed by looking at the QQ-plot of these residuals (we have not donethis here because it is quite meaningless with as few as 5 observations).

• The parameters estimates are

µ = 0.3484 (6.21)

β = 0.0208 (6.22)

and their respective estimated standard deviations are

sµ = 0.1226 (6.23)

sβ = 0.0031. (6.24)

From the estimated parameters, the estimated regression model can bewritten down:

Y = µ+ βX (6.25)

= 0.3484 + 0.0208X. (6.26)

The interpretation of the model is as follows: for each increase in ageof 1 year, the mean fat reduction increases with 0.0208 units.

• The hypothesis H0 : β = 0 against H1 : β 6= 0 (two-sided) is performed

by calculating to = βsβ

= 6.7672 (this is directly read from the output).

We now that the null distribution of the test statistic is tN−2;α/2. Thus,at α = 0.05, we have as a critical value tN−2;α/2 = 3.1824. Since|to| > 3.1824 we reject the null hypothesis at the 5% level of significance,in favour of the alternative hypothesis. Thus, we conclude that thereis a significant (positive) linear relation between the age and the meanfat reduction. This could also have been concluded from the p-valuethat is given by S-Plus: p = 0.0066 < 0.05 (since p is even very small,the conclusion of a positive linear relation can be even stated verystrongly).


• Further, S-Plus also given the “Residual standard error”: S = σ =0.109 (with n− 2 = 3 degrees of freedom).

• The “multiple R-squared” and the “F-statistic” is discussed later.

• Finally, S-Plus does not calculate confidence intervals of the parame-ters. But these can, of course, be calculated very easily by applyingthe methods of Section 6.3.3. The lower and upper limits of the 95%confidence interval of β are given by

L = β − sβt3;0.025 (6.27)

U = β + sβt3;0.025, (6.28)

where β = 0.0208, sβ = 0.0031 and t3;0.025 = 3.1824. Thus, the 95%confidence interval becomes

[0.0109; 0.0307]. (6.29)

(note that 0 is not within the confidence interval, which is equivalentto concluding that β 6= 0 at the 5% level of significance.

6.4 Predictions

In Section 6.1 we have build the regression model from an ANOVA modelby modelling the mean of the response variable as µ(Xi) = µ + βXi, whereXi takes values in a set of p values (e.g. {15, 25, 36, 47, 60} in the example).However, the model µ(Xi) = µ + βXi also allows us to predict the meanresponse in other values of the predictor X (e.g. predicting the mean fatreduction for a subject of 50 years of age). For an arbitrary X, the corre-sponding prediction of the mean response is simply given by

Y = Y (X) = µ(X) = µ+ βX, (6.30)

where, as before, µ and β are the estimators of µ and β based on the sample,which does not necessarily includes the value of X.

The prediction Y (X) is to interpreted as the estimation of the mean of theresponse variable Y at the value X of the predictor, i.e. Y (X) is an estimatorof µ(X) = µ + βX. Therefore, we will often use µ(X) instead, to stress thefact that it is an estimator for the mean.


Since µ(X) is an estimator (in particular it is a point estimator), a confidenceinterval (interval estimator) can be calculated as well. The lower and upperlimits of the (1− α) confidence interval are given by

L = µ(X)− sµ(X)tN−2;α/2 (6.31)

U = µ(X) + sµ(X)tN−2;α/2, (6.32)

where the variance estimator s2µ(X) is calculated as (without proof)

s2µ(X) = s2

µ +X2s2β − 2X

X

(n− 1)s2X

, (6.33)

where X is the sample mean of the predictors (X = 1n

∑ni=1 Xi) and s2

X =1

n−1

∑ni=1(Xi − X)2 is the sample variance of the predictors. Equation 6.33

can also be written as (without proof)

s2µ(X) =

(1

n+

(X − X)2∑ni=1(Xi − X)2

)σ2. (6.34)

(note that also s2µ(X) is proportional to σ2)

Sometimes, however, it is of more interest to predict one single observationof Y at a given value X, rather than predicting the mean µ(X) at X. It isobvious that the point estimator is again given by Equation 6.30 (this is ourbest possible guess of an observation of Y at X). The interval estimator, onthe other hand, must take additional uncertainty into account. This can bestbe understood by considering the following. When a mean is estimated, thenthe uncertainty (i.e. variance) decreases as the sample size increases. More-over, in the limit when the sample size becomes infinitely large, the varianceof the estimator of the mean becomes zero (consistent estimator). When,on the other hand, a single observation is to be predicted, we must realizethat the statistical model states that, for a given value of X, Var {Y } = σ2,i.e. the variance of one single observation does not depend on the samplesize and it is always equal to the residual variance σ2. It is thus exactly thisvariance that has to be added to the variance of the estimator of the meanresponse (Equation 6.34). In order to make the distinction clear betweenthe estimator of the mean µ and the estimator of a single response Y , thevariance of the latter is denoted by s2

Y (X). Thus,

s2Y (X)

=

(1 +

1

n+

(X − X)2∑ni=1(Xi − X)2

)σ2. (6.35)


The interval estimator (confidence intervals) of the prediction of a singleobservation are given by

L = Y (X)− sY (X)tN−2;α/2 (6.36)

U = Y (X) + sY (X)tN−2;α/2. (6.37)

6.5 The ANOVA Table and F -test

6.5.1 The Decomposition of the Total Sum of Squares

?? As with an ANOVA model, there exists a decomposition of SSTot, whichis calculated in exactly the same way as with ANOVA, i.e.

SSTot =N∑i=1

(Yi − Y

)2. (6.38)

SSTot measures the total variability of the data. SSTot is associated withN − 1 degrees of freedom. The regression sum of squares, SSR, is calculatedas

SSR =N∑i=1

(Yi − Y

)2

. (6.39)

It measures the variability that is attributed to the regression relation be-tween the mean of Y and the regressor X. SSR has 1 degree of freedom.Finally, the residual sum of squares, SSE, is given by

SSE =N∑i=1

(Yi − Yi

)2

, (6.40)

and, as before, SSE measures the variability in the data that is not explainedby the regression relation. SSE has N − 2 degrees of freedom.

The sum of squares, SSR and SSE, divided by its degrees of freedom, 1 and N-2, results in the mean sum of squares, MSR = SSR/1 and MSE = SSE/N−2,respectively.

it can be shown that again MSE is an unbiased estimator of the residualvariance σ2.


The following decomposition holds,

SSTot = SSR + SSE, (6.41)

and a similar decomposition holds for the degrees of freedom, i.e.

N − 1 = 1 + (N − 2). (6.42)

The decomposition of SSTot means: the total variability of the data (SSTot)is decomposed into a part (SSR) that is explained by the regression relation,and a residual unexplained part (SSE). The larger SSR is as compared toSSE, the more evidence there is that there is a true linear regression relationµ(X) = µ+βX (β 6= 0). This argument will again result in a statistical testfor testing H0 : β = 0.

6.5.2 F -test

As explained in the previous section, SSR and SSE can be used to constructa test for H0 : β = 0. In particular, we will use MSR and MSE instead. Itcan be shown that

F =MSR

MSE

H0∼ F1,N−2. (6.43)

From the above null distribution an α-level test can be constructed.

Note that both when β < 0 or β > 0, SSR will probably increase. Thus, H0

will only be rejected for large values of F . Hence, the F -test can only be usedto test against the two-sided alternative hypothesis H1 : β 6= 0, whereas thet-test of Section 6.3.4 also can be used for one-sided alternative hypotheses.

Finally, we remark that there is a one-to-one relation between the F -teststatistic and the t-test statistic for testing H0 : β = 0. In particular, it canbe shown that

T 2 =

(β

sβ

)2

=MSR

MSE= F. (6.44)

6.5.3 ANOVA-Table

Often a regression analysis is accompanied by an ANOVA table. Table 6.1shows its general form.


Table 6.1: An ANOVA Table for a regression analysis


Regression SSR 1 MSR MSRMSE p

Error SSE N − 2 MSETotal SSTot N − 1 MSTot

6.5.4 Example

Example 6.1 continued.Below the ANOVA table in the S-plus output is given.


Analysis of Variance Table

Response: FATREDUC

Terms added sequentially (first to last)


AGE 1 0.5443411 0.5443411 45.79564 0.006593352

Residuals 3 0.0356589 0.0118863

Note that S-Plus does not give the line of SSTot in the ANOVA table, but,of course, this can be simply calculated.

SSTot = SSR + SSE = 0.5443411 + 0.0356589 = 0.58. (6.45)

At the line of AGE we read the F -test for testing H0 : β = 0 against H0 :β 6= 0. Since

F = 45.79564 > F1,N−2;α = F1,3;0.05 = 10.13 (6.46)

we conclude that, at the 5% level of significance, the regression coefficient βis different from zero. Thus, there is a significant linear relation between themean fat reduction and the age of the subjects. From Section 6.3.5 we knownthat β = 0.0208, and thus we may even conclude that there is a significant


positive regression relation. Of course, the same conclusion is obtained byonly looking at the corresponding p-value (p = 0.006593352 < 0.05). Sincethe F -test and the t-test are equivalent, the p-values are exactly equal.

6.6 Coefficient of Determination: R2

In Section ?? we have seen that SSR measures the variability that is explainedby the regression, that SSE measures the residual (unexplained) variability,and that SSTot is a measure for the total variability of the observations inthe sample. Since SSTot = SSR + SSE, it is also meaningful to considerthe fraction of SSR over SSTot. This proportion is called the coefficient ofdetermination, and it is calculated as

R2 =SSR

SSTot, (6.47)

i.e. R2 is the proportion of explained variance over the total variance. Since0 ≤ R2 ≤ 1, R2 is easy to interpret: the larger R2, the “better” the regressionrelation.

Example 6.2. In the S-Plus output of the example of Section ??, we findthat

R2 = 0.9385. (6.48)

Thus, about 94% of the total variability of the fat reduction in the sample isexplained by the regression relation between the mean fat reduction and theage. �

6.7 Assessing the Assumptions and Diagnos-

tics

We will illustrate the assessment of the assumptions by means of an example.We will, however, not use the example of Section 6.1 here because when thereare only N = 5 observations it is almost impossible to assess the assumptions.

Example 6.3. In the nutrition data set, an interesting question concernsthe relation between the QI and the ENEX. In particular, we want to study


1000 1500 2000 2500 3000 3500 4000

ENEX

16

21

26

31

QI

Figure 6.3: Scatter plot of QI against ENEX. The line represents the fittedregression line.

how the QI depends on ENEX. The regression model is

Yi = µ+ β +Xi + εi, (6.49)

where Yi and Xi are the QI and ENEX of subject i, respectively, and whereµ and β are the unknown parameters which will be estimated from the sam-ple. Further, it is assumed that εi i.i.d. N(0, σ2). The interesting researchquestions are:

• is there a relation between QI and ENEX? (this will be checked bymeans of a t- or F -test for H0 : β = 0.

• is β 6= 0, what is the exact relation between QI and ENEX? Thisquestion will be solved be estimating µ and β from the sample. Bypresenting the confidence intervals, we will have an idea on the accuracyof the estimation. The established (estimated) regression model can beused for prediction.

Figure 6.3 shows the raw data in a scatter plot as well as the fitted regressionline.

In the beginning of this chapter we have explained how the regression modelsis constructed and how the regression model must be seen as a summary ofthe assumptions. In particular, the regression model (Equation 6.49) statesthe following assumption: all observations Yi are normally distributed withmean µ+βXi and constant variance σ2 . This assumption can be decomposedin:


• The normality of the observations is transferred to the error termsεi, which have to be normally distributed with mean 0 and constantvariance σ2.

This assumption can be assessed by checking the normality of the resid-uals ei = Yi − Yi by means of e.g. a QQ-plot (see Figure 6.4). Thisplot shows a slight asymmetric deviation from normality, but for themoment we will not consider it as a problem. Only if the p-value of thetests are close to α = 0.05 we will worry.

• The mean of the response Y at a given value of X is assumed to beequal to µ+βX, i.e. it is assumed that there is a linear relation betweenthe regressor X and the mean of the response. Moreover, it is assumedthat the error term, εi, has mean zero for all X. If this is indeed true,then there should be no dependence between the mean of the residuals,ei, and the Xi. This can be visualized by plotting the residuals againstthe regressor Xi.

Figure 6.5 shows this residual plot. The line through the points is aso called smoother. It can be interpreted as a nonlinear line that fitsthe data best. If indeed the linear model for the mean holds, then themean of the residuals will be indeed equal to zero for all X. In the plot,this means that the smooth line should be more or less constant andequal to zero. In Figure 6.5 we see that this is more or less the case.At least there is no evidence for a systematic deviation.

• In the previous paragraphs, we have already checked the normality ofεi and the constant zero mean of εi. Further, it is assumed that theerror terms, εi, have constant variance σ2 for all X. Again this can beassessed by looking at the residuals in the residual plot of Figure 6.5.If indeed the error terms have constant variance, then we would expectthe residuals ei to be spread around zero with more or less the samevariability for every X. In Figure 6.5 the variability of the residualsseems indeed to be fairly constant. Maybe for regressor values 2000 <X < 3000 there might be a slight increase in variance observed, but, onthe other hand, there are more observations in this region, such thatthere is more chance of observing extreme large or small observationwhich visually result in the impression of an increased variance.

Another way to assess the assumption of constancy of variance, is toplot the absolute value of the residuals against the predictor X. Thisis shown in Figure 6.6. If the assumption holds, then we would expectthat the mean of the absolute values of the residuals is constant w.r.t.


Quantiles of Standard Normal

Res

idua

ls

-3 -2 -1 0 1 2 3

-50

5

193

6587

Figure 6.4: QQ-Plot of the residuals

1000 1500 2000 2500 3000 3500 4000

ENEX

-5

0

5

10

resi

dual

s

Figure 6.5: Residual plot

the regressor X. Figure 6.6 shows that, in the example, there is almostno substantial evidence that the variance changes with X.

�

Finally we show residual plots of an artificial example where the the linearmodel does not hold and where the assumption of constancy of variance doesnot hold either. Figure 6.7 shows the scatter plot of the raw data, the residualplot and the plot of the absolute values of the residuals against the predictor.


1000 1500 2000 2500 3000 3500 4000

ENEX

0

2

4

6

8

abs

Figure 6.6: Plot of the absolute values of the residuals against the regressorENEX.

0 20 40 60 80 100

X

-2

2

6

10

Y

0 20 40 60 80 100

X

-5

-3

-1

1

3

5

resi

dual

s

0 20 40 60 80 100

X

0

1

2

3

4

5

abs

Figure 6.7: The scatter plot (above) of an artificial example, its residual plot(down, left) and the plot of the absolute values of the residuals (down, right).


6.8 Multiple Linear Regression

In the previous sections of this chapter we have modelled the mean of theresponse variable as a linear function of one single continuous regressor. Geo-metrically, this resulted in a straight line. Such a regression model is referredto as a simple linear regression model. In this section we will extend thelinear regression model to a model in which we will model the mean of thedependent variable as a function of more than one regressor. This extendedmodel will be referred to as a multiple linear regression model.

6.8.1 Introductory Example

Dataset: CheeseTaste.sddAs cheese ages, various chemical processes take place that determine thetaste of the final product. This dataset contains concentrations of variouschemicals in 30 samples of mature cheddar cheese, and a subjective measureof taste given by a panel of professional tasters. The chemicals are: aceticacid (Acetic), lactic acid (Lactic) and hydrogen sulphide (H2S).

In this example we will initially only be interested in looking at the depen-dence of the mean of the taste score as a function of the lactic acid andhydrogen sulphide concentration. Figure 6.8 shows a scatter plot matrix.This plot suggests that there might be linear relation between taste andboth lactic acid and hydrogen sulphide concentration. Instead of consideringtwo regression models (one with lactic and another with H2S as a regres-sor) we will now consider one single regression model with the two regressorssimultaneously in it.

6.8.2 Statistical Model (additive model)

Suppose we want to model the mean of the response variable Y as a linearfunction of two regressors X1 and X2. The n observations in the sample canthan be denoted by (Y1, X11, X21), . . . , (Yn, X1n, X2n).

The regression model now becomes

Yi = µ+ β1X1i + β2X2i + εi, (6.50)

where εi i.i.d. N(0, σ2) and where µ, β1 and β2 are the unknown parameters.Geometrically, the model in Equation 6.56 represents a plane (Figure 6.9).


Lactic

3

5

7

9

0.80 1.05 1.30 1.55 1.80 2.05

3 5 7 9

H2S

0.80

1.05

1.30

1.55

1.80

2.05

taste

0

10

20

30

40

50

60

0 10 20 30 40 50 60

Figure 6.8: The scatter plot matrix of the variables taste, lactic and H2S

Basically, the regression model in Equation 6.56 implies the following as-sumptions:

• the error terms are normally distributed with mean zero and commonvariance σ2

• the mean of the response variable is a linear, additive function of theregressors X1 and X2

(note that additive again means that the (linear) effect of the regressor X1 isadditive to the (linear) effect of the regressor X2)

The regression model can be further extended to include (p−1) ≥ 1 regressorsX1, . . . , Xp−1. We will not treat this here in detail, though all properties thatare discussed in this section can be readily applied to it.

As with the simple linear regression model, the unknown parameters in themultiple linear regression model can be estimated by means of the leastsquares methods. As before, the estimates are denoted by µ, ˆbeta1 and β2.Furthermore, they are again normally distributed with mean µ, β1 and β2,respectively, and a variance that is proportional to the residual variance σ2.The latter variance is again estimated as the residual mean squared error (forgeneral p),

σ2 = MSE =1

n− p

n∑i=1

(Yi − Yi

)2

, (6.51)

where Yi = µ+β1X1i+. . .+βp−1Xp−1,i are the fitted values. Since the param-eter estimators are again normally distributed and since their variances are


Figure 6.9: The geometrical representation of the additive regression modelof Equation 6.56

proportional to the residual variance, the standardized parameter estimators,with the variance replaced by its estimator, is again t-distributed with n− pdegrees of freedom. Hence, statistical tests and confidence intervals may becomputed in exactly the same way as before.

6.8.3 The F -Test and R2

As with the simple linear regression model, the total sum of squares (SSTot)can be decomposed into a term measuring the variability that can be ex-plained by the regression (SSR) and a term measuring the residual, unex-plained variability (SSE): SSTot = SSR + SSE (the corresponding degreesof freedom decompose similarly: n− 1 = (p− 1) + (n− p)). Sometimes theregression sum of squares, SSR, is further decomposed in terms measuringthe variability that can be explained by each regression separately, but wewill not discuss this here any further.

Based on the decomposition of the total sum of squares a hypothesis test canbe constructed for testing

H0 : β1 = . . . = βp−1 = 0 (6.52)

against

H1 : at least one βj is different from zero. (6.53)

In particular it is an F -test. The test statistic and its null distribution are

F =MSR

MSE=

SSR/(p− 1)

SSE/(n− p)H0∼ Fp−1,n−p. (6.54)


Thus, if the the null hypothesis is excepted, than there is no linear regres-sion relation at all, and when the null hypothesis is rejected, then the meanresponse is linearly related to at least one predictor.

The R2-value (coefficient of determination) can again be calculated in exactlythe same way as before,

R2 =SSR

SSTot. (6.55)

Its interpretation is also as before: the relative fraction of the total variability(SSTot) that is explained by the regression (SSR).

6.8.4 Example

Example 6.8.1 continued.We consider the regression model in Equation 6.56,

Yi = µ+ β1X1i + β2X2i + εi, (6.56)

where εi i.i.d. N(0, σ2), and where X1 and X2 represent lactic and H2S, re-spectively, and Y is the dependent variable taste.

The S-Plus output of the regression analysis is given below.


Call: lm(formula = taste ~ Lactic + H2S, data = cheese, na.action =

na.exclude)

Residuals:

Min 1Q Median 3Q Max

-17.34 -6.53 -1.164 4.844 25.62

Coefficients:


(Intercept) -27.5918 8.9818 -3.0720 0.0048

Lactic 19.8872 7.9590 2.4987 0.0188

H2S 3.9463 1.1357 3.4748 0.0017



F-statistic: 25.26 on 2 and 27 degrees of freedom, the p-value is 6.551e-007


Quantiles of Standard Normal

Res

idua

ls

-2 -1 0 1 2

-10

010

20

12

8

15

Figure 6.10: QQ-plot of the residuals

It is again important to realize that we can only formally interpret the sta-tistical tests in the output if the assumptions are fulfilled:

• normality of the error terms.this can be assessed by looking at the QQ-plot of the residuals (seeFigure 6.10). Figure 6.10 does not suggest any serious deviation fromthe assumption of normality.

• constancy of the variance, i.e. the variance of Y (or, equivalently, of εshould be equal for all X1 and all X2.This assumption may be assessed by looking at the residual plots whichare constructed as the residuals against X1 (Figure 6.11, left) andagainst X2 (Figure 6.11, right). In non of the two residual plots thereis found a clear evidence against the assumption of constancy of thevariance. (note: (1) never mind the effect of one single outlier whenchecking for constancy of variance; (2) maybe Figure 6.11 suggests aslight increase in variance with increasing lactic, but we will not con-sider this increase substantial)

• linearity of the model, i.e. the statistical model implies that in realitythere is indeed a linear relation between the mean response and thetwo regressors X1 and X2.

The residual plots in Figure 6.11 may again be used. In particularwe expect now that the residuals are nicely spread around zero forall predictor values. Or, equivalently, we expect that the mean of theresiduals does not depend anymore on the predictors. The smooth linein the residual plots may help in assessing this assumption. In bothresidual plots we do not see any substantial deviation from what we


0.7 0.9 1.1 1.3 1.5 1.7 1.9 2.1

Lactic

-20

-10

0

10

20re

sidu

als

3 4 5 6 7 8 9 10

H2S

-20

-10

0

10

20

resi

dual

s

Figure 6.11: Residual plots against lactic (left) and H2S (right)

expect under the assumption of linearity. Therefore, we conclude thatthe assumption of linearity holds in this example, i.e. the statisticalmodel that we work with is appropriate.

Thus, so far we have only assessed the assumptions underlying the statisticalmodel, and, we have concluded that all assumptions hold. Thus, we mayproceed with interpreting the S-Plus output of the regression analysis. Inparticular, we conclude from the S-Plus output:

• The estimated regression model, which may be used for e.g. prediction,is

Y (X1, X2) = µ+ β1X1 + β2X2 (6.57)

= −27.5918 + 19.8872X1 + 3.9463X2. (6.58)

(note: as with the simple linear regression model, prediction intervalsmay be calculated, but we will not discuss this here)

• Often it is of interest to test whether the observed regression relationis indeed present in reality or not, i.e. we like to test

H0 : β1 = β2 = 0. (6.59)

This can be tested by mean of the F -test. In the output we readF = 25.26 on 2 and 27 degrees of freedom, resulting in a p-value equalto 6.551e−007 ≈ 0 < 0.05. Hence we conclude that there is a regressionrelationship at the the 5% level of significance, with at least one of thetwo regressors.


• The see with which of the two regressors the mean taste score has alinear relation, we can test the effect of both regressors separately.

For lactic we test

H0 : β1 = 0 against H1 : β1 6= 0. (6.60)

In the output we find p = 0.0188 < 0.05. Thus we conclude, at the 5%level of significance, that there is a linear relation between the meantaste score and the lactic acid concentration of the cheese.

For H2S we test

H0 : β2 = 0 against H1 : β2 6= 0. (6.61)

In the output we find p = 0.0017 < 0.05. Thus we conclude, at the5% level of significance, that there is also a linear relation betweenthe mean taste score and the hydrogen sulphide concentration of thecheese.

• From R2 = 0.6517 we learn that about 65% of the total variability inthe taste score is explained by the regression relation. Although thisis not a large value, there was sufficient evidence in the data the theregression relation is significant. Thus, we have proven that the meantaste score does linearly depend on lactic and H2S, but since R2 is notvery large, there is probably still a lot of variability of the observedtaste scores unexplained.

Finally, we return to the estimated regression model,

Y (X1, X2) = −27.5918 + 19.8872X1 + 3.9463X2, (6.62)

for some more details on the interpretation of the model. When more thanone predictor is included in the model, the interpretation of the β-parametersneeds some careful consideration. In particular, in this example, we have

• for lactic (X1):two types of cheese with the same H2S, but with a lactic of one unitdifference, have an estimated difference in mean taste score of 19.8872units.

• for H2S (X2):two types of cheese with the same lactic, but with a H2S of one unitdifference, have an estimated difference in mean taste score of 3.9463units.


We say that the parameters have a conditional interpretation. Or, β1 is theeffect of X1, controlled for X2 (actually, controlled for the linear effect of X2.And, visa versa, β2 is the effect of X2, controlled for X1. In a simple linearregression model, where e.g. the mean of Y is modelled as a linear functionof one predictor, say X1, the interpretation of β1 is in some way averagedover all other possible predictors. Thus, in such a simple linear regressionmodel, β1 would have the interpretation of the difference in mean taste scorebetween two types of cheese with a difference in lactic of one unit, but withoutany knowledge about the H2S of the two types of cheese.

Consider the (fitted) multiple linear regression model of Equation 6.62:

Y (X1, X2) = −27.5918 + 19.8872X1 + 3.9463X2. (6.63)

According to the discussion on the interpretation of this model, it is clearthat the effect of X1 does not depend on a particular value of X2, i.e. β1

(estimated as β1 = 19.8872) is the change in the mean of Y for an increasein X1 of one unit, for a given value of X2, but the particular value of X2

does not matter! For example, the estimated increase in the mean of Y for acheese with X1 = 3 as compared to a cheese with X1 = 4, and both cheeseshave X2 = 2, is

Y (4, 2)− Y (3, 2) = (−27.5918 + 19.8872× 4 + 3.9463× 2)

− (−27.5918 + 19.8872× 3 + 3.9463× 2)

= 19.8872(= β1).

And the estimated increase in the mean of Y for a cheese with X1 = 3 ascompared to a cheese with X1 = 4, but now both cheeses have X2 = 7, is

Y (4, 2)− Y (3, 2) = (−27.5918 + 19.8872× 4 + 3.9463× 7)

− (−27.5918 + 19.8872× 3 + 3.9463× 7)

= 19.8872(= β1).

Thus, the effect of X1 does not depend on the value of X2 (and visa versa).This property refers to an additive model, i.e. there is no interaction effectbetween X1 and X2 on the mean of Y .


Figure 6.12: The geometrical representation of the interaction regressionmodel of Equation 6.64

6.8.5 Interaction

In the previous section we have worked with an additive multiple linear re-gression model, i.e. there was no interaction effect between X1 and X2 onthe mean of Y . The model in Equation 6.56 can be extended to include aninteraction term:

Yi = µ+ β1X1i + β2X2i + β3X1X2 + εi, (6.64)

where εi i.i.d. N(0, σ2) and where µ, β1, β2 and β3 are the unknown param-eters. The parameters β1 and β2 refer to the main linear effects of theregressors X1 and X2. The parameter β3 refers to the interaction effect.Geometrically, the model in Equation 6.56 represents again a plane (Figure6.12). This figure (of an artificial example) illustrates e.g. that for smallvalues of X2 there is a positive linear effect of X1 (positive slope), but forlarge values of X2 there is a negative linear effect of X1 (negative slope).Thus, in such a model, we cannot conclude anything about the effect of X1

without specifying the value X2.

To make the interpretation of the interaction model more clear, we rewritemodel 6.64 as

Yi = µ+ (β1 + β3X2)X1i + β2X2i + εi. (6.65)

This representation shows that the regression coefficient of X1 is equal toβ1 + β3X2, i.e. the regression coefficient of X1 is a function of X2. Thus,when e.g. β3 < 0, then the effect of X1 decreases with increasing value ofX2. Model 6.64 could also have been rewritten as

Yi = µ+ β1X1 + (β2 + β3X1)X1i + εi, (6.66)


which illustrates that the same reasoning applies to the effect of X2.

Since basically we could define a third regressor X3 = X1X2, the interactionmodel is just a special case of a multiple linear model with 3 regressors(p = 4). This implies that all parameters can again be estimated by meansof the least squares method. Again all parameter estimators are normallydistributed, etc. Thus, we can simply test the hypothesis that there is nointeraction, by fitting model 6.64 and apply a t-test to test

H0 : β3 = 0 against H1 : β3 6= 0. (6.67)

6.8.6 Model Selection

From the previous section we have learnt that in an interaction model theparameters β1 and β2 (cfr. main effects) are meaningless when we want toconclude anything about the effect of one of the regressors without specifyingthe other regressors. This way of reasoning seems very similar to what wehave seen in Chapter 5 with ANOVA models. Therefore, it seems reasonableto adopt a similar way of data modelling:

• start with a model including the interaction term β3X1X2

• test for interaction, i.e. H0 : β3 = 0 against H1 : β3 6= 0

– when H0 is accepted, then eliminate the interaction term from themodel and fit the additive model in order to conclude anythingabout the main effects of the regressors

– when H0 is rejected, we cannot proceed as with an ANOVA modelwith interaction, i.e. it is now not possible to split the datasetaccording to one regressor (reason: a regressor is a continuousvariable; splitting according to such a variable might results insub-datasets containing only one observation each!). Thus, in thiscase we will have to try to interpret the interaction model.


Chapter 7

General Linear Model

In the two previous chapters we have seen two types of statistical models:ANOVA models and regression models. Both models have in common thatthey model the mean of a continuous response variable. The former uses oneor more factors, whereas the latter uses one or more continuous regressors.In this chapter, for a very simple special case, we will see that both types ofmodels are equivalent. In particular, an ANOVA model can be formulated asa regression model. Once this equivalence is known, both types of models canbe combined to include simultaneously a factor and a continuous regressor.Such a model is called a General Linear model (GLM).

In this chapter we will restrict the discussion to one factor with only 2 levels,in combination with 1 continuous regressor.

7.1 Regression Formulation of an ANOVA Model

Consider an ANOVA model with one factor with two levels. The ANOVAmodel is given by

Yij = µ+ τi + εij, (7.1)

where εij i.i.d. N(0, σ2) and τ1 + τ2 = 0 (i = 1, 2; j = 1, . . . , ni).

Next we will construct a regression model of which we will show that it iscompletely equivalent to the above ANOVA model.

Let now the index k denote the number of the observations. In particu-lar, Y1 = Y1,1, Y2 = Y1,2, . . . , Yn1 = Y1,n1 , Yn1+1 = Y2,1, . . . , Yn1+n2 = Y2,n2 .

131

Further, let n = n1 + n2. Next, we define a regressor Xk as follows,

Xk =

{1 if observation k is in group 10 if observation k is in group 2

. (7.2)

A regressor that is defined as above, is referred to as a dummy variable or anindicator variable. Consider the regression model

Yk = µ+ βXk + εk, (7.3)

where εk i.i.d. N(0, σ2) (k = 1, . . . , n). This model looks exactly like asimple linear regression model.

In order to show the equivalence of both models, we will write model 7.3 forthe observations in the first group (X = 1),

Yk = µ+ β + εk, (7.4)

k = 1, . . . , n1. For the observations in the second group (X = 0), model 7.3becomes

Yk = µ+ εk, (7.5)

k = n1 + 1, . . . , n2. Thus, the mean difference in the response variablebetween the two groups is simply β. This mean difference is according to theANOVA model (Equation 7.1) equal to τ1 − τ2 = 2τ1. Thus,

β = 2τ1, (7.6)

and also, the ANOVA null hypothesis H0 : τ1 = τ2 = 0 is exactly equivalentto H0 : β = 0. Moreover (without proof), the statistical test for testingH0 : β = 0 in the regression model 7.3 is exactly the same as the test fortesting H0 : τ1 = τ2 = 0 in the original ANOVA model 7.1.

7.2 General Linear Model: Example

We will introduce and discuss the GLM by means of an example.

Dataset: RBC.sddTwo groups of people are compared w.r.t. their red blood cell count (rbc).One group consists of people that have a very busy life and sleep on averageless than 7 hours per day. The other group contains people that sleep on


0 1

group

160

210

260

rbc

Figure 7.1: Boxplot of RBC in both groups

average more than 7 hours per day. In the dataset, the groups are coded as0 and 1, respectively. From both groups a random sample of 10 persons aresampled. From each of these persons the RBC is measured and their age(age) is written down as well.

Figure 7.1 shows the boxplots of the RBC in both groups. From this plotone may immediately see that there seems to be a clear difference in RBCbetween both groups. This is indeed formally confirmed by means of a t-test,of which the S-Plus output is given here:

Standard Two-Sample t-Test

data: x: rbc with group = 0 , and y: rbc with group = 1

t = -10.1674, df = 18, p-value = 0

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-98.29703 -64.63062

sample estimates:

mean of x mean of y

169.868 251.3319

From the t-test we would conclude that there is a very significant (p =0) difference in mean RBC between both groups. The mean difference isestimated as 251.3319 − 169.868 = 81.4639 in the sense that people withmore than 7 hours of sleep have on average 81.4639 RBC units more thanpeople with less than 7 hours of sleep. A 95% confidence interval of the mean


0 1

group

20

30

40

50

60

age

Figure 7.2: Boxplot of age in both groups

difference is [64.63062, 98.29703] (0 is clearly far away from the boundary ofthis interval).

Since we also have information on the age of the subjects, it may also beinteresting to look to the distribution of the age over the two groups. Thisis shown in the boxplots in Figure 7.2. This plot suggests that there is nobalanced distribution on the age over the two groups, although the samplingwas completely at random! A possible explanation is that the populationof long sleeping people is to a large extend characterized by order persons,whereas the population of short sleeping people is characterized by youngerpersons. Furthermore, if there would exist a relation between the mean RBCand the age, than the observed difference in mean RBC may be partly due tothe difference in age between both groups instead of the difference in hourssleep! This should be further investigated. We will use a GLM that containsboth the factor for group (as a dummy variable) and a regressor for age.

Before we continue with the GLM, we will present the data in a plot in whichwe can see both the effect of group and the effect of age (Figure 7.3). In thisfigure we see a very clear linear relation between mean RBC and age. Asin the boxplot (Figure 7.1) we see again the large difference in RBC samplemean.

First we define the dummy variable as (i = 1, . . . , 20)

X1i =

{0 if observation i is in group 0 (< 7)1 if observation i is in group 1 (> 7)

. (7.7)

Thus, X1i is the first regressor, defined as a dummy variable to act as thetwo-level factor. The age is the second regressor X2k. The GLM model is

Yi = µ+ β1X1i + β2X2i + εi, (7.8)


20 30 40 50 60

age

160

210

260

rbc

Figure 7.3: A scatter plot of RBC against age. The circles and the trianglesrepresent the short and the long sleepers, respectively. The two horizontallines represent the sample means of RBC of both groups.

where εi i.i.d. N(0, σ2) (i = 1, . . . , 20). Since the model in Equation 7.8is basically a multiple linear regression model, the regression parameters β1

and β2 have the same kind of interpretation as in a multiple linear regressionmodel. Thus, the effect of group, measured by β1, is to be interpreted as themean difference in RBC between a group of short sleeping people of age X2

and a group of long sleeping people of the same age X2 (it does not matterwhat particular age they have). This is easily seen by writing model 7.8 forpeople in the first group (X1 = 0),

Yi = µ+ β2X2i + εi. (7.9)

This model basically says that the RBC of a short sleeping person of age X2

is

Y ∼ N(µ+ β2X2, σ2). (7.10)

For the subjects in the second group (X1 = 1), the model becomes

Yi = µ+ β1 + β2X2i + εi, (7.11)

which implies that the RBC of a long sleeping person of age X2 is

Y ∼ N(µ+ β1 + β2X2, σ2). (7.12)

Thus, β1 is indeed the difference in mean RBC between both groups of peopleof age X2 (whatever the value of X2). Thus, β1 measures the effect of hourssleep, controlled for age (i.e. the effect of age is eliminated).

The S-Plus output of the GLM is given below.



Call: lm(formula = rbc ~ group + age, data = RBC, na.action = na.exclude)

Residuals:

Min 1Q Median 3Q Max

-1.857 -0.8047 0.3009 0.6348 1.623

Coefficients:


(Intercept) 78.9711 1.2497 63.1904 0.0000

group 0.9030 1.1621 0.7770 0.4478

age 3.0400 0.0404 75.2259 0.0000



F-statistic: 19130 on 2 and 17 degrees of freedom, the p-value is 0

From this output we conclude the following:

• The fitted GLM is

Y (X1, X2) = 78.9711 + 0.9030X1 + 3.04X2. (7.13)

(we used µ = 78.9711, β1 = 0.9030 and β2 = 3.04). This fitted modelmay be used for prediction. Note again that β1 = 0.903 is the estimatedeffect of group, i.e. the mean difference in RBC between both groups,controlled for age, is estimated as 0.903 in the sense that long sleepingpersons have on average large RBC than short sleeping people.

• Thus, the effect of sleeping is estimated by β1 = 0.903. To take thesampling variability into account, it is better to give the 95% confidenceinterval:

[0.903− 1.1621t17,0.025, 0.903 + 1.1621t17,0.025] = [−1.545, 3.355].

From this interval estimator we may immediately conclude that H0 :β1 = 0 is not rejected at the 5% level of significance. The same con-clusion is obtained by looking at the result of the t-tests for this nullhypothesis: p = 0.4478 > 0.05.


Hence, by controlling for age (i.e. eliminating the effect of age on themean RBC) there is no significant difference between to two groupsanymore!!! Thus, the difference that we concluded from the t-test wasmainly due to the effect of age on the mean RBC and the fact that thegroups were different in their age distribution.

• We may also look to the effect of age: β2 = 3.04, which is highlysignificantly different from zero (p = 0). This proves that age is linearlyrelated to the mean RBC.

If, however, we would have had concluded here that the effect of agewas not significant, then still the effect of age may have been retainedin the model. (reason: a non-significant result is a weak conclusion,and if the researcher believes that it may have an effect, it may alwaysbe retained in the model. Moreover, by keeping it in the model, it willchange the interpretation of the other parameters in the model suchthat this interpretation guarantees that the effect of age is eliminatedfrom it).

Finally, Figure 7.4 shows a scatter plot of the residuals of the regressionmodel

Yi = µ+ β2X2i + εi (7.14)

(i.e. a simple linear regression model with only age in it). In Figure 7.4the residuals are plotted against age. Now the linear effect of age is ofcourse eliminated. The difference in mean RBC between the two groups isnow indeed much smaller. This difference is approximately the difference asmeasured by β2 in the GLM.


20 30 40 50 60

age

-2

-1

0

1

2

resi

dual

s

Figure 7.4: A scatter plot of residuals against age. The circles and thetriangles represent the short and the long sleepers, respectively. The twohorizontal lines represent the sample means of the residuals of both groups.


APPLIED STATISTICS FOR THE FOOD SCIENCESaremautd/CourseNotes2002_2003/notes stat… · Applied...

Documents

Transcript of APPLIED STATISTICS FOR THE FOOD SCIENCESaremautd/CourseNotes2002_2003/notes stat… · Applied...