MT2004

142
MT2004 Olivier GIMENEZ Telephone: 01334 461827 E-mail: [email protected] Website: http://www.creem.st-and.ac.uk/olivier/OGimene z.html

description

MT2004. Olivier GIMENEZ Telephone: 01334 461827 E-mail: [email protected] Website: http://www.creem.st-and.ac.uk/olivier/OGimenez.html. So far, data-driven statistical methods i.e. use data to answer questions in direct ways The rest of the module - from Section 7 - deals with - PowerPoint PPT Presentation

Transcript of MT2004

Page 1: MT2004

MT2004

Olivier GIMENEZ

Telephone: 01334 461827

E-mail: [email protected]

Website: http://www.creem.st-and.ac.uk/olivier/OGimenez.html

Page 2: MT2004

So far, data-driven statistical methods i.e. use data to answer questions in direct ways

The rest of the module - from Section 7 - deals with

CAPTURING PATTERNS IN THE DATA USING MODELS: Modelling Step

ANALYSING THESE MODELS TO ANSWER QUESTIONS: Estimation and Inference Step

Page 3: MT2004

7. Basic Normal theory & the Central Limit Theorem

7.1 Basic properties of normal distributions

See Section 2.2.2

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-4 -3 -2 -1 0 1 2 3 4 5 6

µ = 0 = 1

µ = 3 = 0,5

µ = 3 = 1

µ = 1 = 2

f (y)

x

Page 4: MT2004

7.1.1 Linear transformation of a normal r.v.

Let X be a random variable with mean and variance 2

Let Y = a + b X, then

E(Y) = ?

V(Y) = ?

Page 5: MT2004

7.1.1 Linear transformation of a normal r.v.

Let X be a random variable with mean and variance 2

Let Y = a + b X, then

E(Y) = a + b E(X) = a + b

V(Y) = b2 V(X) = b2 2

Page 6: MT2004

7.1.1 Linear transformation of a normal r.v.

Now, if X is normally distributed with mean and variance 2, then Y = a + b X is normally distributed too.

In other words, any linear combination of a normal distribution is a normal distribution.

And more precisely, according to the previous slide,

X N(,2) Y N (a + b , b2 2)

Demonstration: homework

Page 7: MT2004

7.1.1 Linear transformation of a normal r.v.

Now, suppose that X N(,2), and consider

What is the distribution of Z ?

Page 8: MT2004

7.1.1 Linear transformation of a normal r.v.

and remember that X N(,2) Y = a + b X N (a + b , b2 2)

so by identification, we obtain

Finally

Write

Page 9: MT2004

7.1.1 Linear transformation of a normal r.v.

Result:

Page 10: MT2004

7.1.1 Linear transformation of a normal r.v.

Very useful result for working out probabilities associated with any normal distributions.

Idea: transform to the standard normal distribution N(0,1) and use the published tables for probabilities associated with N(0,1).

E.g.:

z 0.0 0.5 1.0 2.0 2.5 3.0

Pr(Z>z) 0.5000 0.30854 0.15866 0.02275 0.00621 0.00135

(See Table 5 of the K & Y Tables)

Page 11: MT2004

7.1.1 Linear transformation of a normal r.v.

Example:

Calculate the probability that a random variable X N(3,4) takes a value between 4 and 5

Page 12: MT2004

7.1.1 Linear transformation of a normal r.v.

Example:

Calculate the probability that a random variable X N(3,4) takes a value between 4 and 5

We wish to compute Pr(4 X 5).

Using the result above, we have that Z = (X-)/2 N(0,1)

So Pr(4 X 5) = Pr(4-3 X-3 5-3) = Pr(1/2 Z 1)

Finally Pr(4 X 5) = Pr(Z 1) - Pr(Z 1/2)

= (1-0.15866) - (1-0.30854) = 0.14988

Page 13: MT2004

7.1.2 Sums of independent normal random variables

Sums of (normal) r.v's occur frequently in statistical theory (e.g. mean, variance...). Distribution?

If X1, X2 independent with Xi N(i,i2) i=1,2,

then

Extension: X1,...,Xn independent r.v's with Xi N(i,i2)

i=1,...,n and a1,...,an constants, then

Page 14: MT2004

7.1.2 Sums of independent normal random variables

Page 15: MT2004

7.1.2 Sums of independent normal random variables

Page 16: MT2004

7.1.2 Sums of independent normal random variables

Page 17: MT2004

7.2 The Central Limit Theorem

We've just seen that the mean of n independent identically normally distributed r.v's is itself normally distributed.

The Central Limit Theorem (CLT) states that the mean of n i.i.d. r.v's from any distribution is approximately normally distributed for large enough n.

Page 18: MT2004

7.2 The Central Limit Theorem

The Central Limit Theorem (CLT) states that the mean of n i.i.d. r.v's from any distribution is approximately normally distributed for large enough n.

Page 19: MT2004

7.2 The Central Limit Theorem

The Central Limit Theorem (CLT) states that the mean of n i.i.d. r.v's from any distribution is approximately normally distributed for large enough n.

Page 20: MT2004

7.2 The Central Limit Theorem

Example: A bridge can hold at most 400 vehicles if they are bumper-to-bumper and stationary. The mean weight of vehicles using the bridge is 2.5 tonnes with a standard deviation of 2.0 tonnes. What is the probability that the maximum design load of 1100 tonnes will be exceeded in a traffic jam?

Page 21: MT2004

7.2 The Central Limit Theorem

Example: A bridge can hold at most 400 vehicles if they are bumper-to-bumper and stationary. The mean weight of vehicles using the bridge is 2.5 tonnes with a standard deviation of 2.0 tonnes. What is the probability that the maximum design load of 1100 tonnes will be exceeded in a traffic jam?

Let Xi be the weight of a vehicle, i=1,...n. Here, we have that n = 400, = 2.5t and = 2.0t.

The probability that the maximum design load of 1100 tonnes will be exceeded in a traffic jam is given by Pr(iXi > 1100).

We'd like to use the CLT: X1,...,Xn i.i.d. r.v's with mean and variance 2:

Page 22: MT2004

7.2 The Central Limit Theorem

Example: A bridge can hold at most 400 vehicles if they are bumper-to-bumper and stationary. The mean weight of vehicles using the bridge is 2.5 tonnes with a standard deviation of 2.0 tonnes. What is the probability that the maximum design load of 1100 tonnes will be exceeded in a traffic jam?

Page 23: MT2004

How to use Tables?

Table of the Standard Normal Distribution

values inside the table = areas under Z N(0,1) between - and z

i.e. (z) = P(Z z)

Example 1: to determine (1.96)=P(Z1.96), i.e. the area under the curve between - and 1.96, look in the intersecting cell for the row labelled 1.90 and the column labelled 0.06. The area under the curve is 0.975.

Example 2: Find z such as (z) = 0.95. P(Z1.64)=0.9495 and P(Z1.65)=0.9505 so that z = 1.645.

Example 3: (-1.23) = 1-(1.23)=1-.8907=0.1093

Page 24: MT2004

7.3 Approximating other distributions by normal distributions

The CLT also provides the justification for approximating several other distributions by a normal distribution.

We consider two examples, the Binomial and the Poisson distributions.

The Binomial probability distribution is:

It becomes hard to evaluate it for large n as the factorials in the binomials coefficient 'explode'.

However, the CLT can be used to overcome this problem

Page 25: MT2004

7.3 Approximating other distributions by normal distributions

We first note that if X Bin(n,p), then X can be written as a sum of n independent binomials r.v's: X = X1 + ... + Xn, where Xi Bin(1,p).

Each Xi has mean p and variance p(1-p).

Thus the CLT implies that

Alternatively:

In real life, the approximation will be good enough when

Page 26: MT2004

7.3 Approximating other distributions by normal distributions

X Bin(n,p); X = X1 + ... + Xn, where Xi Bin(1,p); each Xi has mean p and variance p(1-p).

Alternatively:

Example: The probability of annual survival of a bird species is 0.4. Suppose we are studying a population of n = 200 individuals. What is the probability that less than 50% of the population survives the current year.

Page 27: MT2004

7.3 Approximating other distributions by normal distributions

Example: The probability of annual survival of a bird species is 0.4. Suppose we are studying a population of n = 200 individuals. What is the probability that less than 50% of the population

survives the current year.

Let Xi be the random variable 'individual i survives the year', we have that Bin(1,0.4); each Xi has mean 0.4 and variance 0.4(1-0.4). Then X = X1 + ... + Xn is the total of surviving individuals, X bin(200,0.4).

Via the CLT:

So that

As:

Page 28: MT2004

7.3 Approximating other distributions by normal distributions

Example: The probability of annual survival of a bird species is 0.4. Suppose we are studying a population of n = 200 individuals. What is the probability that less than 50% of the population

survives the current year.

with Z N(0,1). Using tables for the standard normal distribution, we have that P(Z<3)=0.9987.

Without invoking the CLT, we would need to compute

P(X<100) = P(X=0) + P(X=1) + ... + P(X=99)

So that

Page 29: MT2004

7.3 Approximating other distributions by normal distributions

The Poisson probability function is:

It becomes hard to evaluate it for high values of as x gets huge.

However, the CLT can be used to overcome this problem.

We note first that if X1, X2 independent with Xi Pois(i) i = 1,2 then X1 + X2 Pois(1+2) (to be proved in Honours)

Page 30: MT2004

We note that if X Pois(), then X can be written as a sum of n independent Poisson r.v's: X = X1 + ... + Xn, where Xi Pois(/n).

Each Xi has mean /n and variance /n (mean = variance: homework)

Thus the CLT

implies that

So that

7.3 Approximating other distributions by normal distributions

Page 31: MT2004

7.3 Approximating other distributions by normal distributions

Example: Find the probability that a Poisson distributed r.v. with mean 25 takes a value in the range 26 to 30.

Page 32: MT2004

7.3 Approximating other distributions by normal distributions

Example: Find the probability that a Poisson distributed r.v. with mean 25 takes a value in the range 26 to 30.

If X Pois(25), we need to calculate P(26 X 30)

Using tables for the standard normal distribution, we have that

P(26 X 30) = P(0.2 Z 1) = P(Z 1) - P(Z 0.2)

= 0.8413 – 0.5793 = 0.262

The CLT tells us that

Page 33: MT2004

8. Practical Applications of Normal Distributions

Why are normal distributions so important?

1 – The CLT shows that sums of i.i.d r.v’s tend towards normality, even if the r.v’s are non-normal

2 – Many data sets for which a normal distribution provides a good model (describe adequately the data): heights of people, IQ scores…

3 – Easy to work with mathematically (integrals, tables…)

4 – Statistical procedures based on normality assumption are often insensitive to small violations of the assumption (ANOVA e.g., see future section)

5 – Non-normal distributions can be transformed to approximate normality

Page 34: MT2004

8.1 Testing for normality

Before using the normal distribution as a model of data to perform test about the mean of a population e.g., we need to decide whether or not the random sample under investigation could have been drawn from a normal distribution.

There a analytical tests (Pearson, Kolmogorow…) but we will focus on a graphical method here.

We won’t be able to prove normality, but only fail to reject the hypothesis that the data come from the normal distribution (hypothesis testing philosophy, finite random sample)

Page 35: MT2004

8.1 Testing for normality

First idea: use a histogram, and compare with what we would expect for a normal distribution, i.e. bell-shaped, symmetric with a single peak (unimodal)

Page 36: MT2004

8.1 Testing for normality

Histograms of random samples (n=30) from N(3,var=25) vs Density curve of N(xi/n,s2)

Difficult to conclude for normality, because of variability

Page 37: MT2004

8.1 Testing for normality

Second idea:

Remember that any normal r.v. is a linear transformation of a standard normal r.v.

So if y1,…,yn is a random sample from any normal r.v. (N(,2) say) and z1,…,zn a random sample from a N(0,1)

Then plot the sorted y values against the sorted z values

We would get something close to a straight line because

Y = + Z

Page 38: MT2004

8.1 Testing for normality

Plots of random samples (n=30) from N(3,var=25) against N(0,1)

Difficult to conclude for normality, because of variability

Page 39: MT2004

8.1 Testing for normalityThird idea: to overcome the problem of variability, use an ‘idealised’/theoretical average sample from N(0,1), the normal scores

Page 40: MT2004

8.1 Testing for normality

P(Z z10% quantile) = Φ(z10% quantile) = 0.10

P(Z z20% quantile) = Φ(z20% quantile) = 0.20

P(Z z100% quantile) = Φ(z100% quantile) = 1.00

Meaning that, on average, we expect 10% of the data points to lie below the 10% quantile of the c.d.f., 20% below the 20% quantile, …, and 100% below the 100% quantile.

If Z N(0,1), by definition of the cumulative distribution function/(lower) quantile, we have that:

Page 41: MT2004

8.1 Testing for normality

Consider a sample of 10 points e.g. from N(0,1)

We’ve got 10 probability intervals corresponding to the quantiles:

[0,0.1], [0.1,0.2], [0.2,0.3], [0.3,0.4], [0.4,0.5]

[0.5,0.6], [0.6,0.7], [0.7,0.8], [0.8,0.9], [0.9,1.0]

For convenience, consider the mid-point of each interval (i-0.5)/10, i=1,…,10

The normal scores are obtained by computing Φ-1((i-0.5)/10), i=1,…,10, where is the c.d.f. of the N(0,1)

Page 42: MT2004

8.1 Testing for normality

Φ-1((1-0.5)/10) = Φ-1(0.05) = -1.645

Consider a sample of 10 points e.g. from N(0,1)

0.05

Page 43: MT2004

8.1 Testing for normality

Φ-1((2-0.5)/10) = Φ-1(0.15) = -1.036

Consider a sample of 10 points e.g. from N(0,1)

0.15

Page 44: MT2004

8.1 Testing for normality

Consider a sample of 10 points e.g. from N(0,1)

Finally…

Page 45: MT2004

8.1 Testing for normality

Idea: Plot the observed y sorted values against the normal scores, then check visually for linearity

Example: Early in the 20th century, a Colonel L.A. Waddell collected 32 skulls from Tibet. He collected 2 groups: 17 from graves on Sikkim and 15 from a battlefield near Lhasa. Here are maximum skull length measurements (in mm) for the Lhasa group: 182, 180, 191, 184, 181, 173, 189, 175, 196, 200, 185, 174, 195, 197, 182. Before doing anything with these data (e.g. testing for a difference in the mean skull length between the 2 groups), we need to check for normality first.

Page 46: MT2004

ExampleSample

quantiles

182

180

191

184

181

173

189

175

196

200

185

174

195

197

182

Page 47: MT2004

ExampleSample

quantilesSorted sample

quantiles

182 173

180 174

191 175

184 180

181 181

173 182

189 182

175 184

196 185

200 189

185 191

174 195

195 196

197 197

182 200

Page 48: MT2004

ExampleSample

quantilesSorted sample

quantilesi = 1,…,15

(i-0.5)/15

182 173 1 0.03

180 174 2 0.10

191 175 3 0.17

184 180 4 0.23

181 181 5 0.30

173 182 6 0.37

189 182 7 0.43

175 184 8 0.50

196 185 9 0.57

200 189 10 0.63

185 191 11 0.70

174 195 12 0.77

195 196 13 0.83

197 197 14 0.90

182 200 15 0.97

Page 49: MT2004

ExampleSample

quantilesSorted sample

quantilesi = 1,…,15

(i-0.5)/15 Φ-1((i-0.5)/15)

182 173 1 0.03 -1.83

180 174 2 0.10 -1.28

191 175 3 0.17 -0.97

184 180 4 0.23 -0.73

181 181 5 0.30 -0.52

173 182 6 0.37 -0.34

189 182 7 0.43 -0.17

175 184 8 0.50 0.00

196 185 9 0.57 0.17

200 189 10 0.63 0.34

185 191 11 0.70 0.52

174 195 12 0.77 0.73

195 196 13 0.83 0.97

197 197 14 0.90 1.28

182 200 15 0.97 1.83

Page 50: MT2004

Examplex=seq(1,15)

y=qnorm((x-0.5)/15)) # calculates Φ-1((i-0.5)/15), theoretical quantiles

o=c(173,174,175,180,181,182,182,184,185,189,191,195,196,197,200)

plot(y,o,xlab="Theoretical quantiles",ylab="Sample quantiles")

Page 51: MT2004

Example

Alternatively, use R command qqnorm

skull.lhasa = c(182,180,191,184,181,173,189,175,196,200,185,174,195,197,182)

qqnorm(skull.lhasa)

Page 52: MT2004

Example

To help check linearity of a normal scores plot, add a straight line.

‘qqline’ adds a line to a normal quantile-quantile plot which passes through the first and third quartiles.

qqnorm(skull.lhasa)

qqline(skull.lhasa)

Page 53: MT2004

Further examples

Sample from a distribution with more probability in the and centre of the distribution and less in the ‘shoulders’ than the normal distribution

Page 54: MT2004

Further examples

Sample from a positively skewed distribution

Page 55: MT2004

Further examples

Sample from a negatively skewed distribution

Page 56: MT2004

Further examples

Sample from a normal distribution

Page 57: MT2004

8.2 Using a normal distribution as a model when the variance 2 is known (z-test)

We wish to learn something about a population on the basis of a random sample from that population

Why? Because it is often impractical to work with the whole population, so we test hypotheses about the population on the basis of a sample drawn from it

We assume here that the population may be modelled using a normal distribution with unknown mean but known variance 2

For example, suppose one wants to investigate IQ of students at StAndrews. As you cannot study the whole population, you need to measure the IQ’s of a random sample of students.

In this example, we could ask what the average IQ of StAndrews students is, or test the hypothesis that it is greater than some value

Page 58: MT2004

8.2.1 Hypothesis testing: parametric approach

General approach for hypothesis testing:

1 – Define a null hypothesis H0 and an alternative hypothesis H1

2 – Choose a test statistic which will distinguish between H0 and H1 by taking ‘extreme’ values if H1 is true, and moderate values otherwise

3 – Find the distribution of the test statistic under H0

4 – Using this distribution, determine the probability of obtaining a test statistic at least as ‘extreme’ as the one observed under H0. This is the p-value of the data under the test

5 – Conclude: a very low p-value suggests that H0 is false, otherwise we cannot reject H0

Page 59: MT2004

8.2.1 Hypothesis testing: parametric approach

And in the particular case of a normally distributed population with known variance?

Define x1,…,xn a set of independent observations from a population which can be modelled by a normal distribution with unknown mean and known variance 2.

Using these observations, we are able to test hypotheses about the mean of the whole population, e.g.

1 –H0: = 0 against H1: 0

2 – A ‘good’ test statistic is the mean of the observations xi/n

which will tend to be close to 0 under H0 but further away under H1

Page 60: MT2004

8.2.1 Hypothesis testing: parametric approach

1 –H0: = 0 against H1: 0 - TWO-SIDED TEST

2 – A ‘good’ test statistic is the mean of the observations xi/n

which will tend to be close to 0 under H0 but further away under H1

3 – Distribution of the test statistic under H0 ?

Page 61: MT2004

8.2.1 Hypothesis testing: parametric approach

1 –H0: = 0 against H1: 0 - TWO-SIDED TEST

2 – A ‘good’ test statistic is the mean of the observations xi/n

which will tend to be close to 0 under H0 but further away under H1

3 – Distribution of the test statistic under H0 ?

Page 62: MT2004

8.2.1 Hypothesis testing: parametric approach

3 – Distribution of the test statistic under H0 ?

Page 63: MT2004

8.2.1 Hypothesis testing: parametric approach

3 – The distribution of the test statistic under H0 is

4 - We use a significance level of = 5%, and extreme values on either side of the mean are of interest since H1 does not distinguish between them

The appropriate range of values is P(-z/2 Z z/2) = 1-=0.95 i.e P(Z z/2) - P(Z - z/2) = 2P(Z z/2) - 1 = 0.95

P(Z z/2) = 0.975 so z/2 = 1.96

-z/2 z/2

Z

Page 64: MT2004

8.2.1 Hypothesis testing: parametric approach

4 - At the significance level = 5%, the region of acceptance of H0 is [-1.96,1.96]

5 - So if the observed value of the test statistic is outside this region, we reject H0, otherwise we cannot.

ALTERNATIVELY,

4 - We can calculate what proportion of values are at least as improbable as the observed value under H0, i.e. the p-value.

In other words, we want to compute the p-value

P(Z -zobs or Z zobs) = 1-P(Z zobs)+1-P(Z zobs) = 2(1-(zobs))

5 - Finally, if the p-value is < 0.05, we reject H0 (outsite the acceptance region), otherwise we cannot.

Page 65: MT2004

z

8.2.1 Hypothesis testing: parametric approach

H0: = 0 against H1: 0 - ONE-SIDED TEST

P(Z z) = 1-= 0.95 so z = 1.645 and accept H0

ALTERNATIVELY

P(Z zobs) = 1-(zobs) and reject H0 if < 0.05

Page 66: MT2004

8.2.1 Hypothesis testing: parametric approach

H0: = 0 against H1: 0 - ONE-SIDED TEST

P(Z z) = 1-= 0.95 so z = -1.645 and accept H0

ALTERNATIVELY

P(Z zobs) = (zobs) and reject H0 if < 0.05

-z

Page 67: MT2004

8.2.1 Hypothesis testing: parametric approach

Example - Tutorial 4, Question 5a

Test at the 5% level the hypothesis that the mean spinning timehas not been affected by lubrification, against the alternative that it has been increased. Report the p-value.

Test H0:=150 against H1:>150. We have that n=5 and Xi/n=162. Hence:

We thus reject H0 at the 5% significance level.

The p-value is 1-(2.68) = 1- 0.9963 = 0.0037 << 0.05 so we reject H0

Page 68: MT2004

8.2.2 The power of a test

There are two types of error that can be made when hypothesis testing:

Type I error: Reject H0 when it is true; P(type I error) =

Type II error: Accept H0 when it is false; P(type II error) = ?

We'd like to minimize the two types of error, but the problem is that when decreases, P(type II error) increases and vice-versa.

So set up to a fixed value, and try to minimize P(type II error)

DECISION

REALITY

Accept H0 Reject H0

H0 true OK Type I error

H0 false Type II error

OK

Page 69: MT2004

8.2.2 The power of a test

Definition: The power of a test is the probability of rejecting H0 when it is false.

P(type II error) = P(Accept H0 | H0 false) so power = 1 - P(type II error).

So once is fixed, we'll do our best to increase the power of the test.

Example: Consider H0: = 0 vs H1: > 0

Page 70: MT2004

8.2.2 The power of a test

Page 71: MT2004

8.2.2 The power of a test

Note: If n increases, (...) decreases and the power increases; in other words, for a given , the power increases as the sample size increases, so your ability to detect an alternative hypothesis

Page 72: MT2004

8.2.3 Confidence intervals

You have already calculated confidence intervals by computer-intensive methods. Here, we'll do it analytically.

An x% confidence interval for a parameter ( here) is an interval having probability x% of including the true value of this parameter.

The parameter being estimated is a fixed quantity, but the interval is random.

In other words, it means that x% of intervals calculated in the same way for similar samples will include the true value.

Regarding the example of n observations from a normal population with known variance 2

Page 73: MT2004

8.2.3 Confidence intervals

Example of n observations from a normal population with known variance 2

Page 74: MT2004

8.2.3 Confidence intervals

Example: The wavelengths of light pulses from a semiconductor laser are approximately normally distributed, with variance calculated by theory to be 100nm2. The mean wavelength for individual lasers varies. Measurements of 100 pulses from a laser give an average wavelength of 598nm. Find a 95% confidence interval for the mean wavelength of the laser.

Page 75: MT2004

8.2.3 Confidence intervals

Example: The wavelengths of light pulses from a semiconductor laser are approximately normally distributed, with variance calculated by theory to be 100nm2. The mean wavelength for individual lasers varies. Measurements of 100 pulses from a laser give an average wavelength of 598nm. Find a 95% confidence interval for the mean wavelength of the laser.

Page 76: MT2004

In the previous section, we assume that the variance of the whole population was known

Unlikely to be the case…

So we need methods to deal with both mean and variance of the whole population are unknown

To develop the theory underlying such methods, we need to introduce first some other distributions but related to the normal distribution

Namely, the 2, t and F distributions

9. Distributions derived from normal distributions

Page 77: MT2004

9.1 2 distributions

Page 78: MT2004

9.1 2 distributions

Page 79: MT2004

Upper quantile = value above which some specified proportion of the area of a p.d.f. lies

9.1 2 distributions

Page 80: MT2004

The 5% upper quantile of a 25 is x such Pr(2

5 x) = 0.05

9.1 2 distributions

Page 81: MT2004

The 5% upper quantile of a 25 is x such Pr(2

5 x) = 0.05 or alternatively Pr(2

5 x) = 0.95 i.e. the lower 95% quantile

9.1 2 distributions

Page 82: MT2004

Pr(25 x) = 0.95 (the lower 95% quantile) is obtained using the R

command: > qchisq(0.95,5) # cumulative d. f.[1] 11.07050

9.1 2 distributions

Page 83: MT2004

Example: Suppose that X, Y, and Z are coordinates in 3-dimensional space which are independently distributed as N(0,1), with all measurements in cm. What is the probability that the point (X,Y,Z) lies more than 3 cm from the origin?

9.1 2 distributions

Page 84: MT2004

Example: Suppose that X, Y, and Z are coordinates in 3-dimensional space which are independently distributed as N(0,1), with all measurements in cm. What is the probability that the point (X,Y,Z) lies more than 3 cm from the origin?

9.1 2 distributions

Page 85: MT2004

Example: Suppose that X, Y, and Z are coordinates in 3-dimensional space which are independently distributed as N(0,1), with all measurements in cm. What is the probability that the point (X,Y,Z) lies more than 3 cm from the origin?

9.1 2 distributions

Page 86: MT2004

9.1 2 distributions

Page 87: MT2004

9.1 2 distributions

Page 88: MT2004

9.1 2 distributions

Page 89: MT2004

9.2 The Fdistributions

Page 90: MT2004

9.2 The Fdistributions

The 5% upper quantile of a Fdf1,df2 is x such Pr(Fdf1,df2 x) = 0.05

Use Tables or R command qf(0.95,df1,df2) (lower 95% quantile)

Page 91: MT2004

9.2 The Fdistributions

So if we have a table with the upper quantiles, we can also get the lower quantiles as follows.

Remember that:

Upper quantile = value above which some specified proportion of the area of a p.d.f. lies

Lower quantile = value below which some specified proportion of the area of a p.d.f. lies

Page 92: MT2004

9.2 The Fdistributions

So if we have a table with the upper quantiles, we can also get the lower quantiles as follows.

Page 93: MT2004

9.2 The Fdistributions

So if we have a table with the upper quantiles, we can also get the lower quantiles as follows.

i.e. upper (1-) quantile of Fn,k or lower quantile of Fn,k is the inverse of the upper quantile of the Fk,n

Page 94: MT2004

9.2 The Fdistributions

Example: Given that F3,2;0.025 = 39.17, find F2,3;0.975 (i.e. lower 0.025 = 1-0.975 quantile of the F2,3 distribution)

F2,3;0.975 = 1/ F3,2;0.025 = 1/39.17 = 0.0255

Page 95: MT2004

9.2 The Fdistributions

Example: Given that F3,2;0.025 = 39.17, find F2,3;0.975 (i.e. lower 0.025 = 1-0.975 quantile of the F2,3 distribution)

F2,3;0.975 = 1/ F3,2;0.025 = 1/39.17 = 0.0255

R commands

> par(mfrow=c(2,1))

> plot(x,df(x,2,3),xlab="",ylab="",type='l')

> title("pdf F(2,3)")

> plot(x,df(x,3,2),xlab="",ylab="",type='l')

> title("pdf F(3,2)")

Page 96: MT2004

9.3 The tdistributions

Page 97: MT2004

9.3 The tdistributions

The shape of the p.d.f. of tn depends on n

Page 98: MT2004

9.3 The tdistributions

Looks like a normal distribution, but more of the probability is in the centre and the tails, see the graph for t1 e.g. (top left)

Page 99: MT2004

9.3 The tdistributions

Page 100: MT2004

9.3 The tdistributions

tn; is the upper quantile of the t distribution with n degrees of freedom

Page 101: MT2004

9.3 The tdistributions

Use tables or R, e.g. qt(0.95,30) (=1.859548) gives the lower 95% quantile of the t distribution with 8 degrees of freedom (upper 5% quantile) (qt(0.95,5000) = 1.645158…)

Page 102: MT2004

10 Using tdistributionsTo derive the distribution of the statistic testing hypotheses about the mean of a normal population with unknown variance, we need a key result on the joint distribution of the sample mean and the sample variance

Remember that:

Page 103: MT2004

10 Using tdistributionsTo derive the distribution of the statistic testing hypotheses about the mean of a normal population with unknown variance, we need a key result on the joint distribution of the sample mean and the sample variance

Page 104: MT2004

10 Using tdistributions

The quantity T depends on the population mean but not on the unknown variance 2.

So this statistic will be useful to test hypotheses about the mean population of normal populations with unknown variance

Page 105: MT2004

10.2 One-sample t-testsand confidence intervals

One sample t-tests:

39 observations on pulse rates (heart beats/minute) of Indigenous Peruvians had sample mean 70.31 and sample variance 90.219.

We assume normality.

Question: at the 1% significance level, could this data set be considered as a random sample from a population with mean 75.

In other words (Step 1 of hypothesis testing strategy):

H0: = 75 against H1 75

Your turn. Perform step 2 (find a ‘good test statistic’) and step 3 (derive its distribution)

Page 106: MT2004

One sample t-tests:

39 observations on pulse rates (heart beats/minute) of Indigenous Peruvians had sample mean 70.31 and sample variance 90.219.

Step 1: H0: = 75 against H1 75

Step 2: Xi/n - 0 is a good candidate since it takes ‘extreme’ values if H1 is true, and moderate values if H0 is true.

Step 4: it’s a 2-sided test, so we will reject H0 if

tobs –tn-1;/2 or tobs tn-1;/2 (graphical representation)

10.2 One-sample t-testsand confidence intervals

Page 107: MT2004

One sample t-tests:

39 observations on pulse rates (heart beats/minute) of Indigenous Peruvians had sample mean 70.31 and sample variance 90.219.

Step 1: H0: = 75 against H1 75

Step 2: Xi/n - 0 is a good candidate since it takes ‘extreme’ values if H1 is true, and moderate values if H0 is true.

If one-sided test, H1: <0, we reject if tobs –tn-1;

If one-sided test, H1: >0, we reject if tobs tn-1;

10.2 One-sample t-testsand confidence intervals

Page 108: MT2004

One sample t-tests:

39 observations on pulse rates (heart beats/minute) of Indigenous Peruvians had sample mean 70.31 and sample variance 90.219.

So we will reject if tobs 2.7045 or if tobs -2.7045

P-value using R:

> 2*pt(tobs,38) # (tobs<0 so need to double the c.d.f. of tobs – 2-sided test)

> 0.003799049

10.2 One-sample t-testsand confidence intervals

Page 109: MT2004

Confidence interval:

39 observations on pulse rates (heart beats/minute) of Indigenous Peruvians had sample mean 70.31 and sample variance 90.219.

We’d like to build up a 99% confidence interval for , we’re looking for values of for which we would accept H0

We know that:

10.2 One-sample t-testsand confidence intervals

Page 110: MT2004

Confidence interval:

39 observations on pulse rates (heart beats/minute) of Indigenous Peruvians had sample mean 70.31 and sample variance 90.219.

So we would accept any value of such that

75 is outside the confidence interval, so we would reject H0 at the 1% significance level

10.2 One-sample t-testsand confidence intervals

Page 111: MT2004

Confidence interval:

With R, a 95% confidence interval is obtained as follows:

> cil = 70.31 + qt(0.975,38)*90.219/sqrt{39}

> cil = 70.31 - qt(0.975,38)*90.219/sqrt{39}

> c(cil,ciu)

> [1] 67.23099 73.38901

And the 99% confidence interval is obtained as

> c(70.31 + qt(0.995,38)*90.219/sqrt{39}, 70.31 + qt(0.995,38)*90.219/sqrt{39}

10.2 One-sample t-testsand confidence intervals

Page 112: MT2004

Consider two samples of observations (Xi,Yi)

Consider the case: the two measurements (Xi,Yi) are made on the same unit i

We wish to test if the two population means are equal

Example: measurement of left and right wing length of birds

Should not be treated as independent!!!!!

Obviously, length of left wing and length of right wing both tend to be large for large birds: dependent measurements

Idea: work with the differences between the two measurement on each unit, i.e. Xi-Yi, in order to go back to a one-sample t-test e.g.

10.3 Paired t-tests

Page 113: MT2004

Example: corneal thickness in microns for both eyes of patients who have glaucoma in one eye

Glaucoma 488 478 480 426 440 410 458 460

Healthy 484 478 492 444 436 398 464 476

Obviously, the corneal thickness is likely to be similar in the two eyes of any patient – dependent observations

Consider di = glaucomai – healthyi. We will assume that this new random sample is drawn from a normal distribution N(d,2), and we wish to test: H0: d=0 vs H1: d0

di = -32 ; di2 = 936 and

10.3 Paired t-tests

Page 114: MT2004

Example: corneal thickness in microns for both eyes of patients who have glaucoma in one eye

H0: d=0 vs H1: d0

di = -32 ; di2 = 936, s2 = 115.43 and t7;0.025 = 2.3646 (see Tables)

tobs > - t7;0.025 and tobs < t7;0.025 meaning that tobs is in the region of acceptance of H0

10.3 Paired t-tests

-t/2 t/2

t

Page 115: MT2004

Example: corneal thickness in microns for both eyes of patients who have glaucoma in one eye

H0: d=0 vs H1: d0

di = -32 ; di2 = 936, s2 = 115.43 and t7;0.025 = 2.3646 (see Tables)

tobs > - t7;0.025 and tobs < t7;0.025 meaning that tobs is in the region of acceptance of H0

At the 5% significance level, we fail to reject H0, so there is apparently no difference between the good eye and the diseased eye

10.3 Paired t-tests

Page 116: MT2004

10.4 Two-sample t-tests

Now, we want to deal with two sets of data and compare, e.g., their means

We consider that the two random samples are drawn from normal distributions with unknown but same variances.

More formally

Page 117: MT2004

10.4 Two-sample t-tests

We consider that the two random samples are drawn from normal distributions with unknown but same variances.

We know that the distributions of the sample means of the two samples are:

so that (using results on sums of normal r.v’s)

As usual, we’d like to relate this distribution to a standard normal random variable…

Page 118: MT2004

10.4 Two-sample t-tests

We consider that the two random samples are drawn from normal distributions with unknown but same variances.

We have that:

Obviously, if we assume that is known, we can test hypotheses about the difference in means between the two groups (see the one-sample case – z-test).

But we assume that is unknown. So we need to do again what we’ve done for the t-test (one-sample test about the mean with unknown variance).

Page 119: MT2004

10.4 Two-sample t-tests

More precisely, first find the distribution of:

We note that:

where

Page 120: MT2004

10.4 Two-sample t-tests

Similarly, we have that:

where

Page 121: MT2004

10.4 Two-sample t-tests

Putting the two latter results together, we have that, using the additivity of 2 r.v’s:

Note that the above quantity can be written as

where:

is called the pooled sample variance.

Page 122: MT2004

10.4 Two-sample t-tests

Remember that we have:

Page 123: MT2004

10.4 Two-sample t-tests

So let the test statistic T be

which is actually the ratio of following distributions:

i.e. a t distribution with n+m-2 degrees of freedom!

Page 124: MT2004

10.4 Two-sample t-tests

Now we can see that T can be re-written as follows:

or:

The quantity T depends on the population means X and Y but not on the unknown variance 2.

This statistic is thus useful to test hypotheses about the difference in means between the 2 populations.

Page 125: MT2004

10.4 Two-sample t-tests

Example: Consider two random samples from 2 normal distributions:

x = 11 10 14 12 13 and y = 8 3 4 9

Test the hypothesis that the two population means are equal against the alternative hypothesis that they are not.

Page 126: MT2004

10.4 Two-sample t-tests

Example: Consider two random samples from 2 normal distributions:

x = 11 10 14 12 13 and y = 8 3 4 9

Test the hypothesis that the two population means are equal against the alternative hypothesis that they are not.

We wish to test H0: X = Y against H1: X Y

s2 = (10 + 26) / 7 = 36 / 7, and xi/n = 12, yj/m = 6

There is evidence to reject H0 at the 5% significance level.

In other words, the two population means are different

Page 127: MT2004

10.4 Two-sample t-tests

Using R:

> x=c(11,10,14,12,13)> y=c(8,3,4,9)> # pooled standard deviation:> pooledsd=sqrt(((5-1)*var(x)+(4-1)*var(y))/(5+4-2))

> # observed value of the test statistic:> tobs=(mean(x)-mean(y))/(pooledsd*sqrt(1/5+1/4))> tobs[1] 3.944053

> # p-value of the 2-sided test> 2*(1-pt(tobs,5+4-2))[1] 0.005574311

Page 128: MT2004

10.5 Testing equality of variances

Motivation: to apply the two-sample t-test of Section 10.4, we need to check that the two samples come from normal distributions with same variance

Consider X1,…,Xn and Y1,…,Ym two random samples drawn from normal distributions. We also assume independence.

Let 2X and 2

Y be the population variances of the two random samples.

Remember the strategy of hypothesis testing:

Step 1: We wish to test H0: 2X = 2

Y vs H1: 2

X 2Y

Step 2: We need to find a ‘good’ test statistic, i.e. a function of the data that takes ‘extreme’ values if H1 is true, and moderate values if H0 is true.

Page 129: MT2004

10.5 Testing equality of variances

We’ve seen that:

So what about the ratio:

?????

Page 130: MT2004

10.5 Testing equality of variances

If you work it out a little bit, you get under H0: 2X = 2

Y = 2, the following test statistic:

Under the null hypothesis the terms involving cancel.

If the alternative hypothesis is true, i.e. if 2X 2

Y, then the value of the test statistic above will be small or large depending on whether 2

X 2Y or 2

X 2Y.

Page 131: MT2004

10.5 Testing equality of variances

Step 3: Now we need the distribution of this test statistic under H0.

By definition of an F distribution, we have that:

that is:

using the main property of F distributions.

or

Page 132: MT2004

10.5 Testing equality of variances

Step 4: We will reject the null hypothesis if the observed value of this test statistic is greater than the upper quantile of the appropriate F distribution (using the Tables or program R).

Note that it is enough to compare the larger of the two test statistics describes on the previous slide with the upper quantile of the appropriate distribution.

Example: consider two examples, one of size 11 and the other of size 16 from two normal distributions. The sample variance of the first is 20 and the sample variance of the second is 30. At the 5% level, is there evidence to reject the hypothesis that the two populations have the same variance? Note that F15,10;0.025=3.522

Page 133: MT2004

10.5 Testing equality of variances

Example: consider two examples, one of size 11 and the other of size 16 from two normal distributions. The sample variance of the first is 20 and the sample variance of the second is 30. At the 5% level, is there evidence to reject the hypothesis that the two populations have the same variance? Note that F15,10;0.025=3.522

1) We wish to test 2X = 2

Y vs H1: 2

X 2Y , where X has sample

size and Y has sample size 16, with respectively s2X=20 and

s2Y=30. This is a test of equality of variances.

2) To perform it, we calculate the observed value of the test statistic (the largest one): fobs = s2

Y/s2X = 30/20 = 1.5

3) We need to compare this observed value to the 2.5% upper quantile of an F distribution with 15 and 10 degrees of freedom, i.e. F15,10;0.025 which is equal to 3.522

Page 134: MT2004

10.5 Testing equality of variances

Example: consider two examples, one of size 11 and the other of size 16 from two normal distributions. The sample variance of the first is 20 and the sample variance of the second is 30. At the 5% level, is there evidence to reject the hypothesis that the two populations have the same variance? Note that F15,10;0.025 = 3.522

4) fobs = 1.5 < F15,10;0.025 = 3.522

5) So there is no evidence to reject the null hypothesis. We fail to reject the equality of variances.

Note: We might now consider testing whether the two population means are different.

Sections 10.4 and 10.5 might be useful for Prac3 (two random samples from the same normal distribution?)

Page 135: MT2004

Examples

Example 1:

The data are the number of moths caught during the night by 11 traps of one style and 8 traps of a second style.

Trap type 1: 41 34 33 36 40 25 31 37 34 30 38

Trap type 2: 52 57 62 55 64 57 56 55

We assume that the two samples of measurements are taken at random from a normal population, and we ask if the variances of the two populations are equal (5% significance level).

Page 136: MT2004

Examples

Solution - Example 1:

We wish to test H0 12 = 2

2 vs H1: 12 2

2 (test of equality of variances)

We have that n1 = 11, n2 = 8, so 1=10 and 2=7

s12=21.87 and s2

2=15.36

So Fobs = s12/s2

2 = 21.87/15.36=1.42

F10,7;0.025=4.76

Fobs<F10,7;0.025, therefore we cannot reject H0 at the 5% level.

Page 137: MT2004

Examples

Example 2:

The data are human blood-clotting times (in minutes) of individuals given one of two different drugs.

Given drug B: 8.8 8.4 7.9 8.7 9.1 9.6

Given drug G: 9.9 9.0 11.1 9.6 8.7 10.4 9.5

We assume that the two samples of measurements are taken at random from a normal population, and we ask if blood of persons treated with drug B has the same mean clotting time as does blood from persons treated with drug G (5% significance level).

Page 138: MT2004

Examples

Solution - Example 2:

We wish to test H0 1 = 2 vs H1: 1 2 (test of equality of means)

We have that n1 = 6, n2 = 7, (n1-1)s1

2=1.6950, (n2-1)s22=4.0171. So s2

= (1.6950+4.0171)/(6+7-2)=0.5193

We get tobs = (8.75-9.74)/(0.40)=-2.475

t11;0.025=2.201

tobs<-t11;0.025, therefore reject H0 at the 5% level.

The p-value of this test is P(T<-2.475)=1-F(2.475) where F cdf of the t-distribution, and 1-F(2.475) = P(T>2.475) which lies between 0.025 and 0.01. We double those values (2-sided test), and we find that the p-value is between 0.05 and 0.02, so we reject H0 at the 5% significance level.

Page 139: MT2004

Examples

Example 3:

The data are weight changes of humans, tabulated after administration of a drug proposed to result in weight loss. Each weight change (in kg) is the weight after minus the weight before drug administration.

Data: 0.2 -0.5 -1.3 -1.6 -0.7 0.4 -0.1 0.0 -0.6 -1.1 -1.2 -0.8

We assume that the sample of measurements is taken at random from a normal population, and we ask whether a weight loss occurs after the drug is taken (5% significance level).

Page 140: MT2004

Examples

Solution - Example 3:

We wish to test H0 1 = vs H1: 1 < (test of equality of means)

We have that n = 12, s2=0.4008, Xi/n=-0.61.

We get tobs = (-0.61-0)/(0.4008/12)=-0.61/0.18=-3.389

t11;0. 05=1.7959

tobs<-t11;0.05, therefore reject H0 at the 5% level.

The p-value of this test is P(T<-3.389)=1-F(3.389) where F cdf of the t-distribution, and 1-F(3.389) = P(T>3.389) which lies between 0.005 and 0.001, so we reject H0 at the 5% significance level.

Page 141: MT2004

Examples

Example 4:

We consider an experiment designed to test whether a new fertilizer results in an increase of more than 250kg/ha in crop yield over the old fertilizer. 9 pairs of test plots were set up with same environmental conditions (paired samples…).

New fertilizer: 2250 2410 2260 2200 2360 2320 2240 2300 2090

Old fertilizer: 1920 2020 2060 1960 1960 2140 1980 1940 1790

Test the hypothesis at the 5% significance level.

Page 142: MT2004

Examples

Solution - Example 4:

We first consider the differences di=newi-oldi

dj: 330 390 200 240 400 180 260 360 300

We wish to test H0 d = vs H1: d > (test of equality of means for paired samples)

We have that n = 9, s=80.6, di/n=295.6.

We get tobs = (295.6-250)/(80.6/(12))=1.695

t8;0. 05=1.8595

tobs<t8;0.05, therefore do not reject H0 at the 5% level.

The p-value of this test is P(T>1.695) which lies between 0.1 and 0.05, so we cannot reject H0 at the 5% significance level.