MT2004 Olivier GIMENEZ Telephone: 01334 461827 E-mail: [email protected] Website: .

MT2004

Olivier GIMENEZ

Telephone: 01334 461827

E-mail: [email protected]

Website: http://www.creem.st-and.ac.uk/olivier/OGimenez.html

So far, data-driven statistical methods i.e. use data to answer questions in direct ways

The rest of the module - from Section 7 - deals with

CAPTURING PATTERNS IN THE DATA USING MODELS: Modelling Step

ANALYSING THESE MODELS TO ANSWER QUESTIONS: Estimation and Inference Step

7. Basic Normal theory & the Central Limit Theorem

7.1 Basic properties of normal distributions

See Section 2.2.2

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-4 -3 -2 -1 0 1 2 3 4 5 6

µ = 0 = 1

µ = 3 = 0,5

µ = 3 = 1

µ = 1 = 2

f (y)

x

7.1.1 Linear transformation of a normal r.v.

Let X be a random variable with mean and variance 2

Let Y = a + b X, then

E(Y) = ?

V(Y) = ?


Let X be a random variable with mean and variance 2

Let Y = a + b X, then

E(Y) = a + b E(X) = a + b

V(Y) = b2 V(X) = b2 2


Now, if X is normally distributed with mean and variance 2, then Y = a + b X is normally distributed too.

In other words, any linear combination of a normal distribution is a normal distribution.

And more precisely, according to the previous slide,

X N(,2) Y N (a + b , b2 2)

Demonstration: homework


Now, suppose that X N(,2), and consider

What is the distribution of Z ?


and remember that X N(,2) Y = a + b X N (a + b , b2 2)

so by identification, we obtain

Finally

Write


Result:


Very useful result for working out probabilities associated with any normal distributions.

Idea: transform to the standard normal distribution N(0,1) and use the published tables for probabilities associated with N(0,1).

E.g.:

z 0.0 0.5 1.0 2.0 2.5 3.0

Pr(Z>z) 0.5000 0.30854 0.15866 0.02275 0.00621 0.00135

(See Table 5 of the K & Y Tables)


Example:

Calculate the probability that a random variable X N(3,4) takes a value between 4 and 5


Example:

Calculate the probability that a random variable X N(3,4) takes a value between 4 and 5

We wish to compute Pr(4 X 5).

Using the result above, we have that Z = (X-)/2 N(0,1)

So Pr(4 X 5) = Pr(4-3 X-3 5-3) = Pr(1/2 Z 1)

Finally Pr(4 X 5) = Pr(Z 1) - Pr(Z 1/2)

= (1-0.15866) - (1-0.30854) = 0.14988

7.1.2 Sums of independent normal random variables

Sums of (normal) r.v's occur frequently in statistical theory (e.g. mean, variance...). Distribution?

If X1, X2 independent with Xi N(i,i2) i=1,2,

then

Extension: X1,...,Xn independent r.v's with Xi N(i,i2)

i=1,...,n and a1,...,an constants, then

7.1.2 Sums of independent normal random variables

7.2 The Central Limit Theorem

We've just seen that the mean of n independent identically normally distributed r.v's is itself normally distributed.

The Central Limit Theorem (CLT) states that the mean of n i.i.d. r.v's from any distribution is approximately normally distributed for large enough n.


The Central Limit Theorem (CLT) states that the mean of n i.i.d. r.v's from any distribution is approximately normally distributed for large enough n.


Example: A bridge can hold at most 400 vehicles if they are bumper-to-bumper and stationary. The mean weight of vehicles using the bridge is 2.5 tonnes with a standard deviation of 2.0 tonnes. What is the probability that the maximum design load of 1100 tonnes will be exceeded in a traffic jam?



Let Xi be the weight of a vehicle, i=1,...n. Here, we have that n = 400, = 2.5t and = 2.0t.

The probability that the maximum design load of 1100 tonnes will be exceeded in a traffic jam is given by Pr(iXi > 1100).

We'd like to use the CLT: X1,...,Xn i.i.d. r.v's with mean and variance 2:

How to use Tables?

Table of the Standard Normal Distribution

values inside the table = areas under Z N(0,1) between - and z

i.e. (z) = P(Z z)

Example 1: to determine (1.96)=P(Z1.96), i.e. the area under the curve between - and 1.96, look in the intersecting cell for the row labelled 1.90 and the column labelled 0.06. The area under the curve is 0.975.

Example 2: Find z such as (z) = 0.95. P(Z1.64)=0.9495 and P(Z1.65)=0.9505 so that z = 1.645.

Example 3: (-1.23) = 1-(1.23)=1-.8907=0.1093

7.3 Approximating other distributions by normal distributions

The CLT also provides the justification for approximating several other distributions by a normal distribution.

We consider two examples, the Binomial and the Poisson distributions.

The Binomial probability distribution is:

It becomes hard to evaluate it for large n as the factorials in the binomials coefficient 'explode'.

However, the CLT can be used to overcome this problem


We first note that if X Bin(n,p), then X can be written as a sum of n independent binomials r.v's: X = X1 + ... + Xn, where Xi Bin(1,p).

Each Xi has mean p and variance p(1-p).

Thus the CLT implies that

Alternatively:

In real life, the approximation will be good enough when


X Bin(n,p); X = X1 + ... + Xn, where Xi Bin(1,p); each Xi has mean p and variance p(1-p).

Alternatively:

Example: The probability of annual survival of a bird species is 0.4. Suppose we are studying a population of n = 200 individuals. What is the probability that less than 50% of the population survives the current year.


Example: The probability of annual survival of a bird species is 0.4. Suppose we are studying a population of n = 200 individuals. What is the probability that less than 50% of the population

survives the current year.

Let Xi be the random variable 'individual i survives the year', we have that Bin(1,0.4); each Xi has mean 0.4 and variance 0.4(1-0.4). Then X = X1 + ... + Xn is the total of surviving individuals, X bin(200,0.4).

Via the CLT:

So that

As:


Example: The probability of annual survival of a bird species is 0.4. Suppose we are studying a population of n = 200 individuals. What is the probability that less than 50% of the population

survives the current year.

with Z N(0,1). Using tables for the standard normal distribution, we have that P(Z<3)=0.9987.

Without invoking the CLT, we would need to compute

P(X<100) = P(X=0) + P(X=1) + ... + P(X=99)

So that


The Poisson probability function is:

It becomes hard to evaluate it for high values of as x gets huge.

However, the CLT can be used to overcome this problem.

We note first that if X1, X2 independent with Xi Pois(i) i = 1,2 then X1 + X2 Pois(1+2) (to be proved in Honours)

We note that if X Pois(), then X can be written as a sum of n independent Poisson r.v's: X = X1 + ... + Xn, where Xi Pois(/n).

Each Xi has mean /n and variance /n (mean = variance: homework)

Thus the CLT

implies that

So that



Example: Find the probability that a Poisson distributed r.v. with mean 25 takes a value in the range 26 to 30.


Example: Find the probability that a Poisson distributed r.v. with mean 25 takes a value in the range 26 to 30.

If X Pois(25), we need to calculate P(26 X 30)

Using tables for the standard normal distribution, we have that

P(26 X 30) = P(0.2 Z 1) = P(Z 1) - P(Z 0.2)

= 0.8413 – 0.5793 = 0.262

The CLT tells us that

8. Practical Applications of Normal Distributions

Why are normal distributions so important?

1 – The CLT shows that sums of i.i.d r.v’s tend towards normality, even if the r.v’s are non-normal

2 – Many data sets for which a normal distribution provides a good model (describe adequately the data): heights of people, IQ scores…

3 – Easy to work with mathematically (integrals, tables…)

4 – Statistical procedures based on normality assumption are often insensitive to small violations of the assumption (ANOVA e.g., see future section)

5 – Non-normal distributions can be transformed to approximate normality

8.1 Testing for normality

Before using the normal distribution as a model of data to perform test about the mean of a population e.g., we need to decide whether or not the random sample under investigation could have been drawn from a normal distribution.

There a analytical tests (Pearson, Kolmogorow…) but we will focus on a graphical method here.

We won’t be able to prove normality, but only fail to reject the hypothesis that the data come from the normal distribution (hypothesis testing philosophy, finite random sample)


First idea: use a histogram, and compare with what we would expect for a normal distribution, i.e. bell-shaped, symmetric with a single peak (unimodal)

…


Histograms of random samples (n=30) from N(3,var=25) vs Density curve of N(xi/n,s2)

Difficult to conclude for normality, because of variability


Second idea:

Remember that any normal r.v. is a linear transformation of a standard normal r.v.

So if y1,…,yn is a random sample from any normal r.v. (N(,2) say) and z1,…,zn a random sample from a N(0,1)

Then plot the sorted y values against the sorted z values

We would get something close to a straight line because

Y = + Z


Plots of random samples (n=30) from N(3,var=25) against N(0,1)

Difficult to conclude for normality, because of variability

8.1 Testing for normalityThird idea: to overcome the problem of variability, use an ‘idealised’/theoretical average sample from N(0,1), the normal scores


P(Z z10% quantile) = Φ(z10% quantile) = 0.10



…

Meaning that, on average, we expect 10% of the data points to lie below the 10% quantile of the c.d.f., 20% below the 20% quantile, …, and 100% below the 100% quantile.

If Z N(0,1), by definition of the cumulative distribution function/(lower) quantile, we have that:


Consider a sample of 10 points e.g. from N(0,1)

We’ve got 10 probability intervals corresponding to the quantiles:

[0,0.1], [0.1,0.2], [0.2,0.3], [0.3,0.4], [0.4,0.5]

[0.5,0.6], [0.6,0.7], [0.7,0.8], [0.8,0.9], [0.9,1.0]

For convenience, consider the mid-point of each interval (i-0.5)/10, i=1,…,10

The normal scores are obtained by computing Φ-1((i-0.5)/10), i=1,…,10, where is the c.d.f. of the N(0,1)


Φ-1((1-0.5)/10) = Φ-1(0.05) = -1.645


0.05


Φ-1((2-0.5)/10) = Φ-1(0.15) = -1.036


0.15



Finally…


Idea: Plot the observed y sorted values against the normal scores, then check visually for linearity

Example: Early in the 20th century, a Colonel L.A. Waddell collected 32 skulls from Tibet. He collected 2 groups: 17 from graves on Sikkim and 15 from a battlefield near Lhasa. Here are maximum skull length measurements (in mm) for the Lhasa group: 182, 180, 191, 184, 181, 173, 189, 175, 196, 200, 185, 174, 195, 197, 182. Before doing anything with these data (e.g. testing for a difference in the mean skull length between the 2 groups), we need to check for normality first.

ExampleSample

quantiles

182

180

191

184

181

173

189

175

196

200

185

174

195

197

182

ExampleSample

quantilesSorted sample

quantiles

182 173

180 174

191 175

184 180

181 181

173 182

189 182

175 184

196 185

200 189

185 191

174 195

195 196

197 197

182 200

ExampleSample


quantilesi = 1,…,15

(i-0.5)/15

182 173 1 0.03

180 174 2 0.10

191 175 3 0.17

184 180 4 0.23

181 181 5 0.30

173 182 6 0.37

189 182 7 0.43

175 184 8 0.50

196 185 9 0.57

200 189 10 0.63

185 191 11 0.70

174 195 12 0.77

195 196 13 0.83

197 197 14 0.90

182 200 15 0.97

ExampleSample


quantilesi = 1,…,15

(i-0.5)/15 Φ-1((i-0.5)/15)

182 173 1 0.03 -1.83

180 174 2 0.10 -1.28

191 175 3 0.17 -0.97

184 180 4 0.23 -0.73

181 181 5 0.30 -0.52

173 182 6 0.37 -0.34

189 182 7 0.43 -0.17

175 184 8 0.50 0.00

196 185 9 0.57 0.17

200 189 10 0.63 0.34

185 191 11 0.70 0.52

174 195 12 0.77 0.73

195 196 13 0.83 0.97

197 197 14 0.90 1.28

182 200 15 0.97 1.83

Examplex=seq(1,15)

y=qnorm((x-0.5)/15)) # calculates Φ-1((i-0.5)/15), theoretical quantiles

o=c(173,174,175,180,181,182,182,184,185,189,191,195,196,197,200)

plot(y,o,xlab="Theoretical quantiles",ylab="Sample quantiles")

Example

Alternatively, use R command qqnorm

skull.lhasa = c(182,180,191,184,181,173,189,175,196,200,185,174,195,197,182)

qqnorm(skull.lhasa)

Example

To help check linearity of a normal scores plot, add a straight line.

‘qqline’ adds a line to a normal quantile-quantile plot which passes through the first and third quartiles.

qqnorm(skull.lhasa)

qqline(skull.lhasa)

Further examples

Sample from a distribution with more probability in the and centre of the distribution and less in the ‘shoulders’ than the normal distribution

Further examples

Sample from a positively skewed distribution

Further examples

Sample from a negatively skewed distribution

Further examples

Sample from a normal distribution

8.2 Using a normal distribution as a model when the variance 2 is known (z-test)

We wish to learn something about a population on the basis of a random sample from that population

Why? Because it is often impractical to work with the whole population, so we test hypotheses about the population on the basis of a sample drawn from it

We assume here that the population may be modelled using a normal distribution with unknown mean but known variance 2

For example, suppose one wants to investigate IQ of students at StAndrews. As you cannot study the whole population, you need to measure the IQ’s of a random sample of students.

In this example, we could ask what the average IQ of StAndrews students is, or test the hypothesis that it is greater than some value

8.2.1 Hypothesis testing: parametric approach

General approach for hypothesis testing:

1 – Define a null hypothesis H0 and an alternative hypothesis H1

2 – Choose a test statistic which will distinguish between H0 and H1 by taking ‘extreme’ values if H1 is true, and moderate values otherwise

3 – Find the distribution of the test statistic under H0

4 – Using this distribution, determine the probability of obtaining a test statistic at least as ‘extreme’ as the one observed under H0. This is the p-value of the data under the test

5 – Conclude: a very low p-value suggests that H0 is false, otherwise we cannot reject H0


And in the particular case of a normally distributed population with known variance?

Define x1,…,xn a set of independent observations from a population which can be modelled by a normal distribution with unknown mean and known variance 2.

Using these observations, we are able to test hypotheses about the mean of the whole population, e.g.

1 –H0: = 0 against H1: 0

2 – A ‘good’ test statistic is the mean of the observations xi/n

which will tend to be close to 0 under H0 but further away under H1


1 –H0: = 0 against H1: 0 - TWO-SIDED TEST

2 – A ‘good’ test statistic is the mean of the observations xi/n

which will tend to be close to 0 under H0 but further away under H1

3 – Distribution of the test statistic under H0 ?


3 – Distribution of the test statistic under H0 ?


3 – The distribution of the test statistic under H0 is

4 - We use a significance level of = 5%, and extreme values on either side of the mean are of interest since H1 does not distinguish between them

The appropriate range of values is P(-z/2 Z z/2) = 1-=0.95 i.e P(Z z/2) - P(Z - z/2) = 2P(Z z/2) - 1 = 0.95

P(Z z/2) = 0.975 so z/2 = 1.96

-z/2 z/2

Z


4 - At the significance level = 5%, the region of acceptance of H0 is [-1.96,1.96]

5 - So if the observed value of the test statistic is outside this region, we reject H0, otherwise we cannot.

ALTERNATIVELY,

4 - We can calculate what proportion of values are at least as improbable as the observed value under H0, i.e. the p-value.

In other words, we want to compute the p-value

P(Z -zobs or Z zobs) = 1-P(Z zobs)+1-P(Z zobs) = 2(1-(zobs))

5 - Finally, if the p-value is < 0.05, we reject H0 (outsite the acceptance region), otherwise we cannot.

z


H0: = 0 against H1: 0 - ONE-SIDED TEST

P(Z z) = 1-= 0.95 so z = 1.645 and accept H0

ALTERNATIVELY

P(Z zobs) = 1-(zobs) and reject H0 if < 0.05


H0: = 0 against H1: 0 - ONE-SIDED TEST

P(Z z) = 1-= 0.95 so z = -1.645 and accept H0

ALTERNATIVELY

P(Z zobs) = (zobs) and reject H0 if < 0.05

-z


Example - Tutorial 4, Question 5a

Test at the 5% level the hypothesis that the mean spinning timehas not been affected by lubrification, against the alternative that it has been increased. Report the p-value.

Test H0:=150 against H1:>150. We have that n=5 and Xi/n=162. Hence:

We thus reject H0 at the 5% significance level.

The p-value is 1-(2.68) = 1- 0.9963 = 0.0037 << 0.05 so we reject H0

8.2.2 The power of a test

There are two types of error that can be made when hypothesis testing:

Type I error: Reject H0 when it is true; P(type I error) =

Type II error: Accept H0 when it is false; P(type II error) = ?

We'd like to minimize the two types of error, but the problem is that when decreases, P(type II error) increases and vice-versa.

So set up to a fixed value, and try to minimize P(type II error)

DECISION

REALITY

Accept H0 Reject H0

H0 true OK Type I error

H0 false Type II error

OK


Definition: The power of a test is the probability of rejecting H0 when it is false.

P(type II error) = P(Accept H0 | H0 false) so power = 1 - P(type II error).

So once is fixed, we'll do our best to increase the power of the test.

Example: Consider H0: = 0 vs H1: > 0


Note: If n increases, (...) decreases and the power increases; in other words, for a given , the power increases as the sample size increases, so your ability to detect an alternative hypothesis

8.2.3 Confidence intervals

You have already calculated confidence intervals by computer-intensive methods. Here, we'll do it analytically.

An x% confidence interval for a parameter ( here) is an interval having probability x% of including the true value of this parameter.

The parameter being estimated is a fixed quantity, but the interval is random.

In other words, it means that x% of intervals calculated in the same way for similar samples will include the true value.

Regarding the example of n observations from a normal population with known variance 2


Example of n observations from a normal population with known variance 2


Example: The wavelengths of light pulses from a semiconductor laser are approximately normally distributed, with variance calculated by theory to be 100nm2. The mean wavelength for individual lasers varies. Measurements of 100 pulses from a laser give an average wavelength of 598nm. Find a 95% confidence interval for the mean wavelength of the laser.

MT2004 Olivier GIMENEZ Telephone: 01334 461827 E-mail: [email protected] Website: .

Documents

Transcript of MT2004 Olivier GIMENEZ Telephone: 01334 461827 E-mail: [email protected] Website: .