Stats 845 Applied Statistics. This Course will cover: 1.Regression –Non Linear Regression...

Stats 845

Applied Statistics

This Course will cover:

1. Regression– Non Linear Regression– Multiple Regression

2. Analysis of Variance and Experimental Design

The Emphasis will be on:

1. Learning Techniques through example:

2. Use of common statistical packages.• SPSS• Minitab• SAS• SPlus

What is Statistics?

It is the major mathematical tool of scientific inference - the art of drawing conclusion from data. Data that is to some extent corrupted by some component of random variation (random noise)

An analogy can be drawn to data that is affected by random components of variation to signals that are corrupted by noise.

Quite often sounds that are heard or received by some radio receiver can be thought of as signals with superimposed noise.

The objective in signal theory is to extract the signal from the received sound (i.e. remove the noise to the greatest extent possible). The same is true in data analysis.

Example A:

Suppose we are comparing the effect of three different diets on weight loss.

An observation on weight loss can be thought of as being made up of two components:

1. A component due to the effect of the diet being applied to the subject (the signal)

2. A random component due to other factors affecting weight loss not considered (initial weight of the subject, sex of the subject, metabolic makeup of the subject.) random noise.

Note:

that random assignment of subjects to diets will ensure that this component will be a random effect.

Example B In this example we again are comparing the effect of three diets on weight gain. Subjects are randomly divided into three groups. Diets are randomly distributed amongst the groups. Measurements on weight gain are taken at the following times -

- one month- two months - 6 months and - 1 year

after commencement of the diet.

In addition to both the factors Time and Diet effecting weight gain there are two random sources of variation (noise)

- between subject variation and

- within subject variation

This can be illustrated in a schematic fashion as follows:

Deterministic factorsDietTime

Random Noisewithin subject

between subject

Responseweight gain

The circle of Research

Questions arise about a phenomenon

A decision is made to collect data

A decision is made as how to collect the

data

The data is collected

The data is summarized and

analyzed

Conclusion are drawn from the analysis

StatisticsStatistics

Notice the two points on the circle where statistics plays an important role:1.The analysis of the collected data.

2.The design of a data collection procedure

The analysis of the collected data.

• This of course is the traditional use of statistics. • Note that if the data collection procedure is well

thought out and well designed, the analysis step of the research project will be straightforward.

• Usually experimental designs are chosen with the statistical analysis already in mind.

• Thus the strategy for the analysis is usually decided upon when any study is designed.

• It is a dangerous practice to select the form of analysis after the data has been collected ( the choice may to favour certain pre-determined conclusions and therefore in a considerable loss in objectivity )

• Sometimes however a decision to use a specific type of analysis has to be made after the data has been collected (It was overlooked at the design stage)

The design of a data collection procedure

• the importance of statistics is quite often ignored at this stage.

• It is important that the data collection procedure will eventually result in answers to the research questions.

• And will result in the most accurate answers for the resources available to research team.

• Note the success of a research project should not depend on the answers that it comes up with but the accuracy of the answers.

• This fact is usually an indicator of a valuable research project..

Some definitions

important to Statistics

A population:

this is the complete collection of subjects (objects) that are of interest in the study.

There may be (and frequently are) more than one in which case a major objective is that of comparison.

A case (elementary sampling unit):

This is an individual unit (subject) of the population.

A variable:

a measurement or type of measurement that is made on each individual case in the population.

Types of variables Some variables may be measured on a numerical scale while others are measured on a categorical scale.

The nature of the variables has a great influence on which analysis will be used. .

For Variables measured on a numerical scale the measurements will be numbers.

Ex: Age, Weight, Systolic Blood Pressure

For Variables measured on a categorical scale the measurements will be categories.

Ex: Sex, Religion, Heart Disease

Types of variables

In addition some variables are labeled as dependent variables and some variables are labeled as independent variables.

This usually depends on the objectives of the analysis.

Dependent variables are output or response variables while the independent variables are the input variables or factors.

Usually one is interested in determining equations that describe how the dependent variables are affected by the independent variables

A sample:

Is a subset of the population

Types of Samples

different types of samples are determined by how the sample is selected.

Convenience Samples

In a convenience sample the subjects that are most convenient to the researcher are selected as objects in the sample.

This is not a very good procedure for inferential Statistical Analysis but is useful for exploratory preliminary work.

Quota samples

In quota samples subjects are chosen conveniently until quotas are met for different subgroups of the population.

This also is useful for exploratory preliminary work.

Random Samples

Random samples of a given size are selected in such that all possible samples of that size have the same probability of being selected.

Convenience Samples and Quota samples are useful for preliminary studies. It is however difficult to assess the accuracy of estimates based on this type of sampling scheme.

Sometimes however one has to be satisfied with a convenience sample and assume that it is equivalent to a random sampling procedure

A population statistic (parameter):

Any quantity computed from the values of variables for the entire population.

A sample statistic:

Any quantity computed from the values of variables for the cases in the sample.

Statistical Decision Making

• Almost all problems in statistics can be formulated as a problem of making a decision .

• That is given some data observed from some phenomena, a decision will have to be made about the phenomena

Decisions are generally broken into two types:

• Estimation decisions

and

• Hypothesis Testing decisions.

Probability Theory plays a very important role in these decisions and the assessment of error made by these decisions

Definition:

A random variable X is a numerical quantity that is determined by the outcome of a random experiment

Example :

An individual is selected at random from a population

and

X = the weight of the individual

The probability distribution of a random variable (continuous) is describe by:

its probability density curve f(x).

i.e. a curve which has the following properties :• 1. f(x) is always positive.

• 2. The total are under the curve f(x) is one.

• 3. The area under the curve f(x) between a and b is the probability that X lies between the two values.

0

0.005

0.01

0.015

0.02

0.025

0 20 40 60 80 100 120

f(x)

Examples of some important Univariate distributions

1.The Normal distribution A common probability density curve is the “Normal” density curve - symmetric and bell shaped Comment: If = 0 and = 1 the distribution is called the standard normal distribution

0

0.005

0.01

0.015

0.02

0.025

0.03

0 20 40 60 80 100 120

Normal distribution with = 50 and =15

Normal distribution with = 70 and =20

f(x) 1

2e

x 2

2 2

2.The Chi-squared distribution with degrees of freedom

0 x if2

1)( 2/2/)2(

2/2

xexxf

2 4 6 8 10 12 14

0.1

0.2

0.3

0.4

0.5

Comment: If z1, z2, ..., z are

independent random variables each having a standard normal distribution then

U =

has a chi-squared distribution with degrees of freedom.

222

21 zzz

3. The F distribution with degrees of freedom in the

numerator and degrees of

freedom in the denominator if x 0

where K =

f(x) K x (1 2)2 11

2

x

12 / 2

1 2

2

1

2

1 / 2

1

2

2

2

F dist

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 1 2 3 4 5 6

Comment: If U1 and U2 are independent random variables each having Chi-squared distribution with 1 and 2 degrees of freedom respectively then

F =

has a F distribution with degrees of freedom in the numerator and degrees of freedom in the denominator

U1 1

U 2 2

4.The t distribution with degrees of freedom

where K =

f(x) K 1x2

1 / 2

12

2

-4 -2 2 4

0.1

0.2

0.3

0.4

Comment: If z and U are independent random variables, and z has a standard Normal distribution while U has a Chi-squared distribution with degrees of freedom then

t =

has a t distribution with degrees of freedom.

z

U

• An Applet showing critical values and tail probabilities for various distributions

1. Standard Normal

2. T distribution

3. Chi-square distribution

4. Gamma distribution

5. F distribution

The Sampling distribution of a statistic

A random sample from a probability distribution, with density function f(x) is a collection of n independent random variables, x1, x2, ...,xn with a

probability distribution described by f(x).

If for example we collect a random sample of individuals from a population and

– measure some variable X for each of those individuals,

– the n measurements x1, x2, ...,xn will

form a set of n independent random variables with a probability distribution equivalent to the distribution of X across the population.

A statistic T is any quantity computed from the random observations x1, x2, ...,xn.

• Any statistic will necessarily be also a random variable and therefore will have a probability distribution described by some probability density function fT(t).

• This distribution is called the sampling distribution of the statistic T.

• This distribution is very important if one is using this statistic in a statistical analysis.

• It is used to assess the accuracy of a statistic if it is used as an estimator.

• It is used to determine thresholds for acceptance and rejection if it is used for Hypothesis testing.

Some examples of Sampling distributions of statistics

Distribution of the sample mean for a

sample from a Normal popululation

Let x1, x2, ...,xn is a sample from a normal

population with mean and standard deviation

Let

x x i

i

n

Than

has a normal sampling distribution with mean

and standard deviation

x x i

i

n

x

x n

0

20 40 60 80 100

Distribution of the z statistic

Let x1, x2, ...,xn is a sample from a normal

population with mean and standard deviation

Let

Then z has a standard normal distibution

n

xz

Comment:

Many statistics T have a normal distribution with mean T and standard deviation T. Then

will have a standard normal distribution.

z T T

T

Distribution of the 2 statistic for sample variance

Let x1, x2, ...,xn is a sample from a normal population with mean and standard deviation Let

= sample variance

and

= sample standard deviation

1

2

2

n

xxs i

i

1

2

n

xxs i

i

Let

Then 2 has chi-squared distribution with = n-1 degrees of freedom.

2 x i x 2

i

2 (n 1)s2

2

0

0.5

0 4 8 12 16 20 24

The chi-squared distribution

Distribution of the t statistic

Let x1, x2, ...,xn is a sample from a normal population with mean and standard deviation Let

then t has student’s t distribution with = n-1 degrees of freedom

t x s

n

Comment:

If an estimator T has a normal distribution with mean T and standard deviation T.

If sT is an estimatior of T based on degrees of freedom Then

will have student’s t distribution with degrees of freedom. .

t T T

s T

t distribution

standard normal distribution

Point estimation

• A statistic T is called an estimator of the parameter if its value is used as an estimate of the parameter .

• The performance of an estimator T will be determined by how “close” the sampling distribution of T is to the parameter, , being estimated.

• An estimator T is called an unbiased estimator of if T, the mean of the

sampling distribution of T satisfies T = .

• This implies that in the long run the average value of T is .

• An estimator T is called the Minimum Variance Unbiased estimator of if T is an unbiased estimator and it has the smallest standard error T amongst all unbiased

estimators of .

• If the sampling distribution of T is normal, the standard error of T is extremely important. It completely describes the variability of the estimator T.

Interval Estimation (confidence intervals)

• Point estimators give only single values as an estimate. There is no indication of the accuracy of the estimate.

• The accuracy can sometimes be measured and shown by displaying the standard error of the estimate.

• There is however a better way.

• Using the idea of confidence interval estimates

• The unknown parameter is estimated with a range of values that have a given probability of capturing the parameter being estimated.

• The interval TL to TU is called a (1 - ) 100 % confidence interval for the parameter , if the probability that lies in the range TL to TU is equal to 1 -

• Here , TL to TU , are

– statistics – random numerical quantities calculated from

the data.

Confidence Intervals

Examples Confidence interval for the mean of a Normal population

(based on the z statistic).

is a (1 - ) 100 % confidence interval for , the mean of a normal population.

Here z/2 is the upper /2 100 % percentage point of the

standard normal distribution.

TL x z / 2

n

to TU x z / 2

n

More generally if T is an unbiased estimator of the parameter and has a normal sampling distribution with known standard error T then

is a (1 - ) 100 % confidence interval for .

TL T z / 2T to TU T z / 2 T

Confidence interval for the mean of a Normal population (based on the t statistic).

is a (1 - ) 100 % confidence interval for , the mean of a normal population.

Here t/2 is the upper /2 100 % percentage point of the Student’s t distribution with = n-1 degrees of freedom.

TL x t / 2

s

n to TU x t / 2

s

n

More generally if T is an unbiased estimator of the parameter and has a normal sampling distribution with estmated standard error sT, based on n degrees of freedom, then

is a (1 - ) 100 % confidence interval for .

TL T t / 2s T to TU T t / 2s T

Situation Confidence interval Sample form the Normal distribution with unknown mean and known variance (Estimating ) (n large) n

zx 02/

Sample form the Normal distribution with unknown mean and unknown variance (Estimating )(n small)

n

stx 2/

Estimation of a binomial probability p

n

ppzp

)ˆ1(ˆˆ 2/

Two independent samples from the Normal distribution with unknown means and known variances (Estimating 1 - 2) (n,m large) m

s

n

szyx yx

22

2/

Two independent samples from the Normal distribution with unknown means and unknown but equal variances. (Estimating 1 - 2) ) (n,m small) mn

styx Pooled

112/

Estimation of a the difference between two binomial probabilities, p1-p2

2

22

1

112/21

)ˆ1(ˆ)ˆ1(ˆˆˆ

n

pp

n

ppzpp

Common Confidence intervals

Multiple Confidence intervals

In many situations one is interested in estimating not only a single parameter, , but a collection of parameters, 1, 2, 3, ... .

A collection of intervals, TL1 to TU1, TL2 to TU2, TL3 to

TU3, ... are called a set of (1 - ) 100 % multiple

confidence intervals if the probability that all the intervals capture their respective parameters is 1 -

Hypothesis Testing

• Another important area of statistical inference is that of Hypothesis Testing.

• In this situation one has a statement (Hypothesis) about the parameter(s) of the distributions being sampled and one is interested in deciding whether the statement is true or false.

• In fact there are two hypotheses – The Null Hypothesis (H0) and

– the Alternative Hypothesis (HA).

• A decision will be made either to – Accept H0 (Reject HA) or to

– Reject H0 (Accept HA). The following table

gives the different possibilities for the decision and the different possibilities for the correctness of the decision

• The following table gives the different possibilities for the decision and the different possibilities for the correctness of the decision

Accept H0 Reject H0

H0

is true

Correct Decision

Type I error

H0

is false

Type II error

Correct Decision

• Type I error - The Null Hypothesis H0 is

rejected when it is true.

• The probability that a decision procedure makes a type I error is denoted by , and is sometimes called the significance level of the test.

• Common significance levels that are used are = .05 and = .01

• Type II error - The Null Hypothesis H0 is

accepted when it is false.

• The probability that a decision procedure makes a type II error is denoted by .

• The probability 1 - is called the Power of the test and is the probability that the decision procedure correctly rejects a false Null Hypothesis.

A statistical test is defined by

• 1. Choosing a statistic for making the decision to Accept or Reject H0. This

statisitic is called the test statistic.

• 2. Dividing the set of possible values of the test statistic into two regions - an Acceptance and Critical Region.

• If upon collection of the data and evaluation of the test statistic, its value lies in the Acceptance Region, a decision is made to accept the Null Hypothesis H0.

• If upon collection of the data and evaluation of the test statistic, its value lies in the Critical Region, a decision is made to reject the Null Hypothesis H0.

• The probability of a type I error, , is usually set at a predefined level by choosing the critical thresholds (boundaries between the Acceptance and Critical Regions) appropriately.

• The probability of a type II error, , is decreased (and the power of the test, 1 - , is increased) by

1. Choosing the “best” test statistic.

2. Selecting the most efficient experimental design.

3. Increasing the amount of information (usually by increasing the sample sizes involved) that the decision is based.

Situation Test Statistic H0 HA Critical Region

z < -z/2 or z > z/2

z > z

Sample form the Normal distribution with unknown mean and known variance (Testing ) (n large)

s

xnz 0

z <-z

t < -t/2 or t > t/2

t > t

Sample form the Normal distribution with unknown mean and unknown variance (Testing ) (n small)

s

xnt 0

t < -t

pp z < -z/2 or z > z/2

pp z > z

Testing of a binomial probability p

n

pp

ppz

)1(

ˆ

00

0

pp

pp z < -z

21

z < -z/2 or z > z/2

21

z > z

Two independent samples from the Normal distribution with unknown means and known variances (Testing 1 - 2) (n, m largel)

m

s

n

s

yxz

yx

22

21

21

z < -z

21

t < -t/2 or t > t/2

21

t > t

Two independent samples from the Normal distribution with unknown means and unknown but equal variances. (Testing 1 - 2)

mns

yxt

Pooled

11

21

21

t < -t

21 pp

z < -z/2 or z > z/2

21 pp

z > z

Estimation of a the difference between two binomial probabilities, p1-p2

21

21

11)ˆ1(ˆ

ˆˆ

nnpp

ppz

21 pp

21 pp z < -z

Some common Tests

The p-value approach to Hypothesis Testing

1. A test statistic

2. A Critical and Acceptance region for the test statistic

In hypothesis testing we need

The Critical Region is set up under the sampling distribution of the test statistic.

Area = (0.05 or 0.01) above the critical region. The critical region may be one tailed or two tailed

The Critical region:

1 when trueAccept 2/2/0 zzzPHP

/2

0 z

/2

2/z 2/z

Accept H0

Reject H0 Reject H0

2/2/0 or when trueReject zzzzPHP

1. Computing the value of the test statistic

2. Making the decisiona. Reject if the value is in the Critical

region and b. Accept if the value is in the

Acceptance region.

In test is carried out by

The value of the test statistic may be in the Acceptance region but close to being in the Critical region, or

The it may be in the Critical region but close to being in the Acceptance region.

To measure this we compute the p-value.

Definition – Once the test statistic has been computed form the data the p-value is defined to be:

p-value = P[the test statistic is as or more extreme than the observed value of the test statistic]

more extreme means giving stronger evidence to rejecting H0

Example – Suppose we are using the z –test for the mean m of a normal population and = 0.05.

Z0.025 = 1.960


= P [ z > 2.3] + P[z < -2.3]

= 0.0107 + 0.0107 = 0.0214

Thus the critical region is to reject H0 if

Z < -1.960 or Z > 1.960 .

Suppose the z = 2.3, then we reject H0

p - value

2.3-2.3

Graph


= P [ z > 1.2] + P[z < -1.2]

= 0.1151 + 0.1151 = 0.2302

If the value of z = 1.2, then we accept H0

23.02% chance that the test statistic is as or more extreme than 1.2. Fairly high, hence 1.2 is not very extreme

p - value

1.2-1.2

Graph

Properties of the p -value1. If the p-value is small (<0.05 or 0.01) H0 should be

rejected.

2. The p-value measures the plausibility of H0.

3. If the test is two tailed the p-value should be two tailed.

4. If the test is one tailed the p-value should be one tailed.

5. It is customary to report p-values when reporting the results. This gives the reader some idea of the strength of the evidence for rejecting H0

Multiple testing

Quite often one is interested in performing collection (family) of tests of hypotheses.

1. H0,1 versus HA,1.



etc.

• Let * denote the probability that at least one type I error is made in the collection of tests that are performed.

• The value of *, the family type I error rate, can be considerably larger than , the type I error rate of each individual test.

• The value of the family error rate, *, can be controlled by altering the thresholds of each individual test appropriately.

• A testing procedure of this nature is called a Multiple testing procedure.

Independent variables

Dependent Variables

Categorical Continuous Continuous & Categorical

Categorical Multiway frequency Analysis(Log Linear Model)

Discriminant Analysis Discriminant Analysis

Continuous ANOVA (single dep var)MANOVA (Mult dep var)

MULTIPLE REGRESSION(single dep variable)MULTIVARIATEMULTIPLE REGRESSION (multiple dependent variable)

ANACOVA (single dep var)MANACOVA (Mult dep var)

Continuous & Categorical

?? ?? ??

A chart illustrating Statistical Procedures

Comparing k Populations

Means – One way Analysis of Variance (ANOVA)

The F test

The F test – for comparing k means

Situation

• We have k normal populations

• Let i and denote the mean and standard deviation of population i.

• i = 1, 2, 3, … k.

• Note: we assume that the standard deviation for each population is the same.

1 = 2 = … = k =

We want to test

kH 3210 :

against

jiH jiA ,pair oneleast at for :

To test kH 3210 :

against jiH jiA ,pair oneleast at for :

knsn

kxxn

s

sF

k

ii

k

iii

k

iii

Error

Between

11

2

1

2

2

2

1

1use the test statistic

where mean for the sample.thix i

standard deviation for the samplethis i

1 1

1

overall meank k

k

n x n xx

n n

is called the Between Sum of Squares and is denoted by SSBetween

It measures the variability between samples

the statistic 2

1

k

i ii

n x x

k – 1 is known as the Between degrees of freedom and

is called the Between Mean Square and is denoted by MSBetween

2

1

1k

i ii

n x x k

is called the Error Sum of Squares and is denoted by SSError

the statistic

is known as the Error degrees of freedom and

is called the Error Mean Square and is denoted by MSError

2

1 1

1k k

i i ii i

n s n k

2

1

1k

i ii

n s

1

k

ii

n k N k

then

Error

Between

MS

MSF

The Computing formula for F:

k

i

n

jij

i

x1 1

2

Compute

ixTin

jiji samplefor Total

1

Total Grand 1 11

k

i

n

jij

k

ii

i

xTG

size sample Total1

k

iinN

k

i i

i

n

T

1

2

1)

2)

3)

4)

5)

Then

1)

2)

k

i i

ik

i

n

jijError n

TxSS

i

1

2

1 1

2

BetweenSS

k

i i

i

N

G

n

T

1

22

3) kNSS

kSSF

Error

Between

1

We reject

kH 3210 :

FF if

F is the critical point under the F distribution with 1 = k - 1degrees of freedom in the numerator and 2 = N – k degrees of freedom in the denominator

The critical region for the F test

Example

In the following example we are comparing weight gains resulting from the following six diets

1. Diet 1 - High Protein , Beef

2. Diet 2 - High Protein , Cereal

3. Diet 3 - High Protein , Pork

4. Diet 4 - Low protein , Beef

5. Diet 5 - Low protein , Cereal

6. Diet 6 - Low protein , Pork

Gains in weight (grams) for rats under six diets differing in level of protein (High or Low) and source of protein (Beef, Cereal, or Pork)

Diet 1 2 3 4 5 6

73 98 94 90 107 49 102 74 79 76 95 82 118 56 96 90 97 73 104 111 98 64 80 86 81 95 102 86 98 81 107 88 102 51 74 97 100 82 108 72 74 106 87 77 91 90 67 70 117 86 120 95 89 61 111 92 105 78 58 82

Mean 100.0 85.9 99.5 79.2 83.9 78.7 Std. Dev. 15.14 15.02 10.92 13.89 15.71 16.55

x 1000 859 995 792 839 787 x2 102062 75819 100075 64462 72613 64401

Hence

4794321 1

2

k

i

n

jij

i

x

60 size sample Total1

k

iinN

4678461

2

k

i i

i

n

T

i 1 2 3 4 5 6 Total (G )T i 1000 859 995 792 839 787 5272

Thus

115864678464794321

2

1 1

2

k

i i

ik

i

n

jijError n

TxSS

i

BetweenSS 933.461260

5272467846

2

1

22

k

i i

i

N

G

n

T

3.4

56.214

6.922

54/11586

5/933.46121

kNSS

kSSF

Error

Between

54 and 5 with 386.2 2105.0 F

Thus since F > 2.386 we reject H0

The ANOVA Table

A convenient method for displaying the calculations for the

F-test

Source d.f. Sum of Squares

Mean Square

F-ratio

Between k - 1 SSBetween MSBetween MSB /MSE

Within N - k SSError MSError

Total N - 1 SSTotal

Anova Table

Source d.f. Sum of Squares

Mean Square

F-ratio

Between 5 4612.933 922.587 4.3

Within 54 11586.000 214.556 (p = 0.0023)

Total 59 16198.933

The Diet Example

Using SPSS

Note: The use of another statistical package such as Minitab is similar to using SPSS

Assume the data is contained in an Excel file

Each variable is in a column

1. Weight gain (wtgn)

2. diet

3. Source of protein (Source)

4. Level of Protein (Level)

After starting the SSPS program the following dialogue box appears:

If you select Opening an existing file and press OK the following dialogue box appears

The following dialogue box appears:

If the variable names are in the file ask it to read the names. If you do not specify the Range the program will identify the Range:

Once you “click OK”, two windows will appear

One that will contain the output:

The other containing the data:

To perform ANOVA select Analyze->General Linear Model-> Univariate

The following dialog box appears

Select the dependent variable and the fixed factors

Press OK to perform the Analysis

Tests of Between-Subjects Effects Dependent Variable: wtgn

Source Type III Sum of

Squares df Mean Square F Sig. Corrected Model 4612.933(a) 5 922.587 4.300 .002

Intercept 463233.067 1 463233.067 2159.036 .000

diet 4612.933 5 922.587 4.300 .002

Error 11586.000 54 214.556

Total 479432.000 60

Corrected Total 16198.933 59

a R Squared = .285 (Adjusted R Squared = .219)

The Output

Comments

• The F-test H0: 1 = 2 = 3 = … = k against HA: at least one pair of means are different

• If H0 is accepted we know that all means are equal (not significantly different)

• If H0 is rejected we conclude that at least one pair of means is significantly different.

• The F – test gives no information to which pairs of means are different.

• One now can use two sample t tests to determine which pairs means are significantly different

Fishers LSD (least significant difference) procedure:

1. Test H0: 1 = 2 = 3 = … = k against HA: at least one pair of means are different, using the ANOVA F-test

2. If H0 is accepted we know that all means are equal (not significantly different). Then stop in this case

3. If H0 is rejected we conclude that at least one pair of means is significantly different, then follow this by• using two sample t tests to determine which pairs

means are significantly different

Linear Regression

Hypothesis testing and Estimation

Assume that we have collected data on two variables X and Y. Let

(x1, y1) (x2, y2) (x3, y3) … (xn, yn)

denote the pairs of measurements on the on two variables X and Y for n cases in a sample (or population)

The Statistical Model

Each yi is assumed to be randomly generated from a normal distribution with

mean i = + xi and standard deviation . (, and are unknown)

yi

+ xi

xi

Y = + X

slope =

The Data The Linear Regression Model

• The data falls roughly about a straight line.

0

20

40

60

80

100

120

140

160

40 60 80 100 120 140

Y = + X

unseen

The Least Squares Line

Fitting the best straight line

to “linear” data

LetY = a + b X

denote an arbitrary equation of a straight line.a and b are known values.This equation can be used to predict for each value of X, the value of Y.

For example, if X = xi (as for the ith case) then the predicted value of Y is:

ii bxay ˆ

The residual

can be computed for each case in the sample,

The residual sum of squares (RSS) is

a measure of the “goodness of fit of the line

Y = a + bX to the data

iiiii bxayyyr ˆ

,ˆ,,ˆ,ˆ 222111 nnn yyryyryyr

n

iii

n

iii

n

ii bxayyyrRSS

1

2

1

2

1

2 ˆ

The optimal choice of a and b will result in the residual sum of squares

attaining a minimum.

If this is the case than the line:

Y = a + bX

is called the Least Squares Line

n

iii

n

iii

n

ii bxayyyrRSS

1

2

1

2

1

2 ˆ

The equation for the least squares line

Let

n

iixx xxS

1

2

n

iiyy yyS

1

2

n

iiixy yyxxS

1

n

x

xxxS

n

iin

ii

n

iixx

2

1

1

2

1

2

n

yx

yx

n

ii

n

iin

iii

11

1

n

y

yyyS

n

iin

ii

n

iiyy

2

1

1

2

1

2

n

iiixy yyxxS

1

Computing Formulae:

Then the slope of the least squares line can be shown to be:

n

ii

n

iii

xx

xy

xx

yyxx

S

Sb

1

2

1

and the intercept of the least squares line can be shown to be:

xS

Syxbya

xx

xy

The residual sum of Squares

22

1 1

ˆn n

i i i ii i

RSS y y y a bx

2

xy

yyxx

SS

S

Computing formula

Estimating , the standard deviation in the regression model :

22

ˆ1

2

1

2

n

bxay

n

yys

n

iii

n

iii

xx

xyyy S

SS

n

2

2

1

This estimate of is said to be based on n – 2 degrees of freedom

Computing formula

Sampling distributions of the estimators

The sampling distribution slope of the least squares line :

n

ii

n

iii

xx

xy

xx

yyxx

S

Sb

1

2

1

It can be shown that b has a normal distribution with mean and standard deviation

n

ii

xx

bb

xxS

1

2

and

Thus

has a standard normal distribution, and

b

b

xx

b bz

S

b

b

xx

b bt

ssS

has a t distribution with df = n - 2

(1 – )100% Confidence Limits for slope :

t/2 critical value for the t-distribution with n – 2 degrees of freedom

xxS

st ˆ

2/

Testing the slope

The test statistic is:

0 0 0: vs : AH H

0

xx

bt

sS

- has a t distribution with df = n – 2 if H0 is true.

The Critical Region

Reject

0 0 0: vs : AH H

0/ 2 / 2if or

xx

bt t t t

sS

df = n – 2

This is a two tailed tests. One tailed tests are also possible

The sampling distribution intercept of the least squares line :

It can be shown that a has a normal distribution with mean and standard deviation

n

ii

aa

xx

x

n

1

2

21 and

ˆ xy

xx

Sa y bx y x

S

Thus

has a standard normal distribution and

2

2

1

1

a

a

n

ii

a az

xn x x

2

2

1

1

a

a

n

ii

a at

s xs

n x x

has a t distribution with df = n - 2

(1 – )100% Confidence Limits for intercept :


1

ˆ2

2/xxS

x

nst

Testing the intercept


0 0 0: vs : AH H


0

2

2

1

1

n

ii

at

xs

n x x

The Critical Region

Reject

0 0 0: vs : AH H

0/ 2 / 2if or

a

at t t t

s

df = n – 2

Example

The following data showed the per capita consumption of cigarettes per month (X) in various countries in 1930, and the death rates from lung cancer for men in 1950. TABLE : Per capita consumption of cigarettes per month (Xi) in n = 11 countries in 1930, and the death rates, Yi (per 100,000), from lung cancer for men in 1950.

Country (i) Xi Yi

Australia 48 18Canada 50 15Denmark 38 17Finland 110 35Great Britain 110 46Holland 49 24Iceland 23 6Norway 25 9Sweden 30 11Switzerland 51 25USA 130 20

Australia

CanadaDenmark

Finland

Great Britain

Holland

Iceland

NorwaySweden

Switzerland

USA

0

5

10

15

20

25

30

35

40

45

50

0 20 40 60 80 100 120 140

deat

h ra

tes f

rom

lung

can

cer

(195

0)

Per capita consumption of cigarettes

404,541

2

n

iix

914,161

n

iii yx

018,61

2

n

iiy

Fitting the Least Squares Line

6641

n

iix

2261

n

iiy

55.1432211

66454404

2

xxS

73.1374

11

2266018

2

yyS

82.3271

11

22666416914 xyS

Fitting the Least Squares Line

First compute the following three quantities:

Computing Estimate of Slope (), Intercept () and standard deviation (),

288.055.14322

82.3271

xx

xy

S

Sb

756.611

664288.0

11

226

xbya

35.8

2

12

xx

xyyy S

SS

ns

95% Confidence Limits for slope :

t.025 = 2.262 critical value for the t-distribution with 9 degrees of freedom

xxS

st ˆ

2/

0.0706 to 0.3862

8.350.288 2.262

1432255

95% Confidence Limits for intercept :

1

ˆ2

2/xxS

x

nst

-4.34 to 17.85


2664 111

6.756 2.262 8.35 11 1432255

Iceland

NorwaySweden

DenmarkCanada

Australia

HollandSwitzerland

Great Britain

Finland

USA

0

5

10

15

20

25

30

35

40

45

50

0 20 40 60 80 100 120 140

Per capita consumption of cigarettes

deat

h ra

tes

from

lung

can

cer

(195

0)

Y = 6.756 + (0.228)X

95% confidence Limits for slope 0.0706 to 0.3862

95% confidence Limits for intercept -4.34 to 17.85

Testing the positive slope


0 : 0 vs : 0 AH H

0

xx

bt

sS

The Critical Region

Reject

0 : 0 in favour of : 0 AH H

0.05

0if =1.833

xx

bt t

sS

df = 11 – 2 = 9

A one tailed test

and conclude

0 : 0 H

0Since

xx

bt

sS

0.28841.3 1.833

8.351432255

we reject

: 0 AH

Confidence Limits for Points on the Regression Line

• The intercept is a specific point on the regression line.

• It is the y – coordinate of the point on the regression line when x = 0.

• It is the predicted value of y when x = 0.• We may also be interested in other points on the

regression line. e.g. when x = x0

• In this case the y – coordinate of the point on the regression line when x = x0 is + x0

x0

+ x0

y = + x

(1- )100% Confidence Limits for + x0 :

1 20

2/0xxS

xx

nstbxa

t/2 is the /2 critical value for the t-distribution with n - 2 degrees of freedom

Prediction Limits for new values of the Dependent variable y

• An important application of the regression line is prediction.

• Knowing the value of x (x0) what is the value of y?

• The predicted value of y when x = x0 is:

• This in turn can be estimated by:.

ˆ 0xy

00 ˆˆˆ bxaxy

The predictor

• Gives only a single value for y. • A more appropriate piece of information would

be a range of values.• A range of values that has a fixed probability of

capturing the value for y.• A (1- )100% prediction interval for y.

00 ˆˆˆ bxaxy

(1- )100% Prediction Limits for y when x = x0:

11

20

2/0xxS

xx

nstbxa


ExampleIn this example we are studying building fires in a city and interested in the relationship between:

1. X = the distance of the closest fire hall and the building that puts out the alarm

and

2. Y = cost of the damage (1000$)

The data was collected on n = 15 fires.

The DataFire Distance Damage

1 3.4 26.22 1.8 17.83 4.6 31.34 2.3 23.15 3.1 27.56 5.5 36.07 0.7 14.18 3.0 22.39 2.6 19.610 4.3 31.311 2.1 24.012 1.1 17.313 6.1 43.214 4.8 36.415 3.8 26.1

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

45.0

50.0

0.0 2.0 4.0 6.0 8.0

Distance (miles)

Dam

age

(100

0$)

Scatter Plot

ComputationsFire Distance Damage

1 3.4 26.22 1.8 17.83 4.6 31.34 2.3 23.15 3.1 27.56 5.5 36.07 0.7 14.18 3.0 22.39 2.6 19.6

10 4.3 31.311 2.1 24.012 1.1 17.313 6.1 43.214 4.8 36.415 3.8 26.1

2.491

n

iix

2.3961

n

iiy

16.1961

2

n

iix

5.113761

2

n

iiy

65.14701

n

iii yx

Computations Continued

28.3152.491

n

xx

n

ii

4133.26152.3961

n

yy

n

ii


784.34152.4916.196

2

2

1

1

2

n

xxS

n

iin

iixx

517.911152.3965.11376

2

2

1

1

2

n

yyS

n

iin

iiyy

n

yxyxS

n

ii

n

iin

iiixy

11

1

114.171152.3962.4965.1470


92.4784.34

114.171ˆ xx

xy

S

Sb

28.1028.3919.44133.26ˆ xbya

2

2

n

SS

Ss xx

xyyy

316.213

784.34114.171517.911

2

95% Confidence Limits for slope :


xxS

st ˆ

2/

4.07 to 5.77

95% Confidence Limits for intercept :

1

ˆ2

2/xxS

x

nst

7.21 to 13.35


0.0

10.0

20.0

30.0

40.0

50.0

60.0

0.0 2.0 4.0 6.0 8.0

Distance (miles)

Dam

age

(100

0$)

Least Squares Line

y=4.92x+10.28


1 20

2/0xxS

xx

nstbxa


95% Confidence Limits for + x0 :

x 0 lower upper

1 12.87 17.522 18.43 21.803 23.72 26.354 28.53 31.385 32.93 36.826 37.15 42.44

0.0

10.0

20.0

30.0

40.0

50.0

60.0

0.0 2.0 4.0 6.0 8.0

Distance (miles)

Dam

age

(100

0$)

95% Confidence Limits for + x0

Confidence limits


11

20

2/0xxS

xx

nstbxa


95% Prediction Limits for y when x = x0

x 0 lower upper

1 9.68 20.712 14.84 25.403 19.86 30.214 24.75 35.165 29.51 40.246 34.13 45.45

0.0

10.0

20.0

30.0

40.0

50.0

60.0

0.0 2.0 4.0 6.0 8.0

Distance (miles)

Dam

age

(100

0$)

95% Prediction Limits for y when x =x0

Prediction limits

Linear RegressionSummary

Hypothesis testing and Estimation

(1 – )100% Confidence Limits for slope :


xxS

st ˆ

2/

Testing the slope


0 0 0: vs : AH H

0

xx

bt

sS


(1 – )100% Confidence Limits for intercept :


1

ˆ2

2/xxS

x

nst

Testing the intercept


0 0 0: vs : AH H


0

2

2

1

1

n

ii

at

xs

n x x


1 20

2/0xxS

xx

nstbxa



11

20

2/0xxS

xx

nstbxa


Comparing k Populations

Proportions

The 2 test for independence

Situation

• We have two categorical variables R and C.

• The number of categories of R is r.

• The number of categories of C is c.

• We observe n subjects from the population and count xij = the number of subjects for which R = i and C = j.

• R = rows, C = columns

Example

Both Systolic Blood pressure (C) and Serum Cholesterol (R) were meansured for a sample of n = 1237 subjects.

The categories for Blood Pressure are:

<126 127-146 147-166 167+

The categories for Cholesterol are:

<200 200-219 220-259 260+

Table: two-way frequency

Serum Systolic Blood pressure Cholesterol <127 127-146 147-166 167+ Total

<200 117 121 47 22 307 200-219 85 98 43 20 246 220-259 119 209 68 43 439

260+ 67 99 46 33 245 Total 388 527 204 118 1237


Define Total row 1

thc

jiji ixR

Totalcolumn 1

thc

iiji jxC

n

CRE ji

ij

= Expected frequency in the (i,j) th cell in the case of independence.

Use test statistic

r

i

c

j ij

ijij

E

Ex

1 1

2

2

Eij= Expected frequency in the (i,j) th cell in the case of independence.

H0: R and C are independent

against

HA: R and C are not independent

Then to test

xij= observed frequency in the (i,j) th cell

i jR C

n

Sampling distribution of test statistic when H0 is true

r

i

c

j ij

ijij

E

Ex

1 1

2

2

- 2 distribution with degrees of freedom = (r - 1)(c - 1)

Critical and Acceptance Region

Reject H0 if : 2

Accept H0 if : 2

Table Expected frequencies, Observed frequencies, Standardized Residuals

Serum Systolic Blood pressure

Cholesterol <127 127-146 147-166 167+ Total <200 96.29 130.79 50.63 29.29 307 (117) (121) (47) (22) 2.11 -0.86 -0.51 -1.35 200-219 77.16 104.80 40.47 23.47 246 (85) (98) (43) (20) 0.86 -0.66 0.38 -0.72 220-259 137.70 187.03 72.40 41.88 439 (119) (209) (68) (43) -1.59 1.61 -0.52 0.17 260+ 76.85 104.38 40.04 23.37 245 (67) (99) (46) (33) -1.12 -0.53 0.88 1.99 Total 388 527 204 118 1237

2 = 20.85

Standardized residuals

ij

ijijij

E

Exr

85.20

1 1

2

1 1

2

2

r

i

c

jij

r

i

c

j ij

ijij rE

Ex

degrees of freedom = (r - 1)(c - 1) = 9

919.1605.0

Test statistic

Reject H0 using = 0.05

Another ExampleThis data comes from a Globe and Mail study examining the attitudes of the baby boomers.Data was collected on various age groups

Age group Total

Echo (Age 20 – 29) 398Gen X (Age 30 – 39) 342

Younger Boomers (Age 40 – 49) 378Older Boomers (Age 50 – 59) 286

Pre Boomers (Age 60+) 445Total 1849

One question with responsesIn an average week, how many times would you drink alcohol?

Age group never once twice

three or four times

five more times Total

Echo (Age 20 – 29) 115 135 64 48 36 398 Gen X (Age 30 – 39) 130 123 38 31 20 342 Younger Boomers (Age 40 – 49) 136 87 64 57 34 378 Older Boomers (Age 50 – 59) 109 74 40 43 20 286

Pre Boomers (Age 60+) 218 80 45 40 62 445

Total 708 499 251 219 172 1849

Are there differences in weekly consumption of alcohol related to age?

Table: Expected frequencies


three or four times

five more times Total

Echo (Age 20 – 29) 152.40 107.41 54.03 47.14 37.02 398 Gen X (Age 30 – 39) 130.96 92.30 46.43 40.51 31.81 342 Younger Boomers (Age 40 – 49) 144.74 102.01 51.31 44.77 35.16 378 Older Boomers (Age 50 – 59) 109.51 77.18 38.82 33.87 26.60 286

Pre Boomers (Age 60+) 170.39 120.09 60.41 52.71 41.40 445

Total 708 499 251 219 172 1849

Table: Residuals

Conclusion: There is a significant relationship between age group and weekly alcohol use


three or four times

five more times

Echo (Age 20 – 29) -3.029 2.662 1.357 0.125 -0.168 Gen X (Age 30 – 39) -0.083 3.196 -1.237 -1.494 -2.095 Younger Boomers (Age 40 – 49) -0.726 -1.486 1.771 1.828 -0.196 Older Boomers (Age 50 – 59) -0.049 -0.362 0.189 1.568 -1.280

Pre Boomers (Age 60+) 3.647 -3.659 -1.982 -1.750 3.203

ij

ijijij

E

Exr

2

2 2

1 1 1 1

93.97r c r c

ij ij

iji j i jij

x Er

E

2.05 26.296 for 4 4 16 .d f

Examining the Residuals allows one to identify the cells that indicate a departure from independence

• Large positive residuals indicate cells where the observed frequencies were larger than expected if independent Large negative residuals indicate cells where the observed frequencies were smaller than expected if independent


three or four times

five more times

Echo (Age 20 – 29) -3.029 2.662 1.357 0.125 -0.168 Gen X (Age 30 – 39) -0.083 3.196 -1.237 -1.494 -2.095 Younger Boomers (Age 40 – 49) -0.726 -1.486 1.771 1.828 -0.196 Older Boomers (Age 50 – 59) -0.049 -0.362 0.189 1.568 -1.280

Pre Boomers (Age 60+) 3.647 -3.659 -1.982 -1.750 3.203

Another question with responses

Are there differences in weekly internet use related to age?

Age group never 1 to 4 times

5 to 9 times

10 or more times Total

Echo (Age 20 – 29) 48 72 100 178 398 Gen X (Age 30 – 39) 51 82 92 117 342 Younger Boomers (Age 40 – 49) 79 128 76 95 378 Older Boomers (Age 50 – 59) 92 63 57 74 286

Pre Boomers (Age 60+) 276 71 67 31 445

Total 546 416 392 495 1849

In an average week, how many times would you surf the internet?

Table: Expected frequencies


5 to 9 times

10 or more times Total

Echo (Age 20 – 29) 117.53 89.54 84.38 106.55 398 Gen X (Age 30 – 39) 100.99 76.95 72.51 91.56 342

Younger Boomers (Age 40 – 49) 111.62 85.04 80.14 101.20 378 Older Boomers (Age 50 – 59) 84.45 64.35 60.63 76.57 286

Pre Boomers (Age 60+) 131.41 100.12 94.34 119.13 445

Total 546 416 392 495 1849

Table: Residuals

Conclusion: There is a significant relationship between age group and weekly internet use

ij

ijijij

E

Exr

2

2 2

1 1 1 1

406.29r c r c

ij ij

iji j i jij

x Er

E

2.05 21.03 for 4 3 12 .d f


5 to 9 times

10 or more times

Echo (Age 20 – 29) -6.41 -1.85 1.70 6.92 Gen X (Age 30 – 39) -4.97 0.58 2.29 2.66

Younger Boomers (Age 40 – 49) -3.09 4.66 -0.46 -0.62 Older Boomers (Age 50 – 59) 0.82 -0.17 -0.47 -0.29

Pre Boomers (Age 60+) 12.61 -2.91 -2.82 -8.07

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

never 1 to 4 times 5 to 9 times 10 or more times

Echo (Age 20 – 29)

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0


Gen X (Age 30 – 39)

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0


Younger Boomers (Age 40 – 49)

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0


Older Boomers (Age 50 – 59)

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0


Pre Boomers (Age 60+)

Next topic: Fitting equations to data

Link

Stats 845 Applied Statistics. This Course will cover: 1.Regression –Non Linear Regression...

Documents

Transcript of Stats 845 Applied Statistics. This Course will cover: 1.Regression –Non Linear Regression...