Lectures 15/16 - York University663 485 679 572 602 561 719 557 604 523 502 572 711 353 620 584 659...

18
© Copyright 2004, Alan Marshall 1 Lectures 15/16 Lectures 15/16 Analysis of Variance © Copyright 2004, Alan Marshall 2 ANOVA ANOVA >ANOVA stands for ANalysis Of VAriance >ANOVA allows us to: • Do multiple tests at one time –more than two groups • Test for multiple effects simultaneously –more than one variable © Copyright 2004, Alan Marshall 3 ANOVA Tests ANOVA Tests The types of ANONA we will look at are: >One Way ANOVA >Randomized block design ANOVA >Two-Factor >We will also see ANOVA in regression analysis

Transcript of Lectures 15/16 - York University663 485 679 572 602 561 719 557 604 523 502 572 711 353 620 584 659...

Page 1: Lectures 15/16 - York University663 485 679 572 602 561 719 557 604 523 502 572 711 353 620 584 659 469 606 557 697 634 689 581 461 542 706 580 675 679 529 614 615 624 512 532 Mean

1

© Copyright 2004, Alan Marshall 1

Lectures 15/16Lectures 15/16

Analysis of Variance

© Copyright 2004, Alan Marshall 2

ANOVAANOVA

>ANOVA stands for ANalysis OfVAriance

>ANOVA allows us to:• Do multiple tests at one time

–more than two groups

• Test for multiple effects simultaneously–more than one variable

© Copyright 2004, Alan Marshall 3

ANOVA TestsANOVA Tests

The types of ANONA we will look at are:>One Way ANOVA>Randomized block design ANOVA>Two-Factor>We will also see ANOVA in regression

analysis

Page 2: Lectures 15/16 - York University663 485 679 572 602 561 719 557 604 523 502 572 711 353 620 584 659 469 606 557 697 634 689 581 461 542 706 580 675 679 529 614 615 624 512 532 Mean

2

© Copyright 2004, Alan Marshall 4

One-Way ANOVAOne-Way ANOVA

>One-way ANOVA allows us tosimultaneously test to determine iftwo or more population means areequal

HO: µ1 = µ2 = µ3

HA: At least two means differ

© Copyright 2004, Alan Marshall 5

ANOVA assumptionsANOVA assumptions

>All populations are normallydistributed

>The population variances are equal• ANOVA tests assume that variances can

be pooled

>The observations are independent

© Copyright 2004, Alan Marshall 6

ExampleExample

>We are interested in seeing of theadvertising strategies employed inthree cities made a difference

>We assume that the three cities havebeen shown to be similar in the past

>The sales results for 20 weeks in eachof the three cities is displayed on thenext slide

Page 3: Lectures 15/16 - York University663 485 679 572 602 561 719 557 604 523 502 572 711 353 620 584 659 469 606 557 697 634 689 581 461 542 706 580 675 679 529 614 615 624 512 532 Mean

3

© Copyright 2004, Alan Marshall 7

Example DataExample Data

529 498 804 492 672 691658 663 630 719 531 733793 604 774 787 443 698514 495 717 699 596 776663 485 679 572 602 561719 557 604 523 502 572711 353 620 584 659 469606 557 697 634 689 581461 542 706 580 675 679529 614 615 624 512 532

PriceCity 3City 1

Convenience QualityCity 2

© Copyright 2004, Alan Marshall 8

TerminologyTerminology

>We have a response variable, thelevel of weekly sales

>There are three factors ortreatments, the advertising strategyused in the three cities

© Copyright 2004, Alan Marshall 9

Means and Grand MeanMeans and Grand Mean

529 498 804 492 672 691658 663 630 719 531 733793 604 774 787 443 698514 495 717 699 596 776663 485 679 572 602 561719 557 604 523 502 572711 353 620 584 659 469606 557 697 634 689 581461 542 706 580 675 679529 614 615 624 512 532

Mean 577.55 Mean 653.00 Mean 608.65613.067

City 1Convenience

Grand Mean

QualityCity 2

PriceCity 3

Page 4: Lectures 15/16 - York University663 485 679 572 602 561 719 557 604 523 502 572 711 353 620 584 659 469 606 557 697 634 689 581 461 542 706 580 675 679 529 614 615 624 512 532 Mean

4

© Copyright 2004, Alan Marshall 10

DiscussionDiscussion

>There are differences between themeans, but we are not sure if theyare significant.

>We could also observe that there is anamount of variation about the grandmean• Some of this variation is explained by the

treatments (advertising strategies)• Some remains unexplained

© Copyright 2004, Alan Marshall 11

Sum of SquaresSum of Squares

>In all forms of ANOVA, we analyze theSUMS OF SQUARES• essentially, the numerator in the

variance calculation

© Copyright 2004, Alan Marshall 12

Sum of Squares Between (SSB)Sum of Squares Between (SSB)

>The difference between the each of thetreatment (or factor) means and the grandmean is squared, multiplied by the numberof responses for that treatment, andsummed across treatments

>If the treatment means equaled the grandmean, the SSB would be 0

( )∑=

−=k

1i

2ii xxnSSB i = 1, 2, 3, …, k

group numbers

Page 5: Lectures 15/16 - York University663 485 679 572 602 561 719 557 604 523 502 572 711 353 620 584 659 469 606 557 697 634 689 581 461 542 706 580 675 679 529 614 615 624 512 532 Mean

5

© Copyright 2004, Alan Marshall 13

Sum of Squares Within (SSW)Sum of Squares Within (SSW)

>The unexplained variation, SSW, is sum ofthe residual variation around the treatmentmeans

>Since for each treatment, s2 = SS/(n-1),we can also get the SSW by summing (n-1)s2 for each treatment

( )

( ) ( ) ( ) 233

222

211

k

1i

n

1j

2iij

s1ns1ns1n

xxSSWj

−+−+−=

−= ∑∑= =

© Copyright 2004, Alan Marshall 14

Mean SquaresMean Squares

kNSSWMSW

1kSSBMSB

−=

−=

>The Mean Square forTreatments (i.e.,between groups) is theSSB divided by thenumber of treatmentsminus 1

>The Mean SquareWithin is the SSWdivided by the samplesize minus the numberof treatments

© Copyright 2004, Alan Marshall 15

The Test StatisticThe Test Statistic

kNSSWMSW

1kSSBMSB

MSWMSBF

−=

−=

=

>The ratio of theMSB divided by theMSW is distributedaccording to an Fdistribution, with:ν1 = df1 = (k - 1) andν2 = df2 = (N - k)

Page 6: Lectures 15/16 - York University663 485 679 572 602 561 719 557 604 523 502 572 711 353 620 584 659 469 606 557 697 634 689 581 461 542 706 580 675 679 529 614 615 624 512 532 Mean

6

© Copyright 2004, Alan Marshall 16

Means and Grand MeanMeans and Grand Mean

529 498 804 492 672 691658 663 630 719 531 733793 604 774 787 443 698514 495 717 699 596 776663 485 679 572 602 561719 557 604 523 502 572711 353 620 584 659 469606 557 697 634 689 581461 542 706 580 675 679529 614 615 624 512 532

Mean 577.55 Mean 653.00 Mean 608.65613.067

City 1Convenience

Grand Mean

QualityCity 2

PriceCity 3

© Copyright 2004, Alan Marshall 17

ExampleExample

Mean 577.55 Mean 653.00 Mean 608.65613.067

Between Samples25228.7 31893.4 390.139

57512.2Within Samples

s12 = 10775 s2

2 = 7238.11 s32 = 8670.24

506984

MSB = 28756.1 F = 3.23304MSW = 8894.45 p-value 0.04677

Grand Total (SSW)

City 1Convenience

Grand Mean

Grand Total (SSB)

QualityCity 2

PriceCity 3

© Copyright 2004, Alan Marshall 18

ExampleExample

Mean 577.55 Mean 653.00 Mean 608.65613.067

Between Samples25228.7 31893.4 390.139

57512.2Within Samples

s12 = 10775 s2

2 = 7238.11 s32 = 8670.24

506984

MSB = 28756.1 F = 3.23304MSW = 8894.45 p-value 0.04677

Grand Total (SSW)

City 1Convenience

Grand Mean

Grand Total (SSB)

QualityCity 2

PriceCity 3

( )( ) 7.25228067.61355.57720 2 =−

Page 7: Lectures 15/16 - York University663 485 679 572 602 561 719 557 604 523 502 572 711 353 620 584 659 469 606 557 697 634 689 581 461 542 706 580 675 679 529 614 615 624 512 532 Mean

7

© Copyright 2004, Alan Marshall 19

InterpretationInterpretation

>Since P(F>3.23) = 0.0468 < α =0.05, we reject HO: µ1 = µ2 = µ3

>There is enough evidence to infer thatthe mean weekly sales differ betweenthe cities.

© Copyright 2004, Alan Marshall 20

ANOVA TableANOVA Table

Standard ANOVA Table

Source of Variation SS df Mean Square F-StatisticBetween Samples SSB k - 1 MSB = SSB/(k - 1) F = MSB/MSWWithin Samples SSW N - k MSW = SSW/(N - k)Total SST N - 1

Example

Source of Variation SS df Mean Square F-StatisticBetween Samples 57,512.2 2 28756.11667 3.233041411Within Samples 506,983.5 57 8894.447368Total 564,495.7 59

© Copyright 2004, Alan Marshall 21

Excel OutputExcel Output

Anova: Single Factor

SUMMARYGroups Count Sum Average Variance

Convenience 20 11551 577.55 10775Quality 20 13060 653 7238.105Price 20 12173 608.65 8670.239

ANOVASource of Variation SS df MS F P-value F critBetween Groups 57512.23 2 28756.12 3.233041 0.046773 3.158846Within Groups 506983.5 57 8894.447

Total 564495.7 59

Page 8: Lectures 15/16 - York University663 485 679 572 602 561 719 557 604 523 502 572 711 353 620 584 659 469 606 557 697 634 689 581 461 542 706 580 675 679 529 614 615 624 512 532 Mean

8

© Copyright 2004, Alan Marshall 22

Required ConditionsRequired Conditions

>Each treatment (sub-sample) must benormal and the variances equal

>Our tests are crude: eyeball tests• Looking at the histograms

–if not non-normal, assume normal–text uses box and whisker plots

• Looking at the variances–if not very different, assume the same

© Copyright 2004, Alan Marshall 23

Formulae: Single Factor ANOVAFormulae: Single Factor ANOVA

Source ofVariation SS df MS F

BetweenGroups SSB k – 1

1kSSBMSB−

=MSWMSBF =

WithinGroups SSW N – k

kNSSWMSW−

=

Total SST N – 1

© Copyright 2004, Alan Marshall 24

Example L13, Slides 15-17Example L13, Slides 15-17 Revisited Revisited

>If we are simply looking at twosamples, and want to see if theremeans are equal, we can perform thet-test we did in Lecture 14, or an F-test

>This question was examining if therewere differences Prof. Goodstat’smorning and afternoon classes.

Page 9: Lectures 15/16 - York University663 485 679 572 602 561 719 557 604 523 502 572 711 353 620 584 659 469 606 557 697 634 689 581 461 542 706 580 675 679 529 614 615 624 512 532 Mean

9

© Copyright 2004, Alan Marshall 25

ExampleExample

>Prof. Goodstat has two classes, one at8:30 and one at 1:00. On themidterm, the morning class of 45students had a mean of 70 and astandard deviation of 12, while theafternoon class of 40 had a mean of75 and a standard deviation of 13. Isthere evidence at α = 0.05 that thetwo classes are different?

© Copyright 2004, Alan Marshall 26

ExampleExample

( ) ( ) ( ) ( )

Reject Not Do

96.1835.1725.25

4013

4512

07570

ns

ns

xxt22

2

22

1

21

2121

−>−=−

=

+

−−=

+

µ−µ−−=

© Copyright 2004, Alan Marshall 27

Example - If PooledExample - If Pooled

( ) ( ) ( ) ( )

( ) ( ) ( ) ( )

Reject Not Do

96.18437.1711.25

401

451747.155

07570

n1

n1s

xxt

747.155240451693914444

2nns1ns1ns

21

2p

2121

21

222

2112

p

−>−=−

=

+

−−=

+

µ−µ−−=

=−+

+=

−+−+−

=

Page 10: Lectures 15/16 - York University663 485 679 572 602 561 719 557 604 523 502 572 711 353 620 584 659 469 606 557 697 634 689 581 461 542 706 580 675 679 529 614 615 624 512 532 Mean

10

© Copyright 2004, Alan Marshall 28

Example - Using ANOVAExample - Using ANOVA

Morning Afternoon OverallMeans 70 75 72.35294Variances 144 169

SS df MS F p-valueBetween 249.13495 280.2768 529.4118 1 529.4118 3.3992 0.068798Within 144 169 12927 83 155.747

When we did the example using a t-test, t = 1.8437 0.068798

(tα/2,df)2 = Fα,1,df

© Copyright 2004, Alan Marshall 29

Block DesignBlock Design

© Copyright 2004, Alan Marshall 30

TerminologyTerminology

>Randomized Complete Block ANOVA(Text’s terminology)

>Two-way ANOVA without replication(Excel’s terminology)

>Other:• Randomized Block Design• Block design

Page 11: Lectures 15/16 - York University663 485 679 572 602 561 719 557 604 523 502 572 711 353 620 584 659 469 606 557 697 634 689 581 461 542 706 580 675 679 529 614 615 624 512 532 Mean

11

© Copyright 2004, Alan Marshall 31

Block DesignBlock Design

>This is similar to the matched pairexperiment, but with more than pairs• We will have three or more treatments• The matched pair can be viewed as a

randomized block design with only twotreatments

© Copyright 2004, Alan Marshall 32

ExampleExamplePlot Fertilizer A Fertilizer B Fertilizer C

1 563 588 5752 593 624 5933 542 576 5644 649 672 6535 565 583 5566 587 612 5907 595 617 6078 429 446 4239 500 515 483

10 610 641 62611 524 547 52312 559 586 56813 546 582 55114 503 530 50215 550 573 56716 492 518 49517 497 529 51318 619 643 62619 473 497 47920 533 556 540

>Three fertilizershave been tested in20 plots

>The crop yields areshown at the left

>We want to test forvariation betweenfertilizers, but

>We could havevariation betweenthe plots

© Copyright 2004, Alan Marshall 33

ExampleExample

>This is a two-way ANOVA withoutreplication or block design since theresearcher is controlling for differencesthat may exist between plots of land

>Thus the first row (block) is representsthe three different fertilizers in plot #1,the second, plot #2, etc.

>Notice the similarity to Matched PairsDesign

Page 12: Lectures 15/16 - York University663 485 679 572 602 561 719 557 604 523 502 572 711 353 620 584 659 469 606 557 697 634 689 581 461 542 706 580 675 679 529 614 615 624 512 532 Mean

12

© Copyright 2004, Alan Marshall 34

ExampleExample

>Columns (Fertilizers):• HO: µ1=µ2=µ3 HA: At least one is not equal

• α=0.01,• Rejection region: F.01,2,38 > 5.21 (Excel)

>Rows (Plots):• HO: All are equal• HA: At least one is not equal• α=0.01,• Rejection region: F.01,19,38 > 2.42 (Excel)

© Copyright 2004, Alan Marshall 35

ExampleExample

>Note that if there are not significantdifferences between the blocks (rows)then the single factor test would bemore appropriate.

© Copyright 2004, Alan Marshall 36

Example - Excel OutputExample - Excel OutputAnova: Two-Factor Without Replication

SUMMARY Count Sum Average Variance1 3 1726 575.3333 156.33332 3 1810 603.3333 320.3333. . . . .. . . . .. . . . .

20 3 1629 543 139

Fert-A 20 10929 546.45 2953.945Fert-B 20 11435 571.75 3104.197Fert-C 20 11034 551.7 3339.905

ANOVASource of Variation SS df MS F P-value F crit

Rows 177464.6 19 9340.242 323.1623 3.67E-36 2.42147Columns 7131.033 2 3565.517 123.363 2.41E-17 5.21123Error 1098.3 38 28.90263

Total 185693.9 59

Page 13: Lectures 15/16 - York University663 485 679 572 602 561 719 557 604 523 502 572 711 353 620 584 659 469 606 557 697 634 689 581 461 542 706 580 675 679 529 614 615 624 512 532 Mean

13

© Copyright 2004, Alan Marshall 37

ExplanationExplanation

Source of Variation SSRows 177464.6 Variation attributable to the plotsColumns 7131.033 Variation attributable to the fertilizerError 1098.3 Unexplained variation

Total 185693.9 The total amount of variation to be explained

© Copyright 2004, Alan Marshall 38

ExampleExample

ANOVASource of Variation SS df MS F

Rows 177464.6 19 9340.242 323.1623Columns 7131.033 2 3565.517 123.363Error 1098.3 38 28.90263

Total 185693.9 59

© Copyright 2004, Alan Marshall 39

ExampleExample

>Since the F-Value (123.4) is greaterthan our critical F-Value (5.21), wereject the null hypothesis that thefertilizers are the same

>Likewise, the F-Value for the plots ofland (323.2) exceeds the critical value of2.42 indicating it was appropriate to usethis design

>The same results can be inferred by thelow p-values which are below oursignificance level, α=0.01

Page 14: Lectures 15/16 - York University663 485 679 572 602 561 719 557 604 523 502 572 711 353 620 584 659 469 606 557 697 634 689 581 461 542 706 580 675 679 529 614 615 624 512 532 Mean

14

© Copyright 2004, Alan Marshall 40

DiscussionDiscussion

>We would not have been able to seethe differences between thefertilizers, since the difference wouldhave been “lost” in the variabilitybetween the plots.

© Copyright 2004, Alan Marshall 41

Using One-way ANOVAUsing One-way ANOVA

Anova: Single Factor

SUMMARYGroups Count Sum Average Variance

Fertilizer A 20 10929 546.45 2953.945Fertilizer B 20 11435 571.75 3104.197Fertilizer C 20 11034 551.7 3339.905

ANOVASource of Variation SS df MS F P-value F critBetween Groups 7131.033 2 3565.517 1.138167 0.327578 3.158846Within Groups 178562.9 57 3132.682

Total 185693.9 59

© Copyright 2004, Alan Marshall 42

The Formulae: Block DesignThe Formulae: Block Design

Source ofVariation SS df MS F

BetweenGroups SSB k – 1 1k

SSBMSB−

=MSEMSBF =

BetweenBlocks SSBL b – 1 1b

SSBLMSBL−

=MSE

MSBLF =

WithinGroups SSW (k – 1)(b – 1) ( )( )1b1k

SSWMSW−−

=

Total SST N – 1

Page 15: Lectures 15/16 - York University663 485 679 572 602 561 719 557 604 523 502 572 711 353 620 584 659 469 606 557 697 634 689 581 461 542 706 580 675 679 529 614 615 624 512 532 Mean

15

© Copyright 2004, Alan Marshall 43

Two-Factor ANOVATwo-Factor ANOVA

AKATwo-way ANOVA with

replication

© Copyright 2004, Alan Marshall 44

Two Factor ANOVATwo Factor ANOVA

>Example extends Single Factor ANOVA>Suppose in the test market, we decide

to investigate the impact of the type ofmedia used: television andnewspapers

>Now we have two factors:• The advertising message (before)• The advertising medium (added here)

© Copyright 2004, Alan Marshall 45

HypothesesHypotheses

>For Message:HO: µA1 = µA2 = µA3

HA: At least two means differ>For Media:

HO: µB1 = µB2

HA: The two means differ

Page 16: Lectures 15/16 - York University663 485 679 572 602 561 719 557 604 523 502 572 711 353 620 584 659 469 606 557 697 634 689 581 461 542 706 580 675 679 529 614 615 624 512 532 Mean

16

© Copyright 2004, Alan Marshall 46

Example DataExample Data

City-1 City-2 City-3 City-4 City-5 City-6

TV NP TV NP TV NP491 464 677 689 575 803712 559 627 650 614 584558 759 590 704 706 525447 557 632 652 484 498479 528 683 576 478 812624 670 760 836 650 565546 534 690 628 583 708444 657 548 798 536 546582 557 579 497 579 616672 474 644 841 795 587

Convenience Quality Price

© Copyright 2004, Alan Marshall 47

Single Factor TestSingle Factor Test

>We can perform the single factor testto see if there are differencesbetween the cities.

>Next slide, we see that there aredifferences between the cities

© Copyright 2004, Alan Marshall 48

Single Factor TestSingle Factor TestAnova: Single Factor

SUMMARYGroups Count Sum Average Variance

Column 1 10 5555 555.5 8641.389Column 2 10 5759 575.9 8545.878Column 3 10 6430 643 3884.667Column 4 10 6871 687.1 12558.54Column 5 10 6000 600 9527.556Column 6 10 6244 624.4 12523.82

ANOVASource of Variation SS df MS F P-value F critBetween Groups 113620.3 5 22724.06 2.448631 0.045165 2.386066Within Groups 501136.7 54 9280.309

Total 614757 59

Page 17: Lectures 15/16 - York University663 485 679 572 602 561 719 557 604 523 502 572 711 353 620 584 659 469 606 557 697 634 689 581 461 542 706 580 675 679 529 614 615 624 512 532 Mean

17

© Copyright 2004, Alan Marshall 49

Two Factor TestTwo Factor Test

>Knowing that there are differencesbetween the cities we want to see ifboth factors are responsible for thedifferences

© Copyright 2004, Alan Marshall 50

Example - Data RearrangedExample - Data RearrangedConvenience Quality Price

Television 491 677 575712 627 614558 590 706447 632 484479 683 478624 760 650546 690 583444 548 536582 579 579672 644 795

Newspaper 464 689 803559 650 584759 704 525557 652 498528 576 812670 836 565534 628 708657 798 546557 497 616474 841 587

>We have toreorganize the datato reflect the twofactors

>The responses arecoloured• Yellow:

Convenience andTelevision

• Blue: Quality andTelevision

• etc.

© Copyright 2004, Alan Marshall 51

Output - IOutput - IAnova: Two-Factor With Replication

SUMMARY Convenience Quality Price TotalTelevision

Count 10 10 10 30Sum 5555 6430 6000 17985Average 555.5 643 600 599.5Variance 8641.388889 3884.667 9527.556 8164.397

NewspaperCount 10 10 10 30Sum 5759 6871 6244 18874Average 575.9 687.1 624.4 629.1333Variance 8545.877778 12558.54 12523.82 12579.91

TotalCount 20 20 20Sum 11314 13301 12244Average 565.7 665.05 612.2Variance 8250.852632 8300.682 10602.06

Page 18: Lectures 15/16 - York University663 485 679 572 602 561 719 557 604 523 502 572 711 353 620 584 659 469 606 557 697 634 689 581 461 542 706 580 675 679 529 614 615 624 512 532 Mean

18

© Copyright 2004, Alan Marshall 52

Output - ANOVA TableOutput - ANOVA Table

>There is appears to be a difference betweenthe messages

>There is not enough evidence to suggestthat the media or the interaction issignificant

ANOVASource of Variation SS df MS F P-value F crit

Sample 13172.02 1 13172.02 1.419351 0.23872 4.01954Columns 98838.63 2 49419.32 5.32518 0.007748 3.168246Interaction 1609.633 2 804.8167 0.086723 0.917058 3.168246Within 501136.7 54 9280.309

Total 614757 59

© Copyright 2004, Alan Marshall 53

The Formulae: Two FactorThe Formulae: Two Factor

Source ofVariation SS df MS F

Factor A SSA a – 1 1aSS

MS AA −=

MSEMS

F A=

Factor B SSB b – 1 1bSS

MS BB −=

MSEMS

F B=

Interaction SSAB (a – 1)(b – 1) ( )( )1b1aSS

MS ABAB −−

=MSEMS

F AB=

Error SSE N – ab abNSSEMSE−

=

Total SST N – 1

© Copyright 2004, Alan Marshall 54

YOU LEARN STATISTICSYOU LEARN STATISTICSBY DOING STATISTICSBY DOING STATISTICS