ANOVA and Linear Models. Data Data is from the University of York project on variation in British...

ANOVA and Linear ModelsANOVA and Linear Models

DataData

Data is from the University of York Data is from the University of York project on variation in British liquids.project on variation in British liquids. JK Local, Alan Wrench, Paul CarterJK Local, Alan Wrench, Paul Carter

CorrelationCorrelation

When we have two variables we can When we have two variables we can measure the strength of the linear measure the strength of the linear association by correlationassociation by correlation

Correlation in a strict technical Correlation in a strict technical statistical sense is the linear statistical sense is the linear relationship between two variables. relationship between two variables.


Many times we are not interested in the Many times we are not interested in the differences between two groups, but instead differences between two groups, but instead the relationship between two variables on the relationship between two variables on the same set of subjects.the same set of subjects. Ex: Are post-graduate salary and gpa related?Ex: Are post-graduate salary and gpa related? Ex: Is the F1.0 measurement related to the F1.1 Ex: Is the F1.0 measurement related to the F1.1

measurement?measurement? Correlation is a measurement of LINEAR Correlation is a measurement of LINEAR

dependence. Non-linear dependencies have dependence. Non-linear dependencies have to be modeled in a separate manner. to be modeled in a separate manner.


There is a theoretical There is a theoretical correlation, usually correlation, usually represented by represented by ρρX,YX,Y

We can calculate the We can calculate the sample correlation sample correlation between two variables between two variables (x,y) The Pearson (x,y) The Pearson Coefficient is given to Coefficient is given to the left. the left.

This will vary between This will vary between -1.0 and 1.0 indicating -1.0 and 1.0 indicating

the direction of the the direction of the relationship.relationship.

yx

iixy ssn

yyxxr

)1(

))((


Pearson's product-moment correlationPearson's product-moment correlation

data: york.data$F1.0 and york.data$F1.1 data: york.data$F1.0 and york.data$F1.1 t = 45.9262, df = 318, p-value < 2.2e-16t = 45.9262, df = 318, p-value < 2.2e-16alternative hypothesis: true correlation is not alternative hypothesis: true correlation is not

equal to 0 equal to 0 95 percent confidence interval:95 percent confidence interval: 0.9161942 0.9452264 0.9161942 0.9452264 sample estimates:sample estimates: cor cor 0.932194 0.932194

Correlation TypesCorrelation Types

Pearson’s TauPearson’s Tau X,Y are continuous variables. X,Y are continuous variables.

Kendall’s TauKendall’s Tau X,Y are continuous or ordinal. The X,Y are continuous or ordinal. The

measure is based on X ranked and the Y measure is based on X ranked and the Y ranked. The ranks are used as the basisranked. The ranks are used as the basis

One-Way ANOVAOne-Way ANOVA

If we want to test more than two If we want to test more than two means equality, we have to use an means equality, we have to use an expanded test: One-Way ANOVAexpanded test: One-Way ANOVA

An ExampleAn Example

Vowels: a, i, O, uVowels: a, i, O, u Are the F1 measurements the same Are the F1 measurements the same

for each corresponding vowel in the for each corresponding vowel in the segment?segment?

Assumptions: Normality, each group Assumptions: Normality, each group (level of vowel) has the same (level of vowel) has the same variance, independent variance, independent measurements. measurements.

The ANOVA TableThe ANOVA Table

ResultsResults

Analysis of Variance TableAnalysis of Variance Table

Response: york.data$F1.0Response: york.data$F1.0

Df SS Df SS MS MS F F Pr(>F) Pr(>F)

Vowel 3 10830838 3610279 189.96 < 2.2e-Vowel 3 10830838 3610279 189.96 < 2.2e-16 ***16 ***

Residuals 316 6005850 19006 Residuals 316 6005850 19006

What about the What about the assumptions?assumptions?

Can we test for equal variance? Yes.Can we test for equal variance? Yes. If the variance is not equal, is there a If the variance is not equal, is there a

solution that will still allow us to use solution that will still allow us to use ANOVA? Yes.ANOVA? Yes.

Post-hoc analysisPost-hoc analysis

There is a difference between the There is a difference between the mean of at least one vowel and the mean of at least one vowel and the others, so what?others, so what?

We can test where the difference is We can test where the difference is occurring through pairwise t-tests. occurring through pairwise t-tests. This type of analysis is often referred This type of analysis is often referred to as a post-hoc analysis. to as a post-hoc analysis.

Bonferroni Bonferroni

Pairwise comparisons using t tests with pooled SD Pairwise comparisons using t tests with pooled SD

data: york.data$F1.0 and york.data$Vowel data: york.data$F1.0 and york.data$Vowel

a i O a i O i < 2e-16 - - i < 2e-16 - - O < 2e-16 <2e-16 - O < 2e-16 <2e-16 - u < 2e-16 1 6.5e-14u < 2e-16 1 6.5e-14

P value adjustment method: bonferroniP value adjustment method: bonferroni

Multi-Way ANOVAMulti-Way ANOVA

Usually we are not interested in Usually we are not interested in merely one factor, but several merely one factor, but several factors effects on our independent factors effects on our independent variable.variable.

Same principle [Except now we have Same principle [Except now we have several ‘between groups variables’ ]several ‘between groups variables’ ]

Multi-Way ANOVAMulti-Way ANOVA

Df Sum Sq Mean Sq F value Pr(>F) Df Sum Sq Mean Sq F value Pr(>F) Vowel 3 173482 57827 2.0353 Vowel 3 173482 57827 2.0353

0.1077197 0.1077197 Liquid 1 216198 216198 7.6092 Liquid 1 216198 216198 7.6092

0.0059747 ** 0.0059747 ** Sex 1 340872 340872 11.9971 Sex 1 340872 340872 11.9971

0.0005687 ***0.0005687 ***Residuals 634 18013735 28413 Residuals 634 18013735 28413

Testing AssumptionsTesting Assumptions

Bartlett’s Test: Bartlett’s Test: H0: All variances for each of your cells are equal.H0: All variances for each of your cells are equal.

If your p-value is significant (<.05), then you If your p-value is significant (<.05), then you should not be using an ANOVA, but some non-should not be using an ANOVA, but some non-parametric test that relies on ranks.parametric test that relies on ranks.

We don’t have to worry about this with large We don’t have to worry about this with large sample data. The central limit theorem states sample data. The central limit theorem states that with enough data you will eventually get that with enough data you will eventually get normality (of the mean). normality (of the mean).

Higher Order InteractionsHigher Order Interactions

It often isn’t enough to test factors It often isn’t enough to test factors by themselves, but we want to model by themselves, but we want to model higher-order interactions.higher-order interactions.

We are looking at Sex, Liquid and We are looking at Sex, Liquid and Vowel– there are Sex x Liquid, Sex x Vowel– there are Sex x Liquid, Sex x Vowel, Vowel x Liquid and Sex x Vowel, Vowel x Liquid and Sex x Liquid x Vowel as possible interaction Liquid x Vowel as possible interaction effects. effects.

An Alternative Approach: An Alternative Approach: Linear ModelLinear Model

Linear Models allow for an easily Linear Models allow for an easily expandable approach that allows us expandable approach that allows us to answer questions more explicitly to answer questions more explicitly without having to add more without having to add more machinery with each new factor or machinery with each new factor or covariate. covariate.

The underlying form in an ANOVA is The underlying form in an ANOVA is essentially a linear model. essentially a linear model.

What would it look like?What would it look like?

In a linear model, In a linear model, we estimate we estimate parameters (or parameters (or coefficients) of the coefficients) of the predictors on a predictors on a response. response.

Ex: We want to Ex: We want to model the effect of model the effect of Vowels on F1.0Vowels on F1.01.0 iF

What are each of the What are each of the pieces?pieces?

αα represents the intercept represents the intercept term and the mean for F1.0 term and the mean for F1.0 when the type of vowel is when the type of vowel is controlled for. controlled for.

ττi represents the treatment represents the treatment effect of the ieffect of the ith vowel. vowel.

εε represents the noise and is represents the noise and is assumed to be N(0,assumed to be N(0,σσ22) (i.e. ) (i.e. normally distributed with a normally distributed with a mean of zero and constant mean of zero and constant variance). variance).

1.0 iF

InestimabilityInestimability

We can’t really We can’t really estimate all of the estimate all of the data in our model.data in our model.

We don’t have a We don’t have a control group control group where there isn’t a where there isn’t a vowel effect. vowel effect.

1.0 iF

Two SolutionsTwo Solutions

Stick with the model. You can only Stick with the model. You can only test functions of the parameters and test functions of the parameters and only if they are estimable [The hard only if they are estimable [The hard way and only if you know a fair way and only if you know a fair amount of linear algebra.]amount of linear algebra.]

Pick a control group and allow that to Pick a control group and allow that to be your baseline (or alpha). be your baseline (or alpha).

The Simple WayThe Simple Way

Call:Call:lm(formula = F1.0 ~ Vowel)lm(formula = F1.0 ~ Vowel)

Residuals:Residuals: Min 1Q Median 3Q Max Min 1Q Median 3Q Max -322.62 -109.44 -31.20 67.48 1044.13 -322.62 -109.44 -31.20 67.48 1044.13

Coefficients:Coefficients: Estimate Std. Error t value Pr(>|t|) Estimate Std. Error t value Pr(>|t|) (Intercept) 426.43 13.51 31.566 <2e-16 ***(Intercept) 426.43 13.51 31.566 <2e-16 ***Voweli -42.62 19.10 -2.231 0.0260 * Voweli -42.62 19.10 -2.231 0.0260 * VowelO -33.94 19.10 -1.776 0.0761 . VowelO -33.94 19.10 -1.776 0.0761 . Vowelu -35.16 19.10 -1.841 0.0662 . Vowelu -35.16 19.10 -1.841 0.0662 . ------Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 170.9 on 636 degrees of freedomResidual standard error: 170.9 on 636 degrees of freedomMultiple R-Squared: 0.009255, Adjusted R-squared: 0.004582 Multiple R-Squared: 0.009255, Adjusted R-squared: 0.004582 F-statistic: 1.98 on 3 and 636 DF, p-value: 0.1157F-statistic: 1.98 on 3 and 636 DF, p-value: 0.1157

Model AssestmentModel Assestment

Standard F: Are any of the levels Standard F: Are any of the levels significant?significant?

RR22: How much variation in the : How much variation in the response is explained by the response is explained by the predictor(s)predictor(s)

What’s Next?What’s Next?

How to handle repeated measures?How to handle repeated measures? Generalized Linear Models (Counts, Generalized Linear Models (Counts,

proportions)proportions) Classification and Regression Trees Classification and Regression Trees

(Decision Trees). (Decision Trees).

ANOVA and Linear Models. Data Data is from the University of York project on variation in British...

Documents

Transcript of ANOVA and Linear Models. Data Data is from the University of York project on variation in British...