Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough,...

39
Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough, 2002) Environmental sampling and analysis

Transcript of Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough,...

Page 1: Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough, 2002) Environmental sampling and analysis.

Multiple regression models

Experimental design and data analysis for biologists (Quinn & Keough, 2002)

Environmental sampling and analysis

Page 2: Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough, 2002) Environmental sampling and analysis.

Multiple regression models

• One response (dependent) variable:– Y

• More than one predictor (independent variable) variable:– X1, X2, X3 …, Xj

– number of predictors = p (j = 1 to p)

• Number of observations = n (i = 1 to n)

Page 3: Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough, 2002) Environmental sampling and analysis.

Forest fragmentation

Page 4: Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough, 2002) Environmental sampling and analysis.

Forest fragmentation• 56 forest patches in SE Victoria (Loyn 1987)• Response variable:

– bird abundance

• Predictor variables:– patch area (ha)– years isolated (years)– distance to nearest patch (km)– distance to nearest larger patch (km)– stock grazing intensity (1 to 5 scale)– altitude (m)

Page 5: Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough, 2002) Environmental sampling and analysis.

Biomoinitoring with Vallisneria

• Indicators of sublethal effects of organochlorine contamination– leaf-to-shoot surface area ratio of

Vallisneria americana– response variable

• Predictors:– sediment contamination, plant

density, PAR, rivermile, water depth

• 225 sites in Great Lakes• Potter & Lovett-Doust (2001)

Page 6: Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough, 2002) Environmental sampling and analysis.

Regression models

Linear model:

Sample equation:

...y b b x b xi 0 1 i1 2 i2

yi = 0 + 1xi1 + 2xi2 + .... + i

Page 7: Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough, 2002) Environmental sampling and analysis.

Example

• Regression model:

(bird abundance)i = 0 + 1(patch area)i + 2(years isolated)i + 3(nearest patch distance)i + 4(nearest large patch distance)i + 5(stock grazing)i + 6(altitude)i + i

Page 8: Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough, 2002) Environmental sampling and analysis.

Multiple regression planebi

rd a

bund

ance

altitude log10area

Page 9: Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough, 2002) Environmental sampling and analysis.

Partial regression coefficients

• H0: 1 = 0

• Partial population regression coefficient (slope) for Y on X1, holding all other X’s constant, equals zero

• Example:– slope of regression of bird abundance against patch

area, holding years isolated, distance to nearest patch, distance to nearest larger patch, stock grazing intensity and altitude constant, equals 0.

Page 10: Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough, 2002) Environmental sampling and analysis.

Testing H0: i = 0

• Use partial t-tests:

• t = bi / SE(bi)

• Compare with t-distribution with n-2 df

• Separate t-test for each partial regression coefficient in model

• Usual logic of t-tests:– reject H0 if P < 0.05

Page 11: Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough, 2002) Environmental sampling and analysis.

Model comparison

• Test H0: 1 = 0

• Fit full model:– y = 0+1x1+2x2+3x3+…

• Fit reduced model:– y = 0+2x2+3x3+…

• Calculate SSextra:

– SSRegression(full) - SSRegression(reduced)

• F = MSextra / MSResidual(full)

Page 12: Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough, 2002) Environmental sampling and analysis.

Overall regression model

• H0: 1 = 2 = ... = 0 (all population slopes equal zero)

• Test of whether overall regression equation is significant

• Use ANOVA F-test:– variation explained by regression– unexplained (residual) variation

Page 13: Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough, 2002) Environmental sampling and analysis.

Explained variance

r2

proportion of variation in Y explained by linear relationship with X1, X2 etc.

SS Regression SS Total

Page 14: Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough, 2002) Environmental sampling and analysis.

Forest fragmentation

Intercept 20.789 8.285 0 0.015Log10 area 7.470 1.465 0.565 <0.001Log10 distance -0.907 2.676 -0.035 0.736Log10 ldistance -0.648 2.123 -0.035 0.761Grazing -1.668 0.930 -0.229 0.079Altitude 0.020 0.024 0.079 0.419Years -0.074 0.045 -0.176 0.109

r2 = 0.685, F6,49 = 17.754, P <0 .001

Parameter Coefficient SE Stand coeff P

Page 15: Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough, 2002) Environmental sampling and analysis.

Biomoinitoring with Vallisneria

Parameter Coefficient SE P

Intercept 1.054 0.565 0.063Sediment contamination 1.352 0.482 0.006Plant density 0.028 0.007 <0.001PAR -0.087 0.017 <0.001Rivermile 1.00 x 10-4 9.17 x 10-5 0.277Water depth 0.246 0.486 0.613

Page 16: Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough, 2002) Environmental sampling and analysis.

Assumptions

• Normality and homogeneity of variance for response variable

• Independence of observations

• Linearity

• No collinearity

Page 17: Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough, 2002) Environmental sampling and analysis.

Scatterplots

• Scatterplot matrix (SPLOM)– pairwise plots for all variables

• Partial regression (added variable) plots– relationship between Y and Xj, holding other

Xs constant

– residuals from Y against all Xs except Xj vs residuals from Xj against all other Xs

– graphs partial regression slope for Xj

Page 18: Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough, 2002) Environmental sampling and analysis.

Partial regression plot (log10 area)

-2 -1 0 1 2

Log10 Area

-20

-10

0

10

20

Bird

abu

ndan

ce

Page 19: Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough, 2002) Environmental sampling and analysis.

Regression diagnostics

• Residual:– observed yi - predicted yi

• Residual plots:– residual against predicted yi

– residual against each X

• Influence:– Cook’s D statistics

Page 20: Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough, 2002) Environmental sampling and analysis.

Collinearity

• Collinearity:– predictors correlated

• Assumption of no collinearity:– predictor variables uncorrelated with (ie.

independent of) each other

• Effect of collinearity:– estimates of js and significance tests

unreliable

Page 21: Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough, 2002) Environmental sampling and analysis.

Response (Y) and 2 predictors (X1 and X2)

1. X1 and X2 uncorrelated (r = -0.24)

coeff se tol t Pintercept -0.17 1.03 -0.16 0.873X1 1.13 0.14 0.95 7.86 <0.001X2 0.12 0.14 0.95 0.86 0.404

r2 = 0.787, F = 31.38, P < 0.001

Collinearity

Page 22: Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough, 2002) Environmental sampling and analysis.

Collinearity

intercept 0.49 0.72 0.69 0.503X1 1.55 1.21 0.01 1.28 0.219X2 -0.45 1.21 0.01 -0.37 0.714

2. Rearrange X2 so X1 and X2 highly correlated (r = 0.99)

coeff se tol t P

r2 = 0.780, F = 30.05, P < 0.001

Page 23: Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough, 2002) Environmental sampling and analysis.

Checks for collinearity

• Correlation matrix and/or SPLOM between predictors

• Tolerance for each predictor:– 1-r2 for regression of that predictor on all

others– if tolerance is low (near 0.1) then

collinearity is a problem– VIF (variance inflation factor)

Page 24: Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough, 2002) Environmental sampling and analysis.

Forest fragmentationL 1

0 DIS

TL 1

0 LD

I ST

L 10 A

RE

AG

RA

ZE

ALT

L10DIST

YR

S

L10LDIST L10AREA GRAZE ALT YRS

Tolerances:0.396 – 0.681

Page 25: Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough, 2002) Environmental sampling and analysis.

Solutions to collinearity

• Drop redundant (correlated) predictors• Principal components regression

– potentially useful– replace predictors by independent

components from PCA on predictor variables

• Ridge regression– controversial and complex

Page 26: Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough, 2002) Environmental sampling and analysis.

Predictor importance

• Tests on partial regression slopes

• Standardised partial regression slopes

j

j

Y

X

jj s

sbb *

Page 27: Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough, 2002) Environmental sampling and analysis.

Predictor importance

• Change in explained variation– compare fit of full model to reduced model

omitting Xj

• Hierarchical partitioning– splits total r2 for each predictor into

• independent contribution of each predictor• joint contribution of each predictor with other

predictors

Residual

Extra2

SS Reduced

SS

jXr

Page 28: Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough, 2002) Environmental sampling and analysis.

Forest fragmentation

Predictor Independent Joint Total Stand coeffr2 r2 r2

Log10 area 0.315 0.232 0.548 0.565Log10 distance 0.007 0.009 0.016 -0.035Log10 ldistance 0.014 <0.001 0.014 -0.035Altitude 0.057 0.092 0.149 0.079Grazing 0.190 0.275 0.466 -0.229Years 0.101 0.152 0.253 -0.176

Page 29: Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough, 2002) Environmental sampling and analysis.

Interactions

• Interactive effect of X1 and X2 on Y

• Dependence of partial regression slope of Y against X1 on the value of X2

• Dependence of partial regression slope of Y against X2 on the value of X1

• yi = 0 + 1xi1 + 2xi2 + 3xi1xi2 + i

Page 30: Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough, 2002) Environmental sampling and analysis.

Forest fragmentation

• Does effect of grazing on bird abundance depend on area?– log10 area x grazing interaction

• Does effect of grazing depend on years since isolation?– grazing x years interaction

• Etc.

Page 31: Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough, 2002) Environmental sampling and analysis.

Interpreting interactions

• Interactions highly correlated with individual predictors:– collinearity problem– centring variables (subtracting mean) removes

collinearity

• Simple regression slopes:– slope of Y on X1 for different values of X2

– slope of Y on X2 for different values of X1

– use if interaction is significant

Page 32: Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough, 2002) Environmental sampling and analysis.

Polynomial regression

• Modeling some curvilinear relationships• Include quadratic (X2) or cubic (X3) etc.• Quadratic model:

yi = 0 + 1xi1 + 2xi12 + i

• Compare fit with:

yi = 0 + 1xi1 + i

• Does quadratic fit better than linear?

Page 33: Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough, 2002) Environmental sampling and analysis.

Local and regional species richness

• Relationship between local and regional species richness in North America– Caley & Schluter (1997)

• Two models compared:

local spp = 0 + 1(regional spp) + 2(regional spp)2 +

local spp = 0 + 1(regional spp) +

Page 34: Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough, 2002) Environmental sampling and analysis.

0 50 100 150 200 250

Regional species richness

0

50

100

150

200

Loca

l spe

cies

ric

hnes

s

Linear

Quadratic

Page 35: Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough, 2002) Environmental sampling and analysis.

Model comparison

Full model:SSResidual = 376.620, df = 5

Reduced model:SSResidual = 1299.257, df = 6

Difference due to (regional spp)2:SSExtra = 922.7, df = 1, MSExtra = 922.7F = 12.249, P < 0.018

See Quinn & Keough Box 6.6

Page 36: Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough, 2002) Environmental sampling and analysis.

Categorical predictors

• Convert categorical predictors into multiple continuous predictors– dummy (indicator) variables

• Each dummy variable coded as 0 or 1

• Usually no. of dummy variables = no. groups minus 1

Page 37: Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough, 2002) Environmental sampling and analysis.

Forest fragmentation

Grazing Grazing1 Grazing2 Grazing3 Grazing4

intensity

Zero (1) 0 0 0 0Low (2) 1 0 0 0Medium (3) 0 1 0 0High (4) 0 0 1 0Intense (5) 0 0 0 1

Each dummy variable measures effect of low – intense categories compared to “reference” category – zero grazing

Page 38: Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough, 2002) Environmental sampling and analysis.

Forest fragmentation

Coefficient Est SE t PIntercept 21.603 3.092 6.987 <0.001Grazing -2.854 0.713 -4.005 <0.001Log10 area 6.890 1.290 5.341 <0.001

Intercept 15.716 2.767 5.679 <0.001Grazing1 0.383 2.912 0.131 0.896Grazing2 -0.189 2.549 -0.074 0.941Grazing3 -1.592 2.976 -0.535 0.595Grazing4 -11.894 2.931 -4.058 <0.001Log10 area 7.247 1.255 5.774 <0.001

Page 39: Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough, 2002) Environmental sampling and analysis.

Categorical predictors

• All linear models fit categorical predictors using dummy variables

• ANOVA models combine dummy variables into single factor effect– partition SS into factor and residual– dummy variable effects often provided by software

• Models with both categorical (factor) and continuous (covariate) predictors– adjust factor effects based on covariate– reduce residual based on strength of relationship

between Y and covariate – more powerful test of factor