Multiple, stepwise and non-linear regression models Peter Shaw RU Y = 426 – 4.6X r = 0.956 OR...
-
Upload
kelley-long -
Category
Documents
-
view
221 -
download
0
Transcript of Multiple, stepwise and non-linear regression models Peter Shaw RU Y = 426 – 4.6X r = 0.956 OR...
Multiple, stepwise and non-linear regression models
Peter Shaw
RU
Exponential decay curve
0
100
200
300
400
500
600
0 20 40 60 80 100
time
act
ivit
y
Series1
Y = 426 – 4.6X r = 0.956
OR
Y=487*exp(-0.022*X)
r = 1.00
Introduction
Last week we met and ran bivariate correlation/regression analyses, which were probably simple and familiar to most of you.
This week we extend this concept, looking at variants; MLR simple and stepwise, and non-linear regressions.
The universals here are linear regression equations, and goodness of fit estimated as R2. The bivariate r we met last week is not meaningful in many cases here.
Multiple Linear Regression, MLR
MLR is just like bivariate regression, except that you have multiple independent (“explanatory”) variables.
Model: Y = A + B*X1 + C*X2 (+ D*X3….) Very often the explanatory variables are co-correlated,
hence the dislike of the term “independents” The goodness of fit is indicated by the R2 value, where R2
= proportion of total variance in Y which is explained by the regression. This is just like bivariate regression, except that we don’t used a signed (+/-) r value.
(Why no r? Because with 2 variables the line either goes up or down. With 3 variables the line can go +ve wrt X1 and -ve wrt X2)
Height 1st Leaf No.flowers
mm mm
108 12 18
111 11 10
147 23 22
218 21 14
240 37 12
223 30 25
242 28 45
480 77 40
290 40 50
263 55 30
250 Height, mm
Leaf 1, mm
50
No. flowers
30
60
500
250 Height, mm500
Leaf 1, mm100
No. flowers
60
30
A three dimensional data space.
Three variables define a three-dimensional data space, which can be visualised using three-dimensional graphs.
The 3-D graph of flower number, leaf length and plant height, seen from two rotations.
Figure 3.19. The 3-D graph relating crake density to mowing date and vegetation type. The width of the circles indicates corncrake number, ranging from 0 to the maximum of 10.
Age, years50 100
00
1.0
2.0
3.0
Dep
th, m
m
Erosion depth on marble tombstones in three Sheffield churchyards as a function of the age of the stone.Basic regression equation: Y = -0.03 + 0.027*X [Nb there’s an argument for forcing A = 0, but I’ll skip that]
City centre
City edge
Rural site
Age, years50 100
00
1.0
2.0
3.0
Dep
th, m
mErosion depth on marble tombstones in three Sheffield churchyards as a function of the age of the stone. A separate regression line has been fitted for each of the three churchyards.
City centre
City edge
Rural siteKey
erosion = 0.42 + 0.024*age + -0.13*distance, R2=0.73 (97 df)which can be compared with the bivariate equation:
erosion = -0.28 + 0.027*age, R2 = 0.63 (98 df).
There are several philosophical catches here: Intercept? Parallel lines?
Leads to ancova
Notice how adding a second explanatory variable increased the % variance explained – this is the whole point of MLR.
In fact adding an explanatory variable is ALMOST CERTAIN to increase the R2 – if it does not, it must be perfectly correlated with the first explanatory variable (or have zero variance). The question is really whether it is worth adding that new variable?
SRCE df SS MS F
Age 1 58.995 3 58.995 167**
Error 98 34.533 .352
Total 99 93.529
How does this relate to R2? Recall that R2 = % variance explained = 58.995/93.529 = 0.6308
SRCE df SS MS F
A&D 2 68.33 34.166 131**
Error 97 25.196 0.26
Total 99 93.529
H0: MS (A&D-A) = MS error
F1,97 = (68.33-58.995)/1/0.26 = 35.9**
This leads to Stepwise MLR
Given 1 dependent variable and lots of explanatory variables, start by finding the single highest correlation, and if significant, iterate:
1: Find the explanatory variable that causes the greatest reduction in error variance
2: Is this reduction significant?
If so add to the model and repeat.
Data dredging tip:
If you have (say) 50 spp, and want to find out quickly which one shows the sharpest response to a 1or0 factor (MF, logged vs unlogged etc), try this:
Set the factor as dependent, then run stepwise MLR using all spp as predictors of that factor. Eccentric but works like a charm!
MLR is widely used in research, especially for “data dredging”.
It gives accurate R2 values, and valid least-squares estimates of parameters – intercept, gradients wrt X1, X2 etc.
Sometimes its power can seem almost magical – I have a neat little dataset in which MLR can derive parameters of the polynomial function
xX = A + B*X + C*X2 + D*X3…
MLR is powerful and dangerous:
We can find the general relation between number of points and number of rows – empirically! We solve the polynomial by MLR, then
1+2+3+4..+99+100 = ?
Empirically fitting polynomial functions by MLR:
The triangular function
The case of the mysterious tipburn on the CEGB’s Scots Pine
Tipburn on Pinus sylvestris in Liphook open air fumigation experiment 1985-1990
We had some homogenous aged pines exposed to SO2 in a sophisticated open-air fumigation system, max c. 200ppb. Books told us that Pine only got visible injury above 100 ppm – so we were well into the invisible damage region. Except that it wasn’t invisible on a few plants! About 5% were sensitive and could show damage in our conditions. Apart from DNA, what was going on?
Data structure
35 rows of data = 7 plots * 5 years
#plants burnt (=dep) then columns describing SO2.
Column: annual mean; 12 * monthly mean; 52*weekly mean; 365 * daily mean, rather more hourly means ..
Stepwise MLR picked up the mean SO2 conc. during the week of budburst as best predictor. Bingo!
There is a serious pitfall when the explanatory variables are co-correlated (which is frequent, almost routine).
In this case the gradients should be more-or-less ignored! They may be accurate, but are utterly misleading.
An illustration: Y correlates +vely with X1 and with X2. Enter Y, X1 and X2 into an MLR Y = A + B*X1 + C*X2 But if B >0, C may well be < 0! On the face of it this suggests that Y is –vely
correlated with X2. In fact this correlation is with the residual information content in X2 after its
correlation with X1 is removed. At least one multivariate package (CANOCO) instructs that prior to analysis
you must ensure all explanatory variables are uncorrelated, by excluding offenders from the analysis.
This caveat is widely ignored, indeed widely unknown.
But beware!
Colinearity & Variance inflation factors
To diagnose colinearity, compute the VIF for each variable:
VIFj = 1/(1-R2j)
Where R2j is the multiple R2 between variable J and all
others. If your maximum VIF > 10, worry! Exclude variables or run a PCA (coming up soon!)
Non-linear regressions You should now be familiar with the
notion of fitting a best fit line to a scattergraph, and the use of an R2 value to assess the significance of a relationship.
Be aware that linear regression implicitly fits a straight line - it would be possible to have a perfect relationship between dependent & independent variables, but r is non-significant!
R = 0.95**
R = 0 NS
Sometimes straight lines just aren’t enough!
Are these data linear?X Y0 01 12 43 94 165 256 36
R = 0.96 P<0.05Y = 8X - 9.33 - but is it?!
There are many cases where the line you want to fit is not a straight one!
Population dynamics are often exponential (up or down), as are many physical processes such as heating/cooling, radioactive decay etc.
Body size relationships (allometry) are almost always curvy.
Almost always it is still possible to use familiar, unsophisticated linear regression on these. You just have to apply a simple mathematical trick first.
Temp
time
BMR
Mass
To illustrate the basic idea..
Take these clearly non-linear data. Y correlates well with X, but by calculating a new variable the correlation can be made perfect. X Y0 01 12 43 94 165 256 36
Y#0123456
Calculate Y# = Square root(Y)
It is now evident that the correlation between X and Y* is perfect.Y# = 1*X + 0, r = 1.0, p<0.01
Logarithms.. For better or worse, the tricks we need to
straighten lines are based on the mathematical functions called logarithms.
When I were a lad… we used “logs” at school every day for long multiplications/divisions. Calculators stopped all that, and I find that I have to explain the basics here.
Forgive me if you already know this – someone sitting next to you won’t!
Let’s start with a few powers: N 2N 3N 10N
0 1 1 1 1 2 3 10 2 4 9 100 3 8 27 1000 4 16 81 10000 5 32 243 100000 6 64 729 1000000 Given these we see that 26 64, 34 = 81 etc. An alternative way of expressing this is to say that the
Logarithm base 2(64) = 6, logarithm base 3(81) = 4. This is abbreviated: Log2(16) = 4, log3(9) = 2 etc.
In practice you are not offered log base 2 or 3 on calculators, though they are easy to work out.
What you do get is log10 or loge, where e = 2.7182818… Loge is usually written Ln.
The power of logs is that they translate multiplication into addition! This is easy to see, sticking for a moment with logs base 2
8 * 4 = (2 * 2 * 2) * (2 * 2) [23 * 22] = (2*2*2*2*2) [25] =32 [Note 3+2 =5]
Universally: Logx(A) + Logx(B) = Logx(A*B) For virtually any values of x, A and B. (Exclude numbers <=0)
Which log to use?
Answer: It generally makes no difference at all! PROVIDING you are consistent. Throughout this lecture I will use ln = loge, but I could equally have used log10.
In case you wanted to know: Loge(X) = log10(x) *loge(10) ETC
The case where you need to be careful is fitting an explicitly exponential model, but if you calculate a half life you will get the same answer in any log base.
Given this we can deal with 2 distinct types of non-linear relationship which cover >95% of biological/environmental situations. 1: Exponential curves. These come out of
population dynamics, where growth and decline typically involve proportionate changes (ie 10% per year) and have a doubling time or a half life.
Power relationships, where doubling the independent variable changes the dependent by 2X where X is the power constant. This gives us the allometric relationships typical of size scaling functions (body size/BMR, river diameter/discharge etc).
Exponential models
The model can be recognised by having a half life or doubling time, and is defined as follows:
Y = A*exp(B*X) Where A,B are constants, exp(x) is
the natural antilog ex = 2.718x
Transform by getting log(Y) Ln(y) = ln(A) + B*X
B<0 = exponential decayThis never quite goes flat but heads to an asymptote
B>0 = exponential growth.This is explosive – never persists long!
For exponential data:
Plot log(Y) vs X, and use regression to get the intercept + gradient.
Intercept = ln(A), gradient = B If you want the half life, use: T½ = ln(½)/B
Notice that the gradient will vary according to log base, so half life is safer as it doesn’t.
Count Time
500 0
400 10
320 20
240 30
200 40
160 50
120 60
100 70
85 80
70 90
Exponential decay curve
0
100
200
300
400
500
600
0 20 40 60 80 100
time
act
ivit
y
Series1
A basic linear fit:
Y = 426 – 4.6X
r = 0.956
Exponential decay after log-transformation
0
1
2
3
4
5
6
7
0 20 40 60 80 100
time
log
(Y)
Series2 Y = 6.19 – 0.022, r = 1Hence Y=487*exp(-0.022*X)
From this we get:Half life = ln(0.5)/-0.022 = 31.5sec
Log-transform the Y axis!
Power relationships(allometric models) Here the model is: Y = A*XB
A,B are constants as before
Linearise by log-transforming Y AND X
Ln(Y) = ln(A)+ln(X)*B Plot ln(y) vs ln(X): intercept = ln(A),
gradient = B
BMR
Mass
XVAR
100806040200
YV
AR
7
6
5
4
3
2
1
3.77 291.21 33.43 246.57 886.82 956.45 854.9 49
Y X
0.58 1.460.08 0.480.54 1.380.82 1.940.83 1.980.81 1.930.69 1.69
LogY LogX
LOGX
2.01.81.61.41.21.0.8.6.4
LOG
Y
1.0
.8
.6
.4
.2
0.0
logY = -0.155 + 0.5*log10XHence Y = 0.7*sq. root(X)
To summarise:
1: plot and inspect data 2: explore by getting r value for
I: Y vs X [linear] II: ln(Y) vs X [exponential] III: ln(Y) vs ln(X) [allometric]
The best r value implies the best model, though the differences can be very minor.
This won’t always work! Some models defy linear
transformations. The commonest example is a quadratic relationship
Y = A + B*X + C*X2
Here there is no alternative but to call up a very powerful, general procedure, the general linear model (GLM). SPSS offers this as well as quadratic regression: you get an r value + estimates of A, B and C. Handle this just as we solved the triangular function above.
N flux from soil
Insectdensity