Severity GLMs: A Forgotten Distribution Christopher Monsour CAS Predictive Modeling Seminar
Workshop in R & GLMs: #2 Diane Srivastava University of British Columbia [email protected].
-
Upload
tamsyn-hoover -
Category
Documents
-
view
217 -
download
1
Transcript of Workshop in R & GLMs: #2 Diane Srivastava University of British Columbia [email protected].
![Page 2: Workshop in R & GLMs: #2 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca.](https://reader030.fdocuments.us/reader030/viewer/2022032702/56649ceb5503460f949b6d13/html5/thumbnails/2.jpg)
Start by loading your Lakedata_06 dataset:
diane<-read.table(file.choose(),sep=";",header=TRUE)
![Page 3: Workshop in R & GLMs: #2 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca.](https://reader030.fdocuments.us/reader030/viewer/2022032702/56649ceb5503460f949b6d13/html5/thumbnails/3.jpg)
Dataframes
Two ways to identify a column (called "treatment") in
your dataframe (called "diane"):
diane$treatment
OR
attach(diane); treatment
At end of session, remember to: detach(diane)
![Page 4: Workshop in R & GLMs: #2 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca.](https://reader030.fdocuments.us/reader030/viewer/2022032702/56649ceb5503460f949b6d13/html5/thumbnails/4.jpg)
Summary statistics
length (x)
mean (x)
var (x)
cor (x,y)
sum (x)
summary (x) minimum, maximum, mean, median, quartiles
What is the correlation between two variables in your dataset?
![Page 5: Workshop in R & GLMs: #2 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca.](https://reader030.fdocuments.us/reader030/viewer/2022032702/56649ceb5503460f949b6d13/html5/thumbnails/5.jpg)
Factors
• A factor has several discrete levels (e.g. control,
herbicide)
• If a vector contains text, R automatically assumes it
is a factor.
• To manually convert numeric vector to a factor:
x <- as.factor(x)
• To check if your vector is a factor, and what the
levels are:
is.factor(x) ; levels(x)
![Page 6: Workshop in R & GLMs: #2 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca.](https://reader030.fdocuments.us/reader030/viewer/2022032702/56649ceb5503460f949b6d13/html5/thumbnails/6.jpg)
ExerciseMake lake area into a factor called AreaFactor:
Area 0 to 5 ha: small
Area 5.1 to 10: medium
Area > 10 ha: large
You will need to:
1. Tell R how long AreaFactor will be.
AreaFactor<-Area; AreaFactor[1:length(Area)]<-"medium"
2. Assign cells in AreaFactor to each of the 3 levels
3. Make AreaFactor into a factor, then check that it is a factor
![Page 7: Workshop in R & GLMs: #2 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca.](https://reader030.fdocuments.us/reader030/viewer/2022032702/56649ceb5503460f949b6d13/html5/thumbnails/7.jpg)
ExerciseMake lake area into a factor called AreaFactor:
Area 0 to 5 ha: small
Area 5.1 to 10: medium
Area > 10 ha: large
You will need to:
1. Tell R how long AreaFactor will be.
AreaFactor<-Area; AreaFactor[1:length(Area)]<-"medium"
2. Assign cells in AreaFactor to each of the 3 levels
AreaFactor[Area<5.1]<-“small"; AreaFactor[Area>10]<-“large"
3. Make AreaFactor into a factor, then check that it is a factorAreaFactor<-as.factor(AreaFactor); is.factor(AreaFactor)
![Page 8: Workshop in R & GLMs: #2 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca.](https://reader030.fdocuments.us/reader030/viewer/2022032702/56649ceb5503460f949b6d13/html5/thumbnails/8.jpg)
Linear regression
model <- lm (y ~ x, data = diane)
invent a name for your model
linear model
insert youry vector name here
insert yourx vector name here
insert yourdataframename here
ALT+126
![Page 9: Workshop in R & GLMs: #2 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca.](https://reader030.fdocuments.us/reader030/viewer/2022032702/56649ceb5503460f949b6d13/html5/thumbnails/9.jpg)
Linear regression
model <- lm (Species ~ Elevation, data = diane)
summary (model)
Call:lm(formula = Species ~ Elevation, data = diane)
Residuals: Min 1Q Median 3Q Max -7.29467 -2.75041 -0.04947 1.83054 15.00270
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.421568 2.426983 3.882 0.000551 ***Elevation -0.002609 0.003663 -0.712 0.482070 ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.811 on 29 degrees of freedomMultiple R-Squared: 0.01719, Adjusted R-squared: -0.0167 F-statistic: 0.5071 on 1 and 29 DF, p-value: 0.4821
![Page 10: Workshop in R & GLMs: #2 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca.](https://reader030.fdocuments.us/reader030/viewer/2022032702/56649ceb5503460f949b6d13/html5/thumbnails/10.jpg)
Linear regression
model2 <- lm (Species ~ AreaFactor, data = diane)
summary (model2)
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 11.833 1.441 8.210 6.17e-09 ***AreaFactormedium -2.405 1.723 -1.396 0.174 AreaFactorsmall -8.288 1.792 -4.626 7.72e-05 ***
Large has mean species richness of 11.8
Medium has mean species richness of 11.8 - 2.4 = 9.4
Small has a mean species richness of 11.8 - 8.3 = 3.5
mean(Species[AreaFactor=="medio"])
![Page 11: Workshop in R & GLMs: #2 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca.](https://reader030.fdocuments.us/reader030/viewer/2022032702/56649ceb5503460f949b6d13/html5/thumbnails/11.jpg)
ANOVA
model2 <- lm (Species ~ AreaFactor, data = diane)
anova (model2)
Analysis of Variance Table
Response: Species Df Sum Sq Mean Sq F value Pr(>F) AreaFactor 2 333.85 166.92 13.393 8.297e-05 ***Residuals 28 348.99 12.46
![Page 12: Workshop in R & GLMs: #2 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca.](https://reader030.fdocuments.us/reader030/viewer/2022032702/56649ceb5503460f949b6d13/html5/thumbnails/12.jpg)
F tests in regression
model3 <- lm (Species ~ Elevation + pH, data = diane)
anova (model, model3)
Model 1: Species ~ ElevationModel 2: Species ~ Elevation + pH
Res.Df RSS Df Sum of Sq F Pr(>F) 1 29 671.10 2 28 502.06 1 169.05 9.4279 0.004715 **
F 1, 28 = 9.43
![Page 13: Workshop in R & GLMs: #2 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca.](https://reader030.fdocuments.us/reader030/viewer/2022032702/56649ceb5503460f949b6d13/html5/thumbnails/13.jpg)
• Fit the model: Species~pH
• Fit the model: Species~pH+pH2
("pH2" is just pH2)
• Use the ANOVA command to decide whether species richness is a linear or quadratic function of pH
Exercise
![Page 14: Workshop in R & GLMs: #2 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca.](https://reader030.fdocuments.us/reader030/viewer/2022032702/56649ceb5503460f949b6d13/html5/thumbnails/14.jpg)
Distributions: not so normal!
• Review assumptions for parametric stats (ANOVA, regressions)
• Why don’t transformations always work?
• Introduce non-normal distributions
![Page 15: Workshop in R & GLMs: #2 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca.](https://reader030.fdocuments.us/reader030/viewer/2022032702/56649ceb5503460f949b6d13/html5/thumbnails/15.jpg)
Tests before ANOVA, t-tests
• Normality
• Constant variances
![Page 16: Workshop in R & GLMs: #2 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca.](https://reader030.fdocuments.us/reader030/viewer/2022032702/56649ceb5503460f949b6d13/html5/thumbnails/16.jpg)
Tests for normality: exercise
data<-c(rep(0:6,c(42,30,10,5,5,4,4)));data
How many datapoints are there?
![Page 17: Workshop in R & GLMs: #2 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca.](https://reader030.fdocuments.us/reader030/viewer/2022032702/56649ceb5503460f949b6d13/html5/thumbnails/17.jpg)
Tests for normality: exercise
• Shapiro-Wilks (if sig, not normal)
shapiro.test (data)
If error message, make sure the stats package is loaded, then try again:
library(stats); shapiro.test (data)
![Page 18: Workshop in R & GLMs: #2 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca.](https://reader030.fdocuments.us/reader030/viewer/2022032702/56649ceb5503460f949b6d13/html5/thumbnails/18.jpg)
Tests for normality: exercise
• Kolmogorov-Smirnov (if sig, not normal)
ks.test(data,”pnorm”,mean(data),sd=sqrt(var(data)))
![Page 19: Workshop in R & GLMs: #2 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca.](https://reader030.fdocuments.us/reader030/viewer/2022032702/56649ceb5503460f949b6d13/html5/thumbnails/19.jpg)
Tests for normality: exercise
• Quantile-quantile plot (if wavers substantially off 1:1 line, not normal)
par(pty="s")
qqnorm(data); qqline(data)opens up a single plot window
![Page 20: Workshop in R & GLMs: #2 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca.](https://reader030.fdocuments.us/reader030/viewer/2022032702/56649ceb5503460f949b6d13/html5/thumbnails/20.jpg)
Tests for normality: exercise
-2 -1 0 1 2
01
23
45
6
Normal Q-Q Plot
Theoretical Quantiles
Sa
mp
le Q
ua
ntil
es
![Page 21: Workshop in R & GLMs: #2 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca.](https://reader030.fdocuments.us/reader030/viewer/2022032702/56649ceb5503460f949b6d13/html5/thumbnails/21.jpg)
Tests for normality: exercise
If the distribution isn´t normal, what is it?
freq.data<-table(data); freq.data
barplot(freq.data)
0 1 2 3 4 5 6
01
02
03
04
0
![Page 22: Workshop in R & GLMs: #2 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca.](https://reader030.fdocuments.us/reader030/viewer/2022032702/56649ceb5503460f949b6d13/html5/thumbnails/22.jpg)
Non-normal distributions• Poisson (count data, left-skewed, variance =
mean)
• Negative binomial (count data, left-skewed, variance >> mean)
• Binomial (binary or proportion data, left-skewed, variance constrained by 0,1)
• Gamma (variance increases as square of mean, often used for survival data)
![Page 23: Workshop in R & GLMs: #2 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca.](https://reader030.fdocuments.us/reader030/viewer/2022032702/56649ceb5503460f949b6d13/html5/thumbnails/23.jpg)
Exercise
model2 <- lm (Species ~ AreaFactor, data = diane)
anova (model2)
1. Test for normality of residuals
resid2<- resid (model2)
...you do the rest!
2. Test for homogeneity of variances
summary (lm (abs (resid2) ~ AreaFactor))
![Page 24: Workshop in R & GLMs: #2 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca.](https://reader030.fdocuments.us/reader030/viewer/2022032702/56649ceb5503460f949b6d13/html5/thumbnails/24.jpg)
Regression diagnostics
1. Residuals are normally distributed
2. Absolute value of residuals do not change with predicted value (homoscedastcity)
3. Residuals show no pattern with predicted values (i.e. the function “fits”)
4. No datapoint has undue leverage on the model.
![Page 25: Workshop in R & GLMs: #2 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca.](https://reader030.fdocuments.us/reader030/viewer/2022032702/56649ceb5503460f949b6d13/html5/thumbnails/25.jpg)
Regression diagnostics
model3 <- lm (Species ~ Elevation + pH, data = diane)
par(mfrow=c(2,2)); plot(model3)
![Page 26: Workshop in R & GLMs: #2 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca.](https://reader030.fdocuments.us/reader030/viewer/2022032702/56649ceb5503460f949b6d13/html5/thumbnails/26.jpg)
1. Residuals are normally distributed
• Straight “Normal Q-Q plot”
Theoretical Quantiles
Std
. d
evia
nce
res
id.
![Page 27: Workshop in R & GLMs: #2 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca.](https://reader030.fdocuments.us/reader030/viewer/2022032702/56649ceb5503460f949b6d13/html5/thumbnails/27.jpg)
2. Absolute residuals do not change with predicted values
• No fan shape in Residuals vs fitted plot
• No upward (or downward) trend in Scale-location plot
Fitted values
Sq
rt (
abs
(SD
res
id.)
)
Fitted values
Res
idu
als MALO
BUENO
![Page 28: Workshop in R & GLMs: #2 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca.](https://reader030.fdocuments.us/reader030/viewer/2022032702/56649ceb5503460f949b6d13/html5/thumbnails/28.jpg)
Examples of neutral and fan-shapes
![Page 29: Workshop in R & GLMs: #2 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca.](https://reader030.fdocuments.us/reader030/viewer/2022032702/56649ceb5503460f949b6d13/html5/thumbnails/29.jpg)
3. Residuals show no pattern
Curved residual plots result from fitting a straight line to non-linear data (e.g. quadratic)
![Page 30: Workshop in R & GLMs: #2 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca.](https://reader030.fdocuments.us/reader030/viewer/2022032702/56649ceb5503460f949b6d13/html5/thumbnails/30.jpg)
4. No unusual leverage
Cook’s distance > 1 indicates a point with undue leverage (large change in model fit when removed)
![Page 31: Workshop in R & GLMs: #2 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca.](https://reader030.fdocuments.us/reader030/viewer/2022032702/56649ceb5503460f949b6d13/html5/thumbnails/31.jpg)
Transformations
Try transforming your y-variable to improve the regression diagnostic plot
• replace Species with log(Species)
• replace Species with sqrt(Species)
![Page 32: Workshop in R & GLMs: #2 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca.](https://reader030.fdocuments.us/reader030/viewer/2022032702/56649ceb5503460f949b6d13/html5/thumbnails/32.jpg)
Poisson distribution
• Frequency data
• Lots of zero (or minimum value) data
• Variance increases with the mean
![Page 33: Workshop in R & GLMs: #2 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca.](https://reader030.fdocuments.us/reader030/viewer/2022032702/56649ceb5503460f949b6d13/html5/thumbnails/33.jpg)
1. Correct for correlation between mean and variance by log-transforming y (but log (0) is undefined!!)
2. Use non-parametric statistics (but low power)
3. Use a “generalized linear model” specifying a Poisson distribution
What do you do with Poisson data?
![Page 34: Workshop in R & GLMs: #2 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca.](https://reader030.fdocuments.us/reader030/viewer/2022032702/56649ceb5503460f949b6d13/html5/thumbnails/34.jpg)
The problem: Hard to transform data to satisfy all requirements!
Tarea: Janka example
Janka dataset: Asks if Janka hardness values are a good estimate of timber density? N=36