Introduction to R for Data Science :: Session 7 [Multiple Linear Regression in R]
-
Upload
goran-s-milovanovic -
Category
Education
-
view
15.454 -
download
0
Transcript of Introduction to R for Data Science :: Session 7 [Multiple Linear Regression in R]
Introduction to R for Data Science
Lecturers
dipl. ing Branko Kovač
Data Analyst at CUBE/Data Science Mentor
at Springboard
Data Science Serbia
dr Goran S. Milovanović
Data Scientist at DiploFoundation
Data Science Serbia
Multiple Linear Regression in R
• Dummy coding of categorical predictors
• Multiple regression
• Nested models and Partial
F-test
• Partial and Part Correlation
• Multicolinearity
• {Lattice} plots
• Prediction, Confidence
Intervals, Residuals
• Influential Cases and
the Influence Plot
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
######################################################## # Introduction to R for Data Science
# SESSION 7 :: 9 June, 2016 # Multiple Linear Regression in R # Data Science Community Serbia + Startit
# :: Goran S. Milovanović and Branko Kovač :: ########################################################
#### read data library(datasets)
library(broom) library(ggplot2)
library(lattice) #### load
data(iris) str(iris)
Multiple Regression in R
• Problems with simple linear regression: iris dataset
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
#### simple linear regression: Sepal Length vs Petal Lenth
# Predictor vs Criterion {ggplot2} ggplot(data = iris, aes(x = Sepal.Length, y = Petal.Length)) +
geom_point(size = 2, colour = "black") + geom_point(size = 1, colour = "white") +
geom_smooth(aes(colour = "black"), method='lm') + ggtitle("Sepal Length vs Petal Length") +
xlab("Sepal Length") + ylab("Petal Length") + theme(legend.position = "none")
Multiple Regression in R
• Problems with simple linear regression: iris dataset
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
# And now for something completelly different (but in R)...
#### Problems with linear regression in iris # Predictor vs Criterion {ggplot2} - group separation
ggplot(data = iris, aes(x = Sepal.Length,
y = Petal.Length, color = Species)) + geom_point(size = 2) +
ggtitle("Sepal Length vs Petal Length") + xlab("Sepal Length") + ylab("Petal Length")
Multiple Regression in R
• Problems with simple linear regression: iris dataset
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
# Predictor vs Criterion {ggplot2} - separate regression lines
ggplot(data = iris, aes(x = Sepal.Length, y = Petal.Length,
colour=Species)) + geom_smooth(method=lm) +
geom_point(size = 2) + ggtitle("Sepal Length vs Petal Length") + xlab("Sepal Length") + ylab("Petal Length")
Multiple Regression in R
• Problems with simple linear regression: iris dataset
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
### better... {lattice} xyplot(Petal.Length ~ Sepal.Length | Species, #
{latice} xyplot data = iris, xlab = "Sepal Length", ylab = "Petal Length"
)
Multiple Regression in R
• Problems with simple linear regression: iris dataset
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
# Petal Length and Sepal Length: Conditional Densities
densityplot(~ Petal.Length | Species, # {latice} xyplot data = iris,
plot.points=FALSE, xlab = "Petal Length", ylab = "Density",
main = "P(Petal Length|Species)", col.line = 'red' )
densityplot(~ Sepal.Length | Species, # {latice} xyplot
data = iris, plot.points=FALSE, xlab = "Sepal Length", ylab = "Density",
main = "P(Sepal Length|Species)", col.line = 'blue'
)
Multiple Regression in R
• Problems with simple linear regression:
iris dataset
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
# Linear regression in subgroups species <- unique(iris$Species)
w1 <- which(iris$Species == species[1]) # setosa reg <- lm(Petal.Length ~ Sepal.Length, data=iris[w1,]) tidy(reg)
w2 <- which(iris$Species == species[2]) # versicolor reg <- lm(Petal.Length ~ Sepal.Length, data=iris[w2,])
tidy(reg) w3 <- which(iris$Species == species[3]) # virginica reg <- lm(Petal.Length ~ Sepal.Length, data=iris[w3,])
tidy(reg)
Multiple Regression in R
• Simple linear regressions in sub-groups
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
#### Dummy Coding: Species in the iris dataset is.factor(iris$Species)
levels(iris$Species) reg <- lm(Petal.Length ~ Species, data=iris) tidy(reg)
glance(reg) # Never forget what the regression coefficient for a dummy variable means:
# It tells us about the effect of moving from the baseline towards the respective reference level! # Here: baseline = setosa (cmp. levels(iris$Species) vs. the output of tidy(reg)) # NOTE: watch for the order of levels!
levels(iris$Species) # Levels: setosa versicolor virginica iris$Species <- factor(iris$Species,
levels = c("versicolor", "virginica", "setosa"))
levels(iris$Species) # baseline is now: versicolor
reg <- lm(Petal.Length ~ Species, data=iris) tidy(reg) # The regression coefficents (!): figure out what has happened!
Multiple Regression in R
• Dummy coding of categorical predictors
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
### another way to do dummy coding rm(iris); data(iris) # ...just to fix the order of Species back to default
levels(iris$Species) contrasts(iris$Species) = contr.treatment(3, base = 1) contrasts(iris$Species) # this probably what you remember from your stats class...
iris$Species <- factor(iris$Species, levels = c ("virginica","versicolor","setosa"))
levels(iris$Species) contrasts(iris$Species) = contr.treatment(3, base = 1) # baseline is now: virginica
contrasts(iris$Species) # consider carefully what you need to do
Multiple Regression in R
• Dummy coding of categorical predictors
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
### Petal.Length ~ Species (Dummy Coding) + Sepal.Length rm(iris); data(iris) # ...just to fix the order of Species back to default
reg <- lm(Petal.Length ~ Species + Sepal.Length, data=iris) # BTW: since is.factor(iris$Species)==T, R does the dummy coding in lm() for you regSum <- summary(reg)
regSum$r.squared regSum$coefficients
# compare w. Simple Linear Regression reg <- lm(Petal.Length ~ Sepal.Length, data=iris) regSum <- summary(reg)
regSum$r.squared regSum$coefficients
Multiple Regression in R
• Multiple regression with dummy-coded categorical predictors
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
### Comparing nested models reg1 <- lm(Petal.Length ~ Sepal.Length, data=iris)
reg2 <- lm(Petal.Length ~ Species + Sepal.Length, data=iris) # reg1 is nested under reg2 # terminology: reg2 is a "full model" # this terminology will be used quite often in Logistic Regression
# NOTE: Nested models
# There is a set of coefficients for the nested model (reg1) such that it # can be expressed in terms of the full model (reg2); in our case it is simple # HOME: - figure it out.
anova(reg1, reg2) # partial F-test; Species certainly has an effect beyond Sepal.Length
# NOTE: for partial F-test, see: # http://pages.stern.nyu.edu/~gsimon/B902301Page/CLASS02_24FEB10/PartialFtest.pdf
Multiple Regression in R
• Comparison of nested models
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
#### Multiple Regression - by the book # Following: http://www.r-tutor.com/elementary-statistics/multiple-linear-regression
# (that's from your reading list, to remind you...) data(stackloss) str(stackloss)
# Data set description # URL: https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/stackloss.html
stacklossModel = lm(stack.loss ~ Air.Flow + Water.Temp + Acid.Conc., data=stackloss)
# let's see: summary(stacklossModel)
glance(stacklossModel) # {broom} tidy(stacklossModel) # {broom}
# predict new data obs = data.frame(Air.Flow=72, Water.Temp=20, Acid.Conc.=85)
predict(stacklossModel, obs)
Multiple Regression in R
• By the book: two or three continuous predictors…
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
# confidence intervals confint(stacklossModel, level=.95) #
95% CI confint(stacklossModel, level=.99) # 99% CI
# 95% CI for Acid.Conc. only confint(stacklossModel, "Acid.Conc.",
level=.95) # default regression plots in R
plot(stacklossModel)
Multiple Regression in R
• By the book: two or three continuous predictors…
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
# multicolinearity
library(car) # John Fox's car package VIF <- vif(stacklossModel) VIF
sqrt(VIF) # Variance Inflation Factor (VIF)
# The increase in the ***variance*** of an regression ceoff. due to colinearity # NOTE: sqrt(VIF) = how much larger the ***SE*** of a reg.coeff. vs. what it would be # if there were no correlations with the other predictors in the model
# NOTE: lower_bound(VIF) = 1; no upper bound; VIF > 2 --> (Concerned == TRUE) Tolerance <- 1/VIF # obviously, tolerance and VIF are redundant
Tolerance # NOTE: you can inspect multicolinearity in the multiple regression mode # by conducting a Principal Component Analysis over the predictors;
# when the time is right.
Multiple Regression in R
• Assumptions: multicolinearity
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
#### R for partial and part (semi-partial) correlations library(ppcor) # a good one; there are many ways to do this in R
#### partial correlation in R dataSet <- iris
str(dataSet) dataSet$Species <- NULL
irisPCor <- pcor(dataSet, method="pearson") irisPCor$estimate # partial correlations irisPCor$p.value # results of significance tests
irisPCor$statistic # t-test on n-2-k degrees of freedom ; k = num. of variables conditioned # partial correlation between x and y while controlling for z
partialCor <- pcor.test(dataSet$Sepal.Length, dataSet$Petal.Length, dataSet$Sepal.Width, method = "pearson")
partialCor$estimate partialCor$p.value
partialCor$statistic
Multiple Regression in R
• Partial Correlation in R
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
#### semi-partial correlation in R # NOTE: ... Semi-partial correlation is the correlation of two variables
# with variation from a third or more other variables removed only # from the ***second variable*** # NOTE: The first variable <- rows, the second variable <- columns
# cf. ppcor: An R Package for a Fast Calculation to Semi-partial Correlation Coefficients (2015) # Seongho Kim, Biostatistics Core, Karmanos Cancer Institute, Wayne State University
# URL: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4681537/ irisSPCor <- spcor(dataSet, method = "pearson") irisSPCor$estimate
irisSPCor$p.value irisSPCor$statistic
partCor <- spcor.test(dataSet$Sepal.Length, dataSet$Petal.Length, dataSet$Sepal.Width,
method = "pearson") # NOTE: this is a correlation of dataSet$Sepal.Length w. dataSet$Petal.Length
# when the variance of dataSet$Petal.Length (2nd variable) due to dataSet$Sepal.Width # is removed! partCor$estimate
partCor$p.value
Multiple Regression in R
• Part (semi-partial) Correlation in R
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
# NOTE: In multiple regression, this is the semi-partial (or part) correlation # that you need to inspect:
# assume a model with X1, X2, X3 as predictors, and Y as a criterion # You need a semi-partial of X1 and Y following the removal of X2 and X3 from Y # It goes like this: in Step 1, you perform a multiple regression Y ~ X2 + X3;
# In Step 2, you take the residuals of Y, call them RY; in Step 3, you regress (correlate) # RY ~ X1: the correlation coefficient that you get from Step 3 is the part correlation
# that you're looking for.
Multiple Regression in R
• NOTE on semi-partial (part) correlation in multiple regression…
Intro to R for Data Science
Session 7: Multiple Linear Regression in R