Introduction to R for Data Science :: Session 7 [Multiple Linear Regression in R]

20
Introduction to R for Data Science Lecturers dipl. ing Branko Kovač Data Analyst at CUBE/Data Science Mentor at Springboard Data Science Serbia [email protected] dr Goran S. Milovanovi ć Data Scientist at DiploFoundation Data Science Serbia [email protected] [email protected]

Transcript of Introduction to R for Data Science :: Session 7 [Multiple Linear Regression in R]

Page 1: Introduction to R for Data Science :: Session 7 [Multiple Linear Regression in R]

Introduction to R for Data Science

Lecturers

dipl. ing Branko Kovač

Data Analyst at CUBE/Data Science Mentor

at Springboard

Data Science Serbia

[email protected]

dr Goran S. Milovanović

Data Scientist at DiploFoundation

Data Science Serbia

[email protected]

[email protected]

Page 2: Introduction to R for Data Science :: Session 7 [Multiple Linear Regression in R]

Multiple Linear Regression in R

• Dummy coding of categorical predictors

• Multiple regression

• Nested models and Partial

F-test

• Partial and Part Correlation

• Multicolinearity

• {Lattice} plots

• Prediction, Confidence

Intervals, Residuals

• Influential Cases and

the Influence Plot

Intro to R for Data Science

Session 7: Multiple Linear Regression in R

Page 3: Introduction to R for Data Science :: Session 7 [Multiple Linear Regression in R]

######################################################## # Introduction to R for Data Science

# SESSION 7 :: 9 June, 2016 # Multiple Linear Regression in R # Data Science Community Serbia + Startit

# :: Goran S. Milovanović and Branko Kovač :: ########################################################

#### read data library(datasets)

library(broom) library(ggplot2)

library(lattice) #### load

data(iris) str(iris)

Multiple Regression in R

• Problems with simple linear regression: iris dataset

Intro to R for Data Science

Session 7: Multiple Linear Regression in R

Page 4: Introduction to R for Data Science :: Session 7 [Multiple Linear Regression in R]

#### simple linear regression: Sepal Length vs Petal Lenth

# Predictor vs Criterion {ggplot2} ggplot(data = iris, aes(x = Sepal.Length, y = Petal.Length)) +

geom_point(size = 2, colour = "black") + geom_point(size = 1, colour = "white") +

geom_smooth(aes(colour = "black"), method='lm') + ggtitle("Sepal Length vs Petal Length") +

xlab("Sepal Length") + ylab("Petal Length") + theme(legend.position = "none")

Multiple Regression in R

• Problems with simple linear regression: iris dataset

Intro to R for Data Science

Session 7: Multiple Linear Regression in R

Page 5: Introduction to R for Data Science :: Session 7 [Multiple Linear Regression in R]

# And now for something completelly different (but in R)...

#### Problems with linear regression in iris # Predictor vs Criterion {ggplot2} - group separation

ggplot(data = iris, aes(x = Sepal.Length,

y = Petal.Length, color = Species)) + geom_point(size = 2) +

ggtitle("Sepal Length vs Petal Length") + xlab("Sepal Length") + ylab("Petal Length")

Multiple Regression in R

• Problems with simple linear regression: iris dataset

Intro to R for Data Science

Session 7: Multiple Linear Regression in R

Page 6: Introduction to R for Data Science :: Session 7 [Multiple Linear Regression in R]

# Predictor vs Criterion {ggplot2} - separate regression lines

ggplot(data = iris, aes(x = Sepal.Length, y = Petal.Length,

colour=Species)) + geom_smooth(method=lm) +

geom_point(size = 2) + ggtitle("Sepal Length vs Petal Length") + xlab("Sepal Length") + ylab("Petal Length")

Multiple Regression in R

• Problems with simple linear regression: iris dataset

Intro to R for Data Science

Session 7: Multiple Linear Regression in R

Page 7: Introduction to R for Data Science :: Session 7 [Multiple Linear Regression in R]

### better... {lattice} xyplot(Petal.Length ~ Sepal.Length | Species, #

{latice} xyplot data = iris, xlab = "Sepal Length", ylab = "Petal Length"

)

Multiple Regression in R

• Problems with simple linear regression: iris dataset

Intro to R for Data Science

Session 7: Multiple Linear Regression in R

Page 8: Introduction to R for Data Science :: Session 7 [Multiple Linear Regression in R]

# Petal Length and Sepal Length: Conditional Densities

densityplot(~ Petal.Length | Species, # {latice} xyplot data = iris,

plot.points=FALSE, xlab = "Petal Length", ylab = "Density",

main = "P(Petal Length|Species)", col.line = 'red' )

densityplot(~ Sepal.Length | Species, # {latice} xyplot

data = iris, plot.points=FALSE, xlab = "Sepal Length", ylab = "Density",

main = "P(Sepal Length|Species)", col.line = 'blue'

)

Multiple Regression in R

• Problems with simple linear regression:

iris dataset

Intro to R for Data Science

Session 7: Multiple Linear Regression in R

Page 9: Introduction to R for Data Science :: Session 7 [Multiple Linear Regression in R]

# Linear regression in subgroups species <- unique(iris$Species)

w1 <- which(iris$Species == species[1]) # setosa reg <- lm(Petal.Length ~ Sepal.Length, data=iris[w1,]) tidy(reg)

w2 <- which(iris$Species == species[2]) # versicolor reg <- lm(Petal.Length ~ Sepal.Length, data=iris[w2,])

tidy(reg) w3 <- which(iris$Species == species[3]) # virginica reg <- lm(Petal.Length ~ Sepal.Length, data=iris[w3,])

tidy(reg)

Multiple Regression in R

• Simple linear regressions in sub-groups

Intro to R for Data Science

Session 7: Multiple Linear Regression in R

Page 10: Introduction to R for Data Science :: Session 7 [Multiple Linear Regression in R]

#### Dummy Coding: Species in the iris dataset is.factor(iris$Species)

levels(iris$Species) reg <- lm(Petal.Length ~ Species, data=iris) tidy(reg)

glance(reg) # Never forget what the regression coefficient for a dummy variable means:

# It tells us about the effect of moving from the baseline towards the respective reference level! # Here: baseline = setosa (cmp. levels(iris$Species) vs. the output of tidy(reg)) # NOTE: watch for the order of levels!

levels(iris$Species) # Levels: setosa versicolor virginica iris$Species <- factor(iris$Species,

levels = c("versicolor", "virginica", "setosa"))

levels(iris$Species) # baseline is now: versicolor

reg <- lm(Petal.Length ~ Species, data=iris) tidy(reg) # The regression coefficents (!): figure out what has happened!

Multiple Regression in R

• Dummy coding of categorical predictors

Intro to R for Data Science

Session 7: Multiple Linear Regression in R

Page 11: Introduction to R for Data Science :: Session 7 [Multiple Linear Regression in R]

### another way to do dummy coding rm(iris); data(iris) # ...just to fix the order of Species back to default

levels(iris$Species) contrasts(iris$Species) = contr.treatment(3, base = 1) contrasts(iris$Species) # this probably what you remember from your stats class...

iris$Species <- factor(iris$Species, levels = c ("virginica","versicolor","setosa"))

levels(iris$Species) contrasts(iris$Species) = contr.treatment(3, base = 1) # baseline is now: virginica

contrasts(iris$Species) # consider carefully what you need to do

Multiple Regression in R

• Dummy coding of categorical predictors

Intro to R for Data Science

Session 7: Multiple Linear Regression in R

Page 12: Introduction to R for Data Science :: Session 7 [Multiple Linear Regression in R]

### Petal.Length ~ Species (Dummy Coding) + Sepal.Length rm(iris); data(iris) # ...just to fix the order of Species back to default

reg <- lm(Petal.Length ~ Species + Sepal.Length, data=iris) # BTW: since is.factor(iris$Species)==T, R does the dummy coding in lm() for you regSum <- summary(reg)

regSum$r.squared regSum$coefficients

# compare w. Simple Linear Regression reg <- lm(Petal.Length ~ Sepal.Length, data=iris) regSum <- summary(reg)

regSum$r.squared regSum$coefficients

Multiple Regression in R

• Multiple regression with dummy-coded categorical predictors

Intro to R for Data Science

Session 7: Multiple Linear Regression in R

Page 13: Introduction to R for Data Science :: Session 7 [Multiple Linear Regression in R]

### Comparing nested models reg1 <- lm(Petal.Length ~ Sepal.Length, data=iris)

reg2 <- lm(Petal.Length ~ Species + Sepal.Length, data=iris) # reg1 is nested under reg2 # terminology: reg2 is a "full model" # this terminology will be used quite often in Logistic Regression

# NOTE: Nested models

# There is a set of coefficients for the nested model (reg1) such that it # can be expressed in terms of the full model (reg2); in our case it is simple # HOME: - figure it out.

anova(reg1, reg2) # partial F-test; Species certainly has an effect beyond Sepal.Length

# NOTE: for partial F-test, see: # http://pages.stern.nyu.edu/~gsimon/B902301Page/CLASS02_24FEB10/PartialFtest.pdf

Multiple Regression in R

• Comparison of nested models

Intro to R for Data Science

Session 7: Multiple Linear Regression in R

Page 14: Introduction to R for Data Science :: Session 7 [Multiple Linear Regression in R]

#### Multiple Regression - by the book # Following: http://www.r-tutor.com/elementary-statistics/multiple-linear-regression

# (that's from your reading list, to remind you...) data(stackloss) str(stackloss)

# Data set description # URL: https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/stackloss.html

stacklossModel = lm(stack.loss ~ Air.Flow + Water.Temp + Acid.Conc., data=stackloss)

# let's see: summary(stacklossModel)

glance(stacklossModel) # {broom} tidy(stacklossModel) # {broom}

# predict new data obs = data.frame(Air.Flow=72, Water.Temp=20, Acid.Conc.=85)

predict(stacklossModel, obs)

Multiple Regression in R

• By the book: two or three continuous predictors…

Intro to R for Data Science

Session 7: Multiple Linear Regression in R

Page 15: Introduction to R for Data Science :: Session 7 [Multiple Linear Regression in R]

# confidence intervals confint(stacklossModel, level=.95) #

95% CI confint(stacklossModel, level=.99) # 99% CI

# 95% CI for Acid.Conc. only confint(stacklossModel, "Acid.Conc.",

level=.95) # default regression plots in R

plot(stacklossModel)

Multiple Regression in R

• By the book: two or three continuous predictors…

Intro to R for Data Science

Session 7: Multiple Linear Regression in R

Page 16: Introduction to R for Data Science :: Session 7 [Multiple Linear Regression in R]

# multicolinearity

library(car) # John Fox's car package VIF <- vif(stacklossModel) VIF

sqrt(VIF) # Variance Inflation Factor (VIF)

# The increase in the ***variance*** of an regression ceoff. due to colinearity # NOTE: sqrt(VIF) = how much larger the ***SE*** of a reg.coeff. vs. what it would be # if there were no correlations with the other predictors in the model

# NOTE: lower_bound(VIF) = 1; no upper bound; VIF > 2 --> (Concerned == TRUE) Tolerance <- 1/VIF # obviously, tolerance and VIF are redundant

Tolerance # NOTE: you can inspect multicolinearity in the multiple regression mode # by conducting a Principal Component Analysis over the predictors;

# when the time is right.

Multiple Regression in R

• Assumptions: multicolinearity

Intro to R for Data Science

Session 7: Multiple Linear Regression in R

Page 17: Introduction to R for Data Science :: Session 7 [Multiple Linear Regression in R]

#### R for partial and part (semi-partial) correlations library(ppcor) # a good one; there are many ways to do this in R

#### partial correlation in R dataSet <- iris

str(dataSet) dataSet$Species <- NULL

irisPCor <- pcor(dataSet, method="pearson") irisPCor$estimate # partial correlations irisPCor$p.value # results of significance tests

irisPCor$statistic # t-test on n-2-k degrees of freedom ; k = num. of variables conditioned # partial correlation between x and y while controlling for z

partialCor <- pcor.test(dataSet$Sepal.Length, dataSet$Petal.Length, dataSet$Sepal.Width, method = "pearson")

partialCor$estimate partialCor$p.value

partialCor$statistic

Multiple Regression in R

• Partial Correlation in R

Intro to R for Data Science

Session 7: Multiple Linear Regression in R

Page 18: Introduction to R for Data Science :: Session 7 [Multiple Linear Regression in R]

#### semi-partial correlation in R # NOTE: ... Semi-partial correlation is the correlation of two variables

# with variation from a third or more other variables removed only # from the ***second variable*** # NOTE: The first variable <- rows, the second variable <- columns

# cf. ppcor: An R Package for a Fast Calculation to Semi-partial Correlation Coefficients (2015) # Seongho Kim, Biostatistics Core, Karmanos Cancer Institute, Wayne State University

# URL: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4681537/ irisSPCor <- spcor(dataSet, method = "pearson") irisSPCor$estimate

irisSPCor$p.value irisSPCor$statistic

partCor <- spcor.test(dataSet$Sepal.Length, dataSet$Petal.Length, dataSet$Sepal.Width,

method = "pearson") # NOTE: this is a correlation of dataSet$Sepal.Length w. dataSet$Petal.Length

# when the variance of dataSet$Petal.Length (2nd variable) due to dataSet$Sepal.Width # is removed! partCor$estimate

partCor$p.value

Multiple Regression in R

• Part (semi-partial) Correlation in R

Intro to R for Data Science

Session 7: Multiple Linear Regression in R

Page 19: Introduction to R for Data Science :: Session 7 [Multiple Linear Regression in R]

# NOTE: In multiple regression, this is the semi-partial (or part) correlation # that you need to inspect:

# assume a model with X1, X2, X3 as predictors, and Y as a criterion # You need a semi-partial of X1 and Y following the removal of X2 and X3 from Y # It goes like this: in Step 1, you perform a multiple regression Y ~ X2 + X3;

# In Step 2, you take the residuals of Y, call them RY; in Step 3, you regress (correlate) # RY ~ X1: the correlation coefficient that you get from Step 3 is the part correlation

# that you're looking for.

Multiple Regression in R

• NOTE on semi-partial (part) correlation in multiple regression…

Intro to R for Data Science

Session 7: Multiple Linear Regression in R

Page 20: Introduction to R for Data Science :: Session 7 [Multiple Linear Regression in R]