Wine.Final.Project.MJv3
-
Upload
melissa-a-johnson -
Category
Documents
-
view
10 -
download
3
Transcript of Wine.Final.Project.MJv3
![Page 1: Wine.Final.Project.MJv3](https://reader031.fdocuments.us/reader031/viewer/2022030214/589a78651a28ab0b788b6f4f/html5/thumbnails/1.jpg)
Melissa A. Johnson
STAT 515 - Fall 2016
December 8, 2016
A STATISTICAL MODEL OF
WINE QUALITY
![Page 2: Wine.Final.Project.MJv3](https://reader031.fdocuments.us/reader031/viewer/2022030214/589a78651a28ab0b788b6f4f/html5/thumbnails/2.jpg)
1
Data Set Review:
What are the basic factors that make wine more preferable or high-rated? Are these factors imaginary or real? These are some age-old conundrum about fermented grape juice the world has come to know and love. I’ve been skeptical of professional wine ratings after reading an instance of a blind taste test where oenologists could not accurately pick out the most expensive wines. The wine quality data set caught my fascination with this problem.
My objective is to conduct a multiple linear regression (MLR) model to predict wine quality in relation to physiochemical attributes. My questions are as follows:
1. Is there a linear relationship between wine ratings (quality) and at least one of the predictor variables?
2. Which physiochemical attribute(s) are the strongest predictors of highly-rated wine and poorly-rated wine?
3. Which model has the best performance in fitting the data?
Wine ratings or “quality” scores are sensory data, meaning humans are assessing the quality whereas the physiochemical values are objective data collected from lab tests. There may be benefits to realize by measuring the impact of physiochemical tests in wine quality. It could help wine producers by improving the production process and identify target markets to increase profitability. In wine industry practicum, certification and quality assessments to prevent adulteration of wine are also based on these types of data collection and descriptive analysis.
I transformed the raw data set, obtained from http://mlr.cs.umass.edu/ml/machine-learning-databases/wine-quality/, using text-to-columns feature in Excel to make it suitable for analysis in R.
The data set is related to red variants of the Portuguese "Vinho Verde" wine. Only physicochemical (inputs) and sensory (output) variables are available (e.g., there is no private data about grape types, wine brand, wine selling price, etc.). There are 1599 observations with 11 input variables and 1 output variable. Several of the attributes may be correlated, thus it makes sense to apply some sort of feature variable selection.
![Page 3: Wine.Final.Project.MJv3](https://reader031.fdocuments.us/reader031/viewer/2022030214/589a78651a28ab0b788b6f4f/html5/thumbnails/3.jpg)
2
Input Variables (based on physicochemical tests):
1 - fixed acidity (min: 4.60, max: 15.90)
2 - volatile acidity (min: 0.12, max: 1.58)
3 - citric acid (min: 0.00, max: 1.00)
4 - residual sugar (min: 0.90, max: 15.50)
5 - chlorides (min: 0.01, max: 0.61)
6 - free sulfur dioxide (min: 1.00, max: 72.00)
7 - total sulfur dioxide (min: 6.00, max: 289.00)
8 - density (min: 0.99, max: 1.00)
9 - pH (min: 2.74, max: 4.01)
10 - sulphates (min: 0.33, max: 2.00)
11 - alcohol (min: 8.40, max: 14.90)
Output Variable (based on sensory data):
12 - quality (min: 3.00, max: 8.00), score between 0 and 10
Red Wine Data Set:
I applied a simple Validation Set Approach to randomly divide the available set of observations into two parts; 60% training set and 40% validation set. The model in Figure 1 is fit on the training set and the fitted model is used to predict the responses for the observations in the validation set. The resulting validation set error rate, assessed using the Mean Squared Error (MSE) provides an estimate of the test error rate.
> mean((quality-predict(lm.fit,mydata))[-train]^2) #estimated test MSE
in validation set
[1] 0.4374297
![Page 4: Wine.Final.Project.MJv3](https://reader031.fdocuments.us/reader031/viewer/2022030214/589a78651a28ab0b788b6f4f/html5/thumbnails/4.jpg)
3
In order to see if there is a relationship between quality (response variable) and all 11 physiochemical attributes (predictor variables) we need to test the null hypothesis. The hypothesis test can be performed by computing the F-statistic for the full model.
Null Hypothesis = H0: B1= B2 = B3… = B11 = 0
Alternative Hypothesis = Ha: at least one Bj is a non-zero
1. Multiple Linear Regression Model Using All 11 Predictor Variables
trainfit<- lm(quality~., data=training)
summary(trainfit) #Output of Regression coefficients for all Predictors
Figure 1
![Page 5: Wine.Final.Project.MJv3](https://reader031.fdocuments.us/reader031/viewer/2022030214/589a78651a28ab0b788b6f4f/html5/thumbnails/5.jpg)
4
The F-statistic on this full model (Figure 1) is 50.26, which is much larger than 1, so we can reject the null hypothesis: H0 and conclude that there is a relationship between quality (Y) and at least one of the predictors (Bj). Since there is a relatively large n (n=1599), an F-statistic of 50.26 provides sufficient evidence to reject H0.
2. Diagnostics Plots to Verify Regression Assumptions
Figure 2a
Figure 2b
The “Residuals vs. Fitted” plot (Figure 2a) checks for equal variance. The red line is a smooth fit to the residuals, intending to make any patterns more easily identifiable.
![Page 6: Wine.Final.Project.MJv3](https://reader031.fdocuments.us/reader031/viewer/2022030214/589a78651a28ab0b788b6f4f/html5/thumbnails/6.jpg)
5
Due to the strange parallel lines pattern observed in the plot, It’s not easy to conclude the non-violation of equal variance assumption.
The “Normal Q-Q” plot (Figure 2b) of quantiles of this distribution against quantiles of standard normal distribution does not appear to have systematic deviations for linearity since the bulk of the points lie on a straight line.
3. Correlation Plots
We can explore the numerical predictors and response (quality) by creating a
correlation table (Figure 3a/3b) between quality and all physiochemical predictors.
Figure 3a Correlation Plots (numerical values)
Figure 3b Correlation Plot (coded by size/color of circles instead of numerical values)
![Page 7: Wine.Final.Project.MJv3](https://reader031.fdocuments.us/reader031/viewer/2022030214/589a78651a28ab0b788b6f4f/html5/thumbnails/7.jpg)
6
Alcohol seems to be the best single predictor of quality with a correlation of .49 or
49%
4. Variable Inflation Factor (VIF)
In addition to inspecting the correlation matrix, I computed the VIF (Figure 4) for all predictors. There is typically a small amount of collinearity among predictors so I am concerned with values that are between 5-10, which could indicate a problematic amount of collinearity.
Figure 4
Both the correlation matrix and VIF calculation confirms there is a multicollinearity
problem with fixed.acidity and two (2) other predictors: citric.acidity and
density.
5. Multiple Linear Regression Model Without fixed.acidity Predictor Variable
trainfit1= update(trainfit, ~.-fixed.acidity, data = training)
summary(trainfit1)
I updated the MLR model (Figure 5) to remove fixed.acidity and the model yields higher prediction accuracy (larger F-statistic and lower Residual Squared Error).
![Page 8: Wine.Final.Project.MJv3](https://reader031.fdocuments.us/reader031/viewer/2022030214/589a78651a28ab0b788b6f4f/html5/thumbnails/8.jpg)
7
Figure 5
Although the F-statistic on the MLR model without fixed.acidity is a better fit for the data, we can also see that there are relatively large individual p-values on some of the
predictor variables such as residual.sugar, density, and citric.acidity.
6. Principal Component Analysis
The response variable, quality appears to only be related to a subset of the predictors.
We can use dimensionality reduction or PCA method for exploratory analysis and produce derived variables (principal components) for use in a supervised learning method. PCA reduces the total set of numerical variables to remove the overlap of information between them. The new variables will be a linear combination of the original variables
![Page 9: Wine.Final.Project.MJv3](https://reader031.fdocuments.us/reader031/viewer/2022030214/589a78651a28ab0b788b6f4f/html5/thumbnails/9.jpg)
8
that are weighted averages of the original variables. The linear combinations are uncorrelated thus potentially correcting the problem of multicollinearity.
#Centers variable to have mean = 0 and normalize data to have standard
deviation = 1
> pr.out = prcomp(mydata.pca, scale = TRUE)
The rotation measure provides the principal component loading. Each column of rotation matrix contains the principal component loading vectors which are measures of interest.
> pr.out$rotation Figure 6
Figure 6a
![Page 10: Wine.Final.Project.MJv3](https://reader031.fdocuments.us/reader031/viewer/2022030214/589a78651a28ab0b788b6f4f/html5/thumbnails/10.jpg)
9
Figure 6b Plot of First Two Principal Components
We infer that the first principal component corresponds to a measure of
total.sulfur.dioxide and alcohol. Similarly, it can be said that the second
component corresponds to a measure of pH and fixed.acidity.
Figure 6a shows that first principal component explains 28.2% variance. Second component explains 17.5% variance. Third component explains 14.1% variance and so on. To choose the number of components for the principal component regression model, we need to look at a scree plot. A scree plot is used to access components which explains the most variability in the data by representing values in descending order.
![Page 11: Wine.Final.Project.MJv3](https://reader031.fdocuments.us/reader031/viewer/2022030214/589a78651a28ab0b788b6f4f/html5/thumbnails/11.jpg)
10
Figure 6c Scree Plot of PVE
Figure 6d Scree Plot of Cumulative PVE (Sum = 1)
It appears that principal component 9 results in a cumulative variance of close to 98%. In this case, using PCA did not do well in reducing the number of predictors from 11 to 9 without compromising on proportion of explained variance. Unfortunately, after checking both scree plot of PVE (Figure 6c) and cumulative PVE (Figure 6d), we see that PCA did not give us a small enough number of principal components required to get a good understanding of the data.
![Page 12: Wine.Final.Project.MJv3](https://reader031.fdocuments.us/reader031/viewer/2022030214/589a78651a28ab0b788b6f4f/html5/thumbnails/12.jpg)
11
7. Forward and Backward Selection MLR Model
We know from the previous exploration of the data set that the response variable,
quality is related to only a subset of the predictors. In order to determine which predictors are associated with the response, we need to fit a single model involving only those predictors. We cannot consider all 2P models with p = 11; there is over 2048 models! Therefore, we can determine the best model using variable selection. Using the two classical approaches, Forward Selection and Backward Selection yield a more efficient and automated ways to choose a smaller set of models to consider.
#Forward Selection Method null<-lm(quality~1, data=training)
forward<-step(null, scope=list(upper=trainfit), direction='forward')
summary(forward)
Figure 7a
![Page 13: Wine.Final.Project.MJv3](https://reader031.fdocuments.us/reader031/viewer/2022030214/589a78651a28ab0b788b6f4f/html5/thumbnails/13.jpg)
12
#Backward Selection Method backward<-step(trainfit, direction='backward')
summary(backward)
Figure 7b
Forward selection method begins with the null model then adds predictors that results in the lowest RSS whereas the backward selection method is conducting the opposite optimization by starting with all the predictors then eliminating the variables with the largest p-values. Although these techniques differ slightly in their optimization strategy, it is interesting to point out that in this instance both methods fitted the data to the same MLR model (Figure 7a and Figure 7b).
![Page 14: Wine.Final.Project.MJv3](https://reader031.fdocuments.us/reader031/viewer/2022030214/589a78651a28ab0b788b6f4f/html5/thumbnails/14.jpg)
13
8. Diagnostics Plots to Verify Regression Assumptions
I plotted the “Residuals vs. Fitted” and “Normal Q-Q plot” to search for any patterns that might violate the regression assumptions (Figure 8a and Figure 8b).
Figure 8a
Figure 8b
This is very similar to the full model plot of “Residuals vs. Fitted.” The plot for the forward/backward model (Figure 8a) appears to be the same odd pattern in the residuals so we cannot run out the non-violation of the equal (constant) variance rule is intact. The “Normal Q-Q plot” (Figure 8b) of quantiles of this distribution does not appear to have systematic deviations for linearity since the bulk of the points lie on a relatively straight line. Another important residual plot is the histogram plot of residuals (Figure 8c), which shows a normally distributed bell-shaped curve.
![Page 15: Wine.Final.Project.MJv3](https://reader031.fdocuments.us/reader031/viewer/2022030214/589a78651a28ab0b788b6f4f/html5/thumbnails/15.jpg)
14
Figure 8c
Figure 8d
We can also visualize the regression results with the coefficient plot (Figure 8d) where each coefficient is plotted as a point with a thick blue line representing one standard error confidence interval and a vertical gray dotted line indicating 0. If the standard error confidence interval does not contain 0, it is statistically significant. Here we can
see alcohol and sulphates has the largest effect on quality of the wine.
![Page 16: Wine.Final.Project.MJv3](https://reader031.fdocuments.us/reader031/viewer/2022030214/589a78651a28ab0b788b6f4f/html5/thumbnails/16.jpg)
15
9. Fitting the Backward Selection Model on the Validation Data Set
> PredBack$residual.scale
[1] 0.6540796 Figure 9
> AIC(backward)
[1] 1917.249
> AIC(trainfit)
[1] 1923.308
> AIC(trainfit1)
[1] 1921.749
Another metric we can use to compare the performance of models (trainfit-full model,
trainfit1- full model minus fixed.acidity, and model obtained from backward
selection method) is the Akaike Information Criterion (AIC) measure. The AIC value of
backward 1917.249 is smaller than that of trainfit and trainfit1 at 1923.308 and
1921.749, therefore the model obtained with backward selection method is better
fitting the data.
![Page 17: Wine.Final.Project.MJv3](https://reader031.fdocuments.us/reader031/viewer/2022030214/589a78651a28ab0b788b6f4f/html5/thumbnails/17.jpg)
16
10. Leave-One-Out-Cross-Validation Method (LOOCV)
The Validation Set Approach from part 1 has its drawbacks. Instead the LOOCV can be used to fit a model with n - 1 training observations and the prediction is made for the one excluded observation. The MSE provides an approximate unbiased estimate for the test error, however, it is still highly variable since it is based on a single excluded observation.
Major advantages of the LOOCV method over the validation set approach is that it has far less bias. The statistical learning model is repeatedly using training sets that contains n - 1 observations or almost the entire data set in contrast to the validation set approach which uses 60% of the original data set to train the model.
The LOOCV approach also tends to not overestimate the test error rate as much as the validation set approach. Instead of yielding different results due to randomness in the training/validation set splits, performing LOOCV multiple times will always yield the same results because there is no randomness in the testing/validation splits.
#LOOCV estimate for test error is 42.47
> cv.err$delta
[1] 0.4247753 0.4247728
11. k-Fold Cross-Validation
I used the k-fold cross validation method as an alternative to LOOCV. The approach randomly divides the set of observations into k groups, or folds, of approximately equal size. The first fold is treated as the validation set and the model is fitted on the remaining k – 1 folds. The MSE is computed on observations in the held-out fold that is treated as the validation set. The procedure is repeated k times, each time with a different group of observations treated as the validation set.
Although there is a variability in the test error estimates, it is often lower than the variability that results using the validation set approach.
#k-Fold CV estimate for test error is 42.40
cv.error.10
[1] 0.42409
12. Bias-Variance Trade-off for k-Fold Cross Validation
On the basis of bias reduction, LOOCV is preferred over the k-Fold CV method since it takes (k - 1) n / k observations to train the model in contrast to the LOOCV method that trains the model on n - 1 observations, which is almost the entire data set. However, the LOOCV method has a higher variance than k-Fold CV with k < n due to the high
![Page 18: Wine.Final.Project.MJv3](https://reader031.fdocuments.us/reader031/viewer/2022030214/589a78651a28ab0b788b6f4f/html5/thumbnails/18.jpg)
17
correlation of averaging n fitted models whereas the average of k fitted models is somewhat less correlated. This results in test error rates from LOOCV tendency to have a higher variance than k-fold CV.
13. Conclusion
Backward Selection Multiple Linear Regression Model: Quality = 4.1747748 - 0.8826331*volatile.acidity - 1.9781343*chlorides +
0.0041786*free.sulfur.dioxide - 0.0037963*total.sulfur.dioxide - 0.5251147*pH +
0.9535727*sulphates + 0.3182497*alcohol
It is interesting to note the backwards selection method only produced six regression
coefficients versus the original eleven predictors. This model shows that sulphates
and alcohol has the largest effect on predicting higher quality ratings of red wine in
this data set whereas higher physiochemical measures of chlorides and
volatile.acidity has the largest effect on predicting lower quality ratings. With respect to the adjusted R2, Residual Squared Error, and AIC measures, we can determine that the backward selection method best fits the data. In examining many diagnostics graphs and plots, we can also conclude that linear regression is a compelling method in analyzing the effects of physiochemical tests on quality rating of Vinho Verde red wine.
![Page 19: Wine.Final.Project.MJv3](https://reader031.fdocuments.us/reader031/viewer/2022030214/589a78651a28ab0b788b6f4f/html5/thumbnails/19.jpg)
18
Appendix of R Codes:
mydata = read.csv("winequality-red.csv",header=TRUE)
mydata.pca = read.csv("winequality-red-pca.csv",header=TRUE)
row <- nrow(mydata)
set.seed(12345)
trainindex <- sample(row, 0.6*row, replace=FALSE)
training <- mydata[trainindex,]
validation <- mydata[-trainindex,]
#ValidationSetApproach
require(ISLR)
set.seed(1)
train=sample(1599,960)
lm.fit=lm(quality~.,data=mydata,subset=train)
predict(lm.fit)
rating.predict<-predict(lm.fit)
plot(rating.predict)
attach(mydata)
mean((quality-predict(lm.fit,mydata))[-train]^2) #estimated test MSE in
validation set
#MLR with all 11 predictors (full model) used in forward selection method
later
trainfit<- lm(quality~., data=training)
#Output of Regression coefficients for all Predictors
summary(trainfit)
#Model fit
plot(trainfit)
#fitsummary = summary(trainfit)
#fitsummary$r.squared
#fitsummary$adj.r.squared
#fitsummary$sigma #RSE
AIC(trainfit)
BIC(trainfit)
#Principal Component Analysis (PCA)
physiochemattributes=row.names(mydata.pca)
physiochemattributes
names(mydata.pca)
#Means/Variances of variables
apply(mydata.pca, 2, mean)
apply(mydata.pca, 2, var)
#Centers variable to have mean zero and set variables to standard
deviation one
pr.out = prcomp(mydata.pca, scale = TRUE)
pr.out
summary(pr.out)
dim(pr.out$x)
pr.out$center
pr.out$scale
![Page 20: Wine.Final.Project.MJv3](https://reader031.fdocuments.us/reader031/viewer/2022030214/589a78651a28ab0b788b6f4f/html5/thumbnails/20.jpg)
19
pr.out$rotation
#Plot first two principal component
biplot(pr.out, scale=0)
pr.out$rotation = -pr.out$rotation
pr.out$x = -pr.out$x
biplot(pr.out, scale=0)
#Check standard deviation and variance explained
pr.out$sdev
pr.var = pr.out$sdev^2
pr.var
#Compute proportion of variance explained PVE
pve= pr.var/sum(pr.var)
pve
#Plot PVE and Cumulative PVE
plot(pve, xlab= "Principal Component", ylab="Proportion of Variance
Explained", ylim=c(0,1), type='b', col="blue")
plot(cumsum(pve), xlab= "Principal Component", ylab="Proportion of
Variance Explained", ylim=c(0,1), type='b', col="blue")
#PCA Regression
require(pls)
set.seed(1000)
train <- mydata[1:1279,]
y_test <- mydata[1279:1599, 1]
test <- mydata[1279:1599, 2:5]
pcr_model <- pcr(quality~., data = train,scale =TRUE, validation = "CV")
pcr_model
summary(pcr_model)
# Plot the root mean squared error
validationplot(pcr_model)
predplot(pcr_model)
coefplot(pcr_model)
# Plot the cross validation MSE
validationplot(pcr_model, val.type="MSEP")
train <- mydata[1:1279,]
test <- mydata[1279:1599, 2:5]
pcr_model <- pcr(quality~., data = train,scale =TRUE, validation = "CV")
pcr_pred <- predict(pcr_model, data=test)
summary(pcr_model)
summary(pcr_pred)
mean(pcr_pred)
require(car)
vif(trainfit)
#remove fixed acidity from model because of multicollinearity
trainfit1= update(trainfit, ~.-fixed.acidity, data = training)
summary(trainfit1)
![Page 21: Wine.Final.Project.MJv3](https://reader031.fdocuments.us/reader031/viewer/2022030214/589a78651a28ab0b788b6f4f/html5/thumbnails/21.jpg)
20
plot(trainfit1)
anova(trainfit,trainfit1, backward)
#coefficients(trainfit)
#fitted(trainfit)
#anova(trainfit)
#residuals(trainfit)
#vcov(trainfit)
#histogram with 20 bars
Residual<-residuals(trainfit1)
hist(Residual,breaks=20)
#Q-Q pplot with residuals
qqnorm(residual, ylab="Standardized Residuals", xlab="Normal Scores",
main="Residual")
qqline(residual)
layout(matrix(c(1,2,3,4),2,2))
plot(trainfit)
#Correlation and Scatter Plots for full model (trainfit)
require(corrplot)
M = cor(training)
corrplot(M, method = "number")
corrplot(M, method = "circle")
corrplot.mixed(M)
#Scatterplot
plot(training)
#Forward Selection Method
null<-lm(quality~1, data=training)
forward<-step(null, scope=list(upper=trainfit), direction='forward')
summary(forward)
plot(forward)
#histogram with 20 bars
Residual1<-residuals(forward)
hist(Residual,breaks=20)
#Backward Selection Method, actually yields the same model as Forward
Selection Method
backward<-step(trainfit, direction='backward')
summary(backward)
AIC(backward)
AIC(trainfit)
AIC(trainfit1)
#lm.fit= lm(quality~volatile.acidity + chlorides + free.sulfur.dioxide +
total.sulfur.dioxide + pH + sulphates + alcohol, data=training)
#attach(training)
#Compute MSE on validation set
#########backward$delta
#Predicting validation data set with model with all predictors
![Page 22: Wine.Final.Project.MJv3](https://reader031.fdocuments.us/reader031/viewer/2022030214/589a78651a28ab0b788b6f4f/html5/thumbnails/22.jpg)
21
PredTrainfit<-predict(trainfit, validation, se.fit=TRUE)
PredTrainfit
PredTrainfit$residual.scale
#Predicting validation data set with model that does not have
fixed.acidity
PredTrainfit1<-predict(trainfit1, validation, se.fit=TRUE)
PredTrainfit1
PredTrainfit1$residual.scale
#Prediciting the validation dataset based on the mixed selection model
PredBack<-predict(backward, validation, se.fit=TRUE)
PredBack
PredBack$residual.scale
# Comparing the validation residual standard deviations, the mixed
selection model is better than base model
#Coefficient Plot of Backwards Selection Model
require(coefplot)
coefplot(backward)
#LOOCV
require(boot)
glm.fit=glm(quality~.,data=mydata)
cv.err=cv.glm(mydata,glm.fit)
cv.err$delta
cv.error=rep(0,5)
for (i in 1:5){
glm.fit=glm(quality~.,data=mydata)
cv.error[i]=cv.glm(mydata,glm.fit)$delta[1]
}
cv.error
#cross validation estimate for test error is 42.47
coef(glm.fit)
#K-fold Cross Validation
set.seed(17)
cv.error.10=rep(0,10)
for (i in 1:10) {
glm.fit=glm(quality~., data=mydata)
cv.error.10=cv.glm(mydata,glm.fit,K=10)$delta[2] }
cv.error.10
#k-fold CV estimate for test error is 42.40
![Page 23: Wine.Final.Project.MJv3](https://reader031.fdocuments.us/reader031/viewer/2022030214/589a78651a28ab0b788b6f4f/html5/thumbnails/23.jpg)
22
References: James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning: With applications in R. New York: Springer. Lander, J. P. (2014). R for Everyone: Advanced Analytics and Graphics. Upper Saddle River, NJ: Addison-Wesley. P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236. Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib