Wine.Final.Project.MJv3

Melissa A. Johnson

STAT 515 - Fall 2016

December 8, 2016

A STATISTICAL MODEL OF

WINE QUALITY

1

Data Set Review:

What are the basic factors that make wine more preferable or high-rated? Are these factors imaginary or real? These are some age-old conundrum about fermented grape juice the world has come to know and love. I’ve been skeptical of professional wine ratings after reading an instance of a blind taste test where oenologists could not accurately pick out the most expensive wines. The wine quality data set caught my fascination with this problem.

My objective is to conduct a multiple linear regression (MLR) model to predict wine quality in relation to physiochemical attributes. My questions are as follows:

1. Is there a linear relationship between wine ratings (quality) and at least one of the predictor variables?

2. Which physiochemical attribute(s) are the strongest predictors of highly-rated wine and poorly-rated wine?

3. Which model has the best performance in fitting the data?

Wine ratings or “quality” scores are sensory data, meaning humans are assessing the quality whereas the physiochemical values are objective data collected from lab tests. There may be benefits to realize by measuring the impact of physiochemical tests in wine quality. It could help wine producers by improving the production process and identify target markets to increase profitability. In wine industry practicum, certification and quality assessments to prevent adulteration of wine are also based on these types of data collection and descriptive analysis.

I transformed the raw data set, obtained from http://mlr.cs.umass.edu/ml/machine-learning-databases/wine-quality/, using text-to-columns feature in Excel to make it suitable for analysis in R.

The data set is related to red variants of the Portuguese "Vinho Verde" wine. Only physicochemical (inputs) and sensory (output) variables are available (e.g., there is no private data about grape types, wine brand, wine selling price, etc.). There are 1599 observations with 11 input variables and 1 output variable. Several of the attributes may be correlated, thus it makes sense to apply some sort of feature variable selection.

http://mlr.cs.umass.edu/ml/machine-learning-databases/wine-quality/

http://mlr.cs.umass.edu/ml/machine-learning-databases/wine-quality/

2

Input Variables (based on physicochemical tests):

1 - fixed acidity (min: 4.60, max: 15.90)

2 - volatile acidity (min: 0.12, max: 1.58)

3 - citric acid (min: 0.00, max: 1.00)

4 - residual sugar (min: 0.90, max: 15.50)

5 - chlorides (min: 0.01, max: 0.61)

6 - free sulfur dioxide (min: 1.00, max: 72.00)

7 - total sulfur dioxide (min: 6.00, max: 289.00)

8 - density (min: 0.99, max: 1.00)

9 - pH (min: 2.74, max: 4.01)

10 - sulphates (min: 0.33, max: 2.00)

11 - alcohol (min: 8.40, max: 14.90)

Output Variable (based on sensory data):

12 - quality (min: 3.00, max: 8.00), score between 0 and 10

Red Wine Data Set:

I applied a simple Validation Set Approach to randomly divide the available set of observations into two parts; 60% training set and 40% validation set. The model in Figure 1 is fit on the training set and the fitted model is used to predict the responses for the observations in the validation set. The resulting validation set error rate, assessed using the Mean Squared Error (MSE) provides an estimate of the test error rate.

> mean((quality-predict(lm.fit,mydata))[-train]^2) #estimated test MSE

in validation set

[1] 0.4374297

3

In order to see if there is a relationship between quality (response variable) and all 11 physiochemical attributes (predictor variables) we need to test the null hypothesis. The hypothesis test can be performed by computing the F-statistic for the full model.

Null Hypothesis = H0: B1= B2 = B3… = B11 = 0

Alternative Hypothesis = Ha: at least one Bj is a non-zero

1. Multiple Linear Regression Model Using All 11 Predictor Variables

trainfit<- lm(quality~., data=training)

summary(trainfit) #Output of Regression coefficients for all Predictors

Figure 1

4

The F-statistic on this full model (Figure 1) is 50.26, which is much larger than 1, so we can reject the null hypothesis: H0 and conclude that there is a relationship between quality (Y) and at least one of the predictors (Bj). Since there is a relatively large n (n=1599), an F-statistic of 50.26 provides sufficient evidence to reject H0.

2. Diagnostics Plots to Verify Regression Assumptions

Figure 2a

Figure 2b

The “Residuals vs. Fitted” plot (Figure 2a) checks for equal variance. The red line is a smooth fit to the residuals, intending to make any patterns more easily identifiable.

5

Due to the strange parallel lines pattern observed in the plot, It’s not easy to conclude the non-violation of equal variance assumption.

The “Normal Q-Q” plot (Figure 2b) of quantiles of this distribution against quantiles of standard normal distribution does not appear to have systematic deviations for linearity since the bulk of the points lie on a straight line.

3. Correlation Plots

We can explore the numerical predictors and response (quality) by creating a

correlation table (Figure 3a/3b) between quality and all physiochemical predictors.

Figure 3a Correlation Plots (numerical values)

Figure 3b Correlation Plot (coded by size/color of circles instead of numerical values)

6

Alcohol seems to be the best single predictor of quality with a correlation of .49 or

49%

4. Variable Inflation Factor (VIF)

In addition to inspecting the correlation matrix, I computed the VIF (Figure 4) for all predictors. There is typically a small amount of collinearity among predictors so I am concerned with values that are between 5-10, which could indicate a problematic amount of collinearity.

Figure 4

Both the correlation matrix and VIF calculation confirms there is a multicollinearity

problem with fixed.acidity and two (2) other predictors: citric.acidity and

density.

5. Multiple Linear Regression Model Without fixed.acidity Predictor Variable

trainfit1= update(trainfit, ~.-fixed.acidity, data = training)

summary(trainfit1)

I updated the MLR model (Figure 5) to remove fixed.acidity and the model yields higher prediction accuracy (larger F-statistic and lower Residual Squared Error).

7

Figure 5

Although the F-statistic on the MLR model without fixed.acidity is a better fit for the data, we can also see that there are relatively large individual p-values on some of the

predictor variables such as residual.sugar, density, and citric.acidity.

6. Principal Component Analysis

The response variable, quality appears to only be related to a subset of the predictors.

We can use dimensionality reduction or PCA method for exploratory analysis and produce derived variables (principal components) for use in a supervised learning method. PCA reduces the total set of numerical variables to remove the overlap of information between them. The new variables will be a linear combination of the original variables

8

that are weighted averages of the original variables. The linear combinations are uncorrelated thus potentially correcting the problem of multicollinearity.

#Centers variable to have mean = 0 and normalize data to have standard

deviation = 1

> pr.out = prcomp(mydata.pca, scale = TRUE)

The rotation measure provides the principal component loading. Each column of rotation matrix contains the principal component loading vectors which are measures of interest.

> pr.out$rotation Figure 6

Figure 6a

9

Figure 6b Plot of First Two Principal Components

We infer that the first principal component corresponds to a measure of

total.sulfur.dioxide and alcohol. Similarly, it can be said that the second

component corresponds to a measure of pH and fixed.acidity.

Figure 6a shows that first principal component explains 28.2% variance. Second component explains 17.5% variance. Third component explains 14.1% variance and so on. To choose the number of components for the principal component regression model, we need to look at a scree plot. A scree plot is used to access components which explains the most variability in the data by representing values in descending order.

10

Figure 6c Scree Plot of PVE

Figure 6d Scree Plot of Cumulative PVE (Sum = 1)

It appears that principal component 9 results in a cumulative variance of close to 98%. In this case, using PCA did not do well in reducing the number of predictors from 11 to 9 without compromising on proportion of explained variance. Unfortunately, after checking both scree plot of PVE (Figure 6c) and cumulative PVE (Figure 6d), we see that PCA did not give us a small enough number of principal components required to get a good understanding of the data.

11

7. Forward and Backward Selection MLR Model

We know from the previous exploration of the data set that the response variable,

quality is related to only a subset of the predictors. In order to determine which predictors are associated with the response, we need to fit a single model involving only those predictors. We cannot consider all 2P models with p = 11; there is over 2048 models! Therefore, we can determine the best model using variable selection. Using the two classical approaches, Forward Selection and Backward Selection yield a more efficient and automated ways to choose a smaller set of models to consider.

#Forward Selection Method null<-lm(quality~1, data=training)

forward<-step(null, scope=list(upper=trainfit), direction='forward')

summary(forward)

Figure 7a

12

#Backward Selection Method backward<-step(trainfit, direction='backward')

summary(backward)

Figure 7b

Forward selection method begins with the null model then adds predictors that results in the lowest RSS whereas the backward selection method is conducting the opposite optimization by starting with all the predictors then eliminating the variables with the largest p-values. Although these techniques differ slightly in their optimization strategy, it is interesting to point out that in this instance both methods fitted the data to the same MLR model (Figure 7a and Figure 7b).

13

8. Diagnostics Plots to Verify Regression Assumptions

I plotted the “Residuals vs. Fitted” and “Normal Q-Q plot” to search for any patterns that might violate the regression assumptions (Figure 8a and Figure 8b).

Figure 8a

Figure 8b

This is very similar to the full model plot of “Residuals vs. Fitted.” The plot for the forward/backward model (Figure 8a) appears to be the same odd pattern in the residuals so we cannot run out the non-violation of the equal (constant) variance rule is intact. The “Normal Q-Q plot” (Figure 8b) of quantiles of this distribution does not appear to have systematic deviations for linearity since the bulk of the points lie on a relatively straight line. Another important residual plot is the histogram plot of residuals (Figure 8c), which shows a normally distributed bell-shaped curve.

14

Figure 8c

Figure 8d

We can also visualize the regression results with the coefficient plot (Figure 8d) where each coefficient is plotted as a point with a thick blue line representing one standard error confidence interval and a vertical gray dotted line indicating 0. If the standard error confidence interval does not contain 0, it is statistically significant. Here we can

see alcohol and sulphates has the largest effect on quality of the wine.

15

9. Fitting the Backward Selection Model on the Validation Data Set

> PredBack$residual.scale

[1] 0.6540796 Figure 9

> AIC(backward)

[1] 1917.249

> AIC(trainfit)

[1] 1923.308

> AIC(trainfit1)

[1] 1921.749

Another metric we can use to compare the performance of models (trainfit-full model,

trainfit1- full model minus fixed.acidity, and model obtained from backward

selection method) is the Akaike Information Criterion (AIC) measure. The AIC value of

backward 1917.249 is smaller than that of trainfit and trainfit1 at 1923.308 and

1921.749, therefore the model obtained with backward selection method is better

fitting the data.

16

10. Leave-One-Out-Cross-Validation Method (LOOCV)

The Validation Set Approach from part 1 has its drawbacks. Instead the LOOCV can be used to fit a model with n - 1 training observations and the prediction is made for the one excluded observation. The MSE provides an approximate unbiased estimate for the test error, however, it is still highly variable since it is based on a single excluded observation.

Major advantages of the LOOCV method over the validation set approach is that it has far less bias. The statistical learning model is repeatedly using training sets that contains n - 1 observations or almost the entire data set in contrast to the validation set approach which uses 60% of the original data set to train the model.

The LOOCV approach also tends to not overestimate the test error rate as much as the validation set approach. Instead of yielding different results due to randomness in the training/validation set splits, performing LOOCV multiple times will always yield the same results because there is no randomness in the testing/validation splits.

#LOOCV estimate for test error is 42.47

> cv.err$delta

[1] 0.4247753 0.4247728

11. k-Fold Cross-Validation

I used the k-fold cross validation method as an alternative to LOOCV. The approach randomly divides the set of observations into k groups, or folds, of approximately equal size. The first fold is treated as the validation set and the model is fitted on the remaining k – 1 folds. The MSE is computed on observations in the held-out fold that is treated as the validation set. The procedure is repeated k times, each time with a different group of observations treated as the validation set.

Although there is a variability in the test error estimates, it is often lower than the variability that results using the validation set approach.

#k-Fold CV estimate for test error is 42.40

cv.error.10

[1] 0.42409

12. Bias-Variance Trade-off for k-Fold Cross Validation

On the basis of bias reduction, LOOCV is preferred over the k-Fold CV method since it takes (k - 1) n / k observations to train the model in contrast to the LOOCV method that trains the model on n - 1 observations, which is almost the entire data set. However, the LOOCV method has a higher variance than k-Fold CV with k < n due to the high

17

correlation of averaging n fitted models whereas the average of k fitted models is somewhat less correlated. This results in test error rates from LOOCV tendency to have a higher variance than k-fold CV.

13. Conclusion

Backward Selection Multiple Linear Regression Model: Quality = 4.1747748 - 0.8826331*volatile.acidity - 1.9781343*chlorides +

0.0041786*free.sulfur.dioxide - 0.0037963*total.sulfur.dioxide - 0.5251147*pH +

0.9535727*sulphates + 0.3182497*alcohol

It is interesting to note the backwards selection method only produced six regression

coefficients versus the original eleven predictors. This model shows that sulphates

and alcohol has the largest effect on predicting higher quality ratings of red wine in

this data set whereas higher physiochemical measures of chlorides and

volatile.acidity has the largest effect on predicting lower quality ratings. With respect to the adjusted R2, Residual Squared Error, and AIC measures, we can determine that the backward selection method best fits the data. In examining many diagnostics graphs and plots, we can also conclude that linear regression is a compelling method in analyzing the effects of physiochemical tests on quality rating of Vinho Verde red wine.

18

Appendix of R Codes:

mydata = read.csv("winequality-red.csv",header=TRUE)

mydata.pca = read.csv("winequality-red-pca.csv",header=TRUE)

row <- nrow(mydata)

set.seed(12345)

trainindex <- sample(row, 0.6*row, replace=FALSE)

training <- mydata[trainindex,]

validation <- mydata[-trainindex,]

#ValidationSetApproach

require(ISLR)

set.seed(1)

train=sample(1599,960)

lm.fit=lm(quality~.,data=mydata,subset=train)

predict(lm.fit)

rating.predict<-predict(lm.fit)

plot(rating.predict)

attach(mydata)

mean((quality-predict(lm.fit,mydata))[-train]^2) #estimated test MSE in

validation set

#MLR with all 11 predictors (full model) used in forward selection method

later

trainfit<- lm(quality~., data=training)

#Output of Regression coefficients for all Predictors

summary(trainfit)

#Model fit

plot(trainfit)

#fitsummary = summary(trainfit)

#fitsummary$r.squared

#fitsummary$adj.r.squared

#fitsummary$sigma #RSE

AIC(trainfit)

BIC(trainfit)

#Principal Component Analysis (PCA)

physiochemattributes=row.names(mydata.pca)

physiochemattributes

names(mydata.pca)

#Means/Variances of variables

apply(mydata.pca, 2, mean)

apply(mydata.pca, 2, var)

#Centers variable to have mean zero and set variables to standard

deviation one

pr.out = prcomp(mydata.pca, scale = TRUE)

pr.out

summary(pr.out)

dim(pr.out$x)

pr.out$center

pr.out$scale

19

pr.out$rotation

#Plot first two principal component

biplot(pr.out, scale=0)

pr.out$rotation = -pr.out$rotation

pr.out$x = -pr.out$x

biplot(pr.out, scale=0)

#Check standard deviation and variance explained

pr.out$sdev

pr.var = pr.out$sdev^2

pr.var

#Compute proportion of variance explained PVE

pve= pr.var/sum(pr.var)

pve

#Plot PVE and Cumulative PVE

plot(pve, xlab= "Principal Component", ylab="Proportion of Variance

Explained", ylim=c(0,1), type='b', col="blue")

plot(cumsum(pve), xlab= "Principal Component", ylab="Proportion of

Variance Explained", ylim=c(0,1), type='b', col="blue")

#PCA Regression

require(pls)

set.seed(1000)

train <- mydata[1:1279,]

y_test <- mydata[1279:1599, 1]

test <- mydata[1279:1599, 2:5]

pcr_model <- pcr(quality~., data = train,scale =TRUE, validation = "CV")

pcr_model

summary(pcr_model)

# Plot the root mean squared error

validationplot(pcr_model)

predplot(pcr_model)

coefplot(pcr_model)

# Plot the cross validation MSE

validationplot(pcr_model, val.type="MSEP")

train <- mydata[1:1279,]

test <- mydata[1279:1599, 2:5]

pcr_model <- pcr(quality~., data = train,scale =TRUE, validation = "CV")

pcr_pred <- predict(pcr_model, data=test)

summary(pcr_model)

summary(pcr_pred)

mean(pcr_pred)

require(car)

vif(trainfit)

#remove fixed acidity from model because of multicollinearity

trainfit1= update(trainfit, ~.-fixed.acidity, data = training)

summary(trainfit1)

20

plot(trainfit1)

anova(trainfit,trainfit1, backward)

#coefficients(trainfit)

#fitted(trainfit)

#anova(trainfit)

#residuals(trainfit)

#vcov(trainfit)

#histogram with 20 bars

Residual<-residuals(trainfit1)

hist(Residual,breaks=20)

#Q-Q pplot with residuals

qqnorm(residual, ylab="Standardized Residuals", xlab="Normal Scores",

main="Residual")

qqline(residual)

layout(matrix(c(1,2,3,4),2,2))

plot(trainfit)

#Correlation and Scatter Plots for full model (trainfit)

require(corrplot)

M = cor(training)

corrplot(M, method = "number")

corrplot(M, method = "circle")

corrplot.mixed(M)

#Scatterplot

plot(training)

#Forward Selection Method

null<-lm(quality~1, data=training)

forward<-step(null, scope=list(upper=trainfit), direction='forward')

summary(forward)

plot(forward)

#histogram with 20 bars

Residual1<-residuals(forward)

hist(Residual,breaks=20)

#Backward Selection Method, actually yields the same model as Forward

Selection Method

backward<-step(trainfit, direction='backward')

summary(backward)

AIC(backward)

AIC(trainfit)

AIC(trainfit1)

#lm.fit= lm(quality~volatile.acidity + chlorides + free.sulfur.dioxide +

total.sulfur.dioxide + pH + sulphates + alcohol, data=training)

#attach(training)

#Compute MSE on validation set

#########backward$delta

#Predicting validation data set with model with all predictors

21

PredTrainfit<-predict(trainfit, validation, se.fit=TRUE)

PredTrainfit

PredTrainfit$residual.scale

#Predicting validation data set with model that does not have

fixed.acidity

PredTrainfit1<-predict(trainfit1, validation, se.fit=TRUE)

PredTrainfit1

PredTrainfit1$residual.scale

#Prediciting the validation dataset based on the mixed selection model

PredBack<-predict(backward, validation, se.fit=TRUE)

PredBack

PredBack$residual.scale

# Comparing the validation residual standard deviations, the mixed

selection model is better than base model

#Coefficient Plot of Backwards Selection Model

require(coefplot)

coefplot(backward)

#LOOCV

require(boot)

glm.fit=glm(quality~.,data=mydata)

cv.err=cv.glm(mydata,glm.fit)

cv.err$delta

cv.error=rep(0,5)

for (i in 1:5){

glm.fit=glm(quality~.,data=mydata)

cv.error[i]=cv.glm(mydata,glm.fit)$delta[1]

}

cv.error

#cross validation estimate for test error is 42.47

coef(glm.fit)

#K-fold Cross Validation

set.seed(17)

cv.error.10=rep(0,10)

for (i in 1:10) {

glm.fit=glm(quality~., data=mydata)

cv.error.10=cv.glm(mydata,glm.fit,K=10)$delta[2] }

cv.error.10

#k-fold CV estimate for test error is 42.40

22

References: James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning: With applications in R. New York: Springer. Lander, J. P. (2014). R for Everyone: Advanced Analytics and Graphics. Upper Saddle River, NJ: Addison-Wesley. P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236. Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib

Wine.Final.Project.MJv3

Documents

Transcript of Wine.Final.Project.MJv3