Download - Predicting Diamond Price: 2 Step method

Transcript
  • Prediction of Diamond Price: 2 Step MethodSarajit Poddar31 July 2015

    Contents

    1 Objective 1

    2 Algorithm development & Testing 1

    2.1 Initial setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    2.2 Exploring the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    2.3 Model development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    2.4 Developing Predictive Models (Decision Tree) . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    2.5 Developing Predictive Models (Randomforest) . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    2.6 Predicting the price in the Test dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    2.7 Analysing the Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    1 Objective

    The objective of this article is to explore machine learning algorithm for classification of diamonds into variouscost buckets depending on various characteristics.

    2 Algorithm development & Testing

    2.1 Initial setup

    2.1.1 Load libraries

    Data cleansing, tidying, transformation libraries. Plotting libraries.

    # Load required librarieslibrary(dplyr); library(tidyr); library(ggplot2)

    Specialised libraries for machine learning

    # Load required librarieslibrary(caret); library(randomForest)library(rattle); library(rpart.plot)# set see so that the results can be reproducibleset.seed(1000)

    1

  • 2.1.2 Subsetting the dataset

    Subsetting step 1: Price Range

    # Load the diamonds datasetdata(diamonds)# Price rangeprice.low

  • ## $ clarity: Ord.factor w/ 8 levels "I1"
  • 2.2 Exploring the data

    2.2.1 Price distribution in the dataset

    # Histogram of price distributionqplot(fprice, data=dataset, geom="histogram")

    0

    300

    600

    900

    2 3 4 5 6 7 8 9 10fprice

    count

    # Histogram of carat distributionqplot(fcarat, data=dataset, geom="histogram")

    4

  • 0300

    600

    900

    3 4 5 6 7 8 9 10 11 12 13 14 15 16fcarat

    count

    # Association of price with carat and clarityg

  • 23

    4

    5

    6

    7

    8

    9

    10

    3 4 5 6 7 8 9 10 11 12 13 14 15 16fcarat

    fpric

    e

    clarity I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF

    2.3 Model development

    # Tidy up the dataset used for development of the modeldataset.pr

  • modFit
  • ## 5 0 0 0 0 0 0 0 0 0## 6 0 11 97 306 323 200 87 34 14## 7 0 0 0 0 0 0 0 0 0## 8 0 0 0 0 0 0 0 0 0## 9 0 0 2 13 29 104 196 324 310## 10 0 0 0 0 0 0 0 0 0#### Overall Statistics#### Accuracy : 0.4024## 95% CI : (0.3861, 0.4189)## No Information Rate : 0.2209## P-Value [Acc > NIR] : < 2.2e-16#### Kappa : 0.2939## Mcnemar's Test P-Value : NA#### Statistics by Class:#### Class: 2 Class: 3 Class: 4 Class: 5 Class: 6 Class: 7## Sensitivity 0.000000 0.9858 0.0000 0.0000 0.84334 0.00000## Specificity 1.000000 0.7469 1.0000 1.0000 0.76001 1.00000## Pos Pred Value NaN 0.5248 NaN NaN 0.30131 NaN## Neg Pred Value 0.998573 0.9946 0.8291 0.8687 0.97533 0.91067## Prevalence 0.001427 0.2209 0.1709 0.1313 0.10930 0.08933## Detection Rate 0.000000 0.2178 0.0000 0.0000 0.09218 0.00000## Detection Prevalence 0.000000 0.4150 0.0000 0.0000 0.30594 0.00000## Balanced Accuracy 0.500000 0.8663 0.5000 0.5000 0.80168 0.50000## Class: 8 Class: 9 Class: 10## Sensitivity 0.00000 0.90251 0.00000## Specificity 1.00000 0.79205 1.00000## Pos Pred Value NaN 0.33129 NaN## Neg Pred Value 0.91838 0.98614 0.90725## Prevalence 0.08162 0.10245 0.09275## Detection Rate 0.00000 0.09247 0.00000## Detection Prevalence 0.00000 0.27911 0.00000## Balanced Accuracy 0.50000 0.84728 0.50000

    pred.test

  • ## 6 0 8 41 125 134 80 28 12 6## 7 0 0 0 0 0 0 0 0 0## 8 0 0 0 0 0 0 0 0 0## 9 0 0 0 5 15 46 94 140 132## 10 0 0 0 0 0 0 0 0 0#### Overall Statistics#### Accuracy : 0.3991## 95% CI : (0.3741, 0.4244)## No Information Rate : 0.2213## P-Value [Acc > NIR] : < 2.2e-16#### Kappa : 0.2892## Mcnemar's Test P-Value : NA#### Statistics by Class:#### Class: 2 Class: 3 Class: 4 Class: 5 Class: 6 Class: 7## Sensitivity 0.000000 0.9758 0.0000 0.000 0.81707 0.00000## Specificity 1.000000 0.7365 1.0000 1.000 0.77477 1.00000## Pos Pred Value NaN 0.5127 NaN NaN 0.30876 NaN## Neg Pred Value 0.998663 0.9908 0.8289 0.869 0.97175 0.91043## Prevalence 0.001337 0.2213 0.1711 0.131 0.10963 0.08957## Detection Rate 0.000000 0.2159 0.0000 0.000 0.08957 0.00000## Detection Prevalence 0.000000 0.4211 0.0000 0.000 0.29011 0.00000## Balanced Accuracy 0.500000 0.8562 0.5000 0.500 0.79592 0.50000## Class: 8 Class: 9 Class: 10## Sensitivity 0.00000 0.91503 0.00000## Specificity 1.00000 0.78258 1.00000## Pos Pred Value NaN 0.32407 NaN## Neg Pred Value 0.91845 0.98778 0.90775## Prevalence 0.08155 0.10227 0.09225## Detection Rate 0.00000 0.09358 0.00000## Detection Prevalence 0.00000 0.28877 0.00000## Balanced Accuracy 0.50000 0.84880 0.50000

    2.5 Developing Predictive Models (Randomforest)

    2.5.1 Model definition

    modFit

  • 2.5.2.1 Training set accuracy (In-Sample)

    ## Confusion Matrix and Statistics#### Reference## Prediction 2 3 4 5 6 7 8 9 10## 2 5 0 0 0 0 0 0 0 0## 3 0 774 1 0 0 0 0 0 0## 4 0 0 598 0 0 0 0 0 0## 5 0 0 0 460 0 0 0 0 0## 6 0 0 0 0 382 0 0 0 0## 7 0 0 0 0 1 313 1 0 0## 8 0 0 0 0 0 0 285 0 0## 9 0 0 0 0 0 0 0 359 0## 10 0 0 0 0 0 0 0 0 325#### Overall Statistics#### Accuracy : 0.9991## 95% CI : (0.9975, 0.9998)## No Information Rate : 0.2209## P-Value [Acc > NIR] : < 2.2e-16#### Kappa : 0.999## Mcnemar's Test P-Value : NA#### Statistics by Class:#### Class: 2 Class: 3 Class: 4 Class: 5 Class: 6 Class: 7## Sensitivity 1.000000 1.0000 0.9983 1.0000 0.9974 1.00000## Specificity 1.000000 0.9996 1.0000 1.0000 1.0000 0.99937## Pos Pred Value 1.000000 0.9987 1.0000 1.0000 1.0000 0.99365## Neg Pred Value 1.000000 1.0000 0.9997 1.0000 0.9997 1.00000## Prevalence 0.001427 0.2209 0.1709 0.1313 0.1093 0.08933## Detection Rate 0.001427 0.2209 0.1707 0.1313 0.1090 0.08933## Detection Prevalence 0.001427 0.2212 0.1707 0.1313 0.1090 0.08990## Balanced Accuracy 1.000000 0.9998 0.9992 1.0000 0.9987 0.99969## Class: 8 Class: 9 Class: 10## Sensitivity 0.99650 1.0000 1.00000## Specificity 1.00000 1.0000 1.00000## Pos Pred Value 1.00000 1.0000 1.00000## Neg Pred Value 0.99969 1.0000 1.00000## Prevalence 0.08162 0.1025 0.09275## Detection Rate 0.08134 0.1025 0.09275## Detection Prevalence 0.08134 0.1025 0.09275## Balanced Accuracy 0.99825 1.0000 1.00000

    pred.test

  • ## Confusion Matrix and Statistics#### Reference## Prediction 2 3 4 5 6 7 8 9 10## 2 0 0 0 0 0 0 0 0 0## 3 2 311 20 1 0 0 0 0 0## 4 0 16 208 35 0 0 0 0 0## 5 0 4 26 120 40 7 0 1 0## 6 0 0 2 34 85 36 6 1 1## 7 0 0 0 3 34 57 19 8 0## 8 0 0 0 3 4 22 46 15 7## 9 0 0 0 0 0 10 45 80 42## 10 0 0 0 0 1 2 6 48 88#### Overall Statistics#### Accuracy : 0.6651## 95% CI : (0.6406, 0.689)## No Information Rate : 0.2213## P-Value [Acc > NIR] : < 2.2e-16#### Kappa : 0.6097## Mcnemar's Test P-Value : NA#### Statistics by Class:#### Class: 2 Class: 3 Class: 4 Class: 5 Class: 6 Class: 7## Sensitivity 0.000000 0.9396 0.8125 0.61224 0.51829 0.42537## Specificity 1.000000 0.9803 0.9589 0.94000 0.93994 0.95301## Pos Pred Value NaN 0.9311 0.8031 0.60606 0.51515 0.47107## Neg Pred Value 0.998663 0.9828 0.9612 0.94145 0.94065 0.94400## Prevalence 0.001337 0.2213 0.1711 0.13102 0.10963 0.08957## Detection Rate 0.000000 0.2079 0.1390 0.08021 0.05682 0.03810## Detection Prevalence 0.000000 0.2233 0.1731 0.13235 0.11029 0.08088## Balanced Accuracy 0.500000 0.9599 0.8857 0.77612 0.72912 0.68919## Class: 8 Class: 9 Class: 10## Sensitivity 0.37705 0.52288 0.63768## Specificity 0.96288 0.92777 0.95803## Pos Pred Value 0.47423 0.45198 0.60690## Neg Pred Value 0.94568 0.94466 0.96299## Prevalence 0.08155 0.10227 0.09225## Detection Rate 0.03075 0.05348 0.05882## Detection Prevalence 0.06484 0.11832 0.09693## Balanced Accuracy 0.66997 0.72532 0.79785

    2.6 Predicting the price in the Test dataset

    2.6.1 Determine the fitted model based using price range as one of the predictor

    # Fitted modelfitted.model

  • # Summary of the fitted modelsummary(fitted.model)

    #### Call:## lm(formula = price ~ fcarat + cut + clarity + color + table +## y + z + fprice, data = dataset.pr)#### Residuals:## Min 1Q Median 3Q Max## -478.00 -91.32 -0.86 90.29 653.91#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 692.7184 121.8809 5.684 1.39e-08 ***## fcarat.L 63.6436 83.7364 0.760 0.447263## fcarat.Q -207.6135 64.4114 -3.223 0.001276 **## fcarat.C -59.1854 59.6963 -0.991 0.321518## fcarat^4 22.1916 53.7768 0.413 0.679872## fcarat^5 -35.8988 45.3522 -0.792 0.428657## fcarat^6 -2.3563 36.5826 -0.064 0.948646## fcarat^7 -25.9610 31.6190 -0.821 0.411654## fcarat^8 -1.1780 29.5570 -0.040 0.968211## fcarat^9 -13.4611 25.6953 -0.524 0.600390## fcarat^10 -15.2370 19.4604 -0.783 0.433680## fcarat^11 -29.6520 13.2595 -2.236 0.025378 *## fcarat^12 -17.7874 8.5098 -2.090 0.036649 *## fcarat^13 -10.1020 6.2293 -1.622 0.104931## cut.L 24.6540 7.0539 3.495 0.000478 ***## cut.Q 2.7526 5.7337 0.480 0.631200## cut.C 5.7071 5.1496 1.108 0.267803## cut^4 -5.7487 4.3535 -1.320 0.186742## clarity.L 370.0914 14.4251 25.656 < 2e-16 ***## clarity.Q -79.9797 10.2724 -7.786 8.37e-15 ***## clarity.C 57.5250 8.2458 6.976 3.43e-12 ***## clarity^4 -36.7786 6.6591 -5.523 3.50e-08 ***## clarity^5 1.7895 5.5359 0.323 0.746521## clarity^6 17.2801 5.0240 3.439 0.000588 ***## clarity^7 14.7822 4.6156 3.203 0.001370 **## color.L -172.2569 8.0063 -21.515 < 2e-16 ***## color.Q -27.3245 5.8839 -4.644 3.51e-06 ***## color.C -13.3488 5.3907 -2.476 0.013309 *## color^4 4.9552 5.0367 0.984 0.325251## color^5 -3.7985 4.6906 -0.810 0.418089## color^6 6.2334 4.2783 1.457 0.145190## table -0.9325 0.9375 -0.995 0.319938## y 294.9346 16.8682 17.485 < 2e-16 ***## z 119.8985 17.1153 7.005 2.79e-12 ***## fprice.L 3103.1795 31.1688 99.560 < 2e-16 ***## fprice.Q 238.9273 26.0004 9.189 < 2e-16 ***## fprice.C -126.1668 21.6234 -5.835 5.73e-09 ***## fprice^4 45.2983 15.5962 2.904 0.003695 **## fprice^5 -53.8780 10.0269 -5.373 8.08e-08 ***

    12

  • ## fprice^6 40.1292 6.7176 5.974 2.48e-09 ***## fprice^7 -14.4412 5.6136 -2.573 0.010125 *## fprice^8 3.7922 5.3327 0.711 0.477048## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 122.7 on 4958 degrees of freedom## Multiple R-squared: 0.9897, Adjusted R-squared: 0.9897## F-statistic: 1.167e+04 on 41 and 4958 DF, p-value: < 2.2e-16

    2.6.2 Determine the price range

    # Predicting the price rangedataset.predict

  • geom_point(size=2, colour="salmon", alpha = 0.2) +xlab("Fitted value") +ylab("Residual") +geom_smooth(method="loess", colour="red", lwd=1)

    500

    250

    0

    250

    500

    1000 2000 3000 4000 5000Fitted value

    Res

    idua

    l

    2.7.2 Plotting the predicted data with actual data

    g

  • 1000

    2000

    3000

    4000

    5000

    1000 2000 3000 4000 5000Predicted Price

    Actu

    al P

    rice

    2.7.3 Plotting the difference between actuals and the prediction

    # Determine the difference between prediction and actualsx

  • lines(xfit, yfit, col="red", lwd=2)# Add legendlegend('topright', c("Mean", "Density Curve", "Normal Curve"),

    lty=c(1,1,1), lwd=c(2,2,2), col = c("darkgreen", "blue", "red"))

    Difference between actuals and prediction

    Price difference

    Freq

    uenc

    y

    600 400 200 0 200 400

    020

    060

    010

    0014

    00

    MeanDensity CurveNormal Curve

    2.7.4 Plotting the difference between actuals and the prediction (in %age)

    # Determine the difference between prediction and actualsx

  • lines(mydensity, col="blue", lwd=2)# Plotting the normal curve with the same mean and Standard deviationxfit