Predicting Diamond Price: 2 Step method

17
Prediction of Diamond Price: 2 Step Method Sarajit Poddar 31 July 2015 Contents 1 Objective 1 2 Algorithm development & Testing 1 2.1 Initial setup .............................................. 1 2.2 Exploring the data .......................................... 4 2.3 Model development .......................................... 6 2.4 Developing Predictive Models (Decision Tree) ........................... 6 2.5 Developing Predictive Models (Randomforest) ........................... 9 2.6 Predicting the price in the Test dataset .............................. 11 2.7 Analysing the Residuals ....................................... 13 1 Objective The objective of this article is to explore machine learning algorithm for classification of diamonds into various cost buckets depending on various characteristics. 2 Algorithm development & Testing 2.1 Initial setup 2.1.1 Load libraries Data cleansing, tidying, transformation libraries. Plotting libraries. # Load required libraries library(dplyr); library(tidyr); library(ggplot2) Specialised libraries for machine learning # Load required libraries library(caret); library(randomForest) library(rattle); library(rpart.plot) # set see so that the results can be reproducible set.seed(1000) 1

description

Application of Machine learning algorithm to determine Diamond Price. Here the 1st step is to determine the price class in which the diamond falls and then determining the price based on a linear model, which includes the price class in predicting the price.

Transcript of Predicting Diamond Price: 2 Step method

  • Prediction of Diamond Price: 2 Step MethodSarajit Poddar31 July 2015

    Contents

    1 Objective 1

    2 Algorithm development & Testing 1

    2.1 Initial setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    2.2 Exploring the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    2.3 Model development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    2.4 Developing Predictive Models (Decision Tree) . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    2.5 Developing Predictive Models (Randomforest) . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    2.6 Predicting the price in the Test dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    2.7 Analysing the Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    1 Objective

    The objective of this article is to explore machine learning algorithm for classification of diamonds into variouscost buckets depending on various characteristics.

    2 Algorithm development & Testing

    2.1 Initial setup

    2.1.1 Load libraries

    Data cleansing, tidying, transformation libraries. Plotting libraries.

    # Load required librarieslibrary(dplyr); library(tidyr); library(ggplot2)

    Specialised libraries for machine learning

    # Load required librarieslibrary(caret); library(randomForest)library(rattle); library(rpart.plot)# set see so that the results can be reproducibleset.seed(1000)

    1

  • 2.1.2 Subsetting the dataset

    Subsetting step 1: Price Range

    # Load the diamonds datasetdata(diamonds)# Price rangeprice.low

  • ## $ clarity: Ord.factor w/ 8 levels "I1"
  • 2.2 Exploring the data

    2.2.1 Price distribution in the dataset

    # Histogram of price distributionqplot(fprice, data=dataset, geom="histogram")

    0

    300

    600

    900

    2 3 4 5 6 7 8 9 10fprice

    count

    # Histogram of carat distributionqplot(fcarat, data=dataset, geom="histogram")

    4

  • 0300

    600

    900

    3 4 5 6 7 8 9 10 11 12 13 14 15 16fcarat

    count

    # Association of price with carat and clarityg

  • 23

    4

    5

    6

    7

    8

    9

    10

    3 4 5 6 7 8 9 10 11 12 13 14 15 16fcarat

    fpric

    e

    clarity I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF

    2.3 Model development

    # Tidy up the dataset used for development of the modeldataset.pr

  • modFit
  • ## 5 0 0 0 0 0 0 0 0 0## 6 0 11 97 306 323 200 87 34 14## 7 0 0 0 0 0 0 0 0 0## 8 0 0 0 0 0 0 0 0 0## 9 0 0 2 13 29 104 196 324 310## 10 0 0 0 0 0 0 0 0 0#### Overall Statistics#### Accuracy : 0.4024## 95% CI : (0.3861, 0.4189)## No Information Rate : 0.2209## P-Value [Acc > NIR] : < 2.2e-16#### Kappa : 0.2939## Mcnemar's Test P-Value : NA#### Statistics by Class:#### Class: 2 Class: 3 Class: 4 Class: 5 Class: 6 Class: 7## Sensitivity 0.000000 0.9858 0.0000 0.0000 0.84334 0.00000## Specificity 1.000000 0.7469 1.0000 1.0000 0.76001 1.00000## Pos Pred Value NaN 0.5248 NaN NaN 0.30131 NaN## Neg Pred Value 0.998573 0.9946 0.8291 0.8687 0.97533 0.91067## Prevalence 0.001427 0.2209 0.1709 0.1313 0.10930 0.08933## Detection Rate 0.000000 0.2178 0.0000 0.0000 0.09218 0.00000## Detection Prevalence 0.000000 0.4150 0.0000 0.0000 0.30594 0.00000## Balanced Accuracy 0.500000 0.8663 0.5000 0.5000 0.80168 0.50000## Class: 8 Class: 9 Class: 10## Sensitivity 0.00000 0.90251 0.00000## Specificity 1.00000 0.79205 1.00000## Pos Pred Value NaN 0.33129 NaN## Neg Pred Value 0.91838 0.98614 0.90725## Prevalence 0.08162 0.10245 0.09275## Detection Rate 0.00000 0.09247 0.00000## Detection Prevalence 0.00000 0.27911 0.00000## Balanced Accuracy 0.50000 0.84728 0.50000

    pred.test

  • ## 6 0 8 41 125 134 80 28 12 6## 7 0 0 0 0 0 0 0 0 0## 8 0 0 0 0 0 0 0 0 0## 9 0 0 0 5 15 46 94 140 132## 10 0 0 0 0 0 0 0 0 0#### Overall Statistics#### Accuracy : 0.3991## 95% CI : (0.3741, 0.4244)## No Information Rate : 0.2213## P-Value [Acc > NIR] : < 2.2e-16#### Kappa : 0.2892## Mcnemar's Test P-Value : NA#### Statistics by Class:#### Class: 2 Class: 3 Class: 4 Class: 5 Class: 6 Class: 7## Sensitivity 0.000000 0.9758 0.0000 0.000 0.81707 0.00000## Specificity 1.000000 0.7365 1.0000 1.000 0.77477 1.00000## Pos Pred Value NaN 0.5127 NaN NaN 0.30876 NaN## Neg Pred Value 0.998663 0.9908 0.8289 0.869 0.97175 0.91043## Prevalence 0.001337 0.2213 0.1711 0.131 0.10963 0.08957## Detection Rate 0.000000 0.2159 0.0000 0.000 0.08957 0.00000## Detection Prevalence 0.000000 0.4211 0.0000 0.000 0.29011 0.00000## Balanced Accuracy 0.500000 0.8562 0.5000 0.500 0.79592 0.50000## Class: 8 Class: 9 Class: 10## Sensitivity 0.00000 0.91503 0.00000## Specificity 1.00000 0.78258 1.00000## Pos Pred Value NaN 0.32407 NaN## Neg Pred Value 0.91845 0.98778 0.90775## Prevalence 0.08155 0.10227 0.09225## Detection Rate 0.00000 0.09358 0.00000## Detection Prevalence 0.00000 0.28877 0.00000## Balanced Accuracy 0.50000 0.84880 0.50000

    2.5 Developing Predictive Models (Randomforest)

    2.5.1 Model definition

    modFit

  • 2.5.2.1 Training set accuracy (In-Sample)

    ## Confusion Matrix and Statistics#### Reference## Prediction 2 3 4 5 6 7 8 9 10## 2 5 0 0 0 0 0 0 0 0## 3 0 774 1 0 0 0 0 0 0## 4 0 0 598 0 0 0 0 0 0## 5 0 0 0 460 0 0 0 0 0## 6 0 0 0 0 382 0 0 0 0## 7 0 0 0 0 1 313 1 0 0## 8 0 0 0 0 0 0 285 0 0## 9 0 0 0 0 0 0 0 359 0## 10 0 0 0 0 0 0 0 0 325#### Overall Statistics#### Accuracy : 0.9991## 95% CI : (0.9975, 0.9998)## No Information Rate : 0.2209## P-Value [Acc > NIR] : < 2.2e-16#### Kappa : 0.999## Mcnemar's Test P-Value : NA#### Statistics by Class:#### Class: 2 Class: 3 Class: 4 Class: 5 Class: 6 Class: 7## Sensitivity 1.000000 1.0000 0.9983 1.0000 0.9974 1.00000## Specificity 1.000000 0.9996 1.0000 1.0000 1.0000 0.99937## Pos Pred Value 1.000000 0.9987 1.0000 1.0000 1.0000 0.99365## Neg Pred Value 1.000000 1.0000 0.9997 1.0000 0.9997 1.00000## Prevalence 0.001427 0.2209 0.1709 0.1313 0.1093 0.08933## Detection Rate 0.001427 0.2209 0.1707 0.1313 0.1090 0.08933## Detection Prevalence 0.001427 0.2212 0.1707 0.1313 0.1090 0.08990## Balanced Accuracy 1.000000 0.9998 0.9992 1.0000 0.9987 0.99969## Class: 8 Class: 9 Class: 10## Sensitivity 0.99650 1.0000 1.00000## Specificity 1.00000 1.0000 1.00000## Pos Pred Value 1.00000 1.0000 1.00000## Neg Pred Value 0.99969 1.0000 1.00000## Prevalence 0.08162 0.1025 0.09275## Detection Rate 0.08134 0.1025 0.09275## Detection Prevalence 0.08134 0.1025 0.09275## Balanced Accuracy 0.99825 1.0000 1.00000

    pred.test

  • ## Confusion Matrix and Statistics#### Reference## Prediction 2 3 4 5 6 7 8 9 10## 2 0 0 0 0 0 0 0 0 0## 3 2 311 20 1 0 0 0 0 0## 4 0 16 208 35 0 0 0 0 0## 5 0 4 26 120 40 7 0 1 0## 6 0 0 2 34 85 36 6 1 1## 7 0 0 0 3 34 57 19 8 0## 8 0 0 0 3 4 22 46 15 7## 9 0 0 0 0 0 10 45 80 42## 10 0 0 0 0 1 2 6 48 88#### Overall Statistics#### Accuracy : 0.6651## 95% CI : (0.6406, 0.689)## No Information Rate : 0.2213## P-Value [Acc > NIR] : < 2.2e-16#### Kappa : 0.6097## Mcnemar's Test P-Value : NA#### Statistics by Class:#### Class: 2 Class: 3 Class: 4 Class: 5 Class: 6 Class: 7## Sensitivity 0.000000 0.9396 0.8125 0.61224 0.51829 0.42537## Specificity 1.000000 0.9803 0.9589 0.94000 0.93994 0.95301## Pos Pred Value NaN 0.9311 0.8031 0.60606 0.51515 0.47107## Neg Pred Value 0.998663 0.9828 0.9612 0.94145 0.94065 0.94400## Prevalence 0.001337 0.2213 0.1711 0.13102 0.10963 0.08957## Detection Rate 0.000000 0.2079 0.1390 0.08021 0.05682 0.03810## Detection Prevalence 0.000000 0.2233 0.1731 0.13235 0.11029 0.08088## Balanced Accuracy 0.500000 0.9599 0.8857 0.77612 0.72912 0.68919## Class: 8 Class: 9 Class: 10## Sensitivity 0.37705 0.52288 0.63768## Specificity 0.96288 0.92777 0.95803## Pos Pred Value 0.47423 0.45198 0.60690## Neg Pred Value 0.94568 0.94466 0.96299## Prevalence 0.08155 0.10227 0.09225## Detection Rate 0.03075 0.05348 0.05882## Detection Prevalence 0.06484 0.11832 0.09693## Balanced Accuracy 0.66997 0.72532 0.79785

    2.6 Predicting the price in the Test dataset

    2.6.1 Determine the fitted model based using price range as one of the predictor

    # Fitted modelfitted.model

  • # Summary of the fitted modelsummary(fitted.model)

    #### Call:## lm(formula = price ~ fcarat + cut + clarity + color + table +## y + z + fprice, data = dataset.pr)#### Residuals:## Min 1Q Median 3Q Max## -478.00 -91.32 -0.86 90.29 653.91#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 692.7184 121.8809 5.684 1.39e-08 ***## fcarat.L 63.6436 83.7364 0.760 0.447263## fcarat.Q -207.6135 64.4114 -3.223 0.001276 **## fcarat.C -59.1854 59.6963 -0.991 0.321518## fcarat^4 22.1916 53.7768 0.413 0.679872## fcarat^5 -35.8988 45.3522 -0.792 0.428657## fcarat^6 -2.3563 36.5826 -0.064 0.948646## fcarat^7 -25.9610 31.6190 -0.821 0.411654## fcarat^8 -1.1780 29.5570 -0.040 0.968211## fcarat^9 -13.4611 25.6953 -0.524 0.600390## fcarat^10 -15.2370 19.4604 -0.783 0.433680## fcarat^11 -29.6520 13.2595 -2.236 0.025378 *## fcarat^12 -17.7874 8.5098 -2.090 0.036649 *## fcarat^13 -10.1020 6.2293 -1.622 0.104931## cut.L 24.6540 7.0539 3.495 0.000478 ***## cut.Q 2.7526 5.7337 0.480 0.631200## cut.C 5.7071 5.1496 1.108 0.267803## cut^4 -5.7487 4.3535 -1.320 0.186742## clarity.L 370.0914 14.4251 25.656 < 2e-16 ***## clarity.Q -79.9797 10.2724 -7.786 8.37e-15 ***## clarity.C 57.5250 8.2458 6.976 3.43e-12 ***## clarity^4 -36.7786 6.6591 -5.523 3.50e-08 ***## clarity^5 1.7895 5.5359 0.323 0.746521## clarity^6 17.2801 5.0240 3.439 0.000588 ***## clarity^7 14.7822 4.6156 3.203 0.001370 **## color.L -172.2569 8.0063 -21.515 < 2e-16 ***## color.Q -27.3245 5.8839 -4.644 3.51e-06 ***## color.C -13.3488 5.3907 -2.476 0.013309 *## color^4 4.9552 5.0367 0.984 0.325251## color^5 -3.7985 4.6906 -0.810 0.418089## color^6 6.2334 4.2783 1.457 0.145190## table -0.9325 0.9375 -0.995 0.319938## y 294.9346 16.8682 17.485 < 2e-16 ***## z 119.8985 17.1153 7.005 2.79e-12 ***## fprice.L 3103.1795 31.1688 99.560 < 2e-16 ***## fprice.Q 238.9273 26.0004 9.189 < 2e-16 ***## fprice.C -126.1668 21.6234 -5.835 5.73e-09 ***## fprice^4 45.2983 15.5962 2.904 0.003695 **## fprice^5 -53.8780 10.0269 -5.373 8.08e-08 ***

    12

  • ## fprice^6 40.1292 6.7176 5.974 2.48e-09 ***## fprice^7 -14.4412 5.6136 -2.573 0.010125 *## fprice^8 3.7922 5.3327 0.711 0.477048## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 122.7 on 4958 degrees of freedom## Multiple R-squared: 0.9897, Adjusted R-squared: 0.9897## F-statistic: 1.167e+04 on 41 and 4958 DF, p-value: < 2.2e-16

    2.6.2 Determine the price range

    # Predicting the price rangedataset.predict

  • geom_point(size=2, colour="salmon", alpha = 0.2) +xlab("Fitted value") +ylab("Residual") +geom_smooth(method="loess", colour="red", lwd=1)

    500

    250

    0

    250

    500

    1000 2000 3000 4000 5000Fitted value

    Res

    idua

    l

    2.7.2 Plotting the predicted data with actual data

    g

  • 1000

    2000

    3000

    4000

    5000

    1000 2000 3000 4000 5000Predicted Price

    Actu

    al P

    rice

    2.7.3 Plotting the difference between actuals and the prediction

    # Determine the difference between prediction and actualsx

  • lines(xfit, yfit, col="red", lwd=2)# Add legendlegend('topright', c("Mean", "Density Curve", "Normal Curve"),

    lty=c(1,1,1), lwd=c(2,2,2), col = c("darkgreen", "blue", "red"))

    Difference between actuals and prediction

    Price difference

    Freq

    uenc

    y

    600 400 200 0 200 400

    020

    060

    010

    0014

    00

    MeanDensity CurveNormal Curve

    2.7.4 Plotting the difference between actuals and the prediction (in %age)

    # Determine the difference between prediction and actualsx

  • lines(mydensity, col="blue", lwd=2)# Plotting the normal curve with the same mean and Standard deviationxfit