Fast or furious? - User analysis of SF Express...

CS 229 PROJECT, DEC. 2017 1

Fast or furious? - User analysis of SF Express IncGege Wen@gegewen, Yiyuan Zhang@yiyuan12, Kezhen Zhao@zkz

I. MOTIVATION

The motivation of this project is to predict thelikelihood for a S.F. Express user to file complaintfor a certain delivery using machine learning algo-rithms. With the prediction, we will also determinethe top triggers that make a user furious, thereforeimprove the users experience during a delivery.

The S.F. express company is the second largestcarrier in China and we will focus on the company’sdomestic delivery business. This project is proposedbased on the company’s current business needs.

II. RELATED WORK

Through literature review, we found several pa-pers related to our topic. The most relevant paperbuilds a system that helps automatically predict apatient’s chief compliant based on vitals and nurses’description of states at arrival[1]. A linear supportvector machine (SVM) on their selected features andachieved a F1 score of 0.913 on their perceptronalgorithm. Their work inspires us on the choicesof learning algorithms and feature simplificationprocess. However, there is some differences in ourproblem since we need to select variables on basesof limited pre-existing features and we expect highrecall rate due to industry needs.

We applied different models that were provedto be highly efficient in past literature: Wiener’spaper[2] gives us a guideline to perform RandomForest classification. Chen[3] summarized XGBoost,which can help us get better results on a tree boost-ing system. Friedman[4] experimented on StochasticGradient Boosted Decision Trees(GBDT) and wealso referred to their work on boosting process.For Feature selection, we applied several typesof Least Absolute Shrinkage and Selection Opera-tor(LASSO) mentioned in Tibshirani’s[5], Chen’s[6]and Zhang’s[7] papers.

III. DATASET AND FEATURES

A. DatasetThe dataset comes from S.F.’s real delivery his-

tory, containing the industrial bar gun information

including descriptions on the packages, senders, andreceivers. The dataset represented in CSV formatwith 40 columns as attributes and 22000 rows asinput entries. After cleaning null and irrelevantattributes, 27 attributes are remained as potentialfeatures.

Since delivery entries with complaint rarely occurcomparing to all delivery histories, the dataset ishighly unbalanced with 2000 complaint entries and20000 non-complaint entries. To solve this issue, wecreated three balanced datasets each containing all2000 complaint entries and 2000 randomly selectednon-complaint entries. With in each balanced set,we split 75% to be the training set and 25% tobe the test set. Note that a development set is alsocreated within each training set, which contains 20%of the training set entries. The following featureand model selection process are performed on allthree balanced datasets. Therefore, we can crossvalidate using the three datasets to achieve a optimalgeneralization during parameter tuning. To test theperformance of the algorithms under real worldcondition, we also created a unbalanced test setwhere the complaint and non-complaint entries arein the ratio of 1:10.

B. Data Pre-processingThe features in our dataset exist in numerical,

binary, and categorical formats. To combine allvariables, we normalized the features using dummyvariables into the form of a feature matrix. Anexample of this normalization is shown as follows:

x1 → numerical ∈ Rx2 → categorical ∈ {A,B,C}x3 → binary ∈ 0, 1

If we have x1 = 5, x2 = A, x3 = 1, our featurematrix will be:

x1x2Bx2Cx3

→

5001


Note that to avoid dummy variable trap, a categor-ical variable with n categories is projected to n− 1columns. The dimension of the resulting featurematrix is 87.

C. Feature SelectionWe performed feature selection since our mapped

feature matrix is very high dimensional. Using the87 features, we feed our set 1,2 and 3 into a simplelogistic regression and yield an average AUC of0.53. Logistic regression with lasso has even worseperformance with an average AUC of 0.52. There-fore, we used Z-test, lasso, recursive and stabilityselection to reduce the dimension of feature.

1) Z-Test Selection: We assumed that a featureis considered relevant to the predication when wereject the hypothesis that a coefficient equals to zero.When Z > 2, we have 95% confidence to reject nullhypothesis and claim the coefficient is statisticallysignificant.

2) Lasso method: We performed a cross valida-tion to estimate the expected generalization errorfor each λ. The value of λ was chosen basedon CVMSE to be 0.02. An example of λ versesCVMSE on set 1 is presented as Figure 1.

Fig. 1: λ vs. CVMSE

3) Recursive Selection: we eliminated least im-portant features using the recursive method andcompared the remaining feature in each dataset.

4) Stability Selection: using a randomized-lassoas our selection algorithm, we performed stabilityselection on three sets and eliminated the featuresthat scored lower than 0.85 in all three sets.

Comparing the results from the four feature selec-tion methods, we eliminated 40 features that were

selected as insignificant by all methods. Also, 16features were considered significant as shown inFigure 2 after combining the results from the fourselection methods. The industrial meaning of thesefeatures can be further explored by the company inorder to improve costumer experience.

Fig. 2: Significant features

IV. MODEL SELECTION

A. Baseline Model

We choose logistic regression as our baselinemodel. With the selected features and balanced dataset, logistic regression model achieved an averageAUC score of 0.64. Logistic regression with Lassoreach an average AUC of 0.65.

B. Tree-structured Model

In this real life problem, most features do nothave a linear property. A tree structured model per-forms better at capturing the non-linearity. There-fore, to achieve better performance, we attemptedRandom Forest (RF), Gradient Boosting DecisionTree (GBDT) and XGBoost.

The RF algorithm take average on a number ofparallel decision trees each has low bias and highvariance. This algorithm will effectively controlover-fitting while improve accuracy especially forhigh dimensional classification problem.

The GBDT algorithm builds a function whichevaluates the loss of our method. Then we optimizeit by making repetitive changes in the gradientdirection of our loss function. Iterative improvementis made to reduce the cost in this algorithm.

The XGBoost is also a boosting algorithm butwith L1 and L2 regularization as well as usesboth first and second order gradient[3]. This methodprune on user defined full trees to achieve betteraccuracy.

C. Experiments

For each of the three models, we performed aparameter selection process by using all the three


balanced datasets. For each set, we first apply ourmodeling method with default parameters and ob-tain our ’baseline’ AUC score and accuracy on itscorresponding development set. Then we partiallytune up parameters to see if we are on the right trackby checking the AUC score. After we finish tuningall the parameters, we achieve the ’best parameters’for this set.

Because our datasets are randomly sampled, thethree sets perform differently. All algorithms givesrelatively low AUC score on Set 2 and better scoreon Set 3. Therefore, to generalize the parametersand determine the best parameters for all dataset,we combine results from these three sets and builda combined parameter pool. Finally we selectedparameters that can provide the best average perfor-mance on three sets and further check their resultson test sets. A example of this process applied onRF is shown as follows.

Fig. 3: Parameter selection process

Within each process, a parameter optimizationtechnique is applied where we defined parametersto be either ’fixed’ or ’inter-correlated’. A fixedparameters (e.g. max feature in RF) affects the AUCscore independently. Therefore, the optimized valuefor this parameter is searched by by simply adjust-ing this parameter until reaching maximum AUC.However, inter-correlate parameters (e.g. depth, minsample split, and min sample leaf RF) affects theAUC result as a group. For these parameters, wemake adjustment in a recursive manner where aportion of this parameters group is adjusted whilethe rest are fixed. We repeat this adjustment untilmaximum AUC is reached.

V. RESULTS

A. ParametersUsing the above model selection methods, our

final parameters for RF, GBDT and XGBoost arelisted in Figure 4. Each method contain two setsof parameters, which are applied to balanced andunbalanced dataset respectively.

Fig. 4: Final parameters for the three models withbalanced and unbalanced dataset

B. AUCThe AUC results from a new balanced set and

a new unbalanced set are summarized in Figure 5.This results suggests that the GBDT and XGBoostmodel both give a relatively good AUC performancebefore tunning on unbalanced and balanced test set.After parameter tuning, GBDT performs the best onbalanced set while XGBoost performs the best onunbalanced set.

Fig. 5: Parameter selection process

VI. DISCUSSION

A. Balanced Test setFigure 6 shows the ROC and PR curve com-

parison when the algorithms are applied to thebalanced test set. Comparing the ROC curve beforeand after parameter tuning, we can see a increase inAUC, which were represented by the area under theROC curve due to tuning. RF, GBDT and XGBoost


benefited from the parameter tuning while the mostsignificant improvement is observed on RF.

On the PR curve, we observed that the rate ofprecision quickly decreases when we attempt toimprove the recall percentage of class-1, especiallyfor the RF algorithm. After parameter tuning, wewere able to smooth this decrease and improvethe precision for a certain recall rate. The optimalprecision is achieved by different algorithms underdifferent desired recall rate.

In conclusion, for a balanced test set, we thinkthat RF, GBDT and XGBoost all provide a goodrecall and precision. Since the test set is balanced,we can just use a threshold of 0.5 for all algorithms.

Fig. 6: ROC and PR Curve for class-1 before andafter tuning on balanced dataset

We have also summarized the performance met-rics of precision, recall and F1-score for both class-1and class-0 predications as shown Figure 7. Sincethe dataset is balanced, we use F1 score here, whichis the harmonic average of precision and recall.The performance metric indicates that GBDT andXGBoost out perform RF in terms of F1 scorebecause these two algorithm gives better predictionon recall.

Fig. 7: Performance Metrics on balanced dataset

B. Unbalanced Test SetFigure 8 shows the ROC and PR curve com-

parison when the algorithms are applied to theunbalanced test set. From the ROC curves, we alsoobserved that AUC is improved after we tune up allthe parameters.

Comparing to the balanced dataset, precision ratedecreases dramatically when we slightly improveon the percentage of recall from 0 to 0.2. Thiscan be accounted to the class bias problem inthe unbalanced data set. Due the overwhelmingproportion of class-0 data (people who don’t file acomplaint is dominating in real-world dataset), ourmodel tends to overfit to class-0. Therefore, whenwe make predictions on class-1, our fitted modelwill mistakenly classify a large portion of ones tozeros.

Fig. 8: ROC and PR Curve before and after tuningon unbalanced dataset

Companies’ real needs is to identify the poten-tial complaint customers. So we mainly focus oncorrectly predicting ones. To achieve this goal, weneed to adjust β in F1-score function to increasestatistical weight on recall rate. Here follow theprinciple that F-score is the harmonic average ofprecision and recall, and β is the weight on recallrate according to Van Rijsbergen’s effectivenessmeasure. Considering all the ideas above and thefact that the component ratio for ones and zeros inour unbalanced dataset is 1:10, we eventually setβ = 3.

Figure 9 illustrates how the recall rate and Fβ

score change as we modify the threshold of ourmodel. Recall rate decrease dramatically if weraise the threshold value from 0 to 0.2. When the


Fig. 9: Performance Metrics on unbalanced dataset

threshold is greater than 0.2, then the curve willdecreases slowly. F1-score increases with thresholdand reaches its peak when threshold at around 0.1,then decreases with a similar pattern as recall rate.Recall rate is sensitive to low threshold (0 to 0.1),while the precision rate is even more sensitive inthis range. After passing the peak point, recall ratebecomes more dominative than precision rate andleads F1-score decreasing on higher threshold value.We tend to believe that there is a large portionof ’mixed’ data at low threshold region that can’tbe suitably fitted into our model. So if we adjustthreshold, a trade-off between precision and recallhas to be made. In our project we are perusing ahigher recall rate so we chose a threshold value of0.1.

Fig. 10: Recall and F3 score verses threshold=0.1

In the performance metrics on Figure 10, weincluded precision, recall, F1 score and F3 score.These results are generated with the same set ofparameters, however, we altered the threshold asdescribed above. With a smaller threshold, the pre-cision of predicting class-1 decreased while therecall rate increased. For all three models, we sac-rificed precision to get a higher recall rate. In thiscondition, the F1 score is no longer suitable andrepresentative of the model performance. The F3score is shown in figure 10 in comparison withF1, which indicate that the RF algorithm has thebest overall performance on predicting class 1 onunbalanced dataset. This is due to the fact that RF

gives higher recall rate at a small threshold. For theclass-0 prediction, both GBDT and XGboost canachieve a relatively good performance.

VII. CONCLUSION AND FUTURE WORK

In this project, we investigated a real life classi-fication problem to help the S.F. company identifycomplaint customer. We also identify the featuresthat are strongly related with complaint to helpS.F. company improve the customer experience.The dataset is high dimensional and has class-unbalanced issues. Tree-structured models (RF,GBDT and XGBoost) were utilized for this prob-lem. After parameter tuning, we achieved an AUCscore of 0.77 on a balanced set by using GBDT.For the unbalanced set, we further investigated theuse of a smaller threshold to achieve a better recall.At our selected threshold of 0.1, RF has the bestoverall performance.

In the next step of this project, we could furtheroptimize the value of threshold to meet the com-pany’s desired precision and recall rate.

VIII. CONTRIBUTION

Gege Wen: Baseline Model, Feature Selection,Analysis

Yiyuan Zhang: Dataset Pre-process, Feature Se-lection, Analysis

Kezhen Zhao: Dataset Pre-process, Feature Se-lection, Tree-structured Models, Anlysis

REFERENCES

[1] Jernite Y, Halpern Y, Horng S, et al. Predicting chief complaintsat triage time in the emergency department[C]//NIPS Workshopon Machine Learning for Clinical Data Analysis and Healthcare.2013.

[2] Liaw A, Wiener M. Classification and regression by RandomForest[J]. R news, 2002, 2(3): 18-22.

[3] Chen T, Guestrin C. Xgboost: A scalable tree boosting sys-tem[C]//Proceedings of the 22nd acm sigkdd international con-ference on knowledge discovery and data mining. ACM, 2016:785-794.

[4] Friedman J H. Stochastic gradient boosting[J]. ComputationalStatistics & Data Analysis, 2002, 38(4): 367-378.

[5] Tibshirani R. Regression shrinkage and selection via the lasso[J].Journal of the Royal Statistical Society. Series B (Methodologi-cal), 1996: 267-288.

[6] Chen T, Guestrin C. Xgboost: A scalable tree boosting sys-tem[C]//Proceedings of the 22nd acm sigkdd international con-ference on knowledge discovery and data mining. ACM, 2016:785-794.

[7] Zhang X, Lu X, Shi Q, et al. Recursive SVM feature selectionand sample classification for mass-spectrometry and microarraydata[J]. BMC bioinformatics, 2006, 7(1): 197.

Fast or furious? - User analysis of SF Express...

Documents

Transcript of Fast or furious? - User analysis of SF Express...