KDD Cup Research Paper

16
CS4642 - Data Mining & Information Retrieval KDDCup - 2014 Submission Paper 100112V - E.A.S.D.Edirisinghe 100132G - W.V.D.Fernando 100440A - R.H.T.D.Ranasinghe 100444N - M.C.S.Ranathunghe

Transcript of KDD Cup Research Paper

Page 1: KDD Cup Research Paper

CS4642 - Data Mining & Information Retrieval

KDDCup - 2014 Submission Paper

100112V - E.A.S.D.Edirisinghe

100132G - W.V.D.Fernando

100440A - R.H.T.D.Ranasinghe

100444N - M.C.S.Ranathunghe

Page 2: KDD Cup Research Paper

1. Introduction

KDD Cup is the annual Data Mining and Knowledge Discovery competition organized by ACM

Special Interest Group on Knowledge Discovery and Data Mining, the leading professional organization

of data miners.[1]Following are some examples for the past few years,

KDD Cup 2010 - Student performance evaluation

KDD Cup 2009 - Customer relationship prediction

The organizers of KDD Cup 2014 has used the kaggle platform which is considered as one of the

world's largest community of data scientists who compete with each other to solve complex data science

problems.

For the KDD Cup 2014 the task is to predict exciting projects in Donorschoose.org, a United

States–based nonprofit organization that allows individuals to donate directly to public school classroom

projects. Public school teachers can post their project requests on Donorschoose.org site so that any

interested party can donate any amount of money to the project(min 1$). When a project reaches its

funding goal, Donorschoose.org admins ship the materials to the school. If a partially funded project

expires(which is up to only 4 months from the date posted if the teacher hasn’t set any earlier deadline),

donors get their donations returned as account credits .

The 2014 KDD Cup asks participants to help DonorsChoose.org identify projects that are

exceptionally exciting to people, at the time of posting. While all projects on the site fulfill some kind of

need, certain projects have a quality above and beyond what is typical. By identifying and recommending

such projects early, they will improve funding outcomes, improve the user experience, and help more

students in receiving the materials they need to learn.

2. Description of available data

● donations.csv - This file contains information about the donations to each project. This is only

provided for projects in the training set. Has information about the donor and about the payment

also

● essays.csv - This file contains project text posted by the teachers. Has a title, need statement,

short description and essays for each project. This is provided for both the training and test set.

● projects.csv - This file contains information about each project - school information, teacher

information, focussing areas/subjects of the projects etc. This is provided for both the training and

test set.

● resources.csv - This file contains information about the resources requested for each project -

how many items requested, unit price of an item, vendor name, resource type etc. This is

provided for both the training and test set.

● outcomes.csv - This file contains information about the outcomes of projects in the training set.

The field is_exciting contains here for each project in training set.

● sampleSubmission.csv - This file contains the project ids of the test set and shows the submission

format for the competition.

Page 3: KDD Cup Research Paper

2. Data Analysis and Pre-processing

First we collected all the data and looked at the basic statistics of each of the features in the

dataset. Through this analysis we found that the data demanded some pre-processing before it could be

mined.

One of the first issues we identified was the incompleteness of the data. During our analysis we found that

some features in the data set had incomplete data. Most significant of which was the following features

1. fulfilment_labor_materials - Data not available for 35082 projects

2. students_reached - Data not available for 146 projects

3. secondary_focus_subject - Data not available for 207893 projects

4. secondary_focus_area - Data not available for 207893 projects

So we used several methods to estimate the missing data. For the secondary_focus_subject and

secondary_focus_area features, the missing data was filled with the values from primary_focus_subject

and primary_focus_area features of the same project.

But for the other two features we used multiple imputation[2] to generate the missing values. We

used the ‘mice’[3] package available with R for the imputation of the data set. After the imputation we

were able to use the completed data set for analysing.

2.1 Analysis of data

Upon closer inspection of the data we found out some interesting patterns.

2.1.1. Relationship of exciting projects with reference to the project’s accepted date.

As we can see from the above figure, which shows the number of ‘exciting’ projects for each

year, there are no exciting projects before the year 2010. According to the discussions in the forums, the

consensus is that this is due to the DonorsChoose.org organisation not keeping track of several factors

Page 4: KDD Cup Research Paper

that determine the “excitingness” of the project before 2010. So we decided to only consider the projects

submitted after 2010 as our training data

We can also see that projects submitted at the end of summer and the beginning of fall have a

higher chance of becoming an exciting project . We can conclude that this is due to people donating more

towards the start of a new school session that starts in the fall.

Overall trend for the number of exciting projects in each year can be seen in the graph below.

Page 5: KDD Cup Research Paper

2.1.2. Exciting projects by the state where the school is situated

We can see that projects proposed in some states have a higher chance of becoming an exciting

projects in comparison to projects from others. Specifically we can see that a project proposed in

California has the highest probability of becoming an exciting project. So there should be a significant

relationship between the state where the school is situated and a project becoming exciting.

2.1.3. Exciting projects based on the required resource type.

Some resource requirements can be seen to generate more exciting projects than others. Projects

with technological and supply requirements have a higher probability of being an exciting project.

Page 6: KDD Cup Research Paper

2.1.4. Exciting projects based on the poverty level of the school.

Projects that contribute to schools which have a higher poverty level have a higher chance of

becoming an exciting project. As we can see from the chart above, more than half of the exciting projects

come from the schools that are in the highest poverty level.

2.1.5. Relationship between exciting projects and the target grade level of the project

Projects that cater to younger students generate more exciting projects. As we can see from the

chart above, projects aimed at PreK-2 students and Grades 3-9 have a higher percentage of exciting

projects.

2.1.6 Relationship between exciting projects and the essay word count

We needed a simple way of of using the essay data, which amounted to the largest data set among

the available data, so we decided to check for a relationship between the essay length and a project

becoming exciting. Our analysis, as shown in the graph below, showed us that there appears to be a

relationship among them.

Page 7: KDD Cup Research Paper

Based on the above mentioned relationships and further analysis of data, we decided on a set of

features to be used in our training models mentioned in Appendix I.

Some categorical data features that contained a large number of categories could not be used with

the libraries we used for data mining. So those features were encoded using One-Hot encoding, which

encoded the categorical feature as a set of dummy numerical features (one per each category)

Another step we took before applying the dataset to models was to breakup the available data set

into a training set and a testing set. This was done for cross validation purposes. We used the ‘split’

functionality available with R to split the data so that both the training set and the testing set have a

similar distribution of exciting projects.

3. Main Methods used

3.1 Random Forests

Random Forests[4] is an ensemble learning method widely used in data mining tasks. It uses a set

of decision trees which are generated at run time and lets them vote on the classification of new data.

In the Random Forest method we generate a set of classification trees using the following steps,

1. For each we sample a set of cases from the original training set with replacement and this sample

is used in generating the tree

2. For each tree a random set of features are also selected from the available features and the

training is only done using these features.

3. Each classification tree is generated to the full extent and no pruning is used

These generated trees are then used to classify new items. For each new item, each tree in the

random forest votes on the classification of the new item. Then the majority decision is used for the

classification. If we are trying to calculate the probability of belonging to a class, then the mean

probability output from each tree is considered.

Our decision to use Random Forest models was taken considering the following advantages of the

method among others.

Page 8: KDD Cup Research Paper

1. High accuracy - Random forest are known to be highly accurate a

2. Ease of use - Random forest are very easy to generate even with very large datasets.

In our project we used the ‘randomForest’[5] package available with R to generate a Random

Forest model. To overcome the data imbalance issue due to the exciting projects being only about 5% of

the data set we used stratified sampling so that when the samples are taken from the training set the

exciting projects make up around 25% of the sample.

We were able to generate some good results with this model. We achieved 0.61986 score on the

public leaderboard with this model alone. But our result could have been improved if we used a proper

training method to train the model parameters. We tried the training procedure using the caret package in

R but failed due performance issues.

3.2 Gradient Boosting Regression Trees

Gradient Tree Boosting[6] or Gradient Boosted Regression Trees (GBRT) is a generalization of

boosting to arbitrary differentiable loss functions. It uses decision trees as the weak model and a loss

function to calculate the residual. GBRT is an accurate and effective procedure that can be used for both

regression and classification problems. Gradient Tree Boosting models are used in a variety of areas

including Web search ranking and ecology.

The advantages of GBRT are:

● Natural handling of data of mixed type (heterogeneous features)

● Predictive power

● Robustness to outliers in output space (via robust loss functions)

The disadvantages of GBRT are:

● Scalability, due to the sequential nature of boosting it can hardly be parallelized.

We have used Gradient boosting regression trees algorithm from the Scikit-learn[7] python

library in-order to build our model. In this method there are few hyper-parameters which should be

specified carefully in-order to yield the optimum result. Below is the list of those hyper-parameters

● number of regression trees (n_estimators)

Gradient boosting is fairly robust to overfitting, hence a large number usually results in better

performance.

● depth of each individual tree (max_depth)

The maximum depth limits the number of nodes in the tree.

● loss function (loss)

loss function to be optimized. There are four functions which could be used off the shelf and

those are ‘ls’, ‘lad’, ‘huber’ and ‘quantile’. ‘ls’ refers to least squares regression. ‘lad’ (least

absolute deviation) is a highly robust loss function solely based on order information of the input

variables. ‘huber’ is a combination of ‘ls’ and ‘lad. ‘quantile’ allows quantile regression.

● learning rate (learning_rate)

Page 9: KDD Cup Research Paper

learning rate shrinks the contribution of each tree by learning_rate. There is a trade-off between

learning_rate and n_estimators.

3.2.1 Hyper-parameter tuning

One of the pitfalls of machine learning in general is the over-fitting of the model to the training

data and in general GBRT is highly vulnerable of being over-fit to the training data if the hyper-

parameters are not tuned properly.

In-order to tune the hyper-parameters we followed below steps

1. Choosed ‘lad’ (least absolute deviation) as the loss function since it is highly robust and

the running time of the search was limited.

2. Picked n_estimators as large as (computationally) possible (1000 estimators).

3. Tuned max_depth, learning_rate, min_samples_leaf, and max_features via grid search[8].

4. Increased n_estimators even more and tuned learning_rate again holding the other

parameters fixed.

The result of hyper-parameter tuning was max_features=1.0, learning_rate=0.1, max_depth=6,

min_samples_leaf=9

We were able to achieve a score of 0.61875 on the public score board only using this model alone.

3.3 Logistic Regression

Logistic Regression is used to predict a binary response from a binary predictor, used for

predicting the outcome of a categorical dependent variable (i.e., a class label) based on one or more

predictor variables (features). For an example in this contest we can use logistic regression to predict

whether an project is exciting or not based on the features that we identified or discovered in data

preprocessing section.

The logistic regression uses the below function with features x1 to xk with each feature having it’s

own weight

The output of the above function always lies between 0 and 1 regardless of the weights or

features.

Page 10: KDD Cup Research Paper

-x)

So using this function we can predict a value for is_exciting which will be between 0 and 1.

3.3.1 Implementation

There are 2 packages we considered in R when we were implementing logistic regression. The

package glm and glmnet. Both packages use a model called generalized linear model which generalizes

linear regression by allowing the linear model to be related to the response variable via a link function and

by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

Considering the glm and glmnet packages we found out that, glm package fails to handle large

numbers of columns due to low RAM space. But glmnet handles big data efficiently and the algorithm

itself can handle Factor values in R without modifying it. So we chose glmnet package to create our

model. The glmnet package can deal with all shapes of data, including very large sparse data matrices. It

fits linear, logistic and multinomial, poisson, and Cox regression models. [9]

We considered the following parameters when using the glmnet package.

x - input matrix, of dimension nobs x nvars; each row is an observation vector. Can be in sparse

matrix format. Used the important columns we identified in data preprocessing and analyzing

section.

y - response variable. In this case whether the project was exciting or not

Page 11: KDD Cup Research Paper

alpha - The elasticnet mixing parameter, with 0 ≤ α ≤ 1. alpha=1 is the lasso penalty, and

alpha=0 the ridge penalty.

lambda - A user supplied lambda sequence. Typical usage is to have the program compute its

own lambda sequence based on nlambda and lambda.min.ratio. Supplying

a value of lambda overrides this.

In addition the glmnet package also requires the packages “matrix” in order to create a sparse

matrix from important columns we identified.

3.3.2 Hyper Parameter tuning

Using the test data we created from splitting the training set in data preprocessing section we

tuned the values alpha and lambda to give the best results. We observed when the alpha is very close to

ridge penalty the model tends to give good results to the test set. So we used alpha = 0.001. Also we used

lambda = 0.2 which gave good results to the test set we prepared.

3.3.3 Results

The model gave fairly good results in the competition. It scored 0.61 in public leaderboard and

0.60 in private leaderboard., which is a fairly good result.

3.4 Ensembling

Our final submission was an ensemble of the three methods described above. We used a simple

weighted ensemble method where we assigned a weight to each method and calculated the result by

combining the final results of the three methods using the weight.

4. Other methods tested

4.1 Neural Networks

Artificial neural networks provide a general, practical method for learning real-valued, discrete-valued,

and vector-valued functions from examples. Algorithms such as Backpropagation use gradient descent to

tune network parameters to best fit a training set of input-output pairs. Neural Network learning is robust

to errors in the training data and has been successfully applied to problems such as interpreting visual

scenes, speech recognition, and learning robot control strategies.

Page 12: KDD Cup Research Paper

We have used the PyBrain[10] python library to build a neural network which used backpropagation

algorithm to train the network. While training the neural network, we have faced a number of problems

such as

1. Number of hidden layers to be used

Number of Hidden Layers Result

none Only capable of representing linear separable functions

or decisions.

1 Can approximate any function that contains a

continuous mapping from one finite space to another.

2 Can represent an arbitrary decision boundary to

arbitrary accuracy with rational activation functions and

can approximate any smooth mapping to any accuracy.

Table XX summarize the knowledge we have acquired by going through various research papers.

But unfortunately we were unable to find a specific method to determine the number hidden

layers and hence we’ve tested a various number of hidden layers ranging from 2 to 50. We were

unable to increase the number of hidden layers further due to the huge amount of time taken by

the network training phase.

2. Numbers of neurons for each hidden layer

We were unable to find any specific formula to calculate the number of neurons in a particular

hidden layer. Although we found many rule-of-thumb methods for determining the correct

number of neurons to use in the hidden layers, such as the following:

● The number of hidden neurons should be between the size of the input layer and the size

of the output layer.

● The number of hidden neurons should be 2/3 the size of the input layer, plus the size of

the output layer.

● The number of hidden neurons should be less than twice the size of the input layer.

We have applied above rules and further we tried to decide the number of neurons in a hidden

layer based on fibonacci number series.

3. Neural Network training time

Training the neural network took a lot of time and hence we didn’t had time to research about

pruning algorithms[11] and to tune the neural network using genetic algorithms[12]

However, all the prediction results obtained through the neural network model performed poorly when

compared to other models during the cross validation.

Page 13: KDD Cup Research Paper

4.2 Classification and Regression Trees (CART) and Conditional Inference Trees

The idea of these models is to build a tree by splitting on variables.To predict the outcome for an

observation, follow the splits and at the end, predict the most frequent outcome. We used the package

“rpart” and “party” to implement this CART trees in R. The problem with this method was it didn’t

provide good results because of the data set being too imbalanced in favour of “non-excitingness”.

Rpart and CTREE methods that we tested, (rpart from R package “rpart” and CTREE from

package Party) builds a tree by splitting on variables. The motivation for this approach would be the

interpretability of how the decision was reached, which is not possible in logistic regression methods.

To predict the outcome follow the splits in the tree and at the end (leaves) select the most

frequent outcome.

4.2.1 CART

CART tries to the split the data to subset so that each subset is as homogenous as possible. Based

on how the splits are made tree is generated. The number of splits generated can be controlled by setting a

lower bound for split size by “minbucket” parameter. If the splits are too small overfitting will occur and

if they are too large accuracy will be poor.

Scatterplot with actual data looks like this..

As you can see in the above figure generating splits is difficult because of the imbalanced nature of the

data. To mitigate this effect sampling techniques were used.

Page 14: KDD Cup Research Paper

4.2.2 CTREE

CTREE function from “party” package differs from rpart by the way variable are selected to

generate splits. CTREE uses a significance test procedure in order to select variables instead of selecting

the variable that maximizes an information measure (e.g. Gini coefficient) which is used in rpart.

Supposedly ctree handles data imbalance better than rpart therefore CTREE method was favoured instead

of rpart [13].

4.2.3 Managing Data imbalance

To reduce the effects of data imbalance “SMOTE” function in R package “DMwR” was used. This

function handles unbalanced classification problems using the “SMOTE” [14] method.

SMOTE depends on the following parameters

perc.over = A number that drives the decision of how many extra cases from the minority

class are generated (known as over-sampling).

perc.under = A number that drives the decision of how many extra cases from the majority

classes are selected for each case generated from the minority class (known as

under-sampling)

k = A number indicating the number of nearest neighbours that are used to generate

the new examples of the minority class.

With SMOTE we could get accuracy and specificity but other methods outperformed CART and

CTREEs.

Results

k = 5 , perc.under = 200, perc.over = 200, (Predictions made using 40% of the train data set using

CTREE function).

0 1

0 TN = 135198 FP = 18394

1 FN = 11951 TP = 2733

Page 15: KDD Cup Research Paper

5. Conclusion

After completing this project we were able to gain a vast amount of knowledge on various data

mining and machine learning techniques. Further, we were able to obtain an AUC score of 0.59216 from

out ensemble model.

Some of the lessons learned include,

● Importance of cleaning the data - By identifying the issues in the data set and cleaning them out,

for example by filling empty data and removing unnecessary data( projects prior to 2010) we

were able to improve our results greatly.

● Using an ensemble of different models - Using the ensemble method we were able to outperform

the results from individual models

● Importance of generating new features from the available ones - Rather than using only the

available features with different modeling methods we were able to increase performance by

creating new features from the available ones.

● Importance of cross validation - Even though we were able to get a score of 0.62488 in the public

leaderboard, we only got 0.59216 in the private leaderboard. We believe that this could have

been avoided with better cross validation of the data rather than allowing the models to overfit to

the public leaderboard test set

● Tuning Hyper-parameters - We were able to obtain better result by tuning the hyper-parameters.

But since we used grid search methods, we had to limit the value range of parameters to a small

range. In future it is important to come up with an greedy approach to tune the hyper-parameters

in-order to improve the performance of the model

Page 16: KDD Cup Research Paper

References

[1] http://www.kdd.org/ KDD Cup Official Site

[2] Rubin, D.B. (1987) Multiple Imputation for Nonresponse in Surveys. J. Wiley & Sons, New York.

[3]Stef van Buuren, Karin Groothuis-Oudshoorn (2011). mice: Multivariate Imputation by Chained

Equations in R. Journal of Statistical Software, 45(3), 1-67. URL http://www.jstatsoft.org/v45/i03/

[4] Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.

[5] A. Liaw and M. Wiener (2002). Classification and Regression by randomForest. R News 2(3), 18--22.

[6] Friedman, J. H. "Greedy Function Approximation: A Gradient Boosting Machine." (February 1999)

[7] Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

[8] Bergstra, J. and Bengio, Y., Random search for hyper-parameter optimization, The Journal of

Machine Learning Research (2012)

[9] Jerome Friedman, Trevor Hastie, Robert Tibshirani (2010). Regularization Paths for Generalized

Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1), 1-22. URL

http://www.jstatsoft.org/v33/i01/.

[10] Schaul, T., Bayer, J., Wierstra, D., Sun, Y., Felder, M., Sehnke, F., ... & Schmidhuber, J. (2010).

PyBrain. The Journal of Machine Learning Research, 11, 743-746.

[11] Karnin, E. D. (1990). A simple procedure for pruning back-propagation trained neural networks.

Neural Networks, IEEE Transactions on, 1(2), 239-242.

[12] Leung, F. H. F., Lam, H. K., Ling, S. H., & Tam, P. K. S. (2003). Tuning of the structure and

parameters of a neural network using an improved genetic algorithm. Neural Networks, IEEE

Transactions on, 14(1), 79-88.

[13] R. Pearson. (2013-04-13). Classification Tree Models [Online]. Available:

http://exploringdatablog.blogspot.com/2013/04/classification-tree-models.html

[14] N.V. Chawla et al., “SMOTE: synthetic minority over-sampling technique,” JAIR, vol. 48,iss. 1, pp.

321-357, May 2013