DataRobot Entry Documentation
-
Upload
utkarsh-shrivatava -
Category
Documents
-
view
22 -
download
0
description
Transcript of DataRobot Entry Documentation
![Page 1: DataRobot Entry Documentation](https://reader036.fdocuments.us/reader036/viewer/2022081803/55cf8eb5550346703b94c4b6/html5/thumbnails/1.jpg)
KDD Cup 2014 Predicting Excitement at DonorsChoose.org Winning Entry Documentation Name: Jeremy Achin Location: Boston – MA, United States Email: [email protected] Name: Xavier Conort Location: Singapore Email: [email protected] Name: Lucas Eustáquio Gomes da Silva Location: Belo Horizonte – MG, Brazil Email: [email protected] Summary DonorsChoose.org is an online charity that makes it easy to help students in need through school donations. At any time, thousands of teachers in K12 schools propose projects requesting materials to enhance the education of their students. When a project reaches its funding goal, they ship the materials to the school. The 2014 KDD Cup asked participants to help DonorsChoose.org identify projects that are exceptionally exciting to donors at the time of posting. In order to predict how exciting is a project, data was provided in a relational format and split by dates. Any project posted prior to January 1, 2014 was in the training set (along with its funding outcomes). Any project posted after January 1, 2014 was in the test set. The test set used known outcomes from January 2014 to mid May 2014. Kaggle ignored “live” projects in the test set and did not disclose which projects were still live to avoid leakage regarding the funding status. A data dictionnary of the data provided is available here: https://www.kaggle.com/c/kddcup2014predictingexcitementatdonorschoose/data Our approach tried to extract the best features from the data and use them in 2 Gradient Boosting Machines models:
![Page 2: DataRobot Entry Documentation](https://reader036.fdocuments.us/reader036/viewer/2022081803/55cf8eb5550346703b94c4b6/html5/thumbnails/2.jpg)
● one based on the sklearn Gradient Boosting Regressor (http://scikitlearn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html)
● and one based on the R gradient boosting machine (gbm) package (http://cran.rproject.org/web/packages/gbm/)
Both used as response 2013 outcomes only. This gave us significant gain in computation time without much loss in predictive accuracy. We made the assumption that the midMay cutoff in the test set produces a censoring effect on the response. To reproduce the “assumed” time effect on the response in the training set, we censored the response before training our models and created 2 types of censored outcomes:
● “random cutoff outcomes”: exciting outcomes censored at 3 cutoffs drawn randomly from the first 131 days after the project was posted
● “20 weeks outcomes”: exciting outcomes censored every week during the first 20 weeks
Feature Extraction Our feature extraction consists of:
1. Raw features from projects.csv. This contains information about each project and was provided for both the training and test set.
2. Lapses between projects posted by teachers 3. Proxies of text posted by teachers such as nb of characters, nb of words, stats on
length of words, nb of sentences, nb of words per sentence, stats on punctuations usage, misspelling, etc...
4. Stacked predictions of the exciting outcome and of the required criteria to be qualified as exciting, based on information contained in project title and essay posted by the teacher
5. Deviations from an “expected” project cost that was estimated by a stacked prediction of the cost. The model used to predict was a Gradient Boosting Machine that used as predictors "primary_focus_subject", "grade_level" and "students_reached"
6. Vendor id of the most expensive item in the project 7. Stacked predictions of final exciting outcome based on the name of the most
expensive item 8. History features
To build history features, we sliced time into chunks of 4 months, computed statistics for each chunk and used as features the stats of the last 3 chunks prior to the time chunk of the project.
![Page 3: DataRobot Entry Documentation](https://reader036.fdocuments.us/reader036/viewer/2022081803/55cf8eb5550346703b94c4b6/html5/thumbnails/3.jpg)
The stats that we computed for each chunk include stats on
1. Projects posted by teachers: a. nb of projects b. for each criteria, sum of projects that met the criteria c. criteria met by last project d. mean project cost
2. Donations received by teachers: a. nb of donations received b. sum and last amounts received
3. Donations made by teachers: a. nb of donations made b. sum of amounts donated c. sum of exciting projects to which the teachers donated money d. sum of distance between the teacher location and the location of projects
they sponsored 4. Donations made by the zip, city, state of the project:
a. sum and mean amount donated b. sum and mean of exciting outcomes of the projects sponsored
To build stacked predictions of the exciting outcome and criteria met by a project, we trained regularized regressions (from the R package glmnet) trained on words 2 grams document term matrices generated from the project title and the essay posted by the teacher. Regressions were trained:
● By primary focus area one model for each area ● For each area, we built:
○ one logistic regression (L1 penalty) to predict “is_exciting” using title document term matrix
○ one logistic regression (L2 penalty) to predict “is_exciting” using essay document term matrix
○ regressions (L2 penalty) to predict each criteria to qualify (fully funded, at_least_1_teacher_referred_donor,...) using essay document term matrix only
Stacked predictions of the final exciting outcome based on the name of the most expensive item used an elastic net logistic regression (from glmnet) trained on a words 2 grams document term matrix. All text stacked predictions were generated via 5folds crossvalidation. From this, we created 2 sets of features:
● FG1: mostly describes the projects to predict, the teachers’ project history and the teachers’ donations history. The feature set includes:
![Page 4: DataRobot Entry Documentation](https://reader036.fdocuments.us/reader036/viewer/2022081803/55cf8eb5550346703b94c4b6/html5/thumbnails/4.jpg)
○ Raw features from projects.csv ○ Lapses between projects posted by teachers ○ Stats on past projects posted by teachers ○ Stats on past donations received by teachers ○ Stats on past donations made by teachers ○ Text proxies of text posted by teachers
● FG2: includes more features on the projects and uses only the history of donations made by the teachers or the project locations (zip, city and state). This feature set can be seen as designed for teachers with low/no project history while the first set relies more on past performance of teachers projects.
○ Raw features from projects.csv ○ Lapses between projects posted by teachers ○ Stats on past donations made by teachers ○ Stats on past donations made by the zip, city, state of the project ○ Stacked predictions of the exciting outcome and criteria required for a project
to be qualified as “exciting” based on the project titles and essays posted by teachers
○ Deviations from the project “expected” cost ○ Vendor id of the project’s most expensive item ○ Stacked predictions of the exciting outcome based on the name of the most
expensive item Modeling Techniques and Training We made the assumption that the midMay cutoff in the test set produces a censoring effect on the response and we expected this effect to be much stronger for the most recent months. To reproduce the “assumed” time effect on the response in the training set, we censored the response before training our models and created 2 types of censored outcomes:
● “random cutoff outcomes”: exciting outcomes censored at 3 cutoffs drawn randomly from the first 131 days after the project was posted
● “20 weeks outcomes”: exciting outcomes censored every week during the first 20 weeks
![Page 5: DataRobot Entry Documentation](https://reader036.fdocuments.us/reader036/viewer/2022081803/55cf8eb5550346703b94c4b6/html5/thumbnails/5.jpg)
We trained
● One sklearn gradient boosted trees model to predict the random cutoff outcomes. The model used the FG1 feature set and the random cutoff as predictors. The training set size was multiplied by 3 as each record of the training set had 3 censored responses.
● Twenty R gradient boosting machine (gbm) models: one model for each week of the “20 weeks outcomes”. All models used the FG2 feature set as predictors.
All models were trained with 2013 outcomes only. The sklearn gradient boosted trees model used as hyper parameters:
● n_estimators: 2000 ● learning_rate: 0.01 ● max_features: 12 ● max_depth: 7 ● subsample: 1
The 20 R gradient boosting machine models used as hyper parameters:
● distribution="bernoulli" ● n.trees=2500+week_n*100 ● n.minobsinnode=10
![Page 6: DataRobot Entry Documentation](https://reader036.fdocuments.us/reader036/viewer/2022081803/55cf8eb5550346703b94c4b6/html5/thumbnails/6.jpg)
● interaction.depth=5 ● shrinkage=0.01 ● bag.fraction=0.75 ● with week_n the nb of weeks used to censor the response
To predict outcomes in the test set:
● We first computed the nb of days “n” between the project posted date and the test set cutoff (May 12, 2014)
● When using the sklearn “random cutoff” gbm, the nb of days “n” was used as a predictor
● When using the R “20 weeks” gbms, we selected the gbms that were trained with a number of weeks close to “n/7”
● Averaged the 2 solutions
Code Description Code to generate FG1 features
Script Folder Description
FG1_functions.R main folder Support functions.
RUN_FG1.R main folder Run solution for FG1 features
![Page 7: DataRobot Entry Documentation](https://reader036.fdocuments.us/reader036/viewer/2022081803/55cf8eb5550346703b94c4b6/html5/thumbnails/7.jpg)
FG1_read_files.R main folder Read competitions files and do simple feature transformation
FG1_cost.R main folder Build history of cost of teachers past projects
FG1_outcomes.R main folder Build history of outcomes of teachers past project
FG1_received.R main folder Build history of donations received by teachers
FG1_donated.R main folder Build history of donations made by teachers
FG1_txt_proxies.R, FG1_vocab.R, FG1_proxies.R
main folder Build text proxies of text posted by teachers
FG1_lapse.R main folder Compute lapse between projects of a same teacher
FG1_subset.R main folder List of FG1 features
FG1_Conso.R main folder Consolidates features and saves them to disk
Code to generate stacked text features
Script Folder Description
FG2_essay_NLP.R main folder Save to disk text posted by teachers
FG2_resources.R main folder Extract item name and Vendor id of most expensive item of a project and save to disk.
RUN NLP.R NLP Run stacked predictions solution for text posted by teachers and item name
GLMNETs FITS.R NLP Train stacked predictions solution for text posted by teachers and save model and stacked predictions into disk
GLMNET FITS item.R NLP Train stacked predictions solution for item name and save model and stacked predictions into disk
![Page 8: DataRobot Entry Documentation](https://reader036.fdocuments.us/reader036/viewer/2022081803/55cf8eb5550346703b94c4b6/html5/thumbnails/8.jpg)
_DTM_WORDS.R NLP Convert text into word ngrams document term matrix
_NUMBERS.R NLP Convert number into text
_KFolds.R NLP Partition
_METRICS.R NLP function to compute evaluation metrics
CV_GLMNET.R NLP function to train glmnet on Kfolds
GLMNETs PREDICT.R NLP Predictions based on text posted by teachers and save predictions into disk
GLMNET PREDICT item.R NLP Predictions based on item name and save predictions into disk
Code to generate FG2 features
Script Folder Description
FG2_functions.R main folder Support functions.
RUN_FG2.R main folder Run solution for FG2 features
FG2_essay_NLP.R main folder Save to disk text posted by teachers
FG2_donations_distance.R main folder Build features of relative location of donations (received and made by teachers)
FG2_cost_deviation.R main folder Estimate a “normal” cost for a project
FG2_donation_history_per_location.R
main folder Build history of donations made the zip, city and state of the project
FG2_subset.R main folder List of FG2 features
FG2_Conso.R main folder Consolidates features and saves them to disk
Code to generate censored outcomes
![Page 9: DataRobot Entry Documentation](https://reader036.fdocuments.us/reader036/viewer/2022081803/55cf8eb5550346703b94c4b6/html5/thumbnails/9.jpg)
Script Folder Description
fn.base.R kddcup2014r Support functions.
data.build.R kddcup2014r Build the features and saves them to disk
Code to predict
Script Folder Description
sci_learn_train.py kddcup2014py Python script to train gradient boosted trees
train.FG1.rc.R kddcup2014r Train random cutoff outcomes model
train.FG2.20W.R kddcup2014r Train 20 weeks outcomes model
train.ens.R kddcup2014r Average the 2 solutions and save the submission file in data/submission
How to Run the Code
1. unzip KDD2014_DATAROBOT.zip 2. Put competition files into the "data/input" folder. 3. Open a R session with folder "KDD2014_DATAROBOT" set as working dir. 4. Run “RUN FG1.R” 5. Open a R session with folder "KDD2014_DATAROBOT/NLP" set as working dir. 6. Run “RUN NLP.R” 7. Open a R session with folder "KDD2014_DATAROBOT" set as working dir. 8. Run “RUN FG2.R” 9. Open a R session with folder "KDD2014_DATAROBOT/kddcup2014r" set as
working dir. 10. Run “data.build.R” 11. Run “train.FG1.rc.R” 12. Run “train.FG2.20W.R” 13. Run train.ens.R
The predictions will be saved in KDD2014_DATAROBOT/data/submission/ens.csv
![Page 10: DataRobot Entry Documentation](https://reader036.fdocuments.us/reader036/viewer/2022081803/55cf8eb5550346703b94c4b6/html5/thumbnails/10.jpg)
Dependencies To build the solution, R and python were used. The R version used was 3.0.2, and the Python version was 2.7.3. As for the packages:
● R: SOAR 0.9911, doSNOW 1.0.9, foreach 1.4.1, cvTools 0.3.2, data.table 1.8.10, Matrix 1.14, tau 0.015, RtextTools 1.4.1, glmnet 1.95, gbm 2.1
● Python: pandas 0.13.1, numpy 1.8.1, scikitlearn 0.15.0 All the listed version are the used ones. It will probably work with newer versions, but it wasn't tested. Additional Comments The time bias present in the test set made predictions very challenging. We chose to trust our solutions based on censored outcomes rather than solutions using the raw response and a linear decay to adjust the submission. Based on other competitors feedback and our own (unselected) submissions, models with a more aggressive time decay performed better on the Private Leaderboard (our highest score of unselected submissions went up to 0.685). This could be explained by either a seasonality that we didn’t capture in our models or some additional censoring done by Kaggle in the test set. References J. Friedman, “Greedy Function Approximation: A Gradient Boosting Machine”, The Annals of Statistics, Vol. 29, No. 5, 2001. Friedman, “Stochastic Gradient Boosting”, 1999 Hastie, R. Tibshirani and J. Friedman, “Elements of Statistical Learning Ed. 2”, Springer,
2009.