Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader)...
Transcript of Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader)...
![Page 1: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University](https://reader035.fdocuments.us/reader035/viewer/2022062918/5edd26fead6a402d6668232d/html5/thumbnails/1.jpg)
Competition II: Springleaf
Sha Li (Team leader)Xiaoyan Chong, Minglu Ma, Yue Wang
CAMCOS Fall 2015San Jose State University
![Page 2: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University](https://reader035.fdocuments.us/reader035/viewer/2022062918/5edd26fead6a402d6668232d/html5/thumbnails/2.jpg)
Agenda• Kaggle Competition: Springleaf dataset
introduction• Data Preprocessing• Classification Methodologies & Results
• Logistic Regression• Random Forest• XGBoost• Stacking
• Summary & Conclusion
![Page 3: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University](https://reader035.fdocuments.us/reader035/viewer/2022062918/5edd26fead6a402d6668232d/html5/thumbnails/3.jpg)
Kaggle Competition: SpringleafObjective: Predict whether customers will
respond to a direct mail loan offer
• Customers: 145,231• Independent variables: 1932• “Anonymous” features• Dependent variable:
– target = 0: DID NOT RESPOND– target = 1: RESPONDED
• Training sets: 96,820 obs.• Testing sets: 48,411 obs.
![Page 4: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University](https://reader035.fdocuments.us/reader035/viewer/2022062918/5edd26fead6a402d6668232d/html5/thumbnails/4.jpg)
Dataset facts• R package used to read file:
data.table::fread
• Target=0 obs.: 111,458• Target=1 obs.: 33,773• Numerical variables: 1,876• Character variables: 51• Constant variables: 5• Variable level counts:
– 67.0% columns havelevels <= 100
Count of levels for each column
76.7%
23.3%
Class 0 and 1 count
Variables count
![Page 5: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University](https://reader035.fdocuments.us/reader035/viewer/2022062918/5edd26fead6a402d6668232d/html5/thumbnails/5.jpg)
Missing values• “”, “NA”: 0.6%• “[]”, -1: 2.0%• -99999, 96, …, 999, …,
99999999: 24.9%• 25.3% columns have
missing values 61.7%
Count of NAs in each column Count of NAs in each row
![Page 6: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University](https://reader035.fdocuments.us/reader035/viewer/2022062918/5edd26fead6a402d6668232d/html5/thumbnails/6.jpg)
Challenges for classification
• Huge Dataset (145,231 X 1932)• “Anonymous” features• Uneven distribution of response variable• 27.6% of missing values• Deal with both numerical and categorical
variables• Undetermined portion of Categorical
variables• Data pre-processing complexity
![Page 7: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University](https://reader035.fdocuments.us/reader035/viewer/2022062918/5edd26fead6a402d6668232d/html5/thumbnails/7.jpg)
Data preprocessingRemove ID and target
Replace NA by median Replace NA randomly
Replace [] and ‐1 as NA
Remove duplicate cols
Replace character cols
Remove low variance cols
Regard NA as a new group
Normalize Log(1+|x|)
![Page 8: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University](https://reader035.fdocuments.us/reader035/viewer/2022062918/5edd26fead6a402d6668232d/html5/thumbnails/8.jpg)
Principal Component Analysis
When PC is close to 400,it can explain 90% variance.
pc1
![Page 9: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University](https://reader035.fdocuments.us/reader035/viewer/2022062918/5edd26fead6a402d6668232d/html5/thumbnails/9.jpg)
LDA: Linear discriminant analysis• We are interested in the most discriminatory direction,
not the maximum variance.• Find the direction that best separates the two classes.
Var1 and Var2 are large
Significant overlap
µ1 µ2
µ1 and µ2 are close
![Page 10: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University](https://reader035.fdocuments.us/reader035/viewer/2022062918/5edd26fead6a402d6668232d/html5/thumbnails/10.jpg)
Methodology
• K Nearest Neighbor (KNN) • Support Vector Machine (SVM)• Logistic Regression• Random Forest• XGBoost (eXtreme Gradient Boosting)• Stacking
![Page 11: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University](https://reader035.fdocuments.us/reader035/viewer/2022062918/5edd26fead6a402d6668232d/html5/thumbnails/11.jpg)
K Nearest Neighbor (KNN)
• Target =0• Target =1
Overall Accuracy
Target = 1 Accuracy
Accuracy
72.1 73.9 75.0 76.1 76.5 76.8 77.0
22.818.3 15.3 12.1 10.5 9.4 7.5
0.010.020.030.040.050.060.070.080.090.0
100.0
3 5 7 11 15 21 39
Acc
urac
y
K
KNNOverall Target=1
![Page 12: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University](https://reader035.fdocuments.us/reader035/viewer/2022062918/5edd26fead6a402d6668232d/html5/thumbnails/12.jpg)
Support Vector Machine (SVM)
• Expensive; takes long time for each run• Good results for numerical data
Accuracy
Overall 78.1%
Target = 1 13.3%
Target = 0 97.6%
Confusion matrix Prediction
Truth
0 1
0 19609 483
1 5247 803
![Page 13: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University](https://reader035.fdocuments.us/reader035/viewer/2022062918/5edd26fead6a402d6668232d/html5/thumbnails/13.jpg)
Logistic Regression
• Logistic regression is a regression model where the dependent variable is categorical.
• Measures the relationship between dependent variable and independent variables by estimating probabilities
![Page 14: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University](https://reader035.fdocuments.us/reader035/viewer/2022062918/5edd26fead6a402d6668232d/html5/thumbnails/14.jpg)
Logistic Regression
Accuracy
Overall 79.2 %
Target = 1 28.1 %
Target = 0 94.5 %
Confusion matrix Prediction
Truth
0 1
0 53921 3159
1 12450 4853
75.0075.5076.0076.5077.0077.5078.0078.5079.0079.5080.00
2 5 15 25 35 45 55 65 75 85 95 105
115
125
135
145
155
165
175
185
195
210
240
280
320
Acc
urac
y
PC
Overall
0.0010.0020.0030.0040.0050.0060.0070.0080.0090.00
100.00
2 5 15 25 35 45 55 65 75 85 95 105
115
125
135
145
155
165
175
185
195
210
240
280
320
Acc
urac
y
PC
Target=1
![Page 15: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University](https://reader035.fdocuments.us/reader035/viewer/2022062918/5edd26fead6a402d6668232d/html5/thumbnails/15.jpg)
Random Forest• Machine learning ensemble algorithm
-- Combining multiple predictors • Based on tree model• For both regression and classification • Automatic variable selection • Handles missing values• Robust, improving model stability and accuracy
![Page 16: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University](https://reader035.fdocuments.us/reader035/viewer/2022062918/5edd26fead6a402d6668232d/html5/thumbnails/16.jpg)
Random ForestTrain datasetTrain dataset
Draw Bootstrap Samples
Draw Bootstrap Samples
Build random tree
Build random tree
Predict based on each treePredict based on each tree
Majority voteMajority vote
A Random Tree
![Page 17: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University](https://reader035.fdocuments.us/reader035/viewer/2022062918/5edd26fead6a402d6668232d/html5/thumbnails/17.jpg)
Random Forest
Accuracy
Overall 79.3%
Target = 1 20.1%
Target = 0 96.8%
Confusion matrix Prediction
Truth
0 1
0 36157 1181
1 8850 2223
• Target =1• Overall• Target =0
Tree number(500) vs Misclassification Error
![Page 18: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University](https://reader035.fdocuments.us/reader035/viewer/2022062918/5edd26fead6a402d6668232d/html5/thumbnails/18.jpg)
XGBoost• Additive tree model: add new trees that complement the already-built
ones• Response is the optimal linear combination of all decision trees• Popular in Kaggle competitions for efficiency and accuracy
……..
Greedy Algorithm
Number of Tree
Error
Additive tree model
![Page 19: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University](https://reader035.fdocuments.us/reader035/viewer/2022062918/5edd26fead6a402d6668232d/html5/thumbnails/19.jpg)
XGBoost• Additive tree model: add new trees that complement the already-built
ones• Response is the optimal linear combination of all decision trees• Popular in Kaggle competitions for efficiency and accuracy
![Page 20: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University](https://reader035.fdocuments.us/reader035/viewer/2022062918/5edd26fead6a402d6668232d/html5/thumbnails/20.jpg)
XGBoost
AccuracyOverall 80.0%
Target = 1 26.8%Target = 0 96.1%
Train error
Test errorConfusion
matrix Prediction
Truth
0 1
0 35744 1467
1 8201 2999
![Page 21: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University](https://reader035.fdocuments.us/reader035/viewer/2022062918/5edd26fead6a402d6668232d/html5/thumbnails/21.jpg)
Methods Comparison
77.0 78.1 77.8 79.0 79.2 80.0
6.613.3
19.0 20.128.1 26.8
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
90.0
100.0
Acc
urac
y
Overall Target =1
![Page 22: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University](https://reader035.fdocuments.us/reader035/viewer/2022062918/5edd26fead6a402d6668232d/html5/thumbnails/22.jpg)
Winner or Combination ?
![Page 23: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University](https://reader035.fdocuments.us/reader035/viewer/2022062918/5edd26fead6a402d6668232d/html5/thumbnails/23.jpg)
Stacking
Base learners Meta learner
Labeled data
……
Final
prediction
Test
Base learner C1
Base learner C2
Base learner Cn
• Main Idea: Learn and combine multiple classifiers
Metafeatures
Train
![Page 24: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University](https://reader035.fdocuments.us/reader035/viewer/2022062918/5edd26fead6a402d6668232d/html5/thumbnails/24.jpg)
Generating Base and Meta Learners
• Base model—efficiency, accuracy and diversity Sampling training examples Sampling features Using different learning models
• Meta learner Majority voting Weighted averaging Kmeans Higher level classifier — Supervised(XGBoost)
24
Unsupervised
![Page 25: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University](https://reader035.fdocuments.us/reader035/viewer/2022062918/5edd26fead6a402d6668232d/html5/thumbnails/25.jpg)
Stacking model
XGBoostPredictions
XGBoost
Logistic Regression
Random Forest Total data
Base learners Meta learner
Finalprediction
Meta Features
Combined data
Total dataSparse
CondenseLow level
PCA…
![Page 26: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University](https://reader035.fdocuments.us/reader035/viewer/2022062918/5edd26fead6a402d6668232d/html5/thumbnails/26.jpg)
Stacking ResultsBase Model Accuracy Accuracy
(target=1)XGB + total data 80.0% 28.5%XGB + condense data 79.5% 27.9%
XGB + Low level data 79.5% 27.7%
Logistic regression+ sparse data 78.2% 26.8 %
Logistic regression+ condense data 79.1% 28.1%
Random forest + PCA 77.6% 20.9%
Meta Model Accuracy Accuracy (target=1)
XGB 81.11% 29.21%Averaging 79.44% 27.31%Kmeans 77.45% 23.91%
Accuracy of XGB
0.00%20.00%40.00%60.00%80.00%
100.00%
Accuracy of Base Model
Accuracy Accuracy (target=1)
![Page 27: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University](https://reader035.fdocuments.us/reader035/viewer/2022062918/5edd26fead6a402d6668232d/html5/thumbnails/27.jpg)
Stacking ResultsBase Model Accuracy Accuracy
(target=1)XGB + total data 80.0% 28.5%XGB + condense data 79.5% 27.9%
XGB + Low level data 79.5% 27.7%
Logistic regression+ sparse data 78.2% 26.8 %
Logistic regression+ condense data 79.1% 28.1%
Random forest + PCA 77.6% 20.9%
Meta Model Accuracy Accuracy (target=1)
XGB 81.11% 29.21%Averaging 79.44% 27.31%Kmeans 77.45% 23.91%
Accuracy of XGB
0.00%20.00%40.00%60.00%80.00%
100.00%
Accuracy of Base Model
Accuracy Accuracy (target=1)
![Page 28: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University](https://reader035.fdocuments.us/reader035/viewer/2022062918/5edd26fead6a402d6668232d/html5/thumbnails/28.jpg)
Summary and Conclusion• Data mining project in the real world
Huge and noisy data• Data preprocessing
Feature encoding Different missing value process:
New level, Median / Mean, or Random assignment• Classification techniques
Classifiers based on distance are not suitable Classifiers handling mixed type of variables are preferred Categorical variables are dominant Stacking makes further promotion
• Biggest improvement came from model selection, parameter tuning, stacking
• Result comparison: Winner result: 80.4%Our result: 79.5%
![Page 29: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University](https://reader035.fdocuments.us/reader035/viewer/2022062918/5edd26fead6a402d6668232d/html5/thumbnails/29.jpg)
Acknowledgements
We would like to express our deep gratitude tothe following people / organization:
• Profs. Bremer and Simic for their proposal that made this project possible
• Woodward Foundation for funding• Profs. Simic and CAMCOS for all the support• Prof. Chen for his guidance, valuable
comments and suggestions
![Page 30: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University](https://reader035.fdocuments.us/reader035/viewer/2022062918/5edd26fead6a402d6668232d/html5/thumbnails/30.jpg)
QUESTIONS?