Kaggle talk series top 0.2% kaggler on amazon employee access challenge
-
Upload
vivian-s-zhang -
Category
Engineering
-
view
114 -
download
5
description
Transcript of Kaggle talk series top 0.2% kaggler on amazon employee access challenge
AAmmaazzoonn EEmmppllooyyeeee AAcccceessss CChhaalllleennggeePredict an employee's access needs, given his/her job role
Yibo ChenData Scientist @ Supstat Inc
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
1 of 65 6/13/14, 2:01 PM
AAggeennddaaIntroduction to the Challenge1.
Look into the Data2.
Model Building3.
Summary4.
2/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
2 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonn ttoo tthhee CChhaalllleennggeethe storyhttp://www.kaggle.com/c/amazon-employee-access-challengeit is all about the access we need to fulfill our daily work.
3/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
3 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonn ttoo tthhee CChhaalllleennggeethe missionbuild an auto-access model based on the historical datato determine the access privilege according to the employee's job role and the resource he appliedfor
4/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
4 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonn ttoo tthhee CChhaalllleennggeethe dataThe data consists of real historical data collected from 2010 & 2011.Employees are manually allowed or denied access to resources over time.
the filestrain.csv - The training set. Each row has the ACTION (ground truth), RESOURCE, andinformation about the employee's role at the time of approval
test.csv - The test set for which predictions should be made. Each row asks whether anemployee having the listed characteristics should have access to the listed resource.
·
·
5/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
5 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonn ttoo tthhee CChhaalllleennggeethe variablesCOLUMN NAME DESCRIPTION
ACTION ACTION is 1 if the resource was approved, 0 if the resource was not
RESOURCE An ID for each resource
MGR_ID The EMPLOYEE ID of the manager of the current EMPLOYEE ID record
ROLE_ROLLUP_1 Company role grouping category id 1 (e.g. US Engineering)
ROLE_ROLLUP_2 Company role grouping category id 2 (e.g. US Retail)
ROLE_DEPTNAME Company role department description (e.g. Retail)
ROLE_TITLE Company role business title description (e.g. Senior Engineering Retail Manager)
ROLE_FAMILY_DESC Company role family extended description (e.g. Retail Manager, Software Engineering)
ROLE_FAMILY Company role family description (e.g. Retail Manager)
ROLE_CODE Company role code; this code is unique to each role (e.g. Manager)
6/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
6 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonn ttoo tthhee CChhaalllleennggeethe metricAUC(area under the ROC curve)
is a metric used to judge predictions in binary response (0/1) problem
is only sensitive to the order determined by the predictions and not their magnitudes
package verification or ROCR in R
·
·
·
7/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
7 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonn ttoo tthhee CChhaalllleennggeethe metric
(t <- data.frame(true_label=c(0,0,0,0,1,1,1,1),
predict_1=c(1,2,3,4,5,6,7,8),
predict_2=c(1,2,3,6,5,4,7,8),
predict_3=c(1,7,6,4,5,3,2,8)))
## true_label predict_1 predict_2 predict_3
## 1 0 1 1 1
## 2 0 2 2 7
## 3 0 3 3 6
## 4 0 4 6 4
## 5 1 5 5 5
## 6 1 6 4 3
## 7 1 7 7 2
## 8 1 8 8 8
8/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
8 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonn ttoo tthhee CChhaalllleennggeethe metric
P:4N:4TP:2FP:1TPR=TP/P=0.5FPR=FP/N=0.25
table(t$predict_2 >= 6, t$true_label)
##
## 0 1
## FALSE 3 2
## TRUE 1 2
9/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
9 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonn ttoo tthhee CChhaalllleennggeethe metric
P:4N:4TP:3FP:1TPR=TP/P=0.75FPR=FP/N=0.25
table(t$predict_2 >= 5, t$true_label)
##
## 0 1
## FALSE 3 1
## TRUE 1 3
10/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
10 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonn ttoo tthhee CChhaalllleennggeethe metric
11/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
11 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonn ttoo tthhee CChhaalllleennggeethe metric
require(ROCR, quietly = T)
pred <- prediction(t$predict_1, t$true_label)
performance(pred, "auc")@y.values[[1]]
## [1] 1
require(verification, quietly = T)
roc.area(t$true_label, t$predict_1)$A
## [1] 1
pred <- prediction(t$predict_1, t$true_label)
perf <- performance(pred, "tpr", "fpr")
plot(perf, col = 2, lwd = 3)
12/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
12 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonn ttoo tthhee CChhaalllleennggeethe metric
pred <- prediction(t$predict_2, t$true_label)
performance(pred, "auc")@y.values[[1]]
## [1] 0.875
roc.area(t$true_label, t$predict_2)$A
## [1] 0.875
pred <- prediction(t$predict_2, t$true_label)
perf <- performance(pred, "tpr", "fpr")
plot(perf, col = 2, lwd = 3)
13/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
13 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonn ttoo tthhee CChhaalllleennggeethe metric
pred <- prediction(t$predict_3, t$true_label)
performance(pred, "auc")@y.values[[1]]
## [1] 0.5
roc.area(t$true_label, t$predict_3)$A
## [1] 0.5
pred <- prediction(t$predict_3, t$true_label)
perf <- performance(pred, "tpr", "fpr")
plot(perf, col = 2, lwd = 3)
14/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
14 of 65 6/13/14, 2:01 PM
LLooookk iinnttoo tthhee DDaattaaload data from files
15/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
15 of 65 6/13/14, 2:01 PM
LLooookk iinnttoo tthhee DDaattaathe target
table(y, useNA = "ifany")
## y
## 0 1 <NA>
## 1897 30872 58921
16/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
16 of 65 6/13/14, 2:01 PM
LLooookk iinnttoo tthhee DDaattaathe predictor
17/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
17 of 65 6/13/14, 2:01 PM
LLooookk iinnttoo tthhee DDaattaatreat the features as Categorical or Numerical?
sapply(x, function(z) {
length(unique(z))
})
## resource mgr_id role_rollup_1 role_rollup_2
## 7518 4913 130 183
## role_deptname role_title role_family_desc role_family
## 476 361 2951 68
## role_code
## 361
18/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
18 of 65 6/13/14, 2:01 PM
LLooookk iinnttoo tthhee DDaattaapar(mar = c(5, 4, 0, 2))
plot(x$role_title, x$role_code)
19/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
19 of 65 6/13/14, 2:01 PM
LLooookk iinnttoo tthhee DDaattaalength(unique(x$role_title))
## [1] 361
length(unique(x$role_code))
## [1] 361
length(unique(paste(x$role_code, x$role_title)))
## [1] 361
20/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
20 of 65 6/13/14, 2:01 PM
LLooookk iinnttoo tthhee DDaattaax <- x[, names(x) != "role_code"]
sapply(x, function(z) {
length(unique(z))
})
## resource mgr_id role_rollup_1 role_rollup_2
## 7518 4913 130 183
## role_deptname role_title role_family_desc role_family
## 476 361 2951 68
21/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
21 of 65 6/13/14, 2:01 PM
LLooookk iinnttoo tthhee DDaattaacheck the distribution - role_family_desc
hist(train$role_family_desc, breaks = 100) hist(test$role_family_desc, breaks = 100)
22/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
22 of 65 6/13/14, 2:01 PM
LLooookk iinnttoo tthhee DDaattaacheck the distribution - resource
hist(train$resource, breaks = 100) hist(test$resource, breaks = 100)
23/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
23 of 65 6/13/14, 2:01 PM
LLooookk iinnttoo tthhee DDaattaacheck the distribution - mgr_id
hist(train$mgr_id, breaks = 100) hist(test$mgr_id, breaks = 100)
24/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
24 of 65 6/13/14, 2:01 PM
LLooookk iinnttoo tthhee DDaattaatreat the features as Categorical or Numerical?YetiMan shared his findings in the forum:
1) My analyses so far leads me to believe that there is "information" in some of the categoricallabels themselves. My hunch is that they imply some sort of chronology, but I can't be certain.
2) Just for fun I increased the max classes for R's gbm package to 8192 and built a model (usingplain vanilla training data). The leader board result was 0.87 - slightly worse than the all-numericgbm. Food for thought.
·
·
25/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
25 of 65 6/13/14, 2:01 PM
LLooookk iinnttoo tthhee DDaattaaour approach
treat all features as Categorical1.
treat all features as Numerical2.
treat mgr_id as Numerical, the others as Categorical3.
26/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
26 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinnggworkflow
Feature Extraction
Base Learners
Ensemble
·
·
·
27/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
27 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinnggworkflow
28/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
28 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinnggFeature Extraction
the raw features(as numerical)1.
the raw features(as categorical) with level reduction2.
the dummies(in sparse Matrix)3.
the dummies including the interaction4.
some derived variables(count & ratio)5.
29/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
29 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinngg1. the raw features(as numerical)
30/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
30 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinngg2. the raw features(as categorical) with level reduction2.1 choose the top frequency categories
VAR_RAW FREQUENCY VAR_WITH_LEVEL_REDUCTION
a 3 a
a 3 a
a 3 a
b 2 b
b 2 b
c 1 other
d 1 other
for (i in 1:ncol(x)) {
the_labels <- names(sort(table(x[, i]), decreasing = T)[1:2])
x[!x[, i] %in% the_labels, i] <- "other"
}
31/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
31 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinngg2. the raw features(as categorical) with level reduction2.2 use Pearson's Chi-squared Test
table(y$y, ifelse(x$mgr_id == 770, "mgr_770", "mgr_not_770"))
##
## mgr_770 mgr_not_770
## 0 5 1892
## 1 147 30725
chisq.test(y$y, ifelse(x$mgr_id == 770, "mgr_770", "mgr_not_770"))$p.value
## [1] 0.2507
32/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
32 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinngg3. the dummies(in sparse Matrix)ID VAR VAR_A VAR_B VAR_C
1 a 1 0 0
2 a 1 0 0
3 a 1 0 0
4 b 0 1 0
5 c 0 0 1
33/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
33 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinngg3. the dummies(in sparse Matrix)use package Matrix to create the dummies
require(Matrix)
set.seed(114)
Matrix(sample(c(0, 1), 40, re = T, prob = c(0.6, 0.1)), nrow = 5)
## 5 x 8 sparse Matrix of class "dgCMatrix"
##
## [1,] . . . 1 . . . 1
## [2,] . 1 . . . . 1 .
## [3,] 1 . . . . . . .
## [4,] . . . . . 1 . .
## [5,] . . . . . . . .
34/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
34 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinngg4. the dummies including the interactionID M N MN_AP MN_AQ MN_BP MN_BQ
1 a p 1 0 0 0
2 a p 1 0 0 0
3 a q 0 1 0 0
4 b p 0 0 1 0
5 b q 0 0 0 1
35/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
35 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinngg5. some derived variables(count & ratio)
the frequency of every category
the frequency of the interactions
the proportion
·
·
·
36/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
36 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinngg5. some derived variables(count & ratio)
tmp1 <- cnt_1[114:117, c('c1_resource', 'c1_role_deptname')]
tmp2 <- cnt_2[114:117, c('c2_resource_role_deptname_cnt_ij',
'c2_resource_role_deptname_ratio_i',
'c2_resource_role_deptname_ratio_j')]
cbind(tmp1, tmp2)
## c1_resource c1_role_deptname c2_resource_role_deptname_cnt_ij
## 114 1 1645 1
## 115 36 1312 4
## 116 45 465 24
## 117 374 2377 169
## c2_resource_role_deptname_ratio_i c2_resource_role_deptname_ratio_j
## 114 1.0000 0.0006079
## 115 0.1111 0.0030488
## 116 0.5333 0.0516129
## 117 0.4519 0.0710980
37/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
37 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinnggBase Learners
Regularized Generalized Linear Model1.
Support Vector Machine2.
Random Forest3.
Gradient Boosting Machine4.
38/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
38 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinnggEnsemble
mean prediction of all models1.
two-stage stacking2.
based on 5-fold cv holdout predictions·
39/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
39 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinnggEnsemble
mean prediction of all models1.
two-stage stacking2.
based on 5-fold cv holdout predictions
algorithms in level-1(Regularized Generalized Linear Model & Gradient Boosting Machine)
algorithms in level-2(Regularized Generalized Linear Model)
·
·
·
40/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
40 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinngg1. Regularized Generalized Linear Model
generalized linear model(glm)
convex penalties
·
·
41/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
41 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinngg1. Regularized Generalized Linear Model
logistic regression·
x <- sort(rnorm(100))
set.seed(114)
y <- c(sample(x=c(0,1),size=30,prob=c(0.9,0.1),re=T),
sample(x=c(0,1),size=20,prob=c(0.7,0.3),re=T),
sample(x=c(0,1),size=20,prob=c(0.3,0.7),re=T),
sample(x=c(0,1),size=30,prob=c(0.1,0.9),re=T))
m1 <- lm(y~x)
m2 <- glm(y~x,family=binomial(link=logit))
y2 <- predict(m2,data=x,type='response')
par(mar=c(5,4,0,0))
plot(y~x);abline(m1,lwd=3,col=2)
points(x,y2,type='l',lwd=3,col=3)
42/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
42 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinngg1. Regularized Generalized Linear Model
logistic regression·
convex penalties·
43/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
43 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinngg1. Regularized Generalized Linear Model
convex penalties·
L1 (lasso)
L2 (ridge regression)
mixture of L1&L2 (elastic net)
-
-
-
44/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
44 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinngg1. Regularized Generalized Linear Model
the dummies(in sparse Matrix)
the dummies including the interaction
R package:glmnet
·
·
·
45/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
45 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinngg2. Support Vector Machine(just for Diversity)
46/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
46 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinngg2. Support Vector Machine(just for Diversity)
47/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
47 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinngg2. Support Vector Machine(just for Diversity)
48/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
48 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinngg2. Support Vector Machine(just for Diversity)
the dummies including the interaction
some derived variables(count & ratio)
R package:kernlab,e1071
·
·
·
49/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
49 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinnggdecision tree
50/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
50 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinngg3. Random Forestdecision trees + bagging
51/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
51 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinngg3. Random Forest
the raw features(as numerical)
the raw features(as categorical) with level reduction
some derived variables(count & ratio)
R package:randomForest
·
·
·
·
52/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
52 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinngg4. Gradient Boosting Machinedecision trees + boosting
53/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
53 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinngg4. Gradient Boosting Machine
the raw features(as numerical)
the raw features(as categorical) with level reduction
some derived variables(count & ratio)
R package:gbm
·
·
·
·
54/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
54 of 65 6/13/14, 2:01 PM
SSuummmmaarryysome insightsVARIABLE NAME REL.INF
cnt2_resource_role_deptname_cnt_ij 2.542974017
cnt2_resource_role_rollup_2_ratio_i 2.107624216
cnt2_resource_role_deptname_ratio_j 2.017153645
cnt2_resource_role_rollup_2_ratio_j 1.910465811
cnt2_resource_role_family_ratio_i 1.770737494
... ...
cnt4_resource_mgr_id_role_rollup_2_role_family_desc 0.008938286
cnt4_resource_role_rollup_1_role_rollup_2_role_title 0.008930661
cnt4_resource_mgr_id_role_rollup_1_role_family_desc 0.002106958
55/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
55 of 65 6/13/14, 2:01 PM
SSuummmmaarryysome insights
summary(x[, c('cnt2_resource_role_deptname_cnt_ij',
'cnt2_resource_role_deptname_ratio_j')])
## cnt2_resource_role_deptname_cnt_ij cnt2_resource_role_deptname_ratio_j
## Min. : 1.0 Min. :0.0003
## 1st Qu.: 2.0 1st Qu.:0.0061
## Median : 7.0 Median :0.0172
## Mean : 15.6 Mean :0.0315
## 3rd Qu.: 17.0 3rd Qu.:0.0368
## Max. :201.0 Max. :1.0000
56/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
56 of 65 6/13/14, 2:01 PM
SSuummmmaarryysome insights
xx <- x[, 'cnt2_resource_role_deptname_cnt_ij']
tt <- t.test(xx ~ y)
list(estimate=tt$estimate,
conf.int=tt$conf.int, p.value=tt$p.value)
## $estimate
## mean in group 0 mean in group 1
## 10.04 13.82
##
## $conf.int
## [1] -4.851 -2.710
## attr(,"conf.level")
## [1] 0.95
##
## $p.value
## [1] 5.838e-12
par(mar=c(5,4,2,2))
boxplot(xx ~ y)
57/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
57 of 65 6/13/14, 2:01 PM
SSuummmmaarryysome insights
xxx <- cut(xx, include.lowest=T,
breaks=c(0,1,3,7,14,30,300))
par(mar=c(5,2,0,0))
barplot(table(xxx))
tb <- table(y, xxx)
r_0 <- tb[1, ] / colSums(tb)
par(mar=c(5,2,0,0))
plot(r_0, type='l', lwd=3)
58/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
58 of 65 6/13/14, 2:01 PM
SSuummmmaarryysome insights
xx <- x[, 'cnt2_resource_role_deptname_ratio_j']
tt <- t.test(xx ~ y)
list(estimate=tt$estimate,
conf.int=tt$conf.int, p.value=tt$p.value)
## $estimate
## mean in group 0 mean in group 1
## 0.01955 0.02902
##
## $conf.int
## [1] -0.011732 -0.007205
## attr(,"conf.level")
## [1] 0.95
##
## $p.value
## [1] 3.93e-16
par(mar=c(5,4,2,2))
boxplot(xx ~ y)
59/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
59 of 65 6/13/14, 2:01 PM
SSuummmmaarryysome insights
xxx <- cut(xx, include.lowest=T,
breaks=quantile(xx, seq(0,1,0.2)))
par(mar=c(5,2,0,0))
barplot(table(xxx))
tb <- table(y, xxx)
r_0 <- tb[1, ] / colSums(tb)
par(mar=c(5,2,0,0))
plot(r_0, type='l', lwd=3)
60/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
60 of 65 6/13/14, 2:01 PM
SSuummmmaarryyoverfittingMODEL AUC_CV AUC_PUBLIC AUC_PRIVATE
num_glmnet_0 0.8985069 0.87737 0.87385
stacking_gbm_with_the_glmnet 0.9277316 0.90695 0.90478
61/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
61 of 65 6/13/14, 2:01 PM
SSuummmmaarryyoverfittingMODEL AUC_CV AUC_PUBLIC AUC_PRIVATE
num_glmnet_0 0.8985069 0.87737 0.87385
stacking_gbm_with_the_glmnet 0.9277316 0.90695 0.90478
stacking_gbm_without_the_glmnet 0.9182303 0.91529 0.91130
62/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
62 of 65 6/13/14, 2:01 PM
SSuummmmaarryyoverfittingWinning solution code and methodologyhttp://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/5283/winning-solution-code-and-methodology
63/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
63 of 65 6/13/14, 2:01 PM
SSuummmmaarryyuseful discussionsPython code to achieve 0.90 AUC with Logistic Regressionhttp://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/4838/python-code-to-achieve-0-90-auc-with-logistic-regression
Starter code in python with scikit-learn (AUC .885)http://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/4797/starter-code-in-python-with-scikit-learn-auc-885
Patterns in Training data sethttp://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/4886/patterns-in-training-data-set
64/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
64 of 65 6/13/14, 2:01 PM
tthhaannkk yyoouu
65/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
65 of 65 6/13/14, 2:01 PM