Kaggle talk series top 0.2% kaggler on amazon employee access challenge

65
AAmmaazzoonn EEmmppllooyyeeee AAcccceessss CChhaalllleennggee Predict an employee's access needs, given his/her job role Yibo Chen Data Scientist @ Supstat Inc Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 1 of 65 6/13/14, 2:01 PM

description

NYC Data Science Academy, NYC Open Data Meetup, Big Data, Data Science, NYC, Vivian Zhang, SupStat Inc,NYC, Machine learning, Kaggle, amazon employee access challenge

Transcript of Kaggle talk series top 0.2% kaggler on amazon employee access challenge

Page 1: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

AAmmaazzoonn EEmmppllooyyeeee AAcccceessss CChhaalllleennggeePredict an employee's access needs, given his/her job role

Yibo ChenData Scientist @ Supstat Inc

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

1 of 65 6/13/14, 2:01 PM

Page 2: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

AAggeennddaaIntroduction to the Challenge1.

Look into the Data2.

Model Building3.

Summary4.

2/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

2 of 65 6/13/14, 2:01 PM

Page 3: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

IInnttrroodduuccttiioonn ttoo tthhee CChhaalllleennggeethe storyhttp://www.kaggle.com/c/amazon-employee-access-challengeit is all about the access we need to fulfill our daily work.

3/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

3 of 65 6/13/14, 2:01 PM

Page 4: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

IInnttrroodduuccttiioonn ttoo tthhee CChhaalllleennggeethe missionbuild an auto-access model based on the historical datato determine the access privilege according to the employee's job role and the resource he appliedfor

4/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

4 of 65 6/13/14, 2:01 PM

Page 5: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

IInnttrroodduuccttiioonn ttoo tthhee CChhaalllleennggeethe dataThe data consists of real historical data collected from 2010 & 2011.Employees are manually allowed or denied access to resources over time.

the filestrain.csv - The training set. Each row has the ACTION (ground truth), RESOURCE, andinformation about the employee's role at the time of approval

test.csv - The test set for which predictions should be made. Each row asks whether anemployee having the listed characteristics should have access to the listed resource.

·

·

5/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

5 of 65 6/13/14, 2:01 PM

Page 6: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

IInnttrroodduuccttiioonn ttoo tthhee CChhaalllleennggeethe variablesCOLUMN NAME DESCRIPTION

ACTION ACTION is 1 if the resource was approved, 0 if the resource was not

RESOURCE An ID for each resource

MGR_ID The EMPLOYEE ID of the manager of the current EMPLOYEE ID record

ROLE_ROLLUP_1 Company role grouping category id 1 (e.g. US Engineering)

ROLE_ROLLUP_2 Company role grouping category id 2 (e.g. US Retail)

ROLE_DEPTNAME Company role department description (e.g. Retail)

ROLE_TITLE Company role business title description (e.g. Senior Engineering Retail Manager)

ROLE_FAMILY_DESC Company role family extended description (e.g. Retail Manager, Software Engineering)

ROLE_FAMILY Company role family description (e.g. Retail Manager)

ROLE_CODE Company role code; this code is unique to each role (e.g. Manager)

6/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

6 of 65 6/13/14, 2:01 PM

Page 7: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

IInnttrroodduuccttiioonn ttoo tthhee CChhaalllleennggeethe metricAUC(area under the ROC curve)

is a metric used to judge predictions in binary response (0/1) problem

is only sensitive to the order determined by the predictions and not their magnitudes

package verification or ROCR in R

·

·

·

7/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

7 of 65 6/13/14, 2:01 PM

Page 8: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

IInnttrroodduuccttiioonn ttoo tthhee CChhaalllleennggeethe metric

(t <- data.frame(true_label=c(0,0,0,0,1,1,1,1),

predict_1=c(1,2,3,4,5,6,7,8),

predict_2=c(1,2,3,6,5,4,7,8),

predict_3=c(1,7,6,4,5,3,2,8)))

## true_label predict_1 predict_2 predict_3

## 1 0 1 1 1

## 2 0 2 2 7

## 3 0 3 3 6

## 4 0 4 6 4

## 5 1 5 5 5

## 6 1 6 4 3

## 7 1 7 7 2

## 8 1 8 8 8

8/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

8 of 65 6/13/14, 2:01 PM

Page 9: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

IInnttrroodduuccttiioonn ttoo tthhee CChhaalllleennggeethe metric

P:4N:4TP:2FP:1TPR=TP/P=0.5FPR=FP/N=0.25

table(t$predict_2 >= 6, t$true_label)

##

## 0 1

## FALSE 3 2

## TRUE 1 2

9/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

9 of 65 6/13/14, 2:01 PM

Page 10: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

IInnttrroodduuccttiioonn ttoo tthhee CChhaalllleennggeethe metric

P:4N:4TP:3FP:1TPR=TP/P=0.75FPR=FP/N=0.25

table(t$predict_2 >= 5, t$true_label)

##

## 0 1

## FALSE 3 1

## TRUE 1 3

10/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

10 of 65 6/13/14, 2:01 PM

Page 11: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

IInnttrroodduuccttiioonn ttoo tthhee CChhaalllleennggeethe metric

11/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

11 of 65 6/13/14, 2:01 PM

Page 12: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

IInnttrroodduuccttiioonn ttoo tthhee CChhaalllleennggeethe metric

require(ROCR, quietly = T)

pred <- prediction(t$predict_1, t$true_label)

performance(pred, "auc")@y.values[[1]]

## [1] 1

require(verification, quietly = T)

roc.area(t$true_label, t$predict_1)$A

## [1] 1

pred <- prediction(t$predict_1, t$true_label)

perf <- performance(pred, "tpr", "fpr")

plot(perf, col = 2, lwd = 3)

12/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

12 of 65 6/13/14, 2:01 PM

Page 13: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

IInnttrroodduuccttiioonn ttoo tthhee CChhaalllleennggeethe metric

pred <- prediction(t$predict_2, t$true_label)

performance(pred, "auc")@y.values[[1]]

## [1] 0.875

roc.area(t$true_label, t$predict_2)$A

## [1] 0.875

pred <- prediction(t$predict_2, t$true_label)

perf <- performance(pred, "tpr", "fpr")

plot(perf, col = 2, lwd = 3)

13/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

13 of 65 6/13/14, 2:01 PM

Page 14: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

IInnttrroodduuccttiioonn ttoo tthhee CChhaalllleennggeethe metric

pred <- prediction(t$predict_3, t$true_label)

performance(pred, "auc")@y.values[[1]]

## [1] 0.5

roc.area(t$true_label, t$predict_3)$A

## [1] 0.5

pred <- prediction(t$predict_3, t$true_label)

perf <- performance(pred, "tpr", "fpr")

plot(perf, col = 2, lwd = 3)

14/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

14 of 65 6/13/14, 2:01 PM

Page 15: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

LLooookk iinnttoo tthhee DDaattaaload data from files

15/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

15 of 65 6/13/14, 2:01 PM

Page 16: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

LLooookk iinnttoo tthhee DDaattaathe target

table(y, useNA = "ifany")

## y

## 0 1 <NA>

## 1897 30872 58921

16/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

16 of 65 6/13/14, 2:01 PM

Page 17: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

LLooookk iinnttoo tthhee DDaattaathe predictor

17/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

17 of 65 6/13/14, 2:01 PM

Page 18: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

LLooookk iinnttoo tthhee DDaattaatreat the features as Categorical or Numerical?

sapply(x, function(z) {

length(unique(z))

})

## resource mgr_id role_rollup_1 role_rollup_2

## 7518 4913 130 183

## role_deptname role_title role_family_desc role_family

## 476 361 2951 68

## role_code

## 361

18/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

18 of 65 6/13/14, 2:01 PM

Page 19: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

LLooookk iinnttoo tthhee DDaattaapar(mar = c(5, 4, 0, 2))

plot(x$role_title, x$role_code)

19/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

19 of 65 6/13/14, 2:01 PM

Page 20: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

LLooookk iinnttoo tthhee DDaattaalength(unique(x$role_title))

## [1] 361

length(unique(x$role_code))

## [1] 361

length(unique(paste(x$role_code, x$role_title)))

## [1] 361

20/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

20 of 65 6/13/14, 2:01 PM

Page 21: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

LLooookk iinnttoo tthhee DDaattaax <- x[, names(x) != "role_code"]

sapply(x, function(z) {

length(unique(z))

})

## resource mgr_id role_rollup_1 role_rollup_2

## 7518 4913 130 183

## role_deptname role_title role_family_desc role_family

## 476 361 2951 68

21/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

21 of 65 6/13/14, 2:01 PM

Page 22: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

LLooookk iinnttoo tthhee DDaattaacheck the distribution - role_family_desc

hist(train$role_family_desc, breaks = 100) hist(test$role_family_desc, breaks = 100)

22/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

22 of 65 6/13/14, 2:01 PM

Page 23: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

LLooookk iinnttoo tthhee DDaattaacheck the distribution - resource

hist(train$resource, breaks = 100) hist(test$resource, breaks = 100)

23/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

23 of 65 6/13/14, 2:01 PM

Page 24: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

LLooookk iinnttoo tthhee DDaattaacheck the distribution - mgr_id

hist(train$mgr_id, breaks = 100) hist(test$mgr_id, breaks = 100)

24/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

24 of 65 6/13/14, 2:01 PM

Page 25: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

LLooookk iinnttoo tthhee DDaattaatreat the features as Categorical or Numerical?YetiMan shared his findings in the forum:

1) My analyses so far leads me to believe that there is "information" in some of the categoricallabels themselves. My hunch is that they imply some sort of chronology, but I can't be certain.

2) Just for fun I increased the max classes for R's gbm package to 8192 and built a model (usingplain vanilla training data). The leader board result was 0.87 - slightly worse than the all-numericgbm. Food for thought.

·

·

25/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

25 of 65 6/13/14, 2:01 PM

Page 26: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

LLooookk iinnttoo tthhee DDaattaaour approach

treat all features as Categorical1.

treat all features as Numerical2.

treat mgr_id as Numerical, the others as Categorical3.

26/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

26 of 65 6/13/14, 2:01 PM

Page 27: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

MMooddeell BBuuiillddiinnggworkflow

Feature Extraction

Base Learners

Ensemble

·

·

·

27/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

27 of 65 6/13/14, 2:01 PM

Page 28: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

MMooddeell BBuuiillddiinnggworkflow

28/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

28 of 65 6/13/14, 2:01 PM

Page 29: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

MMooddeell BBuuiillddiinnggFeature Extraction

the raw features(as numerical)1.

the raw features(as categorical) with level reduction2.

the dummies(in sparse Matrix)3.

the dummies including the interaction4.

some derived variables(count & ratio)5.

29/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

29 of 65 6/13/14, 2:01 PM

Page 30: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

MMooddeell BBuuiillddiinngg1. the raw features(as numerical)

30/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

30 of 65 6/13/14, 2:01 PM

Page 31: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

MMooddeell BBuuiillddiinngg2. the raw features(as categorical) with level reduction2.1 choose the top frequency categories

VAR_RAW FREQUENCY VAR_WITH_LEVEL_REDUCTION

a 3 a

a 3 a

a 3 a

b 2 b

b 2 b

c 1 other

d 1 other

for (i in 1:ncol(x)) {

the_labels <- names(sort(table(x[, i]), decreasing = T)[1:2])

x[!x[, i] %in% the_labels, i] <- "other"

}

31/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

31 of 65 6/13/14, 2:01 PM

Page 32: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

MMooddeell BBuuiillddiinngg2. the raw features(as categorical) with level reduction2.2 use Pearson's Chi-squared Test

table(y$y, ifelse(x$mgr_id == 770, "mgr_770", "mgr_not_770"))

##

## mgr_770 mgr_not_770

## 0 5 1892

## 1 147 30725

chisq.test(y$y, ifelse(x$mgr_id == 770, "mgr_770", "mgr_not_770"))$p.value

## [1] 0.2507

32/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

32 of 65 6/13/14, 2:01 PM

Page 33: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

MMooddeell BBuuiillddiinngg3. the dummies(in sparse Matrix)ID VAR VAR_A VAR_B VAR_C

1 a 1 0 0

2 a 1 0 0

3 a 1 0 0

4 b 0 1 0

5 c 0 0 1

33/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

33 of 65 6/13/14, 2:01 PM

Page 34: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

MMooddeell BBuuiillddiinngg3. the dummies(in sparse Matrix)use package Matrix to create the dummies

require(Matrix)

set.seed(114)

Matrix(sample(c(0, 1), 40, re = T, prob = c(0.6, 0.1)), nrow = 5)

## 5 x 8 sparse Matrix of class "dgCMatrix"

##

## [1,] . . . 1 . . . 1

## [2,] . 1 . . . . 1 .

## [3,] 1 . . . . . . .

## [4,] . . . . . 1 . .

## [5,] . . . . . . . .

34/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

34 of 65 6/13/14, 2:01 PM

Page 35: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

MMooddeell BBuuiillddiinngg4. the dummies including the interactionID M N MN_AP MN_AQ MN_BP MN_BQ

1 a p 1 0 0 0

2 a p 1 0 0 0

3 a q 0 1 0 0

4 b p 0 0 1 0

5 b q 0 0 0 1

35/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

35 of 65 6/13/14, 2:01 PM

Page 36: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

MMooddeell BBuuiillddiinngg5. some derived variables(count & ratio)

the frequency of every category

the frequency of the interactions

the proportion

·

·

·

36/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

36 of 65 6/13/14, 2:01 PM

Page 37: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

MMooddeell BBuuiillddiinngg5. some derived variables(count & ratio)

tmp1 <- cnt_1[114:117, c('c1_resource', 'c1_role_deptname')]

tmp2 <- cnt_2[114:117, c('c2_resource_role_deptname_cnt_ij',

'c2_resource_role_deptname_ratio_i',

'c2_resource_role_deptname_ratio_j')]

cbind(tmp1, tmp2)

## c1_resource c1_role_deptname c2_resource_role_deptname_cnt_ij

## 114 1 1645 1

## 115 36 1312 4

## 116 45 465 24

## 117 374 2377 169

## c2_resource_role_deptname_ratio_i c2_resource_role_deptname_ratio_j

## 114 1.0000 0.0006079

## 115 0.1111 0.0030488

## 116 0.5333 0.0516129

## 117 0.4519 0.0710980

37/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

37 of 65 6/13/14, 2:01 PM

Page 38: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

MMooddeell BBuuiillddiinnggBase Learners

Regularized Generalized Linear Model1.

Support Vector Machine2.

Random Forest3.

Gradient Boosting Machine4.

38/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

38 of 65 6/13/14, 2:01 PM

Page 39: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

MMooddeell BBuuiillddiinnggEnsemble

mean prediction of all models1.

two-stage stacking2.

based on 5-fold cv holdout predictions·

39/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

39 of 65 6/13/14, 2:01 PM

Page 40: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

MMooddeell BBuuiillddiinnggEnsemble

mean prediction of all models1.

two-stage stacking2.

based on 5-fold cv holdout predictions

algorithms in level-1(Regularized Generalized Linear Model & Gradient Boosting Machine)

algorithms in level-2(Regularized Generalized Linear Model)

·

·

·

40/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

40 of 65 6/13/14, 2:01 PM

Page 41: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

MMooddeell BBuuiillddiinngg1. Regularized Generalized Linear Model

generalized linear model(glm)

convex penalties

·

·

41/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

41 of 65 6/13/14, 2:01 PM

Page 42: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

MMooddeell BBuuiillddiinngg1. Regularized Generalized Linear Model

logistic regression·

x <- sort(rnorm(100))

set.seed(114)

y <- c(sample(x=c(0,1),size=30,prob=c(0.9,0.1),re=T),

sample(x=c(0,1),size=20,prob=c(0.7,0.3),re=T),

sample(x=c(0,1),size=20,prob=c(0.3,0.7),re=T),

sample(x=c(0,1),size=30,prob=c(0.1,0.9),re=T))

m1 <- lm(y~x)

m2 <- glm(y~x,family=binomial(link=logit))

y2 <- predict(m2,data=x,type='response')

par(mar=c(5,4,0,0))

plot(y~x);abline(m1,lwd=3,col=2)

points(x,y2,type='l',lwd=3,col=3)

42/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

42 of 65 6/13/14, 2:01 PM

Page 43: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

MMooddeell BBuuiillddiinngg1. Regularized Generalized Linear Model

logistic regression·

convex penalties·

43/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

43 of 65 6/13/14, 2:01 PM

Page 44: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

MMooddeell BBuuiillddiinngg1. Regularized Generalized Linear Model

convex penalties·

L1 (lasso)

L2 (ridge regression)

mixture of L1&L2 (elastic net)

-

-

-

44/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

44 of 65 6/13/14, 2:01 PM

Page 45: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

MMooddeell BBuuiillddiinngg1. Regularized Generalized Linear Model

the dummies(in sparse Matrix)

the dummies including the interaction

R package:glmnet

·

·

·

45/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

45 of 65 6/13/14, 2:01 PM

Page 46: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

MMooddeell BBuuiillddiinngg2. Support Vector Machine(just for Diversity)

46/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

46 of 65 6/13/14, 2:01 PM

Page 47: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

MMooddeell BBuuiillddiinngg2. Support Vector Machine(just for Diversity)

47/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

47 of 65 6/13/14, 2:01 PM

Page 48: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

MMooddeell BBuuiillddiinngg2. Support Vector Machine(just for Diversity)

48/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

48 of 65 6/13/14, 2:01 PM

Page 49: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

MMooddeell BBuuiillddiinngg2. Support Vector Machine(just for Diversity)

the dummies including the interaction

some derived variables(count & ratio)

R package:kernlab,e1071

·

·

·

49/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

49 of 65 6/13/14, 2:01 PM

Page 50: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

MMooddeell BBuuiillddiinnggdecision tree

50/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

50 of 65 6/13/14, 2:01 PM

Page 51: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

MMooddeell BBuuiillddiinngg3. Random Forestdecision trees + bagging

51/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

51 of 65 6/13/14, 2:01 PM

Page 52: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

MMooddeell BBuuiillddiinngg3. Random Forest

the raw features(as numerical)

the raw features(as categorical) with level reduction

some derived variables(count & ratio)

R package:randomForest

·

·

·

·

52/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

52 of 65 6/13/14, 2:01 PM

Page 53: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

MMooddeell BBuuiillddiinngg4. Gradient Boosting Machinedecision trees + boosting

53/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

53 of 65 6/13/14, 2:01 PM

Page 54: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

MMooddeell BBuuiillddiinngg4. Gradient Boosting Machine

the raw features(as numerical)

the raw features(as categorical) with level reduction

some derived variables(count & ratio)

R package:gbm

·

·

·

·

54/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

54 of 65 6/13/14, 2:01 PM

Page 55: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

SSuummmmaarryysome insightsVARIABLE NAME REL.INF

cnt2_resource_role_deptname_cnt_ij 2.542974017

cnt2_resource_role_rollup_2_ratio_i 2.107624216

cnt2_resource_role_deptname_ratio_j 2.017153645

cnt2_resource_role_rollup_2_ratio_j 1.910465811

cnt2_resource_role_family_ratio_i 1.770737494

... ...

cnt4_resource_mgr_id_role_rollup_2_role_family_desc 0.008938286

cnt4_resource_role_rollup_1_role_rollup_2_role_title 0.008930661

cnt4_resource_mgr_id_role_rollup_1_role_family_desc 0.002106958

55/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

55 of 65 6/13/14, 2:01 PM

Page 56: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

SSuummmmaarryysome insights

summary(x[, c('cnt2_resource_role_deptname_cnt_ij',

'cnt2_resource_role_deptname_ratio_j')])

## cnt2_resource_role_deptname_cnt_ij cnt2_resource_role_deptname_ratio_j

## Min. : 1.0 Min. :0.0003

## 1st Qu.: 2.0 1st Qu.:0.0061

## Median : 7.0 Median :0.0172

## Mean : 15.6 Mean :0.0315

## 3rd Qu.: 17.0 3rd Qu.:0.0368

## Max. :201.0 Max. :1.0000

56/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

56 of 65 6/13/14, 2:01 PM

Page 57: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

SSuummmmaarryysome insights

xx <- x[, 'cnt2_resource_role_deptname_cnt_ij']

tt <- t.test(xx ~ y)

list(estimate=tt$estimate,

conf.int=tt$conf.int, p.value=tt$p.value)

## $estimate

## mean in group 0 mean in group 1

## 10.04 13.82

##

## $conf.int

## [1] -4.851 -2.710

## attr(,"conf.level")

## [1] 0.95

##

## $p.value

## [1] 5.838e-12

par(mar=c(5,4,2,2))

boxplot(xx ~ y)

57/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

57 of 65 6/13/14, 2:01 PM

Page 58: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

SSuummmmaarryysome insights

xxx <- cut(xx, include.lowest=T,

breaks=c(0,1,3,7,14,30,300))

par(mar=c(5,2,0,0))

barplot(table(xxx))

tb <- table(y, xxx)

r_0 <- tb[1, ] / colSums(tb)

par(mar=c(5,2,0,0))

plot(r_0, type='l', lwd=3)

58/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

58 of 65 6/13/14, 2:01 PM

Page 59: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

SSuummmmaarryysome insights

xx <- x[, 'cnt2_resource_role_deptname_ratio_j']

tt <- t.test(xx ~ y)

list(estimate=tt$estimate,

conf.int=tt$conf.int, p.value=tt$p.value)

## $estimate

## mean in group 0 mean in group 1

## 0.01955 0.02902

##

## $conf.int

## [1] -0.011732 -0.007205

## attr(,"conf.level")

## [1] 0.95

##

## $p.value

## [1] 3.93e-16

par(mar=c(5,4,2,2))

boxplot(xx ~ y)

59/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

59 of 65 6/13/14, 2:01 PM

Page 60: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

SSuummmmaarryysome insights

xxx <- cut(xx, include.lowest=T,

breaks=quantile(xx, seq(0,1,0.2)))

par(mar=c(5,2,0,0))

barplot(table(xxx))

tb <- table(y, xxx)

r_0 <- tb[1, ] / colSums(tb)

par(mar=c(5,2,0,0))

plot(r_0, type='l', lwd=3)

60/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

60 of 65 6/13/14, 2:01 PM

Page 61: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

SSuummmmaarryyoverfittingMODEL AUC_CV AUC_PUBLIC AUC_PRIVATE

num_glmnet_0 0.8985069 0.87737 0.87385

stacking_gbm_with_the_glmnet 0.9277316 0.90695 0.90478

61/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

61 of 65 6/13/14, 2:01 PM

Page 62: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

SSuummmmaarryyoverfittingMODEL AUC_CV AUC_PUBLIC AUC_PRIVATE

num_glmnet_0 0.8985069 0.87737 0.87385

stacking_gbm_with_the_glmnet 0.9277316 0.90695 0.90478

stacking_gbm_without_the_glmnet 0.9182303 0.91529 0.91130

62/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

62 of 65 6/13/14, 2:01 PM

Page 63: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

SSuummmmaarryyoverfittingWinning solution code and methodologyhttp://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/5283/winning-solution-code-and-methodology

63/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

63 of 65 6/13/14, 2:01 PM

Page 64: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

SSuummmmaarryyuseful discussionsPython code to achieve 0.90 AUC with Logistic Regressionhttp://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/4838/python-code-to-achieve-0-90-auc-with-logistic-regression

Starter code in python with scikit-learn (AUC .885)http://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/4797/starter-code-in-python-with-scikit-learn-auc-885

Patterns in Training data sethttp://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/4886/patterns-in-training-data-set

64/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

64 of 65 6/13/14, 2:01 PM

Page 65: Kaggle talk series  top 0.2% kaggler on amazon employee access challenge

tthhaannkk yyoouu

65/65

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

65 of 65 6/13/14, 2:01 PM