KDD CUP 2015 - 9th solution

Post on 12-Jan-2017

338 views 1 download

Transcript of KDD CUP 2015 - 9th solution

Team: NCCU

A Linear Ensemble of Classification Models with Novel Backward Cumulative Features for MOOC Dropout Prediction

Chih-Ming Chen, Man-Kwan Shan, Ming-Feng Tsai, Yi-Hsuan Yang, Hsin-Ping Chen, Pei-Wen Yeh, and Sin-Ya Peng

Research Center for Information Technology Innovation, Academia Sinica

Department of Computer Science,National Chengchi University

Team: NCCU

A Linear Ensemble of Classification Models with Novel Backward Cumulative Features for MOOC Dropout Prediction

Chih-Ming Chen, Man-Kwan Shan, Ming-Feng Tsai, Yi-Hsuan Yang, Hsin-Ping Chen, Pei-Wen Yeh, and Sin-Ya Peng

Research Center for Information Technology Innovation, Academia Sinica

Department of Computer Science,National Chengchi University

linearly combination of several models

the proposed data engineering method

it’s able to generate a bunch of distinct feature sets

Key Point Summary

Latent Space Representation — Clustering Model — Skip-Gram Model

Backward Cumulative Features— Generate 30 distinct sets of features

Linear Model +

Tree-based Model

alleviate the feature sparsity problem

alleviate the bias problem of statistical feature

good matchweakness when using sparse feature

Workflow

Train Data(75%)

Validate Data(25%)

Train Data

0"

50000"

100000"

150000"

200000"

10/27/2013"

11/27/2013"

12/27/2013"

1/27/2014"

2/27/2014"

3/27/2014"

4/27/2014"

5/27/2014"

6/27/2014"

7/27/2014"

Training'Date'Distribu.on�

0"20000"40000"60000"80000"

100000"120000"140000"

10/27/2013"

11/27/2013"

12/27/2013"

1/27/2014"

2/27/2014"

3/27/2014"

4/27/2014"

5/27/2014"

6/27/2014"

7/27/2014"

Tes.ng'Date'Distribu.on�

Split the training data based on the time distribution.

stable results

— 2 settings

Workflow

Train Data(75%)

Validate Data(25%)

LearnedModel Offline

Evaluation

Test Data

Train Data

Submissionmethod 1

cross-validation

— 2 settings

check if it leads to better performance

Workflow

Train Data(75%)

Validate Data(25%)

LearnedModel Offline

Evaluation

Test Data

Train Data

LearnedModel

Submissionmethod 2

— 2 settings

check if it leads to better performance

Workflow

Train Data(75%)

Validate Data(25%)

LearnedModel Offline

Evaluation

Test Data

Train Data

LearnedModel

Submission

— 2 settings

Prediction Model Overview

Logistic Regression

Gradient Boosting Classifier

RawData

Support Vector Classifier

Student

Course

Time

A classical approach to a general prediction task.

— 2 solutions

Features

Prediction Model Overview

Logistic Regression

Gradient Boosting Classifier

Gradient Boosting Decision Trees

RawData

Support Vector Classifier

Student

Course

Time

Backward Cumulation

the feature engineering towards the MOOC dataset.

— 2 solutions

Features

Prediction Model Overview

Logistic Regression

Gradient Boosting Classifier

Gradient Boosting Decision Trees

Linear Combination

Final Prediction

RawData

Support Vector Classifier

Student

Course

Time

Backward Cumulation

— 2 solutions

Features

Prediction Model Overview

Logistic Regression

Gradient Boosting Classifier

Gradient Boosting Decision Trees

Linear Combination

Final Prediction

RawData

Support Vector Classifier

Student

Course

Time

Backward Cumulation

solution 1

solution 2

xgboost

scikit-learn

— 2 solutions

http://scikit-learn.org/stable/https://github.com/dmlc/xgboost

Prediction Model Overview

Logistic Regression

Gradient Boosting Classifier

Gradient Boosting Decision Trees

Linear Combination

Final Prediction

RawData

Support Vector Classifier

Student

Course

Time

Backward Cumulation

Feature Extraction / Feature Engineering

— 2 solutions

Feature Extraction

Student

Course

Time

Enrolment ID

• Bag-of-words Feature— Boolean (0/1) — Term Frequency (TF)

• Probability Value — Ratio— Naive Bayes

• Latent Space Feature — Clustering — DeepWalk

RawData

Feature Extraction

Student

Course

Time

Enrolment ID

• Bag-of-words Feature— Boolean (0/1) — Term Frequency (TF)

• Probability Value — Ratio— Naive Bayes

• Latent Space Feature — Clustering — DeepWalk

RawData

describing the status

e.g.the month of the course

the number of registration

video 5

problem 10

wiki 0

discussion 2

navigation 0

Feature Extraction

Student

Course

Time

Enrolment ID

• Bag-of-words Feature— Boolean (0/1) — Term Frequency (TF)

• Probability Value — Ratio— Naive Bayes

• Latent Space Feature — Clustering — DeepWalk

RawData

dropout probability

P( dropout|containing objects )

:= P(O1|dropout) … P(Od|dropout)

O = {O1, O2, …, Od}objects:

e.g.the dropout ratio of the course

estimate the probability from observed data

Feature Extraction

Student

Course

Time

Enrolment ID

• Bag-of-words Feature— Boolean (0/1) — Term Frequency (TF)

• Probability Value — Ratio— Naive Bayes

• Latent Space Feature — Clustering — DeepWalk

RawData

Latent Topic

K-means Clustering on 1. registered courses 2. containing objects

some features are sparse

DeepWalk / Skip-Gram for obtaining a dense feature representation

DeepWalk

https://github.com/phanein/deepwalk

The Goal — Find the representation of each node of a graph.

It’s an extension work of word2vec’s Skip-Gram model.

DeepWalk

https://github.com/phanein/deepwalk

The Goal — Find the representation of each node of a graph.

It’s an extension work of word2vec’s Skip-Gram model.

The core is to model the context information. (in practical, the node’s neighbours)

Similar objects are mapped into similar space.

From DeepWalk to the MOOC Problem

U1

U2

U3

Course A

Course B

Course C

Course D

Course E

https://github.com/phanein/deepwalk

From DeepWalk to the MOOC Problem

U1

U2

U3

Course A

Course B

Course C

Course D

U1 Course B

Course E

Course CU2 U1

Random Walk

https://github.com/phanein/deepwalk

Treat Random Walks on heterogeneous graphas the sentence.

From DeepWalk to the MOOC Problem

U1

U2

U3

Course A

Course B

Course C

Course D

U1 Course B

Course E

Course CU2 U1

Random Walk

https://github.com/phanein/deepwalk

Treat Random Walks on heterogeneous graphas the sentence.

U10.3 0.2 -0.1 0.5 -0.8 Course B 0.1 0.3 -0.5 1.2 -0.3

Performance

Bag-of-words Bag-of-words

Probability

Bag-of-words

Probability

Naive Bayes

Bag-of-words

Probability

Naive Bayes

Latent Space

> 0.890 > 0.901 > 0.902 > 0.903

Backward Cumulation

ModelsCombination

Backward Cumulative Features — Motivation

O X O X O O X O X X X X X X O

X X X X O X X X X X O O O X X

X X X X X X X X X X X O X O X

Logs Table10/13 10/14 10/15 10/16 10/17 10/18 10/19 10/20 10/21 10/22 10/23 10/24 10/25 10/26 10/27

U1

U2

U3

Backward Cumulative Features — Motivation

O X O X O O X O O X X X X X O

X X X X O X X X X X O O O X X

X X X X X X X X X X X O X O X

different period

Logs Table

different number of logs

10/13 10/14 10/15 10/16 10/17 10/18 10/19 10/20 10/21 10/22 10/23 10/24 10/25 10/26 10/27U1

U2

U3

Backward Cumulative Features

O X O X O O X O X X X X X X O

X X X X O X X X X X O O O X X

X X X X X X X X X X X O X O X

Consider only the logs in last N days. N=210/13 10/14 10/15 10/16 10/17 10/18 10/19 10/20 10/21 10/22 10/23 10/24 10/25 10/26 10/27

U1

U2

U3

Backward Cumulative Features

O X O X O O X O X X X X X X O

X X X X O X X X X X O O O X X

X X X X X X X X X X X O X O X

Consider only the logs in last N days. N=310/13 10/14 10/15 10/16 10/17 10/18 10/19 10/20 10/21 10/22 10/23 10/24 10/25 10/26 10/27

U1

U2

U3

Backward Cumulative Features

O X O X O O X O X X X X X X O

X X X X O X X X X X O O O X X

X X X X X X X X X X X O X O X

Consider only the logs in last N days. N=410/13 10/14 10/15 10/16 10/17 10/18 10/19 10/20 10/21 10/22 10/23 10/24 10/25 10/26 10/27

U1

U2

U3

Backward Cumulative Features

O X O X O O X O X X X X X X O

X X X X O X X X X X O O O X X

X X X X X X X X X X X O X O X

Consider only the logs in last N days. N=510/13 10/14 10/15 10/16 10/17 10/18 10/19 10/20 10/21 10/22 10/23 10/24 10/25 10/26 10/27

U1

U2

U3

Backward Cumulative Features

RawData

Student

Course

Time

Backward Cumulation

.

.

.

Feature Set 1N=1

N=2

N=3

N=29

N=30

Feature Set 2

Feature Set 3...

Feature Set 29

Feature Set 30

— 2 strategies

Backward Cumulative Features

RawData

Student

Course

Time

Backward Cumulation

.

.

.

Feature Set 1N=1

N=2

N=3

N=29

N=30

Feature Set 2

Feature Set 3...

Feature Set 29

Feature Set 30

Classifier

Strategy 1.Concatenate all features.

— 2 strategies

Backward Cumulative Features

RawData

Student

Course

Time

Backward Cumulation

.

.

.

Feature Set 1N=1

N=2

N=3

N=29

N=30

Feature Set 2

Feature Set 3...

Feature Set 29

Feature Set 30

Classifier

Classifier

Classifier

Classifier

Classifier

Strategy 2.Build 30 distinct models.

Average

— 2 strategies

Prediction Model Overview

Logistic Regression

Gradient Boosting Classifier

Gradient Boosting Decision Trees

Linear Combination

Final Prediction

RawData

Support Vector Classifier

Student

Course

Time

Backward Cumulation

solution 1 * 0.5

solution 2 * 0.5

xgboost

scikit-learn

What We Learned from the Competition

• Team Work is important — share ideas— share solutions

• Model diversity & feature diversity — diverse models / features can capture different characteristic of the data

• Realize the data — the goal— the evaluation metric— the data structure

What We Learned from the Competition

• Team Work is important — share ideas— share solutions

• Model diversity & feature diversity — diverse models / features can capture different characteristic of the data

• Realize the data — the goal— the evaluation metric— the data structure

Start earlier …

What We Learned from the Competition

• Team Work is important — share ideas— share solutions

• Model diversity & feature diversity — diverse models / features can capture different characteristic of the data

• Realize the data — the goal— the evaluation metric— the data structure

Start earlier …

Feature FormatData Partition

Feature Scale

several things to be discussede.g.

changecandy [at] gmail.com

Any Question?