KDD CUP 2015 - 9th solution

36
Team: NCCU A Linear Ensemble of Classification Models with Novel Backward Cumulative Features for MOOC Dropout Prediction Chih-Ming Chen, Man-Kwan Shan, Ming-Feng Tsai, Yi-Hsuan Yang, Hsin-Ping Chen, Pei-Wen Yeh, and Sin-Ya Peng Research Center for Information Technology Innovation, Academia Sinica Department of Computer Science, National Chengchi University

Transcript of KDD CUP 2015 - 9th solution

Page 1: KDD CUP 2015 - 9th solution

Team: NCCU

A Linear Ensemble of Classification Models with Novel Backward Cumulative Features for MOOC Dropout Prediction

Chih-Ming Chen, Man-Kwan Shan, Ming-Feng Tsai, Yi-Hsuan Yang, Hsin-Ping Chen, Pei-Wen Yeh, and Sin-Ya Peng

Research Center for Information Technology Innovation, Academia Sinica

Department of Computer Science,National Chengchi University

Page 2: KDD CUP 2015 - 9th solution

Team: NCCU

A Linear Ensemble of Classification Models with Novel Backward Cumulative Features for MOOC Dropout Prediction

Chih-Ming Chen, Man-Kwan Shan, Ming-Feng Tsai, Yi-Hsuan Yang, Hsin-Ping Chen, Pei-Wen Yeh, and Sin-Ya Peng

Research Center for Information Technology Innovation, Academia Sinica

Department of Computer Science,National Chengchi University

linearly combination of several models

the proposed data engineering method

it’s able to generate a bunch of distinct feature sets

Page 3: KDD CUP 2015 - 9th solution

Key Point Summary

Latent Space Representation — Clustering Model — Skip-Gram Model

Backward Cumulative Features— Generate 30 distinct sets of features

Linear Model +

Tree-based Model

alleviate the feature sparsity problem

alleviate the bias problem of statistical feature

good matchweakness when using sparse feature

Page 4: KDD CUP 2015 - 9th solution

Workflow

Train Data(75%)

Validate Data(25%)

Train Data

0"

50000"

100000"

150000"

200000"

10/27/2013"

11/27/2013"

12/27/2013"

1/27/2014"

2/27/2014"

3/27/2014"

4/27/2014"

5/27/2014"

6/27/2014"

7/27/2014"

Training'Date'Distribu.on�

0"20000"40000"60000"80000"

100000"120000"140000"

10/27/2013"

11/27/2013"

12/27/2013"

1/27/2014"

2/27/2014"

3/27/2014"

4/27/2014"

5/27/2014"

6/27/2014"

7/27/2014"

Tes.ng'Date'Distribu.on�

Split the training data based on the time distribution.

stable results

— 2 settings

Page 5: KDD CUP 2015 - 9th solution

Workflow

Train Data(75%)

Validate Data(25%)

LearnedModel Offline

Evaluation

Test Data

Train Data

Submissionmethod 1

cross-validation

— 2 settings

check if it leads to better performance

Page 6: KDD CUP 2015 - 9th solution

Workflow

Train Data(75%)

Validate Data(25%)

LearnedModel Offline

Evaluation

Test Data

Train Data

LearnedModel

Submissionmethod 2

— 2 settings

check if it leads to better performance

Page 7: KDD CUP 2015 - 9th solution

Workflow

Train Data(75%)

Validate Data(25%)

LearnedModel Offline

Evaluation

Test Data

Train Data

LearnedModel

Submission

— 2 settings

Page 8: KDD CUP 2015 - 9th solution

Prediction Model Overview

Logistic Regression

Gradient Boosting Classifier

RawData

Support Vector Classifier

Student

Course

Time

A classical approach to a general prediction task.

— 2 solutions

Features

Page 9: KDD CUP 2015 - 9th solution

Prediction Model Overview

Logistic Regression

Gradient Boosting Classifier

Gradient Boosting Decision Trees

RawData

Support Vector Classifier

Student

Course

Time

Backward Cumulation

the feature engineering towards the MOOC dataset.

— 2 solutions

Features

Page 10: KDD CUP 2015 - 9th solution

Prediction Model Overview

Logistic Regression

Gradient Boosting Classifier

Gradient Boosting Decision Trees

Linear Combination

Final Prediction

RawData

Support Vector Classifier

Student

Course

Time

Backward Cumulation

— 2 solutions

Features

Page 11: KDD CUP 2015 - 9th solution

Prediction Model Overview

Logistic Regression

Gradient Boosting Classifier

Gradient Boosting Decision Trees

Linear Combination

Final Prediction

RawData

Support Vector Classifier

Student

Course

Time

Backward Cumulation

solution 1

solution 2

xgboost

scikit-learn

— 2 solutions

http://scikit-learn.org/stable/https://github.com/dmlc/xgboost

Page 12: KDD CUP 2015 - 9th solution

Prediction Model Overview

Logistic Regression

Gradient Boosting Classifier

Gradient Boosting Decision Trees

Linear Combination

Final Prediction

RawData

Support Vector Classifier

Student

Course

Time

Backward Cumulation

Feature Extraction / Feature Engineering

— 2 solutions

Page 13: KDD CUP 2015 - 9th solution

Feature Extraction

Student

Course

Time

Enrolment ID

• Bag-of-words Feature— Boolean (0/1) — Term Frequency (TF)

• Probability Value — Ratio— Naive Bayes

• Latent Space Feature — Clustering — DeepWalk

RawData

Page 14: KDD CUP 2015 - 9th solution

Feature Extraction

Student

Course

Time

Enrolment ID

• Bag-of-words Feature— Boolean (0/1) — Term Frequency (TF)

• Probability Value — Ratio— Naive Bayes

• Latent Space Feature — Clustering — DeepWalk

RawData

describing the status

e.g.the month of the course

the number of registration

video 5

problem 10

wiki 0

discussion 2

navigation 0

Page 15: KDD CUP 2015 - 9th solution

Feature Extraction

Student

Course

Time

Enrolment ID

• Bag-of-words Feature— Boolean (0/1) — Term Frequency (TF)

• Probability Value — Ratio— Naive Bayes

• Latent Space Feature — Clustering — DeepWalk

RawData

dropout probability

P( dropout|containing objects )

:= P(O1|dropout) … P(Od|dropout)

O = {O1, O2, …, Od}objects:

e.g.the dropout ratio of the course

estimate the probability from observed data

Page 16: KDD CUP 2015 - 9th solution

Feature Extraction

Student

Course

Time

Enrolment ID

• Bag-of-words Feature— Boolean (0/1) — Term Frequency (TF)

• Probability Value — Ratio— Naive Bayes

• Latent Space Feature — Clustering — DeepWalk

RawData

Latent Topic

K-means Clustering on 1. registered courses 2. containing objects

some features are sparse

DeepWalk / Skip-Gram for obtaining a dense feature representation

Page 17: KDD CUP 2015 - 9th solution

DeepWalk

https://github.com/phanein/deepwalk

The Goal — Find the representation of each node of a graph.

It’s an extension work of word2vec’s Skip-Gram model.

Page 18: KDD CUP 2015 - 9th solution

DeepWalk

https://github.com/phanein/deepwalk

The Goal — Find the representation of each node of a graph.

It’s an extension work of word2vec’s Skip-Gram model.

The core is to model the context information. (in practical, the node’s neighbours)

Similar objects are mapped into similar space.

Page 19: KDD CUP 2015 - 9th solution

From DeepWalk to the MOOC Problem

U1

U2

U3

Course A

Course B

Course C

Course D

Course E

https://github.com/phanein/deepwalk

Page 20: KDD CUP 2015 - 9th solution

From DeepWalk to the MOOC Problem

U1

U2

U3

Course A

Course B

Course C

Course D

U1 Course B

Course E

Course CU2 U1

Random Walk

https://github.com/phanein/deepwalk

Treat Random Walks on heterogeneous graphas the sentence.

Page 21: KDD CUP 2015 - 9th solution

From DeepWalk to the MOOC Problem

U1

U2

U3

Course A

Course B

Course C

Course D

U1 Course B

Course E

Course CU2 U1

Random Walk

https://github.com/phanein/deepwalk

Treat Random Walks on heterogeneous graphas the sentence.

U10.3 0.2 -0.1 0.5 -0.8 Course B 0.1 0.3 -0.5 1.2 -0.3

Page 22: KDD CUP 2015 - 9th solution

Performance

Bag-of-words Bag-of-words

Probability

Bag-of-words

Probability

Naive Bayes

Bag-of-words

Probability

Naive Bayes

Latent Space

> 0.890 > 0.901 > 0.902 > 0.903

Backward Cumulation

ModelsCombination

Page 23: KDD CUP 2015 - 9th solution

Backward Cumulative Features — Motivation

O X O X O O X O X X X X X X O

X X X X O X X X X X O O O X X

X X X X X X X X X X X O X O X

Logs Table10/13 10/14 10/15 10/16 10/17 10/18 10/19 10/20 10/21 10/22 10/23 10/24 10/25 10/26 10/27

U1

U2

U3

Page 24: KDD CUP 2015 - 9th solution

Backward Cumulative Features — Motivation

O X O X O O X O O X X X X X O

X X X X O X X X X X O O O X X

X X X X X X X X X X X O X O X

different period

Logs Table

different number of logs

10/13 10/14 10/15 10/16 10/17 10/18 10/19 10/20 10/21 10/22 10/23 10/24 10/25 10/26 10/27U1

U2

U3

Page 25: KDD CUP 2015 - 9th solution

Backward Cumulative Features

O X O X O O X O X X X X X X O

X X X X O X X X X X O O O X X

X X X X X X X X X X X O X O X

Consider only the logs in last N days. N=210/13 10/14 10/15 10/16 10/17 10/18 10/19 10/20 10/21 10/22 10/23 10/24 10/25 10/26 10/27

U1

U2

U3

Page 26: KDD CUP 2015 - 9th solution

Backward Cumulative Features

O X O X O O X O X X X X X X O

X X X X O X X X X X O O O X X

X X X X X X X X X X X O X O X

Consider only the logs in last N days. N=310/13 10/14 10/15 10/16 10/17 10/18 10/19 10/20 10/21 10/22 10/23 10/24 10/25 10/26 10/27

U1

U2

U3

Page 27: KDD CUP 2015 - 9th solution

Backward Cumulative Features

O X O X O O X O X X X X X X O

X X X X O X X X X X O O O X X

X X X X X X X X X X X O X O X

Consider only the logs in last N days. N=410/13 10/14 10/15 10/16 10/17 10/18 10/19 10/20 10/21 10/22 10/23 10/24 10/25 10/26 10/27

U1

U2

U3

Page 28: KDD CUP 2015 - 9th solution

Backward Cumulative Features

O X O X O O X O X X X X X X O

X X X X O X X X X X O O O X X

X X X X X X X X X X X O X O X

Consider only the logs in last N days. N=510/13 10/14 10/15 10/16 10/17 10/18 10/19 10/20 10/21 10/22 10/23 10/24 10/25 10/26 10/27

U1

U2

U3

Page 29: KDD CUP 2015 - 9th solution

Backward Cumulative Features

RawData

Student

Course

Time

Backward Cumulation

.

.

.

Feature Set 1N=1

N=2

N=3

N=29

N=30

Feature Set 2

Feature Set 3...

Feature Set 29

Feature Set 30

— 2 strategies

Page 30: KDD CUP 2015 - 9th solution

Backward Cumulative Features

RawData

Student

Course

Time

Backward Cumulation

.

.

.

Feature Set 1N=1

N=2

N=3

N=29

N=30

Feature Set 2

Feature Set 3...

Feature Set 29

Feature Set 30

Classifier

Strategy 1.Concatenate all features.

— 2 strategies

Page 31: KDD CUP 2015 - 9th solution

Backward Cumulative Features

RawData

Student

Course

Time

Backward Cumulation

.

.

.

Feature Set 1N=1

N=2

N=3

N=29

N=30

Feature Set 2

Feature Set 3...

Feature Set 29

Feature Set 30

Classifier

Classifier

Classifier

Classifier

Classifier

Strategy 2.Build 30 distinct models.

Average

— 2 strategies

Page 32: KDD CUP 2015 - 9th solution

Prediction Model Overview

Logistic Regression

Gradient Boosting Classifier

Gradient Boosting Decision Trees

Linear Combination

Final Prediction

RawData

Support Vector Classifier

Student

Course

Time

Backward Cumulation

solution 1 * 0.5

solution 2 * 0.5

xgboost

scikit-learn

Page 33: KDD CUP 2015 - 9th solution

What We Learned from the Competition

• Team Work is important — share ideas— share solutions

• Model diversity & feature diversity — diverse models / features can capture different characteristic of the data

• Realize the data — the goal— the evaluation metric— the data structure

Page 34: KDD CUP 2015 - 9th solution

What We Learned from the Competition

• Team Work is important — share ideas— share solutions

• Model diversity & feature diversity — diverse models / features can capture different characteristic of the data

• Realize the data — the goal— the evaluation metric— the data structure

Start earlier …

Page 35: KDD CUP 2015 - 9th solution

What We Learned from the Competition

• Team Work is important — share ideas— share solutions

• Model diversity & feature diversity — diverse models / features can capture different characteristic of the data

• Realize the data — the goal— the evaluation metric— the data structure

Start earlier …

Feature FormatData Partition

Feature Scale

several things to be discussede.g.

Page 36: KDD CUP 2015 - 9th solution

changecandy [at] gmail.com

Any Question?