KDD CUP 2015 - 9th solution

Team: NCCU

A Linear Ensemble of Classification Models with Novel Backward Cumulative Features for MOOC Dropout Prediction

Chih-Ming Chen, Man-Kwan Shan, Ming-Feng Tsai, Yi-Hsuan Yang, Hsin-Ping Chen, Pei-Wen Yeh, and Sin-Ya Peng

Research Center for Information Technology Innovation, Academia Sinica

Department of Computer Science,National Chengchi University

Team: NCCU

A Linear Ensemble of Classification Models with Novel Backward Cumulative Features for MOOC Dropout Prediction

Chih-Ming Chen, Man-Kwan Shan, Ming-Feng Tsai, Yi-Hsuan Yang, Hsin-Ping Chen, Pei-Wen Yeh, and Sin-Ya Peng

Research Center for Information Technology Innovation, Academia Sinica

Department of Computer Science,National Chengchi University

linearly combination of several models

the proposed data engineering method

it’s able to generate a bunch of distinct feature sets

Key Point Summary

Latent Space Representation — Clustering Model — Skip-Gram Model

Backward Cumulative Features— Generate 30 distinct sets of features

Linear Model +

Tree-based Model

alleviate the feature sparsity problem

alleviate the bias problem of statistical feature

good matchweakness when using sparse feature

Workflow

Train Data(75%)

Validate Data(25%)

Train Data

50000"

100000"

150000"

200000"

10/27/2013"

11/27/2013"

12/27/2013"

1/27/2014"

2/27/2014"

3/27/2014"

4/27/2014"

5/27/2014"

6/27/2014"

7/27/2014"

Training'Date'Distribu.on�

0"20000"40000"60000"80000"

100000"120000"140000"

10/27/2013"

11/27/2013"

12/27/2013"

1/27/2014"

2/27/2014"

3/27/2014"

4/27/2014"

5/27/2014"

6/27/2014"

7/27/2014"

Tes.ng'Date'Distribu.on�

Split the training data based on the time distribution.

stable results

— 2 settings

Workflow

Train Data(75%)

Validate Data(25%)

LearnedModel Offline

Evaluation

Test Data

Train Data

Submissionmethod 1

cross-validation

— 2 settings

check if it leads to better performance

Workflow

Train Data(75%)

Validate Data(25%)

Evaluation

Test Data

Train Data

LearnedModel

Submissionmethod 2

— 2 settings

check if it leads to better performance

Workflow

Train Data(75%)

Validate Data(25%)

Evaluation

Test Data

Train Data

LearnedModel

Submission

— 2 settings

Prediction Model Overview

Logistic Regression

Gradient Boosting Classifier

RawData

Support Vector Classifier

Student

Course

A classical approach to a general prediction task.

— 2 solutions

Features

Logistic Regression

Gradient Boosting Decision Trees

RawData

Student

Course

Backward Cumulation

the feature engineering towards the MOOC dataset.

— 2 solutions

Features

Logistic Regression

Linear Combination

Final Prediction

RawData

Student

Course

Backward Cumulation

— 2 solutions

Features

Logistic Regression

Linear Combination

Final Prediction

RawData

Student

Course

Backward Cumulation

solution 1

solution 2

xgboost

scikit-learn

— 2 solutions

http://scikit-learn.org/stable/https://github.com/dmlc/xgboost

Logistic Regression

Linear Combination

Final Prediction

RawData

Student

Course

Backward Cumulation

Feature Extraction / Feature Engineering

— 2 solutions

Feature Extraction

Student

Course

Enrolment ID

• Bag-of-words Feature— Boolean (0/1) — Term Frequency (TF)

• Probability Value — Ratio— Naive Bayes

• Latent Space Feature — Clustering — DeepWalk

RawData

Feature Extraction

Student

Course

Enrolment ID

RawData

describing the status

e.g.the month of the course

the number of registration

video 5

problem 10

wiki 0

discussion 2

navigation 0

Feature Extraction

Student

Course

Enrolment ID

RawData

dropout probability

P( dropout|containing objects )

:= P(O1|dropout) … P(Od|dropout)

O = {O1, O2, …, Od}objects:

e.g.the dropout ratio of the course

estimate the probability from observed data

Feature Extraction

Student

Course

Enrolment ID

RawData

Latent Topic

K-means Clustering on 1. registered courses 2. containing objects

some features are sparse

DeepWalk / Skip-Gram for obtaining a dense feature representation

DeepWalk

https://github.com/phanein/deepwalk

The Goal — Find the representation of each node of a graph.

It’s an extension work of word2vec’s Skip-Gram model.

DeepWalk

The Goal — Find the representation of each node of a graph.

It’s an extension work of word2vec’s Skip-Gram model.

The core is to model the context information. (in practical, the node’s neighbours)

Similar objects are mapped into similar space.

From DeepWalk to the MOOC Problem

Course A

Course B

Course C

Course D

Course E

Course A

Course B

Course C

Course D

U1 Course B

Course E

Course CU2 U1

Random Walk

Treat Random Walks on heterogeneous graphas the sentence.

Course A

Course B

Course C

Course D

U1 Course B

Course E

Course CU2 U1

Random Walk

Treat Random Walks on heterogeneous graphas the sentence.

U10.3 0.2 -0.1 0.5 -0.8 Course B 0.1 0.3 -0.5 1.2 -0.3

Performance

Bag-of-words Bag-of-words

Probability

Bag-of-words

Probability

Naive Bayes

Bag-of-words

Probability

Naive Bayes

Latent Space

> 0.890 > 0.901 > 0.902 > 0.903

Backward Cumulation

ModelsCombination

Backward Cumulative Features — Motivation

O X O X O O X O X X X X X X O

X X X X O X X X X X O O O X X

X X X X X X X X X X X O X O X

Logs Table10/13 10/14 10/15 10/16 10/17 10/18 10/19 10/20 10/21 10/22 10/23 10/24 10/25 10/26 10/27

Backward Cumulative Features — Motivation

O X O X O O X O O X X X X X O

different period

Logs Table

different number of logs

10/13 10/14 10/15 10/16 10/17 10/18 10/19 10/20 10/21 10/22 10/23 10/24 10/25 10/26 10/27U1

Backward Cumulative Features

Consider only the logs in last N days. N=210/13 10/14 10/15 10/16 10/17 10/18 10/19 10/20 10/21 10/22 10/23 10/24 10/25 10/26 10/27

RawData

Student

Course

Backward Cumulation

Feature Set 1N=1

Feature Set 2

Feature Set 3...

Feature Set 29

Feature Set 30

— 2 strategies

RawData

Student

Course

Backward Cumulation

Feature Set 1N=1

Feature Set 2

Feature Set 3...

Feature Set 29

Feature Set 30

Classifier

Strategy 1.Concatenate all features.

— 2 strategies

RawData

Student

Course

Backward Cumulation

Feature Set 1N=1

Feature Set 2

Feature Set 3...

Feature Set 29

Feature Set 30

Classifier

Strategy 2.Build 30 distinct models.

Average

— 2 strategies

Logistic Regression

Linear Combination

Final Prediction

RawData

Student

Course

Backward Cumulation

solution 1 * 0.5

solution 2 * 0.5

xgboost

scikit-learn

What We Learned from the Competition

• Team Work is important — share ideas— share solutions

• Model diversity & feature diversity — diverse models / features can capture different characteristic of the data

• Realize the data — the goal— the evaluation metric— the data structure

Start earlier …

Feature FormatData Partition

Feature Scale

several things to be discussede.g.

changecandy [at] gmail.com

Any Question?

KDD CUP 2015 - 9th solution

Technology

Transcript of KDD CUP 2015 - 9th solution

KDD cup 99

KDD Cup 2011 Triton Miners

Data Mining / KDD

Targeted Marketing, KDD Cup and Customer Modeling.

A Detailed Analysis of the KDD CUP 99

KDD: A Definition

Individualizing Bayesian Knowledge Tracing. Are Skill ... · Individualizing Bayesian Knowledge Tracing. Are Skill ... KDD Cup data mining challenge. ... 2013), draws on the ...

Epidemiology Kdd Slides

Introduction to KDD

KDD Cup 2013 Track 1 - Author - Paper Identification challenge (2nd place team)

KDD Course

KDD Cup Survey

KDD Cup Research Paper

Targeted Marketing, KDD Cup and Customer Modeling

Download Estimation for KDD Cup 2003KDD Cup 2003 Janez BrankJanez Brank and Jure LeskovecJure Leskovec Jožef Stefan Institute Ljubljana, Slovenia.

Boosting - courses.cs.washington.edu · ML competitions (Kaggle, KDD Cup,…) • Coefficients chosen manually, with boosting, with bagging, or others Most deployed ML systems use

9th CMAS World Cup in Finswimming · 9th CMAS World Cup in Finswimming Round 6 Gliwice, Poland 24-35.05.2014 RESULTS

Malstone KDD 2010

Department of Mathematical Information Technology ...users.jyu.fi/~tusesipo/papers/Juvonen_etal_Anomaly... · We test the implemented system with the widely used KDD Cup 99 dataset

Kaggle KDD Cup Report