KDD Cup 2013 Track 1 - Author - Paper Identification challenge (2nd place team)

12
KDD Cup 2013 Author – Paper Identification Challenge (2 nd place team) Dmitry Efimov Lucas Silva Benjamin Solecki

description

We describe our approach for solution of Author - Paper Identification Challenge: https://www.kaggle.com/c/kdd-cup-2013-author-paper-identification-challenge

Transcript of KDD Cup 2013 Track 1 - Author - Paper Identification challenge (2nd place team)

Page 1: KDD Cup 2013 Track 1 - Author - Paper Identification challenge (2nd place team)

KDD Cup 2013Author – Paper Identification

Challenge (2nd place team)

Dmitry EfimovLucas Silva

Benjamin Solecki

Page 2: KDD Cup 2013 Track 1 - Author - Paper Identification challenge (2nd place team)

Approach summaryGoal: find incorrectlyassigned pairsauthor-paper

Supervisedmachine learning problem

with binary response

Deep feature engineering(> 300 features)

GradientBoosting Machine

(package gbm in R)

Page 3: KDD Cup 2013 Track 1 - Author - Paper Identification challenge (2nd place team)

Author – Paper graph

Page 4: KDD Cup 2013 Track 1 - Author - Paper Identification challenge (2nd place team)

Author features

countjournals tf-idf

measure

Count featuresNLP features

Multiplesource features

author’sduplicates

Page 5: KDD Cup 2013 Track 1 - Author - Paper Identification challenge (2nd place team)

Paper features

Count features NLP features

Multiplesourcefeatures

Additional features

countkeywords

tf-idfmeasure

paper’sduplicates

reversefeatures

engineering

Page 6: KDD Cup 2013 Track 1 - Author - Paper Identification challenge (2nd place team)

Author – paper features (1 of 4)

Count features

Multiplesourcefeatures

Additionalfeatures

Likelihoodfeatures

Page 7: KDD Cup 2013 Track 1 - Author - Paper Identification challenge (2nd place team)

Author – paper features (2 of 4)

Count features

Additionalfeatures

count of coauthors

with the same affiliation

reverse feature

engineering: year ranking

feature

Page 8: KDD Cup 2013 Track 1 - Author - Paper Identification challenge (2nd place team)

Author – paper features (3 of 4)

Multiplesourcefeatures

how many times pair author-paper appeared in the

Microsoft database?

Page 9: KDD Cup 2013 Track 1 - Author - Paper Identification challenge (2nd place team)

Author – paper features (4 of 4)Likelihood

featuresuse Lj and Lja

as features BAD IDEA

1) use (α∙ Lj + (1−α)∙ Lja) as feature (shrunken likelihood); 2) mixed-effects models (package lme4 in R) to find α

Lj – likelihood by journalLja – likelihood by journal and author

Page 10: KDD Cup 2013 Track 1 - Author - Paper Identification challenge (2nd place team)

ModelGradient Boosting Machine(package gbm in R)

Grid search for the setof parameters

83 features in the final model (out of 300 calculated features )

Page 11: KDD Cup 2013 Track 1 - Author - Paper Identification challenge (2nd place team)

Result and conclusion

• Our MAP score is 0.98144 (the winning submission score is 0.98259).

• Many algorithms (LambdaRank, LambdaMART, RankBoost) based on MAP optimization gave less MAP score than GBM with Bernoulli distribution.

• The idea of feature classification based on bipartite author-paper graph is very promising. Analyzing of graph topology can give ideas for new features.

Page 12: KDD Cup 2013 Track 1 - Author - Paper Identification challenge (2nd place team)

Thank you!