Post on 05-Dec-2014
description
KDD Cup 2013Author – Paper Identification
Challenge (2nd place team)
Dmitry EfimovLucas Silva
Benjamin Solecki
Approach summaryGoal: find incorrectlyassigned pairsauthor-paper
Supervisedmachine learning problem
with binary response
Deep feature engineering(> 300 features)
GradientBoosting Machine
(package gbm in R)
Author – Paper graph
Author features
countjournals tf-idf
measure
Count featuresNLP features
Multiplesource features
author’sduplicates
Paper features
Count features NLP features
Multiplesourcefeatures
Additional features
countkeywords
tf-idfmeasure
paper’sduplicates
reversefeatures
engineering
Author – paper features (1 of 4)
Count features
Multiplesourcefeatures
Additionalfeatures
Likelihoodfeatures
Author – paper features (2 of 4)
Count features
Additionalfeatures
count of coauthors
with the same affiliation
reverse feature
engineering: year ranking
feature
Author – paper features (3 of 4)
Multiplesourcefeatures
how many times pair author-paper appeared in the
Microsoft database?
Author – paper features (4 of 4)Likelihood
featuresuse Lj and Lja
as features BAD IDEA
1) use (α∙ Lj + (1−α)∙ Lja) as feature (shrunken likelihood); 2) mixed-effects models (package lme4 in R) to find α
Lj – likelihood by journalLja – likelihood by journal and author
ModelGradient Boosting Machine(package gbm in R)
Grid search for the setof parameters
83 features in the final model (out of 300 calculated features )
Result and conclusion
• Our MAP score is 0.98144 (the winning submission score is 0.98259).
• Many algorithms (LambdaRank, LambdaMART, RankBoost) based on MAP optimization gave less MAP score than GBM with Bernoulli distribution.
• The idea of feature classification based on bipartite author-paper graph is very promising. Analyzing of graph topology can give ideas for new features.
Thank you!