KDD Cup 2013 Track 1 - Author - Paper Identification challenge (2nd place team)

KDD Cup 2013Author – Paper Identification

Challenge (2nd place team)

Dmitry EfimovLucas Silva

Benjamin Solecki

Approach summaryGoal: find incorrectlyassigned pairsauthor-paper

Supervisedmachine learning problem

with binary response

Deep feature engineering(> 300 features)

GradientBoosting Machine

(package gbm in R)

Author – Paper graph

Author features

countjournals tf-idf

measure

Count featuresNLP features

Multiplesource features

author’sduplicates

Paper features

Count features NLP features

Multiplesourcefeatures

Additional features

countkeywords

tf-idfmeasure

paper’sduplicates

reversefeatures

engineering

Author – paper features (1 of 4)

Count features

Additionalfeatures

Likelihoodfeatures

Count features

Additionalfeatures

count of coauthors

with the same affiliation

reverse feature

engineering: year ranking

feature

how many times pair author-paper appeared in the

Microsoft database?

Author – paper features (4 of 4)Likelihood

featuresuse Lj and Lja

as features BAD IDEA

1) use (α∙ Lj + (1−α)∙ Lja) as feature (shrunken likelihood); 2) mixed-effects models (package lme4 in R) to find α

Lj – likelihood by journalLja – likelihood by journal and author

ModelGradient Boosting Machine(package gbm in R)

Grid search for the setof parameters

83 features in the final model (out of 300 calculated features )

Result and conclusion

• Our MAP score is 0.98144 (the winning submission score is 0.98259).

• Many algorithms (LambdaRank, LambdaMART, RankBoost) based on MAP optimization gave less MAP score than GBM with Bernoulli distribution.

• The idea of feature classification based on bipartite author-paper graph is very promising. Analyzing of graph topology can give ideas for new features.

Thank you!

KDD Cup 2013 Track 1 - Author - Paper Identification challenge (2nd place team)

Technology

Transcript of KDD Cup 2013 Track 1 - Author - Paper Identification challenge (2nd place team)

KDD Cup 2011 Triton Miners

LEON CUP RACER - SEAT Motorsport Italia€¦ · TECHNICAL MANUAL LEON CUP RACER 4 de 42 1 GENERAL INFORMANTION 1.1 Vehicle & parts identification V.I.N. (Vehicle Identification Number)

2000 KDD Cup Winners

The Yahoo! Music Dataset and KDD-Cup'11

Kdd Toledana C4atreros.es

Paper Trees Kdd

KDD-Cup 2004 Chairs: Rich Caruana & Thorsten Joachims Web Master++: Lars Backstrom Cornell University.

KDD Cup 2009

EDM and the 4 th Paradigm of Scientific Discovery Reflections On The 2010 KDD Cup Competition

Results on Tracks 1 and 2 of KDD Cup 2013

Sub title here KDD Cup Task 1 Information Extraction from Biomedical Articles System Description June / July 2002.

Data Mining / KDD

KDD Cup 2009 Fast Scoring on a Large Database Presentation of the Results at the KDD Cup Workshop June 28, 2008 The Organizing Team.

Big Data y Educación - ucm.es Data y... · Eric Schmidt, 2013 ... KDD Cup 2010 Educational Data Mining Challenge ... • KDD Cup 2010 Educational Data Mining Challenge https: ...

KDD CUP 2007

Feature Engineering and Classi er Ensemble for KDD Cup 2010

Deep Feature Extraction for multi Class Intrusion ... · The drawbacks of the existing KDD cup 99 dataset discussed by several researchers [7] lead to the development of NSL-KDD dataset.

KDD-2001 Cup The Genomics Challenge Christos Hatzis, Silico Insights

KDD Cup Research Paper

KDD tutorial