Kdd Cup 2013 Author Paper Identification Final Report

16
Kdd Cup 2013 Author Paper Identification Final Report Ben Deng – M10112006

description

Kdd Cup 2013 Author Paper Identification Final Report. Ben Deng – M10112006. Outline. Problem Description Database Analysis Research Issue Proposed Ideas Results. Problem Description. Inside the research community, it has more than 50 million publications and 19 million authors. - PowerPoint PPT Presentation

Transcript of Kdd Cup 2013 Author Paper Identification Final Report

Page 1: Kdd  Cup 2013 Author Paper Identification Final Report

Kdd Cup 2013Author Paper IdentificationFinal Report

Ben Deng – M10112006

Page 2: Kdd  Cup 2013 Author Paper Identification Final Report

Outline

Problem Description Database Analysis Research Issue Proposed Ideas Results

Page 3: Kdd  Cup 2013 Author Paper Identification Final Report

Problem Description

Inside the research community, it has more than 50 million publications and 19 million authors.

However every Journal, Letter, Conference… have their own format. This include author names. In addition, these formats can lead an author-name ambiguity. For instance, abbreviations, identical names, name misspellings, pseudonyms.

All these problems can result in incorrect assign to various authors, and it is enormous problem when we want to search for specific author. The main goal is how to recognize the author and correctly assign the publications to them.

Page 4: Kdd  Cup 2013 Author Paper Identification Final Report

Database Analysis

Author.csv

Id

Name

Affiliation (missing data, noise)

PaperAuthor.csv

PaperID

AuthorID

Name

Affiliation (missing data, noise)

Paper.csv

ID

Title

Year

ConferenceId

JournalId

Keywords (missing data)

Journal.csv

ID

ShortName

FullName

HomePage

Conference.csv

ID

ShortName

FullName

HomePage

Page 5: Kdd  Cup 2013 Author Paper Identification Final Report

Research Issue

Lot of data are missing Noise in affiliation column

(especially with foreign affiliation) Name ambiguity (especially name

with chinese origin) Authors have different

abbreviations from different Journals and/or Conference

Page 6: Kdd  Cup 2013 Author Paper Identification Final Report

Proposed Ideas

Filling missing data. Counting how many different

affiliations the same author has. Using keywords, how many times

the same keyword was used. Class weight is fixed to be auto.

Page 7: Kdd  Cup 2013 Author Paper Identification Final Report

Filling missing data

In order to normalize the tables such that a one to one join table was created between them which joins each column1 to a single column2, if indeed there should be exactly one column2 per column1.

SQL Code

UPDATE table t

SET city = c.column2 FROM (SELECT column1, MAX(column2) AS column2 FROM table WHERE column2 IS NOT NULL GROUP BY column1) c

WHERE t.column2 IS NULL AND column1= c.column1;

Page 8: Kdd  Cup 2013 Author Paper Identification Final Report

Simulation and Results

Page 9: Kdd  Cup 2013 Author Paper Identification Final Report

Simulation and Results

Random Forest (Classifier) Gradient Boosting (Classifier) Decision Tree (Classifier) K Nearest (Classifier)

Page 10: Kdd  Cup 2013 Author Paper Identification Final Report

Random Forest

Result is 0.51341, however I am expecting for 0.80217

Using the same code from Github (same parameters)

Page 11: Kdd  Cup 2013 Author Paper Identification Final Report

Random Forest

Result is 0.52469

Parameters of Python Code

RandomForestClassifier(n_estimators=200, criterion='gini', max_depth=None, min_samples_split=15, min_samples_leaf=1, min_density=0.10000000000000001, max_features='auto', bootstrap=True, compute_importances=False, oob_score=False, n_jobs=2, random_state=None, verbose=0)

Page 12: Kdd  Cup 2013 Author Paper Identification Final Report

Decision Tree

Result is 0.47386

Parameters of Python Code

DecisionTreeClassifier(criterion='gini', max_depth=None, min_samples_split=15,min_samples_leaf=1, min_density=0.10000000000000001,max_features=‘auto’, compute_importances=False,random_state=1)

Page 13: Kdd  Cup 2013 Author Paper Identification Final Report

Gradient Boosting

Result is 0.53506

Parameters of Python Code

GradientBoostingClassifier(loss='deviance', learning_rate=0.00001,n_estimators=250, subsample=0.5, min_samples_split=2, min_samples_leaf=1, max_depth=10, init=None,random_state=None, max_features=None, verbose=0

Page 14: Kdd  Cup 2013 Author Paper Identification Final Report

K Nearest

Result is 0.48297

Parameters of Python Code

KNeighborsClassifier(n_neighbors=50, weights=‘distance', algorithm='auto', leaf_size=30, p=2)

Page 15: Kdd  Cup 2013 Author Paper Identification Final Report

SVM SVC, Nu-SVC, LinearSVC

Support Vector Machine (SVC, Nu-SVC and LinearSVC) were tested.

However the training was taking more than 3 days and they are still training the classifier. So, I did not be able to finish the training and submit the results using SVM.

Page 16: Kdd  Cup 2013 Author Paper Identification Final Report

Thank You