Matching References to Headers in PDF Papers

Matching References to Headers in PDF Papers

Tan Yee Fan

2007 December 19

WING Group Meeting

Task

Corpus ACL Anthology contains a collection of PDF

papers

Task For each paper P, what papers is cited by P? Gold standard data obtained from Dragomir

e.g., P00-1002 ==> P98-2144

Header and References

Header of paper (HeaderParse) Paper title, author names, etc.

Reference section (ParsCit) Paper title, author names, publication venue, etc.

Each header and each reference is seen as a record Title, authors, venue

System Overview

Luceneindex

Reference record

Header records

Returned headers

Matching algorithm

IndexingAll fields concatenated into a singlestring, perform token matching

QueryingOR matching (default in Lucene)

Record Matching

TITLE AUTHOR VENUE TITLE AUTHOR VENUE

Header record Reference record

TITLE_MIN_LENTITLE_MAX_LEN

AUTHOR_MIN_LENAUTHOR_MAX_LEN

VENUE_MIN_LENVENUE_MAX_LEN

TITLE_SIMAUTHOR_SIMVENUE_SIM

MATCH/MISMATCH?

Header-reference pair (instance)

Experiment Setup

Data Reference records: papers divided into training

set and test set (50% each) Header records: same set of papers used for

training and testing

Learning algorithm SMO in Weka (a SVM implementation)

Bootstrapping the Training Data Problem

Gold standard data specifies mappings at the paper to paper level, but not which reference

Solution Hand labeled a small set of reference-header pairs from 6

papers Train a SVM on this small bootstrap set On training set, if gold standard specifies P1 -> P2, then

use SVM to classify reference-header pairs of P1 and P2 Retrain SVM using reference-header pairs combined from

training and bootstrap sets

Experimental Result

Used the ACL subset (2176 PDF papers) Skipped: 142 reference sections, 202 paper

headers If classifier considers a reference in P1

matches header of P2, then P1 -> P2 Results (on paper to paper mappings)

P = 0.901, R = 0.696, F = 0.785 P = 0.898, R = 0.767, F = 0.827 (with manually

cleaned header records)

Cost-utility Framework

f1 f2

r1

f3 f4 f5

r2

r3

r4

r5

r6

c1 c2 c3 c4 c5 u1 u2 u3 u4 u5

cost ofacquiring fi

utility ofacquiring fifeature fi

known value

value that can be acquired

Record Matching

TITLE_MIN_LENTITLE_MAX_LEN

AUTHOR_MIN_LENAUTHOR_MAX_LEN

VENUE_MIN_LENVENUE_MAX_LEN

TITLE_SIMAUTHOR_SIMVENUE_SIM

MATCH/MISMATCH?

Header-reference pair (instance)

[1]Given information

[2]Information that canbe acquired at a cost

Training dataAssume all feature-valuesand their acquisition costsknown

Testing dataAssume [1] known, butfeature-values and theiracquisition costs in [2]unknown

CostsSet to MIN_LEN * MAX_LEN

Costs and Utilities

Costs Trained 3 models (using M5’), treat as regression

Utilities Trained 2^3 = 8 classifiers (each to predict match/mismatch

using only known feature-values) For a test instance with a missing feature-value F

Get confidence of appropriate classifier without F Get expected confidence of appropriate classifier with F Utility is difference between the two confidence scores

Note Similar to Saar-Tsechansky et al.

Results

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Normalized cost Recall Precision F-measure

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Normalized cost Recall Precision F-measure

Increasing proportion offeature-values acquired

Increasing proportion offeature-values acquired

Without cleaning of header records With manual cleaning of header records

Thank You

Matching References to Headers in PDF Papers

Documents

Transcript of Matching References to Headers in PDF Papers