Matching References to Headers in PDF Papers
-
Upload
cassidy-sweet -
Category
Documents
-
view
28 -
download
5
description
Transcript of Matching References to Headers in PDF Papers
Matching References to Headers in PDF Papers
Tan Yee Fan
2007 December 19
WING Group Meeting
Task
Corpus ACL Anthology contains a collection of PDF
papers
Task For each paper P, what papers is cited by P? Gold standard data obtained from Dragomir
e.g., P00-1002 ==> P98-2144
Header and References
Header of paper (HeaderParse) Paper title, author names, etc.
Reference section (ParsCit) Paper title, author names, publication venue, etc.
Each header and each reference is seen as a record Title, authors, venue
System Overview
Luceneindex
Reference record
Header records
Returned headers
Matching algorithm
IndexingAll fields concatenated into a singlestring, perform token matching
QueryingOR matching (default in Lucene)
Record Matching
TITLE AUTHOR VENUE TITLE AUTHOR VENUE
Header record Reference record
TITLE_MIN_LENTITLE_MAX_LEN
AUTHOR_MIN_LENAUTHOR_MAX_LEN
VENUE_MIN_LENVENUE_MAX_LEN
TITLE_SIMAUTHOR_SIMVENUE_SIM
MATCH/MISMATCH?
Header-reference pair (instance)
Experiment Setup
Data Reference records: papers divided into training
set and test set (50% each) Header records: same set of papers used for
training and testing
Learning algorithm SMO in Weka (a SVM implementation)
Bootstrapping the Training Data Problem
Gold standard data specifies mappings at the paper to paper level, but not which reference
Solution Hand labeled a small set of reference-header pairs from 6
papers Train a SVM on this small bootstrap set On training set, if gold standard specifies P1 -> P2, then
use SVM to classify reference-header pairs of P1 and P2 Retrain SVM using reference-header pairs combined from
training and bootstrap sets
Experimental Result
Used the ACL subset (2176 PDF papers) Skipped: 142 reference sections, 202 paper
headers If classifier considers a reference in P1
matches header of P2, then P1 -> P2 Results (on paper to paper mappings)
P = 0.901, R = 0.696, F = 0.785 P = 0.898, R = 0.767, F = 0.827 (with manually
cleaned header records)
Cost-utility Framework
f1 f2
r1
f3 f4 f5
r2
r3
r4
r5
r6
c1 c2 c3 c4 c5 u1 u2 u3 u4 u5
cost ofacquiring fi
utility ofacquiring fifeature fi
known value
value that can be acquired
Record Matching
TITLE_MIN_LENTITLE_MAX_LEN
AUTHOR_MIN_LENAUTHOR_MAX_LEN
VENUE_MIN_LENVENUE_MAX_LEN
TITLE_SIMAUTHOR_SIMVENUE_SIM
MATCH/MISMATCH?
Header-reference pair (instance)
[1]Given information
[2]Information that canbe acquired at a cost
Training dataAssume all feature-valuesand their acquisition costsknown
Testing dataAssume [1] known, butfeature-values and theiracquisition costs in [2]unknown
CostsSet to MIN_LEN * MAX_LEN
Costs and Utilities
Costs Trained 3 models (using M5’), treat as regression
Utilities Trained 2^3 = 8 classifiers (each to predict match/mismatch
using only known feature-values) For a test instance with a missing feature-value F
Get confidence of appropriate classifier without F Get expected confidence of appropriate classifier with F Utility is difference between the two confidence scores
Note Similar to Saar-Tsechansky et al.
Results
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Normalized cost Recall Precision F-measure
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Normalized cost Recall Precision F-measure
Increasing proportion offeature-values acquired
Increasing proportion offeature-values acquired
Without cleaning of header records With manual cleaning of header records
Thank You