Logical Structure Recovery in Scholarly Articles with Rich Document Features Minh-Thang Luong, Thuy...
-
Upload
juniper-banks -
Category
Documents
-
view
219 -
download
0
Transcript of Logical Structure Recovery in Scholarly Articles with Rich Document Features Minh-Thang Luong, Thuy...
![Page 1: Logical Structure Recovery in Scholarly Articles with Rich Document Features Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649f045503460f94c181a3/html5/thumbnails/1.jpg)
Logical Structure Recovery in Scholarly Articles with Rich Document Features
Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan
![Page 2: Logical Structure Recovery in Scholarly Articles with Rich Document Features Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649f045503460f94c181a3/html5/thumbnails/2.jpg)
• Logical structure annotation in ForeciteReader.• The view shows object navigation interface, currently focusing on the list of figure captions.
04/21/23 2
![Page 3: Logical Structure Recovery in Scholarly Articles with Rich Document Features Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649f045503460f94c181a3/html5/thumbnails/3.jpg)
• Section navigation in ForeCiteReader environment with generic sections
04/21/23 3
![Page 4: Logical Structure Recovery in Scholarly Articles with Rich Document Features Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649f045503460f94c181a3/html5/thumbnails/4.jpg)
Overview
• Methodology– Problem Formulation– Learning Model - CRF– Approach overview– Classification categories
• Raw-text features• Rich document representation• Experiments• Further analysis
04/21/23 4
![Page 5: Logical Structure Recovery in Scholarly Articles with Rich Document Features Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649f045503460f94c181a3/html5/thumbnails/5.jpg)
Problem FormulationTwo related subtasks:• Logical structure (LS) classification
– scholarly document as an ordered collection of text lines– label each text line with a semantic category e.g. title,
author, address, etc.
• Generic section (GS) classification– take the headers of each section of text in a paper– deduce a generic logical purpose of the section.
Sequence labeling tasks - CRF
04/21/23 5
![Page 6: Logical Structure Recovery in Scholarly Articles with Rich Document Features Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649f045503460f94c181a3/html5/thumbnails/6.jpg)
Learning Model - CRF
CRF in simplified formf: both state & transition functions
04/21/23 6
Binary feature
State function
Transition function
• Utilize CRF++ package http://crfpp.sourceforge.net/ • Input for line li to CRF++ is of the form “value1 … valuem categoryi"
![Page 7: Logical Structure Recovery in Scholarly Articles with Rich Document Features Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649f045503460f94c181a3/html5/thumbnails/7.jpg)
Approach overview
04/21/23 7
![Page 8: Logical Structure Recovery in Scholarly Articles with Rich Document Features Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649f045503460f94c181a3/html5/thumbnails/8.jpg)
Classification categories - example
04/21/23 8
![Page 9: Logical Structure Recovery in Scholarly Articles with Rich Document Features Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649f045503460f94c181a3/html5/thumbnails/9.jpg)
Classification categories – full sets
• Logical structure subtask, 23 categories: address, affiliation, author, bodyText, categories, construct, copyright, email, equation, figure, figureCaption, footnote, keywords, listItem, note, page, reference, sectionHeader, subsectionHeader, subsubsectionHeader, table, tableCaption, and title.
• Generic section subtask, 13 categories: abstract, categories, general terms, keywords, introduction, background, relatedWork, methodology, evaluation, discussions, conclusions, acknowledgments, and references.
04/21/23 9
![Page 10: Logical Structure Recovery in Scholarly Articles with Rich Document Features Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649f045503460f94c181a3/html5/thumbnails/10.jpg)
Raw-text features - LS
• Parscit token-level features +• Our line-level features:
– Location: relative position within document– Number: patterns of subsections, subsubsections,
categories, footnotes– Punctuation: patterns of emails & web linksbracket numbering equation– Length: 1token, 2token, 3token, 4token, 5+token
identify majority of lines as bodtyText04/21/23 10
![Page 11: Logical Structure Recovery in Scholarly Articles with Rich Document Features Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649f045503460f94c181a3/html5/thumbnails/11.jpg)
Raw-text features - GS
• Naïve, yet effective features:– Positions– First and Second Words– Whole Header
04/21/23 11
![Page 12: Logical Structure Recovery in Scholarly Articles with Rich Document Features Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649f045503460f94c181a3/html5/thumbnails/12.jpg)
Rich document representation – OCR output
• Linearlize XML output into CRF features: “Don't-Look-Now,-But-We've-Created-a-Bureaucracy. Loc_0 Align_left FontSize_largest Bold_yes Italic_no Picture_no Table_no Bullet_no".
04/21/23 12
![Page 13: Logical Structure Recovery in Scholarly Articles with Rich Document Features Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649f045503460f94c181a3/html5/thumbnails/13.jpg)
Rich document representation – OCR features
• Position– Alignment: left, center, right & justified– Location: within-page location
• Format– FontSize: quantize base on frequency, e.g smaller, smaller,
base, -2, -1, 0– Bold – Italic
• Object– Bullet – Picture – Table
04/21/2313
![Page 14: Logical Structure Recovery in Scholarly Articles with Rich Document Features Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649f045503460f94c181a3/html5/thumbnails/14.jpg)
Experiments - datasets
• LS: 20 ACM, 10 CHI 2008, 10 ACL 2009 – fully labeled • GS: 211 ACM papers – headers labeled
04/21/23 14
Skewed data
![Page 15: Logical Structure Recovery in Scholarly Articles with Rich Document Features Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649f045503460f94c181a3/html5/thumbnails/15.jpg)
Experiments – metrics
TP: # correctly classified text lines (true positive)Similarly, FN, FP, and TN for true negatives.
• Category-specific performance: – F1measure = 2 x P x R / (P+R);
Precision = TP/(TP+FP), Recall = TP/(TP + FN)
• Overall performance: – Macro average: average of all category-specific F1
– Micro average: percentage of correctly labeled lines
04/21/23 15
![Page 16: Logical Structure Recovery in Scholarly Articles with Rich Document Features Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649f045503460f94c181a3/html5/thumbnails/16.jpg)
Experiments – LS results
LSPC - baseline using only ParsCit features
LSPC+RT: LSPC + raw text features
LSPC+RT+RD: LSPC+RT + rich document features (OCR)
• LSPC+RT+RD , LSPC+RT > LSPC more than 10 F1 points
• LSPC+RT+RD < LSPC+RT: minor degradation for four categories
• LSPC+RT+RD > LSPC+RT: all other categories (many > 4 F1 scores)
Large improvements for footnote, sssHeaders
04/21/23 16
![Page 17: Logical Structure Recovery in Scholarly Articles with Rich Document Features Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649f045503460f94c181a3/html5/thumbnails/17.jpg)
Experiments – GS results• GSmaxent: maximum entropy
based system (Nguyen and Kan, 2007)
• GSCRF: our system
• GSCRF > GSmaxent : in all categories except background
Large improvements for discussions
04/21/23 17
![Page 18: Logical Structure Recovery in Scholarly Articles with Rich Document Features Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649f045503460f94c181a3/html5/thumbnails/18.jpg)
Further analysis – Text features
• All contribute to the final composite performance• Most influential: position
04/21/23 18
![Page 19: Logical Structure Recovery in Scholarly Articles with Rich Document Features Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649f045503460f94c181a3/html5/thumbnails/19.jpg)
Further analysis – rich doc features
• Format contributes most to macro avg• While object influences micro average most • Format features help a wider spectrum of categories: paper metadata & section headers• Object features enhance fewer categories, but containing a large number of training data
e.g. list item, table
04/21/23 19
![Page 20: Logical Structure Recovery in Scholarly Articles with Rich Document Features Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649f045503460f94c181a3/html5/thumbnails/20.jpg)
Further analysis – rich doc features
• Most features improve both metricsexcept align & table: trade off macro vs. micro
• Location, Font, and Bullet as the most effective features in each of the groups position, format, and object
04/21/23 20
![Page 21: Logical Structure Recovery in Scholarly Articles with Rich Document Features Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649f045503460f94c181a3/html5/thumbnails/21.jpg)
Error analysis - LS
04/21/23 21
![Page 22: Logical Structure Recovery in Scholarly Articles with Rich Document Features Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649f045503460f94c181a3/html5/thumbnails/22.jpg)
Error analysis - GS
• whole header: non-overlapping tokens with any of the memoized training data instances
Needs to use body text instead (Future work)• Similar relative positions of consecutive headers: background vs.
method, method vs. discussions, & discussions vs. Conclusions• The dataset skew also impacts: large number of method, while
much less for background and discussions categories many headers are mislabelled as method
04/21/23 22
![Page 23: Logical Structure Recovery in Scholarly Articles with Rich Document Features Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649f045503460f94c181a3/html5/thumbnails/23.jpg)
04/21/23 23
![Page 24: Logical Structure Recovery in Scholarly Articles with Rich Document Features Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649f045503460f94c181a3/html5/thumbnails/24.jpg)
Q & A
Thank you!
04/21/23 24