Table Extraction Using Conditional Random Fields D. Pinto, A. McCallum, X. Wei and W. Bruce Croft -...
-
Upload
osborn-lane -
Category
Documents
-
view
212 -
download
0
Transcript of Table Extraction Using Conditional Random Fields D. Pinto, A. McCallum, X. Wei and W. Bruce Croft -...
![Page 1: Table Extraction Using Conditional Random Fields D. Pinto, A. McCallum, X. Wei and W. Bruce Croft - on SIGIR03 - Presented by Vitor R. Carvalho March 15.](https://reader036.fdocuments.us/reader036/viewer/2022083009/5697bfba1a28abf838ca07ac/html5/thumbnails/1.jpg)
Table Extraction Using Conditional Random Fields
D. Pinto, A. McCallum, X. Wei and W. Bruce Croft
- on SIGIR03 -
Presented by Vitor R. CarvalhoMarch 15th 2004
![Page 2: Table Extraction Using Conditional Random Fields D. Pinto, A. McCallum, X. Wei and W. Bruce Croft - on SIGIR03 - Presented by Vitor R. Carvalho March 15.](https://reader036.fdocuments.us/reader036/viewer/2022083009/5697bfba1a28abf838ca07ac/html5/thumbnails/2.jpg)
Warm up
• Why table extraction?
– Applications: Question-Answering, data mining and IR– Tables: “textual tokens laid out in tabular form” – Tables: “databases designed for human eyes”
• Related Work:– Pyreddy and Croft,1997: purely layout-based approach; a Character
Alignment Graph (CAG) is used to identify the whole table
– Ng et. al. ,1999: machine learning to identify rows and columns positions; no extraction of content.
– Hurst, 2000: combination of layout and language perspective; text are broken into blocks by spatial and linguistic evidence
– Pinto et. al., 2002: based on CAG, heuristic method to extract table cells for QA system.
![Page 3: Table Extraction Using Conditional Random Fields D. Pinto, A. McCallum, X. Wei and W. Bruce Croft - on SIGIR03 - Presented by Vitor R. Carvalho March 15.](https://reader036.fdocuments.us/reader036/viewer/2022083009/5697bfba1a28abf838ca07ac/html5/thumbnails/3.jpg)
Objectives
• On this paper:– Only text tables are studied, not HTML tables
– Table extraction can be broken down into 6 subproblems:» Locate the table (*)
» Identify the row positions and types (*)
» Identify columns positions and types
» Segment tables into cells
» Tag cells as data or headers
» Associate data cells with their corresponding headers
– Only (*) tasks are addressed in the paper
– CRFs are compared to MaxEntropy and to HMM
![Page 4: Table Extraction Using Conditional Random Fields D. Pinto, A. McCallum, X. Wei and W. Bruce Croft - on SIGIR03 - Presented by Vitor R. Carvalho March 15.](https://reader036.fdocuments.us/reader036/viewer/2022083009/5697bfba1a28abf838ca07ac/html5/thumbnails/4.jpg)
Example
• From www.FedStats.com , July 2001
![Page 5: Table Extraction Using Conditional Random Fields D. Pinto, A. McCallum, X. Wei and W. Bruce Croft - on SIGIR03 - Presented by Vitor R. Carvalho March 15.](https://reader036.fdocuments.us/reader036/viewer/2022083009/5697bfba1a28abf838ca07ac/html5/thumbnails/5.jpg)
12 Line Labels
• Non-extraction labels– { NONTABLE, BLANKLINE, SEPARATOR }
• Header Labels– { TITLE, SUPERHEADER, TABLEHEADER, SUBHEADER,
SECTIONHEADER }
• Data Row Labels– { DATAROW, SECTIONDATAROW }
• Caption Labels– { TABLEFOOTNOTE, TABLECAPTION }
![Page 6: Table Extraction Using Conditional Random Fields D. Pinto, A. McCallum, X. Wei and W. Bruce Croft - on SIGIR03 - Presented by Vitor R. Carvalho March 15.](https://reader036.fdocuments.us/reader036/viewer/2022083009/5697bfba1a28abf838ca07ac/html5/thumbnails/6.jpg)
Feature Set• White Space Features
– Presence of: 4 consecutive white spaces, 4 space indents, 2 consecutive white space between non-space characters, a complete white space line, single space indent, etc
– Percentage of: white space from the first non-white space on
• Text Features– Presence of: 3 cells on a line, etc
– Percentage of: digits (0-9) on a line, alphabet characters(a-z) on a line, header features (year strings, month abreviations, etc) on a line
• Separator Features– Presence of: 4 consecutive periods
– Percentage of: separator characters(-,+,! ,=,:,*) on a line
• Conjunction of Features– Conjunctions: current&previous line, current&next line, next&nextnext
![Page 7: Table Extraction Using Conditional Random Fields D. Pinto, A. McCallum, X. Wei and W. Bruce Croft - on SIGIR03 - Presented by Vitor R. Carvalho March 15.](https://reader036.fdocuments.us/reader036/viewer/2022083009/5697bfba1a28abf838ca07ac/html5/thumbnails/7.jpg)
Task 1: Table Line Location
• A table line is any label but NONTABLE, BLANKLINE and SEPARATOR
• F-Measure = (2*Precision * Recall)/(Recall+Precision)
• Both CRFs used a Gaussian Prior and were trained using L-BFGS
• Training set (52 documents), develop. set (6 documents), test set (62 docs)
![Page 8: Table Extraction Using Conditional Random Fields D. Pinto, A. McCallum, X. Wei and W. Bruce Croft - on SIGIR03 - Presented by Vitor R. Carvalho March 15.](https://reader036.fdocuments.us/reader036/viewer/2022083009/5697bfba1a28abf838ca07ac/html5/thumbnails/8.jpg)
Task 2: Line Identification
• How many of these lines were actually table lines?
![Page 9: Table Extraction Using Conditional Random Fields D. Pinto, A. McCallum, X. Wei and W. Bruce Croft - on SIGIR03 - Presented by Vitor R. Carvalho March 15.](https://reader036.fdocuments.us/reader036/viewer/2022083009/5697bfba1a28abf838ca07ac/html5/thumbnails/9.jpg)
Task 2: Line Identification
![Page 10: Table Extraction Using Conditional Random Fields D. Pinto, A. McCallum, X. Wei and W. Bruce Croft - on SIGIR03 - Presented by Vitor R. Carvalho March 15.](https://reader036.fdocuments.us/reader036/viewer/2022083009/5697bfba1a28abf838ca07ac/html5/thumbnails/10.jpg)
Additional Results
• Pinto et. al. heuristic method
• 4 labels: CAPTIONS, HEADERS, DATA, NON-TABLE
![Page 11: Table Extraction Using Conditional Random Fields D. Pinto, A. McCallum, X. Wei and W. Bruce Croft - on SIGIR03 - Presented by Vitor R. Carvalho March 15.](https://reader036.fdocuments.us/reader036/viewer/2022083009/5697bfba1a28abf838ca07ac/html5/thumbnails/11.jpg)
Conclusions
• The Table extraction problem has complex linguistic and formatting characteristics. In order to attack this problem, a combination of textual and spatial features was used.
• CRFs can handle very well arbitrary and overlapping features, and offer the combined benefits of conditional-probability training models and Markov finite-state context models.