Towards Domain-Independent Information Extraction from Web Tables Wolfgang Gatterbauer, Paul...

26
Towards Domain-Independent Information Extraction from Web Tables Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Krupl, and Bernhard Pollak Presented by Aaron Stewart BYU CS 652 Table Extraction Using Spatial Reasoning in the CSS2 Visual Box Model Database and Artificial Inteligence Group Vienna University of Technology, Austria Wolfgang Gatterbauer and Paul Bohunsky
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    217
  • download

    1

Transcript of Towards Domain-Independent Information Extraction from Web Tables Wolfgang Gatterbauer, Paul...

Page 1: Towards Domain-Independent Information Extraction from Web Tables Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Krupl, and Bernhard Pollak.

Towards Domain-Independent Information Extraction from Web Tables

Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog,Bernhard Krupl, and Bernhard Pollak

Presented by Aaron StewartBYU CS 652

Table Extraction Using Spatial Reasoning in the CSS2 Visual Box Model

Database and Artificial Inteligence GroupVienna University of Technology, Austria

Wolfgang Gatterbauer and Paul Bohunsky

Page 2: Towards Domain-Independent Information Extraction from Web Tables Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Krupl, and Bernhard Pollak.

Contributions

1. Classify visually structured data2. Non-tree IE formalism3. Argue to defer semantic interpretation of

output4. Ground truthing method5. Web table test set6. Visual results

Page 3: Towards Domain-Independent Information Extraction from Web Tables Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Krupl, and Bernhard Pollak.

Introduction

Source: Gatterbauer et al. 2007

Page 4: Towards Domain-Independent Information Extraction from Web Tables Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Krupl, and Bernhard Pollak.

Visually Structured Data on the Web

• Tables• Lists• Aligned Graphs

Page 5: Towards Domain-Independent Information Extraction from Web Tables Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Krupl, and Bernhard Pollak.

Visually Structured Data on the Web

Source: Gatterbauer et al. 2007

Page 6: Towards Domain-Independent Information Extraction from Web Tables Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Krupl, and Bernhard Pollak.

Formal Setup

• DOM Tree Representation• Visual Box Representation– Visualized Element Nodes (VENs)• DOM nodes with bounding boxes

– Visualized Words• Text words with bounding boxes

Page 7: Towards Domain-Independent Information Extraction from Web Tables Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Krupl, and Bernhard Pollak.

Formal Setup

Source: Gatterbauer et al. 2007

Page 8: Towards Domain-Independent Information Extraction from Web Tables Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Krupl, and Bernhard Pollak.

Information Extraction

• Visualized Element Nodes Table extraction (VENTex)

• Steps:– Table location– Table recognition– Table interpretation

Page 9: Towards Domain-Independent Information Extraction from Web Tables Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Krupl, and Bernhard Pollak.

Information Extraction

Source: Gatterbauer et al. 2007

Page 10: Towards Domain-Independent Information Extraction from Web Tables Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Krupl, and Bernhard Pollak.

Table Extraction

Source: Gatterbauer et al. 2007

Page 11: Towards Domain-Independent Information Extraction from Web Tables Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Krupl, and Bernhard Pollak.

Table Extraction

1. Gather 8 HTML node attributes2. For text, add link3. Only accept TH, TD, DIV html nodes4. Tables must form frames5. Remove duplicate bounding boxes

Page 12: Towards Domain-Independent Information Extraction from Web Tables Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Krupl, and Bernhard Pollak.

Table Extraction

6. Adjacency: 3 pixels7. LOCATEFRAMES algorithm8. No overlapping cells9. Minimum 3 rows, 2 columns10. Remove empty rows/columns (spacers)

Page 13: Towards Domain-Independent Information Extraction from Web Tables Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Krupl, and Bernhard Pollak.

LOCATE FRAMES Algorithm (earlier paper)

• Visual table model• Expansion algorithm

Page 14: Towards Domain-Independent Information Extraction from Web Tables Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Krupl, and Bernhard Pollak.

Visual Table Model

Source: Gatterbauer et al. 2007

Page 15: Towards Domain-Independent Information Extraction from Web Tables Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Krupl, and Bernhard Pollak.

Double Topographical Grid???

• Two origins– Upper left corner– Lower right corner

• Sorted lists of pixel positions– The numbers are indices– But pixels remain in regular coordinates

Page 16: Towards Domain-Independent Information Extraction from Web Tables Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Krupl, and Bernhard Pollak.

Neighbor Relations

Source: Gatterbauer et al. 2007

Page 17: Towards Domain-Independent Information Extraction from Web Tables Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Krupl, and Bernhard Pollak.

Neighbor Relations

• Expand to include neighbors 1,2,3,4– within or equal – Not bigger– Not outside– Not stepped

Page 18: Towards Domain-Independent Information Extraction from Web Tables Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Krupl, and Bernhard Pollak.

Expansion Algorithm

Source: Gatterbauer et al. 2007

Page 20: Towards Domain-Independent Information Extraction from Web Tables Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Krupl, and Bernhard Pollak.

Table Interpretation

• Argument– Few details about the method actually used– Take data as it comes– Pass it on to a later semantic processing stage

Page 21: Towards Domain-Independent Information Extraction from Web Tables Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Krupl, and Bernhard Pollak.

Table Interpretation

Source: Gatterbauer et al. 2007

Page 22: Towards Domain-Independent Information Extraction from Web Tables Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Krupl, and Bernhard Pollak.

Performance

• Load + render: O(n)• Double topographical grid: O(n sqrt(n))• About 5 seconds per page

Page 23: Towards Domain-Independent Information Extraction from Web Tables Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Krupl, and Bernhard Pollak.

Web Table Ground Truthing

• Tool to copy web pages– (not easy!)– http://

www.dbai.tuwien.ac.at/user/pollak/webpagedump

• Students selected and submitted pages– 493 web tables– 269 web pages– 63 students– http://www.dbai.tuwien.ac.at/staff/gatter/ventex/

Page 24: Towards Domain-Independent Information Extraction from Web Tables Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Krupl, and Bernhard Pollak.

Experimental Results

Source: Gatterbauer et al. 2007

Page 25: Towards Domain-Independent Information Extraction from Web Tables Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Krupl, and Bernhard Pollak.

Future Work• Table extraction• Table interpretation• Nested substructures• Other visually structured data• Information integration

Source: Gatterbauer et al. 2007

Page 26: Towards Domain-Independent Information Extraction from Web Tables Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Krupl, and Bernhard Pollak.

My Conclusions

• Useful table-building algorithm– For electronic data only– Requires strict alignment

• Could be expanded– Other electronic formats (PDF, even ASCII text)– Probabilistic model for jitter