AMBER presentation
-
Upload
giorgio-orsi -
Category
Technology
-
view
1.262 -
download
5
description
Transcript of AMBER presentation
Little Knowledge Rules The Web: Domain-Centric Result Page Extraction
Tim Furche, Georg Gottlob, Giovanni Grasso, Giorgio Orsi, Christian Scallhart, Cheng Wang
Department of Computer ScienceUniversity of Oxford
Result Page Understanding
Outline
Adaptable Model-Based Extraction of
Result Pages (AMBER)
• System Overview
• Experiments
• Current Work
Part of DIADEM | Domain-centric Intelligent Automated Data Extraction Methodology
AMBER: System Overview
Needs only one clue
Implemented in rules
Very high precision & recall
Domain-Parameterized tool,currently aimed at UK real-estate
Adaptable Model-Based Extraction of Result Pages
AMBER: System Overview
Fact Generation & Annotation
• Live browser (Mozilla XUL-Runner)
• Extract DOM tree
• CSS box information
• Textual annotation with GATE (domain dep.)
– Gazetteers
– Regular expression like rules
• All represented as facts in the Page Model
Phenomenological Mapping
Fact Attribute
• Attribute Model:
– Types & constraints
• Dom node and attribute
• Attribute Creation Constraints:
– Required Annotations
– Disallowed Annotations
Segmentation Mapping: Identification
Attribute Data area
• From bottom phenomena to data area
• Little knowledge rules the webOnly one domain concept
(mandatory attribute)– Price
– Location
– Title
Segmentation Mapping: Identification
• Multi data area identification
Segmentation Mapping: Understanding
• Data area Record
• Domain independent
• Identify leading nodes
• Two problems
– Superfluous nodes
– Correct shift
Segmentation Mapping: Understanding
Segmentation Mapping: Understanding
Experiments
95.0%
96.0%
97.0%
98.0%
99.0%
100.0%
Data Area Record Attribute Price Location
Precision
Recall
F-measure
Summary
• AMBER - Adaptable Model-based Extraction of Result Pages
– Domain knowledge simple heuristic
– Using DLV compact & easy implementation
– Understanding phase: only one domain clue quickly adaptable to new domains
– Very High precision (99.4%) recall (99.0%)
Current Work
• Testing AMBER on another domain
• Integrate visual information in understanding phase
• Use probabilistic logic programming to improve the whole system
Thanks!