DEByE─Data Extraction By Example

26
DEByE-Data Extraction By Example Alberto H. F. Laender, , Berthier Ribeiro- Neto and Altigran S. Da Silva Data & Knowledge Engineering ,2002 1

description

Alberto H. F. Laender, , Berthier Ribeiro-Neto and Altigran S. Da Silva 2008/12/30 regular meeting

Transcript of DEByE─Data Extraction By Example

Page 1: DEByE─Data Extraction By Example

DEByE-Data Extraction By Example

Alberto H. F. Laender, , Berthier Ribeiro-Neto and Altigran S. Da Silva

Data & Knowledge Engineering ,2002

1

Page 2: DEByE─Data Extraction By Example

Abstract

2

Data Extraction By Example(DEByE): a approach to extracting data from Web sources, based on a small set of examples specified by the user.

The examples provided by the user are then used to generate patterns which allow extracting data from new documents.

Page 3: DEByE─Data Extraction By Example

Outline

3

Introduction The DEByE approach Data Extraction The DEByE tool Experimental results Conclusion

Page 4: DEByE─Data Extraction By Example

Introduction(1/3)

4

The spreading of modern digital libraries and the popularization of the web have made a huge volume of textual information available to a very large audience.

Finding the desired information in the large text databases present in the web is not a trivial task. One main reason: users might be interested in semi-

structured data that is not recognized by traditional Web interfaces and search engines.

Problem: how to gain access to semi-structured Web data which is present in Web pages .

Page 5: DEByE─Data Extraction By Example

Introduction(2/3)

5

The structure is said to be implicit because it has not been declared explicitly as done when we specify the schema of a database.

The structure might vary from one to another, so it call that actual data is semi-structured.

The problem of extracting data from Web sources can be stated as follow: Given a Web page S containing a set of implicit objects, determine a mapping W that populates a data repository R with the object in S.

An important point connecting the generation of such a mapping is defining what an object is.

They present a new approach for generating wrappers that implement the mapping W, which is called DEByE.

Motivation: to let the user or database designer specify a target structure for the data to be extracted.

Page 6: DEByE─Data Extraction By Example

Introduction(3/3)

6

Advantage:1. Shield user from many of details.2. User can map the data into a structure of his

preference.3. The step for the extraction procedure is simpler

and more intuitive The DEByE tool represents the structure of the

data through nested table. The examples provided by the user are used

to generate extraction pattern which allow extracting new data from new Web pages.

Page 7: DEByE─Data Extraction By Example

The DEByE approach(1/4)

7

Text or pages which present such a type of inherent structure and whose data is restricted to a specific domain are said to be data rich and narrow in ontological breadth.

In the DEByE approach, this is done indirectly through the specification of an example object.

This is accomplished by cutting pieces of data from a sample page and inserting these pieces into the nested table.

Page 8: DEByE─Data Extraction By Example

The DEByE approach(2/4)

8

The nested tables are not as powerful as XML or OEM.

An critical issue: How to automatically extract new data from new Web pages to populate a given nested table.

Page 9: DEByE─Data Extraction By Example

The DEByE approach(3/4)

9

Two module: Graphical User Interface (GUI) and Extractor1. GUI provides the user with a Java interface that he uses to

assemble the example objects.2. The Extractor takes these patterns and applies them to new

pages from the target Web source. (The set of Extracted Objects is coded in an XML-based format )

Page 10: DEByE─Data Extraction By Example

The DEByE approach(4/4)

10

Page 11: DEByE─Data Extraction By Example

Data Extraction─ Notation and terminology

11

In DEByE, the examples of objects are used to generate Object Extraction Pattern (oe-patterns).These oe-patterns combine structural and textual information that are used to recognize and extract new objects.

Use the notion of an object type to represent complex object. An object of a specific object type is called an instance of the type.

Page 12: DEByE─Data Extraction By Example

Data Extraction─ Notation and terminology

12

Instances of a v-type are objects of any type from a list of types called the alternatives of the variant types.

Page 13: DEByE─Data Extraction By Example

Data Extraction─ Object Extraction Patterns

13

The oe-patterns describe the hierarchical structure of the example objects.

The nodes at the bottom of the hierarchy are used to match AVPs and are called Attribute-Value Pair patterns (avp-pattern).

To each such AVP, we associate a local syntactic context that can be derived from the strings surrounding the AVP value in the text.

Use the concept of a passage (or window) and techniques from information retrieval.

Page 14: DEByE─Data Extraction By Example

Data Extraction ─Object Extraction Patterns

14

The key problem is that this pattern includes too much information about the local context in which the value 10.95 appeared.

Page 15: DEByE─Data Extraction By Example

Data Extraction ─Object Extraction Patterns

15

Oe-patterns are essntially trees containing information on the structure of the objects and on their associated AVPs.

The sub-tree of an oe-pattern are themselves oe-pattern, modeling the structure of component objects.

Page 16: DEByE─Data Extraction By Example

Data Extraction─ Extraction strategies

16

Top-down extraction strategy:

The algorithm assembles an oe-pattern and uses this pattern to identify new object in new pages.

This procedure is repeated if more than one example object is provided.

This top-down extraction strategy works well with pages that are well structured.

Page 17: DEByE─Data Extraction By Example

Data Extraction ─Extraction strategies

17

Bottom-up extraction strategy

The main feature of our bottom-up extraction strategy is that it recognize and extracts atomic components, prior to the recognition of the object itself.

The extracted AVPs are then used to assemble the object through a bottom-up composition operation.

Page 18: DEByE─Data Extraction By Example

Data Extraction─ Extraction strategies

18

The bottom-up extraction strategy is considerably more complex than the top-down strategy.

Contiguous pairs of Title and Price values are combined to form Book instances. Each of these instances is labeled with the smaller position value (also called lowest component).

Page 19: DEByE─Data Extraction By Example

Data Extraction─ Extraction strategies

19

The assembling phase procedure is based on two fundamental assumptions

1. AVPs can be correctly identified and extracted from a text (page).2. The presence of any component of an instance indicates the

existence of such an instance. In DEByE, many of the problems caused by imperfect avp-

pattern can be alleviated by the features of the interface. Advantage: easiness of use, quick proto-typing and

converge of a variety of data sources with variations in structure.

The bottom-up strategy is more flexible than the top-down strategy because it assembles complex objects through a composition of simpler object components.

It is suitable for cases where missing components or components out of order are expected.

Page 20: DEByE─Data Extraction By Example

The DEByE tool

20

Page 21: DEByE─Data Extraction By Example

21

Column operation

Page 22: DEByE─Data Extraction By Example

Experimental results(1/3)

22

Table 1: the percentage figure for the number of objects retrieved are relative to the total number of object identified manually in the source pages.

The number of examples used in the extraction id determined by trial and error.

Page 23: DEByE─Data Extraction By Example

Experimental results(2/3)

23

RISE (Repository of Online Information Sources Used in information Extraction Tasks)

Purpose: provide a preliminary comparison of DEByE with three other systems.

Page 24: DEByE─Data Extraction By Example

Experimental results(3/3)

24

Do not allow a full and direct comparison between DEByE and the three other system.

In DEByE, there is no knowledge base or set of heuristics to guide the extraction procedure.

These preliminary results with the RISE repository suggest that DEByE is as effective as the known alternatives based on the wrapper induction.

Page 25: DEByE─Data Extraction By Example

Comparison and related work

25

Data model1. OEM2. UnQL query language

Wrapper induction Machine Learning

1. Do not deal with missing or out of order components of the objects.2. The solution provided is computationally intractable and has not

been implemented in the WIEN system. NLP techniques

1. Rapier2. SRV

Ontology-based extraction approaches Rely on mainly the expected contents of the pages, according to

what was anticipated by the pre-specified ontology. NoDoSE: the way examples are provided by the user to

generate the extraction patterns. XWRAP: the explicit use of the HTML syntax and structure.

Page 26: DEByE─Data Extraction By Example

Conclusions

26

An example-based approach to automatically extracting semi-structured data.

1. User specifies examples according to a structure of his liking.

2. Bottom-up procedure can take advantage of the user provided examples to recognize and extract new data with great efficacy.

Convenience in the specification of the examples can be achieved through a user interface that adopts nested table as its fundamental paradigm.

The bottom-up extraction procedure is quite effective. Few examples are enough to allow the recognition and

extraction of almost the totality of the objects present in the Web source we considerd.