Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department...

Interactive Wrapper Generation with Minimal User Effort

Utku Irmak and Torsten Suel

CIS Department

Polytechnic University

Brooklyn, NY 11201

[email protected] and [email protected]

Introduction

Information on WWW is usually unstructured in nature, and presented via HTML Not appropriate for (certain types of) automatic processing

Significant amount of embedded structured data Stock data, product/price data, various statistics, … Expressed through layout, HTML structure

Wrapper: a software tool and set of rules for extracting such structured data from web pages

Challenge: different sites, variations within sites

An Example: Meta Search Engine


Rank Title URL Snippet

1 Parallel and Distributed Databases

www.csse.monash... ... Introduction …

2 distributed and parallel databases

springerlink.com/app...

3 Shared Cache – The Future of Parallel Databases

csdl2.computer.org… … Shared Cache – The future …

4 Distributed and Parallel Databases

www.informatik.uni-trier.edu/...

… Distributed and Parallel…

Introduction Extracting the relevant data embedded in web

pages and store in a relational structure for further processing Specialized software programs called wrappers

Manual wrappers: e.g., Perl scripts … Due to shortcomings of manually developing

wrappers, many tools have been proposed for generating wrappers Semi-automatic (interactive and non-interactive) Fully-automatic

Our Goal in this Work

Design a complete interactive system for generating wrappers Developed for industrial application

Overcome common obstacles such as Missing (multiple) attributes Visual variations

Minimize user effort Create robust and reliable wrappers on

future pages

Related Work

Semi-automatic approaches WIEN, SoftMealy, STALKER, Active learning techniques are employed

by Muslea et al. Semi-automatic interactive approaches

W4F, XWrap, Lixto Fully-automatic approaches

IEPAD, RoadRunner, work by Zhai et al.

Our Contributions

We describe a new system for semi-automatic wrapper generation based on an interactive interface a powerful extraction language ranking of likely candidate sets

To implement the interface, we describe a framework based on active learning

We propose the use of a category utility function for ranking the tuple sets

We perform a detailed experimental evaluation

Framework

User

Training Webpage

Verification Set

Wrapper Generation

System

Input: - a training webpage - a number of verification pages

Framework

User

Training Webpage

Verification Set

Wrapper Generation

System(1)User highlights a tuple on training webpage

Framework

User

Training Webpage

Verification Set

Wrapper Generation

System(2) Selected tuple submitted to our system, which generates several wrappers

Framework

User

Training Webpage

Verification Set

Wrapper Generation System

Wrapper Generation

System

?

(3a) System presents user with a candidate tuple set

Framework

User

Training Webpage

Verification Set

Wrapper Generation

System

???

(3b) System presents user with another candidate tuple set

Framework

User

Training Webpage

Verification Set

Wrapper Generation

System

?

(3c) System presents user with another candidate tuple set

Framework

User

Training Webpage

Verification Set

Wrapper Generation

System

(4) User selects one of the proposed candidate tuple set

Framework

User

Training Webpage

Verification Set

Wrapper Generation

System(5) System refines wrapper and tests it on verification set

Framework

User

Training Webpage

Verification Set

Wrapper Generation

System

!

(6) System finds one page where the wrapper “disagrees”

Framework

User

Training Webpage

Verification Set

Wrapper Generation

System

??

?

(7a) System presents user with a candidate tuple set on this page in verification set

Framework

User

Training Webpage

Verification Set

Wrapper Generation

System

??

(7b) System presents user with another candidate tuple set on page in verification set

Framework

User

Training Webpage

Verification Set

Wrapper Generation

System

(8) User selects one of the proposed candidate tuple set

Framework

User

Verification Set

Wrapper Generation

SystemWrapper

Training Webpage

(9) System outputs final wrapper

Definition: Wrapper

A wrapper is a set of extraction rules that agree on all pages considered thusfar (i.e., that extract exactly the same set of tuples on these pages)

The extraction rules within a wrapper may disagree on not yet encountered web pages

In this case, a wrapper can be refined by removing some of the extraction rules

Summary of Interaction Steps:

User highlights a tuple on training page This allows system to generate a number of wrappers that capture

different candidate tuple sets

System presents candidate tuple sets on the training page to user, in order of “plausibility”

User selects the correct tuple set

System tests resulting wrapper on verification set to find any “disagreements”

For any disagreement, user selects the correct set from a ranked list of choices

A Real Example: half.ebay.com

Extract tuple with attributes: Price, Total Price, Shipping, Seller

Only extract those tuples that: Are listed in “Like New Items” and Whose sellers are awarded a Red

Star


Training page:

Observations:

There can be a lot of unexpected cases and variations on real websites

A powerful language is needed to specify extraction rules

Simple extraction followed by SQL filtering conditions will often not work

The final wrapper may still contain many extraction rules and may disagree on webpages encountered in the future

User Effort:

(0) Cost of defined table structure: number of attribute, their names, maybe types

(1) Cost of highlighting one (or maybe two) tuples on training pages

(2) Cost of one or more selections from a ranked list of candidate tuple sets

To Implement We Need:

(0) User interface based browser extensions

(1) Powerful extraction language

(2) Algorithms for generating extraction rules and grouping them into wrappers

(3) Techniques for ranking wrappers in terms of plausibility

(4) Heuristics for throwing away bizarro rules

System Architecture Overview

Document Representation

Extraction Language Overview

Based on DOM-tree with auxiliary properties Extraction patterns consists of a sequence of

expressions on the path from root to a tuple attribute

Each expression consists of conjunctions and disjunctions of predicates

If a node at depthi Satisfies its expression: Accept Otherwise: Reject

Only children of accepted nodes are checked further for the expression defined at depthi+1

Predicates in the Extraction Language

Element Nodes tagName tagAttr tagAttrArray elementSiblingPosition tagPstn …

Text Nodes textNode textSiblingPosition syntax leftTextNode leftElementNode …

The Wrapper Structure

Wrapper Generation Algorithm

Creating dom_path and LCA objects Creating patterns that extract tuple attributes Creating initial wrappers Generating the tuple validation rules and new

wrappers Combining the wrappers Ranking the tuple sets Getting confirmation from the user Testing the wrapper on the verification set

Ranking the Tuple Sets We adopt the concept of category utility:

Maximize inter-cluster dissimilarity Minimize intra-cluster similarity Dom-Path, specific value, missing attributes, indexing, content specification

1) The weight of attribute A

2) The probability that an item has value v for attribute A, given it belongs to cluster C

3) The probability that an item belongs to cluster C, given it has value v for attribute A

S0

T

Ranking: Discussion

Note: we are ranking tuple sets and wrappers

A wrapper is more plausible if the tuples is extracted are very similar to each other, and if those tuples are very different from the non-tuples

One could also try to rank extraction patterns, say using MDL

Experimental Evaluations

Number of training tuples required by our system and previous works

Results on four previously used data sets from RISE Okra, BigBook, Internet Address Finder, Quote Server

Experimental Evaluations

We chose ten well-known web sites and collected fifty web pages from each:

AltaVista, CNN, Google, Hotjobs, IMDb, YMB (Yahoo! Message Board), MSN Q (MSN Money - Quotes), Weather, Art, and BN (Barnes & Noble)

Experimental Evaluation Updating Term Weights (effect of adaptive approach):

The effect of pregenerating wrappers for the same extraction scenario on Art and BN websites

Summary

An approach to interactive wrapper generation that combines Powerful extraction language Techniques for deriving extraction

patterns from user input A framework using active learning A ranking technique using a

category utility function

Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department...

Documents

Transcript of Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department...