Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department...

42
Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 [email protected] and [email protected]

Transcript of Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department...

Page 1: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.

Interactive Wrapper Generation with Minimal User Effort

Utku Irmak and Torsten Suel

CIS Department

Polytechnic University

Brooklyn, NY 11201

[email protected] and [email protected]

Page 2: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.

Introduction

Information on WWW is usually unstructured in nature, and presented via HTML Not appropriate for (certain types of) automatic processing

Significant amount of embedded structured data Stock data, product/price data, various statistics, … Expressed through layout, HTML structure

Wrapper: a software tool and set of rules for extracting such structured data from web pages

Challenge: different sites, variations within sites

Page 3: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.

An Example: Meta Search Engine

Page 4: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.

An Example: Meta Search Engine

Rank Title URL Snippet

1 Parallel and Distributed Databases

www.csse.monash... ... Introduction …

2 distributed and parallel databases

springerlink.com/app...

3 Shared Cache – The Future of Parallel Databases

csdl2.computer.org… … Shared Cache – The future …

4 Distributed and Parallel Databases

www.informatik.uni-trier.edu/...

… Distributed and Parallel…

Page 5: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.

Introduction Extracting the relevant data embedded in web

pages and store in a relational structure for further processing Specialized software programs called wrappers

Manual wrappers: e.g., Perl scripts … Due to shortcomings of manually developing

wrappers, many tools have been proposed for generating wrappers Semi-automatic (interactive and non-interactive) Fully-automatic

Page 6: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.

An Example: Meta Search Engine

Page 7: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.

Our Goal in this Work

Design a complete interactive system for generating wrappers Developed for industrial application

Overcome common obstacles such as Missing (multiple) attributes Visual variations

Minimize user effort Create robust and reliable wrappers on

future pages

Page 8: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.

Related Work

Semi-automatic approaches WIEN, SoftMealy, STALKER, Active learning techniques are employed

by Muslea et al. Semi-automatic interactive approaches

W4F, XWrap, Lixto Fully-automatic approaches

IEPAD, RoadRunner, work by Zhai et al.

Page 9: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.

Our Contributions

We describe a new system for semi-automatic wrapper generation based on an interactive interface a powerful extraction language ranking of likely candidate sets

To implement the interface, we describe a framework based on active learning

We propose the use of a category utility function for ranking the tuple sets

We perform a detailed experimental evaluation

Page 10: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.

Framework

User

Training Webpage

Verification Set

Wrapper Generation

System

Input: - a training webpage - a number of verification pages

Page 11: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.

Framework

User

Training Webpage

Verification Set

Wrapper Generation

System(1)User highlights a tuple on training webpage

Page 12: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.

Framework

User

Training Webpage

Verification Set

Wrapper Generation

System(2) Selected tuple submitted to our system, which generates several wrappers

Page 13: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.

Framework

User

Training Webpage

Verification Set

Wrapper Generation System

Wrapper Generation

System

?

(3a) System presents user with a candidate tuple set

Page 14: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.

Framework

User

Training Webpage

Verification Set

Wrapper Generation

System

???

(3b) System presents user with another candidate tuple set

Page 15: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.

Framework

User

Training Webpage

Verification Set

Wrapper Generation

System

?

(3c) System presents user with another candidate tuple set

Page 16: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.

Framework

User

Training Webpage

Verification Set

Wrapper Generation

System

(4) User selects one of the proposed candidate tuple set

Page 17: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.

Framework

User

Training Webpage

Verification Set

Wrapper Generation

System(5) System refines wrapper and tests it on verification set

Page 18: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.

Framework

User

Training Webpage

Verification Set

Wrapper Generation

System

!

(6) System finds one page where the wrapper “disagrees”

Page 19: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.

Framework

User

Training Webpage

Verification Set

Wrapper Generation

System

??

?

(7a) System presents user with a candidate tuple set on this page in verification set

Page 20: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.

Framework

User

Training Webpage

Verification Set

Wrapper Generation

System

??

(7b) System presents user with another candidate tuple set on page in verification set

Page 21: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.

Framework

User

Training Webpage

Verification Set

Wrapper Generation

System

(8) User selects one of the proposed candidate tuple set

Page 22: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.

Framework

User

Verification Set

Wrapper Generation

SystemWrapper

Training Webpage

(9) System outputs final wrapper

Page 23: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.

Definition: Wrapper

A wrapper is a set of extraction rules that agree on all pages considered thusfar (i.e., that extract exactly the same set of tuples on these pages)

The extraction rules within a wrapper may disagree on not yet encountered web pages

In this case, a wrapper can be refined by removing some of the extraction rules

Page 24: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.

Summary of Interaction Steps:

User highlights a tuple on training page This allows system to generate a number of wrappers that capture

different candidate tuple sets

System presents candidate tuple sets on the training page to user, in order of “plausibility”

User selects the correct tuple set

System tests resulting wrapper on verification set to find any “disagreements”

For any disagreement, user selects the correct set from a ranked list of choices

Page 25: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.

A Real Example: half.ebay.com

Extract tuple with attributes: Price, Total Price, Shipping, Seller

Only extract those tuples that: Are listed in “Like New Items” and Whose sellers are awarded a Red

Star

Page 26: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.

A Real Example: half.ebay.com

Page 27: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.

A Real Example: half.ebay.com

Training page:

Page 28: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.

Observations:

There can be a lot of unexpected cases and variations on real websites

A powerful language is needed to specify extraction rules

Simple extraction followed by SQL filtering conditions will often not work

The final wrapper may still contain many extraction rules and may disagree on webpages encountered in the future

Page 29: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.

User Effort:

(0) Cost of defined table structure: number of attribute, their names, maybe types

(1) Cost of highlighting one (or maybe two) tuples on training pages

(2) Cost of one or more selections from a ranked list of candidate tuple sets

Page 30: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.

To Implement We Need:

(0) User interface based browser extensions

(1) Powerful extraction language

(2) Algorithms for generating extraction rules and grouping them into wrappers

(3) Techniques for ranking wrappers in terms of plausibility

(4) Heuristics for throwing away bizarro rules

Page 31: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.

System Architecture Overview

Page 32: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.

Document Representation

Page 33: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.

Extraction Language Overview

Based on DOM-tree with auxiliary properties Extraction patterns consists of a sequence of

expressions on the path from root to a tuple attribute

Each expression consists of conjunctions and disjunctions of predicates

If a node at depthi Satisfies its expression: Accept Otherwise: Reject

Only children of accepted nodes are checked further for the expression defined at depthi+1

Page 34: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.

Predicates in the Extraction Language

Element Nodes tagName tagAttr tagAttrArray elementSiblingPosition tagPstn …

Text Nodes textNode textSiblingPosition syntax leftTextNode leftElementNode …

Page 35: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.

The Wrapper Structure

Page 36: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.

Wrapper Generation Algorithm

Creating dom_path and LCA objects Creating patterns that extract tuple attributes Creating initial wrappers Generating the tuple validation rules and new

wrappers Combining the wrappers Ranking the tuple sets Getting confirmation from the user Testing the wrapper on the verification set

Page 37: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.

Ranking the Tuple Sets We adopt the concept of category utility:

Maximize inter-cluster dissimilarity Minimize intra-cluster similarity Dom-Path, specific value, missing attributes, indexing, content specification

1) The weight of attribute A

2) The probability that an item has value v for attribute A, given it belongs to cluster C

3) The probability that an item belongs to cluster C, given it has value v for attribute A

S0

T

Page 38: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.

Ranking: Discussion

Note: we are ranking tuple sets and wrappers

A wrapper is more plausible if the tuples is extracted are very similar to each other, and if those tuples are very different from the non-tuples

One could also try to rank extraction patterns, say using MDL

Page 39: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.

Experimental Evaluations

Number of training tuples required by our system and previous works

Results on four previously used data sets from RISE Okra, BigBook, Internet Address Finder, Quote Server

Page 40: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.

Experimental Evaluations

We chose ten well-known web sites and collected fifty web pages from each:

AltaVista, CNN, Google, Hotjobs, IMDb, YMB (Yahoo! Message Board), MSN Q (MSN Money - Quotes), Weather, Art, and BN (Barnes & Noble)

Page 41: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.

Experimental Evaluation Updating Term Weights (effect of adaptive approach):

The effect of pregenerating wrappers for the same extraction scenario on Art and BN websites

Page 42: Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.eduuirmak@cis.poly.edu.

Summary

An approach to interactive wrapper generation that combines Powerful extraction language Techniques for deriving extraction

patterns from user input A framework using active learning A ranking technique using a

category utility function