Joint Optimization of Wrapper Generation and Template Detection

28
Joint Optimization of Wrapper Generation and Template Detection Shuyi Zheng, Di Wu, Ruihua Song, Ji-Rong Wen Microsoft Research Asia SIGKDD-2007, San Jose, California, USA

description

Joint Optimization of Wrapper Generation and Template Detection. Shuyi Zheng, Di Wu, Ruihua Song , Ji-Rong Wen Microsoft Research Asia SIGKDD-2007, San Jose, California, USA. Outline. Introduction Our approach Experiments Demo Conclusion. Motivations. Page Generation Script - PowerPoint PPT Presentation

Transcript of Joint Optimization of Wrapper Generation and Template Detection

Page 1: Joint Optimization of  Wrapper  Generation  and Template Detection

Joint Optimization of Wrapper Generation and Template Detection

Shuyi Zheng, Di Wu, Ruihua Song, Ji-Rong WenMicrosoft Research Asia

SIGKDD-2007, San Jose, California, USA

Page 2: Joint Optimization of  Wrapper  Generation  and Template Detection

SIGKDD-2007, San Jose, California, USA 2

OutlineIntroductionOur approachExperimentsDemoConclusion

Page 3: Joint Optimization of  Wrapper  Generation  and Template Detection

SIGKDD-2007, San Jose, California, USA 4

Motivations

Page Generation Script(e.g., ASP, PHP, JSP)

Database

Encoding

Wrapper

Decoding

Page 4: Joint Optimization of  Wrapper  Generation  and Template Detection

SIGKDD-2007, San Jose, California, USA 5

Related WorkSome automatic or semi-automatic wrapper

learning methods have been proposede.g. WIEN[12], SoftMeley,[11] Stalker[17],

RoadRunner[6], EXALG[2], TTAG[4], works in [18], ViNTs[21] and etc.

Page clustering for wrapper induction is considered a trivial taskManual: most of previous workAutomatic but isolated from wrapper

generation: RoadRunner[6,7] and [18]

Page 5: Joint Optimization of  Wrapper  Generation  and Template Detection

SIGKDD-2007, San Jose, California, USA 6

Problems (cont.)Dynamic URLs

With the popularity of dynamic URLs, it is no longer as effective to detect templates by URLs as before

Page 6: Joint Optimization of  Wrapper  Generation  and Template Detection

SIGKDD-2007, San Jose, California, USA 7(a): www.amazon.com/gp/product/B000BNLGJA/

(a): …/gp/product/B000BNLGJA/

(b): www.amazon.com/gp/product/B00007J8SC/

(b): …/gp/product/B00007J8SC/

(c): www.amazon.com/gp/product/B0000DD95R/(c): …/gp/product/B0000DD95R/

(d): www.amazon.com/gp/product/B0000A1AT9/(d): …/gp/product/B0000A1AT9/

Page 7: Joint Optimization of  Wrapper  Generation  and Template Detection

SIGKDD-2007, San Jose, California, USA 8

ProblemsDynamic URLs

With the popularity of dynamic URLs, it is no longer as effective to detect templates by URLs as before

Complex TemplatesEven if URLs can group pages that share a

template, such a method is sometimes far from optimal to generate only one wrapper for a complex template

Page 8: Joint Optimization of  Wrapper  Generation  and Template Detection

SIGKDD-2007, San Jose, California, USA 9(c): www.amazon.com/gp/product/B0000DD95R/(d): www.amazon.com/gp/product/B0000A1AT9/

Page 9: Joint Optimization of  Wrapper  Generation  and Template Detection

SIGKDD-2007, San Jose, California, USA 10

Our Proposed ApproachMain ideas

Similarity-based templates, instead of ground-truth templates

AdvantagesBe more stableOptimize the number of wrappers

Page 10: Joint Optimization of  Wrapper  Generation  and Template Detection

SIGKDD-2007, San Jose, California, USA 11

OutlineIntroductionOur approachExperimentsDemoConclusion

Page 11: Joint Optimization of  Wrapper  Generation  and Template Detection

SIGKDD-2007, San Jose, California, USA 12

Problem Definition

Page 12: Joint Optimization of  Wrapper  Generation  and Template Detection

SIGKDD-2007, San Jose, California, USA 13

System Overview

Training Pages

*

+

Wrapper Set

?

Joint Template Detection and Wrapper Generation Module

Web Page Parsing

Web Page Parsing

+

?

*

+

Wrapper Selection Data Extraction

Name Image Price............

... ...

... ...

... ...

... ...

New Pages

+

Page 13: Joint Optimization of  Wrapper  Generation  and Template Detection

SIGKDD-2007, San Jose, California, USA 14

Wrapper Generation [6, 4, 18]

HTML

HEAD BODY

TITLE TABLE DIV IMG+ ?

HTML

HEAD BODY

TITLE TABLE DIV IMGTABLE

HTML

HEAD BODY

TITLE TABLE IMGTABLE TABLE

Page 14: Joint Optimization of  Wrapper  Generation  and Template Detection

SIGKDD-2007, San Jose, California, USA 15

Wrapper-DOM DistanceDistance between a wrapper and a DOM tree

Tree alignment

Cost calculation

Page 15: Joint Optimization of  Wrapper  Generation  and Template Detection

SIGKDD-2007, San Jose, California, USA

16

Wrapper-Oriented Page Clustering (WPC)

(a) Level-1 Wrapper

-

-

-

--

+

+

+

+

W

-

-

-

--

+

++W

(b) Level-2 Wrapper

-

-

-

--

+W+

(c) Level-3 Wrapper

-

-

-

--

W

(d) Level-4 Wrapper

Page 16: Joint Optimization of  Wrapper  Generation  and Template Detection

SIGKDD-2007, San Jose, California, USA 17

OutlineIntroductionOur approachExperimentsDemoConclusion

Page 17: Joint Optimization of  Wrapper  Generation  and Template Detection

SIGKDD-2007, San Jose, California, USA 18

ExperimentsData

1700 product pages from Amazon.com (Amazon)

Mixed 1000 pages from 10 shopping sites (M10)

Target product records: (name, image, price)

Settings2-fold cross-validationEvaluation measures: Precision, Recall and F1

Page 18: Joint Optimization of  Wrapper  Generation  and Template Detection

SIGKDD-2007, San Jose, California, USA 19

Effectiveness TestAmazon: 44 wrappers, F1: 94.88% vs. 78% M10:

Page 19: Joint Optimization of  Wrapper  Generation  and Template Detection

SIGKDD-2007, San Jose, California, USA 20

WPC with Different Thresholds

Page 20: Joint Optimization of  Wrapper  Generation  and Template Detection

SIGKDD-2007, San Jose, California, USA 21

Stability TestObjective

Evaluate how the choice of initial training page impacts the performance of WPC

Page 21: Joint Optimization of  Wrapper  Generation  and Template Detection

SIGKDD-2007, San Jose, California, USA 22

OutlineIntroductionOur approachExperimentsDemoConclusion

Page 22: Joint Optimization of  Wrapper  Generation  and Template Detection

SIGKDD-2007, San Jose, California, USA 23

Demo!Microsoft Office Excel 2007 Web Data Add-In is coming soon!

Please have a try in two weeks! http://blogs.msdn.com/xaw

Page 23: Joint Optimization of  Wrapper  Generation  and Template Detection

SIGKDD-2007, San Jose, California, USA 24

OutlineIntroductionOur approachExperimentsDemoConclusion

Page 24: Joint Optimization of  Wrapper  Generation  and Template Detection

SIGKDD-2007, San Jose, California, USA 25

ConclusionOur system

Takes a miscellaneous training set as inputConducts template detection and wrapper

generation in a single stepCan achieve a joint optimization under the

criterion of extraction accuracy

In the near future,We will extend the approach to handle the

templates containing content strings

Page 25: Joint Optimization of  Wrapper  Generation  and Template Detection

SIGKDD-2007, San Jose, California, USA 26

Thanks!Contacts:

Ruihua Song ([email protected])Shuyi Zheng ([email protected])

Page 26: Joint Optimization of  Wrapper  Generation  and Template Detection

SIGKDD-2007, San Jose, California, USA 27

Poster No. 11

Looking forward to talking with you at Poster Reception II this evening!

Page 27: Joint Optimization of  Wrapper  Generation  and Template Detection

SIGKDD-2007, San Jose, California, USA 28

Backup Slides

Page 28: Joint Optimization of  Wrapper  Generation  and Template Detection

SIGKDD-2007, San Jose, California, USA 29

Labeling CostTo show how many training pages are

required for learning wrappers to achieve an accuracy higher than 95% in terms of F1.