Joint Optimization of Wrapper Generation and Template Detection
description
Transcript of Joint Optimization of Wrapper Generation and Template Detection
Joint Optimization of Wrapper Generation and Template Detection
Shuyi Zheng, Di Wu, Ruihua Song, Ji-Rong WenMicrosoft Research Asia
SIGKDD-2007, San Jose, California, USA
SIGKDD-2007, San Jose, California, USA 2
OutlineIntroductionOur approachExperimentsDemoConclusion
SIGKDD-2007, San Jose, California, USA 4
Motivations
Page Generation Script(e.g., ASP, PHP, JSP)
Database
Encoding
Wrapper
Decoding
SIGKDD-2007, San Jose, California, USA 5
Related WorkSome automatic or semi-automatic wrapper
learning methods have been proposede.g. WIEN[12], SoftMeley,[11] Stalker[17],
RoadRunner[6], EXALG[2], TTAG[4], works in [18], ViNTs[21] and etc.
Page clustering for wrapper induction is considered a trivial taskManual: most of previous workAutomatic but isolated from wrapper
generation: RoadRunner[6,7] and [18]
SIGKDD-2007, San Jose, California, USA 6
Problems (cont.)Dynamic URLs
With the popularity of dynamic URLs, it is no longer as effective to detect templates by URLs as before
SIGKDD-2007, San Jose, California, USA 7(a): www.amazon.com/gp/product/B000BNLGJA/
(a): …/gp/product/B000BNLGJA/
(b): www.amazon.com/gp/product/B00007J8SC/
(b): …/gp/product/B00007J8SC/
(c): www.amazon.com/gp/product/B0000DD95R/(c): …/gp/product/B0000DD95R/
(d): www.amazon.com/gp/product/B0000A1AT9/(d): …/gp/product/B0000A1AT9/
SIGKDD-2007, San Jose, California, USA 8
ProblemsDynamic URLs
With the popularity of dynamic URLs, it is no longer as effective to detect templates by URLs as before
Complex TemplatesEven if URLs can group pages that share a
template, such a method is sometimes far from optimal to generate only one wrapper for a complex template
SIGKDD-2007, San Jose, California, USA 9(c): www.amazon.com/gp/product/B0000DD95R/(d): www.amazon.com/gp/product/B0000A1AT9/
SIGKDD-2007, San Jose, California, USA 10
Our Proposed ApproachMain ideas
Similarity-based templates, instead of ground-truth templates
AdvantagesBe more stableOptimize the number of wrappers
SIGKDD-2007, San Jose, California, USA 11
OutlineIntroductionOur approachExperimentsDemoConclusion
SIGKDD-2007, San Jose, California, USA 12
Problem Definition
SIGKDD-2007, San Jose, California, USA 13
System Overview
Training Pages
*
+
Wrapper Set
?
Joint Template Detection and Wrapper Generation Module
Web Page Parsing
Web Page Parsing
+
?
*
+
Wrapper Selection Data Extraction
Name Image Price............
... ...
... ...
... ...
... ...
New Pages
+
SIGKDD-2007, San Jose, California, USA 14
Wrapper Generation [6, 4, 18]
HTML
HEAD BODY
TITLE TABLE DIV IMG+ ?
HTML
HEAD BODY
TITLE TABLE DIV IMGTABLE
HTML
HEAD BODY
TITLE TABLE IMGTABLE TABLE
SIGKDD-2007, San Jose, California, USA 15
Wrapper-DOM DistanceDistance between a wrapper and a DOM tree
Tree alignment
Cost calculation
SIGKDD-2007, San Jose, California, USA
16
Wrapper-Oriented Page Clustering (WPC)
(a) Level-1 Wrapper
-
-
-
--
+
+
+
+
W
-
-
-
--
+
++W
(b) Level-2 Wrapper
-
-
-
--
+W+
(c) Level-3 Wrapper
-
-
-
--
W
(d) Level-4 Wrapper
SIGKDD-2007, San Jose, California, USA 17
OutlineIntroductionOur approachExperimentsDemoConclusion
SIGKDD-2007, San Jose, California, USA 18
ExperimentsData
1700 product pages from Amazon.com (Amazon)
Mixed 1000 pages from 10 shopping sites (M10)
Target product records: (name, image, price)
Settings2-fold cross-validationEvaluation measures: Precision, Recall and F1
SIGKDD-2007, San Jose, California, USA 19
Effectiveness TestAmazon: 44 wrappers, F1: 94.88% vs. 78% M10:
SIGKDD-2007, San Jose, California, USA 20
WPC with Different Thresholds
SIGKDD-2007, San Jose, California, USA 21
Stability TestObjective
Evaluate how the choice of initial training page impacts the performance of WPC
SIGKDD-2007, San Jose, California, USA 22
OutlineIntroductionOur approachExperimentsDemoConclusion
SIGKDD-2007, San Jose, California, USA 23
Demo!Microsoft Office Excel 2007 Web Data Add-In is coming soon!
Please have a try in two weeks! http://blogs.msdn.com/xaw
SIGKDD-2007, San Jose, California, USA 24
OutlineIntroductionOur approachExperimentsDemoConclusion
SIGKDD-2007, San Jose, California, USA 25
ConclusionOur system
Takes a miscellaneous training set as inputConducts template detection and wrapper
generation in a single stepCan achieve a joint optimization under the
criterion of extraction accuracy
In the near future,We will extend the approach to handle the
templates containing content strings
SIGKDD-2007, San Jose, California, USA 26
Thanks!Contacts:
Ruihua Song ([email protected])Shuyi Zheng ([email protected])
SIGKDD-2007, San Jose, California, USA 27
Poster No. 11
Looking forward to talking with you at Poster Reception II this evening!
SIGKDD-2007, San Jose, California, USA 28
Backup Slides
SIGKDD-2007, San Jose, California, USA 29
Labeling CostTo show how many training pages are
required for learning wrappers to achieve an accuracy higher than 95% in terms of F1.