Joint Optimization of Wrapper Generation and Template Detection
Shuyi Zheng, Di Wu, Ruihua Song, Ji-Rong WenMicrosoft Research Asia
SIGKDD-2007, San Jose, California, USA
SIGKDD-2007, San Jose, California, USA 2
OutlineIntroductionOur approachExperimentsDemoConclusion
SIGKDD-2007, San Jose, California, USA 4
Motivations
Page Generation Script(e.g., ASP, PHP, JSP)
Database
Encoding
Wrapper
Decoding
SIGKDD-2007, San Jose, California, USA 5
Related WorkSome automatic or semi-automatic wrapper
learning methods have been proposede.g. WIEN[12], SoftMeley,[11] Stalker[17],
RoadRunner[6], EXALG[2], TTAG[4], works in [18], ViNTs[21] and etc.
Page clustering for wrapper induction is considered a trivial taskManual: most of previous workAutomatic but isolated from wrapper
generation: RoadRunner[6,7] and [18]
SIGKDD-2007, San Jose, California, USA 6
Problems (cont.)Dynamic URLs
With the popularity of dynamic URLs, it is no longer as effective to detect templates by URLs as before
SIGKDD-2007, San Jose, California, USA 7(a): www.amazon.com/gp/product/B000BNLGJA/
(a): …/gp/product/B000BNLGJA/
(b): www.amazon.com/gp/product/B00007J8SC/
(b): …/gp/product/B00007J8SC/
(c): www.amazon.com/gp/product/B0000DD95R/(c): …/gp/product/B0000DD95R/
(d): www.amazon.com/gp/product/B0000A1AT9/(d): …/gp/product/B0000A1AT9/
SIGKDD-2007, San Jose, California, USA 8
ProblemsDynamic URLs
With the popularity of dynamic URLs, it is no longer as effective to detect templates by URLs as before
Complex TemplatesEven if URLs can group pages that share a
template, such a method is sometimes far from optimal to generate only one wrapper for a complex template
SIGKDD-2007, San Jose, California, USA 9(c): www.amazon.com/gp/product/B0000DD95R/(d): www.amazon.com/gp/product/B0000A1AT9/
SIGKDD-2007, San Jose, California, USA 10
Our Proposed ApproachMain ideas
Similarity-based templates, instead of ground-truth templates
AdvantagesBe more stableOptimize the number of wrappers
SIGKDD-2007, San Jose, California, USA 11
OutlineIntroductionOur approachExperimentsDemoConclusion
SIGKDD-2007, San Jose, California, USA 12
Problem Definition
SIGKDD-2007, San Jose, California, USA 13
System Overview
Training Pages
*
+
Wrapper Set
?
Joint Template Detection and Wrapper Generation Module
Web Page Parsing
Web Page Parsing
+
?
*
+
Wrapper Selection Data Extraction
Name Image Price............
... ...
... ...
... ...
... ...
New Pages
+
SIGKDD-2007, San Jose, California, USA 14
Wrapper Generation [6, 4, 18]
HTML
HEAD BODY
TITLE TABLE DIV IMG+ ?
HTML
HEAD BODY
TITLE TABLE DIV IMGTABLE
HTML
HEAD BODY
TITLE TABLE IMGTABLE TABLE
SIGKDD-2007, San Jose, California, USA 15
Wrapper-DOM DistanceDistance between a wrapper and a DOM tree
Tree alignment
Cost calculation
SIGKDD-2007, San Jose, California, USA
16
Wrapper-Oriented Page Clustering (WPC)
(a) Level-1 Wrapper
-
-
-
--
+
+
+
+
W
-
-
-
--
+
++W
(b) Level-2 Wrapper
-
-
-
--
+W+
(c) Level-3 Wrapper
-
-
-
--
W
(d) Level-4 Wrapper
SIGKDD-2007, San Jose, California, USA 17
OutlineIntroductionOur approachExperimentsDemoConclusion
SIGKDD-2007, San Jose, California, USA 18
ExperimentsData
1700 product pages from Amazon.com (Amazon)
Mixed 1000 pages from 10 shopping sites (M10)
Target product records: (name, image, price)
Settings2-fold cross-validationEvaluation measures: Precision, Recall and F1
SIGKDD-2007, San Jose, California, USA 19
Effectiveness TestAmazon: 44 wrappers, F1: 94.88% vs. 78% M10:
SIGKDD-2007, San Jose, California, USA 20
WPC with Different Thresholds
SIGKDD-2007, San Jose, California, USA 21
Stability TestObjective
Evaluate how the choice of initial training page impacts the performance of WPC
SIGKDD-2007, San Jose, California, USA 22
OutlineIntroductionOur approachExperimentsDemoConclusion
SIGKDD-2007, San Jose, California, USA 23
Demo!Microsoft Office Excel 2007 Web Data Add-In is coming soon!
Please have a try in two weeks! http://blogs.msdn.com/xaw
SIGKDD-2007, San Jose, California, USA 24
OutlineIntroductionOur approachExperimentsDemoConclusion
SIGKDD-2007, San Jose, California, USA 25
ConclusionOur system
Takes a miscellaneous training set as inputConducts template detection and wrapper
generation in a single stepCan achieve a joint optimization under the
criterion of extraction accuracy
In the near future,We will extend the approach to handle the
templates containing content strings
SIGKDD-2007, San Jose, California, USA 26
Thanks!Contacts:
Ruihua Song ([email protected])Shuyi Zheng ([email protected])
SIGKDD-2007, San Jose, California, USA 27
Poster No. 11
Looking forward to talking with you at Poster Reception II this evening!
SIGKDD-2007, San Jose, California, USA 28
Backup Slides
SIGKDD-2007, San Jose, California, USA 29
Labeling CostTo show how many training pages are
required for learning wrappers to achieve an accuracy higher than 95% in terms of F1.
Top Related