A Collection of Website Benchmarks Labelled for {T}emplate Detection and Content Extraction
-
Upload
salvador-tamarit-munoz -
Category
Documents
-
view
217 -
download
0
description
Transcript of A Collection of Website Benchmarks Labelled for {T}emplate Detection and Content Extraction
-
A Collection of Website Benchmarks Labelled forTemplate Detection and Content Extraction
Julian Alarte, David Insa, Josep Silva, Salvador Tamarit
MiST Research Group, Universitat Polite`cnica de Vale`nciaand
Babel Research Group, Universidad Politecnica de Madrid
XV Jornadas Sobre Programacion y Lenguajes (PROLE15)September 15th, 2015
-
Context and Motivation
Template DetectionIdentifies the template of a webpage.
Essential for indexing tasks:I Templates represent between 40% and 50% of data on the WebI Usually contain irrelevant information (e.g. advertisements, menus and
banners)
Avoids waste of resources (storage space, bandwidth, etc.)
Important tool for website developers and analyzers.
Content ExtractionIdentifies the main content of the webpage.
Essential for many information retrieval and processing tasks.
Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite September 15th, 2015 (PROLE15) 1 / 12
-
Context and Motivation
Template DetectionIdentifies the template of a webpage.
Essential for indexing tasks:I Templates represent between 40% and 50% of data on the WebI Usually contain irrelevant information (e.g. advertisements, menus and
banners)
Avoids waste of resources (storage space, bandwidth, etc.)
Important tool for website developers and analyzers.
Content ExtractionIdentifies the main content of the webpage.
Essential for many information retrieval and processing tasks.
Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite September 15th, 2015 (PROLE15) 1 / 12
-
Context and Motivation
Template Detection & Content ExtractionTwo of the main areas of information retrieval applied to the Web.
Complementary: main content is not part of the template.
Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite September 15th, 2015 (PROLE15) 2 / 12
-
Benchmark Suite for Template Detectionand Content Extraction
Testing, Comparing and TuningCollections of heterogeneous benchmarks: ensures generality of thetechniques
Gold standard: ensures the same evaluation criteria.
Using a benchmark suiteTraining phase: to optimize the techniques by adjusting parameters
Evaluation phase: to measure the performance with objective criteria.
They need disjoint sets of webpages.
Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite September 15th, 2015 (PROLE15) 3 / 12
-
Benchmark Suite for Template Detectionand Content Extraction
Previous SituationLack of a public and neutral benchmark suite
Evaluations:I with dierent benchmarksI with dierent kinds of templatesI using dierent criteria
Results hardly comparable with other techniques.
Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite September 15th, 2015 (PROLE15) 4 / 12
-
Benchmark Suite for Template Detectionand Content Extraction
Our ExperienceInitial intention: use a public benchmark suite, CleanEvalI Widely used in the literature
I Not prepared for template detection
Second option: Contacted the authors of other techniquesI We could not use their benchmarks due to:
F privacyF copyrightF unavailability
Final choice: Build our own free and publicly available benchmark suite.
Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite September 15th, 2015 (PROLE15) 5 / 12
-
Benchmark Suite for Template Detectionand Content Extraction
Our ExperienceInitial intention: use a public benchmark suite, CleanEvalI Widely used in the literatureI Not prepared for template detection
Second option: Contacted the authors of other techniquesI We could not use their benchmarks due to:
F privacyF copyrightF unavailability
Final choice: Build our own free and publicly available benchmark suite.
Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite September 15th, 2015 (PROLE15) 5 / 12
-
Benchmark Suite for Template Detectionand Content Extraction
Our ExperienceInitial intention: use a public benchmark suite, CleanEvalI Widely used in the literatureI Not prepared for template detection
Second option: Contacted the authors of other techniques
I We could not use their benchmarks due to:F privacyF copyrightF unavailability
Final choice: Build our own free and publicly available benchmark suite.
Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite September 15th, 2015 (PROLE15) 5 / 12
-
Benchmark Suite for Template Detectionand Content Extraction
Our ExperienceInitial intention: use a public benchmark suite, CleanEvalI Widely used in the literatureI Not prepared for template detection
Second option: Contacted the authors of other techniquesI We could not use their benchmarks due to:
F privacyF copyrightF unavailability
Final choice: Build our own free and publicly available benchmark suite.
Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite September 15th, 2015 (PROLE15) 5 / 12
-
Benchmark Suite for Template Detectionand Content Extraction
Our ExperienceInitial intention: use a public benchmark suite, CleanEvalI Widely used in the literatureI Not prepared for template detection
Second option: Contacted the authors of other techniquesI We could not use their benchmarks due to:
F privacyF copyrightF unavailability
Final choice: Build our own free and publicly available benchmark suite.
Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite September 15th, 2015 (PROLE15) 5 / 12
-
The TECO Benchmark Suite
Our approach
TECO (TEmplate detection and COntent extraction benchmarks suite)
Consists in:I Benchmark suiteI Gold StandardI Scripts to automatize the benchmarking process
Scope:I Template detectionI Content extraction
Goal:I TestI CompareI Tune
Uses:I TrainingI Evaluation
Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite September 15th, 2015 (PROLE15) 6 / 12
-
The TECO Benchmark Suite
FeaturesResult of a research project:I A new technique for content extractionI Later adapted for template detection.
40 real heterogeneous websites downloaded from Internet.
Open, extensible, publicly available and free.
Webpages in dierent languages: to test language-independent features.
Downloading of the webpages:I All needed elements for correct visualization: HTML, images, scripts, CSS...I SiteSucker (OS X) and wget (Linux).
Each benchmark is composed of:1 Key page. Target webpage.2 All those webpages (from the same website) that are linked by the key page as
well as the webpages linked by them.
Gold standard (for each key page) using labels:I HTML classes notTemplate and mainContent.
Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite September 15th, 2015 (PROLE15) 7 / 12
-
Producing the Gold Standard
Four dierent engineers
Independently:I Manually explored the key page and the webpages accessible from itI Choose what part of the webpage is the template and what part is the main
content.
Together:I Same actions sharing their individual opinions.
Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite September 15th, 2015 (PROLE15) 8 / 12
-
Benchmark Classification
Classification 1: All benchmarks have been classified into five groups:Companies / Shops, Forums / Social, Personal websites / Blogs,Media / Communication, Institutions / Associations.
www.bbc.co.uk/news/index.html (Media / Communication)
Classification 2: All benchmarks have been classified according to their size andthe proportion of their template / main content.
Id Benchmark Nodes T. Nodes M.C. Nodes24 www.bbc.co.uk/news/index.html 2991 364 1360
Classification 3: The benchmarks were also classified according to the number ofwebpages that implement the template.
Id VL TT PT DT Notes (peculiarities)24 5 0 5 0 Several templates (but very similar).
Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite September 15th, 2015 (PROLE15) 9 / 12
-
Downloading and using the suite
http://users.dsic.upv.es/~jsilva/retrieval/teco/
DownloadDirectory with 40 folders.
Scripts to automatize thebenchmarking process
Rules for using the suite1 Publish the results so that they are
publicly available.2 Provide enough information so that
anyone can easily duplicate theexperiments.
Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite September 15th, 2015 (PROLE15) 10 / 12
-
Rules for extending the suite
1 Websites included in TECO must be real and online websites not created bythe people who submit the benchmark.
2 All benchmarks must be localized, so all resources are accessible oine.3 Each benchmark must be composed of a webpage and at least all webpages
accessible from it with two clicks.4 All benchmarks must be manually reviewed by at least two people before
being submitted.5 All benchmarks submitted must be signed.6 Researchers must follow the labeling guidelines of TECO.7 All benchmarks submitted should not have a direct relation with a particular
technique or tool.
Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite September 15th, 2015 (PROLE15) 11 / 12
-
Conclusions & Future WorkContext and Motivation
Template Detection & Content ExtractionTwo of the main areas of information retrieval applied to the Web.
Complementary: main content is not part of the template.
Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite Septembe 15th, 2015 (PROLE15) 2 / 12
The TECO Benchmark Suite
Our approach
TECO (TEmplate detection and COntent extraction benchmarks suite)
Consists in:I Benchmark suiteI Gold StandardI Scripts to automatize the benchmarking process
Scope:I Template detectionI Content extraction
Goal:I TestI CompareI Tune
Uses:I TrainingI Evaluation
Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite Septembe 15th, 2015 (PROLE15) 6 / 12
Benchmark Suite for Template Detectionand Content Extraction
Our ExperienceInitial intention: use a public benchmark suite, CleanEvalI Widely used in the literatureI Not prepared for template detection
Second option: Contacted the authors of other techniquesI We could not use their benchmarks due to:
F privacyF copyrightF unavailability
Final choice: Build our own free and publicly available benchmark suite.
Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite Septembe 15th, 2015 (PROLE15) 5 / 12
Downloading and using the suite
http://users.dsic.upv.es/~jsilva/retrieval/teco/
DownloadDirectory with 40 folders.
Scripts to automatize thebenchmarking process
Rules for using the suite1 Publish the results so that they are
publicly available.2 Provide enough information so that
anyone can easily duplicate theexperiments.
Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite Septembe 15th, 2015 (PROLE15) 10 / 12
Ongoing Extension (TECO 2.0)
Includes 90 benchmarks (50 more than TECO 1.0).
Contains explicit information about subtemplates.
Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite September 15th, 2015 (PROLE15) 12 / 12
-
Conclusions & Future WorkContext and Motivation
Template Detection & Content ExtractionTwo of the main areas of information retrieval applied to the Web.
Complementary: main content is not part of the template.
Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite Septembe 15th, 2015 (PROLE15) 2 / 12
The TECO Benchmark Suite
Our approach
TECO (TEmplate detection and COntent extraction benchmarks suite)
Consists in:I Benchmark suiteI Gold StandardI Scripts to automatize the benchmarking process
Scope:I Template detectionI Content extraction
Goal:I TestI CompareI Tune
Uses:I TrainingI Evaluation
Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite Septembe 15th, 2015 (PROLE15) 6 / 12
Benchmark Suite for Template Detectionand Content Extraction
Our ExperienceInitial intention: use a public benchmark suite, CleanEvalI Widely used in the literatureI Not prepared for template detection
Second option: Contacted the authors of other techniquesI We could not use their benchmarks due to:
F privacyF copyrightF unavailability
Final choice: Build our own free and publicly available benchmark suite.
Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite Septembe 15th, 2015 (PROLE15) 5 / 12
Downloading and using the suite
http://users.dsic.upv.es/~jsilva/retrieval/teco/
DownloadDirectory with 40 folders.
Scripts to automatize thebenchmarking process
Rules for using the suite1 Publish the results so that they are
publicly available.2 Provide enough information so that
anyone can easily duplicate theexperiments.
Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite Septembe 15th, 2015 (PROLE15) 10 / 12
Ongoing Extension (TECO 2.0)
Includes 90 benchmarks (50 more than TECO 1.0).
Contains explicit information about subtemplates.
Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite September 15th, 2015 (PROLE15) 12 / 12
-
A Collection of Website Benchmarks Labelled forTemplate Detection and Content Extraction
Julian Alarte, David Insa, Josep Silva, Salvador Tamarit
MiST Research Group, Universitat Polite`cnica de Vale`nciaand
Babel Research Group, Universidad Politecnica de Madrid
XV Jornadas Sobre Programacion y Lenguajes (PROLE15)September 15th, 2015
Context and Motivation