A Collection of Website Benchmarks Labelled for {T}emplate Detection and Content Extraction

download A Collection of Website Benchmarks Labelled for {T}emplate Detection and Content Extraction

of 20

description

Template detection and content extraction are two of the main areas of information retrieval applied to the Web. They perform different analyses over the structure and content of webpages to extract some part of the document. However, their objectives are different. While template detection identifies the template of a webpage (usually comparing with other webpages of the same website), content extraction identifies the main content of the webpage discarding the other part. Therefore, they are somehow complementary, because the main content is not part of the template. It has been measured that templates represent between 40\% and 50\% of data on the Web. Therefore, identifying templates is essential for indexing tasks because templates usually contain irrelevant information such as advertisements, menus and banners. Processing and storing this information is likely to lead to a waste of resources (storage space, bandwidth, etc.). Similarly, identifying the main content is essential for many information retrieval tasks. In this paper, we present a benchmark suite to test different approaches for template detection and content extraction. The suite is public, and it contains real heterogeneous webpages that have been labelled so that different techniques can be suitable (and automatically) compared.

Transcript of A Collection of Website Benchmarks Labelled for {T}emplate Detection and Content Extraction

  • A Collection of Website Benchmarks Labelled forTemplate Detection and Content Extraction

    Julian Alarte, David Insa, Josep Silva, Salvador Tamarit

    MiST Research Group, Universitat Polite`cnica de Vale`nciaand

    Babel Research Group, Universidad Politecnica de Madrid

    XV Jornadas Sobre Programacion y Lenguajes (PROLE15)September 15th, 2015

  • Context and Motivation

    Template DetectionIdentifies the template of a webpage.

    Essential for indexing tasks:I Templates represent between 40% and 50% of data on the WebI Usually contain irrelevant information (e.g. advertisements, menus and

    banners)

    Avoids waste of resources (storage space, bandwidth, etc.)

    Important tool for website developers and analyzers.

    Content ExtractionIdentifies the main content of the webpage.

    Essential for many information retrieval and processing tasks.

    Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite September 15th, 2015 (PROLE15) 1 / 12

  • Context and Motivation

    Template DetectionIdentifies the template of a webpage.

    Essential for indexing tasks:I Templates represent between 40% and 50% of data on the WebI Usually contain irrelevant information (e.g. advertisements, menus and

    banners)

    Avoids waste of resources (storage space, bandwidth, etc.)

    Important tool for website developers and analyzers.

    Content ExtractionIdentifies the main content of the webpage.

    Essential for many information retrieval and processing tasks.

    Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite September 15th, 2015 (PROLE15) 1 / 12

  • Context and Motivation

    Template Detection & Content ExtractionTwo of the main areas of information retrieval applied to the Web.

    Complementary: main content is not part of the template.

    Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite September 15th, 2015 (PROLE15) 2 / 12

  • Benchmark Suite for Template Detectionand Content Extraction

    Testing, Comparing and TuningCollections of heterogeneous benchmarks: ensures generality of thetechniques

    Gold standard: ensures the same evaluation criteria.

    Using a benchmark suiteTraining phase: to optimize the techniques by adjusting parameters

    Evaluation phase: to measure the performance with objective criteria.

    They need disjoint sets of webpages.

    Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite September 15th, 2015 (PROLE15) 3 / 12

  • Benchmark Suite for Template Detectionand Content Extraction

    Previous SituationLack of a public and neutral benchmark suite

    Evaluations:I with dierent benchmarksI with dierent kinds of templatesI using dierent criteria

    Results hardly comparable with other techniques.

    Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite September 15th, 2015 (PROLE15) 4 / 12

  • Benchmark Suite for Template Detectionand Content Extraction

    Our ExperienceInitial intention: use a public benchmark suite, CleanEvalI Widely used in the literature

    I Not prepared for template detection

    Second option: Contacted the authors of other techniquesI We could not use their benchmarks due to:

    F privacyF copyrightF unavailability

    Final choice: Build our own free and publicly available benchmark suite.

    Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite September 15th, 2015 (PROLE15) 5 / 12

  • Benchmark Suite for Template Detectionand Content Extraction

    Our ExperienceInitial intention: use a public benchmark suite, CleanEvalI Widely used in the literatureI Not prepared for template detection

    Second option: Contacted the authors of other techniquesI We could not use their benchmarks due to:

    F privacyF copyrightF unavailability

    Final choice: Build our own free and publicly available benchmark suite.

    Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite September 15th, 2015 (PROLE15) 5 / 12

  • Benchmark Suite for Template Detectionand Content Extraction

    Our ExperienceInitial intention: use a public benchmark suite, CleanEvalI Widely used in the literatureI Not prepared for template detection

    Second option: Contacted the authors of other techniques

    I We could not use their benchmarks due to:F privacyF copyrightF unavailability

    Final choice: Build our own free and publicly available benchmark suite.

    Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite September 15th, 2015 (PROLE15) 5 / 12

  • Benchmark Suite for Template Detectionand Content Extraction

    Our ExperienceInitial intention: use a public benchmark suite, CleanEvalI Widely used in the literatureI Not prepared for template detection

    Second option: Contacted the authors of other techniquesI We could not use their benchmarks due to:

    F privacyF copyrightF unavailability

    Final choice: Build our own free and publicly available benchmark suite.

    Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite September 15th, 2015 (PROLE15) 5 / 12

  • Benchmark Suite for Template Detectionand Content Extraction

    Our ExperienceInitial intention: use a public benchmark suite, CleanEvalI Widely used in the literatureI Not prepared for template detection

    Second option: Contacted the authors of other techniquesI We could not use their benchmarks due to:

    F privacyF copyrightF unavailability

    Final choice: Build our own free and publicly available benchmark suite.

    Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite September 15th, 2015 (PROLE15) 5 / 12

  • The TECO Benchmark Suite

    Our approach

    TECO (TEmplate detection and COntent extraction benchmarks suite)

    Consists in:I Benchmark suiteI Gold StandardI Scripts to automatize the benchmarking process

    Scope:I Template detectionI Content extraction

    Goal:I TestI CompareI Tune

    Uses:I TrainingI Evaluation

    Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite September 15th, 2015 (PROLE15) 6 / 12

  • The TECO Benchmark Suite

    FeaturesResult of a research project:I A new technique for content extractionI Later adapted for template detection.

    40 real heterogeneous websites downloaded from Internet.

    Open, extensible, publicly available and free.

    Webpages in dierent languages: to test language-independent features.

    Downloading of the webpages:I All needed elements for correct visualization: HTML, images, scripts, CSS...I SiteSucker (OS X) and wget (Linux).

    Each benchmark is composed of:1 Key page. Target webpage.2 All those webpages (from the same website) that are linked by the key page as

    well as the webpages linked by them.

    Gold standard (for each key page) using labels:I HTML classes notTemplate and mainContent.

    Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite September 15th, 2015 (PROLE15) 7 / 12

  • Producing the Gold Standard

    Four dierent engineers

    Independently:I Manually explored the key page and the webpages accessible from itI Choose what part of the webpage is the template and what part is the main

    content.

    Together:I Same actions sharing their individual opinions.

    Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite September 15th, 2015 (PROLE15) 8 / 12

  • Benchmark Classification

    Classification 1: All benchmarks have been classified into five groups:Companies / Shops, Forums / Social, Personal websites / Blogs,Media / Communication, Institutions / Associations.

    www.bbc.co.uk/news/index.html (Media / Communication)

    Classification 2: All benchmarks have been classified according to their size andthe proportion of their template / main content.

    Id Benchmark Nodes T. Nodes M.C. Nodes24 www.bbc.co.uk/news/index.html 2991 364 1360

    Classification 3: The benchmarks were also classified according to the number ofwebpages that implement the template.

    Id VL TT PT DT Notes (peculiarities)24 5 0 5 0 Several templates (but very similar).

    Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite September 15th, 2015 (PROLE15) 9 / 12

  • Downloading and using the suite

    http://users.dsic.upv.es/~jsilva/retrieval/teco/

    DownloadDirectory with 40 folders.

    Scripts to automatize thebenchmarking process

    Rules for using the suite1 Publish the results so that they are

    publicly available.2 Provide enough information so that

    anyone can easily duplicate theexperiments.

    Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite September 15th, 2015 (PROLE15) 10 / 12

  • Rules for extending the suite

    1 Websites included in TECO must be real and online websites not created bythe people who submit the benchmark.

    2 All benchmarks must be localized, so all resources are accessible oine.3 Each benchmark must be composed of a webpage and at least all webpages

    accessible from it with two clicks.4 All benchmarks must be manually reviewed by at least two people before

    being submitted.5 All benchmarks submitted must be signed.6 Researchers must follow the labeling guidelines of TECO.7 All benchmarks submitted should not have a direct relation with a particular

    technique or tool.

    Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite September 15th, 2015 (PROLE15) 11 / 12

  • Conclusions & Future WorkContext and Motivation

    Template Detection & Content ExtractionTwo of the main areas of information retrieval applied to the Web.

    Complementary: main content is not part of the template.

    Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite Septembe 15th, 2015 (PROLE15) 2 / 12

    The TECO Benchmark Suite

    Our approach

    TECO (TEmplate detection and COntent extraction benchmarks suite)

    Consists in:I Benchmark suiteI Gold StandardI Scripts to automatize the benchmarking process

    Scope:I Template detectionI Content extraction

    Goal:I TestI CompareI Tune

    Uses:I TrainingI Evaluation

    Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite Septembe 15th, 2015 (PROLE15) 6 / 12

    Benchmark Suite for Template Detectionand Content Extraction

    Our ExperienceInitial intention: use a public benchmark suite, CleanEvalI Widely used in the literatureI Not prepared for template detection

    Second option: Contacted the authors of other techniquesI We could not use their benchmarks due to:

    F privacyF copyrightF unavailability

    Final choice: Build our own free and publicly available benchmark suite.

    Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite Septembe 15th, 2015 (PROLE15) 5 / 12

    Downloading and using the suite

    http://users.dsic.upv.es/~jsilva/retrieval/teco/

    DownloadDirectory with 40 folders.

    Scripts to automatize thebenchmarking process

    Rules for using the suite1 Publish the results so that they are

    publicly available.2 Provide enough information so that

    anyone can easily duplicate theexperiments.

    Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite Septembe 15th, 2015 (PROLE15) 10 / 12

    Ongoing Extension (TECO 2.0)

    Includes 90 benchmarks (50 more than TECO 1.0).

    Contains explicit information about subtemplates.

    Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite September 15th, 2015 (PROLE15) 12 / 12

  • Conclusions & Future WorkContext and Motivation

    Template Detection & Content ExtractionTwo of the main areas of information retrieval applied to the Web.

    Complementary: main content is not part of the template.

    Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite Septembe 15th, 2015 (PROLE15) 2 / 12

    The TECO Benchmark Suite

    Our approach

    TECO (TEmplate detection and COntent extraction benchmarks suite)

    Consists in:I Benchmark suiteI Gold StandardI Scripts to automatize the benchmarking process

    Scope:I Template detectionI Content extraction

    Goal:I TestI CompareI Tune

    Uses:I TrainingI Evaluation

    Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite Septembe 15th, 2015 (PROLE15) 6 / 12

    Benchmark Suite for Template Detectionand Content Extraction

    Our ExperienceInitial intention: use a public benchmark suite, CleanEvalI Widely used in the literatureI Not prepared for template detection

    Second option: Contacted the authors of other techniquesI We could not use their benchmarks due to:

    F privacyF copyrightF unavailability

    Final choice: Build our own free and publicly available benchmark suite.

    Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite Septembe 15th, 2015 (PROLE15) 5 / 12

    Downloading and using the suite

    http://users.dsic.upv.es/~jsilva/retrieval/teco/

    DownloadDirectory with 40 folders.

    Scripts to automatize thebenchmarking process

    Rules for using the suite1 Publish the results so that they are

    publicly available.2 Provide enough information so that

    anyone can easily duplicate theexperiments.

    Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite Septembe 15th, 2015 (PROLE15) 10 / 12

    Ongoing Extension (TECO 2.0)

    Includes 90 benchmarks (50 more than TECO 1.0).

    Contains explicit information about subtemplates.

    Alarte, Insa, Silva, Tamarit (UPV & UPM) TECO Benchmark Suite September 15th, 2015 (PROLE15) 12 / 12

  • A Collection of Website Benchmarks Labelled forTemplate Detection and Content Extraction

    Julian Alarte, David Insa, Josep Silva, Salvador Tamarit

    MiST Research Group, Universitat Polite`cnica de Vale`nciaand

    Babel Research Group, Universidad Politecnica de Madrid

    XV Jornadas Sobre Programacion y Lenguajes (PROLE15)September 15th, 2015

    Context and Motivation