Automatic Extraction of Dynamic Record Sections from Search Engine Result Pages

Automatic Extraction ofDynamic Record Sections from Search Engine Result

PagesHongkun Zhao, Weiyi Meng, Clement Yu*

Department of Computer Science

State University of New York at Binghamton

* Department of Computer Science

University of Illinois at Chicago

September 15, 2006

VLDB 2006 Seoul

Presentation Outline

• Background

• Dynamic section extraction– Problem Statement– The solution

• Experiments

• Related work

Background: Search Result Record (SRR)

Background: SRR Extraction - Motivations

• SRRs are frequently needed to feed into other Web applications: – Metasearch engines need the SRRs from

different search engines and merge them.– Comparison shopping services need to compare

SRRs from different search engines to find the best deal.

Background: SRRs within Multiple Sections

Background: Main Research Issues

• Three levels of search result extraction– Section identification– Record extraction– Data unit identification and annotation

• Automatic wrapper generation

Background: SRR Extraction – ViNTs

• Most current works on automatic search result extraction are on record extraction, including– ViNTs (WWW 2005)

• ViNTs can extract records from sections containing at least three records, including non-result (static) records

Problem Definition: Dynamic Sections

• A typical search engine result page contains static, semi-dynamic and dynamic contents.– Static: query independent– Semi-dynamic: basic structure is query

independent– Dynamic: query dependent

• A dynamic section is a set of all SRRs that appear consecutively and have certain common features such as a common header and a common display format.

Example: SRRs within Multiple Sections

Problem Definition: Dynamic Section Extraction

Problem statement: automatically extract all dynamic sections as well as SRRs within each dynamic section from search result page of any search engine.

Why dynamic section extraction:• They correspond to search results and many applications

need them. • Different applications may needs SRRs from different

sections.

Problem Definition: Challenges in Dynamic Section Extraction

• Non-uniform section format problem

• Section-record granularity problem– Records versus sections

• Hidden section extraction problem– Some sections may not appear in sample result

pages used for training

Background: SRRs within Multiple Sections

Result Page Layout Model

sections

records

template

MSE: Multiple Section Extraction

Web Pages

MRE

DSE

Refining MRs and DSs

Refined Section Instances

Record Mining From DSs

Section Instances Clustering

Wrapper Building

Wrapper family Building

Section Wrappers

Checking Granu- larity for MRs

MRE: Multi-Record section Extraction

• MRE is revised from ViNTs (WWW 2005)• Using MRE to extract MRs has four potential

problems:1. boundary problem, i.e., some records near the two

boundaries of an MR may be incorrectly extracted

2. sections with fewer than three records may not be extracted

3. some extracted sections may contain static contents with repeating patterns

4. some extracted MRs may mistakenly take consecutive sections with the same format as records, and some large records may be incorrectly extracted as sections.

DSE: Dynamic Section Extraction

Step 1: Identify candidate section boundary markers (CSBM)

– Use a pair of result pages at a time

– CSBMs are usually static or semi-dynamic content lines that appear in both result pages and have compatible tag paths

Step 2: Identify dynamic sections (DS) based on the CSBMs

– Each (candidate) DS has a left boundary marker (LBM) and a right boundary marker (RBM), which are CSBMs and are not part of the DS

– Note: some DSs may be incorrect due to incorrect CSBMs

MRs and DSs Refining

• Idea: Use MRs and DSs to refine each other to– identify and discard static sections – correct the boundaries of some MRs and DSs

• Note: To deal with the non-uniform section format problem, neither of the two algorithms, MRE and DSE, assumes there is a common format/pattern among different sections when performing section extraction

MRs and DSs Refining

1. MR=DS 2. MRDS 3. MRDS 4. MRDS 5. MRDS=

Overlapping part

Extra MR part

Extra DS part DS

MR

Mining Records from DSs

Goal: Identify records from dynamic sections that do not match any MRs such as those with fewer than three records.

Method: Consider dynamic section DS

1. Identify repeating tags within the tag forest for DS as candidate separators

2. Use each candidate separator to partition DS into records and select the partition with the highest section cohesion.

Mining Records from DSs

Observations about section cohesion: records in a section tend to be similar to each other, while the lines within a record tend to be dissimilar to each other.

The cohesion of a section S with records r1, r2, …, rk

average distance of the lines within each record = average distance among the records


Partition with high cohesion Partition with low cohesion

Solving Section-Record Granularity Problem

Two subproblems:• Oversized record problem: Some consecutive sections

are recognized as records or multiple small records are recognized as a single large record

• Splitting record problem: Large records are recognized as sections or large records are split into smaller records

Solving Oversized Record Problem

• Use record mining technique to try to find smaller records from a candidate oversized record R.

– If no smaller records can be found, R is not an oversized record

– If smaller records can be found, R is recognized as an oversized record

– If small records can be found and they are similar to the records mined from another (adjacent) candidate oversized record R1, then R and R1 are recognized as consecutive sections.

Solving Splitting Record Problem

• Let R be an MR with records (r1, …, rk), which is a partition of R.

– We generate new partitions by merging these records in different ways and calculate the cohesion of each partition.

– The partition with the highest cohesion will be selected and larger records may be yielded as a result.

• If there exists a set of consecutive MRs that are siblings under the same sub-tree of the DOM tree, and all MRs in the set consist of only one record, then we form a new section with each original section in the set as a record and remove the original sections.

Certifying DSs Based on Multiple Result Pages

• Multiple result pages are used• If an MR on one result page matches with an MR on

at least another result page, both MRs are certified as the section instances of the same section schema.

• More than two result pages can be used to generate section instance groups for different section schemas.

• A matching score is computed between two MRs from two pages based on their tag path similarity, SBM similarity and tag forest similarity.

Wrapper Generation

• Section wrapper format: <pref, seps, LBMs, RBMs>– pref is the tag path that leads to the minimum sub-

tree t that contains all records in this section– seps is the separator set used to partition the sub-

forest of t into records– LBMs and RBMs are the sets of left and right

boundary markers of the section • Page wrapper: a sequence of section wrappers

Solving Hidden Section Extraction Problem

• For sections with zero or only one instances on sample result pages, no wrapper will be generated.

• Use section family to solve this problem: A section family represents a class of section schemas that share some common features.

• Basic idea: Hope the schema of the hidden section is similar to that of an existing section.

Solving Hidden Section Extraction Problem

An example of a section family: All member section schemas have the same pref and seps, and their LBMs (RBMs) share the same line text attribute.

<HTML> <HEAD> <BODY> <TBODY> <TR> <TR> <TR> <TR> <TR> <TR>

LBM of Section 1

Section 1

RBM of Section 1

LBM of Section 2

Section 2

RBM of Section 2

<TR>

<TR> <TR>

<TR>

Experimental Results

• Dataset– 100 search engines from the ViNTs dataset, 19 with

multiple DSs

– 19 additional search engines that produce multiple DSs

– Total 38 search engines produce multiple DSs

– Collect 10 result pages for each search engine, 5 are used for wrapper generation and 5 are used to test the wrappers

• Performance measures: Recall and Precision– Perfect

– Partially correct (> 60% records are extracted)


#Actual #Extr

acted

#Perfect #Partially

correct

Recall % Precision %

Sample

Pages 1057 1106 899 136 85.0 97.9 81.3 93.6

Test

Pages 981 1028 820 134 83.6 97.2 79.8 92.8

Total 2038 2134 1719 270 84.3 97.6 80.6 93.2

Perfect Total Perfect Total

Section extraction results on all 119 search engines:


#Actual #Extr

acted

#Perfect #Partially

correct

Recall % Precision %

Sample

Pages 652 670 538 92 82.5 96.6 80.2 94.0

Test

Pages 590 611 468 95 79.3 95.4 76.6 92.1

Total 1242 1281 1006 187 81.0 96.1 78.5 93.1

Perfect Total Perfect Total

Section extraction results on the 38 search engines whoseresult pages have multiple dynamic sections:


#Actual #Extracted #Correct Recall % Precision %

Sample Pages 9615 9597 9490 98.7 98.9

Test Pages 8248 8245 8139 98.7 98.7

Total 17863 17842 17628 98.7 98.8

Record extraction results on all extracted sections:

Related Work

• Many existing works on record extraction from web pages: RoadRunner, EXALG, IEPAD, DeLa, Omni, MDR, ViPER …

• Only MDR (Liu, Grossman, Zhai, SIGKDD, 2003) has the ability to output multiple sections but

– it does not differentiate dynamic sections from static contents

– it does not address the non-uniform format problem and the section-record granularity problem.

– the hidden section extraction problem does not occur for MDR as it does not generate wrapper, which can lead to other problems such as lower efficiency

Conclusions and Future Work

Conclusions:– Studied the automatic section extraction problem

– Identified several interesting issues: non-uniform format problem, section-record granularity problem and hidden section extraction problem

– Provided solutions to the new problems

Future work – Still room to improve: increase the accuracy of identifying

boundary markers of dynamic sections

– Section classification

– ……

Automatic Extraction of Dynamic Record Sections from Search Engine Result Pages

Documents

Transcript of Automatic Extraction of Dynamic Record Sections from Search Engine Result Pages