1 Block-based Web Search Deng Cai 1, Shipeng Yu 2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft...

Block-based Web Search

Deng Cai*1, Shipeng Yu*2, Ji-Rong Wen* and Wei-Ying Ma*

*Microsoft Research Asia1Tsinghua University

2University of Munich

Presented by Hong Cheng

2

Problems in Traditional IR

• Term-Document Irrelevance Problem– Noisy terms– Multiple topics

• Variant Document Length Problem– Length normalization is important

• Passage Retrieval in traditional IR– Partition the document to several passages– Solve the problem in some sense– Has three types of passages: discourse, semantic, window– Fixed-window passage is shown to be robust

3

Problems in Web IR

• Noisy information– Navigation– Decoration– Interaction– …

• Multiple topics– May contain text as well

as images or links

Noisy Information

Multiple Topics

4

Problems in Web IR (Cont.)

• Variant Document Length Problem

Conclusion: in web IR all the problems of traditional IR remain and are more severe!

TREC-2&4 TREC-4&5 WT10g .GOV

Number of doc 524,929 556,077 1,692,096 1,247,753

Text size (Mb) 2,059 2,134 10,190 18,100

Median length (Kb) 2.5 2.5 3.3 7.5

Average length (Kb) 4.0 3.9 6.3 15.2

5

Challenges in Web IR

• New characteristics of web pages

– Two-Dimensional

Logical Structure

– Visual Layout

Presentation

• Page segmentation methods can be achieved– Obtain blocks from web pages– Block-based web search is possible

Space

Color

Font Style

Font Size

Separator

6

Outline

• Motivation

• Page segmentation approaches

• Web search using page segmentation– Block Retrieval– Block-level Query Expansion

• Experiments and Discussions

• Conclusion

7

Web Page Segmentation Approaches

• Fixed-length approach (FixedPS)– Traditional window-based passage retrieval

• DOM-based approach (DomPS)– Like the natural paragraph in traditional passage retrieval

• Vision-based Web Page Segmentation (VIPS)– Achieve a semantic partition to some extent

• Combined Approach (CombPS)– Combined VIPS & Fixed-length

Web Page

Segmentation FixedPS DomPS VIPS CombPS

Passage Retrieval

Window Discourse SemanticSemantic Window

8

Fixed-length Page Segmentation (FixedPS)

• A block contains words of fixed-length • Traditional window-based methods can be applied• Approaches

– Overlapped windows (e.g. Callan, SIGIR’94)

– Arbitrary passages of varying length (e.g. Kaszkiel et al, SIGIR’97)

• Results– A simple but robust approach– Do not consider semantic information

9

DOM-based Page Segmentation (DomPS)

• Rely on the DOM structure to partition the page– DOM: Document-Object Model

• Current approaches– Only base on tags (e.g. Crivellari et al, TREC 9)

– Combine tags with contents and links (e.g. Chakrabarti et al, SIGIR’01)

• Results– Similar to discourse in passage retrieval– DOM represents only part of the semantic structure– Imprecise content structure

10

VIPS Algorithm

• Motivation– Topics can be distinguished with visual cues in many cases– Utilize the two-dimensional structure of web pages

• Goal– Extract the semantic structure of a web page to some extent,

based on its visual presentation

• Procedure– Top-down partition the web page based on the separators

• Result – A tree structure, each node in the tree corresponds to a

block in the page– Each node will be assigned a value (Degree of Coherence)

to indicate how coherent of the content in the block based on visual perception

11

VIPS: An Example

Web Page

VB1 VB2

VB2_1 VB2_2 . . .

VB2_2_1 VB2_2_2 VB2_2_3 VB2_2_4. . .

. . .

. . .

. . .

Microsoft Technical Report MSR-TR-2003-79

12

Combined Approach (CombPS)

• VIPS solves the problems of noisy information and multi-topics

• FixedPS can deal with the variant document length problem

• Combine these two:– Partition the web page

using VIPS– Divide the blocks

containing more words than pre-defined window length

12701921%

10038617%

556389%

5753210%

26145443%

0~10

10~50

50~200

200~500

500~

Block length after segment 50,000 pages using VIPS chosen from the WT10g data set

13

Web Page Segmentation Summarization

• Fixed-length approach (FixedPS)– traditional passage retrieval

• DOM-based approach (DomPS)– Like the natural paragraph in traditional passage retrieval

• Vision-based Web Page Segmentation (VIPS)– Achieve a semantic partition to some extent

• Combined Approach (CombPS)– Combined VIPS & Fixed-length

Web Page

Segmentation FixedPS DomPS VIPS CombPS

Passage Retrieval

Window Discourse SemanticSemantic Window

14

Outline

• Motivation




• Conclusion

15

Block Retrieval

• Similar to traditional passage retrieval• Retrieve blocks instead of full documents• Combine the relevance of blocks with relevance of

documents

• Goal:– Verify if page segmentation can deal with both the length

normalization and multiple-topic problems

16

Block-level Query Expansion

• Similar to passage-level pseudo-relevance feedback• Expansion terms are selected from top blocks instead

of top documents

• Goal: – Testify if page segmentation can benefit the selection of

query terms through increasing term correlations within a block, and thus improve the final performance

17

Outline

• Motivation




• Conclusion

18

Experiments

• Methodology– Fixed-length window approach (FixedPS)

• Overlapped window with size of 200 words

– DOM-based approach (DomPS)• Iterate the DOM tree for some structural tags

• A block is constructed and identified by such leaf tag

• Free text between two tags is treated as a special block

– Vision-based approach (VIPS)• The permitted degree of coherence is set to 0.6

• All the leaf nodes are extracted as visual blocks

– The combined approach (CombPS)• VIPS then FixedPS

– Full document approach (FullDoc)• No segmentation is performed

19

Experiments (Cont.)

• Dataset– TREC 2001 Web Track

• WT10g corpus (1.69 million pages), crawled at 1997• 50 queries (topics 501-550)

– TREC 2002 Web Track• .GOV corpus (1.25 million pages), crawled at 2002• 49 queries (topics 551-560)

• Retrieval System– Okapi, with weighting function BM2500

• Preprocessing– Standard stop-word list – Do not use stemming and phrase information

• Tune parameters in BM2500 to achieve best baselines• Evaluation criteria: P@10

20

Experiments on Block Retrieval

• Steps:1. Do original document retrieval

– Obtain a document rank DR

2. Analyze top N (1000 here) documents to get a block set

3. Do block retrieval on the block set (same as Step 1 but replace the document with block)– Obtain a block rank BR– Documents are re-ranked by the single-best block in each document

4. Combine the BR and DR to get a new rank of document–

– is the tuning parameter

( ) (1 ) ( )DR BRrank d rank d

21

Block Retrieval on TREC 2001 and TREC 2002 (P@10)

Page Segmentation

Baseline BR only BR + DR best

DomPS

0.312

0.252 0.322

FixedPS 0.304 0.326

VIPS 0.316 0.328

CombPS 0.326 0.338

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.24

0.26

0.28

0.3

0.32

0.34

Combining Parameter

P@

10

CombPSVIPSFixedPSDomPS

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.15

0.17

0.19

0.21

0.23

0.25

Combining Parameter

P@

10


Page Segmentation


DomPS

0.2286

0.1571 0.2286

FixedPS 0.1776 0.2317

VIPS 0.2163 0.2408

CombPS 0.1939 0.2379

Result on TREC 2001 (P@10) Result on TREC 2002 (P@10)

22

Experiments on Block-level Query Expansion

• Steps:1. Same steps as block retrieval

– Do original document retrieval to get DR

– Analyze top N (1000 here) documents to get a block set

– Do block retrieval on the block set to get BR

2. Select some expansion terms based on top blocks– 10 expansion terms in our experiments

– Number of top blocks is a tuning parameter

3. Document retrieval with the expanded query– Modify the term weights before final retrieval

23

Query Expansion on TREC 2001 and TREC 2002 (P@10)

Page Segmentation

BaselineQuery Expansion (best)

P@10 Improvement

FullDoc

0.312

0.326 4.5%

DomPS 0.324 3.8%

FixedPS 0.36 15.4%

VIPS 0.362 16.0%

CombPS 0.366 17.3%

Result on TREC 2001 (P@10) Result on TREC 2002 (P@10)

0 3 5 10 20 30 40 50 600.26

0.28

0.3

0.32

0.34

0.36

Number of Blocks (Documents in FullDoc)

P@

10

CombPSVIPSFixedPSDomPSFullDocBaseline

0 3 5 10 20 30 40 50 600.16

0.17

0.18

0.19

0.2

0.21

0.22

0.23

0.24

0.25


P@

10


Page Segmentation

BaselineQuery Expansion (best)

P@10 Improvement

FullDoc

0.2286

0.2082 -8.9%

DomPS 0.2224 -2.7%

FixedPS 0.2327 1.8%

VIPS 0.2327 1.8%

CombPS 0.2388 4.5%

24

Discussions

• FullDoc can only obtain a low and insignificant result – The baseline is low, so many top ranked documents are actually irrelevant

• DomPS is not good and very unstable – The segmentation is too detailed– Semantic block can hardly be detected and expansion terms are not good

• FixedPS is stable and good– Similar result as the case in traditional IR– A window may miss the real semantic blocks

• VIPS is very good– Top blocks usually have very good quality– Length normalization is still a problem

• CombPS is almost the best method in all experiments– More than just a tradeoff

25

Outline

• Motivation




• Conclusion

26

Conclusion

• Page segmentation is effective for improving web search– Block Retrieval– Block-level Query Expansion

• Plain-text retrieval Fixed-window’s partition

Web information retrieval Semantic partition (VIPS)

• Integrating both semantic and fixed-length properties (CombPS) could deal with all problems and achieve the best performance

• We believe that block-based web search can be very useful in real search engines, and can also be very easily combined with block-level link analysis

Thanks!

28

Block Retrieval on TREC 2001 (Average Precision)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.13

0.14

0.15

0.16

0.17

0.18

0.19

0.2

Combining Parameter

Ave

rag

e P

reci

sio

n


Page Segmentation


DomPS

0.1703

0.1344 0.1752

FixedPS 0.1743 0.1896

VIPS 0.1605 0.1770

CombPS 0.1673 0.1871

Result on TREC 2001 (Average Precision)

29

Query Expansion on TREC 2001 (Average Precision)

0 3 5 10 20 30 40 50 600.15

0.16

0.17

0.18

0.19

0.2

0.21

0.22

0.23


Ave

rag

e P

reci

sio

n


Page Segmentatio

nBaseline

Query Expansion (best)

P@10 Improvement

FullDoc

0.1703

0.1953 14.5%DomPS 0.202 18.6%FixedPS 0.216 26.8%

VIPS 0.2199 29.1%CombPS 0.2188 28.5%

Result on TREC 2001 (Average Precision)

30

Summarization on Block Retrieval

• DomPS seems to be the worst and most unstable method– The produced blocks are too detailed– Blocks can not be mapped to a single semantic part within pages

• FixedPS is stable but not very good– Similar result as the case in traditional IR– It lacks semantic partition and fails to find best semantic blocks

• VIPS is very good and stable– Semantic partition is important to web context, especially to newly

crawled web pages (e.g., TREC 2002) – The inability to deal with varying length problem results a poor

performance for VIPS in somehow old data set

• CombPS is a very good tradeoff between VIPS and FixedPS

31

Summarization on Query Expansion

• FullDoc could only obtain a relatively low and insignificant result – The baseline is low, so many top ranked documents are actually irrelevant

• DomPS fails to obtain a significant improvement over baseline – The segmentation is too detailed, so expansion terms are not very good

• VIPS is very good using small number of blocks– Top blocks usually have very good quality– VIPS can provide semantic partition and good expansion terms

• FixedPS is very stable and good– Very stable when number of blocks increases– A window may cover contents from different semantic regions, thus noisy

terms will likely to be introduced

• CombPS is the best method in both data sets– More than just a tradeoff

1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft...

Documents

Transcript of 1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft...

1 Block-based Web Search Deng Cai 1, Shipeng Yu 2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft...

Transcript of 1 Block-based Web Search Deng Cai 1, Shipeng Yu 2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft...