1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft...

31
Block-based Web Search Deng Cai *1 , Shipeng Yu *2 , Ji-Rong Wen * and Wei- Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich Presented by Hong Cheng

Transcript of 1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft...

Page 1: 1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.

Block-based Web Search

Deng Cai*1, Shipeng Yu*2, Ji-Rong Wen* and Wei-Ying Ma*

*Microsoft Research Asia1Tsinghua University

2University of Munich

Presented by Hong Cheng

Page 2: 1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.

2

Problems in Traditional IR

• Term-Document Irrelevance Problem– Noisy terms– Multiple topics

• Variant Document Length Problem– Length normalization is important

• Passage Retrieval in traditional IR– Partition the document to several passages– Solve the problem in some sense– Has three types of passages: discourse, semantic, window– Fixed-window passage is shown to be robust

Page 3: 1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.

3

Problems in Web IR

• Noisy information– Navigation– Decoration– Interaction– …

• Multiple topics– May contain text as well

as images or links

Noisy Information

Multiple Topics

Page 4: 1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.

4

Problems in Web IR (Cont.)

• Variant Document Length Problem

Conclusion: in web IR all the problems of traditional IR remain and are more severe!

TREC-2&4 TREC-4&5 WT10g .GOV

Number of doc 524,929 556,077 1,692,096 1,247,753

Text size (Mb) 2,059 2,134 10,190 18,100

Median length (Kb) 2.5 2.5 3.3 7.5

Average length (Kb) 4.0 3.9 6.3 15.2

Page 5: 1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.

5

Challenges in Web IR

• New characteristics of web pages

– Two-Dimensional

Logical Structure

– Visual Layout

Presentation

• Page segmentation methods can be achieved– Obtain blocks from web pages– Block-based web search is possible

Space

Color

Font Style

Font Size

Separator

Page 6: 1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.

6

Outline

• Motivation

• Page segmentation approaches

• Web search using page segmentation– Block Retrieval– Block-level Query Expansion

• Experiments and Discussions

• Conclusion

Page 7: 1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.

7

Web Page Segmentation Approaches

• Fixed-length approach (FixedPS)– Traditional window-based passage retrieval

• DOM-based approach (DomPS)– Like the natural paragraph in traditional passage retrieval

• Vision-based Web Page Segmentation (VIPS)– Achieve a semantic partition to some extent

• Combined Approach (CombPS)– Combined VIPS & Fixed-length

Web Page

Segmentation FixedPS DomPS VIPS CombPS

Passage Retrieval

Window Discourse SemanticSemantic Window

Page 8: 1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.

8

Fixed-length Page Segmentation (FixedPS)

• A block contains words of fixed-length • Traditional window-based methods can be applied• Approaches

– Overlapped windows (e.g. Callan, SIGIR’94)

– Arbitrary passages of varying length (e.g. Kaszkiel et al, SIGIR’97)

• Results– A simple but robust approach– Do not consider semantic information

Page 9: 1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.

9

DOM-based Page Segmentation (DomPS)

• Rely on the DOM structure to partition the page– DOM: Document-Object Model

• Current approaches– Only base on tags (e.g. Crivellari et al, TREC 9)

– Combine tags with contents and links (e.g. Chakrabarti et al, SIGIR’01)

• Results– Similar to discourse in passage retrieval– DOM represents only part of the semantic structure– Imprecise content structure

Page 10: 1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.

10

VIPS Algorithm

• Motivation– Topics can be distinguished with visual cues in many cases– Utilize the two-dimensional structure of web pages

• Goal– Extract the semantic structure of a web page to some extent,

based on its visual presentation

• Procedure– Top-down partition the web page based on the separators

• Result – A tree structure, each node in the tree corresponds to a

block in the page– Each node will be assigned a value (Degree of Coherence)

to indicate how coherent of the content in the block based on visual perception

Page 11: 1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.

11

VIPS: An Example

Web Page

VB1 VB2

VB2_1 VB2_2 . . .

VB2_2_1 VB2_2_2 VB2_2_3 VB2_2_4. . .

. . .

. . .

. . .

Microsoft Technical Report MSR-TR-2003-79

Page 12: 1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.

12

Combined Approach (CombPS)

• VIPS solves the problems of noisy information and multi-topics

• FixedPS can deal with the variant document length problem

• Combine these two:– Partition the web page

using VIPS– Divide the blocks

containing more words than pre-defined window length

12701921%

10038617%

556389%

5753210%

26145443%

0~10

10~50

50~200

200~500

500~

Block length after segment 50,000 pages using VIPS chosen from the WT10g data set

Page 13: 1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.

13

Web Page Segmentation Summarization

• Fixed-length approach (FixedPS)– traditional passage retrieval

• DOM-based approach (DomPS)– Like the natural paragraph in traditional passage retrieval

• Vision-based Web Page Segmentation (VIPS)– Achieve a semantic partition to some extent

• Combined Approach (CombPS)– Combined VIPS & Fixed-length

Web Page

Segmentation FixedPS DomPS VIPS CombPS

Passage Retrieval

Window Discourse SemanticSemantic Window

Page 14: 1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.

14

Outline

• Motivation

• Page segmentation approaches

• Web search using page segmentation– Block Retrieval– Block-level Query Expansion

• Experiments and Discussions

• Conclusion

Page 15: 1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.

15

Block Retrieval

• Similar to traditional passage retrieval• Retrieve blocks instead of full documents• Combine the relevance of blocks with relevance of

documents

• Goal:– Verify if page segmentation can deal with both the length

normalization and multiple-topic problems

Page 16: 1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.

16

Block-level Query Expansion

• Similar to passage-level pseudo-relevance feedback• Expansion terms are selected from top blocks instead

of top documents

• Goal: – Testify if page segmentation can benefit the selection of

query terms through increasing term correlations within a block, and thus improve the final performance

Page 17: 1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.

17

Outline

• Motivation

• Page segmentation approaches

• Web search using page segmentation– Block Retrieval– Block-level Query Expansion

• Experiments and Discussions

• Conclusion

Page 18: 1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.

18

Experiments

• Methodology– Fixed-length window approach (FixedPS)

• Overlapped window with size of 200 words

– DOM-based approach (DomPS)• Iterate the DOM tree for some structural tags

• A block is constructed and identified by such leaf tag

• Free text between two tags is treated as a special block

– Vision-based approach (VIPS)• The permitted degree of coherence is set to 0.6

• All the leaf nodes are extracted as visual blocks

– The combined approach (CombPS)• VIPS then FixedPS

– Full document approach (FullDoc)• No segmentation is performed

Page 19: 1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.

19

Experiments (Cont.)

• Dataset– TREC 2001 Web Track

• WT10g corpus (1.69 million pages), crawled at 1997• 50 queries (topics 501-550)

– TREC 2002 Web Track• .GOV corpus (1.25 million pages), crawled at 2002• 49 queries (topics 551-560)

• Retrieval System– Okapi, with weighting function BM2500

• Preprocessing– Standard stop-word list – Do not use stemming and phrase information

• Tune parameters in BM2500 to achieve best baselines• Evaluation criteria: P@10

Page 20: 1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.

20

Experiments on Block Retrieval

• Steps:1. Do original document retrieval

– Obtain a document rank DR

2. Analyze top N (1000 here) documents to get a block set

3. Do block retrieval on the block set (same as Step 1 but replace the document with block)– Obtain a block rank BR– Documents are re-ranked by the single-best block in each document

4. Combine the BR and DR to get a new rank of document–

– is the tuning parameter

( ) (1 ) ( )DR BRrank d rank d

Page 21: 1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.

21

Block Retrieval on TREC 2001 and TREC 2002 (P@10)

Page Segmentation

Baseline BR only BR + DR best

DomPS

0.312

0.252 0.322

FixedPS 0.304 0.326

VIPS 0.316 0.328

CombPS 0.326 0.338

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.24

0.26

0.28

0.3

0.32

0.34

Combining Parameter

P@

10

CombPSVIPSFixedPSDomPS

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.15

0.17

0.19

0.21

0.23

0.25

Combining Parameter

P@

10

CombPSVIPSFixedPSDomPS

Page Segmentation

Baseline BR only BR + DR best

DomPS

0.2286

0.1571 0.2286

FixedPS 0.1776 0.2317

VIPS 0.2163 0.2408

CombPS 0.1939 0.2379

Result on TREC 2001 (P@10) Result on TREC 2002 (P@10)

Page 22: 1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.

22

Experiments on Block-level Query Expansion

• Steps:1. Same steps as block retrieval

– Do original document retrieval to get DR

– Analyze top N (1000 here) documents to get a block set

– Do block retrieval on the block set to get BR

2. Select some expansion terms based on top blocks– 10 expansion terms in our experiments

– Number of top blocks is a tuning parameter

3. Document retrieval with the expanded query– Modify the term weights before final retrieval

Page 23: 1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.

23

Query Expansion on TREC 2001 and TREC 2002 (P@10)

Page Segmentation

BaselineQuery Expansion (best)

P@10 Improvement

FullDoc

0.312

0.326 4.5%

DomPS 0.324 3.8%

FixedPS 0.36 15.4%

VIPS 0.362 16.0%

CombPS 0.366 17.3%

Result on TREC 2001 (P@10) Result on TREC 2002 (P@10)

0 3 5 10 20 30 40 50 600.26

0.28

0.3

0.32

0.34

0.36

Number of Blocks (Documents in FullDoc)

P@

10

CombPSVIPSFixedPSDomPSFullDocBaseline

0 3 5 10 20 30 40 50 600.16

0.17

0.18

0.19

0.2

0.21

0.22

0.23

0.24

0.25

Number of Blocks (Documents in FullDoc)

P@

10

CombPSVIPSFixedPSDomPSFullDocBaseline

Page Segmentation

BaselineQuery Expansion (best)

P@10 Improvement

FullDoc

0.2286

0.2082 -8.9%

DomPS 0.2224 -2.7%

FixedPS 0.2327 1.8%

VIPS 0.2327 1.8%

CombPS 0.2388 4.5%

Page 24: 1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.

24

Discussions

• FullDoc can only obtain a low and insignificant result – The baseline is low, so many top ranked documents are actually irrelevant

• DomPS is not good and very unstable – The segmentation is too detailed– Semantic block can hardly be detected and expansion terms are not good

• FixedPS is stable and good– Similar result as the case in traditional IR– A window may miss the real semantic blocks

• VIPS is very good– Top blocks usually have very good quality– Length normalization is still a problem

• CombPS is almost the best method in all experiments– More than just a tradeoff

Page 25: 1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.

25

Outline

• Motivation

• Page segmentation approaches

• Web search using page segmentation– Block Retrieval– Block-level Query Expansion

• Experiments and Discussions

• Conclusion

Page 26: 1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.

26

Conclusion

• Page segmentation is effective for improving web search– Block Retrieval– Block-level Query Expansion

• Plain-text retrieval Fixed-window’s partition

Web information retrieval Semantic partition (VIPS)

• Integrating both semantic and fixed-length properties (CombPS) could deal with all problems and achieve the best performance

• We believe that block-based web search can be very useful in real search engines, and can also be very easily combined with block-level link analysis

Page 27: 1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.

Thanks!

Page 28: 1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.

28

Block Retrieval on TREC 2001 (Average Precision)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.13

0.14

0.15

0.16

0.17

0.18

0.19

0.2

Combining Parameter

Ave

rag

e P

reci

sio

n

CombPSVIPSFixedPSDomPS

Page Segmentation

Baseline BR only BR + DR best

DomPS

0.1703

0.1344 0.1752

FixedPS 0.1743 0.1896

VIPS 0.1605 0.1770

CombPS 0.1673 0.1871

Result on TREC 2001 (Average Precision)

Page 29: 1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.

29

Query Expansion on TREC 2001 (Average Precision)

0 3 5 10 20 30 40 50 600.15

0.16

0.17

0.18

0.19

0.2

0.21

0.22

0.23

Number of Blocks (Documents in FullDoc)

Ave

rag

e P

reci

sio

n

CombPSVIPSFixedPSDomPSFullDocBaseline

Page Segmentatio

nBaseline

Query Expansion (best)

P@10 Improvement

FullDoc

0.1703

0.1953 14.5%DomPS 0.202 18.6%FixedPS 0.216 26.8%

VIPS 0.2199 29.1%CombPS 0.2188 28.5%

Result on TREC 2001 (Average Precision)

Page 30: 1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.

30

Summarization on Block Retrieval

• DomPS seems to be the worst and most unstable method– The produced blocks are too detailed– Blocks can not be mapped to a single semantic part within pages

• FixedPS is stable but not very good– Similar result as the case in traditional IR– It lacks semantic partition and fails to find best semantic blocks

• VIPS is very good and stable– Semantic partition is important to web context, especially to newly

crawled web pages (e.g., TREC 2002) – The inability to deal with varying length problem results a poor

performance for VIPS in somehow old data set

• CombPS is a very good tradeoff between VIPS and FixedPS

Page 31: 1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.

31

Summarization on Query Expansion

• FullDoc could only obtain a relatively low and insignificant result – The baseline is low, so many top ranked documents are actually irrelevant

• DomPS fails to obtain a significant improvement over baseline – The segmentation is too detailed, so expansion terms are not very good

• VIPS is very good using small number of blocks– Top blocks usually have very good quality– VIPS can provide semantic partition and good expansion terms

• FixedPS is very stable and good– Very stable when number of blocks increases– A window may cover contents from different semantic regions, thus noisy

terms will likely to be introduced

• CombPS is the best method in both data sets– More than just a tradeoff