Download - Relevance Propagation for Web Search Dr. Tie-Yan Liu Web Search and Mining Group Microsoft Research Asia Joint Work with Tao Qin, Tsinghua University.

Transcript

Relevance Propagation Relevance Propagation for Web Searchfor Web Search

Relevance Propagation Relevance Propagation for Web Searchfor Web Search

Dr. Tie-Yan LiuWeb Search and Mining Group

Microsoft Research Asia

Joint Work with Tao Qin, Tsinghua University.

2006-3-13 DCWC 20062

OutlineOutline

• Introduction• Generic framework for relevance propagation• Evaluations

, Effectiveness analysis, Complexity analysis

• Conclusions

2006-3-13 DCWC 20063

IntroductionIntroduction

• Web Search ≠ Information Retrieval, Beside the content relevance, various structure

information also plays an important role in Web search

• Hyperlink graph• Local sitemap• Webpage layout

A1

...

A33A32A31

A22A21

p1

p5

p3

...

...

p4

...

...

...

p2

2006-3-13 DCWC 20064

IntroductionIntroduction

• Three ways of utilizing the structure information for Web search, Linear combination of content relevance and importance

scores computed from hyperlink graph• β∙Relevance + (1-β)∙ PageRank

, Enhance link analysis with the help of content relevance• Query-dependent link graph in HITS

• Topic-sensitive PageRank

, Propagate content relevance along the Web structure• The use of anchor text in Search Engines

• Hyperlink-based relevance score propagation (TREC 2003)

• Sitemap-based feature propagation (TREC 2004)

2006-3-13 DCWC 20065

Hyperlink-based Relevance Score Propagation (Zhai et al, TREC2003)

Hyperlink-based Relevance Score Propagation (Zhai et al, TREC2003)

• Assumption, Hyperlinked pages have correlated content

links outlinks

2006-3-13 DCWC 20066

Hyperlink-based Relevance Score Propagation (Zhai et al, TREC2003)

Hyperlink-based Relevance Score Propagation (Zhai et al, TREC2003)

• Assumption, Hyperlinked pages have correlated content

• Propagation model

, Weighted inlink model

, Weighted outlink model

, Uniform outlink model

1 0( ) ( ) ( ) ( , ) ( ) ( , ), ( ) ( )i j

k k k

i I i j O jp p p p

h p s p h p p p h p p p h p s p

1( ) ( ) (1 ) ( ) ( , ),i

k ki I i

p p

h p s p h p p p

1( ) ( ) (1 ) ( ) ( , ),

j

k kj O j

p p

h p s p h p p p

1( ) ( ) (1 ) ( )j

k kj

p p

h p s p h p

( , ) ( )I ip p s p

( , ) ( )O j jp p s p Original relevance score

Propagation from the inllinks

Propagation from the outlinks

2006-3-13 DCWC 20067

Sitemap-based Feature Propagation (Liu and Qin, TREC2004)Sitemap-based Feature Propagation (Liu and Qin, TREC2004)

• Assumption, Child pages are extensions of their parent

page, One should consider the contribution of the

child pages while computing the relevance of the parent page to a query.

• Propagation model

( )

(1 )'( ) ( ) ( )

( )t t tq Child p

f p f p f qChild p

A1

...

A33A32A31

A22A21

2006-3-13 DCWC 20068

Generic Relevance Propagation FrameworkGeneric Relevance Propagation Framework• Modification of the sitemap-based feature propagation model

• Reminder of the hyperlink-based propagation model

• A generic framework to cover both hyperlink-based and sitemap-based propagations

1 0

( )

1( ) ( ) (1 ) ( )

( )k kt t t

q Child p

f p f p f qChild p

( )

(1 )'( ) ( ) ( )

( )t t tq Child p

f p f p f qChild p

1 0( ) ( ), ( )k kpc p g c p c N

1 0( ) ( ) ( ) ( , ) ( ) ( , ), ( ) ( )i j

k k k

i I i j O jp p p p

h p s p h p p p h p p p h p s p

2006-3-13 DCWC 20069

More Derived Propagation ModelsMore Derived Propagation Models

1 0( ) ( ), ( )k kpc p g c p c N

Score level Feature level

HyperlinkHyperlink based score

propagation model

SitemapSitemap based feature

propagation model

Hyperlink-based Feature Propagation Model• Weighted inlink model

• Weighted outlink model

•Uniform outlink model

1 0( ) ( ) (1 ) ( ) ( , )i

k kt t t i It i

p p

f p f p f p p p

1 0( ) ( ) (1 ) ( ) ( , )

j

k kt t t j Ot j

p p

f p f p f p p p

1 0( ) ( ) (1 ) ( )

j

k kt t t j

p p

f p f p f p

Sitemap-based Score Propagation Model1

( )

1( ) ( ) (1 ) ( )

( )k k

q Child p

h p s p h qChild p

2006-3-13 DCWC 200610

Summary: All Models Covered by the Generic FrameworkSummary: All Models Covered by the Generic FrameworkAlgorithm Abbreviation

Weighted in-link case of hyperlink based score propagation model HS-WI

Weighted out-link case of hyperlink based score propagation model HS-WO

Uniform out-link case of hyperlink based score propagation model HS-UO

Weighted in-link case of hyperlink based feature propagation model HF-WI

Weighted out-link case of hyperlink based feature propagation model HF-WO

Uniform out-link case of hyperlink based feature propagation model HF-UO

Sitemap based score propagation model SS

Sitemap based feature propagation model SF

2006-3-13 DCWC 200611

Benchmark DatasetsBenchmark Datasets

• Corpora , .GOV

• 1M pages

• Queries: TD 2003, 2004

, MSN• 2M pages

• Query: 100 most popular queries from MSN query log

• Base Ranking function, BM2500

1 3

3

( 1) ( 1)

( )( )T Q

k tf k qtf

K tf k qtf

2006-3-13 DCWC 200612

P@10

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0 0.2 0.4 0.6 0.8 1

a

SF

SS

HS-WI

HS-WO

HS-UO

HF-WI

HF-WO

HF-UO

Experimental Results (1)Experimental Results (1)

MAP

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0 0.2 0.4 0.6 0.8 1

a

SF

SS

HS-WI

HS-WO

HS-UO

HF-WI

HF-WO

HF-UO

TREC 2003

2006-3-13 DCWC 200613

Experimental Results (2)Experimental Results (2)

P@10

0

0.05

0.1

0.15

0.2

0.25

0 0.2 0.4 0.6 0.8 1

a

SF

SS

HS-WI

HS-WO

HS-UO

HF-WI

HF-WO

HF-UO

TREC 2004

MAP

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0 0.2 0.4 0.6 0.8 1

a

SF

SS

HS-WI

HS-WO

HS-UO

HF-WI

HF-WO

HF-UO

2006-3-13 DCWC 200614

P@10

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0 0.2 0.4 0.6 0.8 1

a

SF

SS

HS-WI

HS-WO

HS-UO

HF-WI

HF-WO

HF-UO

Experimental Results (3)Experimental Results (3)

MAP

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0 0.2 0.4 0.6 0.8 1

a

SF

SS

HS-WI

HS-WO

HS-UO

HF-WI

HF-WO

HF-UO

MSN

2006-3-13 DCWC 200615

Conclusions on EffectivenessConclusions on Effectiveness

• In general, relevance propagation can boost the search performance with proper parameter settings;

• The sitemap-based models are more effective than the hyperlink-based models;, Hyperlinks ≠ Content Correlation, while the pages in the

same sub site usually talk about correlated topics.

• Detailed comparisons, The two sitemap-based models have similar performance.

, Among the hyperlink-based models, the HF-WI model

performs best.

2006-3-13 DCWC 200616

Online ComplexityOnline Complexity• w is the size of the working set, q is the number of query terms, l

is the average number of inlinks / outlinks, t is the number of iterations.

• For the SS model, the complexity is O(w),, The SS model needs to propagate the relevance score of a

page to its parent only once if we conduct the propagation from the leaf nodes in a bottom-up manner.

• For the SF model, the complexity is O(qw).• For the HS models, the complexity is O(twl)

, In each step of t iterations of the HS models, we need to propagate the relevance score of a page along its in-link or out-link in the sub graph of the working set.

• For the HF models, the complexity is O(tqwl).

2006-3-13 DCWC 200617

Online ComplexityOnline Complexity

Algorithm Complexity average w average l average t average q CPU time

HS-WI O(twl) 6796.5 11.0 7.4 - 47.9

HS-WO O(twl) 6796.5 11.0 6.5 - 36.5

HS-UO O(twl) 6796.5 11.0 6.6 - 39.8

HF-WI O(tqwl) 6796.5 11.0 9.1 1.5 54.0

HT-WO O(tqwl) 6796.5 11.0 11.1 1.5 63.3

HF-UO O(tqwl) 6796.5 11.0 8.9 1.5 51.6

SS O(w) 10000.0 - 1 - 1.9

SF O(qw) 10000.0 - 1 3 8.3

The sitemap-based models are more efficient than the hyperlink-based models

The score-level propagation models are faster than feature-level models

2006-3-13 DCWC 200618

Offline ComplexityOffline Complexity

• Score-level propagation is very difficult to implement offline, The score can only be computed online w.r.t the query.

• For feature-level propagations, , The time complexity of the SF model for offline

implementation is acceptable; • 62.2 hours, or 2.6 days to re-index 8 billion pages

, The time complexity of the HF model is out of tolerance.• 1083 hours, or 45 days to re-index 8 billion pages

, The ST model is easy for parallel implementation while the parallel implementation of the HF model is non-trivial

2006-3-13 DCWC 200619

Conclusions of this StudyConclusions of this Study

• Generally speaking, relevance propagation can boost the performance of web information retrieval.

• Sitemap-based propagation models outperform hyperlink-based propagation models in terms of both effectiveness and efficiency. Notably, sitemap-based propagation can be implemented in parallel.

• Score-level propagation and feature-level propagation have almost similar effectiveness. Although the former is more efficient in on-line implementations, it is not practical for real-world search engines because it can not be implemented offline.

• Overall speaking, sitemap-based feature propagation model is the best choice for real search engines.

Thanks!Thanks!

[email protected]://research.microsoft.com/users/tyliu/