Relevance Propagation Relevance Propagation for Web Searchfor Web Search
Relevance Propagation Relevance Propagation for Web Searchfor Web Search
Dr. Tie-Yan LiuWeb Search and Mining Group
Microsoft Research Asia
Joint Work with Tao Qin, Tsinghua University.
2006-3-13 DCWC 20062
OutlineOutline
• Introduction• Generic framework for relevance propagation• Evaluations
, Effectiveness analysis, Complexity analysis
• Conclusions
2006-3-13 DCWC 20063
IntroductionIntroduction
• Web Search ≠ Information Retrieval, Beside the content relevance, various structure
information also plays an important role in Web search
• Hyperlink graph• Local sitemap• Webpage layout
A1
...
A33A32A31
A22A21
p1
p5
p3
...
...
p4
...
...
...
p2
2006-3-13 DCWC 20064
IntroductionIntroduction
• Three ways of utilizing the structure information for Web search, Linear combination of content relevance and importance
scores computed from hyperlink graph• β∙Relevance + (1-β)∙ PageRank
, Enhance link analysis with the help of content relevance• Query-dependent link graph in HITS
• Topic-sensitive PageRank
, Propagate content relevance along the Web structure• The use of anchor text in Search Engines
• Hyperlink-based relevance score propagation (TREC 2003)
• Sitemap-based feature propagation (TREC 2004)
2006-3-13 DCWC 20065
Hyperlink-based Relevance Score Propagation (Zhai et al, TREC2003)
Hyperlink-based Relevance Score Propagation (Zhai et al, TREC2003)
• Assumption, Hyperlinked pages have correlated content
links outlinks
2006-3-13 DCWC 20066
Hyperlink-based Relevance Score Propagation (Zhai et al, TREC2003)
Hyperlink-based Relevance Score Propagation (Zhai et al, TREC2003)
• Assumption, Hyperlinked pages have correlated content
• Propagation model
, Weighted inlink model
, Weighted outlink model
, Uniform outlink model
1 0( ) ( ) ( ) ( , ) ( ) ( , ), ( ) ( )i j
k k k
i I i j O jp p p p
h p s p h p p p h p p p h p s p
1( ) ( ) (1 ) ( ) ( , ),i
k ki I i
p p
h p s p h p p p
1( ) ( ) (1 ) ( ) ( , ),
j
k kj O j
p p
h p s p h p p p
1( ) ( ) (1 ) ( )j
k kj
p p
h p s p h p
( , ) ( )I ip p s p
( , ) ( )O j jp p s p Original relevance score
Propagation from the inllinks
Propagation from the outlinks
2006-3-13 DCWC 20067
Sitemap-based Feature Propagation (Liu and Qin, TREC2004)Sitemap-based Feature Propagation (Liu and Qin, TREC2004)
• Assumption, Child pages are extensions of their parent
page, One should consider the contribution of the
child pages while computing the relevance of the parent page to a query.
• Propagation model
( )
(1 )'( ) ( ) ( )
( )t t tq Child p
f p f p f qChild p
A1
...
A33A32A31
A22A21
2006-3-13 DCWC 20068
Generic Relevance Propagation FrameworkGeneric Relevance Propagation Framework• Modification of the sitemap-based feature propagation model
• Reminder of the hyperlink-based propagation model
• A generic framework to cover both hyperlink-based and sitemap-based propagations
1 0
( )
1( ) ( ) (1 ) ( )
( )k kt t t
q Child p
f p f p f qChild p
( )
(1 )'( ) ( ) ( )
( )t t tq Child p
f p f p f qChild p
1 0( ) ( ), ( )k kpc p g c p c N
1 0( ) ( ) ( ) ( , ) ( ) ( , ), ( ) ( )i j
k k k
i I i j O jp p p p
h p s p h p p p h p p p h p s p
2006-3-13 DCWC 20069
More Derived Propagation ModelsMore Derived Propagation Models
1 0( ) ( ), ( )k kpc p g c p c N
Score level Feature level
HyperlinkHyperlink based score
propagation model
SitemapSitemap based feature
propagation model
Hyperlink-based Feature Propagation Model• Weighted inlink model
• Weighted outlink model
•Uniform outlink model
1 0( ) ( ) (1 ) ( ) ( , )i
k kt t t i It i
p p
f p f p f p p p
1 0( ) ( ) (1 ) ( ) ( , )
j
k kt t t j Ot j
p p
f p f p f p p p
1 0( ) ( ) (1 ) ( )
j
k kt t t j
p p
f p f p f p
Sitemap-based Score Propagation Model1
( )
1( ) ( ) (1 ) ( )
( )k k
q Child p
h p s p h qChild p
2006-3-13 DCWC 200610
Summary: All Models Covered by the Generic FrameworkSummary: All Models Covered by the Generic FrameworkAlgorithm Abbreviation
Weighted in-link case of hyperlink based score propagation model HS-WI
Weighted out-link case of hyperlink based score propagation model HS-WO
Uniform out-link case of hyperlink based score propagation model HS-UO
Weighted in-link case of hyperlink based feature propagation model HF-WI
Weighted out-link case of hyperlink based feature propagation model HF-WO
Uniform out-link case of hyperlink based feature propagation model HF-UO
Sitemap based score propagation model SS
Sitemap based feature propagation model SF
2006-3-13 DCWC 200611
Benchmark DatasetsBenchmark Datasets
• Corpora , .GOV
• 1M pages
• Queries: TD 2003, 2004
, MSN• 2M pages
• Query: 100 most popular queries from MSN query log
• Base Ranking function, BM2500
1 3
3
( 1) ( 1)
( )( )T Q
k tf k qtf
K tf k qtf
2006-3-13 DCWC 200612
P@10
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0 0.2 0.4 0.6 0.8 1
a
SF
SS
HS-WI
HS-WO
HS-UO
HF-WI
HF-WO
HF-UO
Experimental Results (1)Experimental Results (1)
MAP
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
0 0.2 0.4 0.6 0.8 1
a
SF
SS
HS-WI
HS-WO
HS-UO
HF-WI
HF-WO
HF-UO
TREC 2003
2006-3-13 DCWC 200613
Experimental Results (2)Experimental Results (2)
P@10
0
0.05
0.1
0.15
0.2
0.25
0 0.2 0.4 0.6 0.8 1
a
SF
SS
HS-WI
HS-WO
HS-UO
HF-WI
HF-WO
HF-UO
TREC 2004
MAP
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0 0.2 0.4 0.6 0.8 1
a
SF
SS
HS-WI
HS-WO
HS-UO
HF-WI
HF-WO
HF-UO
2006-3-13 DCWC 200614
P@10
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0 0.2 0.4 0.6 0.8 1
a
SF
SS
HS-WI
HS-WO
HS-UO
HF-WI
HF-WO
HF-UO
Experimental Results (3)Experimental Results (3)
MAP
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0 0.2 0.4 0.6 0.8 1
a
SF
SS
HS-WI
HS-WO
HS-UO
HF-WI
HF-WO
HF-UO
MSN
2006-3-13 DCWC 200615
Conclusions on EffectivenessConclusions on Effectiveness
• In general, relevance propagation can boost the search performance with proper parameter settings;
• The sitemap-based models are more effective than the hyperlink-based models;, Hyperlinks ≠ Content Correlation, while the pages in the
same sub site usually talk about correlated topics.
• Detailed comparisons, The two sitemap-based models have similar performance.
, Among the hyperlink-based models, the HF-WI model
performs best.
2006-3-13 DCWC 200616
Online ComplexityOnline Complexity• w is the size of the working set, q is the number of query terms, l
is the average number of inlinks / outlinks, t is the number of iterations.
• For the SS model, the complexity is O(w),, The SS model needs to propagate the relevance score of a
page to its parent only once if we conduct the propagation from the leaf nodes in a bottom-up manner.
• For the SF model, the complexity is O(qw).• For the HS models, the complexity is O(twl)
, In each step of t iterations of the HS models, we need to propagate the relevance score of a page along its in-link or out-link in the sub graph of the working set.
• For the HF models, the complexity is O(tqwl).
2006-3-13 DCWC 200617
Online ComplexityOnline Complexity
Algorithm Complexity average w average l average t average q CPU time
HS-WI O(twl) 6796.5 11.0 7.4 - 47.9
HS-WO O(twl) 6796.5 11.0 6.5 - 36.5
HS-UO O(twl) 6796.5 11.0 6.6 - 39.8
HF-WI O(tqwl) 6796.5 11.0 9.1 1.5 54.0
HT-WO O(tqwl) 6796.5 11.0 11.1 1.5 63.3
HF-UO O(tqwl) 6796.5 11.0 8.9 1.5 51.6
SS O(w) 10000.0 - 1 - 1.9
SF O(qw) 10000.0 - 1 3 8.3
The sitemap-based models are more efficient than the hyperlink-based models
The score-level propagation models are faster than feature-level models
2006-3-13 DCWC 200618
Offline ComplexityOffline Complexity
• Score-level propagation is very difficult to implement offline, The score can only be computed online w.r.t the query.
• For feature-level propagations, , The time complexity of the SF model for offline
implementation is acceptable; • 62.2 hours, or 2.6 days to re-index 8 billion pages
, The time complexity of the HF model is out of tolerance.• 1083 hours, or 45 days to re-index 8 billion pages
, The ST model is easy for parallel implementation while the parallel implementation of the HF model is non-trivial
2006-3-13 DCWC 200619
Conclusions of this StudyConclusions of this Study
• Generally speaking, relevance propagation can boost the performance of web information retrieval.
• Sitemap-based propagation models outperform hyperlink-based propagation models in terms of both effectiveness and efficiency. Notably, sitemap-based propagation can be implemented in parallel.
• Score-level propagation and feature-level propagation have almost similar effectiveness. Although the former is more efficient in on-line implementations, it is not practical for real-world search engines because it can not be implemented offline.
• Overall speaking, sitemap-based feature propagation model is the best choice for real search engines.
Thanks!Thanks!
[email protected]://research.microsoft.com/users/tyliu/
Top Related