1
Searching Web Better
Dr Wilfred NgDepartment of Computer Science
The Hong Kong University of Science and Technology
2
Outline
Introduction Main Techniques (RSCF)
Clickthrough Data Ranking Support Vector Machine Algorithm Ranking SVM in Co-training Framework
The RSCF-based Metasearch Engine Search Engine Components Feature Extraction Experiments
Current Development
3
Search Engine Adaptation
Google, MSNsearch, Wisenut, Overture, …
Computer Science Finance Social Science
Adapt the search engine by learning from implicit feedback ---- Clickthrough data
CS termsProduct
News
4
Clickthrough Data
Clickthrough data: data that indicates which links in the returned ranking results have been clicked by the users
Formally, a triplet (q, r, c) q – the input query r – the ranking result presented to the user c – the set of links the user clicked on
Benefits: Can be obtained timely No intervention to the search activity
6
Target Ranking (Preference Pairs Set )
Arising from l1 Arising from l7 Arising from l10
Empty Set l7 <r l2
l7 <r l3l7 <r l4l7 <r l5l7 <r l6
l10 <r l2
l10 <r l3l10 <r l4l10 <r l5l10 <r l6l10 <r l8l10 <r l9
7
An Example of Clickthrough Data
User’s inputquery
Clicked bythe user
l
l
ll
l
l
l
l
Labelleddata set
Unlabelleddata set
8
Arising from l1 Arising from l7 Arising from l10
Empty Set l7 <r l2
l7 <r l3l7 <r l4l7 <r l5l7 <r l6
l10 <r l2
l10 <r l3l10 <r l4l10 <r l5l10 <r l6l10 <r l8l10 <r l9 Labelled data set: l1, l2,…, l10
Unlabelled data set: l11, l12,…
Target Ranking (Preference Pairs Set )
9
The Ranking SVM Algorithm
1
321 ,, lll Three links, each described by a feature vector
Target ranking: l1 <r’ l2 <r’ l3
Weight vector -- Ranker
Distance between two closest projected links
Cons: It needs a large set of labelled data
l2
l1
l3
2l2’
l1’
l3’
l2’
l1’
l3’
10
The Ranking SVM in Co-training Framework Divide the feature vector into two subvectors
Two rankers are built over these two feature subvectors
Each ranker chooses several unlabelled preference pairs and add them to the labelled data set
Rebuild each ranker from the augmented labelled data set
Labelled Preference Feedback Pairs P_l
Unlabelled Preference Pairs P_u
Ranker a_B
Training
Selecting confident pairs
Ranker a_AA
ugm
ente
d p
airs
Au
gmen
ted
pai
rs
11
Some Issues
Guideline for partitioning the feature vector After the partition each subvector must be sufficient fo
r the later ranking Number of rankers
Depend on the number of features When to terminate the procedure?
Prediction difference: indicates the ranking difference between the two rankers
After termination, get a final ranker on the augmented labelled data set
12
Metasearch Engine Receives query from
user Sends query to
multiple search engines
Combines the retrieved results from the underlying search engines
Presents a unified ranking result to user
User
Metasearch Engine
SearchEngine 1
query
SearchEngine 2
SearchEngine n
RetrievedResults 1
RetrievedResults 2
RetrievedResults n
Unified Ranking
Result
13
Search Engine Components
Powered by Inktomi, relatively mature
One of the most powerful search engines nowadays
A new but growing search engine
Ranks links based on the prices paid by the sponsors on t
he links
14
Feature Extraction
Ranking Features (12 binary features) Rank(E,T) where E {M,W,O} T {1,3,5,10}
(M: MSNsearch, W: Wisenut, O: Overture) Indicate the ranking of the links in each underlying sea
rch engine
Similarity Features(4 features) Sim_U(q,l), Sim_T(q,t), Sim_C(q,a), Sim_G(q,a) URL,Title, Abstract Cover, Abstract Group Indicate the similarity between the query and the link
15
Experiments
Experiment data: within the same domain – Computer science
Objectives:Offline experiments – compared with RSVMOnline experiments – compared with Google
16
Prediction Error Prediction Error: difference between the
ranker’s ranking and the target rankingTarget ranking:
l1 <r’ l2, l1 <r’ l3, l2 <r’ l3Projected ranking:
l2 <r’ l1, l1 <r’ l3, l2 <r’ l3Prediction error = 33%
1
l2
l1
l3
l2’
l1’
l3’
17
Offline Experiment (Compared with RSVM)
B
A
R
10 queries 30 queries 60 queries
The ranker trained by the RSVM algorithm on the whole feature vector
The ranker trained by the RSCF algorithm on one feature subvector
The ranker trained by the RSCF algorithm on another feature subvector
Prediction error rise up again!The number of iterations in RSCF algorithm is about four to five!
18
Offline Experiment (Compare with RSVM)
C
R
The ranker trained by the RSVM algorithm
The final ranker trained by the RSCF algorithm
Overall comparison
19
Online Experiment (Compare with Google) Experiment data: CS terms
e.g. radix sort, TREC collection, … Experiment Setup
Combine the results returned by RSCF and those by Google into one shuffled list
Present to the users in a unified wayRecord the users’ clicks
Cases More clicks on RSCF
More clicks on Google
Tie No clicks Total
Queries 26 17 13 2 58
20
Experimental AnalysisFeatures Weight Features Weight
Rank(M,1) 0.1914 Rank(W,1) 0.0184
Rank(M,3) 0.2498 Rank(W,3) 0.1014
Rank(M,5) 0.1152 Rank(W,5) -0.3021
Rank(M,10) 0.2498 Rank(W,10)
-0.4367
Rank(O,1) -0.1673 Sim_U(q,l) 0.5382
Rank(O,3) -0.1229 Sim_T(q,t) 0.4928
Rank(O,5) -0.4976 Sim_C(q,a) 0.4136
Rank(O,10) 0.4441 Sim_G(q,a) 0.5010
21
Experimental AnalysisFeatures Weight Features Weight
Rank(M,1) 0.1914 Rank(W,1) 0.0184
Rank(M,3) 0.2498 Rank(W,3) 0.1014
Rank(M,5) 0.1152 Rank(W,5) -0.3021
Rank(M,10) 0.2498 Rank(W,10)
-0.4367
Rank(O,1) -0.1673 Sim_U(q,l) 0.5382
Rank(O,3) -0.1229 Sim_T(q,t) 0.4928
Rank(O,5) -0.4976 Sim_C(q,a) 0.4136
Rank(O,10) 0.4441 Sim_G(q,a) 0.5010
22
Experimental AnalysisFeatures Weight Features Weight
Rank(M,1) 0.1914 Rank(W,1) 0.0184
Rank(M,3) 0.2498 Rank(W,3) 0.1014
Rank(M,5) 0.1152 Rank(W,5) -0.3021
Rank(M,10) 0.2498 Rank(W,10)
-0.4367
Rank(O,1) -0.1673 Sim_U(q,l) 0.5382
Rank(O,3) -0.1229 Sim_T(q,t) 0.4928
Rank(O,5) -0.4976 Sim_C(q,a) 0.4136
Rank(O,10) 0.4441 Sim_G(q,a) 0.5010
23
Conclusion on RSCF
Search engine adaptationThe RSCF algorithm
Train on clickthrough data Apply RSVM in the co-training framework
The RSCF-based metasearch engine Offline experiments – better than RSVM Online experiments – better than Google
24
Current Development
Features extraction and divisionApply in different domainsSearch engine personalizationSpyNoby Project: Personalized search engine
with clickthrough analysis
25
If l1 and l7 are from the same underlying search engine, the preference pairs set arising from l1 should be l1 <r l2 , l1 <r l3 , l1 <r l4 , l1 <r l5 , l1 <r l6
Advantages: Alleviate the penalty on high-ranked links Give more credit to the ranking ability of the
underlying search engines
Modified Target Ranking for Metasearch Engines
26
Arising from l1 Arising from l7 Arising from l10
l1 <r l2
l1 <r l3l1 <r l4l1 <r l5l1 <r l6
l7 <r l2
l7 <r l3l7 <r l4l7 <r l5l7 <r l6
l10 <r l2
l10 <r l3l10 <r l4l10 <r l5l10 <r l6l10 <r l8l10 <r l9 Labeled data set: l1, l2,…, l10
Unlabelled data set: l11, l12,…
Modified Target Ranking
27
RSCF-based Metasearch Engine - MEA User
MEA
query
1. ……2. …………………………30. ......
Unified Ranking
Result
q
q
1. ……2. …………………………30. ……
1. ……2. …………………………30. ……
28
RSCF-based Metasearch Engine - MEB User
MEB
query
1. ……2. …………………………30. ……
Unified Ranking
Result
q
q
1. ……2. …………………………30. ……
1. ……2. …………………………30. ……
1. ……2. …………………………30. ……
q
29
Generating Clickthrough Data Probability of being clicked on:
k: the ranking of the link in the metasearch engine
n: the number of all the links in the metasearch engine : the skewness parameter in Zipf’s lawHarmonic number:
Judge the link’s relevance manually If the link is irrelevant not be clicked on If the link is relevant has the probability of Pr(k) to be
clicked on
)()Pr(
VHk
nk
n
i iVH
1
1)(
30
Feature Extraction Ranking Features (binary features)
Rank(E,T): whether the link is ranked within ST in E
where E {G,M,W,O} T {1,3,5,10,15,20,25,30}
S1={1}, S3={2,3}, S5={4,5}, S10={6,7,8,9,10} ……
(G: Google, M: MSNsearch, W: Wisenut, O: Overture) Indicate the ranking of the links in each underlying sear
ch engine
Similarity Features(4 features) Sim_U(q,l), Sim_T(q,t), Sim_C(q,a), Sim_G(q,a) Measure the similarity between the query and the link
31
Experiments
Experiment data: three different domains CS terms News E-shopping
Objectives:Prediction Error – better than RSVMTop-k Precision – adaptation ability
32
Top-k Precision
Advantages: Precision is more easier to obtained than recall Users care only top-k links (k=10)
Evaluation data: 30 queries in each domain
34
Statistical AnalysisHypothesis Testing: (two-sample hypothesis
testing about means)used to analyze
whether there is a statistically significant difference between two means of two samples
Comparison between Engines
News E-Shopping CS terms
Combined
MEA VS Google ≈ > ≈ >MEA VS MSNsearch ≈ > ≈ >MEA VS Overture > ≈ > >MEA VS Wisenut ≈ > > >MEB VS Google ≈ > ≈ >MEB VS MSNsearch ≈ ≈ > >MEB VS Overture > ≈ > >MEB VS Wisenut ≈ > > >MEA VS MEB ≈ ≈ ≈ ≈
35
Comparison Results MEA can produce better search quality than Goo
gle Google does not excel in every query category MEA and MEB is able to adapt to bring out the str
engths of each underlying search engine MEA and MEB are better than, or comparable to
all their underlying search engine components in every query category
The RSCF-based metasearch engine Comparison of prediction error – better than RSVM Comparison of top-k precision – adaptation ability
36
Spy Naïve Bayes – Motivation
The problem of Joachims method Strong assumptions Excessively penalize high-ranked
links
l1, l2, l3 are apt to appear on the right, while l7, l10 on the left
New interpretation of clickthrough data Clicked – positive (P) Unclicked – unlabeled (U),
containing both positive and negative samples.
Goal: identify Reliable Negatives (RN) from U
Arising from l1
Arising from l7
Arising from l10
Empty Set
l7 <r l2
l7 <r l3l7 <r l4l7 <r l5l7 <r l6
l10 <r l2
l10 <r l3l10 <r l4l10 <r l5l10 <r l6l10 <r l8l10 <r l9
lp <r ln
37
Spy Naïve Bayes: Ideas Standard naïve Bayes – classify positi
ve and negative samples One-step spy naïve Bayes: Spying out
RN from U Put a small number of positive samples
into U to act as “spies”, (to scout the behavior of real positive samples in U)
Take U as negative samples to train a naïve Bayes classifier
Samples with lower probabilities to be positive will be assigned into RN
Voting procedure: make Spying more robust Run one-step SpyNB for n times and g
et n sets of RNi
A sample appear in at least m (m<≈n) sets of RNi will appear in the final RN
39
My publications1. Wilfred NG. Book Review: An Introduction to Search Engines and Web Navigation . An International
Journal of Information Processing & Management, pp. 290-292, 43(1) (2007).2. Wilfred NG, Lin DENG and Dik-Lun LEE. Spying Out Real User Preferences in Web Searching.
Accepted and to appear: ACM Transactions on Internet Technology, (2006).3. Yiping KE, Lin DENG, Wilfred NG and Dik-Lun LEE.
Web Dynamics and their Ramifications for the Development of Web Search Engines. Accepted and to appear: Computer Networks Journal - Special Issue on Web Dynamics, (2005).
4. Qingzhao TAN, Yiping KE and Wilfred NG. WUML: A Web Usage Manipulation Language For Querying Web Log Data . International Conference on Conceptual Modeling ER 2004, Lecture Notes of Computer Science Vol.3288, Shanghai, China, page 567-581, (2004).
5. Lin DENG, Xiaoyong CHAI, Qingzhao TAN, Wilfred NG, Dik-Lun LEE. Spying Out Real User Preferences for Metasearch Engine Personalization. ACM Proceedings of WEBKDD Workshop on Web Mining and Web Usage Analysis 2004, Seattle, USA, (2004).
6. Qingzhao TAN, Xiaoyong CHAI, Wilfred NG and Dik-Lun LEE. Applying Co-training to Clickthrough Data for Search Engine Adaptation. 9th International Conference on Database Systems for Advanced Applications DASFAA 2004, Lecture Notes of Computer Science Vol. 2973, Jeju Island, Korea, page 519-532, (2004).
7. Haofeng ZHOU, Yubo LOU, Qingqing YUAN, Wilfred NG, Wei WANG and Baile SHI. Refining Web Authoritative Resource by Frequent Structures. IEEE Proceedings of the International Database Engineering and Applications Symposium IDEAS 2003, Hong Kong, pages 236-241, (2003).
8. Wilfred NG. Capturing the Semantics of Web Log Data by Navigation Matrices. A Book Chapter in "Semantic Issues in E-Commerce Systems", Edited by R. Meersman, K. Aberer and T. Dillon, Kluwer Academic Publishers, pages 155-170, (2003).
Top Related