Clustering Personalized Web Search Results

Clustering Personalized Web Search Results

Xuehua Shen and Hong Cheng

Introduction

• Search engine’s objectives– Rank most relevant search results at top

• Effectiveness• PageRank / HITS

– Group and present different categories of search results

• Global view• Clustering

Clustering Personalized Search Results

• Study the clustering problem in the UCAIR framework

• Personalized search ranks or reranks the search results based on user implicit feedback

• Bring interesting problems– Efficient and effective clustering/presentation– Dynamically update the clustering results bas

ed on personalization

Goal

• Effective– Cluster user search results into meaningful groups – Present in a clear format– Provide users with main themes of search results

• Efficient– Implement efficient clustering algorithms

• Dynamic– Dynamically maintain the clustering results based on

personalized ranking and reranking

Progress

• Implemented two clustering algorithms– K-Medoids– Hierarchical clustering

• Presentation– Replace Google ads with clustering results– Present ranked results together with clustering results– Two presentation strategies

• Most centrally located document in each cluster• Most frequent terms in each cluster

Partial Results

• K-Medoids– Select the most centrally located documents a

s cluster center– Present the centroid documents as each clust

er’s representative– Efficiency not so good

• Other processing time: 490+100+1562=2152 ms

• Cluster search results time: 2844 ms

Partial Results (II)

• Hierarchical clustering– Merge similar documents in a pair-wise mann

er– Use weighted average term vectors to represe

nt cluster center– Present centroid term vectors as a virtual doc

uments (output Top-K terms)– Efficiency better than K-Medoids

• Other processing time: 200+110+831= 1141 ms

• Cluster search results time: 661 ms

Efficiency Analysis

• K-Medoids

– O(k(n-k)2 ) for each iteration

where n is # of documents, k is # of clusters

– Need multiple iterations for convergence

• Hierarchical clustering– O(n2 ) for each iteration– Need n-k iterations

Lessons Learned

• Clustering takes longer time as more search results accumulate (when we click “Next”)

• Top-K frequent terms in each cluster sometimes do not make sense– Combine additional information besides term

frequency

• Re-cluster each time when reranking search results– Incremental update of clustering results is desired!

Remaining

• Implementation– KMeans– MMR– Frequent word sets

• Effective presentation study– Based on user feedback– Literature survey

• Dynamic maintenance of clustering based on search result ranking and reranking– Drill down in a particular cluster– Update overall clustering organization

Feedback

• Which way to present clustering results is more meaningful?– Based on central documents– Based on term vectors– More options?

• Any other clustering algorithms to achieve effectiveness and efficiency?

• Any other presentation strategy besides “rank list + cluster center” ?

Clustering Personalized Web Search Results

Documents

Transcript of Clustering Personalized Web Search Results