Download - Behavior-driven clustering of queries into topics

Transcript
Page 1: Behavior-driven clustering of queries into topics

CIKM 2011, Glasgow

Behavior-driven clustering of

queries into topics

Luca Maria AielloDebora DonatoUmut OzertemFilippo Menczer

Page 2: Behavior-driven clustering of queries into topics

CIKM 2011 2

USER PROFILING IN SEARCH ENGINES

Granularity levels

Aggregation

27/10/2011

Concise representation

Meaningful semantics

Query

Session

Goal

Mission

Topic

Page 3: Behavior-driven clustering of queries into topics

CIKM 2011 3

MISSIONS AND TOPICS

A topic is a mental object or cognitive content, i.e., the sum of what can be perceived, discovered or learned about any real or abstract entity.

A search mission can be identified as a set of queries that express a complex search need, possibly articulated in smaller goals

27/10/2011

Page 4: Behavior-driven clustering of queries into topics

CIKM 2011 4

QUERY STREAM DECOMPOSITION27/10/2011

Queries in the same mission

Same topic

Queries in consecutive missions

Different topic

Donato et. al:Do you want to take notes? Identifying research missions in Y! search pad. WWW’10Taxonomies User behavior and intent

Page 5: Behavior-driven clustering of queries into topics

CIKM 2011 5

MERGING MISSIONS27/10/2011

Page 6: Behavior-driven clustering of queries into topics

CIKM 2011 6

TOPIC DETECTOR STATS

• Gradient Boosted Decision Tree (GBDT)• Aggregation (min, max, avg, std) of 62 query pair

features

AUC 0.9510X cross validation on 500K pairs

27/10/2011

Lexical Features Behavioral features

Trigrams/terms cosine Probability fwd

Common prefix/suffix Session total click avg

Length difference Session total time avg

… …

Page 7: Behavior-driven clustering of queries into topics

CIKM 2011 7

• Topic detector applied to pairs of query sets• O(log|M|·|M|2) (heavily parellelizable)

1. Missions of the same user supermissions

2. Query sets of different users higher-level topics

GREEDY AGGLOMERATIVE TOPIC EXTRACTION (GATE)27/10/2011

Page 8: Behavior-driven clustering of queries into topics

EVALUATION

40K users

3 months Y! log

Page 9: Behavior-driven clustering of queries into topics

CIKM 2011 9

EVALUATION: BASELINE

• OSLOM community detection algorithm– Weighted undirected graph– Maximizing local fitness function of clusters– Automatic hierarchy detection

Lancichinetti et. al:Finding statistically significant communities in networks. PLoS ONE, 2011.

27/10/2011

2URL cover graph

Page 10: Behavior-driven clustering of queries into topics

CIKM 2011 10

EVALUATION: QUERY SET COVERAGE

Fraction of queries considered in the clustering phase

URL cover graph connected components size distribution

GATE: 1 OSLOM 0.2

27/10/2011

Page 11: Behavior-driven clustering of queries into topics

CIKM 2011 11

EVALUATION: SINGLETON RATIO

Fraction of queries that remains isolated in singleton

GATE: 0.55-0.27 OSLOM 0.88

27/10/2011

Page 12: Behavior-driven clustering of queries into topics

CIKM 2011 12

EVALUATION: AGGREGATION ABILITY

Topics aggregated in two consecutive steps or levels

GATE: 500k OSLOM:100K

27/10/2011

Page 13: Behavior-driven clustering of queries into topics

CIKM 2011 13

EVALUATION: PURITY vs. COVERAGE

• Coverage– Number of unique clicked URLs for the query

• Purity– Average pointwise mutual information of pairs

of query-related relevant terms• Relevant terms are extracted from top clicked

results using a predefined dictionary

27/10/2011

Page 14: Behavior-driven clustering of queries into topics

CIKM 2011 14

EVALUATION: PURITY vs. COVERAGE27/10/2011

Page 15: Behavior-driven clustering of queries into topics

CIKM 2011 15

EVALUATION: PURITY vs. COVERAGE27/10/2011

Page 16: Behavior-driven clustering of queries into topics

USER PROFILING

Page 17: Behavior-driven clustering of queries into topics

CIKM 2011 17

USER PROFILING FROM TOPICS27/10/2011

TopicDetector

Missions

Topics

0.0 0.0 0.00.72.9 3.2 1.90.35 0.41 0.24 User topicalprofile

Page 18: Behavior-driven clustering of queries into topics

CIKM 2011 18

PROFILES FOR “PREDICTION”

• Sequence of missions of the profiled user vs. sequence of a random one

• Sequence-profile match using topic detector• Success: 0.65 (0.72 less frequent, 0.55 most frequent)

27/10/2011

Page 19: Behavior-driven clustering of queries into topics

CIKM 2011 19

CONCLUSIONS

• New behavior-driven notion of topics• Bottom-up topic extraction algorithm• Favorable comparison with graph-based clustering• Effective user profiling

• Other baselines• More accurate predictions

27/10/2011

Page 20: Behavior-driven clustering of queries into topics

ACKNOWLEDGMENTS

Fil MenczerProf. Informatics @ IUDirector CNetS @IU

Umut OzertemYahoo! Search SciencesYahoo! Labs @ Sunnyvale

Emre VelisapaogluYahoo! Search Sciences

Yahoo! Labs @ Sunnyvale

Debora DonatoYahoo! Search Sciences

Yahoo! Labs @ Sunnyvale

Page 21: Behavior-driven clustering of queries into topics
Page 22: Behavior-driven clustering of queries into topics

CIKM 2011 2227/10/2011

Taxonomies User behavior and intent