Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at...
-
Upload
junior-stanley -
Category
Documents
-
view
225 -
download
2
Transcript of Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at...
Nikos Sarkas, Univ. of TorontoNilesh Bansal, Univ. of TorontoGautam Das, Univ. of Texas at ArlingtonNick Koudas, Univ. of Toronto
Measure-driven Keyword-query Expansion
Web evolutionWeb constantly evolving and expandingNew content with unique characteristicsMost recently: active user participation in content
generationBlogsTweetsOnline social networksWikis…
2
User (re)views and opinions
“Explicit” user reviews Star ratings
“Implicit” user reviews Sentiment analysis
3
Limitations of web search (based on a true story)
delta airlines
700 ordered results
4
Query expansion and result refinement
5
delta airlinesdelta airlines
delta airlinesdelta airlines customer servicecustomer service
delta airlinesdelta airlines airmilesairmiles
Services discussed in reviews
delta airlinesdelta airlines safetysafety
delta airlinesdelta airlines delta airlinesdelta airlines customer servicecustomer service
Reviews
Query expansion and result refinement
6
delta airlinesdelta airlines
delta airlinesdelta airlines foodfood
delta airlinesdelta airlines legroomlegroom
Services discussed in reviewswith high on avg. ratings
delta airlinesdelta airlines customer servicecustomer service
delta airlinesdelta airlines delays atlantadelays atlanta
delta airlinesdelta airlines connections jfkconnections jfk
Services discussed in reviewswith low on avg. ratings
delta airlinesdelta airlines fees luggagefees luggage
Faceted search?Faceted search: refine query result using predefined,
static hierarchies (facets)Search-engine query expansion: use query logs to
suggest frequent expansionsOur approach
Dynamically compute “facets” for each queryBased on data characteristics (text + ratings)
7
OutlineProblem definitionBasic frameworkImproved frameworkExperimental results
8
Formal problem statement
9
q1,…,qlq1,…,ql
q1,…,qlq1,…,ql xl+1,…,xrxl+1,…,xr
q1,…,qlq1,…,ql yl+1,…,yryl+1,…,yr
q1,…,qlq1,…,ql zl+1,…,zrzl+1,…,zr
●●●
Query
k query expansionsr words each
k query expansionsr words each
max or minF(q1,…,ql,xl+1,…,xr)
Efficiently computeEfficiently compute
delta airlinesdelta airlinese.g. delta airlinesdelta airlines connections jfkconnections jfk
Top-k query expansionsr words each
Top-k query expansionsr words each
Scoring functions
10
Surprise(q1,…,ql,xl+1,…,xr) =# of docs containing all words
expected # of docs containing all words assuming independence
Avg(q1,…,ql,xl+1,…,xr) =Average rating of documents
containing all words
F(q1,…,ql,xl+1,…,xr) =# of documents of rating b
containing all words F
OutlineProblem definitionBasic frameworkImproved frameworkExperimental results
11
Computing top-k expansionsQuery Q=q1,…,ql
Compute top-k expansions q1,…,ql,xl+1,…,xr
Enumerate all candidate expansions, compute scoreChallenge: compute c(q1,…,xr)
(word co-occurrence) for all candidatesChallenge: compute c(q1,…,xr)
(word co-occurrence) for all candidates
12
F(q1,…,ql,xl+1,…,xr) =# of documents
containing all words F
Computing word co-occurrencesPre-compute and store all possible word co-
occurrencesAssume 4 word co-occurrencesA 50 distinct-word document has 230K 4-word setsInformation from 1M documents: 230B 4-word setsInfeasible
Compute co-occurrences on the flyInefficient
13
Estimating word co-occurrences
delta airlines delaysdelta airlines delays
delta airlinesdelta airlines
delta delaysdelta delays
airlines delaysairlines delays
1000010000
30003000
50005000
deltadelta
airlinesairlines
delaysdelays
2000020000
4500045000
3000030000
two-word co-occurrences word occurrences
20002000
low-orderco-occurrences
high-orderco-occurrence
14
Query-expansion framework
Query q1,…,ql
For each candidate expansion q1,…,ql,xl+1,…,xr
Use c(wi), c(wi,wj) to estimate c(q1,…,ql,xl+1,…,xr) Compute expansion score Update top-k heap
End For
Query q1,…,ql
For each candidate expansion q1,…,ql,xl+1,…,xr
Use c(wi), c(wi,wj) to estimate c(q1,…,ql,xl+1,…,xr) Compute expansion score Update top-k heap
End For
15
OutlineProblem definitionBasic framework
Maximum entropy estimationImproved frameworkExperimental results
16
Maximum entropy estimationc(w1,w2,w3)
p(w1,w2,w3)p(w1,w2,w3)p(w1,w2,w3)p(w1,w2,w3)p(w1,w2,w3)p(w1,w2,w3)p(w1,w2,w3)p(w1,w2,w3)
p1
p2
p3
p4
p5
p6
p7
p8p7+p8 = p(w1,w2) = c(w1,w2)/c(●)
p5+p6+p7+p8 = p(w1) = c(w1)/c(●)
17
= p(w1,w2,w3) c(●)
Maximum entropy estimation
(Unique) maximum entropy distributionComputed using the Iterative Proportional Fitting
algorithm
p=[p1 p2 p3 p4 p5 p6 p7 p8]T
Ap=cmax H(p)=-∑pilogpi
p≥0
18
Query-expansion framework using the IPF algorithm
Candidate expansionCandidate expansion
Ap=cAp=c
IPFIPF
Already considered expansions
Already considered expansions
Top-k score threshold
Iteration: 10Iteration: 20Iteration: 3019
p1
p2
…pn
ME distribution Score
OutlineProblem definitionBasic frameworkImproved frameworkExperimental results
20
Entropy maximization
Can we save work?We only require a single probability (pn)We need to compute top-k expansions: a bound around pn
could help us prune the expansion consideredNot by using IPFIntroduce ElliMax
Determine pn by progressively bounding it
p=[p1 p2 p3 … pn-1 pn]T
Ap=cmax H(p)
p≥0
21
Improved query-expansion framework using the ElliMax algorithm
Candidate expansionCandidate expansion
Ap=cAp=c
ElliMaxElliMax
Already considered expansions
Already considered expansions
Top-k score threshold
Iteration: 5pn Score
Iteration: 10Iteration: 1522
Iteration: 20
OutlineProblem definitionBasic frameworkImproved framework
ElliMax algorithmExperimental results
23
ElliMax algorithm: Ellipsoid method principles
x*
max F(x)Qx≥r
max F(x)Qx≥r
24
Iteration: 0Iteration: 5Iteration: 10
ElliMax algorithm
max H(p)Ap=cp≥0
max H(p)Ap=cp≥0
p-spacep-space
max H’(λ)Uλ≥-q
max H’(λ)Uλ≥-q
λ-spaceλ-space
1) Transform problem
2) Starting ellipsoid
3) Back to the p-space
p1
p2λ 2
λ 1
pnλ*
25
OutlineProblem definitionBasic solutionImproved solutionExperimental results
26
Experimental Results (Performance) Time spent in Entropy MaximizationBasic framework (Algorithm Direct)
IPF algorithmImproved framework (Algorithm Bound)
ElliMax algorithmSynthetic and real data
27
Direct vs Bound (Surprise)Top-10 expansions, 100k synthetic candidates
Expansion size 3 Expansion size 4
28
Direct vs Bound (Avg. Rating)Top-10 expansions, 100k synthetic candidates, ratings
0, 1 and 2
Expansion size 3 Expansion size 4
29
Experimental Results (Quality)
30
Thank you!
31