Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at...

Post on 17-Dec-2015

225 views 2 download

Tags:

Transcript of Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at...

Nikos Sarkas, Univ. of TorontoNilesh Bansal, Univ. of TorontoGautam Das, Univ. of Texas at ArlingtonNick Koudas, Univ. of Toronto

Measure-driven Keyword-query Expansion

Web evolutionWeb constantly evolving and expandingNew content with unique characteristicsMost recently: active user participation in content

generationBlogsTweetsOnline social networksWikis…

2

User (re)views and opinions

“Explicit” user reviews Star ratings

“Implicit” user reviews Sentiment analysis

3

Limitations of web search (based on a true story)

delta airlines

700 ordered results

4

Query expansion and result refinement

5

delta airlinesdelta airlines

delta airlinesdelta airlines customer servicecustomer service

delta airlinesdelta airlines airmilesairmiles

Services discussed in reviews

delta airlinesdelta airlines safetysafety

delta airlinesdelta airlines delta airlinesdelta airlines customer servicecustomer service

Reviews

Query expansion and result refinement

6

delta airlinesdelta airlines

delta airlinesdelta airlines foodfood

delta airlinesdelta airlines legroomlegroom

Services discussed in reviewswith high on avg. ratings

delta airlinesdelta airlines customer servicecustomer service

delta airlinesdelta airlines delays atlantadelays atlanta

delta airlinesdelta airlines connections jfkconnections jfk

Services discussed in reviewswith low on avg. ratings

delta airlinesdelta airlines fees luggagefees luggage

Faceted search?Faceted search: refine query result using predefined,

static hierarchies (facets)Search-engine query expansion: use query logs to

suggest frequent expansionsOur approach

Dynamically compute “facets” for each queryBased on data characteristics (text + ratings)

7

OutlineProblem definitionBasic frameworkImproved frameworkExperimental results

8

Formal problem statement

9

q1,…,qlq1,…,ql

q1,…,qlq1,…,ql xl+1,…,xrxl+1,…,xr

q1,…,qlq1,…,ql yl+1,…,yryl+1,…,yr

q1,…,qlq1,…,ql zl+1,…,zrzl+1,…,zr

●●●

Query

k query expansionsr words each

k query expansionsr words each

max or minF(q1,…,ql,xl+1,…,xr)

Efficiently computeEfficiently compute

delta airlinesdelta airlinese.g. delta airlinesdelta airlines connections jfkconnections jfk

Top-k query expansionsr words each

Top-k query expansionsr words each

Scoring functions

10

Surprise(q1,…,ql,xl+1,…,xr) =# of docs containing all words

expected # of docs containing all words assuming independence

Avg(q1,…,ql,xl+1,…,xr) =Average rating of documents

containing all words

F(q1,…,ql,xl+1,…,xr) =# of documents of rating b

containing all words F

OutlineProblem definitionBasic frameworkImproved frameworkExperimental results

11

Computing top-k expansionsQuery Q=q1,…,ql

Compute top-k expansions q1,…,ql,xl+1,…,xr

Enumerate all candidate expansions, compute scoreChallenge: compute c(q1,…,xr)

(word co-occurrence) for all candidatesChallenge: compute c(q1,…,xr)

(word co-occurrence) for all candidates

12

F(q1,…,ql,xl+1,…,xr) =# of documents

containing all words F

Computing word co-occurrencesPre-compute and store all possible word co-

occurrencesAssume 4 word co-occurrencesA 50 distinct-word document has 230K 4-word setsInformation from 1M documents: 230B 4-word setsInfeasible

Compute co-occurrences on the flyInefficient

13

Estimating word co-occurrences

delta airlines delaysdelta airlines delays

delta airlinesdelta airlines

delta delaysdelta delays

airlines delaysairlines delays

1000010000

30003000

50005000

deltadelta

airlinesairlines

delaysdelays

2000020000

4500045000

3000030000

two-word co-occurrences word occurrences

20002000

low-orderco-occurrences

high-orderco-occurrence

14

Query-expansion framework

Query q1,…,ql

For each candidate expansion q1,…,ql,xl+1,…,xr

Use c(wi), c(wi,wj) to estimate c(q1,…,ql,xl+1,…,xr) Compute expansion score Update top-k heap

End For

Query q1,…,ql

For each candidate expansion q1,…,ql,xl+1,…,xr

Use c(wi), c(wi,wj) to estimate c(q1,…,ql,xl+1,…,xr) Compute expansion score Update top-k heap

End For

15

OutlineProblem definitionBasic framework

Maximum entropy estimationImproved frameworkExperimental results

16

Maximum entropy estimationc(w1,w2,w3)

p(w1,w2,w3)p(w1,w2,w3)p(w1,w2,w3)p(w1,w2,w3)p(w1,w2,w3)p(w1,w2,w3)p(w1,w2,w3)p(w1,w2,w3)

p1

p2

p3

p4

p5

p6

p7

p8p7+p8 = p(w1,w2) = c(w1,w2)/c(●)

p5+p6+p7+p8 = p(w1) = c(w1)/c(●)

17

= p(w1,w2,w3) c(●)

Maximum entropy estimation

(Unique) maximum entropy distributionComputed using the Iterative Proportional Fitting

algorithm

p=[p1 p2 p3 p4 p5 p6 p7 p8]T

Ap=cmax H(p)=-∑pilogpi

p≥0

18

Query-expansion framework using the IPF algorithm

Candidate expansionCandidate expansion

Ap=cAp=c

IPFIPF

Already considered expansions

Already considered expansions

Top-k score threshold

Iteration: 10Iteration: 20Iteration: 3019

p1

p2

…pn

ME distribution Score

OutlineProblem definitionBasic frameworkImproved frameworkExperimental results

20

Entropy maximization

Can we save work?We only require a single probability (pn)We need to compute top-k expansions: a bound around pn

could help us prune the expansion consideredNot by using IPFIntroduce ElliMax

Determine pn by progressively bounding it

p=[p1 p2 p3 … pn-1 pn]T

Ap=cmax H(p)

p≥0

21

Improved query-expansion framework using the ElliMax algorithm

Candidate expansionCandidate expansion

Ap=cAp=c

ElliMaxElliMax

Already considered expansions

Already considered expansions

Top-k score threshold

Iteration: 5pn Score

Iteration: 10Iteration: 1522

Iteration: 20

OutlineProblem definitionBasic frameworkImproved framework

ElliMax algorithmExperimental results

23

ElliMax algorithm: Ellipsoid method principles

x*

max F(x)Qx≥r

max F(x)Qx≥r

24

Iteration: 0Iteration: 5Iteration: 10

ElliMax algorithm

max H(p)Ap=cp≥0

max H(p)Ap=cp≥0

p-spacep-space

max H’(λ)Uλ≥-q

max H’(λ)Uλ≥-q

λ-spaceλ-space

1) Transform problem

2) Starting ellipsoid

3) Back to the p-space

p1

p2λ 2

λ 1

pnλ*

25

OutlineProblem definitionBasic solutionImproved solutionExperimental results

26

Experimental Results (Performance) Time spent in Entropy MaximizationBasic framework (Algorithm Direct)

IPF algorithmImproved framework (Algorithm Bound)

ElliMax algorithmSynthetic and real data

27

Direct vs Bound (Surprise)Top-10 expansions, 100k synthetic candidates

Expansion size 3 Expansion size 4

28

Direct vs Bound (Avg. Rating)Top-10 expansions, 100k synthetic candidates, ratings

0, 1 and 2

Expansion size 3 Expansion size 4

29

Experimental Results (Quality)

30

Thank you!

31