Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at...

31
Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at Arlington Nick Koudas, Univ. of Toronto Measure-driven Keyword-query Expansion

Transcript of Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at...

Page 1: Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at Arlington Nick Koudas, Univ. of Toronto Measure-driven Keyword-query.

Nikos Sarkas, Univ. of TorontoNilesh Bansal, Univ. of TorontoGautam Das, Univ. of Texas at ArlingtonNick Koudas, Univ. of Toronto

Measure-driven Keyword-query Expansion

Page 2: Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at Arlington Nick Koudas, Univ. of Toronto Measure-driven Keyword-query.

Web evolutionWeb constantly evolving and expandingNew content with unique characteristicsMost recently: active user participation in content

generationBlogsTweetsOnline social networksWikis…

2

Page 3: Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at Arlington Nick Koudas, Univ. of Toronto Measure-driven Keyword-query.

User (re)views and opinions

“Explicit” user reviews Star ratings

“Implicit” user reviews Sentiment analysis

3

Page 4: Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at Arlington Nick Koudas, Univ. of Toronto Measure-driven Keyword-query.

Limitations of web search (based on a true story)

delta airlines

700 ordered results

4

Page 5: Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at Arlington Nick Koudas, Univ. of Toronto Measure-driven Keyword-query.

Query expansion and result refinement

5

delta airlinesdelta airlines

delta airlinesdelta airlines customer servicecustomer service

delta airlinesdelta airlines airmilesairmiles

Services discussed in reviews

delta airlinesdelta airlines safetysafety

delta airlinesdelta airlines delta airlinesdelta airlines customer servicecustomer service

Reviews

Page 6: Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at Arlington Nick Koudas, Univ. of Toronto Measure-driven Keyword-query.

Query expansion and result refinement

6

delta airlinesdelta airlines

delta airlinesdelta airlines foodfood

delta airlinesdelta airlines legroomlegroom

Services discussed in reviewswith high on avg. ratings

delta airlinesdelta airlines customer servicecustomer service

delta airlinesdelta airlines delays atlantadelays atlanta

delta airlinesdelta airlines connections jfkconnections jfk

Services discussed in reviewswith low on avg. ratings

delta airlinesdelta airlines fees luggagefees luggage

Page 7: Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at Arlington Nick Koudas, Univ. of Toronto Measure-driven Keyword-query.

Faceted search?Faceted search: refine query result using predefined,

static hierarchies (facets)Search-engine query expansion: use query logs to

suggest frequent expansionsOur approach

Dynamically compute “facets” for each queryBased on data characteristics (text + ratings)

7

Page 8: Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at Arlington Nick Koudas, Univ. of Toronto Measure-driven Keyword-query.

OutlineProblem definitionBasic frameworkImproved frameworkExperimental results

8

Page 9: Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at Arlington Nick Koudas, Univ. of Toronto Measure-driven Keyword-query.

Formal problem statement

9

q1,…,qlq1,…,ql

q1,…,qlq1,…,ql xl+1,…,xrxl+1,…,xr

q1,…,qlq1,…,ql yl+1,…,yryl+1,…,yr

q1,…,qlq1,…,ql zl+1,…,zrzl+1,…,zr

●●●

Query

k query expansionsr words each

k query expansionsr words each

max or minF(q1,…,ql,xl+1,…,xr)

Efficiently computeEfficiently compute

delta airlinesdelta airlinese.g. delta airlinesdelta airlines connections jfkconnections jfk

Top-k query expansionsr words each

Top-k query expansionsr words each

Page 10: Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at Arlington Nick Koudas, Univ. of Toronto Measure-driven Keyword-query.

Scoring functions

10

Surprise(q1,…,ql,xl+1,…,xr) =# of docs containing all words

expected # of docs containing all words assuming independence

Avg(q1,…,ql,xl+1,…,xr) =Average rating of documents

containing all words

F(q1,…,ql,xl+1,…,xr) =# of documents of rating b

containing all words F

Page 11: Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at Arlington Nick Koudas, Univ. of Toronto Measure-driven Keyword-query.

OutlineProblem definitionBasic frameworkImproved frameworkExperimental results

11

Page 12: Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at Arlington Nick Koudas, Univ. of Toronto Measure-driven Keyword-query.

Computing top-k expansionsQuery Q=q1,…,ql

Compute top-k expansions q1,…,ql,xl+1,…,xr

Enumerate all candidate expansions, compute scoreChallenge: compute c(q1,…,xr)

(word co-occurrence) for all candidatesChallenge: compute c(q1,…,xr)

(word co-occurrence) for all candidates

12

F(q1,…,ql,xl+1,…,xr) =# of documents

containing all words F

Page 13: Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at Arlington Nick Koudas, Univ. of Toronto Measure-driven Keyword-query.

Computing word co-occurrencesPre-compute and store all possible word co-

occurrencesAssume 4 word co-occurrencesA 50 distinct-word document has 230K 4-word setsInformation from 1M documents: 230B 4-word setsInfeasible

Compute co-occurrences on the flyInefficient

13

Page 14: Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at Arlington Nick Koudas, Univ. of Toronto Measure-driven Keyword-query.

Estimating word co-occurrences

delta airlines delaysdelta airlines delays

delta airlinesdelta airlines

delta delaysdelta delays

airlines delaysairlines delays

1000010000

30003000

50005000

deltadelta

airlinesairlines

delaysdelays

2000020000

4500045000

3000030000

two-word co-occurrences word occurrences

20002000

low-orderco-occurrences

high-orderco-occurrence

14

Page 15: Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at Arlington Nick Koudas, Univ. of Toronto Measure-driven Keyword-query.

Query-expansion framework

Query q1,…,ql

For each candidate expansion q1,…,ql,xl+1,…,xr

Use c(wi), c(wi,wj) to estimate c(q1,…,ql,xl+1,…,xr) Compute expansion score Update top-k heap

End For

Query q1,…,ql

For each candidate expansion q1,…,ql,xl+1,…,xr

Use c(wi), c(wi,wj) to estimate c(q1,…,ql,xl+1,…,xr) Compute expansion score Update top-k heap

End For

15

Page 16: Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at Arlington Nick Koudas, Univ. of Toronto Measure-driven Keyword-query.

OutlineProblem definitionBasic framework

Maximum entropy estimationImproved frameworkExperimental results

16

Page 17: Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at Arlington Nick Koudas, Univ. of Toronto Measure-driven Keyword-query.

Maximum entropy estimationc(w1,w2,w3)

p(w1,w2,w3)p(w1,w2,w3)p(w1,w2,w3)p(w1,w2,w3)p(w1,w2,w3)p(w1,w2,w3)p(w1,w2,w3)p(w1,w2,w3)

p1

p2

p3

p4

p5

p6

p7

p8p7+p8 = p(w1,w2) = c(w1,w2)/c(●)

p5+p6+p7+p8 = p(w1) = c(w1)/c(●)

17

= p(w1,w2,w3) c(●)

Page 18: Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at Arlington Nick Koudas, Univ. of Toronto Measure-driven Keyword-query.

Maximum entropy estimation

(Unique) maximum entropy distributionComputed using the Iterative Proportional Fitting

algorithm

p=[p1 p2 p3 p4 p5 p6 p7 p8]T

Ap=cmax H(p)=-∑pilogpi

p≥0

18

Page 19: Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at Arlington Nick Koudas, Univ. of Toronto Measure-driven Keyword-query.

Query-expansion framework using the IPF algorithm

Candidate expansionCandidate expansion

Ap=cAp=c

IPFIPF

Already considered expansions

Already considered expansions

Top-k score threshold

Iteration: 10Iteration: 20Iteration: 3019

p1

p2

…pn

ME distribution Score

Page 20: Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at Arlington Nick Koudas, Univ. of Toronto Measure-driven Keyword-query.

OutlineProblem definitionBasic frameworkImproved frameworkExperimental results

20

Page 21: Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at Arlington Nick Koudas, Univ. of Toronto Measure-driven Keyword-query.

Entropy maximization

Can we save work?We only require a single probability (pn)We need to compute top-k expansions: a bound around pn

could help us prune the expansion consideredNot by using IPFIntroduce ElliMax

Determine pn by progressively bounding it

p=[p1 p2 p3 … pn-1 pn]T

Ap=cmax H(p)

p≥0

21

Page 22: Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at Arlington Nick Koudas, Univ. of Toronto Measure-driven Keyword-query.

Improved query-expansion framework using the ElliMax algorithm

Candidate expansionCandidate expansion

Ap=cAp=c

ElliMaxElliMax

Already considered expansions

Already considered expansions

Top-k score threshold

Iteration: 5pn Score

Iteration: 10Iteration: 1522

Iteration: 20

Page 23: Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at Arlington Nick Koudas, Univ. of Toronto Measure-driven Keyword-query.

OutlineProblem definitionBasic frameworkImproved framework

ElliMax algorithmExperimental results

23

Page 24: Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at Arlington Nick Koudas, Univ. of Toronto Measure-driven Keyword-query.

ElliMax algorithm: Ellipsoid method principles

x*

max F(x)Qx≥r

max F(x)Qx≥r

24

Iteration: 0Iteration: 5Iteration: 10

Page 25: Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at Arlington Nick Koudas, Univ. of Toronto Measure-driven Keyword-query.

ElliMax algorithm

max H(p)Ap=cp≥0

max H(p)Ap=cp≥0

p-spacep-space

max H’(λ)Uλ≥-q

max H’(λ)Uλ≥-q

λ-spaceλ-space

1) Transform problem

2) Starting ellipsoid

3) Back to the p-space

p1

p2λ 2

λ 1

pnλ*

25

Page 26: Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at Arlington Nick Koudas, Univ. of Toronto Measure-driven Keyword-query.

OutlineProblem definitionBasic solutionImproved solutionExperimental results

26

Page 27: Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at Arlington Nick Koudas, Univ. of Toronto Measure-driven Keyword-query.

Experimental Results (Performance) Time spent in Entropy MaximizationBasic framework (Algorithm Direct)

IPF algorithmImproved framework (Algorithm Bound)

ElliMax algorithmSynthetic and real data

27

Page 28: Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at Arlington Nick Koudas, Univ. of Toronto Measure-driven Keyword-query.

Direct vs Bound (Surprise)Top-10 expansions, 100k synthetic candidates

Expansion size 3 Expansion size 4

28

Page 29: Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at Arlington Nick Koudas, Univ. of Toronto Measure-driven Keyword-query.

Direct vs Bound (Avg. Rating)Top-10 expansions, 100k synthetic candidates, ratings

0, 1 and 2

Expansion size 3 Expansion size 4

29

Page 30: Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at Arlington Nick Koudas, Univ. of Toronto Measure-driven Keyword-query.

Experimental Results (Quality)

30

Page 31: Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at Arlington Nick Koudas, Univ. of Toronto Measure-driven Keyword-query.

Thank you!

31