Online Search Evaluation with Interleaving Filip Radlinski Microsoft.

27
Online Search Evaluation with Interleaving Filip Radlinski Microsoft

Transcript of Online Search Evaluation with Interleaving Filip Radlinski Microsoft.

Page 1: Online Search Evaluation with Interleaving Filip Radlinski Microsoft.

Online Search Evaluation with Interleaving

Filip RadlinskiMicrosoft

Page 2: Online Search Evaluation with Interleaving Filip Radlinski Microsoft.

Acknowledgments• This talk involves joint work with– Olivier Chapelle – Nick Craswell– Katja Hofmann – Thorsten Joachims– Madhu Kurup– Anne Schuth – Yisong Yue

Page 3: Online Search Evaluation with Interleaving Filip Radlinski Microsoft.

MotivationBaseline Ranking Algorithm Proposed Ranking Algorithm

Which is better?

Page 4: Online Search Evaluation with Interleaving Filip Radlinski Microsoft.

Retrieval evaluationTwo types of retrieval evaluation:

• Offline evaluationAsk experts or users to explicitly evaluate your retrieval system. This dominates evaluation research today.

• Online evaluationSee how normal users interact with your retrieval system when just using it.

Most well known type: A/B tests

Page 5: Online Search Evaluation with Interleaving Filip Radlinski Microsoft.

A/B testing• Each user is assigned to one of two

conditions• They might see the left or the right

ranking

• Measure user interaction with theirs (e.g. clicks)

• Look for differences between the populations

Ranking A Ranking B

Page 6: Online Search Evaluation with Interleaving Filip Radlinski Microsoft.

Online evaluation with interleaving• A within-user online ranker comparison

– Presents results from both rankings to every user

• The ranking that gets more of the clicks wins– Designed to be unbiased, and much more sensitive

than A/B

Ranking A Ranking BShown Users (randomized)

Page 7: Online Search Evaluation with Interleaving Filip Radlinski Microsoft.

Ranking A1. Napa Valley – The authority for lodging...

www.napavalley.com2. Napa Valley Wineries - Plan your wine...

www.napavalley.com/wineries3. Napa Valley College

www.napavalley.edu/homex.asp4. Been There | Tips | Napa Valley

www.ivebeenthere.co.uk/tips/166815. Napa Valley Wineries and Wine

www.napavintners.com6. Napa Country, California – Wikipedia

en.wikipedia.org/wiki/Napa_Valley

Ranking B1. Napa Country, California – Wikipedia

en.wikipedia.org/wiki/Napa_Valley2. Napa Valley – The authority for lodging...

www.napavalley.com3. Napa: The Story of an American Eden...

books.google.co.uk/books?isbn=...4. Napa Valley Hotels – Bed and Breakfast...

www.napalinks.com5. NapaValley.org

www.napavalley.org6. The Napa Valley Marathon

www.napavalleymarathon.org

Presented Ranking1. Napa Valley – The authority for lodging...

www.napavalley.com2. Napa Country, California – Wikipedia

en.wikipedia.org/wiki/Napa_Valley3. Napa: The Story of an American Eden...

books.google.co.uk/books?isbn=...4. Napa Valley Wineries – Plan your wine...

www.napavalley.com/wineries5. Napa Valley Hotels – Bed and Breakfast...

www.napalinks.com 6. Napa Valley College

www.napavalley.edu/homex.asp7 NapaValley.org

www.napavalley.org

AB

[Radlinski et al. 2008]

Team draft interleaving

Page 8: Online Search Evaluation with Interleaving Filip Radlinski Microsoft.

Team draft interleavingRanking A

1. Napa Valley – The authority for lodging...www.napavalley.com

2. Napa Valley Wineries - Plan your wine...www.napavalley.com/wineries

3. Napa Valley Collegewww.napavalley.edu/homex.asp

4. Been There | Tips | Napa Valleywww.ivebeenthere.co.uk/tips/16681

5. Napa Valley Wineries and Winewww.napavintners.com

6. Napa Country, California – Wikipediaen.wikipedia.org/wiki/Napa_Valley

Ranking B1. Napa Country, California – Wikipedia

en.wikipedia.org/wiki/Napa_Valley2. Napa Valley – The authority for lodging...

www.napavalley.com3. Napa: The Story of an American Eden...

books.google.co.uk/books?isbn=...4. Napa Valley Hotels – Bed and Breakfast...

www.napalinks.com5. NapaValley.org

www.napavalley.org6. The Napa Valley Marathon

www.napavalleymarathon.org

Presented Ranking1. Napa Valley – The authority for lodging...

www.napavalley.com2. Napa Country, California – Wikipedia

en.wikipedia.org/wiki/Napa_Valley3. Napa: The Story of an American Eden...

books.google.co.uk/books?isbn=...4. Napa Valley Wineries – Plan your wine...

www.napavalley.com/wineries5. Napa Valley Hotels – Bed and Breakfast...

www.napalinks.com 6. Napa Balley College

www.napavalley.edu/homex.asp7 NapaValley.org

www.napavalley.org

Tie!

Click

Click

[Radlinski et al. 2008]

Page 9: Online Search Evaluation with Interleaving Filip Radlinski Microsoft.

Why might mixing rankings help?

• Suppose results are worth money. For some query:– Ranker A: , ,

User clicks

– Ranker B: , , User also clicks

• Users of A may not know what they’re missing– Difference in behaviour is small

• But if we can mix up results from A & B Strong preference for B

Page 10: Online Search Evaluation with Interleaving Filip Radlinski Microsoft.

Comparison with A/B metrics

p-va

lue

Query set size

• Experiments with real Yahoo! rankers (very small differences in relevance)

Yahoo! Pair 1 Yahoo! Pair 2

Dis

agre

emen

t Pro

babi

lity

[Chapelle et al. 2012]

Page 11: Online Search Evaluation with Interleaving Filip Radlinski Microsoft.

The interleaving click model• Click == Good• Interleaving corrects for position bias• Yet there other sources of bias, such as

bolding

vs

[Yue et al. 2010a]

Page 12: Online Search Evaluation with Interleaving Filip Radlinski Microsoft.

The interleaving click model

• Bars should be equal if there was no effect of bolding

[Yue et al. 2010a]

Rank of Results

Clic

k fr

eque

ncy

on

bott

om re

sult

Page 13: Online Search Evaluation with Interleaving Filip Radlinski Microsoft.

Sometimes clicks aren’t even good

• Satisfaction of a click can be estimated– Time spent on URLs is informative– More sophisticated models also consider the

query and document (some documents require more effort)

• Time before clicking is another efficiency metric

[Kim et al. WSDM 2014]

Click

Click

No…

Page 14: Online Search Evaluation with Interleaving Filip Radlinski Microsoft.

Newer A/B metrics• Newer A/B metrics can incorporate these

signals– Time before clicking– Time spent on result documents– Estimated user satisfaction– Bias in click signal, e.g. position– Anything else the domain expert cares about

• Suppose I’ve picked an A/B metric and assume it to be my target– I just want to measure it more quickly– Can I use interleaving?

Page 15: Online Search Evaluation with Interleaving Filip Radlinski Microsoft.

An A/B metric as a gold standard

• Does interleaving agree with these AB metrics? AB Metric Team Draft Agreement

Is Page Clicked? 63 %

Clicked @ 1? 71 %

Satisfied Clicked? 71 %

Satisfied Clicked @ 1? 76 %

Time – to – click 53 %

Time – to – click @ 1 45 %

Time – to – satisfied – click 47 %

Time – to – satisfied – click @ 1 42 %

[Schuth et al. SIGIR 2015]

Page 16: Online Search Evaluation with Interleaving Filip Radlinski Microsoft.

An A/B metric as a gold standard

• Suppose we parameterize the clicks;– Optimize to maximize agreement with our AB

metric

• In particular:– Only include clicks where the predicted

probability of satisfaction is above threshold t:

– Score clicks based on the time to satisfied click:

– Learn a linear weighted combination of these

[Schuth et al. SIGIR 2015]

Page 17: Online Search Evaluation with Interleaving Filip Radlinski Microsoft.

An A/B metric as a gold standard

AB MetricTeam Draft Agreement (1/80th size)

Learned (to each metric)

AB Self-Agreement on Subset

(1/80th size)Is Page Clicked? 63 % 84 % + 63 %

Clicked @ 1? 71 % * 75 % + 62 %

Satisfied Clicked? 71 % * 85 % + 61 %

Satisfied Clicked @ 1? 76 % * 82 % + 60 %

Time – to – click 53 % 68 % + 58 %

Time – to – click @ 1 45 % 56 % + 59 %

Time – to – satisfied – click 47 % 63 % + 59 %

Time – to – satisfied – click @ 1 42 % 50 % + 60 %

Page 18: Online Search Evaluation with Interleaving Filip Radlinski Microsoft.

The right parameters

AB Metric Team Draft Agreement

Learned Combined

Learned (P(Sat) only)

Learned(Time to click * P(Sat))

Satisfied Clicked? 71 % 85 % + 84 % + 48 % –

P(Sat) >

0.5

P(Sat) >

0.76

• The optimal filtering parameter need not match the metric definition

• But having the right feature is essential

P(Sat) >

0.26

Page 19: Online Search Evaluation with Interleaving Filip Radlinski Microsoft.

Does this cost sensitivity?S

tati

stic

al P

ower

Team Draft

Is Sat clicked (A/B)

Page 20: Online Search Evaluation with Interleaving Filip Radlinski Microsoft.

What if you instead know how you value user actions?

• Suppose we don’t have an AB metric in mind

• Instead, suppose we instead know how to value users’ behavior on changed documents:

– If user clicks on a document that moved up k positions, how much is it worth?

– If a user spends time t before clicking, how much is it worth?

– If a user spends time t’ on a document, how much is it worth?

[Radlinski & Craswell, WSDM 2013]

Page 21: Online Search Evaluation with Interleaving Filip Radlinski Microsoft.

Example credit function• The value if a click is proportional to how

far the document moved between A and B:

• Example:– A: – B:– Any click on gives credit +2– Any click on gives credit -1– Any click on gives credit -1

12 3

1 2 3

1

2

3

Page 22: Online Search Evaluation with Interleaving Filip Radlinski Microsoft.

Interleaving (making the rankings)

1 2 3

1 2 3

1 2 3

Ranker A

Ranker B

1 2 3

1 2 3

1 2 3

1 2 3We generate a set of rankings that are similar to those returned by A and B in an A/B test

Team Draft

50%

50%

Page 23: Online Search Evaluation with Interleaving Filip Radlinski Microsoft.

We have an optimization problem!

• We have a set of allowed rankings

• We specified how clicks translate to credit

• We solve for the probabilities:– The probabilities of showing the rankings add

up to 1

– The expected credit given random clicking is zero

Page 24: Online Search Evaluation with Interleaving Filip Radlinski Microsoft.

Sensitivity• The optimization problem so far is usually

under-constrained (lots of possible rankings).• What else do we want? Sensitivity!• Intuition:– When we show a particular ranking (i.e. something

combining results from A and B), it is always biased(interleaving says that we should be unbiased on average)– The more biased, the less informative the outcome– We want to show individual rankings that are least

biased

I’ll skip the maths here...

Page 25: Online Search Evaluation with Interleaving Filip Radlinski Microsoft.

Allowed interleaved

rankings

for different interleaving algorithms

0.87 25% 25%

0.73 25%

0.74 35% 25%

0.60 40%

0.50

Illustrative optimized solution

A

B

Page 26: Online Search Evaluation with Interleaving Filip Radlinski Microsoft.

Summary• Interleaving is a sensitive online metric for

evaluating rankings– Very high agreement when reliable offline

relevance metrics are available– Agreement of simple interleaving algorithms

with AB metrics & small / ambiguous relevance differences can be poor

• Solutions:– Can de-bias user behaviour (e.g. presentation

effects)– Can optimize to a known AB metric (if one is

trusted)– Can optimize to a known user model

Page 27: Online Search Evaluation with Interleaving Filip Radlinski Microsoft.

Thanks!

Questions?