Online Search Evaluation with Interleaving Filip Radlinski Microsoft.

Online Search Evaluation with Interleaving

Filip RadlinskiMicrosoft

Acknowledgments• This talk involves joint work with– Olivier Chapelle – Nick Craswell– Katja Hofmann – Thorsten Joachims– Madhu Kurup– Anne Schuth – Yisong Yue

MotivationBaseline Ranking Algorithm Proposed Ranking Algorithm

Which is better?

Retrieval evaluationTwo types of retrieval evaluation:

• Offline evaluationAsk experts or users to explicitly evaluate your retrieval system. This dominates evaluation research today.

• Online evaluationSee how normal users interact with your retrieval system when just using it.

Most well known type: A/B tests

A/B testing• Each user is assigned to one of two

conditions• They might see the left or the right

ranking

• Measure user interaction with theirs (e.g. clicks)

• Look for differences between the populations

Ranking A Ranking B

Online evaluation with interleaving• A within-user online ranker comparison

– Presents results from both rankings to every user

• The ranking that gets more of the clicks wins– Designed to be unbiased, and much more sensitive

than A/B

Ranking A Ranking BShown Users (randomized)

Ranking A1. Napa Valley – The authority for lodging...

www.napavalley.com2. Napa Valley Wineries - Plan your wine...

www.napavalley.com/wineries3. Napa Valley College

www.napavalley.edu/homex.asp4. Been There | Tips | Napa Valley

www.ivebeenthere.co.uk/tips/166815. Napa Valley Wineries and Wine

www.napavintners.com6. Napa Country, California – Wikipedia

en.wikipedia.org/wiki/Napa_Valley

Ranking B1. Napa Country, California – Wikipedia

en.wikipedia.org/wiki/Napa_Valley2. Napa Valley – The authority for lodging...

www.napavalley.com3. Napa: The Story of an American Eden...

books.google.co.uk/books?isbn=...4. Napa Valley Hotels – Bed and Breakfast...

www.napalinks.com5. NapaValley.org

www.napavalley.org6. The Napa Valley Marathon

www.napavalleymarathon.org

Presented Ranking1. Napa Valley – The authority for lodging...

www.napavalley.com2. Napa Country, California – Wikipedia

en.wikipedia.org/wiki/Napa_Valley3. Napa: The Story of an American Eden...

books.google.co.uk/books?isbn=...4. Napa Valley Wineries – Plan your wine...

www.napavalley.com/wineries5. Napa Valley Hotels – Bed and Breakfast...

www.napalinks.com 6. Napa Valley College

www.napavalley.edu/homex.asp7 NapaValley.org

www.napavalley.org

AB

[Radlinski et al. 2008]

Team draft interleaving

Team draft interleavingRanking A

1. Napa Valley – The authority for lodging...www.napavalley.com

2. Napa Valley Wineries - Plan your wine...www.napavalley.com/wineries

3. Napa Valley Collegewww.napavalley.edu/homex.asp

4. Been There | Tips | Napa Valleywww.ivebeenthere.co.uk/tips/16681

5. Napa Valley Wineries and Winewww.napavintners.com

6. Napa Country, California – Wikipediaen.wikipedia.org/wiki/Napa_Valley

Ranking B1. Napa Country, California – Wikipedia

en.wikipedia.org/wiki/Napa_Valley2. Napa Valley – The authority for lodging...

www.napavalley.com3. Napa: The Story of an American Eden...

books.google.co.uk/books?isbn=...4. Napa Valley Hotels – Bed and Breakfast...

www.napalinks.com5. NapaValley.org

www.napavalley.org6. The Napa Valley Marathon

www.napavalleymarathon.org

Presented Ranking1. Napa Valley – The authority for lodging...

www.napavalley.com2. Napa Country, California – Wikipedia

en.wikipedia.org/wiki/Napa_Valley3. Napa: The Story of an American Eden...

books.google.co.uk/books?isbn=...4. Napa Valley Wineries – Plan your wine...

www.napavalley.com/wineries5. Napa Valley Hotels – Bed and Breakfast...

www.napalinks.com 6. Napa Balley College

www.napavalley.edu/homex.asp7 NapaValley.org

www.napavalley.org

Tie!

Click

Click

[Radlinski et al. 2008]

Why might mixing rankings help?

• Suppose results are worth money. For some query:– Ranker A: , ,

User clicks

– Ranker B: , , User also clicks

• Users of A may not know what they’re missing– Difference in behaviour is small

• But if we can mix up results from A & B Strong preference for B

Comparison with A/B metrics

p-va

lue

Query set size

• Experiments with real Yahoo! rankers (very small differences in relevance)

Yahoo! Pair 1 Yahoo! Pair 2

Dis

agre

emen

t Pro

babi

lity

[Chapelle et al. 2012]

The interleaving click model• Click == Good• Interleaving corrects for position bias• Yet there other sources of bias, such as

bolding

vs

[Yue et al. 2010a]

The interleaving click model

• Bars should be equal if there was no effect of bolding

[Yue et al. 2010a]

Rank of Results

Clic

k fr

eque

ncy

on

bott

om re

sult

Sometimes clicks aren’t even good

• Satisfaction of a click can be estimated– Time spent on URLs is informative– More sophisticated models also consider the

query and document (some documents require more effort)

• Time before clicking is another efficiency metric

[Kim et al. WSDM 2014]

Click

Click

No…

Newer A/B metrics• Newer A/B metrics can incorporate these

signals– Time before clicking– Time spent on result documents– Estimated user satisfaction– Bias in click signal, e.g. position– Anything else the domain expert cares about

• Suppose I’ve picked an A/B metric and assume it to be my target– I just want to measure it more quickly– Can I use interleaving?

An A/B metric as a gold standard

• Does interleaving agree with these AB metrics? AB Metric Team Draft Agreement

Is Page Clicked? 63 %

Clicked @ 1? 71 %

Satisfied Clicked? 71 %

Satisfied Clicked @ 1? 76 %

Time – to – click 53 %

Time – to – click @ 1 45 %

Time – to – satisfied – click 47 %

Time – to – satisfied – click @ 1 42 %

[Schuth et al. SIGIR 2015]


• Suppose we parameterize the clicks;– Optimize to maximize agreement with our AB

metric

• In particular:– Only include clicks where the predicted

probability of satisfaction is above threshold t:

– Score clicks based on the time to satisfied click:

– Learn a linear weighted combination of these

[Schuth et al. SIGIR 2015]


AB MetricTeam Draft Agreement (1/80th size)

Learned (to each metric)

AB Self-Agreement on Subset

(1/80th size)Is Page Clicked? 63 % 84 % + 63 %

Clicked @ 1? 71 % * 75 % + 62 %

Satisfied Clicked? 71 % * 85 % + 61 %

Satisfied Clicked @ 1? 76 % * 82 % + 60 %

Time – to – click 53 % 68 % + 58 %

Time – to – click @ 1 45 % 56 % + 59 %

Time – to – satisfied – click 47 % 63 % + 59 %

Time – to – satisfied – click @ 1 42 % 50 % + 60 %

The right parameters

AB Metric Team Draft Agreement

Learned Combined

Learned (P(Sat) only)

Learned(Time to click * P(Sat))

Satisfied Clicked? 71 % 85 % + 84 % + 48 % –

P(Sat) >

0.5

P(Sat) >

0.76

• The optimal filtering parameter need not match the metric definition

• But having the right feature is essential

P(Sat) >

0.26

Does this cost sensitivity?S

tati

stic

al P

ower

Team Draft

Is Sat clicked (A/B)

What if you instead know how you value user actions?

• Suppose we don’t have an AB metric in mind

• Instead, suppose we instead know how to value users’ behavior on changed documents:

– If user clicks on a document that moved up k positions, how much is it worth?

– If a user spends time t before clicking, how much is it worth?

– If a user spends time t’ on a document, how much is it worth?

[Radlinski & Craswell, WSDM 2013]

Example credit function• The value if a click is proportional to how

far the document moved between A and B:

• Example:– A: – B:– Any click on gives credit +2– Any click on gives credit -1– Any click on gives credit -1

12 3

1 2 3

1

2

3

Interleaving (making the rankings)

1 2 3

1 2 3

1 2 3

Ranker A

Ranker B

1 2 3

1 2 3

1 2 3

1 2 3We generate a set of rankings that are similar to those returned by A and B in an A/B test

Team Draft

50%

50%

We have an optimization problem!

• We have a set of allowed rankings

• We specified how clicks translate to credit

• We solve for the probabilities:– The probabilities of showing the rankings add

up to 1

– The expected credit given random clicking is zero

Sensitivity• The optimization problem so far is usually

under-constrained (lots of possible rankings).• What else do we want? Sensitivity!• Intuition:– When we show a particular ranking (i.e. something

combining results from A and B), it is always biased(interleaving says that we should be unbiased on average)– The more biased, the less informative the outcome– We want to show individual rankings that are least

biased

I’ll skip the maths here...

Allowed interleaved

rankings

for different interleaving algorithms

0.87 25% 25%

0.73 25%

0.74 35% 25%

0.60 40%

0.50

Illustrative optimized solution

A

B

Summary• Interleaving is a sensitive online metric for

evaluating rankings– Very high agreement when reliable offline

relevance metrics are available– Agreement of simple interleaving algorithms

with AB metrics & small / ambiguous relevance differences can be poor

• Solutions:– Can de-bias user behaviour (e.g. presentation

effects)– Can optimize to a known AB metric (if one is

trusted)– Can optimize to a known user model

Thanks!

Questions?

Online Search Evaluation with Interleaving Filip Radlinski Microsoft.

Documents

Transcript of Online Search Evaluation with Interleaving Filip Radlinski Microsoft.