Tutorial 12 (click models)

Click Models

Kira Radinsky

Slides based on material from:

Filip Radlinski, Madhu Kurup, and Thorsten Joachims

Motivation

• How can we evaluate search engine quality?

Option 1: Ask experts to judge queries & result sets.

For a sample of queries, judges are paid to examine a sample of documents and mark their relevance. This standard process gives a reusable dataset.

Option 2: Watch how users act and hope it tells us something about quality.

For all queries, record how users act and infer the quality of the search results based on the logs of user actions.

Motivation

• How can we evaluate search engine quality?

Option 1: Ask experts to judge queries & result sets.

For a sample of queries, judges are paid to examine a sample of documents and determine relevance. This standard process gives a reusable dataset.

Option 2: Watch how users act and hope it tells us something about quality.

For all queries, record how users act and infer the quality of the search results based on the logs of user actions.

• The key question: What is the relationship between user behaviour and ranking quality?

Outline

• Describe a study of evaluation search with clicks

– Control ranking quality, and measure the effect on user behaviour.

• Evaluation with Absolute Metrics

– Users were shown results from different functions.– Measure statistics about user responses.

• Evaluation using Paired Comparisons

– Show a combination of results from 2 rankings.– Infer relative preferences.

• Discuss limitations and open questions

Experiment Design

• Start with search ranking function f.

• Intentionally degrade performance in two steps, making f1 and f2.

• Measure how user behaviour differs between the ranking functions.

• Interleave results from two rankings and measure responses.

Setup: f better than f1 better than f2

User Study on arXiv.org

– Real users and queries

– Users in natural context

– Degradation types:

ORIG FLAT RAND

• ORIG hand-tuned function

• FLAT ignore meta-data

• RAND randomize top-10

ORIG SWAP2 SWAP4

• ORIG hand-tuned function

• SWAP2 swap 2 pairs

• SWAP4 swap 4 pairs

– How does user behaviour change?

Experiment Setup

Phase 1: ORIG-FLAT-RAND

• Each user who comes to the search engine is assigned one of 6 experimental conditions:

– Results generated by ORIG

– Results generated by FLAT

– Results generated by RAND

– Results generated by interleaving ORIG & FLAT

– Results generated by interleaving ORIG & RAND

– Results generated by interleaving FLAT & RAND

Phase 2: ORIG-SWAP2-SWAP4• Each user who comes to the search engine is assigned one of 6 experimental conditions:

– Results generated by ORIG

– Results generated by SWAP2

– Results generated by SWAP4

– Results generated by interleaving ORIG & SWAP2

– Results generated by interleaving ORIG & SWAP4

– Results generated by interleaving SWAP2 & SWAP4

Experiment 1: Absolute Metrics

• Measured eight easily recorded statistics

• As the ranking quality decreases, we canhypothesize:

Metric Expected change as ranking gets worse

Abandonment Rate Increase (more bad result sets)

Reformulation Rate Increase (more need to reformulate)

Queries per Session Increase (more need to reformulate)

Clicks per Query Decrease (Fewer relevant results)

Max Reciprocal Rank* Decrease (Top results are worse)

Mean Reciprocal Rank* Decrease (More need for many clicks)

Time to First Click* Increase (Good results are lower)

Time to Last Click* Decrease (Fewer relevant results)

(*) Only queries with at least one click count

Experiment Statistics

• On average:

– About 700 queries a day

– About 300 distinct IPs

– About 600 clicks on results

• Each experiment phase was run for one month.

• Each experimental condition observed:

– About 3,000 queries

– About 1,000 queries with clicks

– About 600 distinct IPs.

Absolute Metrics: Results

0

0.5

1

1.5

2

ORIG

FLAT

RAND

ORIG

SWAP2

SWAP4

Absolute Metrics: Results

• Summarizing the results, out of 6 pairs:Summary

• Statistical fluctuations after

one month of data make

conclusions hard to draw

• None of the absolute

metrics reliably identify the

better ranking.

Experiment 2: Interleaved Metrics

• Paired comparisons in sensory analysis:

– Perceptual qualities are hard to test on absolute scale (e.g. taste, sound).

– Subjects usually presented with 2+ alternatives.

– Asked to specify which they prefer.

• Can do the same thing with ranking functions:

– Present two rankings, ask which is better.

– But we’d also like evaluation to be transparent.

• So we can do an interleaving experiment.

Team Draft Interleaving

• Think of making high school sports teams– We start with two captains.

– Each has a preference order over players.

– They take turns picking their next player.

• Interleaving Algorithm– Flip a coin to see which ranking goes first.

– That ranking picks highest ranked available document. Any clicks on it will be assigned to that ranking.

– The other team picks highest ranked available doc.

– Flip a coin again and continue.

Team Draft Interleaving Phase

Phase 3: ORIG-FLAT-RAND and ORIG-SWAP2-SWAP4

• Each user who comes to the search engine is assigned one of 6 experimental conditions:

– Results generated by team-draft: ORIG & FLAT

– Results generated by team-draft: ORIG & RAND

– Results generated by team-draft: FLAT & RAND

– Results generated by team-draft: ORIG & SWAP2

– Results generated by team-draft: ORIG & SWAP4

– Results generated by team-draft: SWAP2 & SWAP4

Team Draft InterleavingRanking A

1. Napa Valley – The authority for lodging...www.napavalley.com

2. Napa Valley Wineries - Plan your wine...www.napavalley.com/wineries

3. Napa Valley Collegewww.napavalley.edu/homex.asp

4. Been There | Tips | Napa Valleywww.ivebeenthere.co.uk/tips/16681

5. Napa Valley Wineries and Winewww.napavintners.com

6. Napa Country, California – Wikipediaen.wikipedia.org/wiki/Napa_Valley

Ranking B1. Napa Country, California – Wikipedia

en.wikipedia.org/wiki/Napa_Valley2. Napa Valley – The authority for lodging...

www.napavalley.com3. Napa: The Story of an American Eden...

books.google.co.uk/books?isbn=...4. Napa Valley Hotels – Bed and Breakfast...

www.napalinks.com5. NapaValley.org

www.napavalley.org6. The Napa Valley Marathon

www.napavalleymarathon.org

Presented Ranking1. Napa Valley – The authority for lodging...

www.napavalley.com2. Napa Country, California – Wikipedia

en.wikipedia.org/wiki/Napa_Valley3. Napa: The Story of an American Eden...

books.google.co.uk/books?isbn=...4. Napa Valley Wineries – Plan your wine...

www.napavalley.com/wineries5. Napa Valley Hotels – Bed and Breakfast...

www.napalinks.com 6. Napa Balley College

www.napavalley.edu/homex.asp7 NapaValley.org

www.napavalley.org

AB

Team Draft InterleavingRanking A

1. Napa Valley – The authority for lodging...www.napavalley.com

2. Napa Valley Wineries - Plan your wine...www.napavalley.com/wineries

3. Napa Valley Collegewww.napavalley.edu/homex.asp

4. Been There | Tips | Napa Valleywww.ivebeenthere.co.uk/tips/16681

5. Napa Valley Wineries and Winewww.napavintners.com

6. Napa Country, California – Wikipediaen.wikipedia.org/wiki/Napa_Valley

Ranking B1. Napa Country, California – Wikipedia

en.wikipedia.org/wiki/Napa_Valley2. Napa Valley – The authority for lodging...

www.napavalley.com3. Napa: The Story of an American Eden...

books.google.co.uk/books?isbn=...4. Napa Valley Hotels – Bed and Breakfast...

www.napalinks.com5. NapaValley.org

www.napavalley.org6. The Napa Valley Marathon

www.napavalleymarathon.org

Presented Ranking1. Napa Valley – The authority for lodging...

www.napavalley.com2. Napa Country, California – Wikipedia

en.wikipedia.org/wiki/Napa_Valley3. Napa: The Story of an American Eden...

books.google.co.uk/books?isbn=...4. Napa Valley Wineries – Plan your wine...

www.napavalley.com/wineries5. Napa Valley Hotels – Bed and Breakfast...

www.napalinks.com 6. Napa Balley College

www.napavalley.edu/homex.asp7 NapaValley.org

www.napavalley.org

Tie!

Interleaving Results

0

10

20

30

40

50

60

ORIG > FLAT FLAT > RAND ORIG > RAND ORIG > SWAP2 SWAP2 > SWAP4 ORIG > SWAP4

Better ranking wins Worse ranking wins

Interleaving Results

• The conclusion is consistent and stronger:

(Absolute Metrics)

Summary

• Paired comparison tests

always correctly identified

the better ranking.

• Most of the differences are

statistically significant.

Summary of Experiment

• Constructed two triplets of ranking functions.

• Tested on real users.

• Absolute metrics didn’t change as we expected.

– Changes weren’t always monotonic.

• Interleaved gave more significant results, and was more reliable.

– But cannot be run “after the fact” from logs.

• But there are many caveats to think about...

Discussion: Users & Queries

• We were only able to explore a few aspects of the problem:

– The users are not “typical” web users.

– The type of queries is not typical.

– Results could be different in other settings: enterprise search, general web search, personalized search, desktop search, mobile search...

– It would be interesting to conduct similar experiments in some of these other settings.

Discussion: User Interactions

• All click evaluations rely on clicks being useful.

• Presentation should not bias toward either ranking function.– If naively interleave two rankings with different snippet

engines, could bias users.– But what if, say, url length just differs?

• Answer may be in the snippet (“instant answers”).– In that case there may be no click.– Other effects (e.g. temporal, mouse movement, browser

buttons) may give more information, but harder to log.

Discussion: Click Metrics

• The metrics we used were fairly simple

– What if “clicked followed by back within 5 seconds” didn’t count?

– If we got much more data, absolute metrics could also become reliable.

– More sophisticated absolute metrics may be more powerful or reliable.

– More sophisticated interleaved metrics may also be.

Discussion: Log Reusability

• Say somebody else comes up with a new ranking function. Are our logs useful to them?

– For absolute metrics

• Would provide baseline performance numbers.

• But temporal effects, etc, may affect evaluation.

– For paired comparison test:

• Hard to know what the user would have clicked given a different input, so probably not

Conclusions

• We’d like to evaluate rankings by observing real users: reflects real needs, cheaper, faster.

• This can be done using absolute measures, or designing a paired comparison experiment.

• In this particular setting, the paired comparison was more reliable and sensitive.

• There are many open questions about when paired comparison is indeed better.

Tutorial 12 (click models)

Technology

Transcript of Tutorial 12 (click models)