An Eye-Tracking Study of Query Reformulation

An Eye-tracking Study of Query Reformulation

Carsten Eickhoff

Sebastian Dungs, Vu Tran (University of Duisburg-Essen)

Contents

1 Motivation & Background

2 Experimental Setup

3 Observations

4 Practical Application

5 Conclusion

Carsten Eickhoff SIGIR 2015 Santiago, Chile August 10th, 2015 2 / 30

Domain Expertise

Wildemuth 2004: Domain experts use different search strategies

White et al. 2009: Domain experts are more successful

Liu et al. 2012: Domain expertise increases with exposure

Eickhoff et al. 2014:

Expertise gains are hard to measure within a sessionBut there is evidence of term acquisition

Domain Expertise

Expertise gains are hard to measure within a session

But there is evidence of term acquisition

Domain Expertise

User Attention Models

Buscher et al. 2008: Eye gaze can serve as relevance feedback

Guo et al. 2010: Cursor position and movement are related to gaze

Kunze et al. 2013: Gaze patterns indicate language proficiency

In this paper...

Eye gaze study on the term level

Model user attention based on gaze patterns

Use the attention model to re-rank query suggestion candidates

In this paper...

Contents

3 Observations

5 Conclusion

Hardware

Windows 7 client (22” screen)

Firefox 25.0.1 Web browser

SMI RED remote eye tracker

60Hz update frequency, 0.4° accuracy

Hardware

Participants

10 participants (5M/5F)

Aged between 19 and 27 years

University students

7 additional participants in pilot study

Participants

University students

Participants

University students

Participants

University students

Participants

University students

10 hand-picked TREC QA complex interactive topics (’06/’07)

Focus on technical and medical topicsE.g.: “What effect does aspirin have on coronary heart diseases?”

3 search tasks per participant

Choice among 2 tasksUp to 20 minutes per task

30 sessions of query (re-)formulations

Pre- and post task questionnaires

Focus on technical and medical topics

E.g.: “What effect does aspirin have on coronary heart diseases?”

Choice among 2 tasks

Up to 20 minutes per task

Tasks contd.

Task Id Frequency familiar (pre) interesting (pre) easy (pre) easy (post) understandable (post) found good results (post)

31 4 2.25 3.75 2.50 1.25 3.50 1.75

38 4 2.75 3.50 3.00 2.75 4.50 2.75

41 1 4.00 4.00 2.00 5.00 5.00 5.00

42 4 1.00 2.75 2.50 4.00 4.50 3.25

53 3 2.33 3.67 2.67 3.33 4.00 3.33

55 5 1.20 2.80 3.40 3.20 4.60 3.40

69 1 1.00 4.00 4.00 5.00 5.00 4.00

70 6 1.17 2.5 2.5 2.33 3.5 2.67

72 0 - - - - - -

73 2 1.50 4.50 3.50 4.00 4.50 4.50

Overall 30 1.73 3.23 2.83 3.00 4.17 3.06

High topic frequency variance

Low a-priori familiarity

On average documents were understandable and relevant

Tasks contd.

31 4 2.25 3.75 2.50 1.25 3.50 1.75

38 4 2.75 3.50 3.00 2.75 4.50 2.75

41 1 4.00 4.00 2.00 5.00 5.00 5.00

42 4 1.00 2.75 2.50 4.00 4.50 3.25

53 3 2.33 3.67 2.67 3.33 4.00 3.33

55 5 1.20 2.80 3.40 3.20 4.60 3.40

69 1 1.00 4.00 4.00 5.00 5.00 4.00

70 6 1.17 2.5 2.5 2.33 3.5 2.67

72 0 - - - - - -

73 2 1.50 4.50 3.50 4.00 4.50 4.50

Overall 30 1.73 3.23 2.83 3.00 4.17 3.06

Tasks contd.

31 4 2.25 3.75 2.50 1.25 3.50 1.75

38 4 2.75 3.50 3.00 2.75 4.50 2.75

41 1 4.00 4.00 2.00 5.00 5.00 5.00

42 4 1.00 2.75 2.50 4.00 4.50 3.25

53 3 2.33 3.67 2.67 3.33 4.00 3.33

55 5 1.20 2.80 3.40 3.20 4.60 3.40

69 1 1.00 4.00 4.00 5.00 5.00 4.00

70 6 1.17 2.5 2.5 2.33 3.5 2.67

72 0 - - - - - -

73 2 1.50 4.50 3.50 4.00 4.50 4.50

Overall 30 1.73 3.23 2.83 3.00 4.17 3.06

Tasks contd.

31 4 2.25 3.75 2.50 1.25 3.50 1.75

38 4 2.75 3.50 3.00 2.75 4.50 2.75

41 1 4.00 4.00 2.00 5.00 5.00 5.00

42 4 1.00 2.75 2.50 4.00 4.50 3.25

53 3 2.33 3.67 2.67 3.33 4.00 3.33

55 5 1.20 2.80 3.40 3.20 4.60 3.40

69 1 1.00 4.00 4.00 5.00 5.00 4.00

70 6 1.17 2.5 2.5 2.33 3.5 2.67

72 0 - - - - - -

73 2 1.50 4.50 3.50 4.00 4.50 4.50

Overall 30 1.73 3.23 2.83 3.00 4.17 3.06

Contents

3 Observations

5 Conclusion

Term Acquisition

Log-based study at MSR (2014)

45% of new query terms previously seen on visited pages & SERPsHow much of it did the user actually see?

We replicate the experiment on our data (43% according to logs)

Only 21% received eye gaze focus

Term Acquisition

45% of new query terms previously seen on visited pages & SERPs

How much of it did the user actually see?

Term Acquisition

Literal Term Acquisition

User Attention

Non-query terms Query terms

Relative frequency on page 0.83 0.17

Share of overall fixations 0.68 0.32

Rel. number of fixations per token 0.25 0.57*Share of overall fixation duration 0.61 0.39

Avg. fixation duration per token 58ms 218ms*

Non-query terms make up most of the page

Normalised by class frequency, query terms receive twice as manyfixations

The same holds for fixation duration

User Attention

Semantic Proximity

Literal query term matches suffer from low coverage

Instead: Semantic proximity between fixations and query terms

E.g., using WordNet and Leacock-Chodorow (LCH) similarity

LCH = −log( length2×Dt)

Semantic Proximity

Proximal Term Acquisition

Semantic Proximity

fixation duration fixation likelihood

LCH < 0.25 30ms 0.09

0.25 ≤ LCH < 0.5 36ms 0.15

0.5 ≤ LCH < 0.75 39ms 0.23

0.75 ≤ LCH < 1.0 34ms 0.18

1.0 ≤ LCH < 1.25 32ms 0.13

1.25 ≤ LCH < 1.5 38ms 0.27

1.5 ≤ LCH < 1.75 45ms 0.35

1.75 ≤ LCH < 2.0 53ms 0.43

LCH ≥ 2.0 86ms 0.49

All literal matches excluded

Significant increase in fixation duration and likelihood between termsof LCH ≥ 1.5 and all levels below

Semantic Proximity

LCH < 0.25 30ms 0.09

0.25 ≤ LCH < 0.5 36ms 0.15

0.5 ≤ LCH < 0.75 39ms 0.23

0.75 ≤ LCH < 1.0 34ms 0.18

1.0 ≤ LCH < 1.25 32ms 0.13

1.25 ≤ LCH < 1.5 38ms 0.27

1.5 ≤ LCH < 1.75 45ms 0.35

1.75 ≤ LCH < 2.0 53ms 0.43

LCH ≥ 2.0 86ms 0.49

Semantic Proximity

LCH < 0.25 30ms 0.09

0.25 ≤ LCH < 0.5 36ms 0.15

0.5 ≤ LCH < 0.75 39ms 0.23

0.75 ≤ LCH < 1.0 34ms 0.18

1.0 ≤ LCH < 1.25 32ms 0.13

1.25 ≤ LCH < 1.5 38ms 0.27

1.5 ≤ LCH < 1.75 45ms* 0.35*1.75 ≤ LCH < 2.0 53ms* 0.43*

LCH ≥ 2.0 86ms* 0.49*

Term Length

all terms stop words removed

Length query terms non-query terms query terms non-query terms

< 4 30ms 29ms 32ms 30ms

4-7 38ms* 30ms 39ms* 32ms

8-11 49ms* 33ms 49ms* 33ms

12-15 55ms* 38ms 55ms* 38ms

16-19 69ms* 41ms 69ms* 41ms

≥ 20 82ms* 51ms 82ms* 51ms

Stop words (avg. length 3.9) influence the early categories

Term length has a significant impact on user attention

Within each length class, query terms receive more of it

Term Length

< 4 30ms 29ms 32ms 30ms

4-7 38ms* 30ms 39ms* 32ms

8-11 49ms* 33ms 49ms* 33ms

12-15 55ms* 38ms 55ms* 38ms

16-19 69ms* 41ms 69ms* 41ms

≥ 20 82ms* 51ms 82ms* 51ms

Term Length

< 4 30ms 29ms 32ms 30ms

4-7 38ms* 30ms 39ms* 32ms

8-11 49ms* 33ms 49ms* 33ms

12-15 55ms* 38ms 55ms* 38ms

16-19 69ms* 41ms 69ms* 41ms

≥ 20 82ms* 51ms 82ms* 51ms

Term Length

< 4 30ms 29ms 32ms 30ms

4-7 38ms* 30ms 39ms* 32ms

8-11 49ms* 33ms 49ms* 33ms

12-15 55ms* 38ms 55ms* 38ms

16-19 69ms* 41ms 69ms* 41ms

≥ 20 82ms* 51ms 82ms* 51ms

Term Complexity

AoA query terms non-query terms query terms non-query terms

< 4 30ms 31ms 31ms 32ms

4-6 35ms 32ms 35ms 33ms

6-8 49ms* 32ms 49ms* 34ms

8-10 72ms* 36ms 72ms* 36ms

10-12 112ms* 41ms 112ms* 41ms

12-14 173ms* 48ms 173ms* 48ms

14-16 228ms* 59ms 228ms* 59ms

16-18 287ms* 64ms 287ms* 64ms

≥ 18 384ms* 89ms 384ms* 89ms

Age-of-acquisition (AoA) statistics by Kuperman et al. (2012)

More complex terms receive more user attention

Within each complexity class, query terms receive most of it

Term Complexity

< 4 30ms 31ms 31ms 32ms

4-6 35ms 32ms 35ms 33ms

6-8 49ms* 32ms 49ms* 34ms

8-10 72ms* 36ms 72ms* 36ms

10-12 112ms* 41ms 112ms* 41ms

12-14 173ms* 48ms 173ms* 48ms

14-16 228ms* 59ms 228ms* 59ms

16-18 287ms* 64ms 287ms* 64ms

≥ 18 384ms* 89ms 384ms* 89ms

Term Complexity

< 4 30ms 31ms 31ms 32ms

4-6 35ms 32ms 35ms 33ms

6-8 49ms* 32ms 49ms* 34ms

8-10 72ms* 36ms 72ms* 36ms

10-12 112ms* 41ms 112ms* 41ms

12-14 173ms* 48ms 173ms* 48ms

14-16 228ms* 59ms 228ms* 59ms

16-18 287ms* 64ms 287ms* 64ms

≥ 18 384ms* 89ms 384ms* 89ms

Term Complexity

< 4 30ms 31ms 31ms 32ms

4-6 35ms 32ms 35ms 33ms

6-8 49ms* 32ms 49ms* 34ms

8-10 72ms* 36ms 72ms* 36ms

10-12 112ms* 41ms 112ms* 41ms

12-14 173ms* 48ms 173ms* 48ms

14-16 228ms* 59ms 228ms* 59ms

16-18 287ms* 64ms 287ms* 64ms

≥ 18 384ms* 89ms 384ms* 89ms

Within each complexity class, query terms receive most of itCarsten Eickhoff SIGIR 2015 Santiago, Chile August 10th, 2015 19 / 30

Try this at home...

Previously: Gaze-based attention models are informative

Eye trackers not always available

Guo et al. 2010: Gaze and cursor movement are correlated

Google: 25% of people read with their cursor as a gaze pointer

Host our tasks on AMT

137 unique users500 search sessionsReplace gaze information by cursor hover

Try this at home...

137 unique users

500 search sessionsReplace gaze information by cursor hover

Try this at home...

137 unique users500 search sessions

Replace gaze information by cursor hover

Try this at home...

Cursor-based User Attention Models

Relative frequency 0.79 0.21

Share of overall hovers 0.74 0.26

Rel. number of hovers per token 0.045 0.059*Share of overall hover duration 0.78 0.22

Avg. hover duration per token 130ms 135ms*

Cursor-based User Attention Models contd.

hover duration hover likelihood

LCH < 0.25 116ms 0.041

0.25 ≤ LCH < 0.5 121ms 0.041

0.5 ≤ LCH < 0.75 119ms 0.039

0.75 ≤ LCH < 1.0 123ms 0.043

1.0 ≤ LCH < 1.25 127ms 0.047

1.25 ≤ LCH < 1.5 126ms 0.048

1.5 ≤ LCH < 1.75 129ms 0.050

1.75 ≤ LCH < 2.0 133ms* 0.053*LCH ≥ 2.0 135ms* 0.055*

Contents

3 Observations

5 Conclusion

Query Suggestion

Practical application of attention models

Replicate experiments with querysuggestions

Use attention models to re-rankcandidates

Measure MRR of selected suggestion

Query Suggestion

Methods

3 attention models to score query suggestion candidates

(Buscher et al.): GLF (w) = tf (w , cA) × idf (w ,C ) × LA(w)LA(w)+SA(w)

TAMλ(w) = λF (w) + (1 − λ)∑

dur(tw )

TAM-RΛ(w) = λlLCH(w) + λf F (w) + λd∑

dur(tw )

Methods

TAMλ(w) = λF (w) + (1 − λ)∑

dur(tw )

Methods

TAMλ(w) = λF (w) + (1 − λ)∑

dur(tw )

Methods

TAMλ(w) = λF (w) + (1 − λ)∑

dur(tw )

Results

Method MRR σRR

API Output 0.76 0.28

Gaze-Length-Filter (GLF) 0.79* 0.26

Term-Attention-Model (TAM) 0.80* 0.24

Term-Attention-Model + Relatedness (TAM-R) 0.86H* 0.29

All attention-based models outperform the original API output

GLF and TAM perform similarly

TAM-R’s relatedness component makes it the most effective method

Results

Method MRR σRR

Results

Method MRR σRR

Results

Method MRR σRR

Contents

3 Observations

5 Conclusion

Conclusion

Literal query term acquisition is often heralded by increased userattention in the form of fixation frequency and duration

Besides literal query term acquisition, attention to semanticallyrelated terms indicates interest in topical aspects

Our findings generalize well from high quality eye gaze traces to moreaccessible cursor movement information

User attention models can significantly improve query suggestionperformance

Conclusion

Future Directions

Apply attention model to other tasks such as ranking, SERPdiversification, or personalized readability estimation

Use long-term user models to better represent domain expertise andliteracy skills

Draw from alternative signals such as pupil dilation or saccades.

Future Directions

Thank You!

An Eye-Tracking Study of Query Reformulation

Technology

Transcript of An Eye-Tracking Study of Query Reformulation