an eye-tracking study of user interactions with query auto completion – katja hofmann, microsoft...

Katja Hofmann

An Eye-tracking Study of User Interactions with Query Auto Completion

Joint work with Bhaskar Mitra, Milad Shokouhi, and Filip Radlinski@katjahofmann

How can we train and evaluate contextual QAC?

Example: context-dependent queries [Shokouhi ‘13].

sacu salt lake tribune

How do searchers examine and interact with QAC?

Click distributions for QAC on PC and iPhone [Li et al. ‘14].

prefix length

suggest

ion

rank

Click distributions for QAC over ranks [Mitra et al. ‘14].

clic

k r

ati

o

suggestion rank

From log data:

PC iPhone

Can infer examination from data + model (given modelling assumptions)

Model for inferring QAC examination from observed clicks [Kharitonov et al. ‘13].

Goal of this study

Conduct controlled experiments to understand:

How do searchers examine QAC rankings?

How does the quality of QAC rankings affect examination and usage?

Are QAC examination and usage affected by position bias?

OutlineExperimentAnalysisResultsDiscussion

Focus on QAC rankingMain question: how does ranking quality affect examination and interaction.

two experimental conditions

massachu|

massachusetts

massachusetts state lottery

massachusetts unemployment

massachusetts registry of motor vehicles

massachusetts secretary of state

massachusetts department of revenue

massachusetts department of education

massachusetts general hospital

massachu|

massachusetts unemployment

massachusetts department of education

massachusetts secretary of state

massachusetts registry of motor vehicles

massachusetts

massachusetts general hospital

massachusetts department of revenue

massachusetts state lottery

original condition (production) random condition

Counterbalanced in blocks so maximum of 2 subsequent tasks are in the same condition.

Search tasksDesigned 14 tasks (+2 practice tasks)Same tasks for all participants (counterbalanced order), required to control variance.

Included navigational and closed informational tasks (easy and complex).

Included difficult-to-spell names (schwarzenegger), terms that can be abbreviated (wsj).

Example search tasks:

Find the homepage of the Massachusetts General Hospital in Boston, USA.What is their physical address?(navigational)

Japan is the 10th most populated country in the world. How many people live there?(easy informational)

How many matches did Roger Federer win against Rafael Nadal in 2007?(complex informational)

Eye-trackingMinimize user impact, maximize accuracy

Tobii TX300unobtrusivetracks natural head movement300 Hz temporal resolutionaccuracy up to 0.4˚ visual angle

size of each QAC suggestion on screen: 0.67˚

23’’ monitor

integrated eye-tracker

http://www.tobii.com/Global/Analysis/Downloads/Product_Descriptions/Tobii_TX300_EyeTracker_Product_Description.pdf





Studying natural query formulation?

Make searchers type: Provide instructions and search task descriptions on screen (avoid copy-paste).

Participants: 25, diverse backgrounds, level of education, and computer experience.

Instruction: Participate in a study of search quality; start search from bing.com, then search any way you like.

DataCollected

eye fixations + saccades (on QAC and other parts of the screen)

mouse clicks, keystrokes

visited URLs

screen capture videos

browser events

Processed excluded 19 episodes where users did not search using bing

result: 331 valid search episodes

extracted 10 measurements to characterize QAC examination, query formulation, and task completion

Measurements

Q1 Q2 Q3R1 R4S2 R5S4

R2

S3

task completion time (TCT)

time to first result click (TFC)

T E T S T _ Q U E

S1

query formulation time (QFT)time to first fixation

(TFF) A B

A + B = cumulative fixation time (CFT)

R3

fixation (anywhere on the screen)saccade (anywhere on the screen)

mouse click

typed character

QAC suggestions shown

fixations on QAC suggestions

control characters

QU QAC suggestion used

QR QAC rank

QL query length

CS characters saved

UQ unique queries submitted

UR unique result pages

+ query and task characteristics:

Analysis using mixed effects modelModel random effects of participant and task, and fixed effect of condition on each response variable:

𝑔 ( 𝑦 𝑖𝑗 )=𝛽0+𝛽1𝑥𝑖𝑗+𝑝𝑖𝑢𝑖+𝑡 𝑗𝑣 𝑗+𝜀𝑖𝑗

link function (e.g. logit for binary response)

response for participant i and task j

condition effect (base level)

condition effect (random)

condition indicator

effect of participant

participant indicator

task indicator

effect of task

residual noise

Analysis: QAC examination


R2

S3

T E T S T _ Q U E

S1

time to first fixation (TFF) A B


R3


response type n β0 estimate β1 estimate

CFT > 0 binary

CFT | CFT > 0

log

TFF | CFT > 0

log

* marks coefficients that are estimated to differ significantly from zero.

Analysis: QAC examination


R2

S3

T E T S T _ Q U E

S1

time to first fixation (TFF) A B


R3



CFT > 0 binary 331 3.468* 0.97 -0.220 0.96

CFT | CFT > 0

log 284 7.124* 1241 ms -0.043 1189 ms

TFF | CFT > 0

log 284 6.503* 667 ms -0.094 607 ms

* marks coefficients that are estimated to differ significantly from zero.

Differences between conditions are much smaller than differences between ranks.

Fixations and use of AS by rank and condition. Condition has little effect, suggesting a strong position bias.

AS suggestion rank

AS

usa

ge

(perc

en

t)

Fixations (original)Fixations (random)AS usage (original)AS usage (random)

mean

fixati

on

tim

e

(mill

iseco

nd

s)

Analysis: query formulation


R2

S3

T E T S T _ Q U E

S1

query formulation time (QFT)

R3

mouse click

typed character

control characters


QFT log

QL Poisson

QU binary

CS | QU Poisson

QR | QU Poisson* marks coefficients that are estimated to differ significantly from zero.

Analysis: query formulation


R2

S3

T E T S T _ Q U E

S1

query formulation time (QFT)

R3

mouse click

typed character

control characters


QFT log 331 8.680* 5884 ms 0.058 6235 ms

QL Poisson

331 3.224* 25 -0.007 25

QU binary 331 -0.915* 0.29 -0.508 0.19

CS | QU Poisson

99 2.192* 9 0.223* 11

QR | QU Poisson

99 0.344* 1.4 0.044 1.5* marks coefficients that are estimated to differ significantly from zero.

Analysis: task completion


R2

S3



T E T S T _ Q U E

S1 R3

mouse click


UQ Poisson

UR = 0 binary

UR | UR > 0 Poisson

TFC | UR > 0

log

TCT ≥ ts binary

TCT | TCT < ts

log* marks coefficients that are estimated to differ significantly from zero.

Analysis: task completion


R2

S3



T E T S T _ Q U E

S1 R3

mouse click


UQ Poisson

331 0.357* 1.4 0.044 1.5

UR = 0 binary 331 -3.654* 0.03 -0.022 0.02

UR | UR > 0 Poisson

282 0.703* 2.0 0.161* 2.4

TFC | UR > 0

log 282 8.625* 5569 ms -0.036 5372 ms

TCT ≥ ts binary 331 -3.217* 0.04 0.764 0.08

TCT | TCT < ts

log 297 11.096* 65.9 s -0.021 64.5 s* marks coefficients that are estimated to differ significantly from zero.

How do users interact with AS?

a) touch typing, aware of suggestions


b + c) spelling support vs. expressing an information need


d) seeking suggestions

DiscussionHow to measure QAC ranking quality?

Rank-based (e.g., MRR, extracted from logs)

e.g., [Shokouhi ‘13]

QAC usage [Kharitonov et al. ‘13]

Manual judgment of suggestions [Bhatia et al. ‘11]

Result page quality [Liu et al. ‘12]

Effort-based (e.g., MKS) [Duan & Hsu ‘11]

AB-tests [Kohavi et al. ‘13]

Interleaving [Hofmann et al. ‘13]

Summary

To learn from user interactions, we need to understand how to interpret them.

Here: focus on effects of ranking changes on user interactions with AS.

Found evidence of strong position bias (no differences in examination / positional AS use), but strong effect on query effectiveness (e.g., # unique pages).

Next: incorporate findings into metrics for evaluation and learning, e.g., can we detect examinations from typing behavior?

References

[Bhatia et al. ‘11] S. Bhatia, D. Majumdar, P. Mitra: Query suggestions in the absence of query logs (SIGIR 2011).

[Duan & Hsu ‘11] H. Duan, B.-J. P. Hsu: Online spelling correction for query completion (WWW ‘11).

[Hofmann et al. ‘13] K. Hofmann, S. Whiteson, M. de Rijke: Fidelity, soundness, and efficiency of interleaved comparison methods (ACM TOIS 31(4) 2013).

[Hofmann et al. ‘14] K. Hofmann, B. Mitra, M. Shokouhi, F. Radlinski: An Eye-tracking Study of User Interactions with Query Auto Completion (CIKM 2014).

[Kharitonov et al. 13] E. Kharitonov, C. Macdonald, P. Serdyukov, I. Ounis: User Model-based Metrics for Offline Query Suggestion Evaluation (CIKM 2013).

[Kohavi et al. ‘13] R. Kohavi, A. Deng, B. Frasca, T. Walker, Y. Xu, N. Pohlmann: Online controlled experiments at large scale (KDD 2013).

[Li et al. ‘14] Y. Li, A. Dong, H. Wang, H. Deng, Y. Chang, C. Zhai: A Two-Dimensional Click Model for Query Auto-Completion (SIGIR 2014).

[Liu et al. ‘12] Y. Liu, R. Song, Y. Chen, J.-Y. Nie, J.-R. Wen: Adaptive query suggestion for difficult queries (SIGIR 2012).

[Mitra et al. ‘14] B. Mitra, M. Shokouhi, F. Radlinski, K. Hofmann: On User’s Interactions with Query Auto-Completion (SIGIR 2014).

[Shokouhi ‘13] M. Shokouhi: Learning to Personalize Query Auto-Completion (SIGIR 2013).

© 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

an eye-tracking study of user interactions with query auto completion – katja hofmann, microsoft...

Internet

binary cft cft

estimate cft

log tff cft

ms tff cft

tfc t e t s t

qac examination

fixation tff

q u e s1 time