an eye-tracking study of user interactions with query auto completion – katja hofmann, microsoft...
DESCRIPTION
Query Auto Completion (QAC) suggests possible queries to web search users from the moment they start entering a query. This popular feature of web search engines is thought to reduce physical and cognitive effort when formulating a query. Perhaps surprisingly, despite QAC being widely used, users' interactions with it are poorly understood. This paper begins to address this gap. We present the results of an in-depth user study of user interactions with QAC in web search. While study participants completed web search tasks, we recorded their interactions using eye-tracking and client-side logging. This allows us to provide a first look at how users interact with QAC. We specifically focus on the effects of QAC ranking, by controlling the quality of the ranking in a within-subject design. We identify a strong position bias, that is consistent across ranking conditions. Due to this strong position bias, ranking quality affects QAC usage. We also find an effect on task completion, in particular on the number of result pages visited. We show how these effects can be explained by a combination of searchers' behavior patterns, namely monitoring or ignoring QAC, and searching for spelling support or complete queries to express a search intent. We conclude the paper with a discussion of the important implications of our findings for QAC evaluation.TRANSCRIPT
Katja Hofmann
An Eye-tracking Study of User Interactions with Query Auto Completion
Joint work with Bhaskar Mitra, Milad Shokouhi, and Filip Radlinski@katjahofmann
How can we train and evaluate contextual QAC?
Example: context-dependent queries [Shokouhi ‘13].
sacu salt lake tribune
How do searchers examine and interact with QAC?
Click distributions for QAC on PC and iPhone [Li et al. ‘14].
prefix length
suggest
ion
rank
Click distributions for QAC over ranks [Mitra et al. ‘14].
clic
k r
ati
o
suggestion rank
From log data:
PC iPhone
Can infer examination from data + model (given modelling assumptions)
Model for inferring QAC examination from observed clicks [Kharitonov et al. ‘13].
Goal of this study
Conduct controlled experiments to understand:
How do searchers examine QAC rankings?
How does the quality of QAC rankings affect examination and usage?
Are QAC examination and usage affected by position bias?
OutlineExperimentAnalysisResultsDiscussion
Focus on QAC rankingMain question: how does ranking quality affect examination and interaction.
two experimental conditions
massachu|
massachusetts
massachusetts state lottery
massachusetts unemployment
massachusetts registry of motor vehicles
massachusetts secretary of state
massachusetts department of revenue
massachusetts department of education
massachusetts general hospital
massachu|
massachusetts unemployment
massachusetts department of education
massachusetts secretary of state
massachusetts registry of motor vehicles
massachusetts
massachusetts general hospital
massachusetts department of revenue
massachusetts state lottery
original condition (production) random condition
Counterbalanced in blocks so maximum of 2 subsequent tasks are in the same condition.
Search tasksDesigned 14 tasks (+2 practice tasks)Same tasks for all participants (counterbalanced order), required to control variance.
Included navigational and closed informational tasks (easy and complex).
Included difficult-to-spell names (schwarzenegger), terms that can be abbreviated (wsj).
Example search tasks:
Find the homepage of the Massachusetts General Hospital in Boston, USA.What is their physical address?(navigational)
Japan is the 10th most populated country in the world. How many people live there?(easy informational)
How many matches did Roger Federer win against Rafael Nadal in 2007?(complex informational)
Eye-trackingMinimize user impact, maximize accuracy
Tobii TX300unobtrusivetracks natural head movement300 Hz temporal resolutionaccuracy up to 0.4˚ visual angle
size of each QAC suggestion on screen: 0.67˚
23’’ monitor
integrated eye-tracker
http://www.tobii.com/Global/Analysis/Downloads/Product_Descriptions/Tobii_TX300_EyeTracker_Product_Description.pdf
Studying natural query formulation?
Make searchers type: Provide instructions and search task descriptions on screen (avoid copy-paste).
Participants: 25, diverse backgrounds, level of education, and computer experience.
Instruction: Participate in a study of search quality; start search from bing.com, then search any way you like.
OutlineExperimentAnalysisResultsDiscussion
DataCollected
eye fixations + saccades (on QAC and other parts of the screen)
mouse clicks, keystrokes
visited URLs
screen capture videos
browser events
Processed excluded 19 episodes where users did not search using bing
result: 331 valid search episodes
extracted 10 measurements to characterize QAC examination, query formulation, and task completion
Video
Measurements
Q1 Q2 Q3R1 R4S2 R5S4
R2
S3
task completion time (TCT)
time to first result click (TFC)
T E T S T _ Q U E
S1
query formulation time (QFT)time to first fixation
(TFF) A B
A + B = cumulative fixation time (CFT)
R3
fixation (anywhere on the screen)saccade (anywhere on the screen)
mouse click
typed character
QAC suggestions shown
fixations on QAC suggestions
control characters
QU QAC suggestion used
QR QAC rank
QL query length
CS characters saved
UQ unique queries submitted
UR unique result pages
+ query and task characteristics:
Analysis using mixed effects modelModel random effects of participant and task, and fixed effect of condition on each response variable:
𝑔 ( 𝑦 𝑖𝑗 )=𝛽0+𝛽1𝑥𝑖𝑗+𝑝𝑖𝑢𝑖+𝑡 𝑗𝑣 𝑗+𝜀𝑖𝑗
link function (e.g. logit for binary response)
response for participant i and task j
condition effect (base level)
condition effect (random)
condition indicator
effect of participant
participant indicator
task indicator
effect of task
residual noise
OutlineExperimentAnalysisResultsDiscussion
Analysis: QAC examination
Q1 Q2 Q3R1 R4S2 R5S4
R2
S3
T E T S T _ Q U E
S1
time to first fixation (TFF) A B
A + B = cumulative fixation time (CFT)
R3
fixations on QAC suggestions
response type n β0 estimate β1 estimate
CFT > 0 binary
CFT | CFT > 0
log
TFF | CFT > 0
log
* marks coefficients that are estimated to differ significantly from zero.
Analysis: QAC examination
Q1 Q2 Q3R1 R4S2 R5S4
R2
S3
T E T S T _ Q U E
S1
time to first fixation (TFF) A B
A + B = cumulative fixation time (CFT)
R3
fixations on QAC suggestions
response type n β0 estimate β1 estimate
CFT > 0 binary 331 3.468* 0.97 -0.220 0.96
CFT | CFT > 0
log 284 7.124* 1241 ms -0.043 1189 ms
TFF | CFT > 0
log 284 6.503* 667 ms -0.094 607 ms
* marks coefficients that are estimated to differ significantly from zero.
Differences between conditions are much smaller than differences between ranks.
Fixations and use of AS by rank and condition. Condition has little effect, suggesting a strong position bias.
AS suggestion rank
AS
usa
ge
(perc
en
t)
Fixations (original)Fixations (random)AS usage (original)AS usage (random)
mean
fixati
on
tim
e
(mill
iseco
nd
s)
Analysis: query formulation
Q1 Q2 Q3R1 R4S2 R5S4
R2
S3
T E T S T _ Q U E
S1
query formulation time (QFT)
R3
mouse click
typed character
control characters
response type n β0 estimate β1 estimate
QFT log
QL Poisson
QU binary
CS | QU Poisson
QR | QU Poisson* marks coefficients that are estimated to differ significantly from zero.
Analysis: query formulation
Q1 Q2 Q3R1 R4S2 R5S4
R2
S3
T E T S T _ Q U E
S1
query formulation time (QFT)
R3
mouse click
typed character
control characters
response type n β0 estimate β1 estimate
QFT log 331 8.680* 5884 ms 0.058 6235 ms
QL Poisson
331 3.224* 25 -0.007 25
QU binary 331 -0.915* 0.29 -0.508 0.19
CS | QU Poisson
99 2.192* 9 0.223* 11
QR | QU Poisson
99 0.344* 1.4 0.044 1.5* marks coefficients that are estimated to differ significantly from zero.
Analysis: task completion
Q1 Q2 Q3R1 R4S2 R5S4
R2
S3
task completion time (TCT)
time to first result click (TFC)
T E T S T _ Q U E
S1 R3
mouse click
response type n β0 estimate β1 estimate
UQ Poisson
UR = 0 binary
UR | UR > 0 Poisson
TFC | UR > 0
log
TCT ≥ ts binary
TCT | TCT < ts
log* marks coefficients that are estimated to differ significantly from zero.
Analysis: task completion
Q1 Q2 Q3R1 R4S2 R5S4
R2
S3
task completion time (TCT)
time to first result click (TFC)
T E T S T _ Q U E
S1 R3
mouse click
response type n β0 estimate β1 estimate
UQ Poisson
331 0.357* 1.4 0.044 1.5
UR = 0 binary 331 -3.654* 0.03 -0.022 0.02
UR | UR > 0 Poisson
282 0.703* 2.0 0.161* 2.4
TFC | UR > 0
log 282 8.625* 5569 ms -0.036 5372 ms
TCT ≥ ts binary 331 -3.217* 0.04 0.764 0.08
TCT | TCT < ts
log 297 11.096* 65.9 s -0.021 64.5 s* marks coefficients that are estimated to differ significantly from zero.
OutlineExperimentAnalysisResultsDiscussion
How do users interact with AS?
a) touch typing, aware of suggestions
How do users interact with AS?
b + c) spelling support vs. expressing an information need
How do users interact with AS?
d) seeking suggestions
DiscussionHow to measure QAC ranking quality?
Rank-based (e.g., MRR, extracted from logs)
e.g., [Shokouhi ‘13]
QAC usage [Kharitonov et al. ‘13]
Manual judgment of suggestions [Bhatia et al. ‘11]
Result page quality [Liu et al. ‘12]
Effort-based (e.g., MKS) [Duan & Hsu ‘11]
AB-tests [Kohavi et al. ‘13]
Interleaving [Hofmann et al. ‘13]
Summary
To learn from user interactions, we need to understand how to interpret them.
Here: focus on effects of ranking changes on user interactions with AS.
Found evidence of strong position bias (no differences in examination / positional AS use), but strong effect on query effectiveness (e.g., # unique pages).
Next: incorporate findings into metrics for evaluation and learning, e.g., can we detect examinations from typing behavior?
References
[Bhatia et al. ‘11] S. Bhatia, D. Majumdar, P. Mitra: Query suggestions in the absence of query logs (SIGIR 2011).
[Duan & Hsu ‘11] H. Duan, B.-J. P. Hsu: Online spelling correction for query completion (WWW ‘11).
[Hofmann et al. ‘13] K. Hofmann, S. Whiteson, M. de Rijke: Fidelity, soundness, and efficiency of interleaved comparison methods (ACM TOIS 31(4) 2013).
[Hofmann et al. ‘14] K. Hofmann, B. Mitra, M. Shokouhi, F. Radlinski: An Eye-tracking Study of User Interactions with Query Auto Completion (CIKM 2014).
[Kharitonov et al. 13] E. Kharitonov, C. Macdonald, P. Serdyukov, I. Ounis: User Model-based Metrics for Offline Query Suggestion Evaluation (CIKM 2013).
[Kohavi et al. ‘13] R. Kohavi, A. Deng, B. Frasca, T. Walker, Y. Xu, N. Pohlmann: Online controlled experiments at large scale (KDD 2013).
[Li et al. ‘14] Y. Li, A. Dong, H. Wang, H. Deng, Y. Chang, C. Zhai: A Two-Dimensional Click Model for Query Auto-Completion (SIGIR 2014).
[Liu et al. ‘12] Y. Liu, R. Song, Y. Chen, J.-Y. Nie, J.-R. Wen: Adaptive query suggestion for difficult queries (SIGIR 2012).
[Mitra et al. ‘14] B. Mitra, M. Shokouhi, F. Radlinski, K. Hofmann: On User’s Interactions with Query Auto-Completion (SIGIR 2014).
[Shokouhi ‘13] M. Shokouhi: Learning to Personalize Query Auto-Completion (SIGIR 2013).
© 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.