learning rational temporal eye movement strategies … · learning rational temporal eye movement...

6
Learning rational temporal eye movement strategies David Hoppe a,b and Constantin A. Rothkopf a,b,c,1 a Cognitive Science Centre, Technical University Darmstadt, 64283 Darmstadt, Germany; b Institute of Psychology, Technical University Darmstadt, 64283 Darmstadt, Germany; and c Frankfurt Institute for Advanced Studies, Goethe University, 60438 Frankfurt, Germany Edited by Michael E. Goldberg, Columbia University College of Physicians, New York, NY, and approved June 1, 2016 (received for review January 24, 2016) During active behavior humans redirect their gaze several times every second within the visual environment. Where we look within static images is highly efficient, as quantified by computa- tional models of human gaze shifts in visual search and face recognition tasks. However, when we shift gaze is mostly un- known despite its fundamental importance for survival in a dynamic world. It has been suggested that during naturalistic visuomotor behavior gaze deployment is coordinated with task- relevant events, often predictive of future events, and studies in sportsmen suggest that timing of eye movements is learned. Here we establish that humans efficiently learn to adjust the timing of eye movements in response to environmental regularities when monitoring locations in the visual scene to detect probabilistically occurring events. To detect the events humans adopt strategies that can be understood through a computational model that includes perceptual and acting uncertainties, a minimal processing time, and, crucially, the intrinsic costs of gaze behavior. Thus, subjects traded off event detection rate with behavioral costs of carrying out eye movements. Remarkably, based on this rational bounded actor model the time course of learning the gaze strategies is fully explained by an optimal Bayesian learner with humanscharacteristic uncertainty in time estimation, the well- known scalar law of biological timing. Taken together, these find- ings establish that the human visual system is highly efficient in learning temporal regularities in the environment and that it can use these regularities to control the timing of eye movements to detect behaviorally relevant events. eye movements | computational modeling | decision making | learning | visual attention W e live in a dynamic and ever-changing environment. To avoid missing crucial events, we need to constantly use new sensory information to monitor our surroundings. However, the fraction of the visual environment that can be perceived at a given moment is limited by the placement of the eyes and the arrangement of the receptor cells within the eyes (1). Thus, continuously monitoring environmental locations, even when we know which regions in space contain relevant information, is unfeasible. Instead, we actively explore by targeting the visual apparatus toward regions of interest using proper movements of the eyes, head, and body (25). This constitutes a fundamental computational problem requiring humans to decide sequentially when to look where. Solving this problem arguably has been crucial to our survival from the early human hunters pursuing a herd of prey and avoiding predators to the modern human navigating a crowded sidewalk and crossing a busy road. So far, the emphasis of most studies on eye movements has been on spatial gaze selection. Its exquisite adaptability can be appreciated by considering the different factors influencing gaze, including low-level image features (6), scene gist (7) and scene semantics (8), task constraints (9), and extrinsic rewards (10); also, combinations of factors have been investigated (11). Studies have cast doubt on the optimality of spatial gaze selection (1214), but computational models of the spatial selection have in part established that humans are close to optimal in targeting locations in the visual scene, as in visual search for visible (15) and invisible targets (16) as well as face recognition (17) and pattern classification (18). Although these studies provide insights into where humans look in a visual scene, the ability of managing complex tasks cannot be explained solely by optimal spatial gaze selection. Even if relevant spatial locations in the scene have been determined these regions must be monitored over time to detect behaviorally relevant changes and thus keep information updated for action selection in accordance with the current state of the environment. Successfully dealing with the dynamic environment therefore requires intelligently distributing the limited visual re- sources over time. For example, where should we look if the task is to detect an event occurring at one of several possible locations? In the beginning, if the temporal regularities in a new environment are unknown, it may be best to look quickly at all relevant loca- tions equally long. However, with more and more observations of events and their durations, we may be able to use the learned temporal regularities. To keep the overall uncertainty as low as possible, fast-changing regions (e.g., streets with cars) should be attended more frequently than slow-changing regions (e.g., walk- ways with pedestrians). Although little is known about how the dynamics in the en- vironment influence human gaze behavior, empirical evidence suggests sophisticated temporal control of eye movements, as in object manipulation (19, 20) and natural behavior (4, 5). A common observation is that gaze is often predictive both of physical events (21) and events caused by other people (22), and studies comparing novicesand experienced sportsmens gaze coordination suggest that temporal gaze selection is learned (23). However, the temporal course of these eye movements and how they are influenced by the event durations in the environment has not been investigated so far. We argue that understanding human eye movements in natural dynamic settings can only be achieved by incorporating the temporal control of the visual apparatus. Also, modeling of temporal eye movement allocation is scarce, with notable exceptions (24), but whereas ideal observers (1517) model behavior after learning has completed and usually exclude intrinsic costs, reinforcement learning models based on Markov Significance In a dynamic world humans not only have to decide where to look but also when to direct their gaze to potentially in- formative locations in the visual scene. Little is known about how timing of eye movements is related to environmental reg- ularities and how gaze strategies are learned. Here we present behavioral data establishing that humans learn to adjust their temporal eye movements efficiently. Our computational model shows how established properties of the visual system de- termine the timing of gaze. Surprisingly, a Bayesian learner only incorporating the scalar law of biological timing can fully explain the course of learning these strategies. Thus, humans use temporal regularities learned from observations to adjust the scheduling of eye movements in a nearly optimal way. Author contributions: D.H. and C.A.R. designed research; D.H. performed research; D.H. and C.A.R. analyzed data; and D.H. and C.A.R. wrote the paper. The authors declare no conflict of interest. This article is a PNAS Direct Submission. 1 To whom correspondence should be addressed. Email: [email protected] darmstadt.de. This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. 1073/pnas.1601305113/-/DCSupplemental. 83328337 | PNAS | July 19, 2016 | vol. 113 | no. 29 www.pnas.org/cgi/doi/10.1073/pnas.1601305113

Upload: truongtu

Post on 17-Sep-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning rational temporal eye movement strategies … · Learning rational temporal eye movement strategies David Hoppea,b and Constantin A. Rothkopfa,b,c,1 aCognitive Science Centre,

Learning rational temporal eye movement strategiesDavid Hoppea,b and Constantin A. Rothkopfa,b,c,1

aCognitive Science Centre, Technical University Darmstadt, 64283 Darmstadt, Germany; bInstitute of Psychology, Technical University Darmstadt, 64283Darmstadt, Germany; and cFrankfurt Institute for Advanced Studies, Goethe University, 60438 Frankfurt, Germany

Edited by Michael E. Goldberg, Columbia University College of Physicians, New York, NY, and approved June 1, 2016 (received for review January 24, 2016)

During active behavior humans redirect their gaze several timesevery second within the visual environment. Where we lookwithin static images is highly efficient, as quantified by computa-tional models of human gaze shifts in visual search and facerecognition tasks. However, when we shift gaze is mostly un-known despite its fundamental importance for survival in adynamic world. It has been suggested that during naturalisticvisuomotor behavior gaze deployment is coordinated with task-relevant events, often predictive of future events, and studies insportsmen suggest that timing of eye movements is learned. Herewe establish that humans efficiently learn to adjust the timing ofeye movements in response to environmental regularities whenmonitoring locations in the visual scene to detect probabilisticallyoccurring events. To detect the events humans adopt strategiesthat can be understood through a computational model thatincludes perceptual and acting uncertainties, a minimal processingtime, and, crucially, the intrinsic costs of gaze behavior. Thus,subjects traded off event detection rate with behavioral costs ofcarrying out eye movements. Remarkably, based on this rationalbounded actor model the time course of learning the gazestrategies is fully explained by an optimal Bayesian learner withhumans’ characteristic uncertainty in time estimation, the well-known scalar law of biological timing. Taken together, these find-ings establish that the human visual system is highly efficient inlearning temporal regularities in the environment and that it canuse these regularities to control the timing of eye movements todetect behaviorally relevant events.

eye movements | computational modeling | decision making | learning |visual attention

We live in a dynamic and ever-changing environment. Toavoid missing crucial events, we need to constantly use

new sensory information to monitor our surroundings. However,the fraction of the visual environment that can be perceived at agiven moment is limited by the placement of the eyes and thearrangement of the receptor cells within the eyes (1). Thus,continuously monitoring environmental locations, even when weknow which regions in space contain relevant information, isunfeasible. Instead, we actively explore by targeting the visualapparatus toward regions of interest using proper movements ofthe eyes, head, and body (2–5). This constitutes a fundamentalcomputational problem requiring humans to decide sequentiallywhen to look where. Solving this problem arguably has beencrucial to our survival from the early human hunters pursuing aherd of prey and avoiding predators to the modern humannavigating a crowded sidewalk and crossing a busy road.So far, the emphasis of most studies on eye movements has

been on spatial gaze selection. Its exquisite adaptability can beappreciated by considering the different factors influencing gaze,including low-level image features (6), scene gist (7) and scenesemantics (8), task constraints (9), and extrinsic rewards (10);also, combinations of factors have been investigated (11). Studieshave cast doubt on the optimality of spatial gaze selection (12–14), but computational models of the spatial selection have inpart established that humans are close to optimal in targetinglocations in the visual scene, as in visual search for visible (15) andinvisible targets (16) as well as face recognition (17) and patternclassification (18). Although these studies provide insights into

where humans look in a visual scene, the ability of managingcomplex tasks cannot be explained solely by optimal spatial gazeselection. Even if relevant spatial locations in the scene have beendetermined these regions must be monitored over time to detectbehaviorally relevant changes and thus keep information updatedfor action selection in accordance with the current state of theenvironment. Successfully dealing with the dynamic environmenttherefore requires intelligently distributing the limited visual re-sources over time. For example, where should we look if the task isto detect an event occurring at one of several possible locations?In the beginning, if the temporal regularities in a new environmentare unknown, it may be best to look quickly at all relevant loca-tions equally long. However, with more and more observations ofevents and their durations, we may be able to use the learnedtemporal regularities. To keep the overall uncertainty as low aspossible, fast-changing regions (e.g., streets with cars) should beattended more frequently than slow-changing regions (e.g., walk-ways with pedestrians).Although little is known about how the dynamics in the en-

vironment influence human gaze behavior, empirical evidencesuggests sophisticated temporal control of eye movements, as inobject manipulation (19, 20) and natural behavior (4, 5). Acommon observation is that gaze is often predictive both ofphysical events (21) and events caused by other people (22), andstudies comparing novices’ and experienced sportsmen’s gazecoordination suggest that temporal gaze selection is learned (23).However, the temporal course of these eye movements and howthey are influenced by the event durations in the environment hasnot been investigated so far. We argue that understanding humaneye movements in natural dynamic settings can only be achievedby incorporating the temporal control of the visual apparatus.Also, modeling of temporal eye movement allocation is scarce,with notable exceptions (24), but whereas ideal observers (15–17)model behavior after learning has completed and usually excludeintrinsic costs, reinforcement learning models based on Markov

Significance

In a dynamic world humans not only have to decide where tolook but also when to direct their gaze to potentially in-formative locations in the visual scene. Little is known abouthow timing of eye movements is related to environmental reg-ularities and how gaze strategies are learned. Here we presentbehavioral data establishing that humans learn to adjust theirtemporal eye movements efficiently. Our computational modelshows how established properties of the visual system de-termine the timing of gaze. Surprisingly, a Bayesian learneronly incorporating the scalar law of biological timing can fullyexplain the course of learning these strategies. Thus, humansuse temporal regularities learned from observations to adjustthe scheduling of eye movements in a nearly optimal way.

Author contributions: D.H. and C.A.R. designed research; D.H. performed research; D.H.and C.A.R. analyzed data; and D.H. and C.A.R. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.1To whom correspondence should be addressed. Email: [email protected].

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1601305113/-/DCSupplemental.

8332–8337 | PNAS | July 19, 2016 | vol. 113 | no. 29 www.pnas.org/cgi/doi/10.1073/pnas.1601305113

Page 2: Learning rational temporal eye movement strategies … · Learning rational temporal eye movement strategies David Hoppea,b and Constantin A. Rothkopfa,b,c,1 aCognitive Science Centre,

decision processes (16, 24) can be applied to situations involvingperceptual uncertainty at best only approximately.Here we present behavioral data and results from a compu-

tational model establishing several properties of human activevisual strategies specific to temporally varying stimuli. In par-ticular, we address the question of what factors specifically needto be accounted for to predict human gaze timing behavior inresponse to environmental events, and how learning proceeds.To this end, we created a temporal event detection (TED) taskto investigate how humans adjust their eye movement strate-gies when learning temporal regularities. Our task enables in-vestigating the unknown temporal aspects of eye movementstrategies by isolating them from spatial selection and the com-plexity that arises from natural images. Our results are the first toour knowledge to show that human eye movements are consis-tently related to temporal regularities in the environment. Wefurther provide a computational model explaining the observedgaze behavior as an interaction of perceptual and action uncer-tainties as well as intrinsic costs and biological constraints. Sur-prisingly, based on this model alone, the learning progress isaccounted for without any additional assumptions only based onthe scalar law of biological timing, allowing crucial insights intohow observers alter their behavior due to task demands andenvironmental regularities.

ResultsBehavioral Results. The TED task required participants to com-plete 30 experimental blocks, each consisting of 20 consecutivetrials. Subjects were instructed to detect a single event on eachtrial and press a button to indicate the detection of the event,which could occur at one of two possible locations with equalprobability (Fig. 1A). This event consisted of a little dot at thecenter of the two locations that disappeared for a certain duration.The visual separation of the two locations was chosen such that

subjects could not detect an event extrafoveally (i.e., when fixatingthe other location). In each block, event durations were constantbut unknown to the subjects in the beginning. They were generatedprobabilistically from three distributions over event durations,termed short (S), medium (M), and long (L). The rationale of thisexperimental design was that there is no single optimal strategy forall blocks in the experiment, but rather subjects have to adapt theirstrategies to the event durations in the respective block (Fig. 1B).Thus, over the course of 20 consecutive trials per block, subjectscould learn the event durations by observing events.We investigated how participants changed their switching be-

havior over trials within individual blocks as a consequence oflearning the constant event durations within a single block. Fig.2A shows two example trials for participant MK, both fromthe same block (block LS—long event duration at the left re-gion and short event duration at the right region). In the firsttrial when event durations were unknown (Fig. 2A, Upper),MK switched evenly between the regions to detect the event(t7 = 0.48;P= 0.64). However, at the end of this block in Trial 20we observed a clear change in the switching pattern (SP) (Fig.2A, Lower). There was a shift in the mean fixation duration overthe course of the block (Fig. 2A, Right). The region associatedwith the short event durations was fixated longer in the last trialof the block (t4 = 2.75;P= 0.04). Hence, the subject changed thegaze SP over trials in this particular block in response to theevent durations at the two locations.Mean fixation durations aggregated over all participants are

shown in Fig. 2B. Steady-state behavior across subjects wasreached within the first 5 to 10 trials within a block, suggestingthat participants were able to quickly learn the event durations.Behavioral changes were consistent across conditions. In condi-tions with different event durations (SM, SL, and ML), regionswith shorter event durations were fixated longer in later trials ofthe block (all differences were highly significant with P< 10−5).

A B

Fig. 1. Experimental procedure for the TED task and implications for eye movement strategies. (A) Single trial structure and stimulus generation. Participantsswitched their gaze between two regions marked by gray circles to detect a single event. The event consisted of the little dot within one of the circlesdisappearing (depicted as occurring on the left side in this trial). Participants confirmed the detection of an event by indicating the region (left or right)through a button press. The event duration (how long the dot was invisible) was drawn from one of three probability distributions (short, medium, and long).Overall, six different conditions resulting from the combination of three event durations and two spatial locations were used in the experiment (SS, MM, LL,SM, SL, MM, and LL). A single block consisted of 20 consecutive trials with event durations drawn from the same distributions. For example, in a block LS allevents occurring at the left region were long events with a mean duration of 1.5 s, and all events occurring at the right region were short events with a meanduration of 0.15 s. A single event for the condition LS was generated as follows (lower left). First, the start time for the event was drawn from a uniformdistribution between 2 and 8 s. Next, the region where the event occurred was either left or right with equal probability. Finally, the duration of the eventwas drawn dependent on where the event occurred. Given that in the example the condition was LS and the event was determined to occur at the left region,the event duration was drawn from a Gaussian distribution with mean 1.5 s (long event). (B) Eye movement strategies for different conditions. The re-lationship between SPs and the probability of missing the event in a single trial is shown for two conditions. For condition MM (Upper) the probability ofmissing an event is lower for faster switching (short fixation durations at both regions). For condition LS (long events at the left region and short events at theright region) the probability of missing an event can be decreased by longer fixating the region with short events (i.e., the right region).

Hoppe and Rothkopf PNAS | July 19, 2016 | vol. 113 | no. 29 | 8333

PSYC

HOLO

GICALAND

COGNITIVESC

IENCE

S

Page 3: Learning rational temporal eye movement strategies … · Learning rational temporal eye movement strategies David Hoppea,b and Constantin A. Rothkopfa,b,c,1 aCognitive Science Centre,

Participants’ gaze behavior was not only influenced by whetherthe event durations were different but was also guided by the sizeof the difference (Fig. 2C, Left). Greater differences in eventdurations lead to greater differences in the fixation durations(linear regression, r2 = 0.93,P< 0.002). Moreover, the overallmean fixation duration aggregated over both regions was affectedby the overall mean event durations (Fig. 2C, Right). Participants,in general, switched faster between the regions if the mean eventduration was small (linear regression, r2 = 0.88,P< 0.005). Blocks

with long overall event durations (e.g., LL) yielded longer meanfixation durations compared with blocks with short event dura-tions (e.g., SS). Taken together, these results establish that par-ticipants quickly adapted their eye movement strategies to the taskrequirements given by the timing of events.However, further inspection of the behavioral data suggests

nontrivial aspects of the gaze-switching strategies. The condi-tions in our TED task differed in task difficulty depending on theevent durations (Fig. 2D). The shorter the event durations in a

A B

C

D

E

F

Fig. 2. Behavioral results. (A) Sample SP for participant MK in a single block (Left) and mean fixation duration for each region (Right). SPs are shown for twotrials: before learning the event durations (trial 1) and after learning the event duration (trial 20). (B) Mean fixation duration aggregated for all participantsand all blocks grouped by condition (error bars correspond to SEM). Mean fixation duration is shown in seconds (y axis) and is the same scale for all conditions(see also Fig. S1). Trial number within the block is shown on the x axis. Colors of the lines refer to the event duration used in that condition. (C) Mean fixationduration difference as a function of event duration difference (Left). Each data point corresponds to a specific condition. Mean fixation duration as a functionof mean event duration (Right). Again, each data point is associated with one of the conditions. (D) Percentage of detected events for all conditions. Values insquare brackets correspond to the 95% credibility interval for the differences in proportion. Different conditions are shown on the x axis. (E) Difference indetection performance between the first and last trial in each block. Different conditions are shown on the x axis. (F) Behavioral changes from trial 1 to trial20 (arrows) for each condition. Color density depicts the probability of missing an event for every possible SP (see also Fig. S2). Each plot corresponds to asingle condition. Conditions are arranged as in B.

8334 | www.pnas.org/cgi/doi/10.1073/pnas.1601305113 Hoppe and Rothkopf

Page 4: Learning rational temporal eye movement strategies … · Learning rational temporal eye movement strategies David Hoppea,b and Constantin A. Rothkopfa,b,c,1 aCognitive Science Centre,

specific condition, the more difficult was the task. Improvementover the course of a block differed between conditions (Fig. 2E)and participants only improved in detection accuracy in condi-tions associated with a high task difficulty. In contrast, partici-pants showed a slight decrease in performance in the otherconditions. This can be represented as arrows pointing from theobserved switching behavior in the first trial to the behavior inthe last trial within blocks in the action space (Fig. 2F). Overall,several aspects remain unclear: the different rates of change inswitching frequencies across trials in a block as well as the dif-ferent directions and end points of behavioral changes observedin the performance space.

Bounded Actor Model. We hypothesized that multiple processesare involved in generating the observed behavior, based on thebehavioral results in the TED task and taking into account knownfacts about human time constraints in visual processing, timeperception, eye movement variability, and intrinsic costs. Startingfrom an ideal observer (25) we developed a bounded actor modelby augmenting it with computational components modeling theseprocesses (Fig. 3A and see Supporting Information for a derivationof the ideal observer and further model components).The ideal observer chooses SPs that maximize performance

in the task. The gaze behavior suggested by the ideal observercorresponds to the SP with the lowest probability of missing theevent (Fig. 2F). We extend this model by including perceptualuncertainty, which limits the accuracy of estimating event dura-tions from observations. Extensive literature has reported howtime intervals are perceived and reproduced (26–28). The centralfinding is the so-called scalar law of biological timing (26, 29), alinear relationship between the mean and the SD of estimatesin the interval timing task. We included perceptual uncertaintyin the ideal observer model based on the values found in theliterature.The second extension of the model accounts for eye move-

ment variability, which limits the ability of humans to accuratelyexecute planned eye movements and has been reported fre-quently in the literature. Although many factors may contributeto this variability in action outcomes, humans have been shown

to take this variability into account (30). Further evidence forthis comes from experiments that have shown that in a re-production task the impact of prior knowledge on behavior in-creased according to the uncertainty in time estimation (31) andstudies that extended an ideal observer to an ideal actor by in-cluding behavioral timing variability (32). To describe the vari-ability of fixation durations we used an exponential Gaussiandistribution (33) because it is a common choice for describingfixation durations (34–37). The parameters were estimated fromour data using maximum likelihood.Although behavioral costs in biological systems are a funda-

mental fact, very little is known about such costs for eye move-ments, because research has so far focused on extrinsic costs (10,11, 24). We included costs in our model by hypothesizing thatswitching more frequently is associated with increased effort. Ina similar task that did not reward faster switching participantsshowed fixation durations of about 1 s (38). To our knowledge aconcrete shape of these costs has not been investigated yet, andin the present study we used a sigmoid function (Fig. 3B, Right)for which the parameters were estimated using least squaresfrom the measured data.Finally, the model has so far neglected that all neural and

motor processes involved take time. The time required for visualprocessing is highly dependent on the task at hand. It can be veryfast (39); however, in the context of an alternative choice taskprocessing is done within the first 150 ms (40) and does notimprove with familiarity of the stimulus (41). In our study, thedemands for the decision were much lower because discriminatingbetween the two states of a region (ongoing event or no event) israther simple. Still, there are some additional constraints that limitthe velocity of switching: eye–brain lag and time of planning andinitiating a saccade (42). In addition, prior research suggests thatin general saccades are triggered after processing the current fix-ation has reached some degree of completeness (43). We estimatedthe mean processing time (Fig. 3B, Left) from our behavioral datausing least squares.To investigate which model should be favored in explaining

the observed gaze-switching behavior we fitted the modelleaving out one component at a time and computed the Akaike

B

C

0.0 0.2 0.4 0.6Fixation Duration (s)

σ2 = 285 ms

Switching Costs

295 ms

Processing Time

0.0 0.2 0.4 0.6Fixation Duration (s)

M \

M \ Cos

t

ime

Pr Toces

sing

U

ty

M \ A

ncert

aincting

eM \

U

y

ΔA

IC

Pernc

erc tai

ntptual

25

75

125

175

Processing time Model Comparison with Full Model (M)

right

Ideal Observer

Event durations

Cost of SaccadesIdeal Actor

Fixation Duration

ActingUncertainty

pdf

Bounded Actor

Cost of saccades

A

Fixa

tion

dura

tion

left

Fixation duration

not preferredpreferred

OptimalStrategy

0.0 1.0

Cost

1 s

pdf

Fixation Duration Fixation Duration

Fig. 3. Computational model. (A) Schematic visualization describing how the different model components influence the preferred SPs for a single condition(MM). Images show which regions in the action space are preferred by the respective models. Each location in the 2D space corresponds to a single SP (see alsoFig. 1B). White dots are the mean fixation durations of our subjects aggregated over individual blocks. The optimal strategy is marked with a black cross foreach model. The Ideal Observer only takes the probability of missing an event into account. The Ideal Actor also accounts for variability in the execution of eyemovement plans, thus suppressing long fixations (because they increase the risk of missing the event). Therefore, strategies using shorter fixation durationsare favored. Next, additive costs for faster switching yield a preference for longer fixation durations. Hence, less value is given to very fast SPs. Finally, theBounded Actor model excludes very short fixation durations, which are biologically impossible, because they do not allow sufficient time for processing visualinformation. (B) Model components for processing time and switching costs after parameter estimation. Parameters were estimated through least squaresusing the last 10 trials of each block. (C) Model comparison. For the absence of each component the model was fitted by least squares. Differences in AIC withrespect to the full model are shown.

Hoppe and Rothkopf PNAS | July 19, 2016 | vol. 113 | no. 29 | 8335

PSYC

HOLO

GICALAND

COGNITIVESC

IENCE

S

Page 5: Learning rational temporal eye movement strategies … · Learning rational temporal eye movement strategies David Hoppea,b and Constantin A. Rothkopfa,b,c,1 aCognitive Science Centre,

information criterion (AIC) for each of these models. The re-sults are shown in Fig. 3C. The conditional probability for eachmodel was calculated by computing the Akaike weights fromthe AIC values as suggested in ref. 44. We found that gazeswitches were more than 10,000 times more probable under thefull model compared with the second-best model (Table S1 andFig. S3). In addition, models lacking a single component de-viate severely from the observed behavior (Fig. S4). This stronglysuggests that the temporal course of eye movements cannot beexplained solely on the basis of an ideal observer. Instead, multiplecharacteristics of perceptual uncertainty, behavioral variability,intrinsic costs, and biological constraints have to be included.

Bayesian Learner Model. All models so far have assumed fullknowledge of the event durations, as is common in ideal observermodels. This neglects the fact that event durations were a prioriunknown to the participants at the beginning of each block oftrials and needed to be inferred from observations. In our TEDtask information about the event durations cannot be accesseddirectly but must be inferred from different types of censoredand uncensored observations (Fig. 4A). Using a Bayesian learnerwe simulated the subjects’ mean estimate and uncertainty of theevent duration for each trial in a block using Markov chainMonte Carlo techniques (Fig. 4B). We used a single Gaussiandistribution fitted to the true event durations for the threeconditions to describe subjects’ initial belief before the firstobservation within a block. Our Bayesian learner enables us tosimulate the bounded actor model for every trial of a blockusing the mean estimates (Fig. 4C). The results show that ourbounded actor together with the ideal Bayesian learner is suf-ficient to recover the main characteristics of the behavioraldata. Crucially, the proposed learner does not introduce anyfurther assumptions or parameters. This means that the be-havioral changes we observed can be explained solely by changesin the estimates of the event durations. As with the steady-statecase, all included factors contributed uniquely to the model.Omitting a single component led to severe deviation from thehuman data (Fig. S4).

DiscussionThe current study investigated how humans learn to adjusttheir eye movement strategies to detect relevant events in

their environment. The minimal computational model matchingkey properties of observed gaze behavior includes perceptualuncertainty about duration, action variability in generated gazeswitches, a minimal visual processing time, and, finally, intrinsiccosts for switching gaze. The bounded actor model explainsseemingly suboptimal behavior such as the decrease in perfor-mance in easier task conditions as a trade-off between detectionprobability and faster but more costly eye movement switching.With respect to this bounded actor model, the experimentallymeasured switching times at the end of the blocks were close tooptimal. Whereas ideal observer models (15–17) assume that thebehaviorally relevant uncertainties are all known to the observer,here we also modeled the learning progress. The speed at whichlearning proceeded was remarkable; subjects were close to theirfinal gaze-switching frequencies after only 5 to 10 trials on av-erage, suggesting that the observed gaze behavior was not idio-syncratic to the TED task. Based on the bounded actor model weproposed a Bayesian optimal learner, which only incorporatesthe inherent perceptual uncertainty about time passage on thebasis of the scalar law of biological timing (26, 28, 29). Re-markably, this source of uncertainty was sufficient to fully ac-count for the learning progress. This suggests that the humanvisual system is highly efficient in learning temporal regularitiesof events in the environment and that it can use these to directgaze to locations in which behaviorally relevant events will occur.Taken together, this study provides further evidence for the

importance of the behavioral goal in perceptual strategies (2, 9,24), because low-level visual feature saliency cannot account forthe gaze switches (45). Instead, cognitive inferences on the basisof successive observations and detections of visual events lead tochanges in gaze strategies. At the beginning of a block, partici-pants spent equal observation times between targets to learnabout the two event durations and progressively looked longer atthe location with the shorter event duration. This implies thatinternal representations of event durations were likely used toadjust the perceptual strategy of looking for future events. Afurther important feature of the TED task and the presentedmodel is that, contrary to previous studies (15–17), the in-formativeness of spatial locations changes moment by momentand not fixation by fixation, which corresponds more to natu-ralistic settings. Under such circumstances, optimality of behav-ior is not exclusively evaluated with respect to a single, next gaze

Fully Observed Missed Partially Observed

1 5 10 15 20

2.0

1.0

Trials

Event Duration (s)

Prior

0.5

1.5

True durationsEstimates

Long

Medium

Short

10 200.3

0.5

0.7

10 200.3

0.5

0.7

10 200.3

0.5

0.7

10 200.3

0.5

0.7

10 200.3

0.5

0.7

10 200.3

0.5

0.7

human dataideal learner

Trial

MeanDuration

(s)

0.15s 0.75s 1.5s

0.15s

0.75s

1.5s

A

B

C

Fig. 4. Bayesian learner model. (A) Different types of observations in the TED task. Dashed lines depict SPs and rectangles denote events. (B) Mean estimatesand uncertainty (SD) for the event durations over the course of a block. Prior belief over event durations before the first observation is shown on the left side.(C) Human data and model predictions for the Bayesian learner model. Fitted parameters from the full model were used. Dashed lines show the mean fixationduration of the bounded actor model over the course of the blocks. For each trial the mean estimate suggested by the ideal learner was used.

8336 | www.pnas.org/cgi/doi/10.1073/pnas.1601305113 Hoppe and Rothkopf

Page 6: Learning rational temporal eye movement strategies … · Learning rational temporal eye movement strategies David Hoppea,b and Constantin A. Rothkopfa,b,c,1 aCognitive Science Centre,

shift, but instead is driven by a complex sequence of behavioraldecisions based on uncertain and noisy sensory measurements.In this respect, our TED task is much closer to natural visionthan experiments with static displays. However, the spatial lo-cations in our task were fixed and the visual stimuli were simplegeometric shapes. Naturalistic vision involves stimuli of richspatiotemporal statistics (46) and semantic content across theentire visual field. How the results observed in this study transferto subjects learning to attend in more naturalistic scenarios willrequire further research. The challenge under such circum-stances is to quantify relative uncertainties as well as costs andbenefits of all contributing factors. There is nothing that inprinciple precludes applying the framework developed here tosuch situations because of its generality. Clearly, several factorsindicate that participants did not depart dramatically from eyemovement strategies in more normal viewing environments, andthus that natural behavior transferred well to the TED task.First, subjects did not receive extensive training (Materials andMethods). Second, they showed only minimal changes in learningbehavior across blocks over the duration of the entire experiment(Fig. S5). Third, intersaccadic interval statistics were very closeto the ones observed in naturalistic settings according to pre-vious literature (34–37). Overall, the found concepts contributeuniquely to our understanding of human visual behavior andshould inform our ideas about neuronal mechanisms underlyingattentional shifts (47), conceptualized as sequential decisions. The

results of this study may similarly be relevant for the understandingof learning optimal sensory strategies in other modalities and themethodologies can fruitfully be applied there, too.

Materials and MethodsSubjects. Ten undergraduate subjects (four females) took part in the ex-periment in exchange for course credit. The participants’ ages ranged from18 to 30 y. All subjects had normal or corrected-to-normal vision. Data fromtwo subjects were not included into the analysis due to insufficient qualityof the eye movement data. Participants completed two to four shortertraining blocks (10 trials per block) before the experiment to get used to thesetting and the task (on average 5 min). The number of training blocks,however, had no influence on task performance. All experimental proce-dures were approved by the ethics committee of the Darmstadt University ofTechnology. Informed consent was obtained from all participants.

Stimuli and Analysis. The two regions (distance 35°) were presented on aFlatron W2242TE monitor (1,680 × 1,050 resolution and 60-Hz refresh rate).Each region was marked with a gray circle (0.52° in diameter). Contrast ofthe target dot was adjusted to prevent extrafoveal vision. Eye movementdata were collected using SMI Eye Tracking Glasses (60-Hz sample rate).Calibration was done using a three-point calibration. For detection of sac-cades, fixations, and blinks dispersion-threshold identification was used.

ACKNOWLEDGMENTS. We thank R. Fleming, M. Hayhoe, F. Jäkel, andM. Lengyel for useful comments on the manuscript.

1. Land MF, Nilsson D (2002) Animal Eyes (Oxford Univ Press, Oxford).2. Yarbus A (1967) Eye Movements and Vision (Plenum, New York).3. Findlay JM, Gilchrist ID (2003) Active Vision: The Psychology of Looking and Seeing

(Oxford Univ Press, Oxford).4. Hayhoe M, Ballard D (2005) Eye movements in natural behavior. Trends Cogn Sci 9(4):

188–194.5. Land M, Tatler B (2009) Looking and Acting: Vision and Eye Movements in Natural

Behaviour (Oxford Univ Press, Oxford).6. Borji A, Sihite DN, Itti L (2013) What stands out in a scene? A study of human explicit

saliency judgment. Vision Res 91(0):62–77.7. Torralba A, Oliva A, Castelhano MS, Henderson JM (2006) Contextual guidance of eye

movements and attention in real-world scenes: The role of global features in objectsearch. Psychol Rev 113(4):766–786.

8. Henderson JM (2003) Human gaze control during real-world scene perception. TrendsCogn Sci 7(11):498–504.

9. Rothkopf CA, Ballard DH, Hayhoe MM (2007) Task and context determine where youlook. J Vis 7(14):1–20.

10. Navalpakkam V, Koch C, Rangel A, Perona P (2010) Optimal reward harvesting incomplex perceptual environments. Proc Natl Acad Sci USA 107(11):5232–5237.

11. Schütz AC, Trommershäuser J, Gegenfurtner KR (2012) Dynamic integration of in-formation about salience and value for saccadic eye movements. Proc Natl Acad SciUSA 109(19):7547–7552.

12. Morvan C, Maloney LT (2012) Human visual search does not maximize the post-sac-cadic probability of identifying targets. PLOS Comput Biol 8(2):e1002342.

13. Morvan C, Maloney L (2009) Suboptimal selection of initial saccade in a visual searchtask. J Vis 9(8):444–444.

14. Clarke AD, Hunt AR (2016) Failure of intuition when choosing whether to invest in asingle goal or split resources between two goals. Psychol Sci 27(1):64–74.

15. Najemnik J, Geisler WS (2005) Optimal eye movement strategies in visual search.Nature 434(7031):387–391.

16. Chukoskie L, Snider J, Mozer MC, Krauzlis RJ, Sejnowski TJ (2013) Learning where tolook for a hidden target. Proc Natl Acad Sci USA 110(Suppl 2):10438–10445.

17. Peterson MF, Eckstein MP (2012) Looking just below the eyes is optimal across facerecognition tasks. Proc Natl Acad Sci USA 109(48):E3314–E3323.

18. Yang SC, Lengyel M, Wolpert DM (2016) Active sensing in the categorization of visualpatterns. eLife 5:e12215, 10.7554/eLife.12215.

19. Ballard DH, Hayhoe MM, Pelz JB (1995) Memory representations in natural tasks.J Cogn Neurosci 7(1):66–80.

20. Johansson RS, Westling G, Bäckström A, Flanagan JR (2001) Eye-hand coordination inobject manipulation. J Neurosci 21(17):6917–6932.

21. Diaz G, Cooper J, Rothkopf C, Hayhoe M (2013) Saccades to future ball location revealmemory-based prediction in a virtual-reality interception task. J Vis 13(1):20.

22. Flanagan JR, Johansson RS (2003) Action plans used in action observation. Nature424(6950):769–771.

23. Land MF, McLeod P (2000) From eye movements to actions: How batsmen hit the ball.Nat Neurosci 3(12):1340–1345.

24. Hayhoe M, Ballard D (2014) Modeling task control of eye movements. Curr Biol24(13):R622–R628.

25. Geisler WS (1989) Sequential ideal-observer analysis of visual discriminations. PsycholRev 96(2):267–314.

26. Merchant H, Harrington DL, Meck WH (2013) Neural basis of the perception andestimation of time. Annu Rev Neurosci 36:313–336.

27. Ivry RB, Schlerf JE (2008) Dedicated and intrinsic models of time perception. TrendsCogn Sci 12(7):273–280.

28. Wittmann M (2013) The inner sense of time: How the brain creates a representationof duration. Nat Rev Neurosci 14(3):217–223.

29. Gibbon J (1977) Scalar expectancy theory and weber’s law in animal timing. PsycholRev 84(3):279.

30. Hudson TE, Maloney LT, Landy MS (2008) Optimal compensation for temporal un-certainty in movement planning. PLOS Comput Biol 4(7):e1000130.

31. Jazayeri M, Shadlen MN (2010) Temporal context calibrates interval timing. NatNeurosci 13(8):1020–1026.

32. Sims CR, Jacobs RA, Knill DC (2011) Adaptive allocation of vision under competingtask demands. J Neurosci 31(3):928–943.

33. Ratcliff R (1979) Group reaction time distributions and an analysis of distributionstatistics. Psychol Bull 86(3):446–461.

34. Laubrock J, Cajar A, Engbert R (2013) Control of fixation duration during sceneviewing by interaction of foveal and peripheral processing. J Vis 13(12):11.

35. Staub A (2011) The effect of lexical predictability on distributions of eye fixationdurations. Psychon Bull Rev 18(2):371–376.

36. Luke SG, Nuthmann A, Henderson JM (2013) Eye movement control in scene viewingand reading: Evidence from the stimulus onset delay paradigm. J Exp Psychol HumPercept Perform 39(1):10–15.

37. Palmer EM, Horowitz TS, Torralba A, Wolfe JM (2011) What are the shapes of re-sponse time distributions in visual search? J Exp Psychol Hum Percept Perform 37(1):58–71.

38. Schütz AC, Lossin F, Kerzel D (2013) Temporal stimulus properties that attract gaze tothe periphery and repel gaze from fixation. J Vis 13(5):6.

39. Stanford TR, Shankar S, Massoglia DP, Costello MG, Salinas E (2010) Perceptual de-cision making in less than 30 milliseconds. Nat Neurosci 13(3):379–385.

40. Thorpe S, Fize D, Marlot C (1996) Speed of processing in the human visual system.Nature 381(6582):520–522.

41. Fabre-Thorpe M, Delorme A, Marlot C, Thorpe S (2001) A limit to the speed of pro-cessing in ultra-rapid visual categorization of novel natural scenes. J Cogn Neurosci13(2):171–180.

42. Trukenbrod HA, Engbert R (2014) ICAT: A computational model for the adaptivecontrol of fixation durations. Psychon Bull Rev 21(4):907–934.

43. Remington RW, Wu SC, Pashler H (2011) What determines saccade timing in se-quences of coordinated eye and hand movements? Psychon Bull Rev 18(3):538–543.

44. Wagenmakers EJ, Farrell S (2004) AIC model selection using Akaike weights. PsychonBull Rev 11(1):192–196.

45. Itti L, Koch C (2001) Computational modelling of visual attention. Nat Rev Neurosci2(3):194–203.

46. Simoncelli EP, Olshausen BA (2001) Natural image statistics and neural representa-tion. Annu Rev Neurosci 24(1):1193–1216.

47. Gottlieb J (2012) Attention, learning, and the value of information. Neuron 76(2):281–295.

48. Burnham KP, Anderson DR (2004) Multimodel inference. Sociol Methods Res 33(2):261–304.

Hoppe and Rothkopf PNAS | July 19, 2016 | vol. 113 | no. 29 | 8337

PSYC

HOLO

GICALAND

COGNITIVESC

IENCE

S