control of attention and gaze in natural environments

Control of Attention and Gaze in Natural Environments

Control of Attention and Gaze in Natural Environments1

Selecting information from visual scenes
What controls the selection process?
Natural scenes contain much more information than we can perceive in a brief exposure. If we view this scene for a sec or two, we'll move our gaze around the image, perhaps looking at the bicycle in the center, or large objects like the building. This process of selecting particular information in the scene isn't random - but we really don't know what determines where we look & what we attend to.
Humans must select a limited subset of the available information in the environment. 
Fundamental Constraints:
- Acuity is restricted.
- Attention is limited. 
- Visual Working Memory is limited.
- Only a limited amount of information can be retained.
What controls these processes?
See a sequence of two brief images of simple shapes - one object is changed in the second view. Your job is to identify the changed item.
When people do experiments like this they find that you can remember about 4 items. Gives us a visceral sense of limitations

Image properties eg brightness, edges, color, can account for some fixations when viewing images of scenes.
Saliency
Can we rely on image properties to guide where we look?
Important information may not be salient eg Stop signs in a cluttered environment.

Salient information may not be important - eg retinal image transients from eye/body movements.

Doesn't account for many observed fixations, especially in natural behavior (eg Land etc).

Behavioral goals determine what information is needed.

Need to Study Natural Behavior10Inclined to think of vision as viewing a picture, but more often were acting in the environment - also, if viewing a 2D image, dont really know what obs is doing - maybe remembering objects, maybe judging image quality - dont really know - if tasl requiring overt actions, have a good idea of what the obs is doing form moment to moment Need for action means diff info is requiredNot only is stim different (2 vs 3 D, fov etc) the info you need is different

Viewing pictures of scenes is different from acting within scenes.11Other problem with trying to explain fix patterns or dsn of attn by looking at fixations of images is that real vision really is differnet form looking at an image of a scene. If youre in a scene you need different kinds of info from when youre looking at an image.For example.. Can think of natural vision as being composed of a set of mini-tasks like this, and gaze needs to be doled out in the service of each of these tasks.Whwen looking at an image, not clear what obs is doing - recog, mem?

Foot placement
Obstacle avoidance
Heading
Viewing pictures of scenes is different from acting within scenes.
Top-down factors
Top-down versus bottom-up
To what extent is the selection of information from scenes determined by cognitive goals (ie top-down) and how much by the stimulus itself (ie salient regions - bottom-up effects)?

1314Modeling Top Down Control Virtual Humanoid has a small set of simple visual behaviors:Sidewalk FollowingPicking Up BlocksAvoiding Obstacles

Each behavior uses a limited, task-relevant selection of visual information from scene.This is computationally efficient.Walter the Virtual HumanoidSprague & Ballard (2003)

14COULD A PURELY TD SYS WORK? This idea is behind the work of Sprague & Ballard, who developed a model of gaze behavior in a walking context. This is Walter, a virtual agent, Walters task is to walk through this virtual env. Walter uses vision to do 3 things.The agent has a small library of behavioral routines that need visual input. Through reinforcement learning the humanoid learns the appropriate policy by which to schedule extraction of visual information In this model a top down scheduler to acquire visual information is adequate for obstacle avoidance and Walters other tasks

Walter learns where/when to direct gaze using reinforcementlearning algorithm.Walters sequence of fixationsobstaclessidewalklitter15Model suggests that such a system is feasible - subject has a set of sub-tasks to perform, and gaze reflects performance of sub-tasks.

Walter the Virtual Humanoid
Sprague & Ballard (VSS 2004)

What about unexpected events?

16However what Walter would be able to handle is an unexpected salient event, such as appearance of another pedestrian in the field of viewWalter would be in trouble because he doesnt have looking for other pedestrians in his behavioral repertoireDynamic Environments

17However what Walter would be able to handle is an unexpected salient event, such as appearance of another pedestrian in the field of viewWalter would be in trouble because he doesnt have looking for other pedestrians in his behavioral repertoireComputational load

Computational load
Unexpected events
Bottom-up: Expensive, Can handle unexpected salient events
Top-down: Efficient, How to deal with unexpected events?
Top down systems are more efficient because they select limited, task-specific information from the image, but will miss things not on the agenda. Bottom up systems that do a bunch of pre-processing of the image can catch a wider variety of information, but are computationally expensive. How would a top down system deal with unexpected events? Through learning or frequent checking?

Driving Simulator

19Gaze distribution depends on tasks

Time fixatingIntersection.Follow

Time fixating
Intersection vs Follow
Obey Traffic Rules
The Problem: Any selective perceptual system must choose the right visual computations, and when to carry them out. How do we deal with the unpredictability of the natural world?
Answer - it's not all that unpredictable and we're really good at learning it.
So this is the essential problem for top-down systems - How do you know what to look for, and when to look for it?

This tight link between vision and task demands brings up the problem of scheduling behaviors. The visual system has limited capacity and computational ability.How does the visual system manage between the current task goals and dealing with new stimuli that may change task demands?. How does this selection occur? Learning, Frequent checkingHuman Gaze Distribution when WalkingExperimental Question: How sensitive are subjects to unexpected salient events?

General Design: Subjects walked along a footpath in a virtual environment while avoiding pedestrians. Do subjects detect unexpected potential collisions?

To examine these tradeoffs we designed a walking experiment in virtual reality in which we could manipulate the bottom up signal. What happens if pedestrian suddenly starts to come at you - looming stimulus.
Virtual Walking Environment
Virtual Research V8 Head Mounted Display with 3rd Tech HiBall Wide Area motion tracker
V8 optics with ASL501 Video Based Eye Tracker (Left) and ASL 210 Limbus Tracker (Right)

Video Based Tracker
Limbus Tracker
Our lab has several systems integrated to allow such a virtual reality experiment. We have a head mounted display with two eyetrackers installed. In the picture on the right hand side you can see the video based tracker for POG recording which is complemented by an limbus tracker used for saccade contingent updates. To allow the subjects to walk a sufficient length - a wide area motion tracking system is used to update the view inside the display while allowing the subject to walk the ~27 meter perimeter of rectangular path in the lab.

Virtual EnvironmentBirds Eye view of the virtual walking environment.


Bird-eye view of the foot path that the subjects walked. 6 subjects each performed 6 trials of walking. 3 in the no-following condition and 3 in the following. Each trial consisted of walking around six times about 3-4 minutes.

Short clip of normal speed1 - Normal Walking: Avoid the pedestrians while walking at a normal pace and staying on the sidewalk.

2 - Added Task: Identical to condition 1. However, the additional instruction of following a yellow pedestrian was given

Normal walking
Follow leader
Experimental Protocol
Side-by-side pictures of the two conditions. 3 blocks of 6 circuits of each.
Pedestrians paths

Colliding pedestrian path
What Happens to Gaze in Response to an Unexpected Salient Event?
The Unexpected Event: Pedestrians on a non-colliding path changed onto a collision course for 1 second (10% frequency). Change occurs during a saccade.
Does a potential collision evoke a fixation?
Pedestrian must be 3-5 meters away and the angular delta could be no greater than 30 degrees. Contingent on saccade.

Fixation on Collider
Purple pedestrian turns corner, fixate pedestrian, look to path, maintain fixation during collision period and as it passes.

No Fixation During Collider Period
In this clip a purple pedestrian appears in the visual field shortly after which the pedestrian starts on a collision path. The subject does not fixate the collider pedestrian during its collision course.

Probability of Fixation During Collision PeriodPedestrians paths

Colliding pedestrian pathMore fixations on colliders in normal walking.

No effect in Leader condition

Controls vs Colliders
Normal Walking
Follow Leader
So collision event does seem to attract gaze, but only to a limited extent, and not if you have the added task of following a leader. Small increase in probability of fixating the collider.

Failure of collider to attract attention with an added task (following) suggests that detections result from top-down monitoring.Why are colliders fixated?30

Detecting a Collider Changes Fixation StrategyLonger fixation on pedestrians following a detection of a collider MissHitTime fixating normal pedestrians following detection of a colliderNormal Walking

Follow Leader

TD systems rely on estimating likelihood of environmental events, so detection of an unlikely or significant event like a potential collision might lead subjects to spend more time monitoring pedestrians. This indicates that subjects can quickly modify their fixation strategy in response to information that indicates a need to change policy.
To make a top-down system work, Subjects need to learn statistics of environmental events and distribute gaze/attention based on these expectations.
Subjects rely on active search to detect potentially hazardous events like collisions, rather than reacting to bottom-up, looming signals.
Possible reservations: Perhaps looming robots not similar enough to real pedestrians to evoke a bottom-up response.

Perhaps looming robots not similar enough to real pedestrians to evoke a bottom-up response.33Walking -Real WorldExperimental question:

Do subjects learn to deploy gaze in response to the probability of environmental events?

General design: Subjects walked on an oval path and avoided pedestrians

To examine these tradeoffs we designed a walking experiment in virtual reality in which we could manipulate the general task demands as well as a salient bottom up signal, used to probe the questions we have framed.

Experimental Setup

System components: Head mounted optics (76g), Color scene camera, Modified DVCR recorder, Eye Vision Software, PC Pentium 4, 2.8GHz processor
A subject wearing the ASL Mobile Eye
Occasionally some pedestrians veered on a collision course with the subject (for approx. 1 sec)

3 types of pedestrians:

Trial 1: Rogue pedestrian - always collides Safe pedestrian - never collides Unpredictable pedestrian - collides 50% of time

Trail 2: Rogue Safe Safe Rogue Unpredictable - remains same

Experimental Design (ctd)36

Fixation on Collider37Effect of Collision ProbabilityProbability of fixating increased with higher collision probability.


Detecting Collisions: pro-active or reactive?Probability of fixating risky pedestrian similar, whether or not he/she actually collides on that trial.

39Note this may seem obvious, but in contrast, lot of work trying to predict fix locs by analyzing properties of image.Not clear what role of saliency might be in normal visionBody motion generates image motion over whole retina

Learning to Adjust GazeChanges in fixation behavior fairly fast, happen over 4-5 encounters (Fixations on Rogue get longer, on Safe shorter)


Shorter Latencies for Rogue FixationsRogues are fixated earlier after they appear in the field of view. This change is also rapid.


Effect of Behavioral RelevanceFixations on all pedestrians go down when pedestrians STOP instead of COLLIDING.STOPPING and COLLIDING should have comparable salience. Note the the Safe pedestrians behave identically in both conditions - only the Rogue changes behavior.

42Fixation probability increases with probability of a collision.Fixation probability similar whether or not the pedestrian collides on that encounter.Changes in fixation behavior fairly rapid (fixations on Rogue get longer, and earlier, and on Safe shorter, and later)43Our Experiment:

Allocation of gaze when driving.

Effect of task on gaze allocation. Does task affect ability to detect unexpected events?

Drive along street with other cars and pedestrians. 2 instructions - drive normally or follow a lead car.

Measure fixation patterns in the two conditions.

Note this may seem obvious, but in contrast, lot of work trying to predict fixation locations by analyzing properties of image. Not clear what role of saliency might be in normal vision. Body motion generates image motion over whole retina.


Reward weights estimated from human behavior using InverseReinforcement Learning - Rothkopf 2008.Human path Avatar path 47Subjects must learn the probabilistic structure of theworld and allocate gaze accordingly. That is, gaze control is model-based.

Subjects behave very similarly despite unconstrained environment and absence of instructions.

Control of gaze is proactive, not reactive, and thus is model based.

Anticipatory use of gaze is probably necessary for much visually guided behavior.Conclusions48Behaviors Compete for Gaze/ Attentional ResourcesThe probability of fixation is lower for both Safe and Rogue pedestrians in both the Leader conditions than in the baseline condition .Note that all pedestrians are allocated fewer fixations, even the Safe ones.

Competes for gaze resources, and we are inferring attentional resources

ConclusionsData consistent with task-driven sampling of visual informationrather than bottom up capture of attention- No effect of increased salience of collision event. - Colliders fail to attract gaze in the leader condition, suggesting the extra task interferes with detection.Observers rapidly learn to deploy visual attention based on environmental probabilities.

Such learning is necessary in order to deploy gaze and attention effectively.

50Competing taskCertain stimuli thought to capture attention bottom-up (eg Theeuwes et al, 2001 etc )

Looming stimuli seem like good candidates for bottom-up attentional capture (Regan & Gray, 2000; Franceroni & Simons, 2003).
All have the intuition that attention is attracted by certain stimuli - e.g. something about to hit you. Extensive literature on what does and doesn't capture attention exogenously - considerable debate.

No LeaderNormal Walking

No Leader
Normal Walking
Follow Leader
Greater saliency of the unexpected event does not increase fixations. No effect of increased collider speed.
To get more evidence on this issue we increased the saliency of the colliding pedestrian by increasing speed at same time as pedestrian turns onto a collision course.

Other evidence for detection of colliders?Do subjects slow down during collider period? Subjects slow down, but only when they fixate collider. Implies fixation measures detection.Slowing is greater if not previously fixated. Consistent with peripheral monitoring of previously fixated pedestrians.53ConclusionsSubjects learn the probabilities of events in the environment and distribute gaze accordinglyThe findings from the Leader manipulation support the claim that different tasks compete for attention54Effect of Context

Presumably, with such a highly salient stimulus, one would expect a high detection rate of these colliders. Our preliminary results show only a marginal increase in fixations on colliders in real environment (0-20% depending on the condition) when compared to those from the experiment described in Chapter 2. This result would favor active search as source of information (colliders missed if they don't coincide with an active search episode), rather than bottom-up interpretation.
If we compare fixations on Risky (goes from 62-70%) to colliders in virtual (40-60%). In this experiment there are many more collisions, so there's the overall context effect. Fixations on the Safe are higher with collisions, than with the stops.

57Presumably, withsuch a highly salient stimulus, one would expect a high detection rate of these colliders.Our preliminary results show only a marginal increase in fixations on colliders in realenvironment (0-20% depending on the condition) when compared to those from theexperiment described in Chapter 2. This result would favor active search as source ofinformation (colliders missed if they dont coincide with an active search episode), ratherthan bottom-up interpretation.If we compare fixations on Risky (goes from 62-70%) to colliders in virtual (40-60%) In this experiment there are many more collisions, so theres the overall context effect. Fixations on the Safe are higher with collisions, than with the stops

Time fixatingIntersection.

Follow the car.orFollow the car and obeytraffic rules.CarRoadsideRoadIntersectionShinoda et al. (2001)Detection of signs at intersection results from frequent looks.5821What do we know? Previous work on dsn of attn in natural environments:IntersectionP = 1.0Mid-blockP = 0.3Greater probability of detection in probable locationsSuggests Ss learn where to attend/look.

How well do human subjects detect unexpected events? Shinoda et al. (2001)Detection of briefly presented Stop signs. 59What do Humans Do? Shinoda et al. (2001) found better detection of unexpected stop signs in a virtual driving task.

How well do human subjects detect unexpected events? 
Shinoda et al. (2001) found better detection of unexpected stop signs in a virtual driving task.
To try and answer this question it's worth looking at human behavioral data. Shinoda et al subjects in a driving simulation - subjects strategically deployed fixations at key moments during the task based on learning.
What are the capabilities and limitations of a top-down scheduler? We would like to examine a more demanding situation and does Shinoda's result generalize?




Safe ped./ Safe environmentSafe ped./ No prior experienceSafe ped./ Conflicting experienceSafe pedestrianProbability of fixationFixations of Safe pedestrian in different contexts

