trials and tribulations

TRIALS AND TRIBULATIONS

Architectural Constraints on Modeling a Visuomotor Task within the Reinforcement Learning Paradigm

SUBJECT OF INVESTIGATION

How humans integrate visual object properties into their action policy when learning a novel visuomotor task.

• BubblePop!

Problem: Too many possible questions…

Solution: Motivate behavioral research by looking at modeling difficulties.

• Nonobvious crossroads

APPROACH

Since the task has a scalar performance signal, model must utilize reinforcement learning.

• Temporal Difference Back Propagation

Start with an extremely simplified version of the task and add back the complexity once you have a successful model.

Analyze the representational and architectural constraints necessary for each model.

5x5 grid-world

4 possible actions• Up, down, left, right

1 unmoving target

Starting locations of target and agent randomly assigned

Fixed reward upon reaching target and a new target generated

Epoch ends after fixed number of steps

FIRST STEPS: DUMMY WORLD

DUMMY WORLD ARCHITECTURES

25 units for the grid 4 Actions

8 Hidden Layer

1

context

Expected Reward

(ego only)

The whole grid (allocentric), or agent centered (egocentric)

Current architectures learn each action independently.

‘Up’ is like ‘Down’, but different.

• It shifts the world

1 action, 4 different inputs• “In which rotation of the

world would you rather go ‘up’ in?”

BUILDING IN SYMMETRY

Scaled grid size up to 10x10• Not as unrealistic as one might think… (tile

coding)

Scaled number of targets• Difference from 1 to 2, but not from 2 to

many.

Confirmed ‘winning-est’ representation

Added memory

WORLD SCALING

Added a ‘ripeness’ dimension to target, and changed the reward function:If target.ripeness >.60

reward = 1;

Else

reward = -.66667;

NO LOW HANGI NG FRUI T:THE R I P ENESS P ROBL EM

How the problem occurs:

1. At a high temperature you move randomly.

2. The random pops net zero reward.

3. The temperature lowers and you ignore the target entirely.

ANN EAL ING AWAY THE CU RS E O F P ICKINESS

No feedback for almost ripe

So how could we anneal our ripeness criterion?

Anneal the amount you care about unripe pops.

Differentiate internal and extern reward functions

A PS YCHO L OGICAL LY PL AUS IBL E SO LUTION

FUTURE DIRECTIONS

Investigate how the type of ripeness difficulty impacts computational demands.

• Difficulty due to reward schedule vs. perceptual acuity vs. redundancy vs. conjunctive-ness vs. ease of prediction

How to handle the ‘Feature Binding ‘Problem’ in this context• Emergent binding through deep learning?

Just keep increasing complexity and see what problems crop up.

• If the model gets to human level performance without a hitch, then that’d be pretty good to.

SUMMARY& DISCUSSION

Egocentric representations pay off in this domain, even with the added memory cost.

• In any domain with a single agent?

Symmetries in the action space can be exploited to greatly expedite learning

• Could there be a general mechanism for detecting such symmetries?

Difficult reward functions might be learnt via annealing internal reward signals.

• How could we have this annealing emerge from the model?

QUESTIONS?

trials and tribulations

Documents

reward schedule

assignedfixed reward

extern reward functionsa

difficult reward functions

ripeness problemhow

ripeness dimension

successful model

ripeness criterion