cs 182/cogsci110/ling109 spring 2008
DESCRIPTION
CS 182/CogSci110/Ling109 Spring 2008. Reinforcement Learning: Details and Biology 4/3/2008. Srini Narayanan – ICSI and UC Berkeley. Lecture Outline. Reinforcement Learning: Temporal Difference: TD-Learning, Q-Learning Demos (MDP, Q-Learning) Animal Learning and Biology - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: CS 182/CogSci110/Ling109 Spring 2008](https://reader030.vdocuments.net/reader030/viewer/2022020800/56812f0d550346895d94abbe/html5/thumbnails/1.jpg)
CS 182/CogSci110/Ling109Spring 2008
Reinforcement Learning: Details and Biology
4/3/2008
Srini Narayanan – ICSI and UC Berkeley
![Page 2: CS 182/CogSci110/Ling109 Spring 2008](https://reader030.vdocuments.net/reader030/viewer/2022020800/56812f0d550346895d94abbe/html5/thumbnails/2.jpg)
Lecture Outline
Reinforcement Learning: Temporal Difference: TD-Learning, Q-
Learning Demos (MDP, Q-Learning)
Animal Learning and Biology Neuro-modulators and temporal difference Discounting Exploration and Exploitation
Neuroeconomics-- Intro
![Page 3: CS 182/CogSci110/Ling109 Spring 2008](https://reader030.vdocuments.net/reader030/viewer/2022020800/56812f0d550346895d94abbe/html5/thumbnails/3.jpg)
Demo of MDP solution
![Page 4: CS 182/CogSci110/Ling109 Spring 2008](https://reader030.vdocuments.net/reader030/viewer/2022020800/56812f0d550346895d94abbe/html5/thumbnails/4.jpg)
Example: Bellman Updates
![Page 5: CS 182/CogSci110/Ling109 Spring 2008](https://reader030.vdocuments.net/reader030/viewer/2022020800/56812f0d550346895d94abbe/html5/thumbnails/5.jpg)
Example: Value Iteration
Information propagates outward from terminal states and eventually all states have correct value estimates
![Page 6: CS 182/CogSci110/Ling109 Spring 2008](https://reader030.vdocuments.net/reader030/viewer/2022020800/56812f0d550346895d94abbe/html5/thumbnails/6.jpg)
Full Estimation (Dynamic Programming)
)()( 11 ttt sVrEsV
T
T T T
st
rt1
st1
T
TT
T
TT
T
T
T
![Page 7: CS 182/CogSci110/Ling109 Spring 2008](https://reader030.vdocuments.net/reader030/viewer/2022020800/56812f0d550346895d94abbe/html5/thumbnails/7.jpg)
Simple Monte Carlo
T T T TT
T T T T T
V(st ) V(st) Rt V (st ) where Rt is the actual return following state st .
st
T T
T T
TT T
T TT
![Page 8: CS 182/CogSci110/Ling109 Spring 2008](https://reader030.vdocuments.net/reader030/viewer/2022020800/56812f0d550346895d94abbe/html5/thumbnails/8.jpg)
Combining DP and MC
T T T TT
T T T T T
st1
rt1
st
V(st ) V(st) rt1 V (st1 ) V(st )
TTTTT
T T T T T
PREDICTION ERROR
![Page 9: CS 182/CogSci110/Ling109 Spring 2008](https://reader030.vdocuments.net/reader030/viewer/2022020800/56812f0d550346895d94abbe/html5/thumbnails/9.jpg)
Model-Free Learning
Big idea: why bother learning T? Update each time we experience a transition Frequent outcomes will contribute more updates
(over time)
Temporal difference learning (TD) Policy still fixed! Move values toward value of whatever
successor occurs
a
s
s, a
s,a,s’s’
![Page 10: CS 182/CogSci110/Ling109 Spring 2008](https://reader030.vdocuments.net/reader030/viewer/2022020800/56812f0d550346895d94abbe/html5/thumbnails/10.jpg)
Q-Learning
Learn Q*(s,a) values Receive a sample (s,a,s’,r) Consider your old estimate: Consider your new sample estimate:
Nudge the old estimate towards the new sample:
![Page 11: CS 182/CogSci110/Ling109 Spring 2008](https://reader030.vdocuments.net/reader030/viewer/2022020800/56812f0d550346895d94abbe/html5/thumbnails/11.jpg)
Any problems with this?
No guarantee you will explore the state space. The value of unexplored states is never computed.
Fundamental problem in RL and in Biology How do we address this problem?
AI solutions include e-greedy Softmax
Evidence from Neuroscience (next lecture).
![Page 12: CS 182/CogSci110/Ling109 Spring 2008](https://reader030.vdocuments.net/reader030/viewer/2022020800/56812f0d550346895d94abbe/html5/thumbnails/12.jpg)
Exploration / Exploitation
Several schemes for forcing exploration Simplest: random actions (-greedy)
Every time step, flip a coin With probability , act randomly With probability 1-, act according to current policy
(best q value for instance)
Problems with random actions? You do explore the space, but keep thrashing
around once learning is done One solution: lower over time Another solution: exploration functions
![Page 13: CS 182/CogSci110/Ling109 Spring 2008](https://reader030.vdocuments.net/reader030/viewer/2022020800/56812f0d550346895d94abbe/html5/thumbnails/13.jpg)
Q-Learning
![Page 14: CS 182/CogSci110/Ling109 Spring 2008](https://reader030.vdocuments.net/reader030/viewer/2022020800/56812f0d550346895d94abbe/html5/thumbnails/14.jpg)
Q-Learning
Q-learning produces tables of q-values:
![Page 15: CS 182/CogSci110/Ling109 Spring 2008](https://reader030.vdocuments.net/reader030/viewer/2022020800/56812f0d550346895d94abbe/html5/thumbnails/15.jpg)
Q Learning features
On-line, Incremental Bootstrapping (like DP unlike MC) Model free Converges to an optimal policy (Watkins
1989). On average when alpha is small With probability 1 when alpha is high in the
beginning and low at the end (say 1/k)
![Page 16: CS 182/CogSci110/Ling109 Spring 2008](https://reader030.vdocuments.net/reader030/viewer/2022020800/56812f0d550346895d94abbe/html5/thumbnails/16.jpg)
Reinforcement Learning
Basic idea: Receive feedback in the form of rewards Agent’s utility is defined by the reward
function Must learn to act so as to maximize expected
utility Change the rewards, change the behavior
DEMO
![Page 17: CS 182/CogSci110/Ling109 Spring 2008](https://reader030.vdocuments.net/reader030/viewer/2022020800/56812f0d550346895d94abbe/html5/thumbnails/17.jpg)
Demo of Q Learning
Demo arm-control Parameters
learning rate) discounted reward (high for future rewards) exploration(should decrease with time)
MDP Reward= number of the pixel moved to the right/
iteration number Actions : Arm up and down (yellow line), hand up and
down (red line)
![Page 18: CS 182/CogSci110/Ling109 Spring 2008](https://reader030.vdocuments.net/reader030/viewer/2022020800/56812f0d550346895d94abbe/html5/thumbnails/18.jpg)
Helicopter Control (Andrew Ng)
![Page 19: CS 182/CogSci110/Ling109 Spring 2008](https://reader030.vdocuments.net/reader030/viewer/2022020800/56812f0d550346895d94abbe/html5/thumbnails/19.jpg)
Lecture Outline
Reinforcement Learning: Temporal Difference: TD-Learning, Q-
Learning Demos (MDP, Q-Learning)
Animal Learning and Biology Neuro-modulators and temporal difference Discounting Exploration and Exploitation
Neuroeconomics-- Intro
![Page 20: CS 182/CogSci110/Ling109 Spring 2008](https://reader030.vdocuments.net/reader030/viewer/2022020800/56812f0d550346895d94abbe/html5/thumbnails/20.jpg)
Example: Animal Learning
RL studied experimentally for more than 60 years in psychology Rewards: food, pain, hunger, drugs, etc. Conditioning Mechanisms and sophistication debated
More recently neuroscience has provided data on Biological reality of prediction error td-(and q) learning Utility structure and reward discounting Exploration vs. exploitation
![Page 21: CS 182/CogSci110/Ling109 Spring 2008](https://reader030.vdocuments.net/reader030/viewer/2022020800/56812f0d550346895d94abbe/html5/thumbnails/21.jpg)
Dopamine levels track prediction error Unpredicted reward(unlearned/no stimulus)
Predicted reward(learned task)
Omitted reward(probe trial)
(Montague et al. 1996)Wolfram Schultz Lab 1990-1996
![Page 22: CS 182/CogSci110/Ling109 Spring 2008](https://reader030.vdocuments.net/reader030/viewer/2022020800/56812f0d550346895d94abbe/html5/thumbnails/22.jpg)
Dopamine and prediction error
Rats were trained on mazesucrose solution at the end was the reward
Photosensors
Dopamine Antagonists
![Page 23: CS 182/CogSci110/Ling109 Spring 2008](https://reader030.vdocuments.net/reader030/viewer/2022020800/56812f0d550346895d94abbe/html5/thumbnails/23.jpg)
RL Model Behavior
![Page 24: CS 182/CogSci110/Ling109 Spring 2008](https://reader030.vdocuments.net/reader030/viewer/2022020800/56812f0d550346895d94abbe/html5/thumbnails/24.jpg)
Human learning
![Page 25: CS 182/CogSci110/Ling109 Spring 2008](https://reader030.vdocuments.net/reader030/viewer/2022020800/56812f0d550346895d94abbe/html5/thumbnails/25.jpg)
Reward prediction in humans
Dopamine neurons in VTA
fMRI study.
Changes in BOLD signal
Decision Lab, Stanford University
![Page 26: CS 182/CogSci110/Ling109 Spring 2008](https://reader030.vdocuments.net/reader030/viewer/2022020800/56812f0d550346895d94abbe/html5/thumbnails/26.jpg)
Reward prediction in humans
Explicit losses (punishment) seems to have a different circuit than the positive signal
Changes modulated by probability of reward
Decision Lab, Stanford University
![Page 27: CS 182/CogSci110/Ling109 Spring 2008](https://reader030.vdocuments.net/reader030/viewer/2022020800/56812f0d550346895d94abbe/html5/thumbnails/27.jpg)
Dopamine neurons and their role
![Page 28: CS 182/CogSci110/Ling109 Spring 2008](https://reader030.vdocuments.net/reader030/viewer/2022020800/56812f0d550346895d94abbe/html5/thumbnails/28.jpg)
Hyperbolic discounting
Ainslee 1992
Short term rewards are different from long term rewardsUsed in many animal discounting modelsHas been used to explain
procrastinationaddiction
Behavior changes as rewards become imminent
![Page 29: CS 182/CogSci110/Ling109 Spring 2008](https://reader030.vdocuments.net/reader030/viewer/2022020800/56812f0d550346895d94abbe/html5/thumbnails/29.jpg)
McCLure, Cohen fMRI expts
![Page 30: CS 182/CogSci110/Ling109 Spring 2008](https://reader030.vdocuments.net/reader030/viewer/2022020800/56812f0d550346895d94abbe/html5/thumbnails/30.jpg)
Different circuits for immediate and delayed rewards?
![Page 31: CS 182/CogSci110/Ling109 Spring 2008](https://reader030.vdocuments.net/reader030/viewer/2022020800/56812f0d550346895d94abbe/html5/thumbnails/31.jpg)
Immediate and Long-term rewards
![Page 32: CS 182/CogSci110/Ling109 Spring 2008](https://reader030.vdocuments.net/reader030/viewer/2022020800/56812f0d550346895d94abbe/html5/thumbnails/32.jpg)
Basic Conclusion of the McClure, Cohen experiments
Two critical predictions: Choices that include a reward today will preferentially engage limbic
structures relative to choices that do not include a reward today Trials in which the later reward is selected will be associated with
relatively higher levels of lateral prefrontal activation, reflecting the ability of this system to value greater rewards even when they
are delayed. The hyperbolic discounting may reflect a tension between limbic and
more pre-frontal structures… As in the grasshopper and the ant (from Aesop) Lots of implications for marketing, education…
Twist: More recent results suggest that the systems may be involved at different activity levels for immediate and delayed reward (Kable 2007, Nat. Neuroscience)
Either case provides unambiguous evidence that subjective value is explicity represented in neural activity.
![Page 33: CS 182/CogSci110/Ling109 Spring 2008](https://reader030.vdocuments.net/reader030/viewer/2022020800/56812f0d550346895d94abbe/html5/thumbnails/33.jpg)
Exploration vs. Exploitation
Fundamental issue in adapting to a complex (changing) world.
Complex biological issue. Multiple factors may play a role. Consistently implicates neuro-modulatory systems thought to be
involved in assessing reward and uncertainty. (D, NE, Ach) The midbrain dopamine system has been reward prediction errors The locus coeruleus (LC) noradrenergic system has been proposed
to govern the balance between exploration and exploitation in response to reward history (Aston-Jones & Cohen 2005).
Basal forebrain cholinergic system together with the adrenergic system have been proposed
to monitor uncertainty, signalling both expected and unexpected forms, respectively, which in turn might be used to promote exploitation or exploration (Yu & Dayan 2005).
![Page 34: CS 182/CogSci110/Ling109 Spring 2008](https://reader030.vdocuments.net/reader030/viewer/2022020800/56812f0d550346895d94abbe/html5/thumbnails/34.jpg)
Discounting and exploration
Aston-Jones, G. & Cohen, J. D. 2005 An integrative theory of locus coeruleus–norepinephrine function: adaptive gain and optimal performance. Annu. Rev. Neurosci. 28, 403–450.
![Page 35: CS 182/CogSci110/Ling109 Spring 2008](https://reader030.vdocuments.net/reader030/viewer/2022020800/56812f0d550346895d94abbe/html5/thumbnails/35.jpg)
Toward a biological model
McClure et al Phil. Trans. of the Royal Society Proc. 2007
![Page 36: CS 182/CogSci110/Ling109 Spring 2008](https://reader030.vdocuments.net/reader030/viewer/2022020800/56812f0d550346895d94abbe/html5/thumbnails/36.jpg)
The Ultimatum Game: Human utility
Sanfey, A.G. et al. (2003) The neural basis of economic decision making in the Ultimatum Game. Science
![Page 37: CS 182/CogSci110/Ling109 Spring 2008](https://reader030.vdocuments.net/reader030/viewer/2022020800/56812f0d550346895d94abbe/html5/thumbnails/37.jpg)
Summary
Biological evidence for Prediction error and td-learning Discounting
Hyperbolic Two systems?
Exploitation and Exploration LC and NE phasic and tonic Social features cue relationship between
discounting, utility, and explore/exploit
![Page 38: CS 182/CogSci110/Ling109 Spring 2008](https://reader030.vdocuments.net/reader030/viewer/2022020800/56812f0d550346895d94abbe/html5/thumbnails/38.jpg)
Areas that are probably directly involved in RL
Basal Ganglia Striatum (Ventral/Dorsal), Putamen, Substantia Nigra
Midbrain (VT) and brainstem/hypothalamus (NC) Amygdala Orbito-Frontal Cortex Cingulate Circuit (ACC) Cerebellum PFC Insula
![Page 39: CS 182/CogSci110/Ling109 Spring 2008](https://reader030.vdocuments.net/reader030/viewer/2022020800/56812f0d550346895d94abbe/html5/thumbnails/39.jpg)
Neuroeconomics: Current topics How (and where) are value and probability combined in the brain to
provide a utility signal? What are the dynamics of this computation? What neural systems track classically defined forms of expected
and discounted utility? Under what conditions do these computations break down?
How is negative utility signaled? Is there a negative utility prediction signal comparable to the one for positive utility?
How are rewards of different types mapped onto a common neural currency like utility?
How do systems that seem to be focused on immediate decisions and actions interact with systems involved in longer-term planning (e.g. making a career decision)? For example, does an unmet need generate a tonic and progressively
increasing signal (i.e. a mounting ‘drive’), or does it manifest as a recurring episodic/phasic signal with increasing amplitude?
What are the connections between utility and ethics? Social issues.
![Page 40: CS 182/CogSci110/Ling109 Spring 2008](https://reader030.vdocuments.net/reader030/viewer/2022020800/56812f0d550346895d94abbe/html5/thumbnails/40.jpg)
Reinforcement Learning: What you should know
Basics Utilities, preferences, conditioning
Algorithms MDP formulation, Bellman’s equation Basic learning formulation, temporal-difference, q-learning
Biology Role of neuromodulators
Dopamine role Short vs. long term rewards, Hyperbolic discounting Exploration vs. exploitation
Neuroeconomics –The basic idea and questions. What you might wonder
Role of reinforcement learning in language learning Role of rewards and utility maximization in ethics, boredom… Role of Neuro-modulation in cognition and behavior..