learning in an uncertain and changing world - oecdlearning in an uncertain and changing world...
TRANSCRIPT
Learning in an uncertain and changing world
Insight from neuroscience to decision science
Florent MEYNIEL
Neurospin, CEA, France
Example of monthly result of your stock.Should you sell or keep it?
$ $ X $ $ X $ $ $ X X $ $ $ $ X $ $ $
Time
Starting with an example
Example of monthly result of your stock.Should you sell or keep it?
$ $ X $ $ X $ $ $ X X $ $ $ $ X $ $ $
Time
X X X X X
Starting with an example
● Can we assume a separation between inference and choice processes?
● What is being inferred when subjects predict future observations?
● How do people weight sequential observations in the face of change points?
● Do people entertain a hierarchical model when learning in a uncertain and changing world?
Questions addressed in this presentation
Separating inference and decision processes
Bayesian Decision Theory
Maloney & Zhang 2010 Vision research
Estimation (inference)
Selection (optimization)
Estimation the state of the world (θ), given the observations received (o) and my model of the world (m).E.g. What are the reward associated with different options ? …
p(θ|o ,m)∝ p(o|θ ,m) p(m)
Given an objective function (F, a.k.a. cost function, utility function) and the estimated state of the world, select the best option.
argmaxa
F (a ,θ)
Separating inference and decision processes
Bayesian Decision Theory
Maloney & Zhang 2010 Vision research
Estimation (inference)
Selection (optimization)
Estimation the state of the world (θ), given the observations received (o) and my model of the world (m).E.g. What are the reward associated with different options ? …
p(θ|o ,m)∝ p(o|θ ,m) p(m)
Given an objective function (F, a.k.a. cost function, utility function) and the estimated state of the world, select the best option.
argmaxa
F (a ,θ)
“Actor-critic” model
Sutton & Barto 1998 MIT press
Critic
Actor
Worldfeedbackfeedback
actions
Reinfor-cement signal
Separating inference and decision processes
Bayesian Decision Theory
Maloney & Zhang 2010 Vision research
Estimation (inference)
Selection (optimization)
Neuro-anatomy of “Actor-critic”
O’Doherty et al Science 2004
Ventral striatum
Dorsal striatum
Compare expected reward and actual reward.Results: prediction error (dopamine system)
Select action
Estimation the state of the world (θ), given the observations received (o) and my model of the world (m).E.g. What are the reward associated with different options ? …
p(θ|o ,m)∝ p(o|θ ,m) p(m)
Given an objective function (F, a.k.a. cost function, utility function) and the estimated state of the world, select the best option.
argmaxa
F (a ,θ)
“Actor-critic” model
Sutton & Barto 1998 MIT press
Critic
Actor
Worldfeedbackfeedback
actions
Reinfor-cement signal
Bayesian inference provides a principled framework for inference and prediction
?
● Bayesian inference computes the posterior probability of statistics.
Bayesian inference provides a principled framework for inference and prediction
?
Learn the statistics θ of observations, given assumptions (M) about the generative processp(θ | o
1, …, o
N, M)
● Bayesian inference computes the posterior probability of statistics.
● Bayesian inference provides optimal predictions about future observations given the previous observations and assumptions about the generative process. We called an ideal observer.
Bayesian inference provides a principled framework for inference and prediction
?
Learn the statistics θ of observations, given assumptions (M) about the generative processp(θ | o
1, …, o
N, M)
Turn this estimate into a predictionp(o
N+1 | θ, M)
● Bayesian inference computes the posterior probability of statistics.
● Bayesian inference provides optimal predictions about future observations given the previous observations and assumptions about the generative process. We called an ideal observer.
Bayesian inference provides a principled framework for inference and prediction
?
Learn the statistics θ of observations, given assumptions (M) about the generative processp(θ | o
1, …, o
N, M)
Turn this estimate into a predictionp(o
N+1 | θ, M)
x
● The inference can be iterated (Gelman, Bishop, Sutton & Barto). Tracking improbable (i.e. surprising) events allows the ideal observer to revise the estimates of statistics (Friston 2005).
Reverse-engineering surprise signals unravels the statistical model used by the brain
Compute statistics
Make predictions
Compare predictionsto new observations
Oups!
Reverse-engineering surprise signals unravels the statistical model used by the brain
Compute statistics
Make predictions
Compare predictionsto new observations
Oups!
Computationalneuroscience
Evidence for an automatic tracking of statistical regularities in electro-encephalography (EEG) signals
EEG recorded during passive listening.
pitch
X
Y
time
Meyniel, Maheu & Dehaene, Plos Computational Biology 2016
Squires et al Science 1976
Evidence for an automatic tracking of statistical regularities in electro-encephalography (EEG) signals
EEG recorded during passive listening.
pitch
X
Y
The P300 amplitude relates to the improbability of the current sound given the previous ones, suggesting a tracking of:→ The global item frequencyin the entire sequence.e.g. p(X) = 0.7 vs. p(X) = 0.3.
time
Meyniel, Maheu & Dehaene, Plos Computational Biology 2016
Squires et al Science 1976
Evidence for an automatic tracking of statistical regularities in electro-encephalography (EEG) signals
EEG recorded during passive listening.
pitch
X
Y
The P300 amplitude relates to the improbability of the current sound given the previous ones, suggesting a tracking of:→ The global item frequencyin the entire sequence.e.g. p(X) = 0.7 vs. p(X) = 0.3.
time
Meyniel, Maheu & Dehaene, Plos Computational Biology 2016
Squires et al Science 1976
Evidence for an automatic tracking of statistical regularities in electro-encephalography (EEG) signals
The P300 amplitude relates to the improbability of the current sound given the previous ones, suggesting a tracking of:→ The global item frequencyin the entire sequence.e.g. p(X) = 0.7 vs. p(X) = 0.3.
time
EEG recorded during passive listening.
pitch
X
Y
Squires et al Science 1976
Evidence for an automatic tracking of statistical regularities in electro-encephalography (EEG) signals
The P300 amplitude relates to the improbability of the current sound given the previous ones, suggesting a tracking of:→ The global item frequencyin the entire sequence.e.g. p(X) = 0.7 vs. p(X) = 0.3.
time
→ The local item frequencyin the recent history.e.g. XXXXX vs. YYYYX.
EEG recorded during passive listening.
pitch
X
Y
Squires et al Science 1976
Evidence for an automatic tracking of statistical regularities in electro-encephalography (EEG) signals
The P300 amplitude relates to the improbability of the current sound given the previous ones, suggesting a tracking of:→ The global item frequencyin the entire sequence.e.g. p(X) = 0.7 vs. p(X) = 0.3.
time
→ The local item frequencyin the recent history.e.g. XXXXX vs. YYYYX.
EEG recorded during passive listening.
pitch
X
Y
Squires et al Science 1976
Evidence for an automatic tracking of statistical regularities in electro-encephalography (EEG) signals
The P300 amplitude relates to the improbability of the current sound given the previous ones, suggesting a tracking of:→ The global item frequencyin the entire sequence.e.g. p(X) = 0.7 vs. p(X) = 0.3.
time
→ The local item frequencyin the recent history.e.g. XXXXX vs. YYYYX.→ The local alternation frequencyWhether items were repeated in the recent historye.g. YYXXX vs. YXYXX.
EEG recorded during passive listening.
pitch
X
Y
Squires et al Science 1976
Evidence for an automatic tracking of statistical regularities in electro-encephalography (EEG) signals
The P300 amplitude relates to the improbability of the current sound given the previous ones, suggesting a tracking of:→ The global item frequencyin the entire sequence.e.g. p(X) = 0.7 vs. p(X) = 0.3.
Replication and further computational refinements: Mars 2008, Kolossa 2013, Lieder 2013, Maheu 2017...
time
→ The local item frequencyin the recent history.e.g. XXXXX vs. YYYYX.→ The local alternation frequencyWhether items were repeated in the recent historye.g. YYXXX vs. YXYXX.
EEG recorded during passive listening.
pitch
X
Y
Squires et al Science 1976
What is the simplest model that the brain must entertain to account for these effects?
What is being inferred: local transition probabilities
EEG DATALocal transition
probability model
Meyniel, Maheu & Dehaene, Plos Computational Biology 2016
Surprise = -log(p(observation))
P( | )P( | )
What is being inferred: local transition probabilities
EEG DATALocal transition
probability model
Implication: a tracking of transition probabilities may render the brain fit to detect serial correlations, and even causality.
Meyniel, Maheu & Dehaene, Plos Computational Biology 2016
Surprise = -log(p(observation))
P( | )P( | )
Similar sequential effects are found in very simple decisions
Since the same sequential effect are observed in brain responses that signals surprising event, sequential effects observed in behavior derive from the subject’s expectation regarding the statistical properties underlying their observations.
310
320
330
340
350
360
370
380
390
400
Re
act
ion
tim
es
(ms)
A simple reaction time task
Redrawn from Cogn Affect Behav Neurosci. 2002; 2: 283–299
The local transition probability model accounts for the asymmetric perception of randomness
Rating of the perceived randomness of binary sequences. (Falk, 1975)
O O X O X X O X O X O O O O X X O X O O O
Meyniel, Maheu & Dehaene, Plos Computational Biology 2016
The local transition probability model accounts for the asymmetric perception of randomness
Rating of the perceived randomness of binary sequences. (Falk, 1975)
O O X O X X O X O X O O O O X X O X O O O
Meyniel, Maheu & Dehaene, Plos Computational Biology 2016
→ here, p(alternate) = 12/20
The local transition probability model accounts for the asymmetric perception of randomness
Rating of the perceived randomness of binary sequences. (Falk, 1975)
O O X O X X O X O X O O O O X X O X O O O
→ Studies of perceived randomness show a bias for alternations, max around 0.6. (Falk, 1975; Falk & Konold, 1997; Bakan, 1960; Budescu, 1987; Rapoport & Budescy, 1992; Kareev, 1992)→ The perceived randomness can be formalized as a posterior entropy→ Our model predict an asymmetry of the perceived entropy (that is all the stronger that the integration is local)→ The asymmetry is specific of our model
Meyniel, Maheu & Dehaene, Plos Computational Biology 2016
→ here, p(alternate) = 12/20
It is necessary to forget in the face of change points
p(
o )
(B) Learning processes with different reliance on past observations
(A) Observed sequence
0 50 100 150 200 250
0 50 100 150 200 2500
0.5
1allprevious 15
true
It is necessary to forget in the face of change points
p(
o )
(B) Learning processes with different reliance on past observations
(A) Observed sequence
0 50 100 150 200 250
0 50 100 150 200 2500
0.5
1allprevious 15
true
● When faced with change points, one should rely more on recent observations in order to quickly update his knowledge.
It is necessary to forget in the face of change points
p(
o )
(B) Learning processes with different reliance on past observations
(A) Observed sequence
0 50 100 150 200 250
0 50 100 150 200 2500
0.5
1allprevious 15
true
● When faced with change points, one should rely more on recent observations in order to quickly update his knowledge.
● Estimates are less accurate when one erroneously assumes that there is no change point than when one erroneously assumes that there are change points.
It is necessary to forget in the face of change points
p(
o )
(B) Learning processes with different reliance on past observations
(A) Observed sequence
0 50 100 150 200 250
0 50 100 150 200 2500
0.5
1allprevious 15
true
● When faced with change points, one should rely more on recent observations in order to quickly update his knowledge.
● Estimates are less accurate when one erroneously assumes that there is no change point than when one erroneously assumes that there are change points.
● Implication: in general, it is optimal to assume that there are change points, and hence it is rational to rely more on recent observations. (Yu and Cohen NIPS 2009).
It is necessary to forget in the face of change points
p(
o )
(B) Learning processes with different reliance on past observations
(A) Observed sequence
0 50 100 150 200 250
0 50 100 150 200 2500
0.5
1allprevious 15
true
● When faced with change points, one should rely more on recent observations in order to quickly update his knowledge.
● Estimates are less accurate when one erroneously assumes that there is no change point than when one erroneously assumes that there are change points.
● Implication: in general, it is optimal to assume that there are change points, and hence it is rational to rely more on recent observations. (Yu and Cohen NIPS 2009).
● Implication: the probability-matching behavior lawfully emerges from a very local inference (Yu and Huang Decision 2014).
EEG surprise signals indicate that the inference is local
The amplitude of the P300, an EEG signature of surprise, indicate that subjects spontaneously predict the next observation using a local inference.
The best fit is obtained with a leak factor ω=16, i.e. a given observation has half its weight after 16*ln(2)≈11 new observations.
Multiple time scales of integration co-exit in the brain
The local transition probability model can be fit on each time point of the evoked-response recorded with MEEG.
The local transition probability model can be fit on each time point of the evoked-response recorded with MEEG.
Maheu, Dehaene & Meyniel, in prep
Multiple time scales of integration co-exit in the brain
The local transition probability model can be fit on each time point of the evoked-response recorded with MEEG.
● Later brain responses are best explained by increasingly shorter integration windows.● Late brain responses (>300 ms) typically correspond to conscious brain processes.● The very short integration windows of late brain response may correspond to a conscious search for
“patterns” in the observed sequence of stimuli.
The local transition probability model can be fit on each time point of the evoked-response recorded with MEEG.
Maheu, Dehaene & Meyniel, in prep
Do subject entertain a hierarchical inference when learning in an uncertain and changing world ?
Some neuroscientists propose that the brain computes a hierarchical model (Friston, Klaas, Mathy, Nassar, Gallistel, Meyniel, … ) while other propose that a leaky integration suffices (Yu, Fusi, Soltani, … )
Behavioral evidence in favor of hierarchical inference in the human brain
Learning rate increases when volatility increases
Behrens et al Nat Neuro 2007
Task: choose the best cue; the reward rate associated to each cue change occasionally at “change points”.
Lear
ning
rat
e
Behavioral evidence in favor of hierarchical inference in the human brain
Task: predict the mean of a gaussian distribution, whose mean (and SD) changes occasionally at “change points”.
Learning rate increases when volatility increases
Behrens et al Nat Neuro 2007
Task: choose the best cue; the reward rate associated to each cue change occasionally at “change points”.
# trial relative to change point
Lear
ning
rat
e
Lear
ning
rat
e
Learning rate increases after a change point
Nassar et al Nat Neuro 2012
Behavioral evidence in favor of hierarchical inference in the human brain
Subjects detect change pointsMeyniel et al PCB 2015
Task: predict the mean of a gaussian distribution, whose mean (and SD) changes occasionally at “change points”.
Learning rate increases when volatility increases
Behrens et al Nat Neuro 2007
Task: choose the best cue; the reward rate associated to each cue change occasionally at “change points”.
# trial relative to change point
Lear
ning
rat
e
Lear
ning
rat
e
Learning rate increases after a change point
Nassar et al Nat Neuro 2012
Task: estimate the (volatile) probabilities that generate a binary sequence, and detect the moment they change.
See also: Gallistel et al Psych Rev 2014
Neural evidence in favor of hierarchical inference in the human brain
Meyniel and Dehaene PNAS 2017
IPS
OFC
TO
● While subjects covertly estimated the probabilities underlying an sequence of stimuli, the (optimal) confidence level accompanying probability estimates correlated with the activity of brain-scale networks.
● Activity in this region predicted the accuracy of subject's confidence reports.
● Those results indicate that the brain indeed tracks probabilities and even the reliability of its estimates, and computes a hierarchical probabilistic inference.
Summary
● Our brain is equipped with a powerful machinery for computing statistics from sequences of observations that may be used to guide decisions.
● The brain infers, at a minimum, the transition probabilities between successive event types.
– This can account for the tendency to perceive serial correlations and causal relations, and for a biased perception of randomness.
● This machinery is tuned to non-stationarity and favor recent observation to estimate current statistics.
– This can account for various behavioral effects: probability matching, sequential effects in choices and reactions times, a conscious search for patterns, recency effects, unstability…
● The human learning algorithm is hierarchical and Bayesian: it takes into account the occurrence of change points and the confidence in its own inference:
– This can account for the flexibility of human learning.
– Unfit prior (e.g. regarding probability of observation, regarding volatility) may account for sub-optimal behavior and appear irrationalities.
Expectations emerge more rapidly from repetitions than alternations
A qualitative agreement with the P300 data by Squires et al. 1976
Meyniel, Maheu & Dehaene, Plos Computational Biology 2016
A qualitative agreement with the P300 data by Squires et al. 1976
Alternation freq. effect
Local freq. effect
Meyniel, Maheu & Dehaene, Plos Computational Biology 2016
A qualitative agreement with the P300 data by Squires et al. 1976
Alternation freq. effect
The order matters (order reserved)
Local freq. effect
Meyniel, Maheu & Dehaene, Plos Computational Biology 2016
A qualitative agreement with the P300 data by Squires et al. 1976
Alternation freq. effect
Local effects even when no global bias
The order matters (order reserved)
Local freq. effect
Meyniel, Maheu & Dehaene, Plos Computational Biology 2016
A qualitative agreement with the P300 data by Squires et al. 1976
Global item freq. effect
Alternation freq. effect
Local effects even when no global bias
The order matters (order reserved)
Local freq. effect
Meyniel, Maheu & Dehaene, Plos Computational Biology 2016
A qualitative agreement with the P300 data by Squires et al. 1976
Global item freq. effect
Alternation freq. effect
Local effects even when no global bias
The order matters (order reserved)
Stronger expectations after repetitions than alternations
Local freq. effect
Meyniel, Maheu & Dehaene, Plos Computational Biology 2016
A modified task to further test the computation of transition probabilities
pitch
B
A
MEG recorded during passive listening.
p(A|B)
p(B|A)
p(Alt.)
p(A)
No bias
Global frequency bias
Global alternation bias
Global repetition bias
A modified version of the original task by Squires et al: additional blocks with biased transition probabilities
time
Maheu, Dehaene & Meyniel, in prep
The local transition probability model accounts for both local and global effects of statistics in all conditions
We collapsed data across time (by averaging between 500-730ms) and space (by filtering). The spatial filters were estimated and applied using cross-validation on half of the data.
Maheu, Dehaene & Meyniel, in prep
Can human subjects explicitly track time-varying transition probabilities?
Meyniel, Schlunneger & Dehaene, Plos Computational Biology 2015
Bayesian inversion by the Ideal Observer(infer probabilities given the observations)