reinforcement learning china summer school
TRANSCRIPT
Reinforcement Learning China Summer School
RLChina 2021
Bayesian Brain
Prof. Jun Wang
UCL
August 17, 2021
1/49
4/49
Table of Contents
1 Learning in biological and computerised systems
2 Perception-control loop: the problem
3 Active inference: perception
4 Active inference: perception-control loop
5 An illustrative example
6 Control as inference
7 Multi-agent Variational Bayes
8 Conclusions
5/49
Table of Contents
1 Learning in biological and computerised systems
2 Perception-control loop: the problem
3 Active inference: perception
4 Active inference: perception-control loop
5 An illustrative example
6 Control as inference
7 Multi-agent Variational Bayes
8 Conclusions
6/49
What is “learning”?
in biology, learning meansa change of behaviour as a result of experience
in classical conditioning1, animals can learn to identify auseful pattern in the environment by associating onestimulus with another:repeated given {ring-a-bell, food}, a dog will start tosalivate (anticipate the upcoming of the food) when bellringing again
learned behaviours are adaptive, and thus are essentialfor animals to survive in the changing environment
e.g., they may learn not to eat certain foods if they haveever become ill after eating them
more learned behaviours =⇒ more intelligent
1Ivan Petrovitch Pavlov and William Gantt. “Lectures on conditioned reflexes:Twenty-five years of objective study of the higher nervous activity (behaviour) ofanimals.”. In: (1928).
7/49
Learning is not limited to biological systems
machine learning is to answer the question of howcomputer programmes can improve their performancethrough experience
past experience exists in various forms, typically including(1) human-labelled or unlabelled data sets,(2) past interactions with the environments, or(3) data collected from realistic simulators
with experience, computers can be trained to identify theassociations, spot the underlying patterns, and makeaccurate predictions and forecasts the futures
as a field, machine learning studies fundamental theoryand algorithms and provides a principled solution for thecomputational methods of learning
8/49
Three levels in any information processing system2
computational theory
what is the problem in a generic manner?what is the computing goal?what is the logic of strategy behind?
representation and algorithm
how can the identified computational problems besolved?what are the inputs/outputs and the algorithm for thetransformation?
hardware implementationhow can the representation and algorithm be realisedphysically?
human: biologicalAI: silicon using transistors
2David Marr. “Vision: A computational investigation into the humanrepresentation and processing of visual information”. In: Inc., New York, NY 2.4.2(1982).
9/49
Human intelligent v.s. current AI3
3Eliezer Sternberg. NeuroLogic: The Brain’s Hidden Rationale Behind OurIrrational Behavior. Vintage, 2016.
10/49
Typical narrow machine learning
Data X Label Y
predict
mathematically, ml can loosely boil down to the questionof finding a unknown function mapping f : X → Y ,
X is input feature space representing data points,Y is label space representing the knowledge outputsthus the mapping f represents a knowledge discoveryprocess from a given data point x ∈ X to the specificlabel y ∈ Y associated with the data point:y = f (x)
however, as f is not known a priori, the goal of thelearning is to identify a hypothesis h ∈ H from apredefined set H to approximate the unknown f , so that
the learned function h can predict the output variableynew = h(xnew) for a given new data point
11/49
View narrow machine learning as function mapping
𝑓: 𝑋 → 𝑌
Unsupervisedlearning
Reinforcementlearning
Generativemodels
Supervisedlearning
𝑌: sample featuresX: class labels
Regression
Classification
𝑌: dependent variablesX: independent variables
𝑌: class labelsX: example features
Anomalydetection
Clustering𝑌: cluster index
X: example features
𝑌: it is an outlier or notX: example features
𝑌: actions to takeX: state features
machine learning tasks are different in the experience available,the objective function, and the specific learning algorithms
12/49
Machine learning: reasoning under uncertaintypattern recognition:
Data X Label Y
predict
decision making (reinforcement learning):
Agent 1 Environment
act
obs
Agent 1 Environment Agent 2
act obs
obs act
13/49
Table of Contents
1 Learning in biological and computerised systems
2 Perception-control loop: the problem
3 Active inference: perception
4 Active inference: perception-control loop
5 An illustrative example
6 Control as inference
7 Multi-agent Variational Bayes
8 Conclusions
14/49
A living system and its environment
Generativeprocess
Generativemodel
3 pt
𝑠"#$∗
𝑠"∗
𝑠"&$∗
𝑜"#$
𝑜"
𝑜"&$
𝑠"#$
𝑠"
𝑠"&$
𝑢"#$
𝑢"
Observation Sensation/internal state
Action
True but hidden state
Figure: Caption
15/49
Entropy
Entropy: a measure of the average surprise (oruncertainty or disorder) of a random event sampled froma probability distribution or density
definition: entropy of a random event X with possibleoutcomes x1, ..., xn:
H(p) = −∑
i
p(X = xi)logp(X = xi)
low entropy means that, on average, the outcome isrelatively predictable
16/49
Table of Contents
1 Learning in biological and computerised systems
2 Perception-control loop: the problem
3 Active inference: perception
4 Active inference: perception-control loop
5 An illustrative example
6 Control as inference
7 Multi-agent Variational Bayes
8 Conclusions
17/49
Variational Bayes4/Free energy principle5
an agent builds a world model by having an internalrepresentation of the world s from observation o as:
p(s|o) =p(o|s)p(s)∫
sp(o|s)p(s) dx
typically the denominator is intractable and one canapproximate the posterior using a tractable functionfamily q(s) ∈ Q by minimising a dis-similarity measure,e.g., KL Divergence:
KL(q(s)||p(s|o)) = Eq(s)[logq(s)
p(s|o)]
4Michael I Jordan et al. “An introduction to variational methods for graphicalmodels”. In: Machine learning 37.2 (1999), pp. 183–233.
5Karl Friston. “The free-energy principle: a unified brain theory?” In: Naturereviews neuroscience 11.2 (2010), pp. 127–138.
18/49
Variational Bayes6/Variational free energy7
we then derive:
KL(q(s)||p(s|o)) = Eq(s)[logq(s)
p(s|o)]
=Eq(s)[log q(s)− log p(s, o) + log p(o)]
=Eq(s)[log q(s)− log p(s, o)] + log p(o)
reorganising gives the measure of surprise (or negativemodel evidence or log marginal distribution):
− log p(o) =Eq(s)[log q(s)− log p(s, o)]− DKL[]q(s)||p(s|o)]
≤Eq(s)[log q(s)− log p(s, o)] ≡ F(q)
where RHS is variational free energy (negative ELBO,Evidence Lower BOund)
6Michael I Jordan et al. “An introduction to variational methods for graphicalmodels”. In: Machine learning 37.2 (1999), pp. 183–233.
7Karl Friston. “The free-energy principle: a unified brain theory?” In: Naturereviews neuroscience 11.2 (2010), pp. 127–138.
19/49
Free energy principle9
the VFE can be decomposed in three principal ways:F(q) = Eq(s)[log q(s)− log p(s, o)]
= Eq(s)[log q(s)]︸ ︷︷ ︸Negative Entropy
−Eq(s)[log p(s, o)]︸ ︷︷ ︸Energy
= Eq(s)[log p(o|s)]︸ ︷︷ ︸Accuracy
+DKL[q(s)|p(s)]︸ ︷︷ ︸Complexity
= DKL[q(s)‖p(s|o)]︸ ︷︷ ︸Posterior Divergence
− log p(o)︸ ︷︷ ︸Negative Log Model Evidence
free energy principle: the goal of a living system is tominimise the free energy F in order to avoid surprisingobservations (states) → maintain homeostasis8 (thusremain alive)
8The process whereby an open or closed system regulates its internal environmentto maintain its states within bounds.
9Karl Friston. “The free-energy principle: a unified brain theory?” In: Naturereviews neuroscience 11.2 (2010), pp. 127–138.
20/49
Table of Contents
1 Learning in biological and computerised systems
2 Perception-control loop: the problem
3 Active inference: perception
4 Active inference: perception-control loop
5 An illustrative example
6 Control as inference
7 Multi-agent Variational Bayes
8 Conclusions
21/49
Extension to handle dynamics and control
the real world is dynamic and thus we extend the modelto have an observation o = {o1, ..., oT} and a hiddenstate at each point in time s = {s1, ..., sT}also add action policy π = {u1, ...uT} which interactswith the environment by altering the next hidden state
the generative model: 1) p (st+1 | st , ut),t > 1; p(s1) 2)p(ot |st)the Expected Free Energy (EFE), from time τ until thetime horizon T :G = Eq(oτ :T ,sτ :T ,π) [ln q (sτ :T , π)− ln p (oτ :T , sτ :T )]
where an agent’s goals are encoded as (subjective)desired distribution over observations p(oτ :T )10; thus wehave p (oτ , sτ ) ≈ p (oτ ) q (sτ | oτ )
10in ML, an optimality variable is added to encode the desires
22/49
Expected free energy12
a temporal mean-field factorisation11 is assumed:the variational function: q (sτ :T , π) ≈ q(π)
∏Tτ q(s | π)
the gen. model: p (oτ :T , sτ :T ) ≈∏Tt p (oτ ) q (sτ | oτ )
as a result, they are independent between time steps:
G = Eq(oτ :T ,sτ :T ,π) [ln q (sτ :T , π)− ln p (oτ :T , sτ :T )]
=Eq(oτ :T ,sτ :T |π)q(π) [ln q (sτ :T | π) + ln q(π)− ln p (oτ :T , sτ :T )]
=Eq(π)
[ln q(π) + Eq(oτ :T ,sτ :T |π)
[T∑
τ
[ln q(s | π)− ln p (oτ , sτ )]
]]
=Eq(π)
[ln q(π)−
(−
T∑
t
Gτ (π))]
= DKL
[q(π)‖e−
∑Tt Gτ (π)
]
where Gτ (π) = Eq(oτ ,sτ |π) [ln q (sτ | π)− ln p (oτ , sτ )]11David M Blei, Alp Kucukelbir, and Jon D McAuliffe. “Variational inference: A
review for statisticians”. In: Journal of the American statistical Association 112.518(2017), pp. 859–877.
12Beren Millidge, Alexander Tschantz, and Christopher L Buckley. “Whence theexpected free energy?” In: Neural Computation 33.2 (2021), pp. 447–482.
23/49
Expected free energy
thus optimising G results in
q∗(π) = SoftMax(−∑T
t Gτ (π))
EFE at time τ , Gτ , can be further decomposed as:Gτ (π)
=Eq(oτ ,sτ |π) [ln q (sτ | π)− ln p (oτ , sτ )]
≈Eq(oτ ,sτ |π) [ln q (sτ | π)− ln p (oτ )− ln q (sτ | oτ )]
≈−Eq(oτ ,|π) [ln p (oτ )]︸ ︷︷ ︸Extrinsic Value
−Eq(oτ )DKL [q (sτ | oτ ) ‖q (sτ | π)]︸ ︷︷ ︸Epistemic Value
thus, optimal policies are obtained by minimising the sumof the expected free energies
EFE is estimated using the generative model to roll outpredicted futures, and compute the EFE of those futures
24/49
Expected free energy
similar to variational free energy, EFE can be alsodecomposed as:Gτ (π)
=Eq(oτ ,sτ |π) [ln q (sτ | π)− ln p (oτ , sτ )]
≈Eq(oτ ,sτ |π) [ln q (sτ | π)− ln p (oτ )− ln q (sτ | oτ )]
≈−Eq(oτ ,sτ |π) [ln p (oτ )]︸ ︷︷ ︸Extrinsic Value
−Eq(oτ )DKL [q (sτ | oτ ) ‖q (sτ | π)]︸ ︷︷ ︸Epistemic Value
note that an epistemic uncertainty refers to thedeficiencies by a lack of knowledge or information;reducible with more data or a better model
the second term measures the reduced entropy of sτ whenobserved oτ → max its value → we intend to choose thepolicy such that H[q(s|π] is high and strong dependencybetween states and observation (aka H[q(s|o] is low)
25/49
Expected free energy
one can also decompose it into the following:
Gτ (π) = Eq(oτ ,sτ |π) [ln q (sτ | π)− ln p (oτ , sτ )]
=Eq(oτ ,sτ |π) [ln q (sτ | π)− ln p (oτ | sτ )− ln p (sτ )]
=−Eq(oτ ,sτ |π) [ln p (oτ | sτ )]︸ ︷︷ ︸Accuracy
+Eq(oτ |sτ ) [DKL [q (sτ | π) ‖p (sτ )]]︸ ︷︷ ︸Complexity
or this (typically used for computation):
Gτ (π) = Eq(oτ ,sτ |π) [ln q (sτ | π)− ln p (oτ )− ln q (sτ | oτ )]
=Eq(oτ ,sτ |π)
[������ln q (sτ | π) − ln p (oτ )− ln p (oτ | sτ )
−������ln q (sτ | π) + ln q (oτ )]
= DKL [q (ot | π) ‖p (ot)]︸ ︷︷ ︸Expected Cost
+Eq(st |π)
[H [p (ot | st)]
]︸ ︷︷ ︸
Expected Ambiguity
26/49
Table of Contents
1 Learning in biological and computerised systems
2 Perception-control loop: the problem
3 Active inference: perception
4 Active inference: perception-control loop
5 An illustrative example
6 Control as inference
7 Multi-agent Variational Bayes
8 Conclusions
27/49
A simple example13
suppose there is an agent interacts with an environmentperception:
the environment has two hidden states s ∈ {1, 2}, e.g.,there is food in your stomach (1) or not (2)while s is NOT measurable, there is an observationo = {1, 2}, e.g., you feeling fed (1) or hungry (2)assume p(o|s) is given as a 2x2 likelihood matrix (called”A” matrix) which maps states to observations, e.g., ifyou fed, you have food and vice versa
control:transition prob. p(st |st−1, u) maps the previous state tothe next one, depending on action u = {u1, u2}parameterised by a separate 2x2 transition matrix (”B”)for each action, e.g., either go get food (u1) or donothing (u2); if u1, will have food in the next state,regardless of whether we have it now, and vice versa
13https:
//medium.com/@solopchuk/tutorial-on-active-inference-30edcf50f5dc
28/49
A simple example14
preference (as a form of objective or reward)
we also have prior preferences p(o), e.g., we like to befed and not hungry, so we assign a higher probability toobservation o = 1 fed. In other words, we expresspreferences over observations as probability p(o)
14https:
//medium.com/@solopchuk/tutorial-on-active-inference-30edcf50f5dc
29/49
A simple example15
action selection: option 1) q(s|π)→ q(o|π)→ G(π)→ q(π)option 2 u∗t+1 = arg maxu DKL
[ASt+1||A(B(u))St
], Bayesian
model average method
15https:
//medium.com/@solopchuk/tutorial-on-active-inference-30edcf50f5dc
30/49
Unifying perception and control (Bayesian brain)16
o1
s1
a1 o2 a2
……θ1 s2 θ 2
x1 x2
s1
a1 a2
……s2
o1 o2
……
oτ
sτ
o0 o1
s0 s1
o0 oτ
s0 sτ
o0 oτ
s0 sτ
q(s) ∼ pθ(s |x, o) pθ(x, o |s)pθ(s)
the latent state s: true world configuration such as pixelassignment, the optimality o is a binary variable
the perception model includes a bottom-up recognitionmodel q(s) and a top-down generative model p(x , o, s)(decomposed into the likelihood p(x , o|s) and the priorbelief p(s))
the prior knowledge θ represents the physical law of theenvironment (the property of each object)
control is performed by taking an action a to change theenvironment state.
16Minne Li et al. “Joint Perception and Control as Inference with an Object-basedImplementation”. In: (2020).
31/49
Table of Contents
1 Learning in biological and computerised systems
2 Perception-control loop: the problem
3 Active inference: perception
4 Active inference: perception-control loop
5 An illustrative example
6 Control as inference
7 Multi-agent Variational Bayes
8 Conclusions
32/49
Bayesian decision principlemaking decision under uncertainty can be interpreted asmaximising expected utility in the face of uncertainty17
the uncertainty is captured by the (hidden) state ofnature: z ∈ Zdecisions (aka actions): a ∈ Areward: R(z , a)the posterior distribution p(z |o), where we can perform astatistical investigation to obtain information (denoted aso = (o1, o2, ..., on) ∈ O) about the nature state θ
(Conditional Bayes Decision Principle)
a∗|o = arg maxa
Eθ∼p(z|o)[R(o, a)], (1)
where the principle is the only fundamentally correct analysis18.Yet, the posterior can be difficult to obtain practically
17John Von Neumann and Oskar Morgenstern. Theory of Games and EconomicBehavior. Princeton University Press, 1953.
18James O Berger. Statistical decision theory and Bayesian analysis. SpringerScience & Business Media, 2013.
33/49
The duality between inference and control
the filtering problem is shown to be the dual of thenoise-free regulator (control) problem19
Figure: Optimal filter Figure: Optimal controller
19Rudolph Emil Kalman. “A new approach to linear filtering and predictionproblems”. In: (1960).
34/49
Control as inference:20
(a) states s are hidden
a1 a2 a3 a4
s1 s2 s3 s4
(a) graphical model with states and actions
a1 a2 a3 a4
s1 s2 s3 s4
O1 O2 O3 O4
(b) graphical model with optimality variables
Figure 1: The graphical model for control as inference. We begin by laying out the states and actions, whichform the backbone of the model (a). In order to embed a control problem into this model, we need to addnodes that depend on the reward (b). These “optimality variables” correspond to observations in a HMM-style framework: we condition on the optimality variables being true, and then infer the most probable actionsequence or most probable action distributions.
A task in this framework can be defined by a reward function r(st, at). Solving a task typicallyinvolves recovering a policy p(at|st, θ), which specifies a distribution over actions conditioned onthe state parameterized by some parameter vector θ. A standard reinforcement learning policy searchproblem is then given by the following maximization:
θ⋆ = argmaxθ
T∑
t=1
E(st,at)∼p(st,at|θ)[r(st, at)]. (1)
This optimization problem aims to find a vector of policy parameters θ that maximize the totalexpected reward
∑t r(st, at) of the policy. The expectation is taken under the policy’s trajectory
distribution p(τ), given by
p(τ) = p(s1, at, . . . , sT , aT |θ) = p(s1)
T∏
t=1
p(at|st, θ)p(st+1|st, at). (2)
For conciseness, it is common to denote the action conditional p(at|st, θ) as πθ(at|st), to emphasizethat it is given by a parameterized policy with parameters θ. These parameters might correspond,for example, to the weights in a neural network. However, we could just as well embed a standardplanning problem in this formulation, by letting θ denote a sequence of actions in an open-loop plan.
Having formulated the decision making problem in this way, the next question we have to ask toderive the control as inference framework is: how can we formulate a probabilistic graphical modelsuch that the most probable trajectory corresponds to the trajectory from the optimal policy? Or,equivalently, how can we formulate a probabilistic graphical model such that inferring the posterioraction conditional p(at|st, θ) gives us the optimal policy?
2.2 The Graphical Model
To embed the control problem into a graphical model, we can begin simply by modeling the rela-tionship between states, actions, and next states. This relationship is simple, and corresponds to agraphical model with factors of the form p(st+1|st, at), as shown in Figure 1 (a). However, thisgraphical model is insufficient for solving control problems, because it has no notion of rewards orcosts. We therefore have to introduce an additional variable into this model, which we will denoteOt. This additional variable is a binary random variable, where Ot = 1 denotes that time step t isoptimal, and Ot = 0 denotes that it is not optimal. We will choose the distribution over this variableto be given by the following equation:
p(Ot = 1|st, at) = exp(r(st, at)). (3)
The graphical model with these additional variables is summarized in Figure 1 (b). While this mightat first seem like a peculiar and arbitrary choice, it leads to a very natural posterior distribution over
3
(b) states s are observable
the idea: add an optimality variable Ot = 1 and assume abiased probability with reward r :p (Ot = 1 | st , at) = exp (r (st , at))
20Sergey Levine. “Reinforcement learning and control as probabilistic inference:Tutorial and review”. In: arXiv preprint arXiv:1805.00909 (2018); Beren Millidge et al.“On the relationship between active inference and control as inference”. In:International Workshop on Active Inference. Springer. 2020, pp. 3–11.
35/49
Control as inference:21
let us look at an MDP model (states are observable)
the evidence: Ot = 1 for all t ∈ {1, ...,T}; thus ELBO is
log p (O1:T ) = log
∫∫p (O1:T , s1:T , a1:T ) ds1:Tda1:T
= log
∫∫p (O1:T , s1:T , a1:T )
q (s1:T , a1:T )
q (s1:T , a1:T )ds1:Tda1:T
= log E(s1:T ,a1:T )∼q(s1:T ,a1:T )
[p (O1:T , s1:T , a1:T )
q (s1:T , a1:T )
]
≥ E(s1:T ,a1:T )∼q(s1:T ,a1:T ) [log p (O1:T , s1:T , a1:T )− log q (s1:T , a1:T )]
21Sergey Levine. “Reinforcement learning and control as probabilistic inference:Tutorial and review”. In: arXiv preprint arXiv:1805.00909 (2018).
36/49
Control as inference:22
we make use of the evidence p (O1:T , s1:T , a1:T ) =[p (s1)
∏Tt=1 p (st+1 | st , at)
]exp(∑T
t=1 r (st , at))
and
factorise the variational distribution as q(τ) ≡q (s1:T , a1:T ) = q (s1)
∏1t=1 q (st+1 | st , at) q (at | st)
this leads to the final ELBO: log p (O1:T ) ≥Eq(s1:T ,a1:T )
[∑Tt=1 r (st , at)− log q (at | st)
]
where we define q (st+1 | st , at) = p (st+1 | st , at)
22Sergey Levine. “Reinforcement learning and control as probabilistic inference:Tutorial and review”. In: arXiv preprint arXiv:1805.00909 (2018).
37/49
Control as inference:23
without using temporal mean-field factorisation as ActiveInference, one can derive Value Iteration similar to astandard RL solution
the ELBO can be decomposed recursively as:Eq(st ,at) [r (st , at)− log π (at | st)] +Eq(st ,at)
[Est+1∼p(st+1|st ,at) [V (st+1)]
]=
Eq(st)
[−DKL
(π (at | st) ‖ 1
exp(V (st))exp (Q (st , at))
)+ V (st)
]
where we define:Q (st , at) ≡ r (st , at) + Est+1∼p(st+1|st ,at) [V (st+1)]V (st) ≡ log
∫A exp (Q (st , at)) dat
→ π (at | st) = exp (Q (st , at)− V (st))
23Sergey Levine. “Reinforcement learning and control as probabilistic inference:Tutorial and review”. In: arXiv preprint arXiv:1805.00909 (2018).
38/49
Table of Contents
1 Learning in biological and computerised systems
2 Perception-control loop: the problem
3 Active inference: perception
4 Active inference: perception-control loop
5 An illustrative example
6 Control as inference
7 Multi-agent Variational Bayes
8 Conclusions
39/49
Agent modelling with maximum entropy objective24
each agent pursues the maximal cumulative reward
max ηi(πθ) = E
[∞∑
t=1
γtR i(st , ait , a−it )
], (2)
with actions (ait , a−it ) sampled from policy (πi
θi , π−iθ−i )
a strategy profile (π1∗, . . . , πn∗) reaches optimum when:
Es∼ps ,ai∗t ∼πi∗,a−i∗t ∼π−i∗
[∞∑
t=1
γtR i(st , ai∗t , a
−i∗t )
]
≥ Es∼ps ,ait∼πi ,a−it ∼π−i
[∞∑
t=1
γtR i(st , ait , a−it )
](3)
∀π ∈ Π, i ∈ (1 . . . n),
where π = (πi , π−i) and agent i ’s optimal policy is πi∗
24Zheng Tian et al. “A regularized opponent model with maximum entropyobjective”. In: IJCAI (2019).
40/49
Agent modelling with maximum entropy objectivea lower bound on the likelihood of optimality of agent i :
logP(Oi1:T = 1|O−i1:T = 1) ≥
∑
t
E(st ,ait ,a−it )∼q[R i(st , a
it , a−it )
+ H(π(ait |st , a−it ))− DKL(ρ(a−it |st)||P(a−it |st))] (4)
=∑
t
Est [Eait∼π,a−it ∼ρ
[R i(st , ait , a−it ) + H(π(ait |st , a−it ))
︸ ︷︷ ︸MEO
]
− Ea−it ∼ρ
[DKL(ρ(a−it |st)||P(a−it |st))]︸ ︷︷ ︸
Regulariser of ρ
]. (5)
ρ(a−it |st , o−it = 1) is agent i ’s opponent modelπ(ait |st , a−it , o it = 1, o−it = 1) is the agent i ’s conditionalpolicy at optimum (o it = o−it = 1) andP(a−it |st , o−it = 1) is the prior of opponent modelthe prior P(a−it |st , o−it = 1) is estimated empiricallywe drop (o it = 1, o−it = 1) in π, ρ and P(a−it |st)
41/49
Recursive reasoning25
Example: in the “beauty contest” game, players are askedto pick numbers from 0 to 100, and the player whosenumber is closest to 2/3 of the average wins a prize
25Colin F Camerer, Teck-Hua Ho, and Juin-Kuan Chong. “A cognitive hierarchymodel of games”. In: The Quarterly Journal of Economics 119.3 (2004), pp. 861–898.
42/49
Multi-agent generalized recursive reasoning26
s<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
a�i<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
Oi<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
ai1<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
⇡i1(a
i|s, a�i)<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
s<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
a�i<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
Oi<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
ai0
<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
a�i1<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
ai2<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
⇡i2(a
i|s, a�i)<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
⇡i0(a
i|s)<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
⇢�i1 (a�i|s, ai)
<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
⇢�i0 (a�i|s)
<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
Level 1 Level 2
Lower-Level Recursive Reasoning
s<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
a�i<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
Oi<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
Level k
⇡ik(ai|s, a�i)
<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
aik
<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
a�ik�1
<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
...<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
⇢�i0 (a�i|s)
<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
aik�2
<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
⇢�ik�1(a
�i|s, ai)<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
⇡ik�2(a
i|s, a�i)<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
· · ·<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
a0<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
⇡i0(a
i|s)<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
a�i0
<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
/<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
unobserved opponent policies are approximated by ρ−i
agent i recursively reasons about opponents (grey area)
in the recursion, agents with higher-level beliefs take thebest response to the lower-level thinkers’ actions.
26Ying Wen et al. “Modelling Bounded Rationality in Multi-Agent Interactions byGeneralized Recursive Reasoning”. In: IJCAI (2020).
43/49
Table of Contents
1 Learning in biological and computerised systems
2 Perception-control loop: the problem
3 Active inference: perception
4 Active inference: perception-control loop
5 An illustrative example
6 Control as inference
7 Multi-agent Variational Bayes
8 Conclusions
44/49
Remarks
AI: reasoning under uncertainty
a learning system is an information system
pattern recognition and decision making problems(include multiagent learning) can be unified and modelledby probabilistic inference such as Variational Bayes (e.g.,active- inference or inference-as-control)
as such information theory can be directly utilised
however, we call for new research on information theorythat deals with ”intrinsic information exchange”
45/49
References I
James O Berger. Statistical decision theory andBayesian analysis. Springer Science & BusinessMedia, 2013.
David M Blei, Alp Kucukelbir, andJon D McAuliffe. “Variational inference: A reviewfor statisticians”. In: Journal of the Americanstatistical Association 112.518 (2017),pp. 859–877.
Colin F Camerer, Teck-Hua Ho, andJuin-Kuan Chong. “A cognitive hierarchy modelof games”. In: The Quarterly Journal ofEconomics 119.3 (2004), pp. 861–898.
46/49
References II
Karl Friston. “The free-energy principle: a unifiedbrain theory?” In: Nature reviews neuroscience11.2 (2010), pp. 127–138.
Michael I Jordan et al. “An introduction tovariational methods for graphical models”. In:Machine learning 37.2 (1999), pp. 183–233.
Rudolph Emil Kalman. “A new approach to linearfiltering and prediction problems”. In: (1960).
Sergey Levine. “Reinforcement learning andcontrol as probabilistic inference: Tutorial andreview”. In: arXiv preprint arXiv:1805.00909(2018).
47/49
References III
Minne Li et al. “Joint Perception and Control asInference with an Object-based Implementation”.In: (2020).
David Marr. “Vision: A computationalinvestigation into the human representation andprocessing of visual information”. In: Inc., NewYork, NY 2.4.2 (1982).
Beren Millidge et al. “On the relationship betweenactive inference and control as inference”. In:International Workshop on Active Inference.Springer. 2020, pp. 3–11.
48/49
References IV
Beren Millidge, Alexander Tschantz, andChristopher L Buckley. “Whence the expectedfree energy?” In: Neural Computation 33.2(2021), pp. 447–482.
Ivan Petrovitch Pavlov and William Gantt.“Lectures on conditioned reflexes: Twenty-fiveyears of objective study of the higher nervousactivity (behaviour) of animals.”. In: (1928).
Eliezer Sternberg. NeuroLogic: The Brain’sHidden Rationale Behind Our Irrational Behavior.Vintage, 2016.
Zheng Tian et al. “A regularized opponent modelwith maximum entropy objective”. In: IJCAI(2019).