probabilistic robotics: a tutorial

Probabilistic Robotics: A Tutorial

Juan Antonio Fernández MadrigalOctober 2004

System Engineering and Automation Dpt.

University of Málaga (Spain)

October 2004 Juan Antonio Fernández MadrigalProbabilistic Robotics: A Tutorial

Contents

1. Introduction1.1 Probabilistic Robotics?1.2 The Omnipresent Core: Bayes’ Rule1.3 Let’s Filter! (Bayes Filter)

2. You Will Find Them Everywhere: Basic Mathematical Tools2.1 A Visit to the Casino: MonteCarlo Methods2.2 Partially Unknown Uncertainty: the EM Algorithm2.3 Approximating Uncertainty Efficiently: Particle Filters

3. The Foundations: The Common Bayesian Framework3.1 Graphs plus Uncertainty: Graphical Models3.2 Arrows on Arcs: Bayesian Networks3.3 Let it Move: Dynamic Bayesian Networks (DBNs)

4. Forgetting the Past: Markovian Models4.1 It is Easy if it is Gaussian: Kalman Filters4.2 On the Line: Markov Chains4.3 What to Do?: Markov Decision Processes (MDPs)4.4 For Non-Omniscient People: Hidden Markov Models (HMMs)4.5 For People that Do Things: POMDPs


1. Introduction

1.1 Probabilistic Robotics?

Why Probabilistic Robotics?

What’s Probabilistic Robotics?-Robotics that use probability calculus for modeling and / orreasoning about robot actions and perceptions.

-State-of-the-art Robotics

-For coping with uncertainty on the robot’s environment.

-For coping with uncertainty / noise on robot’s perceptions.

-For coping with uncertainty / noise on robot’s actions.


1. Introduction

1.2 The Omniscient Core: Bayes’ Rule (~1750)

P(R=r | e)

-Rule for updating your existing belief (probability of some variable) given new evidence (the occurrence of some event).

-In spite of its simplicity, it is the basis for most probabilistic approaches in robotics and other sciences.

=P(e | R=r) P(R=r)

Posterior probability

(The probability that a random variable R takes the value r, given that the event e has occurred)

Conditional probability

(The probability that the event e occurs if the random variable R would take the value r)

(The probability that the random variable R would take the value r anyway)

Prior probability (belief)P(e | R=r) P(R=r)

all r

=P(e)

(Normalizing factor for the posterior P(R=r) to add up to 1)


Known evidence

1. Introduction

1.3 Let’s Filter! (Bayes Filter)

-Bayes’ Rule can be iterated for improving estimate over time:

P(R=r t=t2 | e t from t1 to t2) =

P(e t=t2 | R=r t=t2

) P(R=r t=t2 )

P(e | R=r) P(R=r)all r, t2

-In general, there are the following possibilities:

t

t1 t2testimate

This is called “fixed-lag smoothing”

testimate

This is called “prediction”

testimate

This is called “filtering” = Bayes Filter


2. You’ll Find Them Everywhere: Basic Mathematical Tools

2.1 A Visit to the Casino: MonteCarlo Methods (1946)

-A set of methods based on statistical sampling for approximating some value(s) (any quantitative data) when analytical methods are not available or computationally unsuitable.

-In its general form, it is a way to approximate integrals:

1

0

1

0

1

0

2121 ...),...,,(... nn dududuuuuf

Given the difficult integral (function “f” is known):

It can be approximated by the following steps:1) Take a uniform distribution over the region of integration:

nU 1,0

2) Calculate the expectation of f(U) by statistical sampling (m samples):

m

kkUfm

UfE1

)(1

))((

3) It follows from probability calculus that:

where p(u) is the probability density function of u.

n

duupufUfE)1,0(

)()())((

4) Since U is uniform, p(u)=1, so:

n

duufUfE)1,0(

)())((

5) The standard error is:

merror

where sigma is the standard deviation of each sample (unknown).

6) The error diminishes with many samples, but maybe slowly...There are techniques of “variance reduction” to reduce also sigma:

-Antitethic variates-Control Variates-Importance Sampling-Stratified Sampling...

-Error in aproximation does not depend on dimensionality of data.



2.2 Partially Unknown Uncertainty: The EM Algorithm

-The Expectation-Maximization algorithm (1977) can be used in general for estimating any probability distribution from real measurements that can be incomplete.

-The algorithm works in two steps (E and M) that are iterated, improving the likelihood of the estimate over time. It can be demonstrated that the algorithm converges to a local optimum.

1. E-step (Expectation). Given the current estimate of the distribution, calculate the expectation of the measurements with respect to that estimate.

2. M-step (Maximization). Produce a new estimate that improve (maximizes) that expectation.



2.2 Partially Unknown Uncertainty: The EM Algorithm

-Mathematical formulation (in the general case):

Z=(X,Y)

All the data (both missing and measured)Measured dataMissing or hidden data (not measured)

p(Z | M) = p(X,Y | M) Complete-data likelihood given a model M(we will maximize the expectation of this)

-E-step:

In general, E[ h(W) | R=r) ] = h(w) p(w | r) dw

all w from W

Thus, E[ log p(X,Y | M) | X,M(i-1) ] = log p(X,y | M) p(y | X, M(i-1)) dyall y from Y

-M-step:

M(i) = argmax(on M) E[ log p(X,Y | M) | X,M(i-1) ]

Variation: M(i) = any M that makes expectation greater than M(i-1)

unknown

to be optimized

Generalized EM (GEM), also converges

Generalized EM (GEM), also converges



2.3 Approximating Uncertainty Efficiently: Particle Filters

-MonteCarlo Filter (i.e.: iterated over time).

-It is useful due to its efficiency.

-Represent probability distributions by samples with associated weights, and yield information from the distributions by computing on those samples.

-As the number of samples increases, the accuracy of the estimate increases.

-There are a diversity of particle filter algorithms depending on how to select the samples.


3. The Foundations: The Common Bayesian Framework

3.1 Graphs plus Uncertainty: Graphical Models

-A common formalism that copes with both uncertainty and complexity, two problems commonly found in applied mathematics and engineering.

-A graphical model is a graph with associated probabilities. Nodes represent random variables. An arc between two nodes indicates a statistical dependence between two variables.

-Many specific derivations: mixture models, factor analysis, hidden Markov models, Kalman filters, etc.

-Three basic types:

A) Undirected graphs (=Markov Random Fields): in Physics, Computer Vision, ...

B) Directed (=Bayesian Networks): in Artificial Intelligence, Statistics, ...

C) Mixed (=Chain Graphs).



3.2 Arrows on Graphs: Bayesian Networks

Nodes (variables) can hold discrete or continuous values.Arcs represent causality (and conditional probability)

The model is completely defined by its graph structure, thevalues of its nodes (variables) and the conditional probabilities of the arcs

Cloudy (C)

Sprinklet (S) Rain (R)

wet grass (W)

P(C=true)=0.5 P(C=false)=0.5

P(S=true | C=true)=0.1 P(S=true | C=false)=0.5P(S=false | C=true)=0.9P(S=false | C=false)=0.5

P(R=true | C=true)=0.8 P(R=true | C=false)=0.2P(R=false | C=true)=0.2P(R=false | C=false)=0.8

P(W=true | S=true, R=true)=0.99 P(W=true | S=true, R=false)=0.9P(W=true | S=false, R=true)=0.9P(W=true | S=false, R=false)=0

P(W=false | S=true, R=true)=0.01 P(W=false | S=true, R=false)=0.1P(W=false | S=false, R=true)=0.1P(W=false | S=false, R=false)=1



3.2 Arrows on Graphs: Bayesian Networks - Inference

Cloudy (C)


wet grass (W)

1) Bottom-up Reasoning or Diagnostic: from effects to causes-For example: given that the grass is wet (W=true), which is morelikely, the sprinklet being on (S=true) or the rain (R=true)?

Effect

Causes-We seek P(S=true | W=true) and P(R=true | W=true)

-Using the definition of conditionalprobability:

P(S=true | W=true) =P(S=true, W=true)

P(W=true)-In general, using the chain rule:

P(C,S,R,W) = P(C) P(S|C) P(R|S,C) P(W|S,R,C)

-But by the graph structure: P(C,S,R,W) = P(C) P(S|C) P(R|C) P(W|S,R)




Cloudy (C)


wet grass (W)

1) Bottom-up Reasoning or Diagnostic: from effects to causes-For example: given that the grass is wet (W=true), which is morelikely, the sprinklet being on (S=true) or the rain (R=true)?

Effect

Causes-We seek P(S=true | W=true) and P(R=true | W=true)

-Using the definition of conditionalprobability:

P(S=true | W=true) =P(S=true, W=true)

P(W=true)

-By marginalization:

P(W=true) = P(C=c,S=s,R=r,W=true)all c,s,r




2) Top-down Reasoning or Causal / Generative Reasoning: from causes to effects

-For example: given that it is cloudy (C=true), which is the probabilitythat the grass is wet (W=true)?

-We seek P(W=true | C=true)

-The inference is similar.



3.2 Arrows on Graphs: Bayesian Networks - Causality

-It is possible to formalise if a variable (node) is a cause for anotheror if they are merely correlated.

-It would be useful, for example, for a robot to learn the effects ofits actions...



3.3 Let it Move: Dynamic Bayesian Networks (DBNs)

-Bayesian Networks with time, not dynamical in the sense thatthe graph structure or parameters vary.

-(Very) simplified taxonomy of Bayesian Networks:Graphical Models

Undirected = Markov Random Fields Directed = Bayesian Networks

Non-temporalTemporal = Dynamic Bayesian Networks

Hidden Markov Models (HMM)

Markov Decision Processes Partially Observable Markov Decision Processes (POMDP)

Markov Chains Kalman Filters

Totally Observable Partially Observable

No actions

ActionsGaussian Models

Markov Processes (independence of future w.r.t. all past)


4. Forgetting the Past: Markovian Models

4.1 It is Easy if it is Gaussian: Kalman Filters (1960)

-They model dynamic systems, with partial observability, and gaussian probability distributions.

-It is of interest:

-Applications: any in which it is needed to estimate the state of a known dynamical system under gaussian uncertainty / noise: computer vision (tracking), robot SLAM (if the map is considered part of the state), ...

-To estimate the current state. Thus, the EM algorithm can be thought as an alternative not subjected to gaussians.

-Extensions: to reduce computational cost (e.g.: when the state has a large description), to cope with more than one hypothesis (e.g.: when two indistinguishable landmarks yield a two-modal distribution for the pose of a robot), to cope with non-linear systems (through linearization: EKF), ...




-Mathematical formulation:

P(x | u,x’) = Ax’ + Bu + ec

current stateof the system

actions performedby the system

last state known linear modelof the system

gaussian noise in system actions

P(z | x) = Cx + em

current observationsof the system

current stateof the system

known linear modelof the observations

gaussian noise in observations

c = 0 (mean of control noise)

c (covariance matrix of control noise)

m = 0 (mean of observation noise)

m (covariance matrix of observation noise)

White NoiseWhite Noise




-Through a Bayes Filter:

’t-1 = t-1 + But

’t-1 = t-1 + c

Kt = ’t-1CT(C ’t-1CT+ c)-1

t = ’t-1 + Kt (zt-C ’t-1)

t = (I- KtC) ’t-1

state estimate



4.2 On the Line: Markov Chains

-Nodes of the network represent a random variable X in a given instant of time (unknown except for the first node).

-Arc from node Xn to node Xn+1 represent the conditional probability that the random variable X takes a probability distribution Xn+1 giventhat it exhibited distribution Xn at the last instant (no other past instant is considered since the model is markovian). -Instants of time are discrete. All conditionals are known.

-It is of interest: a) causal reasoning: to obtain Xn from all its past, and b) whether the probability distribution converges over time: assured if the chain is ergodic (any node is reachable in one step from any other node) .

-Applications: {-Direct: physics, computer networks, ...

-Indirect: as part of more sophisticated models.



4.3 What to Do?: Markov Decision Processes

-Markov Processes with actions (output arcs) that can be carried out at each node (state), with some reward as a result of a given action on a given state.

-It is of interest: to obtain the best sequence of actions (markov chain) that optimize the reward. Any sequence of actions (chain) is called a policy.

-In every MDP, it can be demonstrated that there always exists an optimal policy (the one that optimizes the reward).

-Obtaining the optimal policy is expensive (polynomial). There are several algorithms for solving it, some of them reducing that cost (by hierarchies, etc.). The most classical one: value iteration.

-Applications: decision making in general, robot path planning, travel route planning, elevator scheduling, bank customer retention, autonomous aircraftnavigation, manufacturing processes, network switching and routing, ...

A Case of Reinforcement Learning (RL)

A Case of Reinforcement Learning (RL)



4.4 For Non-Omniscient People: Hidden Markov Models

(~1960-70) Markov Processes without actions and with partial observability.

States of the network are not directly accessable, except through some stochastic measurement. That is,

observations are a probabilistic function of the state

-It is of interest:

-Applications: speech processing, robot SLAM, bioinformatics (gene finding,protein modeling, etc.), image processing, finance, traffic, ...

a) which is the probability of the sequence of observations, given the network?

b) which states have we visited more likely, given observations and network parameters?

c) which are the network parameters that maximize the probability of having obtained those observations?

How good is a given model? (not really interesting for us)

How good is a given model? (not really interesting for us)

Where are we in the model? (little use: model known)

Where are we in the model? (little use: model known)

Which is the model? (robot mapping/localisation)

Which is the model? (robot mapping/localisation)




-Elements in a HMM:

N = nº of states in the network (i-th state=si).M = nº of different possible observations (k-th observation=ok).A = matrix (N x N) of state transition probabilities: axy=P(qt+1=sy | qt=sx)

B = matrix (N x M) of observation probabilities: bx(ok)=P(ot=ok | qt=sx)

= matrix (1 x N) of initial state probabilities: x=P(q0=sx)

= HMM model = (A,B, )




-Solution to Problem a): which is the probability of a sequence of observations (of length T), given the network?

-Direct approach: enumerate all the possible sequences of states (paths) of length T in the network, and calculate for each one the probability that the given sequence of observations is obtained if the path is followed. Then, calculate the probability of that path, and thus the joint probability of the path and of the observations for that path. Finally, sum up for all the possible paths In the network.

It depends on the number of paths of length T:O(2TNT ), unfeasible for T long enough.




-Efficient approach: the forward-backward procedure.

-A forward variable is defined t(i)=P(O1,O2,...,Ot,qt=si | ) as the probability of the observation sequence O1,O2,...,Ot followed by reaching state si.

1.1(i)=ibi(O1), for all states i from 1 to N

2.t+1(j)=[ t(i)aij] bj(Ot+1), for all states i from 1 to Ni=1

i=N

3.P(O | )= T(i)i=1

i=N

-It is calculated recursively:

-This calculation is O(N2T ).




-Efficient approach: the forward-backward procedure.

-Alternativelty, a backward variable can be defined t(i)=P(Ot+1,Ot+2,...,OT,qt=si | ) as the probability of the observation sequence Ot+1,Ot+2,...,OT preceded by having reached state si.

1.1(i)=1, for all states i from 1 to N

2.t(i)= aijbj(Ot+1)t+1(j), for all states i from 1 to Nj=1

j=N

3.P(O | )= 1(i)i=1

i=N

-It is calculated recursively:




-Solution to Problem b): which states have we visited more likely, given a sequence of observations of length T and the network?

-There is not a unique solution (as in problem a)): it depends on the optimalitycriteria chosen. But when one is chosen, a solution can be found analitycally.

-The Viterbi Algorithm: it finds the best single-state sequence of states (the onethat maximizes the probability of each single state of the sequence, independently on the other states, at each step).

-The following two variables are defined:t(i)= max P(q1,q2,...,qt=si,O1,O2,...,Ot | )

q1,q2,...,qt-1

(It calculates all the sequences ofstates that reach state si at the endand produce the given observations; Then it returns the maximum probability found)

t(j)= argmax (t-1(i)aij)i=1,2,...,N

Recursively: t(j)=[ max (t-1(i)aij) ] bj(Ot) i=1,2,...,N

(traces the state that maximizes the expression, that is, thatmaximizes the likelihood of passing through single state s j)




-The algorithm works as follows:

1.1(i)=ibi(O1), for all states i from 1 to N 1(i)=0, for all states i

2.t(j)=)=[ max (t-1(i)aij) ] bj(Ot) for all states j from 1 to N t(j)= argmax (t-1(i)aij), for all states j

i=1,2,...,N

i=1,2,...,N

3.P*= max (T(i))

qT*= argmax (T(i)) (the ending state)

i=1,2,...,N

i=1,2,...,N

4.qt*= t+1(qt+1*) (for retrieving all the other states in the sequence)




-Solution to Problem c): which are the network parameters that maximize theprobability of having obtained the observations?

-Not only there is no one unique solution (as in problem b)), but there are notany analitycal procedure to obtain it: only approximations are available.

-Approximation algorithms that obtain locally optimal models exist. Most popular:EM (Expectation-Maximization), which is called Baum-Welch when adapted to HMMs:

-The sequence of observations are considered the measured data.

-The sequence of states that yield those observations, the missing or hidden data.

-The matrices A, B, are the parameters to approximate.

-The number of states is known a priori.



4.5 For People that Do Things: POMDPs

-Markov Processes with both actions and partial observability.

-It is of interest:

-Applications: any in which it is needed both to model some real process (orenvironment) and to act optimally with that model. Only recently applied to robotics (1996)

-The three problems of HMMs: likelihood of the model, localisation in a givenmodel, and calculation of the model itself.

-The basic problem of MDPs: best policy to do (through actions) for obtainingthe greatest benefit.

Modeling and Interpreting Perceptions

Modeling and Interpreting Perceptions

Acting OptimallyActing Optimally

-”Partially Observable Markov Decision Processes”.


References

Thrun S. (2002), “Robotic Mapping: A Survey”, Technical Report CMU-CS-02-111 Murphy K. (1998), “A Brief Introduction to Graphical Models and Bayesian Networks”, http://www

.ai.mit.edu/~murphyk/Bayes/bayes.html. Murphy K. (2000), “A Brief Introduction to Bayes’ Rule”, http://www.ai.mit.edu/~murphyk/Bayes/

bayesrule.html. Contingency Analysis (2004), “MonteCarlo Method”, http://www.riskglossary.com/articles/monte_

carlo_method.htm. Bilmes J.A. (1998), “A Gentle Tutorial of the EM Algorithm and its Applications to Parameter

Estimation for Gaussian Mixture and Hidden Markov Models”, International Computer Science Institute, Technical Report, CA (USA).

Arulampalam S., Maskell S., Gordon N., Clapp T. (2001), “A Tutorial on Particle Filters for On-Line Non-Linear / Non-Gaussian Bayesian Tracking”, IEEE Transactions on Signal Processing vol. 50, no 2.

West M. (2004), “Elements of Markov Chain Structure and Convergence”, Notes of Fall 2004 course, http://www.stat.duke.edu/courses/Fall04/sta214/Notes/214.5.pdf.

Moore A. (2002), “Markov Systems, Markov Decision Processes, and Dynamic Programming”, teaching slides at CMU.

Hannon M.E., Hannon S.S. (2000), “Reinforcement Learning: A Tutorial”, Reading of New Bulgarian University, www.nbu.bg/cogs/events/2000/Readings/Petrov/rltutorial.pdf.

Cassandra T. (1999), “POMDP for Dummies”, http://www.cs.brown.edu/research/ai/pomdp/tutorial/index.html.

Rabiner L. (1989), “A Tutorial in Hidden Markov Models and Selected Applications in Speech Recognition”, Proceedings of the IEEE, vol. 77, no. 2.

Cassandra A.R., Kaelbling L.P., Kurien J.A. (1996), “Acting under Uncertainty: Discrete Bayesian Models for Mobile Robot Navigation”, Proceedings of the IROS’96.

probabilistic robotics: a tutorial

Documents

whats probabilistic

pomdps probabilistic

rprobabilistic robotics

probabilistic approaches

uncertainty noise

unknown uncertainty

approximating uncertainty

basic mathematical tools2