kari heine a survey of sequential monte carlo...

Department of Science and Engineering

Kari Heine

A Survey of Sequential Monte CarloMethods

Licentiate Thesis

Examiners: prof. Robert Piche (TUT)Dr. Dan Crisan (Imperial College London)

Preface

The writing of this thesis has been extremely educational and I hope other peoplewill this work helpful as well. In my opinion, it is best suited to people whoare interested in a moderately theoretical discussion on sequential Monte Carlomethods.

Many people have offered their help in the course of my research and Iwish to thank them all. In particular, I am grateful to the following people:prof. Robert Piche, Dr. Dan Crisan, M.A. (Mathematics) Jari Niemi, Lic.Tech.Matti Vihola, Duane Petrovich and M.Sc. Simo Ali-Loytty.

Tampere, 4th September 2005

Kari HeineLindforsinkatu 16 A 733720 TampereTel: +358 50 4143565

i

Contents

Preface i

Contents ii

Abstract v

Tiivistelma vi

List of Symbols vii

List of Abbreviations xii

1 Introduction 11.1 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Organisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Probability Theory 42.1 Probability Space and Random Variables . . . . . . . . . . . . . . 42.2 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Probability Density Function . . . . . . . . . . . . . . . . . . . . . 72.4 Expectation, Variance and Covariance . . . . . . . . . . . . . . . 82.5 Conditional Probabilities and Expectations . . . . . . . . . . . . . 92.6 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.7 Projective Product . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Bayesian Filtering 163.1 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 193.3 Bayesian Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.4 Another Formulation of the Bayesian Filter . . . . . . . . . . . . . 283.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . 29

4 Monte Carlo Methods 304.1 Convergence of Sequences of Random Variables . . . . . . . . . . 304.2 Classical Monte Carlo Integration . . . . . . . . . . . . . . . . . . 324.3 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.3.1 Approximate Importance Sampling . . . . . . . . . . . . . 36

ii

CONTENTS iii

4.3.2 Minimum Variance Importance Distribution . . . . . . . . 384.3.3 Optimal Importance Distribution for Arbitrary Integrand . 394.3.4 Effective Sample Size . . . . . . . . . . . . . . . . . . . . . 40

4.4 Stratified Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 424.5 Rejection Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5 Sequential Monte Carlo 535.1 Bayesian Filter Approximation . . . . . . . . . . . . . . . . . . . . 545.2 Bootstrap Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.3 Sampling/Importance Resampling . . . . . . . . . . . . . . . . . . 56

5.3.1 Generalisation of the Importance Distribution . . . . . . . 605.3.2 Alternative Formulation of SIR . . . . . . . . . . . . . . . 61

5.4 Proposals for Importance Distributions . . . . . . . . . . . . . . . 635.4.1 Auxiliary Particle Filter . . . . . . . . . . . . . . . . . . . 645.4.2 Monte Carlo Weighting . . . . . . . . . . . . . . . . . . . . 655.4.3 Kalman Filter Importance Distributions . . . . . . . . . . 65

5.5 Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675.5.1 Multinomial Resampling . . . . . . . . . . . . . . . . . . . 675.5.2 Stratified Resampling . . . . . . . . . . . . . . . . . . . . . 685.5.3 Other Stratification Methods . . . . . . . . . . . . . . . . 715.5.4 Sequential Importance Sampling . . . . . . . . . . . . . . . 725.5.5 Systematic Resampling . . . . . . . . . . . . . . . . . . . . 73

5.6 Regularised Particle Filters . . . . . . . . . . . . . . . . . . . . . . 735.6.1 Post-Regularised Particle Filter . . . . . . . . . . . . . . . 745.6.2 Pre-Regularised Particle Filter . . . . . . . . . . . . . . . . 765.6.3 Kernel Filter . . . . . . . . . . . . . . . . . . . . . . . . . 785.6.4 Local Rejection Regularised Particle Filter . . . . . . . . . 795.6.5 Remarks on the LRRPF . . . . . . . . . . . . . . . . . . . 815.6.6 Conclusions on Regularised Particle Filters . . . . . . . . . 82

5.7 Other Sequential Monte Carlo Methods . . . . . . . . . . . . . . . 83

6 Summary 846.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 846.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

A Analysis 92A.1 Functions and Set Theory . . . . . . . . . . . . . . . . . . . . . . 92

A.1.1 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 92A.1.2 Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93A.1.3 Metric Spaces . . . . . . . . . . . . . . . . . . . . . . . . . 93A.1.4 Borel Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 94A.1.5 Extended Real Numbers . . . . . . . . . . . . . . . . . . . 96

A.2 Measure Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97A.2.1 Lebesgue Measure . . . . . . . . . . . . . . . . . . . . . . . 98A.2.2 Measurable Functions . . . . . . . . . . . . . . . . . . . . . 99A.2.3 Simple Functions . . . . . . . . . . . . . . . . . . . . . . . 100

CONTENTS iv

A.3 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

B Kernel Density Estimation 105B.1 Kernel Density Estimators . . . . . . . . . . . . . . . . . . . . . . 105B.2 Choice of Regularisation Kernel . . . . . . . . . . . . . . . . . . . 107B.3 Choice of Bandwidth Matrix . . . . . . . . . . . . . . . . . . . . . 111B.4 Estimation of General Densities . . . . . . . . . . . . . . . . . . . 112

Abstract

TAMPERE UNIVERSITY OF TECHNOLOGYDepartment of Science and EngineeringInstitute of Mathematics

HEINE, KARI: A Survey of Sequential Monte Carlo MethodsLicentiate Thesis, 86 pages, 23 Appendix pagesExaminers: prof. Robert Piche, Dr. Dan CrisanFunding: Tampere Graduate School in Information Science and EngineeringEvaluated by the Department Council in 14.9.2005.Keywords: sequential Monte Carlo, Bayesian filtering, particle filter, importancesampling, sampling/importance resampling, regularised particle filter

Sequential Monte Carlo (SMC) methods form a relatively novel class ofalgorithms for approximating the generally intractable Bayesian filter recursion.Recently, several different modifications have been proposed to improve the origi-nal SMC algorithm, the bootstrap filter. This thesis combines the works of variousauthors under the same framework and provides a survey of some of the bestknown SMC methods proposed in the literature. In order to describe the SMCmethods in a unified manner and to give a rigorous description of the theoreticalprinciples on which the SMC methods are based, the thesis also provides detaileddescriptions of the basics of probability theory, Bayesian filtering, and the MonteCarlo method.

The survey consists of three major topics: the sampling/importance re-sampling (SIR) algorithm, resampling, and regularised particle filters (RPF). TheSIR algorithm is formulated in a general manner and the discussion is accompa-nied with a few examples of choosing the importance distribution. The role ofresampling as a separate step of the algorithm is suppressed and it is consideredto be an integral part of the random sample simulation from the importance dis-tribution. Resampling is also considered in terms of stratified sampling, whichenables the formulation of the sequential importance sampling algorithm as aspecial case of SIR. The description of the RPF algorithms includes discussion onthe fundamental similarities and differences between SIR and RPF methods. Afew theoretical remarks regarding the essential differences between the differentRPF algorithms have been made as well. Although many questions remain un-answered, the thesis provides a detailed theoretical background of SMC methods,which enables the further research of challenging theory of SMC methods.

v

Tiivistelma

TAMPEREEN TEKNILLINEN YLIOPISTOTeknis-luonnontieteellinen osastoMatematiikan laitos

HEINE, KARI: A Survey of Sequential Monte Carlo MethodsLisensiaatintutkimus, 86 sivua, 23 liitesivuaTarkastajat: prof. Robert Piche, toht. Dan CrisanRahoitus: Tampere Graduate School in Information Science and EngineeringKasitellaan osastoneuvostossa 14.9.2005.Avainsanat: sekventiaalinen Monte Carlo, Bayes-suodatus, partikkelisuodatin,painotusotanta, SIR-algoritmi, regularisoidut partikkelisuodattimet

Sekventiaaliset Monte Carlo -menetelmat eli SMC-menetelmat muodostavatjoukon suhteellisen uusia algoritmeja joilla voidaan approksimoida yleisesti han-kalaa Bayes-rekursiota. Viime aikoina kirjallisuudessa on esitetty useita erilaisiaparannusehdotuksia alkuperaiseen partikkelisuodatinalgoritmiin (engl. bootstrapfilter). Tama lisensiaatintutkimus esittaa useiden eri tutkijoiden tuloksia yhtenai-sessa muodossa ja samalla muodostaa katsauksen parhaiten tunnettuihin SMC-menetelmiin. Tyossa esitettavien menetelmien teoreettisten perusteiden tasmalli-sen ja yhtenaisen esittamisen mahdollistamiseksi tyohon on sisallytetty myos yksi-tyiskohtaiset kuvaukset todennakoisyysteorian perusteista, Bayes-suodatuksestaseka Monte Carlo -menetelmasta.

Katsaus SMC-menelmiin koostuu kolmesta osasta: SIR-algoritmi, uudel-leennaytteistys seka regularisoidut partikkelisuodattimet. SIR-algoritmi muotoil-laan yleisesti ja menetelman kuvaus sisaltaa joitain esimerkkeja painotusjakau-man valinnasta. Uudelleennaytteistysta ei niinkaan esiteta SIR-algoritmin eril-lisena osana vaan kiinteana osana satunnaisnaytteiden generointia. Uudelleen-naytteistamista kasitellaan myos ositetun otannan nakokulmasta. Tama mah-dollistaa ns. SIS-algoritmin esittamisen SIR-algoritmin erikoistapauksena. Re-gularisoitujen partikkelisuodattimien kuvaus sisaltaa mainintoja kyseisten me-netelmien ja SIR-algoritmin perustavanlaatuisista eroista ja yhtenevaisyyksista.Joitain teoreettisia huomautuksia on annettu myos eri regularisointimenetelmienoleellisista yhtenevaisyyksista.

Vaikka tyo jattaakin vastaamatta moniin oleellisiin kysymyksiin, sisaltaa sekuitenkin yksityiskohtaisen kuvauksen SMC-menetelmien teoreettisesta taustas-ta, joka mahdollistaa SMC-menetelmien haastavan teorian jatkotutkimuksen.

vi

List of Symbols

General

Symbol Description

a, b, c, . . . Elements of sets.A, B, C, . . . Sets.A , B, C , . . . Systems of sets.A, B, C, . . . Matrices.a ∈ A, a /∈ A a is an element of A, a is not an element of A.A ⊂ B, B ⊃ A A is a subset of B, including the case A = B.A − B Set difference of sets A and B.∁A The complement of A.χA The characteristic function of the set A.P(A) The system of all subsets of A.∅ Empty set.

a , b a is defined to be equal to b.ν ≪ µ The measure ν is absolutely continuous w.r.t. to µ.dν/dµ Radon-Nikodym derivative.R, R, N, Z+, Q Real numbers, extended real numbers, natural numbers,

nonnegative integres, and rational numbers.Rm×n The set of m by n real matrices.Rk The k dimensional Euclidean space.(a, b), [a, b] Open and closed interval.(a, b]

’[a, b) Left half-open and right half-open interval.

·T ,·−1, · 12 , ·− 1

2 Transpose, inverse, square root, and inverse squreroot ofthe matrix ·.

×,×ki=1 The Cartesian product of sets, or the product of meas-

ures.A ⊗ B The smallest σ-algebra containing the sets A×B, A ∈ A ,

B ∈ B, where A and B are σ-algebras.∗ The projective product.[·]ij The element on the ith row and the jth column of the

matrix ·.|·| Absolute value of ·.⌊·⌋ The integer part of a real number.

vii

LIST OF SYMBOLS viii

Symbol Description

⌈· · · ⌋ A diagonal matrix with diagonal elements · · · .|| · || Euclidean norm.det(·) Determinant of a matrix.∇,∇2 The Jacobian matrix and the Hessian matrix.tr (·) The sum of diagonal elements of the matrix ·.0k, Ik A k-dimensional vector of zeros and the k × k identity

matrix.

λk, λk The k-dimensional Lebesgue premeasure and measure.λB The counting measure on the countable set B.ρ, ρE General metric and the Euclidean metric.σ(·) The σ-algebra generated by a random variable(s) or a

system of sets.Bρ(x, r) An open ball of radius r, centered at x, w.r.t. the metric

ρ.B(R) The Borel sets of R w.r.t. the order topology.B(A) The Borel sets of A ⊂ Rk w.r.t. to the Euclidean topo-

logy.f : X → Y A function from the domain X to image space Y .f(·), f−1(·) The image and preimage of the set or element ·.fi → f Pointwise convergence of functions.fi ↑ f Pointwise convergence from below.f+, f− Decomposition of f such that f = f+ − f−.G+

f , G−f The sets of measurable, c.s. functions g, such that g ≥ f

and g ≤ f , µ-a.e., respectively.∫fdµ The Lebesgue integral of f with respect to measure µ.∫

∗ fdµ,∫ ∗

fdµ Upper and lower integral of f with respect to µ.

Kernel Density Estimation

Symbol Description

a, an Bandwidth in a radially symmetric regularisation kernel.aE

n , aNn Optimal bandwidths for KE and KN.

εbias(·), ε∗bias(·) The bias and an asymptotic bias of a KDE at ·.εnAMISE Asymptotic mean integrated squared error of a KDE us-

ing a sample of size n.Di···j Differentiation operator w.r.t. the arguments i,. . . , j.

fn A KDE of the density f .

fn(·|x1, . . . , xn) A KDE of f based on the sample x1, x2, . . . , xn.Hn, Hn Bandwidth matrix.K, KU, KN, KE Arbitrary, uniform, standard normal, and Epanechnikov

regularisation kernels.

LIST OF SYMBOLS ix

Symbol Description

Kis Scaled and shifted regularisation kernel centered at xi

t.

mf The expectation of fn.

vf , v∗f

The variance and the asymptotic variance of fn.

V k The volume of a k-dimensional unit hypersphere.TM Mahalanobis transformation.

Probability Theory

Symbol Description

bi The bridging function in the LRRPF.

ci,j The weight of set xjt × Rk in the i stratum.

cov (x, y) Covariance of random variables x and y.covf (h(x), g(x)) Covariance of random variables h(x) and g(x), when x

has density f with respect to the Lebesgue measure.E[x] Expectation of a random variable x.Ef [x] Expectation of a random variable x having the density

f .E[x | y] Conditional expectation of x given the σ-algebra σ(y).E[x |G ] Conditional expectation of x given the σ-algebra G .E[x | y = y′] Conditional expectation of x given the event y = y′.ft Process model function ft : Rk × Rn → Rk.fN(·; a, C) The density of N(a, C).Fi

t The Jacobian of ft evaluated at xit.

F The set of events.g∗ The essential supremum of gt.g∗

i The essential supremum of gt on the support of Kis.

gy(x) The likelihood of x for given observation y.gt(xt) The likelihood of xt for given observation yt.gi

t(xt) The likelihood resulting from the measurement modelwhere mt is replaced by the first order Taylor series ex-pansion at ft−1(x

it−1).

gyt(xt) Equivalent to g′

t(xt, yt).g(x, y) The conditional density of y w.r.t. to a σ-finite measure

given x.g′

t(xt, yt) The conditional density of yt ∈ Rm w.r.t. a σ-finite mea-sure given xt.

kt(xt, ·), kt(xt, ·) The densities of Kt(xt, ·) and Kt(xt, ·).Kt, Kt Transition kernels that define a Markov chain and im-

portance distributions.mt A measurement model function.Mi

t The Jacobian matrix of mt evaluated at mt(ft−1(xit−1)).

LIST OF SYMBOLS x

Symbol Description

ns The number of strata.N(µ, C) A normal distribution with mean µ and covariance mat-

rix C.pn

t|t The density of πnt|t.

Px The distribution of random element x.P

nx An unweighted discrete approximation of Px, based on

an IID sample of size n.P(A) The probability of the event A ∈ F.P(·; ·) A regular conditional probability.Pt(·; ·) A regular conditional distibution of observation yt with

respect to the σ-algebra generated by xt.P(A |x) Conditional probability of A given the σ-algebra σ(x).P(A |G ) Conditional probability of A given the σ-algebra G .P(x ∈ A) The probability of the event x−1(A).P(x ∈ A |G ) The conditional probability P(ω ∈ Ω | x(ω) ∈ A |G ).P(A |x = x′) Conditional probability of A given the the event x = x′.qt+1:t|t The density of γt+1:t|t.U(A) A uniform distribution on the set A.V [x] The variance or the covariance matrix of x.Vf [x] The variance or the covariance matrix of x having the

density f .wt A measurable function such that wt(x

it) = wi

t.wt A measurable function such that wt(x

it) = wi

t.wi

t The weight of xit.

wit An unnormalised weight of xi

t.wi

t The weight of xit in the importance distribution.

xi Indexing of IID random variables, i.e. particles.xi

t The ith particle representing the a Markov chain xt∞t=0

at time instant t.x0:t Ordered set x0, x1, . . . , xt.x ∼ · The random variable is distributed according to ·.xi

a.e.→ x Almost sure convergence.

xid→ x Convergence in distribution.

y1:t Ordered set y1, y2, . . . , yt.Ω, ω The set of elementary events and an elementary event.γi Sampling distribution on the ith stratum.γ, γt+1:t|t Importance or instrumental distribution.γ∗

t+1:t|t The optimal importance distribution.

π A σ-finite measure on B(Rk).πi A finite measure on the ith stratum.

LIST OF SYMBOLS xi

Symbol Description

πn A discrete measure on B(Rk) based on n elements of Rk.πn

γ A weighted discrete approximation of π, using the im-portance distribution γ.

πi|j The conditional distribution of xi given observations y1:j.π′

i|j A nondiscrete approximation of πi|j .

πi|j An unnormalised measure proportional to π′i|j.

πni|j A discrete approximation of πi|j based on n samples.

πnt|t A regularised approximation of πi|j based on n samples.

πt+1:t|t A measure satisfying πt+1:t|t(A × Rk) = πt+1|t(A), A ∈B(Rk).

π′t+1:t|t A nondiscrete approximation of πt+1:t|t.

πt+1:t|t+1 A measure of the form∫

Cgt+1 dπ′

t+1:t|t, C ∈ B(R2k).

Υ Integral of the form∫

h dµ, where µ is σ-finite.Υn

MC Classical Monte Carlo approximation of the integral Υusing a sample of size n.

Υ(A) Integral of the form∫

A×Rk gt dπt+1:t|t.

ΥBS(A) The approximation of Υ(A) used in the bootstrap filter.ΥSIR(A) The SIR approximation of Υ(A),Υ∗

SIR(A) The alternative SIR approximation of Υ(A).ΥST(A) A stratified sampling approximation of Υ(A).ΥSIS(A) The approximation of Υ(A) used in the SIS algorithm.ΥRPF1(A) The approximation of Υ(A) used in the post-RPF.

Υnγ , Υn

γ An importance sampling and an approximate importancesampling approximations of Υ using a sample of size nfrom the importance distribution γ.

ΥST, ΥPST, ΥN

ST Stratified sampling approximations of the integral Υ us-ing arbitrary, proportional, and Neyman allocation.

List of Abbreviations

Abbreviation Explanation

a.e. Almost everywhere.AMISE Asymptotic mean integrated square error.c.s. Countably simple.CDF Cumulative distribution function.CLT The central limit theorem.EKF Extended Kalman filter.IID Independently and identically distributed.KDE Kernel density estimator.LRRPF Local rejection regularised particle filter.MCT Monotone convergence theorem.MISE Mean integrated square error.pre-RPF Pre-regularised particle filter.post-RPF Post-regularised particle filter.RPF Regularised particle filter.RND Radon-Nikodym derivative.RNE Relative numerical efficiency.SLLN The strong law of large numbers.SMC Sequential Monte Carlo.SIR Sampling/importance resampling.SIS Sequential importance sampling.UKF Unscented Kalman filter.w.r.t. With respect to.

xii

Chapter 1

Introduction

There are numerous engineering applications where an estimation problem ofsome kind is encountered. This is to say that the unknown value of a quanti-ty is to be recovered using the observed values of some other quantities. Theobserved quantities are assumed to be dependent on the unknown value. Avariety of different estimation methods exists [see, e.g Kay, 1993]. If something isknown about the unknown value prior to making any observations, the Bayesianapproach to estimation can be adopted. This means that the prior knowledgeis used for constructing a probabilistic model of the application of interest. Theunknown value, as well as the observations, are considered to be realisations ofrandom variables defined by this model. Because of the subjective nature of theprior knowledge, the preference of the Bayesian approach has traditionally beenthe topic of a raging debate. Instead of participating in this debate, we contentourselves with the fact that in practice, the Bayesian methods have establishedtheir place among the modern estimation methods and therefore they deserve tobe studied.

At this point, the Bayesian approach is divided into two distinct problems:

i) the construction of the model

ii) the estimation

This thesis will focus solely on the latter problem. Traditionally, especially inthe case of recursive time series estimation, these two problems cannot havebeen considered separately because the tractability of the estimation problemhas imposed severe restrictions on the construction of the model. It was thepioneering work by Gordon et al. [1993] that showed how the Monte Carlomethod could be applied to recursive Bayesian time series estimation in the caseof a general model. Ever since, several different variants of this approach havebeen proposed [see, e.g. Doucet et al., 2001b]. These so called sequential MonteCarlo methods constitute the essence of this thesis.

1

CHAPTER 1. INTRODUCTION 2

1.1 Contribution

Perhaps it is because of the quite involved theory that the Monte Carlo methodsare considered to form a branch of experimental mathematics [Hammersley andHandscomb, 1964, page 2]. The same assumption seems to apply to the sequentialMonte Carlo as well; there are numerous methods that are described in a practical,algorithmic manner, and they are interpreted to be methods for propagatingparticle systems where individual particles are either removed or assigned toproduce offspring. Indeed, from the practical point of view, this representation isunparalleled and should not be undervalued. However, there is the risk in sucha representation that the theoretical foundations of the methods may remainunclear to the reader and that the fundamental similarities of different methodsremain unacknowledged. Therefore, the aim of this thesis is twofold. One goal isto provide a survey of the sequential Monte Carlo methods and the other goal isto serve as an introduction to the theory of sequential Monte Carlo.

Regarding the first goal, it should be pointed out that the survey is by nomeans exhaustive and several methods have been unfortunately excluded. Per-haps the most significant class of excluded methods is that of Markov ChainMonte Carlo based methods. However, because the intention has been to for-mulate the included methods in a general manner, the majority of the existingmethods is considered to be covered although some specific details may have beenexcluded. Moreover, some remarks are made on modifications to the includedmethods that would yield algorithms that, to the author’s knowledge, have notbeen previously proposed in the literature. In these cases, it is, however, con-sidered to be of greater importance to note that the modification is possible,rather than to argue in favor of the proposed modification.

The consequence of the second goal is that algorithmic or pseudo-code de-scriptions of the methods are omitted and left to the given references, althoughsome people may find them useful. The intention has been to describe the chosenmethods in a unified manner that exposes the underlying use of Monte Carlointegration theory as explicitly as possible. On the other hand, the detailedtheoretical analysis of the methods is known to be quite involved and thereforeexcluded. For this reason, the thesis can only be considered as an introductionto the theory of sequential Monte Carlo.

The methods described in this thesis are based on the principles of MonteCarlo integration that date back to 1940’s. Although several monographs on thetopic are available [see, e.g. Hammersley and Handscomb, 1964, Robert and Case-lla, 1999, Rubinstein, 1981], the descriptions of the methods often assume con-tinuous probability distributions that have densities with respect to the Lebesguemeasure. This is not sufficient in the context of sequential Monte Carlo, wherewe frequently encounter distributions that cannot be appropriately called con-tinuous nor discrete. For this reason, a significant part of this thesis is devoted toa general measure theoretic discussion of the principles of Monte Carlo as well.

CHAPTER 1. INTRODUCTION 3

1.2 Organisation

This thesis is organised in the following manner. Chapter 2 describes the general,probability theoretic foundations on which the remainder of the thesis is based.Chapter 3 describes the Bayesian approach to recursive time series estimation,i.e., Bayesian filtering. The chapter includes a detailed problem statement andthe proof of the well known Bayes’ recursion formulas. Although the specific con-struction of the model is excluded, an example of a general class of possible modelsis given. The principles of the Monte Carlo method are described in Chapter 4.This chapter covers the concepts of classical Monte Carlo integration, importancesampling, stratified sampling, and the rejection method. All of these methods areemployed by the sequential Monte Carlo methods described in Chapter 5 whichconsists of the three main topics: sampling/importance resampling, resampling,and regularised particle filters. Some concluding remarks are given in Chapter 6.

In order to make the thesis self-contained, some relevant theoretical back-ground is given in the two included appendices. Appendix A describes the basicsof measure and integration theory, and Appendix B describes some principles ofkernel density estimation. These principles are considered to be essential for thecomprehension of the regularised particle filters described in Chapter 5.

Chapter 2

Probability Theory

This chapter gives some elementary probability theoretic definitions and resultsthat are considered to be important for the comprehension of the methods de-scribed in Chapter 5. Because probability theory is closely related to real analysisand measure theory, some elementary results on these topics have been collectedin Appendix A. For further details on probability theory, or as an introductionto probability theory, the reader is encouraged to consult, e.g. [Shiryayev, 1984]or [Williams, 1991].

Section 2.1 defines the fundamental concepts of probability space and ran-dom variables. The statistical independence of random variables is defined inSection 2.2 and in Section 2.3 we define the concept of probability density ina general manner. Section 2.4 gives the definition of the important statisticalquantity known as the expectation. Also, other related quantities such as vari-ance and covariance will be defined. Conditional expectations and conditionalprobabilities, that are important especially from the Bayesian point of view, aredefined in Section 2.5. In Section 2.7, we describe the projective product, whichoffers a convenient shorthand notation for Bayes’ rule. The chapter is concludedin Section 2.6, where we define a class of stochastic processes known as Markovchains.

2.1 Probability Space and Random Variables

A probability space defines all the possible events and their probabilities of occur-rence. The exact definition of the probability space is the following [Shiryayev,1984, page 136].

Definition 1 A probability space is an ordered triple (Ω, F, P), wherei) Ω is a set;ii) F is a σ-algebra of subsets of Ω;iii) P is a measure on F such that P(Ω) = 1.

The sets A ∈ F are called events and P(A) is the probability of the eventA ∈ F. The elements ω ∈ Ω are called elementary events.

4

CHAPTER 2. PROBABILITY THEORY 5

For an arbitrary measurable space (Ω, F), the measure µ defined on the σ-algebra F is called a probability measure if µ(Ω) = 1. Obviously, P is aprobability measure, and a probability space (Ω, F, P) is a measure space (seeDefinition 39 in Section A.2). The elements of F, i.e. the events, form the systemof P-measurable subsets of Ω.

Throughout this work, B(Rk) denotes the system of Borel sets of Rk withrespect to the Euclidean topology. Moreover, R is used for denoting the set ofextended real numbers and B(R) denotes the Borel sets of the extended real linewith respect to the order topology.

Definition 2 Suppose that (Ω, F, P) is a probability space and (E, E ) is a mea-surable space. Then a F/E -measurable mapping x : Ω → E is called a randomelement. In particular, if (E, E ) = (R,B(R)), then x is a random variable,and if (E, E ) = (R,B(R)), then x is an extended random variable.

Naturally, we would like to extend the definition of one dimensional randomvariables to the multidimensional case. This is done by defining a k-dimensionalrandom vector x : Ω → Rk as

x(ω) = [x1(ω), x2(ω), . . . , xk(ω)]T,

where xi, i = 1, . . . , k are random variables. It can be shown that with thisdefinition, x is a F/B(Rk)-measurable mapping, i.e. a random element. Theconverse is also true, meaning that any F/B(Rk)-measurable random elementx : Ω → Rk is a k-dimensional random vector [Shiryayev, 1984, pages 174-175].In the remainder of this thesis, the terms random variable and random vectorwill be used interchangeably. In situations where the dimension of the randomvariable is of importance, the dimension will be explicitly stated. Moreover, nonotational distinction whatsoever is made between random variables and otherfunctions.

If there is a random element x taking values in a measurable space (E, E ),then (E, E ) can be completed into a probability space by defining a probabilitymeasure Px on E as

Px(A) , P(x−1(A)) = P(ω ∈ Ω | x(ω) ∈ A), A ∈ E , (2.1)

where x−1(A) is the preimage of A ∈ E . The F/E -measurability of x guaranteesthat x−1(A) ∈ F and therefore Px(A) exists for all A ∈ E . The resulting proba-bility measure Px is referred to as the probability distribution of x [Shiryayev,1984, page 168] or the law of x [Williams, 1991, page 33 ]. The intuitive notationP(x ∈ A) , Px(A) will be used for denoting the probability Px(A) for all A ∈ E .

The function x can be used for constructing a sub-σ-algebra σ(x) ⊂ F

according to the following theorem [Shiryayev, 1984, page 172]. The proof of thetheorem is elementary but included here for completeness.

Theorem 1 Let x be a random element taking values in the measurable space(E, E ). Define a system of sets

σ(x) , A ∈ F | A = x−1(B), B ∈ E ,


Then σ(x) ⊂ F is a σ-algebra, and it is called the σ-algebra generated by therandom element x.

Proof: To show that σ(x) is a σ-algebra, one must show that it is closed undercomplement and countable unions. Let B ∈ E . Then

Ω = x−1(E) = x−1(B ∪ ∁B) = x−1(B) ∪ x−1(∁B),

∅ = x−1(B) ∩ x−1(∁B),

which implies x−1(∁B) = ∁x−1(B), thus σ(x) is closed under complement. LetB be a countable collection of elements of E . Closedness under countable unionthen follows from the elementary result

⋃

B∈B

x−1(B) = x−1

(⋃

B∈B

B

).

The σ-algebra σ(x1, x2, . . . , xm) generated by m random variables taking valuesin Rk is defined as σ(x), where x is a Rmk valued random variable such that

x =

x1

x2...

xm

.

2.2 Independence

For a set of random variables, the observed realisation of a single random variablemight give us a hint of the realised values of the remaining, unobserved, randomvariables. This will not however be the case if the random variables are inde-pendent. Independence of random variables is of great importance especially inthe construction of probabilistic models and we take the following definition ofindependence from [Williams, 1991, page 38].

Definition 3 Sub-σ-algebras F1, F2, . . . ⊂ F are called independent if for anycollection Fi ∈ F | Fi ∈ Fi, i ∈ I, where I ⊂ N,

P

(⋂

i∈I

Fi

)=∏

i∈I

P(Fi) .

The random variables xi, i = 1, 2, . . . are independent if the σ-algebras σ(xi),i = 1, 2, . . . are independent. The events Ei ∈ F, i = 1, 2, . . . are independentif their characteristic functions χEi

are independent random variables.


2.3 Probability Density Function

Especially with real valued random elements, it is convenient to work with prob-ability density functions instead of probability distributions. In general, ran-dom elements are however not required to be real valued in order to admit densityfunctions [Shiryayev, 1984, page 194].

Definition 4 Suppose that (Ω, F, P) is a probability space, (E, E ) is a measur-able space, and x : Ω → E is a random element with probability distribution Px.Moreover, let µ be a σ-finite measure on E , such that Px is absolutely continuouswith respect to µ. Then any E /B([0,∞))-measurable function f : E → [0,∞)satisfying

Px(A) =

∫

A

f dµ, A ∈ E (2.2)

is called a probability density function of Px with respect to µ, and denotedby

f =dPx

dµ.

A detailed definition of the integral is given in Section A.3. The integral in Equa-tion (2.2) could be written more explicitly as Px(A) =

∫A

f(x)µ(dx), where the“integration variable” x is used. The shorthand notation Px ≪ µ for absolutecontinuity will be used throughout the remainder of this work. The definition ofabsolute continuity is given in Section A.2. The probability density function f is aRadon-Nikodym derivative (RND) of Px with respect to µ and it is ensuredto exist µ-a.e. uniquely by the Radon-Nikodym Theorem (see Theorem 21,in Section A.3). It should be pointed out that probability densities are RNDsbut not all RNDs are probability densities. Note that the measure µ is definedon E instead of the σ-algebra F. Moreover, µ does not have to be a probabilitymeasure. This means that the µ-a.e. above cannot be necessarily interpreted tomean the phrases “with probability one” or “almost surely”.

Next we give two important results that will be used frequently. Note thatthe results are general and apply to any Radon-Nikodym derivatives and not onlyto probability density functions. The proof of the following lemmas can be foundin [Shiryayev, 1984, page 229].

Lemma 1 Let (E, E ) be a measurable space. Suppose that ν and µ are σ-finitemeasures defined on E such that µ ≪ ν, and h is an E /B(R)-measurable function.Then ∫

h dµ =

∫h

dµ

dνdν

in the sense that if one of the integrals exists, then the other one exists and theyare equal.

Lemma 2 Let ν be a signed measure and let µ and λ be σ-finite, such that ν ≪ µand µ ≪ λ. Then

dν

dλλ−a.e.=

dν

dµ

dµ

dλ,

dν

dµ

µ−a.e.=

dν/dλ

dµ/dλ


2.4 Expectation, Variance and Covariance

The expected value, or the expectation, of a random variable x is not onlyan important quantity in statistical analysis, but it also is an important tool fordefining various theoretical structures in probability theory.

Definition 5 Let x be an extended random variable defined on the probabilityspace (Ω, F, P). The expectation of x is denoted by E[x] and defined as

E[x] ,

∫x dP.

The expectation of a random vector x = [x1, x2, . . . , xk]T, where xi, i = 1, 2, . . . , k

are extended random variables is defined elementwise as

E[x] = [E[x1] , E[x2] , . . . , E[xk]]T .

It is important to note that, by the definition of the integral, the expectationmay be finite, −∞, ∞, or undefined. The conditions on which these values areobtained are given in the definition of the integral in Section A.3. Occasionally,the integration variable may be included in the notation for clarity, that is, thenotation E[x] =

∫x(ω)P(dω) is used.

Expectation is also used for defining other quantities of interest that ap-pear often in statistical analysis. Examples of such quantities are covariance,variance, and covariance matrix, for which we have the following definitions[Shiryayev, 1984, pages 232-233].

Definition 6 Let x and y be random variables taking values in R such that theirexpectations are defined. Then the covariance of x and y is

cov (x, y) , E[(x − E[x])(y − E[y])] . (2.3)

For the random vector x = [x1, x2, . . . , xk]T, where xi, i = 1, 2, . . . , k are random

variables taking values in R such that E[xi] is defined, the covariance matrixis denoted by V [x] and defined elementwise as

[V [x]]ij , cov (xi, xj) . (2.4)

The diagonal element [V [x]]ii, i = 1, 2, . . . , k is the variance of xi.

Also, in these definitions, the expectations may be undefined or infinite. Notethat the covariance matrix of a k-dimensional random vector can also be definedas

V [x] , E[(x − E[x])(x − E[x])T

],

where the expectation of the k × k matrix is taken elementwise.Above, the covariance was defined only for random variables taking values

in R. For vector valued random variables x and y taking values in Rk and Rp,respectively, the notation cov (x, y) will be used for denoting the upper right


k×p block of the covariance matrix V [[ xy ]]. Note that, consequently, cov (x, y) =

E[(x − E[x])(y − E[y])T

].

Let x be a random variable taking values in Rk with the probability distri-bution Px such that Px ≪ λk, where λk is the Lebesgue measure in Rk (seeSection A.2.1). Suppose that h : Rk → R is a B(Rk)/B(R)-measurable func-tion. Note that, in this case, h is a random variable defined on the probabilityspace (Rk,B(Rk), Px). Because λk is σ-finite, the Radon-Nikodym theorem en-sures that the probability density f = dPx/dλk exists. Also, because Px is aprobability measure, and hence σ-finite, Lemma 1 can be applied to h, Px, andλk, yielding the following, perhaps more familiar, formulation of the expectationwith respect to a probability density:

E[h] =

∫h dPx =

∫hf dλk. (2.5)

Expectations of this kind will be frequently used in this work. Therefore, thefollowing convenient definition is given.

Definition 7 Let x be a random variable taking values in Rk with the probabilitydistribution Px and a probability density function f = dPx/dλk. Moreover, leth : Rk → R be B(Rk)/B(R)-measurable. Then

Ef [h(x)] , E[h] .

Similar definitions for covariances and variances can also be given as follows.

Definition 8 Let x be a random variable taking values in Rk with the probabilitydistribution Px and a probability density function f = dPx/dλk. Moreover, sup-pose that hi : Rk → R is B(Rk)/B(R)-measurable for all i = 1, 2, . . . , m. Then,assuming that the expectations are defined,

covf(hi(x), hj(x)) , Ef [(hi(x) − Ef [hi(x)])(hj(x) − Ef [hj(x)])] ,

Vf [hi(x)] , Ef [(hi(x) − Ef [hi(x)])(hi(x) − Ef [hi(x)])] .

The covariance matrix Vf [h(x)] is defined elementwise as

[Vf [h(x)]]ij , covf (hi(x), hj(x)) ,

where h : Rk → Rm is defined as h , [h1(x), h2(x), . . . , hm(x)]T.

2.5 Conditional Probabilities and Expectations

Especially in Bayesian inference, conditional probabilities play an important role.As the expectation was defined in terms of probability, it may appear somewhatcounterintuitive that the conditional probability is defined in terms of conditionalexpectation which, in turn, is defined as follows [Shiryayev, 1984, page 211].


Definition 9 Let x : Ω → [0,∞) be a random variable defined on the probabilityspace (Ω, F, P). Then the conditional expectation of x with respect to theσ-algebra G ⊂ F is any G /B(R)-measurable extended random variable denotedby E[x |G ] such that

∫

A

x dP =

∫

A

E[x |G ] dP, A ∈ G . (2.6)

Let x+ , max(0, x) and x− , −min(0, x). Then the conditional expectation ofthe random variable x : Ω → R is

E[x |G ] , E[x+∣∣G]− E

[x− ∣∣G

],

if at least one of the conditional expectations is finite P-a.e. In the case thatE[x+ |G ] = E[x− |G ] = ∞, an arbitrary value is assigned to E[x |G ].

To ensure the existence of the conditional expectation for a nonnegative randomvariable x, it is observed that, if E[x] is defined, then Q(A) =

∫A

x dP is a measuredefined for all A ∈ G [Shiryayev, 1984, page 193]. Because clearly, Q ≪ P andP is σ-finite, then, by the Radon-Nikodym theorem, a RND dQ/dP = E[x |G ]exists and is P-a.e. unique.

Because E[x |G ] is unique except for sets of P-measure equal to zero, theconditional expectation E[x |G ], in fact, defines a set of G /B(R)-measurable func-tions that differ from each other only on a set of P-measure zero. These functionsare called the variants of E[x |G ]. The conditional probability is defined in termsof conditional expectation as follows [Shiryayev, 1984].

Definition 10 The conditional probability of an event A ∈ F with re-spect to a σ-algebra G ⊂ F is denoted by P(A |G ) and it is defined as

P(A |G ) , E[χA |G ] .

Some notational remarks related to the conditional expectations and prob-abilities are in order. By definition, E[x |G ] and P(A |G ) are random variablesand, hence, they are functions of ω. Therefore, it would be more appropriate touse the notations E[x |G ] (ω) and P(A |G ) (ω). In order to shorten the notation,the dependency on ω is often omitted. Also, the following shorthand notationsfor conditioning with respect to a σ-algebra generated by a random variable xwill be used

E[ · |x] , E[ · |σ(x)]

P( · |x) , P( · |σ(x)) .

Moreover, similarly as for the probability, the notation P(x ∈ A |G ) is used as ashorthand notation for P(ω ∈ Ω | x(ω) ∈ A |G ).

Although the conditional expectation with respect to a σ-algebra is theoret-ically practical, the intuition of conditioning on a σ-algebra may appear somewhatunclear. Therefore, we introduce the conditional expectation with respect to anevent as follows [Shiryayev, 1984, page 218].


Definition 11 Let x and y be extended random variables and assume that E[x] isdefined. The conditional expectation of x given an event ω ∈ Ω | y(ω) =y′ ∈ R is any B(R)/B(R)-measurable function m : R → R such that

∫

y−1(A)

x dP =

∫

A

m(y)Py(dy), A ∈ B(R).

Similarly as for Definition 9, the existence and Py-a.e. uniqueness is ensured bythe Radon-Nikodym theorem. Note that, for A ∈ B(R),

∫

y−1(A)

x(ω)P(dω) =

∫

A

m(y)Py(dy) =

∫

y−1(A)

m(y(ω))P(dω),

where the first equality is due to the definition of m, and the second equalityfollows from the change of variables in the Lebesgue integral [see, e.g. Shiryayev,1984, page 194]. Because m y is a σ(y)/B(R)-measurable function, and byTheorem 1 the sets y−1(A), A ∈ B(R) form the σ-algebra σ(y), it follows byDefinition 9 that m(y(ω)) is a variant of the conditional expectation E[x | y], i.e.E[x | y] (ω) = m(y(ω)), P-a.e. The notation E[x | y = y′ ] , m(y′) will be usedfor the conditional expectation given the event ω ∈ Ω | y(ω) = y′ ∈ R. Theconditional probability with respect to an event is defined naturally as follows[Shiryayev, 1984, page 216].

Definition 12 The conditional probability of an event A ∈ F given anevent ω ∈ Ω | y(ω) = y′ ∈ R is denoted by P(A | y = y′), and it is defined as

P(A | y = y′ ) , E[χA | y = y′ ] .

It is important to note that P(A |G ) (ω) is not necessarily a probability mea-sure on F for given ω. However, often it would be practical to treat conditionalprobabilities as probability measures. To this end, more stringent conditions needto be imposed, resulting in the following definition [Shiryayev, 1984, page 224].

Definition 13 A function P(·; ·) : Ω × F → [0, 1] is a regular conditionalprobability with respect to σ-algebra G ⊂ F if

i) P(ω; ·) is a probability measure on F for every ω ∈ Ω;ii) For all A ∈ F, P(·; A) is a variant of P(A |G ).

The significance of regular conditional probabilities is illustrated by the followingtheorem [Shiryayev, 1984].

Theorem 2 Suppose that P(·; ·) is a regular conditional probability with respectto the σ-algebra G ⊂ F. Then

E[x |G ] (ω)P-a.e.=

∫x(ω)P(ω; dω),

if E[x] is defined.


Proof: The proof is given in [Shiryayev, 1984, pages 224-225].

Similar to the way that the probability distribution Px was defined as aprobability measure on the image space of random element x, we define theregular conditional distribution as follows [Shiryayev, 1984, page 225].

Definition 14 Suppose that (Ω, F, P) is a probability space, (E, E ) is a mea-surable space, x : Ω → E is a random element, and G ⊂ F. The functionQ : Ω × E → [0, 1] is a regular conditional distribution of x with respect toG if

i) For all ω ∈ Ω, Q(ω, ·) is a probability measure on E ;ii) For all A ∈ E , Q(·, A) is a variant of P(x ∈ A |G ).

A natural question arising from the definition of regular conditional distributionsis whether they exist or not. This section is concluded by the following reassuringtheorem [Shiryayev, 1984, page 227]

Theorem 3 Suppose that (E, E ) is a Borel space and let x : Ω → E be a randomelement. Then a regular conditional distribution with respect to G ⊂ F exists.

Proof: The proof is given in [Shiryayev, 1984, page 227].In particular, this theorem states that regular conditional distributions exist forrandom variables [Shiryayev, 1984, page 228].

2.6 Markov Chains

Markov chains form an important class of probabilistic models that play asignificant role, e.g. in Bayesian filtering. The following definition of Markovchains can be found in [Shiryayev, 1984, page 523].

Definition 15 Let F0 ⊂ F1 ⊂ · · · ⊂ F be a nondecreasing sequence of σ-algebras in a probability space (Ω, F, P) and let xt, t ∈ Z+ be Ft/B(Rk)-measur-able. Then the sequence xt, Ft∞t=0 is a Markov chain if

P(xn ∈ A | Fm)P-a.e.= P(xn ∈ A | xm) , (2.7)

for all 0 ≤ m ≤ n and A ∈ B(Rk). If, in particular, Fm = σ(x0, x1, . . . , xm),and xt, Ft∞t=0 is a Markov chain, then the sequence xt∞t=0 is called a Markovchain.

Although Markov chains do not necessarily have any relation to the physicalconcept of time, it is intuitively convenient to regard indices t as time instants.With this terminology, Markov chains are often informally described as randomsequences where “the future” is independent of “the past” given “the present”[Shiryayev, 1984, page 524].

The probabilistic properties of a Markov chain are completely defined by theone step transition probabilities P(xt+1 ∈ A |xt), A ∈ B(Rk) and the initial


distribution Px0. Therefore, a Markov chain can be conveniently constructed by

defining an initial distribution π0 and transition kernels Kt. Transition kernelsare defined as follows [see, e.g. Robert and Casella, 1999, Doucet et al., 2001b].

Definition 16 If a function K : Rk × B(Rk) → [0, 1] has the properties:i) K(x, ·) is a probability measure on B(Rk) for all x ∈ Rk;ii) K(·, A) is B(Rk)/B([0, 1])-measurable for all A ∈ B(Rk),

then K is said to be a transition kernel.

Note that any initial distribution π0 together with any sequence Kt∞t=0

of transition kernels define a Markov chain. In particular, if Kt = K for allt = 0, 1, . . ., the chain is said to be time homogenous. Otherwise, the chainis time inhomogenous. The existence of the Markov chains is ensured by thefollowing theorem.

Theorem 4 Suppose that there is an initial distribution π0 on B(Rk) and a se-quence Ki∞i=0 of transition kernels. Then there is a probability space (Ω, F, P)and a Markov chain xi, Fi∞i=0 such that

Ki(xi(ω), A)P-a.e.= P(xi+1 ∈ A |xi) (ω), A ∈ B(Rk), i ∈ Z+.

Proof: In this proof, it will be shown how a Markov chain and an appropriateprobability space can be constructed for the given initial distribution and trans-ition kernels. The detailed proof is lengthy and can be found for homogenousMarkov chains in [Shiryayev, 1984, page 525]. Here, only the outline of the proofis given.

Let us define a probability measure P(m+1)k on B(R(m+1)k) as

P(m+1)k(A) ,

∫· · ·∫∫

χA(y)Km−1(ym−1, dym)Km−2(ym−2, dym−1) · · ·π0(dy0),

where A ∈ B(R(m+1)k), y , [yT0 , yT

1 , . . . , yTm]T, yi , [yi,1, yi,2, . . . , yi,k]

T and yi,j ∈R (see Theorem 2 on page 247 in [Shiryayev, 1984]). Then (see Theorem 3 onpage 161 in [Shiryayev, 1984]) there is a unique probability measure P on B(R∞)such that

P(A) = P(m+1)k(A),

where A = [yT0 , yT

1 , . . . ]T ∈ R∞ | [yT0 , yT

1 , . . . , yTm]T ∈ A. Thus, we have con-

structed a probability space (Ω, F, P) = (R∞,B(R∞), P), using the transition ker-nels Ki and the initial distribution π0. Here, ω = [y0,1, y0,2, . . . , y0,k, y1,1, y1,2, . . .]

T

and Ω = R∞. It remains to show that by defining

xi(ω) , [yi,1, yi,2, . . . , yi,k]T, Fi , σ(x0, x1, . . . , xi),

the sequence xi, Fi∞i=0 is a Markov chain on (Ω, F, P).


Let us take B ∈ B(Rk) and C ∈ B(R(i+1)k), then

P(xi+1 ∈ B ∩ [xT

0 , xT1 , . . . , xT

i ] ∈ C)

=

∫· · ·∫

χB(yi+1)χC(y0, y1, . . . , yi)Ki(yi, dyi+1)Ki−1(yi−1, dyi) · · ·π0(y0)

=

∫· · ·∫

Ki(yi, B)χC(y0, y1, . . . , yi)Ki−1(yi−1, dyi)Ki−2(yi−2, dyi−1) · · ·π0(y0)

=

∫

C

Ki(yi, B)P(dy),

where C = ω ∈ Ω | [yT0 , yT

1 , . . . , yTi ]T ∈ C. Because C ∈ σ(x0, x1, . . . , xi) can

be chosen arbitrarily, it follows that

Ki(xi, B)P-a.e.= P(xi+1 ∈ B |x0, x1, x2, . . . , xi) . (2.8)

Similarly by taking B ∈ B(Rk) and C ∈ B(Rk), it follows that

Ki(xi, B)P-a.e.= P(xi+1 ∈ B | xi) . (2.9)

By combining Equation (2.8) and Equation (2.9), xi, Fi∞i=0 is observed to be aMarkov chain.

For future purposes, let us give the following convenient definition that isoften used in the literature [see, e.g. Crisan, 2001, Crisan and Doucet, 2002,LeGland and Oudjane, 2004].

Definition 17 Let K be a transition kernel, and let µ be a probability measureon B(Rk). Then

Kµ(A) ,

∫K(x, A)µ(dx), A ∈ B(Rk).

Proposition 1 Kµ is a probability measure on B(Rk).

Proof: To prove the countable additivity, let A1, A2, . . . ,∈ B(Rk) be a countablesequence of disjoint subsets of Rk, and A =

⋃∞i=1 Ai. Then

Kµ(A) =

∫K(x, A)µ(dx) =

∫ ∞∑

i=1

K(x, Ai)µ(dx)

=

∞∑

i=1

∫K(x, Ai)µ(dx) =

∞∑

i=1

Kµ(Ai)

By the monotone convergence theorem (see Theorem 19 in Section A.3), theorder of the integration and the countable summation can be interchanged be-cause K(·, Ai) is a nonnegative random variable for given Ai. Clearly, Kµ(Rk) =∫

K(x, Rk)µ(dx) =∫

µ(dx) = 1. Thus, Kµ is a probability measure.


2.7 Projective Product

If there is a finite measure µ defined on the σ-algebra F, then a probabilitymeasure ν on F can always be defined by

ν(A) =µ(A)

µ(Rk), A ∈ F.

Therefore all finite measures can be regarded as unnormalised probabilitymeasures. This observation enables us to give the following definition for theprojective product [LeGland and Oudjane, 2004].

Definition 18 Suppose that µ is a finite measure and ν is an arbitrary prob-ability measure on B(Rk). Moreover, let ϕ : Rk → [0,∞) be a bounded andB(Rk)/B([0,∞))-measurable mapping. Then for all A ∈ B(Rk), the projectiveproduct ϕ ∗ µ is defined as

(ϕ ∗ µ)(A) ,

R

Aϕdµ

R

ϕ dµ, if

∫ϕ dµ > 0

ν(A) , otherwise

The outcome of the projective product is a probability measure as stated by thefollowing simple proposition.

Proposition 2 The projective product ϕ ∗ µ is a probability measure on B(Rk).

Proof: Clearly, (ϕ∗µ) is a probability measure if∫

ϕ dµ = 0. So let us considerthe case when

∫ϕ dµ 6= 0. Clearly, (ϕ ∗ µ)(∅) = 0, and (ϕ ∗ µ)(Rk) = 1. So, it

suffices to show that the countable additivity property holds. Let A1, A2, . . . ∈B(Rk) be a sequence of disjoint sets and A =

⋃∞i=1 Ai. Then, by the properties

of the integral,

∫

A

ϕ dµ =

∫χAϕ dµ =

∫ ∞∑

i=1

χAiϕ dµ =

∞∑

i=1

∫χAi

ϕ dµ =∞∑

i=1

∫

Ai

ϕ dµ.

Because the functions χAiϕ are nonnegative random variables, then, by the mono-

tone convergence theorem, the order of countable summation and integration canbe interchanged. Multiplication by (

∫ϕ dµ)−1 yields the countable additivity

property.

Chapter 3

Bayesian Filtering

In time series estimation, the goal is to recover some unknown time series ofinterest from observations that have been made at given time instants. Timeseries estimation can be divided into three classes: prediction, smoothing, andfiltering [see, e.g. Anderson and Moore, 1979]. In prediction, the observationsreceived before time instant t1 are used for recovering the value of the time seriesat time instant t2 > t1. In smoothing, observations are available up to timeinstant t1, but one is interested only in recovering the value of the time seriesat time instant t2 < t1. In filtering, one has observations available up to timeinstant t1 and the goal is to recover the value of the time series at that time instantt2 = t1. Although all three problems can be approached from the Bayesian pointof view, this thesis will focus mostly on the filtering problem. To some extent,prediction is a by-product of filtering because it is used as an intermediate resultin the filter algorithm. For more on smoothing, see, e.g. [Anderson and Moore,1979, Kitagawa, 1996]

Section 3.1 introduces the well known Bayes’ rule, on which all Bayesianinference is based. Section 3.2 lists the assumptions about the application thatenable the recursive Bayesian filter and formulates the time series estimationproblem as a Bayesian filtering problem. Section 3.3 introduces the recursivealgorithm known as the Bayesian filter and also provides the proof that the givenalgorithm, in fact, solves the Bayesian filtering problem.

3.1 Bayes’ Rule

All Bayesian inference, e.g. classification and filtering, are applications of thefollowing fundamental theorem [Shiryayev, 1984, page 229].

Theorem 5 (Bayes’ rule) Let (Ω, F, P) be a probability space, x : Ω → Rk arandom variable and h : Rk → R a B(Rk)/B(R)-measurable mapping such thatE[|h(x)|] < ∞. Moreover, it is assumed that there is a sub-σ-algebra G ⊂ F,a regular conditional probability P(·; ·) : Ω × G → [0, 1] with respect to the σ-algebra σ(x) ⊂ F, a σ-finite measure µ on G , and a F⊗G /B([0,∞))-measurable

16

CHAPTER 3. BAYESIAN FILTERING 17

function ρ : Ω × Ω → [0,∞) such that

P(ω; A) =

∫

A

ρ(ω, ω)µ(dω), ω ∈ Ω, A ∈ G . (3.1)

Then

E[h(x) |G ] (ω)P-a.e.=

∫h(x(ω))ρ(ω, ω)P(dω)∫

ρ(ω, ω)P(dω). (3.2)

This is known as the Bayes’ rule.

Proof: For all A ∈ G , we define a set function Q : G → R as

Q(A) ,

∫

A

h(x(ω))P(dω) =

∫h(x)χAdP =

∫E[h(x)χA |x] dP

=

∫h(x)P(A |x) dP =

∫h(x(ω))P(ω; A)P(dω), (3.3)

where the third equality follows from the definition of the conditional expectationand the fact that Ω ∈ σ(x). The fourth equality is due to the σ(x)-measurabilityof h(x), and the last equality follows from property (ii) in Definition 13. BecauseE[|h(x)|] < ∞, Q is a signed measure on G [Shiryayev, 1984, page 193]. By thedefinition of the conditional expectation, it also follows that

P(A) =

∫P(ω; A)P(dω), A ∈ G .

By combining Equation (3.1) with Equation (3.3) and by using Fubini’s theorem(see Theorem 20 in Section A.3), we have

Q(A) =

∫h(x(ω))

[∫

A

ρ(ω, ω)µ(dω)

]P(dω)

=

∫

A

[∫h(x(ω))ρ(ω, ω)P(dω)

]µ(dω), A ∈ G

which implies that (dQ/dµ)(ω) =∫

h(x(ω))ρ(ω, ω)P(dω). Similarly for P, onehas

P(A) =

∫

A

[∫ρ(ω, ω)P(dω)

]µ(dω), A ∈ G

and consequently (dP/dµ)(ω) =∫

ρ(ω, ω)P(dω). Because E[h(x) |G ] = dQ/dP,according to Lemma 2, we have

E[h(x) |G ] (ω) =dQ

dP(ω)

P-a.e.=

(dQ/dµ)(ω)

(dP/dµ)(ω)=

∫h(x(ω))ρ(ω, ω)P(dω)∫

ρ(ω, ω)P(dω)

A few remarks regarding the proof are in order. It is not evident that Fubini’stheorem can be applied as proposed in the proof. To this end ρ was assumed to beF⊗G /B([0,∞))-measurable. The symbol ‘⊗’ denotes the direct product of σ-algebras, i.e. the smallest σ-algebra that contains all sets of the form A×B, where


A ∈ F and B ∈ G . Another condition which must be satisfied is that [Shiryayev,1984, page 200]

∫ [∫|h(x(ω))ρ(ω, ω)|µ(dω)

]P(dω) < ∞,

which is easily found to be satisfied by the assumption E[|h(x)|] < ∞, becausethe left side of the inequality is equal to E[|h(x)|].

In practice, the conditioning σ-algebra G is often generated by a randomvariable y : Ω → Rm. Then Theorem 3 ensures that there is also a regularconditional distribution Q(·; ·) : Ω × B(Rm) → [0, 1] with respect to the σ-algebra σ(x). If, in addition, there is a σ-finite measure µy on B(Rm) and aF ⊗ B(Rm)/B([0,∞))-measurable function ρ : Ω × Rm → [0,∞) such that

Q(ω, A) =

∫

A

ρ(ω, y)µy(dy), A ∈ B(Rm),

then the proof of Bayes’ rule can be straightforwardly modified to show that

E[h(x) | y]P-a.e.=

∫h(x(ω))ρ(ω, y)P(dω)∫

ρ(ω, y)P(dω). (3.4)

If, in addition, it is assumed that ρ is σ(x) ⊗ B(Rm)/B([0,∞))-measurable, inwhich case it remains F⊗B(Rm)/B([0,∞))-measurable as well, then the followinglemma allows us to give a perhaps more familiar formulation for the Bayes’ rule.The proof of the lemma is given in [Shiryayev, 1984, page 172].

Lemma 3 Let x and y be random variables taking values in Rk and Rp, re-spectively, and x is σ(y)-measurable. Then there is a B(Rp)/B(Rk)-measurablefunction ϕ such that x(ω) = ϕ(y(ω)) for all ω ∈ Ω.

By the assumed measurability of ρ, it is known that ρ(·, y) is σ(x)/B([0,∞))-measurable for all y ∈ Rm. Therefore, by Lemma 3, there is a function g :Rk × Rm → [0,∞) such that g(x(ω), y) = ρ(ω, y) for all ω ∈ Ω, y ∈ Rm. In thiscase, the direct substitution of g into Equation (3.4) yields

E[h(x) | y]P-a.e.=

∫h(x(ω))g(x(ω), y)P(dω)∫

g(x(ω), y)P(dω)=

∫h(x)g(x, y)Px(dx)∫

g(x, y)Px(dx)(3.5)

where the second equality follows from the change of variables in the Lebesgueintegral [Halmos, 1950, page 163].

The substitution of a characteristic function of a set A ∈ B(Rk) into theplace of function h in Equation (3.4) yields a conditional probability distributionwhich is commonly called the Bayesian posterior distribution. According toDefinition 18, this can be written as

E[χA(x) | y] = P(x ∈ A | y)P-a.e.= gy ∗ Px,

where the shorthand notation gy , g(·, y) is used.


In the foregoing, the random variable y represents the random observa-tion on which the Bayesian inference about the unknown quantity x is based.Moreover, for given x, g(x, ·) is the density of the conditional distribution of theobservation y with respect to the σ-finite measure µy in the observation spaceRm. For a given realised observation y′, the function gy′ as a function of x isoften called the likelihood.

3.2 Problem Statement

In principle, Bayes’ rule could be rather straightforwardly applied to the timeseries estimation. However, significant improvements in the feasibility of thealgorithms are obtained with certain assumptions that make possible recursivefiltering algorithms. A list of these assumptions is given below. Despite themultiplicity of the assumptions, they are not very restrictive and are often foundto be satisfied in practice.

i) There is a Markov chain xt, Ft∞t=0 taking values in Rk. The Markov chainhas the initial distribution Px0

and transition kernels Kt.

ii) There is a sequence yt∞t=1 of F/B(Rm)-measurable mappings yt : Ω → Rm

such that

P(yk ∈ A |G )P-a.e.= P(yk ∈ A |xk) , A ∈ B(Rm),

where G is the σ-algebra generated by any finite collection of random vari-ables xt and yt, excluding yk and including xk. Sequence yt∞t=1 is calledthe observation sequence and the random variable yt is called an obser-vation.

iii) For all t ∈ N, there is a regular conditional distribution Pt(·; ·) : Rk ×B(Rm) → [0, 1] which is a variant of P(yt ∈ A |xt) for all A ∈ B(Rm), i.e.

Pt(xt; A)P-a.e.= P(yt ∈ A |xt) , A ∈ B(Rm).

iv) For all t ∈ N, there is a σ-finite measure µytdefined on B(Rm) and a

B(Rk) ⊗ B(Rm)/B([0,∞))-measurable function g′t such that for all xt ∈ Rk

and A ∈ B(Rm)

Pt(xt; A) =

∫

A

g′t(xt, y)µyt

(dy).

Similarly as in the previous section, the function g′t as a function of xt is

called the likelihood and the shorthand notation gyt= g′

t(·, yt) is used.

In Bayesian inference, one always needs a model of the application of in-terest. Any model that satisfies the assumptions listed above is theoreticallyvalid. However, because the assumptions are not very restrictive, they leave a


lot of alternative models to choose from. The choice of the model is always ap-plication specific and the task of finding a good model is often a difficult task.To give some intuition about what the models might look like, the following ex-ample covers a multitude of models commonly used in different applications. Forconvenience, the shorthand notations x0:t and y1:t will be used for denoting theordered sets of random variables (x0, x1, . . . , xt) and (y1, y2, . . . , yt), respectively.This notation will be used throughout the remainder of this thesis.

Example 1: Let x0 be an Rk-valued random variable with distribution Px0. For

t ∈ Z+, we define the process model as

xt+1 = ft(xt, vt),

where the mappings vt : Ω → Rn form a sequence of independent randomvariables and the mappings ft : Rk × Rn → Rk are assumed to be B(Rk) ⊗B(Rn)/B(Rk)-measurable. Also, we define for t ∈ N the observation or mea-surement model as

yt = ht(xt, wt),

where the mappings wt : Ω → Rp form a sequence of independent randomvariables and the mappings ht : Rk × Rp → Rm are assumed to be B(Rk) ⊗B(Rp)/B(Rm)-measurable. It is noted, that because of the measurability as-sumptions above, yt and xt+1 are random variables.

Let us define

Pt(xt; A) ,

∫

Rp

χA(xt, w)Pwt(dw), A ∈ B(Rm),

where the notation A , h−1t (A) ∈ B(Rk+p) is used for clarity. To see that as-

sumption (iii) is satisfied, Pt(xt; A) must be shown to be a regular conditionaldistribution with respect to the σ-algebra σ(xt). Because of the assumed meas-urability of ht, χA(xt, ·) is B(Rp)/B(R)-measurable for all xt ∈ Rk and Pt(xt; A)is a well defined probability measure for all xt ∈ Rk. It then remains to showthat Pt(xt; A) is a variant of P(yt ∈ A |xt). By Fubini’s theorem, Pt(·; A) isB(Rk)/B([0, 1])-measurable for all A ∈ B(Rm). Let C ∈ B(Rk). Then by Fubini’stheorem

∫

C

Pt(x; A)Pxt(dx) =

∫

C

[∫

Rp

χA(x, w)Pwt(dw)

]Pxt

(dx)

=

∫

C×Rp

χAd(Pwt× Pxt

)

= P(yt ∈ A ∩ xt ∈ C) . (3.6)

Because C is arbitrary, Equation (3.6) implies that Pt(xt; A) is a variant ofP(yt ∈ A |xt). Similarly as above, we can define

Kt(xt, A) ,

∫

Rk

χA(xt, v)Pvt(dv), A ∈ B(Rk),


where A , f−1t (A) ∈ B(Rk+n) and show that this is a regular conditional distri-

bution which ensures that the assumption (i) is satisfied.

By taking the set C above from the σ-algebra generated by any finite collectionof random variables xt and yt, excluding yk and including xk, and by replacingthe distribution Pxt

by the joint distribution of all the random variables in theabove mentioned collection, the equalities above will still hold because of theindependence of wk. This establishes the assumption (ii).

Finally, to address the assumption (iv) it is noted that if (Pxt× Pwt

)(A) = 0whenever λm(A) = 0, then P(xt; ·) has a density with respect to the Lebesguemeasure λm on B(Rm) and also the assumption (iv) is satisfied.

The task at the tth time instant is to find the conditional probabilityP(xt ∈ A | y1:t) for all A ∈ B(Rk). This conditional probability distribution isalso known as the Bayesian posterior distribution, or as the filtering distribu-tion. It should be noted that, often in practice, one is not in fact interested inthe posterior distribution itself. The interest is usually on some point estimatesobtained for example by taking various expectations with respect to the posteriordistribution. If the posterior distribution is available, the construction of pointestimates is usually theoretically straightforward. Therefore, the main objectiveis to obtain the posterior distribution.

For the tth time instant, the direct application of Bayes’ rule would yieldthe posterior distribution P(x0:t ∈ A | y1:t) for all A ∈ B(Rk(t+1)). The filteringdistribution would then simply be

P(xt ∈ A | y1:t)P-a.e.= P

(x0 ∈ Rk, x1 ∈ Rk, . . . , xt ∈ A

∣∣ y1:t

), A ∈ B(Rk).

In many practical applications, the observations are received at consecutive timeinstants one by one. Because the posterior distributions for every time instant areto be computed, we make our objective slightly more ambitious by saying thatwe want to compute P(xt ∈ A | y1:t) recursively using the previously computedposterior distribution P(xt−1 ∈ A | y1:t−1). It is not obvious, how this could bedone or if it is even possible. The following section will show that, with theassumptions listed above, this indeed is possible.

3.3 Bayesian Filter

Before the solution to the Bayesian filtering problem is introduced, we give animportant result which is required in order to prove the validity of the solution.The following proposition is intuitive and the proof is elementary, although ratherlong.

Proposition 3 Let (Ω1, F1, P) and (Ω1 × Ω2, F1 ⊗ F2, P) be probability spaces,where P(A) = P(A × Ω2) for all A ∈ F1. Moreover, suppose that there areextended real valued, integrable functions f : Ω1 → R and f : Ω1 × Ω2 → R such


that f(ω1) = f(ω1, ω2) for all ω1 ∈ Ω1, ω2 ∈ Ω2. Then, for all A ∈ F1,∫

A

f dP =

∫

A×Ω2

f dP.

Proof: Define fA(ω1) , χA(ω1)f(ω1), and fA(ω1, ω2) , fA(ω1). Then we have∫

A

f dP =

∫fA dP =

∫f+

A dP −∫

f−A dP.

Since f+A and f−

A are both nonnegative, then by Theorem 16 in Section A.2.3,there are sequences hi∞i=1 and gi∞i=1 of F1/B(R)-measurable, countably simplefunctions, such that hi ↑ f+

A and gi ↑ f−A . Thus

∫

A

f dP =

∫limi→∞

hi dP −∫

limi→∞

gi dP = limi→∞

∫hi dP − lim

i→∞

∫gi dP,

where the latter equality follows from the monotone convergence theorem (MCT)(see, Theorem 19, Section A.3). Let aj

i∞i=1 and bji∞i=1 denote the ranges of

functions hj and gj, respectively. Then by the definition of integrals of countablysimple functions one has

∫

A

f dP = limj→∞

∞∑

i=1

ajiP(h−1

j (aji )) − lim

j→∞

∞∑

i=1

bjiP(g−1

j (bji ))

= limj→∞

∞∑

i=1

aji P(h−1

j (aji ) × Ω2) − lim

j→∞

∞∑

i=1

bji P(g−1

j (bji ) × Ω2),

Let us then define hi(ω1, ω2) = hi(ω1) and gi(ω1, ω2) = gi(ω1). Thus h−1j (aj

i ) =

h−1j (aj

i ) × Ω2 and g−1j (bj

i ) = g−1j (bj

i ) × Ω2 for all i and j. Thus

∫

A

f dP = limj→∞

∞∑

i=1

aji P(h−1

j (aji )) − lim

j→∞

∞∑

i=1

bji P(g−1

j (bji ))

= limj→∞

∫hj dP − lim

j→∞

∫gj dP

=

∫limj→∞

hj dP −∫

limj→∞

gj dP

=

∫f+

A dP −∫

f−A dP,

where the third equality uses MCT. In the last equality, the definitionfA(ω1, ω2) =fA(ω1) has been used. Clearly, limi→∞ hi(ω1, ω2) = limi→∞ hi(ω1) = f+

A (ω1) =f+

A (ω1, ω2), and similarly for f−A . Therefore, we have

∫A

f dP =∫

fA dP =∫A×Ω2

f dP.

Let us also introduce another important result which will be used later.Only the outline of the proof is given. The required details can be found in [Roy-den, 1968].


Proposition 4 Suppose that (Mi, Mi), i = 1, 2, . . . , k are measurable spaces andC is the collection of the sets of the form C = ×k

i=1 Ai, where Ai ∈ Mi, i =1, 2, . . . , k. If there is a set function µ∗ : C → [0,∞) satisfying

i) ×ki=1 Mi =

⋃∞i=1 Ci, where Ci ∈ C are disjoint and µ∗(Ci) < ∞, i ∈ N;

ii) µ∗(C) =∑k

i=1 µ∗(Ci), where Ci ∈ C are disjoint and C =⋃k

i=1 Ci ∈ C ;iii) µ∗(C) ≤∑∞

i=1 µ∗(Ci), where Ci ∈ C are disjoint and C =⋃∞

i=1 Ci ∈ C ,then there is a unique measure µ : M1 ⊗ M2 ⊗ · · · ⊗ Mk → [0,∞) such thatµ(C) = µ∗(C), for all C ∈ C .

Proof: It is elementary to show that C is a semialgebra (see Definition 36 inSection A.1.4). Then the collection A of all finite disjoint unions of the elementsin C is an algebra (see Definition 35 in Section A.1) [Royden, 1968, page 259].In order to construct a premeasure µ∗∗ (see Definition 38 in Section A.2) on A ,we define for all A ∈ A

µ∗∗(A) ,

m∑

i=1

µ∗(Ci),

where A is an arbitrary disjoint union of sets Ci ∈ C . This defines the premeasureµ∗∗ uniquely, since for any other disjoint union A =

⋃ni=1 Di, we observe that

µ∗∗(A) =

m∑

i=1

µ∗(Ci) =

m∑

i=1

µ∗

(n⋃

j=1

Ci ∩ Dj

)

=

m∑

i=1

n∑

j=1

µ∗(Ci ∩ Dj)

=n∑

j=1

m∑

i=1

µ∗(Ci ∩ Dj) =n∑

j=1

µ∗

(m⋃

i=1

Ci ∩ Dj

)=

n∑

j=1

µ∗(Di),

where the third and the fifth equalities follow from the condition (ii) and fromthe fact that a semialgebra is closed under finite intersections. In order to ensurethat µ∗∗ is a premeasure, it must be shown to be countably additive. Let A =⋃∞

i=1 Ai ∈ A , where Ai ∈ A are disjoint. Because the elements of A are finite

disjoint unions of sets in C , we have µ∗∗(⋃k

i=1 Ai) =∑k

i=1 µ∗∗(Ai) for all k < ∞.

Therefore, because⋃k

i=1 Ai ⊂ A,

µ∗∗(A) ≥ µ∗∗

(k⋃

i=1

Ai

)=

k∑

i=1

µ∗∗(Ai).

Thus,∞∑

i=1

µ∗∗(Ai) = limk→∞

k∑

i=1

µ∗∗(Ai) ≤ µ∗∗(A). (3.7)

On the other hand, because every Ai ∈ A is a finite disjoint union of sets inC ,⋃∞

i=1 Ai =⋃∞

i=1 Ci, where Ci ∈ C are disjoint. Therefore, the condition (iii)implies that

µ∗∗(A) = µ∗∗

( ∞⋃

i=1

Ai

)= µ∗∗

( ∞⋃

i=1

Ci

)≤

∞∑

i=1

µ∗∗(Ci) =∞∑

i=1

µ∗∗(Ai),


and thus, according to Equation (3.7), we must have µ∗∗(A) =∑∞

i=1 µ∗∗(Ai).The condition (i) ensures that µ∗∗ is a σ-finite premeasure on the algebra

A . By the well known Caratheodory’s extension theorem there is a uniquemeasure µ on M1 ⊗ M2 ⊗ · · · ⊗ Mk such that µ(C) = µ∗∗(C) = µ∗(C) for allC ∈ C [Shiryayev, 1984, page 150].

Now we are ready to introduce the well known recursion for solving theBayesian filtering problem. Often the recursion is known as the optimal fil-ter [Crisan and Doucet, 2002, LeGland and Oudjane, 2004] or the Bayesianfilter [Gordon et al., 1993]. Because of the absence of any specific optimalitycriterion, and because of the direct relation to the Bayes’ rule, the latter namewill be used throughout the remainder of this thesis.

Definition 19 (Bayesian filter) Define π0|0 = Px0. Then for all t ∈ Z+, the

Bayesian filter for a model satisfying the assumptions (i). . . (iv) in Section 3.2is the recursion

πt+1|t = Ktπt|t, (3.8)

πt+1|t+1 = gyt+1∗ πt+1|t. (3.9)

Often in the literature, Equation (3.8) is referred to as the prediction equa-tion and Equation (3.9) is referred to as the update equation. The probabilitymeasures πt+1|t and πt|t will be called the prediction distribution and the pos-terior distribution, respectively. The following result ensures that the Bayesianfilter indeed provides a solution to the filtering problem. For another account onthe topic, see [Crisan, 2001].

Proposition 5 Let πt|t and πt+1|t be the prediction distribution and the pos-terior distribution of the Bayesian filter for a model satisfying the assumptions(i),. . . ,(iv) in Section 3.2. Then for all A ∈ B(Rk), t ∈ Z+,

πt|t(A)P-a.e.= P(xt ∈ A | y1:t)

πt+1|t(A)P-a.e.= P(xt+1 ∈ A | y1:t)

Proof: Because the proof is lengthy, it is divided into parts 1, 2, and 3.1) Let us show by induction that for all A = ×s

i=1 Ai, Ai ∈ B(Rm), s ≤ t

P(y1:s ∈ A |x0:t)P-a.e.=

s∏

i=1

P(yi ∈ Ai |xi) . (3.10)

Clearly, by the assumption (ii), we have P(y1:1 ∈ A |x0:t)P-a.e.= P(y1 ∈ A1 | x1).

Because A is defined as a Cartesian product, it follows that

P(y1:s ∈ A |x0:t) = P

(s⋂

i=1

Ai

∣∣∣∣∣ x0:t

)

= E

[s∏

i=1

χAi

∣∣∣∣∣ x0:t

]

, (3.11)


where Ai , y−1i (Ai). Clearly, σ(x0:t) ⊂ σ(x0:t, y1:s−1) implying that

E

[s∏

i=1

χAi

∣∣∣∣∣ x0:t

]P-a.e.= E

[E

[s∏

i=1

χAi

∣∣∣∣∣ x0:t, y1:s−1

] ∣∣∣∣∣ x0:t

]. (3.12)

For all i = 1, 2, . . . , s − 1, χAiis σ(x0:t, y1:s−1)-measurable and therefore

E

[E

[s∏

i=1

χAi

∣∣∣∣∣ x0:t, y1:s−1

] ∣∣∣∣∣ x0:t

]P-a.e.= E

[E[χAs

|x0:t, y1:s−1]s−1∏

i=1

χAi

∣∣∣∣∣x0:t

]. (3.13)

By the assumption (ii),

E[χAs|x0:t, y1:s−1] = P(ys ∈ As |x0:t, y1:s−1)

P-a.e.= P(ys ∈ As |xs) (3.14)

Because P(ys ∈ As |xs) is σ(x0:t)-measurable, it can be brought outside the con-ditional expectation in Equation (3.13). By combining Equations (3.11), (3.12),(3.13) and (3.14), one has

P(y1:s ∈ A |x0:t)P-a.e.= P(ys ∈ As |xs)P

(s−1⋂

i=1

yi ∈ Ai∣∣∣∣∣x0:t

)

.

By induction, Equation (3.10) is found to be satisfied.2) Let us then derive a nonrecursive formulation of the Bayes’s rule for

x0:t given y1:s, where s ≤ t. Let h : Rk(t+1) → R be B(Rk(t+1))/B(R)-measurable.Define a set function Q : B(Rms) → R as

Q(A) ,

∫

A

h(x0:t) dP =

∫

A

h+(x0:t)dP −∫

A

h−(x0:t)dP, A ∈ B(Rms),

where A , y−11:s(A). By assuming that E[|h(x0:t)|] < ∞, both of the integrals in

the above decomposition of Q are found to be finite, and hence σ-finite, measureson B(Rms) [Gariepy and Ziemer, 1995, pages 151-152]. Let these integrals bedenoted by Q+(A) and Q−(A), respectively. Then by Equation (3.10), for allA = ×s

i=1 Ai, Ai ∈ B(Rm),

Q+(A) =

∫h+(x0:t)P(y1:s ∈ A |x0:t) dP =

∫h+(x0:t)

s∏

i=1

Pi(xi; Ai)Px0:t(dx0:t)

=

∫h+(x0:t)

[s∏

i=1

∫

Ai

g′i(xi, yi)µyi

(dyi)

]

Px0:t(dx0:t),

where the first equality follows similarly as in the proof of Theorem 5, and thesecond and the third equalities are due to the assumptions (iii) and (iv), respec-tively. Because µyi

was assumed to be σ-finite and∫

g′i(xi, yi)µyi

(dyi) = 1 for alli = 1, 2, . . . , s, Fubini’s theorem can be applied yielding

s∏

i=1

∫

Ai

g′i(xi, yi)µyi

(dyi) =

∫

A

s∏

i=1

g′i(xi, yi)µy1:s

(dy1:s),


where µy1:s= (((µys

× µys−1) × µys−2

) · · ·µy1). Thus,

Q+(A) =

∫ [∫

A

h+(x0:t)

s∏

i=1

g′i(xi, yi)µy1:s

(dy1:s)

]Px0:t

(dx0:t). (3.15)

Because B(Rk(t+1)) and B(Rk+m) are subsets of B(Rk(t+1)+ms), it follows that h+

and g′i, i = 1, 2, . . . , s are B(Rk(t+1)+ms)/B(R)-measurable functions. Therefore,

their product is B(Rk(t+1)+ms)/B(R)-measurable as well [Gariepy and Ziemer,1995, page 115]. In addition, µy1:s

and Px0:tare σ-finite and

∫h+(x0:t)

∫ s∏

i=1

g′i(xi, yi)µy1:s

(dy1:s)Px0:t(dx0:t) < ∞,

implying that Fubini’s theorem can be applied to Equation (3.15) yielding

Q+(A) =

∫

A

[∫h+(x0:t)

s∏

i=1

g′i(xi, yi)Px0:t

(dx0:t)

]µy1:s

(dy1:s). (3.16)

The right hand side of Equation (3.16) defines a finite measure on B(Rms). Obvi-ously, this measure agrees with Q+ on the semialgebra of sets of the form ×s

i=1 Ai,Ai ∈ B(Rm). Therefore, according to Proposition 4, the equality must hold forall sets A ∈ B(Rms). After repeating the same inference for Q−, we have

Q(A) =

∫

A

[∫h(x0:t)

s∏

i=1

g′i(xi, yi)Px0:t

(dx0:t)

]µy1:s

(dy1:s). (3.17)

Because the inner integrand in Equation (3.17) is B(Rk(t+1)+ms)/B(R)-measur-able, by Fubini’s theorem, the inner integral in Equation (3.17) is B(Rms)/B(R)-measurable. Therefore, by the change of variables, a signed measure Q on σ(y1:s)can be constructed as

Q(A) =

∫

A

[∫h(x0:t)

s∏

i=1

g′i(xi, yi(ω))Px0:t

(dx0:t)

]

µ(dω), A ∈ σ(y1:s),

where µ(A) = µy1:s(A) [Halmos, 1950, page 163]. Thus

dQ

dµ(ω) =

∫h(x0:t)

s∏

i=1


(dx0:t).

Similarly, as in the proof of Bayes’ rule, it then follows that

E[h(x0:t) | y1:s]P-a.e.=

∫h(x0:t)

∏si=1 g′

i(xi, yi)Px0:t(dx0:t)∫ ∏s

i=1 g′i(xi, yi)Px0:t

(dx0:t). (3.18)

3) Let us then prove the recursion. Define

Ns ,

ω ∈ Ω

∣∣∣∣∣

∫ s∏

i=1


(dx0:t) = 0

,


i.e. the set of those ω ∈ Ω for which Equation (3.18) is not defined. According toEquation (3.18), P(Ns) = 0. It is noted that if s ≤ t, then Ns ⊂ Nt. Throughoutthe remainder of this proof, it is assumed that ω /∈ Nt. This allows us to define

π0:t|1:s(ω, A) ,

∫A

∏si=1 g′

i(xi, yi)Px0:t(dx0:t)∫ ∏s

i=1 g′i(xi, yi)Px0:t

(dx0:t)

P-a.e.= P(x0:t ∈ A | y1:s) (ω) (3.19)

where A ∈ B(Rk(t+1)) and the last equality follows from Equation (3.18). Byusing the shorthand notation gyt(ω) = g′

t(·, yt(ω)), Equation (3.19) yields

dπ0:t|1:sdPx0:t

(ω, x0:t) = cs(ω)s∏

i=1

gyi(ω)(xi),

where (dπ0:t|1:s/dPx0:t)(ω, ·) is the RND of π0:t|1:s(ω, ·) with respect to Px0:t

and

cs(ω) ,(∫ ∏s

i=1 gyi(ω) dPx0:t

)−1. Let us then consider only the cases s = t and

s = t − 1. It is straightforward to show that for all ω /∈ Nt, π0:t|1:t(ω, ·) ≪π0:t|1:t−1(ω, ·). This allows us to use Lemma 2 yielding

dπ0:t|1:tdπ0:t|1:t−1

(ω, x0:t) =(dπ0:t|1:t/dPx0:t

)(ω, ·)(dπ0:t|1:t−1/dPx0:t

)(ω, ·) = g′t(xt, yt(ω))

ct(ω)

ct−1(ω), (3.20)

where the first equality holds π0:t|1:t−1(·, ω)-a.e. Note that the resulting RND isindependent of x0:t−1. Let us then define for A ∈ B(Rk)

πt|t−1(ω, A) , π0:t|1:t−1(ω, Rk × Rk × · · · × A)P-a.e.= P(xt ∈ A | y1:t−1) (ω),

πt|t(ω, A) , π0:t|1:t(ω, Rk × Rk × · · · × A)P-a.e.= P(xt ∈ A | y1:t) (ω).

Consequently,

πt|t(ω, A) =

∫

Rk×Rk×···×A

dπ0:t|1:tdπ0:t|1:t−1

(ω, x0:t)π0:t|1:t−1(ω, dx0:t)

=

∫Rk×Rk×···×A

gyt(ω)(xt)π0:t|1:t−1(ω, dx0:t)∫gyt(ω)(xt)π0:t|1:t−1(ω, dx0:t)

(3.21)

and by Proposition 3

πt|t(ω, A) =

∫A

gyt(ω)(xt)πt|t−1(ω, dxt)∫gyt(ω)(xt)πt|t−1(ω, dxt)

=(gyt

∗ πt|t−1

)(ω, A),

This proves Equation (3.9) in the Bayes’ filter. On the other hand, by using theassumption (i) and the shorthand notation KA

t−1 , Kt−1(·, A), we have

πt|t−1(ω, A) =

∫Rk×Rk×···×A

∏t−1i=1 gyi(ω) dPx0:t∫ ∏t−1

i=1 gyi(ω) dPx0:t

=

∫KA

t−1

∏t−1i=1 gyi(ω)Px0:t−1∫ ∏t−1

i=1 gyi(ω)Px0:t−1

=

∫Kt−1(xt−1, A)π0:t−1|1:t−1(ω, dx0:t−1),


and by Proposition 3

πt|t−1(ω, A) =

∫Kt−1(xt−1, A)πt−1|t−1(ω, dxt−1) = Kt−1πt−1|t−1(ω, A). (3.22)

This proves Equation (3.8) in the Bayes’ filter.

In the remainder of this work, it is assumed that a specific realisation of theobservations y1:t is available. This means that the Bayesian posterior distributionπt|t is no longer random. Therefore the dependency on ω is omitted from thenotation and the shorthand notations

πt|t−1(A) , πt|t−1(ω, A), πt|t(A) , πt|t(ω, A)

will be used. Also the dependency on ω is omitted from the denotation of thelikelihood function. The shorthand notation gt(·) , g′

t(·, yt(ω)) will be used.

3.4 Another Formulation of the Bayesian Filter

For future purposes, we derive another formulation for the Bayesian filter. LetC be the collection of sets C of the form A × B, where A, B ∈ B(Rk). For thesesets, we define a set function µ : C → [0, 1] as

µ(C) ,

∫ [∫χC(xt, xt+1)Kt(xt, dxt+1)

]πt|t(dxt), C ∈ C .

By the definition of C, χC(xt, xt+1) = χA(xt+1)χB(xt) and, equivalently, one canwrite

µ(C) =

∫

B

Kt(xt, A)πt|t(dxt).

Because for all A ∈ B(Rk), Kt(·, A) is B(Rk)/B([0, 1])-measurable, bounded, andnonnegative, it is also πt|t-integrable. Thus, µ(C) is well defined.

Let us then show that µ satisfies the conditions of Proposition 4. Clearly,(i) holds because µ is finite. Let C ∈ C be a disjoint, countable union of setsCi ∈ C ∞i=1. Then,

µ(C) =

∫ [∫ ∞∑

i=1

χCi(xt, xt+1)Kt(xt, dxt+1)

]πt|t(dxt)

=

∞∑

i=1

∫ [∫χCi

(xt, xt+1)Kt(xt, dxt+1)

]πt|t(dxt) =

∞∑

i=1

µ(Ci).

The same applies for finite unions. Here, the order of integration and countablesums can be interchanged by the monotone convergence theorem. Thus, accordingto Proposition 4, there is a unique extension of µ on B(R2k). Let this extensionbe denoted by πt+1:t|t. By the definitions of πt+1|t and πt+1:t|t

πt+1|t(A) =

∫Kt(xt, A)πt|t(dxt) = πt+1:t|t(A × Rk),


which, according to Proposition 3, implies that

∫

A

gt dπt|t−1 =

∫

A×Rk

gt dπt:t−1|t−1.

Consequently, the update equation of the Bayesian filter can be equivalentlywritten as

πt|t(A) =

∫A

gt dπt|t−1∫gt dπt|t−1

=

∫A×Rk gt dπt:t−1|t−1∫Rk×Rk gt dπt:t−1|t−1

.

This formulation plays an important role in the description of the approximate,Monte Carlo based Bayesian filtering algorithms in Chapter 5.

3.5 Concluding Remarks

The Bayesian filter is a theoretical result which is difficult to realise in practice.This is because both of the equations in Definition 19, as well as integrals withrespect to πt|t are, in general, impossible to evaluate exactly. This intractabilityhas been the driving force behind the development of the Monte Carlo methodsdescribed in Chapter 5.

It is worth mentioning that there are exceptions that allow the exact evalu-ation of the recursion formulas. However, these exceptions often require intoler-ably inaccurate models for the application at hand. One example is the case of aMarkov chain xt, Ft∞t=0 where xt can only have a finite number of values, i.e. xt

is a discrete random variable. This implies that the integrations in the recursion,as well as posterior expectations, become finite sums and are thus possible toevaluate.

Another example which allows exact evaluation is a model of the form de-scribed in Example 1 in which vt and wt are zero mean normally distributed ad-ditive random variables and mappings ft and ht are both linear in xt. In this case,if π0|0 is a normal distribution, the prediction distribution π1|0 in the Bayesianfilter will be a normal distribution as well. Also the normality of π1|0 implies thatπ1|1 will be a normal distribution and, subsequently, all prediction and posteriordistributions in the Bayesian filter will be normal distributions and the evaluationof the recursion reduces to forming a sequence of means and covariance matricesthat completely define the normal distributions.

Although the linearity and normality assumptions are often found to be toorestrictive, the significance of this class of models should not be undervalued. Infact, it was shown by Ho and Lee [1964] that under these conditions the meanand covariance matrix of the Bayesian posterior distribution are obtained by thewell known Kalman filter algorithm which has been successfully applied eversince its introduction in 1960 [see, e.g. Anderson and Moore, 1979].

Chapter 4

Monte Carlo Methods

The research and development of Monte Carlo methods for approximating inte-grals started around the year 1950. Because Monte Carlo methods are involvedwith intensive computation, it is not an accident that the first Monte Carlo meth-ods date back to the early days of electronic computing. In this section, somefundamental results and definitions related to Monte Carlo methods will be in-troduced in a measure theoretic form. For an extensive introduction to variousMonte Carlo methods, the reader should consult [Robert and Casella, 1999, Ham-mersley and Handscomb, 1964]. More details can be found, e.g. in [Rubinstein,1981, Gamerman, 2002, Liu, 2001]. Although the generation of random variatesis a significant aspect of Monte Carlo methods, both implementationally andtheoretically, the topic is mostly omitted here. For an introduction to randomvariate generation, see [Robert and Casella, 1999, Rubinstein, 1981] and for amore thorough discussion, see [Devroye, 1986].

Section 4.1 defines some important forms of convergence of sequences ofrandom variables and also describes some well known limit theorems. In Section4.2, the definition of classical Monte Carlo integration is given with some remarkson its convergence. Section 4.3 describes the concept of importance sampling andgives some additional remarks regarding the choice of importance distribution andits assessment. The chapter is concluded in Section 4.5 by the description of arandom variate generation method known as the rejection method.

4.1 Convergence of Sequences of Random Vari-

ables

The use of Monte Carlo methods is based on various types of convergence ofsequences of random variables. Before moving into the details of Monte Carlo,this section describes some of the most essential forms of convergence and limittheorems.

Definition 20 Let xi∞i=1 be a sequence of random variables, and x a random

variable. If limi→∞ xiP-a.e.= x, then we write xi

a.e.→ x, and the sequence xi∞i=1 issaid to converge almost surely to the random variable x.

30

CHAPTER 4. MONTE CARLO METHODS 31

Almost sure convergence is also often referred to as convergence with proba-bility one [see, e.g. Jazwinski, 1970].

One of the best known applications of almost sure convergence is the fol-lowing limit theorem [see, e.g. Shiryayev, 1984, page 366].

Theorem 6 (Strong law of large numbers, SLLN) Suppose that xini=1 is

a sequence of independent and identically distributed (IID) random vari-ables such that E[|x1|] < ∞ and E[x1] = m. Define sn(ω) , 1

n

∑ni=1 xi(ω). Then

limn→∞

snP-a.e.= m.

The strong law of large numbers (SLLN) is the most fundamental theorem onwhich the Monte Carlo methods are based. However, almost sure convergencedoes not say much about the statistical behaviour of the sequence of randomvariables, which is illustrated by the following example.

Example 2: Define sn , n−1∑n

i=1

√2/π |xi|−1

, where xini=1 are IID random

variables whose distribution has the density q(x) = |x| e− 1

2x2

/2. Then, accordingto the SLLN, sn

a.e.→ 1, but V [sn] = ∞ for all n ∈ N.

From the practical point of view, more useful results on the convergence ofa sequence of random variables can be obtained by using the following weakerform of convergence.

Definition 21 Let xi∞i=1, be a sequence of random variables and let x be a ran-dom variable. If limi→∞ E[f(xi)] = E[f(x)] for all bounded continuous functionsf , then we write

xid→ x,

and the sequence xi∞i=1 is said to converge in distribution to the randomvariable x.

The convergence in distribution is weaker than the almost sure convergence in

the sense that xna.e.→ x =⇒ xn

d→ x [Shiryayev, 1984, page 254].For some distributions that are frequently used in the remainder of this

thesis, it is conventional to introduce specific notations. A normal distribu-tion with the mean m ∈ Rk and the covariance matrix C ∈ Rk×k will be denotedby N(m, C) and its density with respect to λk is denoted by fN(·; m, C). Anotherfrequently appearing distribution is the uniform distribution on some set A.This distribution will be denoted by U(A). If a sequence xi∞i=1 of random vari-ables converges in distribution to a normal distribution or a uniform distribution,

the notations xid→ N(m, C) and xi

d→ U(A) are used respectively.One of the best known applications of convergence in distribution is the

following well known theorem [see, e.g. Williams, 2001].


Theorem 7 (Central Limit Theorem, CLT) Suppose that xini=1 is a se-

quence of IID random variables taking values in Rk such that E[x1] = m andV [x1] = C ∈ Rk×k. Let sn , n−1

∑ni=1 xi. Then

√n(sn − m)

d→ N(0k, C) .

Example 3: Define sn , n−1∑n

i=1

√π/2 |xi|, where xin

i=1 are IID random vari-

ables whose distribution is N(0, 1). Because E[√

π/2 |xi|] = 1 and E[(xi)2π/2] =π2, then according to the CLT,

√n(sn − 1)

d→ N(0, π/2 − 1), implying that forlarge n, sn has approximately the distribution N(1, n−1(π/2 − 1)).

4.2 Classical Monte Carlo Integration

Suppose that there is a probability space (Ω, F, P), a F/B(Rk)-measurable func-tion x : Ω → Rk, a B(Rk)/B(R)-measurable mapping h : Rk → R, and theexpectation

Υ = E[h(x)] =

∫h(x)Px(dx) (4.1)

is to be evaluated. Often, the evaluation of the integral is intractable and it hasto be approximated in some way. Monte Carlo integration methods are basedon approximating the probability measure Px in Equation (4.1) by a discretemeasure, for which we have the following proposition.

Proposition 6 Suppose that wi are nonnegative real numbers and xi ∈ Rk forall i = 1, 2, . . . , n. Let us define a set function πn : B(Rk) → [0, 1] as

πn(A) ,

n∑

i=1

wiχA(xi), A ∈ B(Rk).

Then πn is a measure on B(Rk) and it is called a discrete measure. If, inaddition,

∑ni=1 wi = 1, then πn is a discrete probability measure.

Proof: Clearly, πn(∅) = 0. Let A =⋃∞

i=1 Ai, where Ai ∈ B(Rk) are disjoint.Then

πn(A) =n∑

j=1

wjχA(xj) =n∑

j=1

wj∞∑

i=1

χAi(xj) =

∞∑

i=1

n∑

j=1

wjχAi(xj) =

∞∑

i=1

πn(Ai),

i.e. the countable additivity holds. Clearly, if∑n

i=1 wi = 1, then πn(Rk) = 1.

There are different ways of constructing discrete probability measures to ap-proximate Px. The most straightforward approximation is given by the followingdefinition.


Definition 22 Let xini=1 be a sequence of independent random variables taking

values in Rk with distribution Px. Then an unweighted discrete approxima-tion of Px is defined as

Pnx(A) ,

1

n

n∑

i=1

χA(xi), A ∈ B(R). (4.2)

To see that Pnx indeed approximates Px, it is noted that according to the SLLN

Pnx(A) =

1

n

n∑

i=1

χA(xi)a.e.→∫

A

dPx = Px(A).

Replacing the measure Px by its unweighted discrete approximation Pnx in Equa-

tion (4.1) yields the classical Monte Carlo approximation

ΥnMC ,

∫h(x)dP

nx =

1

n

n∑

i=1

h(xi), (4.3)

where xi are independent random variables with distribution Px. In the remainderof this thesis, the notation x ∼ π, where π is a probability measure, will be usedfor denoting that x has the distribution π. The notation xi ∼ π, i = 1, 2, . . . , n isused for denoting that xi, i = 1, 2, . . . , n are IID random variables with a commondistribution π.

Because of the measurability assumptions of h, it follows that h(xi)ni=1

is a sequence of IID random variables. Then by the SLLN, it follows thatΥn

MCa.e.→ Υ, if E[|h(x)|] < ∞. Note that, by the properties of the integral

E[|h(x)|] < ∞ ⇐⇒ |Υ| < ∞. Therefore, a sufficient condition for the SLLNto apply is |Υ| < ∞. If, in addition, E[h2(x)] < ∞, then the CLT ensures that

ΥnMC

d→ N(Υ, V [h(x)] /n). Although the CLT only gives an approximation of thedistribution, straightforwardly it can be seen that

E[ΥnMC] =

1

n

n∑

i=1

E[h(xi)

]= E

[h(xi)

]= Υ

V [ΥnMC] =

1

n2

n∑

i=1

V[h(xi)

]= n−1

V[h(xi)

],

for all n ∈ N. The distribution, however, is not in general normal.At this point, a few remarks are in order. First, it is clear by the defi-

nition that the unweighted discrete approximation of a probability measure isnondeterministic, i.e. random. The theory of the probabilistic properties of theserandom measures is beyond the scope of this thesis, and discussed in moredetail, e.g. in [Kallenberg, 1983]. The main interest will be on the probabilisticproperties of the integrals. These properties can be accessed through the sumrepresentation, such as the one in Equation (4.3).

The second remark is that in order to obtain a numerical value for the clas-sical Monte Carlo integral approximation, one should have a realisation of the


random measure. This is equivalent to having a realisation of all the n inde-pendent random variables xi that define the measure. In other words, one shouldbe able to simulate or generate n independent random variates according toa given distribution. For this reason, a sequence of random variables is oftenidentified with its realisation. When referring to the realisation of the randomvariables, the sequence xin

i=1 is called an IID sample of size n and the ele-ments of the sample are called particles. The discussion related to the randomvariate generation has been mostly omitted in this thesis. More details on thetopic can be found, e.g. in [Robert and Casella, 1999, Rubinstein, 1981, Devroye,1986].

4.3 Importance Sampling

Let us extend our considerations of random integral approximations to integralsof the form

Υ =

∫h(x)π(dx), (4.4)

where π is a σ-finite measure on B(Rk) but not necessarily a probability mea-sure. Suppose that there is a probability measure γ on B(Rk) such that π ≪ γ.The absolute continuity ensures that a RND dπ/dγ exists1. Then, according toLemma 1, the integral in Equation (4.4) can be written as

Υ =

∫h(x)

dπ

dγ(x)γ(dx). (4.5)

Since now γ is a probability measure, one can construct a classical Monte Carloapproximation of Υ by simulating an IID sample according to γ. This correspondsto approximating the measure in Equation (4.4) by a discrete measure given bythe following definition.

Definition 23 Suppose that π is a σ-finite measure on B(Rk), and γ is a proba-bility measure on B(Rk) such that π ≪ γ. Let xin

i=1 be a sequence of IID randomvariables such that xi ∼ γ. Then a weighted discrete approximation of πusing γ is defined as

πnγ (A) ,

1

n

n∑

i=1

dπ

dγ(xi)χA(xi), A ∈ B(Rk).

Similarly as in Section 4.2, the SLLN ensures that

πnγ (A) =

1

n

n∑

i=1

dπ

dγ(xi)χA(xi)

a.e.→∫

A

dπ

dγdγ =

∫

A

dπ = π(A),

1Because µ(A) =∫

Ah dπ, A ∈ B(Rk) defines a signed measure µ on B(Rk), it in fact suffices

to choose γ such that µ ≪ γ which does not require π ≪ γ. Therefore, π ≪ γ is not anecessary condition. This only applies to fixed function h. For an arbitrary integrand h, π ≪ γ

is required.


i.e. that πnγ indeed approximates π. The substitution of π by its weighted discrete

approximation using γ yields the importance sampling approximation

Υnγ =

∫h dπn

γ =1

n

n∑

i=1

h(xi)dπ

dγ(xi). (4.6)

The use of the term ‘importance’ in the name of the approximation is saidto have been proposed by A. Marshall in 1956 and originates from the idea that,in the approximation of the measure π

’more samples should be simulated from

important parts of the integration region [Liu, 2001]. What is meant by importantin this case will be discussed in more detail in Section 4.3.2. For now, it is onlynoted that the tool for defining the important regions is the probability distribu-tion γ. In the design of γ

’high probability should be assigned to the important

regions. Hence, it is common practice to call γ the importance distribution.Similarly as for the classical Monte Carlo, the almost sure convergence of

importance sampling is also ensured by the SLLN if |Υ| < ∞. This follows from

Υnγ =

1

n

n∑

i=1

h(xi)dπ

dγ(xi)

a.e.→ E

[h(xi)

dπ

dγ(xi)

]=

∫h

dπ

dγdγ =

∫h dπ = Υ.

Straightforwardly, it is seen that E[Υn

γ

]= Υ and

V[Υn

γ

]=

1

n

(E

[(h(xi)

dπ

dν(xi)

)2]− Υ2

). (4.7)

Unlike in classical Monte Carlo,∫

h2dπ < ∞ is not a sufficient condition for ensur-ing that the variance of the approximation is finite. According to Equation (4.7),the boundedness of E[h2(xi)(dπ/dγ)2(xi)] is a sufficient condition. Moreover, thisis a sufficient condition for the CLT to ensure that

√n(Υn

γ − Υ)d→ N(0, V [h(x)(dπ/dγ)(x)]) .

A more easily verifiable sufficient condition is given by the following proposi-tion [Geweke, 1989].

Proposition 7 Let Υnγ be an importance sampling approximation of Υ =

∫h dπ

using importance distribution γ. If there are a, b ∈ [0,∞) such that dπ/dγ ≤ a

and∫

h2dπ ≤ b, then√

n(Υnγ − Υ)

d→ N(0, V [h(x)(dπ/dγ)(x)]), where x ∼ γ.

Proof:

E

[(h(x)

dπ

dγ(x)

)2]

=

∫ (h

dπ

dγ

)2

dγ =

∫h2 dπ

dγdπ ≤ a

∫h2dπ ≤ ab < ∞.

The CLT and the definition of Υnγ ensure the convergence in distribution.


Often, it is the case that π and γ are probability measures on B(Rk) withdensities f and q, respectively, with respect to λk. If, in addition, π ≪ γ, thenLemma 2 can be applied yielding

dπ

dγ(x)

γ−a.e.=

dπ/dλk

dγ/dλk

=f(x)

q(x).

This implies that the importance sampling approximation of Υ =∫

h dπ is

Υnγ =

1

n

n∑

i=1

h(xi)f(xi)

q(xi), xi ∼ γ.

This is perhaps the most commonly given formulation of the importance samplingapproximation of Υ [see, e.g. Robert and Casella, 1999, Rubinstein, 1981, Gamer-man, 2002]. This section is concluded by two illustrative examples of importancesampling.

Example 4: Define h(x) , 1, f(x) , e−1

2x2

/√

2π, and q(x) , |x|e− 1

2x2

/2, wheref and q are densities of π and γ, respectively. In this case, Υ =

∫h dπ = 1, and

sn defined in Example 2 is the importance sampling approximation of Υ. Thevariance of the approximation remains infinite because supx f(x)/q(x) = ∞ andthe conditions in Proposition 7 are not satisfied.

Example 5: Define h(x) , 1, f(x) , |x|e− 1

2x2

/2 and q(x) , e−1

2x2

/√

2π wheref and q are the densities of π and γ, respectively. Then Υ =

∫h dπ = 1, and

sn defined in Example 3 is the importance sampling approximation of Υ. Theconditions of Proposition 7 are satisfied and the variance of the approximation isfinite, and the CLT ensures convergence in distribution.

4.3.1 Approximate Importance Sampling

Let us make an additional assumption that π in Equation (4.4) is a probabilitymeasure and f is its density with respect to the Lebesgue measure. Often, espe-cially in Bayesian inference, a situation is encountered where f is known only upto proportionality. This happens, e.g. when f represents the density of a Bayesianposterior distribution. The posterior density is easy to obtain up to proportion-ality as a product of the prior density and the likelihood function. However,if the density was to be evaluated exactly, one would need the proportionalitycoefficient whose evaluation involves an intractable integral.

Also, it is possible that the density of the importance distribution q is knownonly up to proportionality. An example of such a situation will be discussedin more detail in Section 4.5. Although both of the densities f and q wouldbe known only up to proportionality, the integral in Equation (4.5) can still beapproximated by the following approximate importance sampling method.It should be pointed out that often in the literature, no clear distinction is made


between the importance sampling and the approximate importance sampling,although their probabilistic properties are different as will be shown below.

Let f ∗ = cf and q∗ = dq, where c and d are the unknown proportionalitycoefficients. Then

∫f ∗

q∗dγ =

c

d

∫f

qdγ =

c

d

∫dπ

dγdγ =

c

d(4.8)

where we use the assumption that π is a probability measure. Thus Υ can beequivalently written as

Υ =

∫hf

qdγ =

d

c

∫hf ∗

q∗dγ =

∫hf ∗/q∗dγ∫f ∗/q∗dγ

. (4.9)

Now classical Monte Carlo approximations for the numerator and the denomina-tor can be constructed, yielding

Υnγ =

1n

∑ni=1 h(xi)f ∗(xi)/q∗(xi)

1n

∑ni=1 f ∗(xi)/q∗(xi)

, xi ∼ γ. (4.10)

Note that the same random samples are used for approximating the numeratorand the denominator. Although this approximation also converges almost surelyto Υ, it should not be confused with the one in Equation (4.6).

To address the convergence of Υnγ as n → ∞, let n and d denote the nu-

merator and the denominator of Υnγ in Equation (4.10), respectively. According

to the CLT, Eq[h2f 2/q2] = Ef [h

2f/q] < ∞ and Eq[f2/q2] = Ef [f/q] < ∞ are

sufficient conditions to ensure that [Monahan, 2001, page 330][

nd

]d→ N

(c

d

[Ef [h]

1

],

c2

nd2

[Vq[hf/q] covq(hf/q, f/q)

covq(hf/q, f/q) Vq[f/q]

]). (4.11)

By assuming that the relative errors of the approximations of the numerator anddenominator are small, i.e.

d − c/d

c/d≈ 0,

n − cdEf [h]

cdEf [h]

≈ 0,

one obtains the following approximation [Monahan, 2001, page 330]

n

d≈ Ef [h] +

n

c/d− Ef [h]

c/dd.

It should be noted that n and d are random. According to this approximation andthe limiting normal distribution in Equation (4.11), it straightforwardly followsthat, for large n, Υn

γ is approximately normally distributed with [Geweke, 1989,Monahan, 2001]

E

[Υn

γ

]≈ Ef [h] (4.12)

V

[Υn

γ

]≈ 1

nVq

[hf

q− Ef [h]

f

q

](4.13)


To conclude, it is observed that the price one must pay for knowing thedensities f and q only up to proportionality is that the analysis of the approx-imation error becomes complicated and Υn

γ is found to be only asymptotically

unbiased. Therefore, it is obvious that Υnγ and Υn

γ are different approximations

of Υ. It is also noted that the asymptotic variance of Υnγ is not necessarily equal

to V[Υn

γ

]in Equation (4.7). Therefore, Υn

γ and Υnγ cannot be said to be even

asymptotically equal. If, however, Υ = Ef [h] = 0, then the asymptotic varianceof Υn

γ and V[Υn

γ

]can be straightforwardly shown to be equal.

Although Υnγ is regarded as an approximation of Υn

γ , it has been pointed

out, e.g. in [Robert and Casella, 1999, page 85], that in some situations Υnγ can

actually perform better than Υnγ in terms of variance. An example of such a

situation is given by the following example.

Example 6: Let us define Υ ,∫

h dπ, h(x) , x, f(x) = fN(x; m, 1), q(x) ,

fN(x; m, 2), f ∗(x) , e−1

2(x−m)2 , and q∗(x) , e−

1

4(x−m)2 , where f and q are densities

of π and γ, respectively. In this case, it can be shown that

V[Υn

γ

]=

1

n

((2√3− 1

)m2 +

4√

3

9

),

and the asymptotic variance of Υnγ is v = 4

√3/(9n). Clearly V

[Υn

γ

]≥ v and the

equality occurs if m = 0.

4.3.2 Minimum Variance Importance Distribution

When approximating a single integral Υ =∫

hdπ for a given π-integrable func-tion h, the choice of importance distribution is theoretically straightforward. Thissimplicity arises from the following proposition. The proposition and its proofin a slightly less general setting can be found, e.g. in [Robert and Casella, 1999,Rubinstein, 1981]. The proof given below follows closely the proof given in [Ru-binstein, 1981, page 122-123].

Proposition 8 Suppose that π is a σ-finite measure on B(Rk) and h : Rk → Ris a π-integrable function. Let γ be a probability measure on B(Rk) such thatµ ≪ γ, where µ is a signed measure defined as µ(A) ,

∫A

h dπ, A ∈ B(Rk). LetΥn

γ denote an importance sampling approximation of Υ =∫

h dπ, using sampleof size n from γ. Then,

V[Υn

γ

]≥ V

[Υn

γ∗

],

where γ∗(A) ,∫

A|h|dπ/

∫|h|dπ, A ∈ B(Rk).

Proof: By the elementary properties of variance, it follows that

V[Υn

γ

]= V

[1

n

n∑

i=1

dµ

dγ(xi)

]

=1

nV

[dµ

dγ(xi)

]=

1

nE

[(dµ

dγ(xi)

)2]

− Υ2

n.


By noting that µ ≪ γ∗ ≪ π, Lemma 2 implies that

dµ

dγ∗ =dµ/dπ

dγ∗/dπ=

h

|h| /∫|h| dπ

, γ∗-a.e.

Therefore, for the claimed optimal importance distribution γ∗ we have

E

[(dµ

dγ∗

)2]

= E

[h2

h2/(∫|h| dπ)2

]=

(∫|h| dπ

)2

.

On the other hand, by the Holder inequality [see, e.g. Gariepy and Ziemer, 1995,page 146].

(∫|h| dπ

)2

=

(∫ |h| (dγ/dπ)1/2

(dγ/dπ)1/2dπ

)2

≤∫

h2

dγ/dπdπ

∫dγ

dπdπ

=

∫(dµ/dπ)2

dγ/dπdπ = E

[(dµ

dγ(x)

)2]

, x ∼ γ.

An implication of this theorem is that if π is a probability measure, animportance sampling approximation of Υ with an appropriate choice of γ mayperform better than classical Monte Carlo integration in terms of smaller variance.This follows from the observation that classical Monte Carlo approximation is aspecial case of importance sampling where γ = π. However, the result is onlytheoretical, since in general the evaluation of dµ/dγ∗ requires the evaluationof∫|h|dπ, which except for the absolute value is exactly the original problem

and, therefore, intractable. At least Proposition 8 gives the theoretically optimalchoice for importance distribution which can be approximated, e.g. as describedin Section 4.3.1.

4.3.3 Optimal Importance Distribution for Arbitrary In-tegrand

In Section 4.3.2, the optimal choice of importance distribution for approximatingthe integral Υ =

∫h dπ was found to depend on the integrand h. In Bayesian

filtering, we however are facing a problem where the integrand cannot be fixedto any specific function. If, for example, we want to approximate the posteriormeasure πt|t, we are interested in functions h = χA for all A ∈ B(Rk). Therefore,it is not evident how the optimal importance distribution should be chosen inthis case. Intuitively, it might seem reasonable to choose γ = π. Let us attemptto justify the optimality of this choice.

Because h is allowed to be arbitrary, one must choose γ such that π ≪ γ.Let µ be a measure as defined in Proposition 8. Then one has µ ≪ π ≪ γ. Theseabsolute continuities allow us to write

µ(A) =

∫

A

h dπ =

∫

A

hdπ

dγdγ, A ∈ B(Rk),


implying that dµ/dγ = hdπ/dγ. In the proof of Proposition 8, the term to beminimised was

V

[dµ

dγ

]= V

[h

dπ

dγ

].

Because the function h is arbitrary, a single RND dπ/dγ cannot be found tominimise this variance for all h. Therefore, it seems reasonable to minimisethe variance of dπ/dγ which is a common source of variance for any choice ofh. Clearly, V [dπ/dγ] = 0 when dπ/dγ is constant. This happens when γ isproportional to π. Because γ is required to be a probability measure it followsthat the optimal importance measure is

γ(A) =π(A)

π(Rk), A ∈ B(Rk).

Here, it is assumed that π is a finite measure. If, in addition, π is a probabilitymeasure, then the optimal γ clearly is equal to π.

4.3.4 Effective Sample Size

Consider an approximate importance sampling approximation Υnq of the integral∫

h(x)f(x)λk(dx), as described in Equation (4.10). Here, f is assumed to be aprobability density function and q is the probability density of the importancedistribution, i.e. the importance density. For an arbitrary integrand h, it wasshown in Section 4.3.3 that the optimal choice of importance density is q = f ,i.e. ideally one should use classical Monte Carlo integration.

In order to assess the efficiency of a given importance density q, Geweke[1989] proposed to evaluate the relative numerical efficiency (RNE)

RNEq ,

1nVf [h]

1nVq

[hfq − Ef [h]

fq

] . (4.14)

The numerator in Equation (4.14) is the variance of an importance samplingapproximation using the optimal importance density q = f , had it been possibleto simulate samples from it. The denominator is the asymptotic variance of theimportance sampling approximation Υn

q . The interpretation of RNEq is that, interms of variance, an importance sampler with a sample size n is as accurate asa classical Monte Carlo integration with a sample size nRNEq.

The asymptotic variance of Υnq , i.e. the denominator in Equation (4.14),

can be equivalently written as

V

[Υn

q

]≈ 1

n

(Vq

[hf

q

]− 2Ef [h] covq

(hf

q,f

q

)+ Ef [h]2 Vq

[f

q

]), (4.15)

It is noted that the evaluation of this variance requires the knowledge of theexpectation Ef [h] which is the original problem and, therefore, unknown. Liu


[1996] proposed to approximate the asymptotic variance of Υnq as follows. With

some simple algebraic operations, it can be shown that

Vq

[hf

q

]= Ef

[f

q

]Ef [h]2 + Vf [h] Ef

[f

q

]· · ·

+ 2Ef [h] covf

(h,

f

q

)− Ef [h]2 + ε (4.16)

where the remainder term ε is

ε = Ef

[(f

q− Ef

[f

q

])(h − Ef [h])2

]. (4.17)

There is no guarantee that this remainder is negligible, but, for a moment, let usassume so [Kong et al., 1994]. Also, after some simple operations, the covarianceterm in Equation (4.15) can be put into the following form

covq

(hf

q,f

q

)= covf

(h,

f

q

)+ Ef [h] Ef

[f

q

]− Ef [h] . (4.18)

Substitution of Equation (4.16) and Equation (4.18) into Equation (4.15) yieldseventually

V

[Υn

q

]≈ Vf [h]

n

(Vq

[f

q

]+ 1

). (4.19)

By replacing the exact asymptotic variance in Equation (4.14) by this approxim-ation, one obtains an approximate RNE

RNEq ,1

Vq[f/q] + 1, (4.20)

which is conveniently found to be independent of the integrand h. Because RNEq

has roughly the same interpretation as RNEq, it is convenient to define the ef-fective sample size as

Neff ,n

Vq[f/q] + 1(4.21)

to represent the number of samples from the optimal importance distribution thatwould give the same accuracy as n samples from the importance density q [Liu,1996, Kong et al., 1994].

Because, in general, the variance Vq[f/q] = Eq[f2/q2]−1 cannot be evaluated

exactly, it has been proposed in the literature that Eq[f2/q2] is approximated by

importance sampling as described in Section 4.3.1 [see, e.g. Arulampalam et al.,2002]. Then according to Equation (4.9), one has

Eq

[f 2

q2

]=

∫(f ∗/q∗)2dγ

(∫

f ∗/q∗dγ)2≈

1n

∑ni=1(f

∗(xi)/q∗(xi))2

1n2 (∑n

i=1 f ∗(xi)/q∗(xi))2 = n

n∑

i=1

(wi)2, (4.22)

where

wi =f ∗(xi)/q(xi)∑nj=1 f ∗(xi)/q(xi)

.


The substitution of the resulting approximation of Vq[f/q] into Equation (4.21)yields a convenient approximation

Neff ,1∑n

i=1(wi)2

. (4.23)

The primary benefit of using the effective sample size is its independence ofthe choice of h. Roughly speaking, the effective sample size is a quantity which“summarises” the performance of an importance sampler for arbitrary choice ofh. Moreover, it is noted that if the samples are simulated according to the densityf , then all weights wi in Equation (4.23) will be equal, implying that Neff = n.

Therefore, at least for nearly optimal importance distributions Neff seems to workreasonably.

It was mentioned that there is no guarantee that the remainder term ε in

Equation (4.17) is negligible. Consequently, if using RNEq as an approximationof RNEq, extreme caution should be exercised. For some choices of functions h,f and q the approximations may be remarkably erroneous, as illustrated by thefollowing example.

Example 7: Let the integral Υ =∫

h(x)f(x)λk(dx), where f(x) = fN(x; 0, 1)

and h(x) = x, be approximated by Υnq , where q(x) = fN(x; 0, 2). In this case,

V

[Υn

q

]≈ 1

nVq

[xf(x)

q(x)− Ef [x]

f(x)

q(x)

]=

1

nVq

[xf(x)

q(x)

]=

4√

3

9n.

Because clearly Vf [x] = 1 it follows that RNEq = 94√

3≈ 1.3, implying that the

importance sampler will perform better than classical Monte Carlo integration.

Because Vq[f(x)/q(x)] = 23

√3−1, it follows that the approximate RNE is RNEq =

32√

3≈ 0.87, which, in turn, suggests that the importance sampler will perform

worse than classical Monte Carlo. The contradiction is apparent.

Extreme contradiction is obtained by defining q(x) = |x| e− 1

2x2

/2, which can beshown to be the optimal importance distribution for the given choice of h. Inthis case, RNEq = π/2 ≈ 1.57 but, because Vq[f(x)/q(x)] = ∞, it follows that

RNEq = 0.

4.4 Stratified Sampling

Several different methods for reducing the variance of the random integral approx-imations have been proposed in the literature [see, e.g. Hammersley and Hand-scomb, 1964, Robert and Casella, 1999]. In this section, we describe a methodcalled stratified sampling which, in some cases, can be shown to give moreaccurate approximations for integrals than classical Monte Carlo. Accuracy, inthis case, means smaller variance of the approximations. Occasionally, stratifiedsampling is also referred to as partitioned sampling [Robert and Casella, 1999].


Most of the results given below can found in [Cochran, 1963], [Hammersley andHandscomb, 1964], and [Robert and Casella, 1999].

Let us again consider the integral Υ =∫

h dπ, where h : Rk → R isπ-integrable and π is a probability measure on B(Rk). Let the space of in-terest, in this case Rk, be partitioned into ns disjoint sets S1, S2, . . . , Sns

suchthat

⋃ns

i=1 Si = Rk. The collection S = S1, S2, . . . , Sns is called a partition

and the elements of S are called strata. This partition allows us to write theintegral Υ equivalently as (see Theorem 18, property (iii) in Section A.3)

Υ =

n∑

i=1

∫

Si

h(x)π(dx). (4.24)

In principle, the integrals in Equation (4.24) could be approximated by classicalMonte Carlo integration. However, this would most likely be very inefficient, be-cause n−nπ(Si) samples are expected to fall outside Si and therefore contribute 0in the resulting Monte Carlo approximation. Therefore, it seems more reasonableto use importance sampling.

It is observed that πi(A) =∫

AχSi

dπ, where A ∈ B(Rk) defines a measureπi such that πi ≪ π and dπi/dπ = χSi

, π-a.e. [Shiryayev, 1984, page 193]. Thus,by Lemma 1 ∫

Si

h(x)π(dx) =

∫h(x)πi(dx).

Because h is allowed to be arbitrary, the yet undefined importance distributionγi must be chosen to satisfy πi ≪ γi. Because πi(R

k) = π(Si), πi is clearly nota probability measure. Therefore, by following the conclusions of Section 4.3.3,the best choice of importance distribution would be

γi(A) ,πi(A)

πi(Rk)=

πi(A)

π(Si), A ∈ B(Rk).

Clearly, dπi/dγi = π(Si), γi-a.e., and, therefore,

∫

Si

h dπ =

∫h dπi =

∫h

dπi

dγidγi = π(Si)

∫h dγi.

By substituting this into Equation (4.24), the original integral Υ can be equival-ently written as

Υ =

n∑

i=1

π(Si)

∫h(x)γi(dx). (4.25)

Based on the discussion above, we are now ready to define exactly what ismeant by stratified Monte Carlo approximation.

Definition 24 Suppose that π is a probability measure on B(Rk), h : Rk → Ris π-integrable, Υ =

∫h dπ, and S = S1, S2, . . . , Sns

is a partition of Rk.

Moreover, define γi(A) , π(A ∩ Si)/π(Si) for all A ∈ B(Rk) and for all i =


1, 2, . . . , ns. Then a stratified Monte Carlo approximation ΥST of Υ isdefined as

ΥST ,

ns∑

i=1

π(Si)ΥiST,

where

ΥiST ,

1

ni

ni∑

j=1

h(xj), xj ∼ γi.

Clearly, the terms ΥiST are the classical Monte Carlo approximations of the in-

tegrals∫

h dγi.Let us take a look at the probabilistic properties of ΥST. An alternative

version of the following fundamental theorem, and of its proof, can be foundin [Cochran, 1963, page 89-91].

Theorem 8 If ΥST is a stratified Monte Carlo approximation of Υ =∫

h dπ,then

E[ΥST] = Υ, (4.26)

V [ΥST] =ns∑

i=1

π2(Si)σ2i

ni

, (4.27)

where σ2i = V [h(x)], x ∼ γi.

Proof: The Equation (4.26) can be proved as follows by the elementary prop-erties of expectation

E[ΥST] = E

[ns∑

i=1

π(Si)

ni

ni∑

j=1

h(xj)

]=

ns∑

i=1

π(Si)

ni

ni∑

j=1

E[h(xj)

]

=

ns∑

i=1

π(Si)

∫h(x)γi(dx) = Υ, (4.28)

where the last equality follows from (4.25). Similarly, Equation (4.27) is provedto hold as follows.

V [ΥST] = V

[ns∑

i=1

π(Si)

ni

ni∑

j=1

h(xj)

]

=

ns∑

i=1

π2(Si)

n2i

ni∑

j=1

V[h(xj)

]

=ns∑

i=1

π2(Si)σ2i

ni

, (4.29)

where σ2i = V [h(x)], x ∼ γi.

Equation (4.27) seems to imply that V[ΥST] depends on the choice of num-bers ni. The collection n1, n2, . . . , nns

is referred to as sample allocation.The following proposition gives a sample allocation which minimises the varianceV[ΥST]. A proof for the following theorem can be found, e.g. in [Cochran, 1963,pages 95-97], but the proof given below follows the one given by Stuart [1954].


Theorem 9 Let ΥST and Υ∗ST be stratified Monte Carlo approximations of Υ. If

Υ∗ST is obtained with the sample allocation

ni =nπ(Si)σi∑ns

i=1 π(Si)σi, (4.30)

thenV [ΥST] ≥ V [Υ∗

ST] .

Proof: In order to minimise V[ΥST] is suffices to minimise nV[ΥST], for whichwe have

nV [ΥST] =

(ns∑

j=1

nj

)(ns∑

i=1

π2(Si)σ2i

ni

)

≥(

ns∑

i=1

π(Si)σi

)2

,

where the Cauchy-Schwartz inequality has been used [see, e.g. Rudin, 1976, page15]. Substitution of ni defined in Equation (4.30) yields straightforwardly equal-ity.The sample allocation given in Equation (4.30) is occasionally called the Ney-man allocation [Cochran, 1963, page 97]. Often, it may be the case that σ2

i

cannot be evaluated and therefore the Neyman allocation is infeasible. Anotherstrategy is then to use

ni = nπ(Si), (4.31)

which is known as the proportional allocation [Cochran, 1963, page 91]. Thefeasibility of this allocation depends on the ability to evaluate π(Si). This is alsoin general infeasible.

It was mentioned earlier that in some cases, stratified Monte Carlo is moreaccurate in terms of variance of the approximation than classical Monte Carlo.The following proposition proves this statement for the Neyman allocation andproportional allocation. The theorem, as well as its proof, can also be foundin [Cochran, 1963, page 98].

Theorem 10 Let ΥMC, ΥPST and ΥN

ST be a classical Monte Carlo approximation,stratified approximation using proportional allocation and stratified approximationusing Neyman allocation, respectively. If all approximations use the sample sizen and the same strata, then

V [ΥMC] ≥ V[ΥP

ST

]≥ V

[ΥN

ST

].

Proof: First it is observed that according to Theorem 8 and Equation (4.31)

V[ΥP

ST

]=

ns∑

i=1

π2(Si)σ2i

ni

=ns∑

i=1

π2(Si)σ2i

nπ(Si)=

ns∑

i=1

π(Si)σ2i

n. (4.32)

According to Equation (4.25), we have

V [h(x)] = E[(h(x) − Υ)2

]=

ns∑

i=1

π(Si)

∫(h(x) − Υ)2γi(dx). (4.33)


By defining ηi ,∫

h dγi, the integral in Equation (4.33) satisfies

∫(h − Υ)2dγi =

∫h2dγi − η2

i + η2i − 2ηiΥ + Υ2 = σ2

i + (ηi − Υ)2 . (4.34)

According to Equation (4.32), Equation (4.33) and Equation (4.34), the varianceof a classical Monte Carlo approximation can be written as

V [ΥMC] =1

nV [h(x)] =

ns∑

i=1

π(Si)σ2i

n+

1

n

ns∑

j=1

(ηi − Υ)2 ≥ V[ΥP

ST

],

implying that V [ΥMC] ≥ V[ΥP

ST

]. Theorem 9, in turn, implies V

[ΥN

ST

]≤

V[ΥP

ST

].

The idea behind the stratified sampling was based on the possibility todecompose Υ as

Υ =ns∑

i=1

∫h(x)πi(dx).

The existence of this decomposition does not however require a partition S ofRk. It is sufficient that there are ns measures π1, π2, . . . , πns

such that πi ≪ πfor all i = 1, 2, . . . , ns, and for all A ∈ B(Rk), π(A) =

∑i=1 πi(A). The absolute

continuities are required for ensuring the existence of RNDs dπi/dπ. In this case,

π(A) =

ns∑

i=1

πi(A) =

ns∑

i=1

∫dπi

dπdπ =

∫ ns∑

i=1

dπi

dπdπ

implying that∑ns

i=1 dπi/dπ = dπ/dπ, π-a.e. This, in turn, allows us to write

ns∑

i=1

∫h dπi =

ns∑

i=1

∫h

dπi

dπdπ =

∫h

ns∑

i=1

dπi

dπdπ =

∫h

dπ

dπdπ =

∫h dπ,

which is a decomposition of Υ of the required form. Because πi is not necessarily aprobability measure, we define γi(A) = πi(A)/πi(Rk), for all A ∈ B(Rk). Clearly,dπi/dγi = πi(Rk), γi-a.e. Therefore,

Υ =ns∑

i=1

πi(Rk)

∫h dγi. (4.35)

To conclude, we define the stratified sampling approximation more generally asfollows.

Definition 25 Suppose that π is a probability measure on B(Rk), h : Rk → R isπ-integrable, and Υ ,

∫h dπ. Let πi | i = 1, 2, . . . , ns be measures satisfying

π(A) =∑ns

i=1 πi(A), for all A ∈ B(Rk) and πi ≪ π for all i = 1, 2, . . . , ns.


Moreover, define γi(A) , πi(A)/πi(Rk) for all A ∈ B(Rk). Then a generalstratified sampling approximation ΥST of Υ is defined as

ΥST ,

ns∑

i=1

πi(Rk)Υi

ST,

where

ΥiST ,

1

ni

ni∑

j=1

h(xj), xj ∼ γi.

It is important to note that the proofs of Theorem 8, Theorem 9, and Theorem10 relied solely on the decomposition given in Equation (4.35), and not on theexistence of the disjoint partition S . Therefore, all of these theorems remainvalid as such even for the general stratified sampling approximation.

4.5 Rejection Method

One motivation for using importance sampling is the situation where one wants toapproximate an expectation with respect to a probability measure from which it isdifficult to simulate samples. Although random sample generation is not the maintopic of this thesis, this section will introduce a well-known method for generatingrandom samples according to a given, possibly nonstandard, distribution. Themethod is known as the rejection method [Devroye, 1986, page 40] or accept-reject sampling [Robert and Casella, 1999, page 49]. The rejection method isbased on the following two theorems that can also be found in [Devroye, 1986,pages 40-41]. Here, the theorems are general distributions on B(Rk) instead ofonly those admitting a density with respect to the Lebesgue measure.

Theorem 11 Suppose that xi : Ω → R, i = 1, 2, . . . , k are random variables,xk+1 : Ω → [0, 1] is a uniformly distributed random variable independent ofx1, . . . , xk, and the distribution of [x1, . . . , xk]

T has a finite density f with re-spect to a σ-finite measure µ on B(Rk). Moreover, let c > 0 be an arbitraryreal number and ν , µ × λ1. Then the distribution Py of the random variabley , [x1, . . . , xk, cxk+1f(x1, . . . , xk)]

T satisfies

Py(B) = ν(A ∩ B)/ν(A), B ∈ B(Rk+1),

whereA = x ∈ Rk+1 | 0 ≤ xk+1 ≤ cf(x1, x2, . . . , xk).

Conversely, if z = [z1, z2, . . . , zk+1]T has the distribution Py, then the distribution

of [z1, z2, . . . , zk]T has the density f with respect to µ.

Proof: Let y′ = [y′1, y

′2, . . . , y

′k+1]

T be a given element of Rk+1 and let us definesets A′ = y ∈ Rk+1 | yk+1 ≤ y′

k+1, B′ = y ∈ Rk+1 | y1 ≤ y′1, y2 ≤ y′

2, . . . , yk ≤y′

k. Now

P(y ≤ y′) = P(x1 ≤ y′

1 ∩ · · · ∩ xk ≤ y′k ∩ cxk+1f(x1, . . . , xk) ≤ y′

k+1)


For vector arguments, ‘≤’ is interpreted elementwise. Let us define the notationsv = [x1, . . . , xk]

T, u = xk+1, and D′ = x ∈ Rk+1 | xk+1 ≤ y′k+1/(cf(x1, . . . , xk)).

Because the density of the distribution of x with respect to the product measureµ × λ1 is f(v)χ[0,1](u), we have

P(y ≤ y′) = Px(B′ ∩ D′) =

∫χB′χD′ dPx

=

∫χB′(v)

[∫χD′(v, u)χ[0,1](u)λ1(du)

]f(v)µ(dv)

=

∫

B′

min

(1,

y′k+1

cf(v)

)f(v)µ(dv) =

1

c

∫

B′

min(cf(v), y′k+1)µ(dv)

=1

c

∫ [∫χB′(v)χ[0,cf(v)](u)χ(−∞,y′

k+1](u)λ1(du)

]µ(dv)

=1

c

∫χB′χAχA′ dν =

ν(A ∩ (A′ ∩ B′))

ν(A),

because c = ν(A). This proves the first part of the theorem. For the conversepart, assume that a random variable z has the distribution Py. For all B ∈ B(Rk),

P([z1, . . . , zk]

T ∈ B)

= Py(B × R) =1

c

∫χ[0,cf(v)](u)χB×R(v, u)ν(dv, du)

=1

c

∫

B

[∫

[0,cf(v)]

dλ1

]µ(dv) =

∫

B

fdµ,

Although this completes the proof, it should be pointed out that the set Acan be equivalently characterised as the union of the sets x ∈ Rk+1 | 0 ≤

xk+1

cf(x1,...,xk)≤ 1, f(x1, . . . , xk) 6= 0 and x ∈ Rk+1 | xk+1 = 0, f(x1, . . . , xk) = 0.

Because c−1xk+1/f(x1, . . . , xk) is a B(Rk+1)/B(R)-measurable function [see, e.g.Kolmogorov and Fomin, 1975a, page 288], the set A is found to be an element ofB(Rk+1).

In practice, the theorem implies that if x can be simulated according to somedistribution with density f with respect to µ, and u ∼ U([0, 1]) independent ofx, then [xT, uf(x)]T is uniformly distributed on the area limited by the Rk-planeand the density f .

Example 8: Define q(x) = (π+πx2)−1, i.e. the density of the Cauchy distribu-tion. Let x and u be independent random variables distributed according to theCauchy distribution and U([0, 1]), respectively. Then [x, 2.5uq(x)]T is distributeduniformly on A which is a subset of B(R2) limited by the x-axis and the function2.5q(x). An illustration of the set A and a uniformly distributed sample on A isgiven in Figure 4.1(a).

In order to justify the rejection method, we need also the following theoremfor which a slightly different version can be found in [Devroye, 1986, page 41].


Theorem 12 Let µ be a σ-finite measure on B(Rk), and A ∈ B(Rk) such thatµ(A) ∈ (0,∞). Suppose that xi∞i=1 is a sequence of IID, Rk-valued randomvariables with common distribution Px such that Px(B) = µ(A∩B)/µ(A), for allB ∈ B(Rk). Define y : Ω → Rk as

y(ω) , xi∗(ω), i∗ = mini | i ∈ N, xi(ω) ∈ A′, (4.36)

where A′ ∈ B(Rk) is such that Px(A′) > 0 and µ(A′ ∩ A) = µ(A′). Then y is a

random variable with distribution

Py(B) =µ(A′ ∩ B)

µ(A′), B ∈ B(Rk).

Proof: Let B ∈ B(Rk). Then, by the definition of y, the following equivalenceshold

ω ∈ y−1(B) ⇐⇒ y(ω) ∈ A′ ∩ B ⇐⇒ there is i : xi(ω) ∈ A′ ∩ B

⇐⇒ there is i : ω ∈ x−1i (A′ ∩ B) ⇐⇒ ω ∈

∞⋃

i=1

x−1i (A′ ∩ B),

which implies that y−1(B) =⋃∞

i=1 x−1i (A′ ∩ B), and as a countable union of

measurable sets, y−1(B) is measurable and hence y is a random variable. Becausexi are independent

Py(B) =∞∑

i=1

P(x1 /∈ A′ ∩ · · · ∩ xi−1 /∈ A′ ∩ xi ∈ A′ ∩ B)

= Px(A′ ∩ B)

∞∑

i=1

(1 − Px(A′))i−1 =

Px(A′ ∩ B)

Px(A′)

=µ(A ∩ A′ ∩ B)µ(A)

µ(A)µ(A′ ∩ A).

The fact that µ(A′) = µ((A′ ∩ A) ∪ (A′ ∩ ∁A)) = µ(A′ ∩ A) + µ(A′ ∩ ∁A) impliesµ(A′ ∩ ∁A) = 0. Because A′ ∩ ∁A ∩ B ⊂ A′ ∩ ∁A, then also µ(A′ ∩ ∁A ∩ B) = 0.Therefore µ(A′ ∩ A ∩ B) = µ(A′ ∩ B) − µ(A′ ∩ ∁A ∩ B) = µ(A′ ∩ B), andPy(B) = µ(A′ ∩ B)/µ(A′).

Roughly speaking, the theorem states that if the random variable x is uni-formly distributed on some set A, then the random variable y, defined to be thefirst realisation of x hitting the subset A′ of A, is uniformly distributed on A′.Theorem 11 and Theorem 12 can be applied in random variate generation asfollows.

Suppose that the goal is to simulate samples from the distribution π definedon B(Rk). This distribution is called the target distribution. Moreover,suppose that there is a distribution γ on B(Rk) whose density with respectto a σ-finite measure µ on B(Rk) is q, and from which it is easy to simulate


samples. This distribution is called the instrumental distribution. Accord-ing to Theorem 11, if [z1, . . . , zk] ∼ γ and zk+1 ∼ U([0, 1]) independently, thenz = [z1, . . . , czk+1q(z1, . . . , zk)]

T has the distribution

Pz(B) = ν(A ∩ B)/ν(A), B ∈ B(Rk+1)

where ν = µ × λ1 and

A = x ∈ Rk+1 | 0 ≤ xk+1 ≤ cq(x1, . . . , xk) ∈ B(Rk+1).

Let us then define the set A′ ∈ B(Rk+1) as

A′ = x ∈ Rk+1 | 0 ≤ xk+1 ≤ df(x1, . . . , xk),

where d > 0 is an arbitrary real number and f is the density of π with respectto µ. According to the converse part of Theorem 11, [x1, . . . , xk]

T ∼ π, if y =[x1, . . . , xk+1]

T ∼ Py, where Py(B) = ν(A′ ∩ B)/ν(A′) for all B ∈ B(Rk+1).According to Theorem 12, this happens if y is defined to be the first z hittingA′, and if ν(A′) = ν(A′ ∩ A). To ensure that the condition ν(A′) = ν(A′ ∩ A) issatisfied, it is observed that

ν(A′) =

∫ [∫

[0,df(x)]

dλ1

]dµ = d

ν(A′ ∩ A) =

∫ [∫

[0,min(cq(x),df(x))]

dλ1

]dµ = c

∫

A1

qdµ + d

∫

A2

fdµ,

where A1, A2 ∈ B(Rk) form a partition of Rk such that A1 = x ∈ Rk | c/d <f(x)/q(x) and A2 = x ∈ Rk | c/d ≥ f(x)/q(x). Clearly, the choice

c

d≥ sup

x:q(x)6=0

f(x)

q(x)

yields A1 = ∅, A2 = Rk and the condition is satisfied. However, a less stringentcondition is possible in terms of essential supremum (see Definition 45, inSection A.2.2). For the choice

c

d≥ ess sup

x:q(x)6=0

f(x)

q(x), (4.37)

one has ν(A1) = 0, implying that∫

A1qdµ =

∫A1

fdµ = 0. Therefore ν(A′ ∩ A)can be written as

ν(A′ ∩ A) = c

∫

A1

qdµ + d

∫

A2

fdµ = d

∫

A2

fdµ + d

∫

A1

fdµ = d

∫

Rk

fdµ = d.

and the required condition is satisfied.In practice, to implement the rejection method one simply simulates pairs

(xi, ui), until ui ≤ df(xi)/(cq(xi)) in which case the value xi is returned. Other-wise the sample is rejected, hence the name rejection method. Here, c/d is chosen


−4 −2 0 2 40

0.2

0.4

0.6

0.8

Rejection sampling

A

A’

rejected

accepted

−4 −2 0 2 40

0.1

0.2

0.3

Histogram of the simulated samples

(a) (b)

Figure 4.1: (a) All samples are uniformly distributed on the set A which is thearea between cq(x) and the x-axis. The set A′ is the area between f(x) andthe x-axis. Samples in A′ are accepted, and samples not in A′ are rejected. (b)Histogram of 2 · 104 accepted samples and the target density f .

to satisfy the condition in Equation (4.37). Eventually, the returned samples areIID with density f as illustrated by the following example.

Example 9: Suppose that one wants to simulate an IID sample from the targetdensity f(x) = |x|e− 1

2x2

/2 using the Cauchy distribution as the instrumentaldistribution. In this case, the density of the instrumental distribution is q(x) =(π + πx2)−1. It can be shown that, when d = 1, the choice c = 2.5 satisfiescondition in Equation (4.37). Figure 4.1(a) illustrates the sets A and A′ as wellas the rejected and accepted samples. Figure 4.1(b) illustrates the density fand the scaled histogram of 2 · 104 accepted samples.

In addition to the cost caused by the simulation of a sample from the in-strumental distribution, there is also the cost of evaluating the densities f and qand this cost is common to all samples, rejected or not. Therefore, it is obviousthat in order to simulate a sample of size n from the target distribution, oneshould choose the instrumental distribution and the ratio c/d to maximise theratio between the number of accepted samples and the number of all samples,i.e. the acceptance rate. For a given instrumental distribution, the acceptancerate is simply the probability of z ∼ U(A) hitting the set A′, that is

Pz(A′) =

ν(A ∩ A′)

ν(A)=

ν(A′)

ν(A)=

d

c. (4.38)

implying that c/d should be chosen to be as small as possible. Under the con-straint given in Equation (4.37), the highest rate of acceptance is then obtainedby choosing

c

d= ess sup

x:q(x)6=0

f(x)

q(x).

It is practical to note that because the validity of the rejection methodsonly depends on the ratio c/d, neither of the densities, f or q, is required to be


known exactly, but up to proportionality. Moreover, it is noted that the condi-tion in Equation (4.37) resembles the condition on the boundedness of the RNDin Proposition 7. This is to say that if one has an instrumental distribution γand a target distribution π for which the RND dπ/dγ is not bounded, then therejection method may not be possible to implement. In this case, one could useimportance sampling, but variance of the resulting approximation might be infin-ite. Thus, in the case that f has “heavier tails” than q, neither of the methods,rejection method and importance sampling, works well in the approximation ofexpectations with respect to f .

Chapter 5

Sequential Monte Carlo

Let us return to the Bayesian filtering problem that was introduced in Chapter3. In this chapter, we will introduce several Monte Carlo based methods for ap-proximating the generally intractable Bayesian posterior distributions. Especiallyin the filtering context, these methods are known as sequential Monte Carlo(SMC) methods or particle filters. The research of SMC methods started inthe early 1990’s and only a few books have been published on the topic. Anintroduction to SMC methods can be found in the pioneering book by Doucetet al. [2001b], and a more application oriented discussion on SMC methods canbe found in [Ristic et al., 2004].

Most of the methods described in this chapter have been introduced earlierin the literature and the original proposals can be found in the provided refer-ences. The purpose of this chapter is to collect the important contributions onSMC methods from various publications and describe them in a unified, measuretheoretic manner.

Section 5.1 describes how the Bayesian filtering distributions are approx-imated in SMC methods. Section 5.2 describes the most fundamental SMCmethod known as the bootstrap filter and Section 5.3 describes the generalisationof the bootstrap filter known as the sampling/importance resampling algorithm.A few proposals for choosing the importance distribution are given in Section 5.4.Section 5.5 addresses the simulation of the samples from the importance distribu-tion and describes the connection between sampling/importance resampling andsequential importance sampling algorithms in terms of stratified sampling. Sec-tion 5.6 introduces another class of SMC methods known as regularised particlefilters and their connection to importance sampling and the rejection method.This chapter is concluded in Section 5.7 which gives references to the descrip-tions of some SMC methods whose detailed descriptions were excluded from thisthesis.

53

CHAPTER 5. SEQUENTIAL MONTE CARLO 54

5.1 Bayesian Filter Approximation

Recall from Section 3.4 the following formulation for the Bayesian posterior dis-tribution

πt|t(A) =

∫A×Rk gt dπt:t−1|t−1∫Rk×Rk gt dπt:t−1|t−1

. (5.1)

The exact evaluation of this measure is, in general, intractable. Therefore, inSMC methods, πt|t is approximated by a discrete probability measure πn

t|t of theform

πnt|t(A) =

n∑

i=1

witχA(xi

t), A ∈ B(Rk), (5.2)

where∑n

i=1 wit = 1, and xi | i = 1, 2, . . . , n is an IID sample from some known

distribution. From Equation (5.1), it is noted that πt+1|t+1 is defined using πt+1:t|twhich, in turn, was defined using πt|t in Section 3.4. Instead of the exact measure,we have only the approximation πn

t|t of πt|t implying that also πt+1:t|t is unavailableand must be approximated. Let this approximation be denoted by π′

t+1:t|t and

defined, similarly as πt+1:t|t, to be the unique measure on B(R2k) such that

π′t+1:t|t(A × B) ,

∫

B

Kt(xt, A)πnt|t(dxt), A, B ∈ B(Rk).

To derive a convenient formulation for π′t+1:t|t, we let C ∈ B(R2k) and define

C , (xt, xt+1) ∈ C | xt 6= xit, for all i = 1, 2, . . . , n ∈ B(R2k) (5.3)

Ai , xt+1 ∈ Rk | χC(xit, xt+1) = 1 ∈ B(Rk) (5.4)

Ci , xit × Ai ∈ B(R2k) (5.5)

According to these definitions, C is the finite disjoint union C ∪ (⋃n

i=1 Ci) andthus π′

t+1:t|t(C) = π′t+1:t|t(C) + π′

t+1:t|t(⋃n

i=1 Ci). Because π′t+1:t|t(C) = 0,

π′t+1:t|t(C) =

∫ [∫ n∑

i=1

χCiKt(xt, dxt+1)

]

πnt|t(dxt)

=n∑

i=1

∫

xit

Kt(xt, Ai)πnt|t(dxt) =

n∑

i=1

witKt(x

it, Ai). (5.6)

By substituting the approximate measure π′t+1:t|t into Equation (5.1), we obtain

an approximate formulation for the Bayesian posterior distribution πt|t as

π′t|t(A) =

∫A×Rk gt dπ′

t:t−1|t−1∫Rk×Rk gt dπ′

t:t−1|t−1

, (5.7)

which, in spite of the use of approximate measures π′t+1:t|t, is still generally intract-

able. Because the only difference between the numerator and the denominator in


Equation (5.7) is the integration region, we find ourselves ultimately interestedin approximating the integrals of the form

Υ(A) ,

∫

A×Rk

gt dπ′t:t−1|t−1, A ∈ B(Rk) (5.8)

such that the resulting approximation of π′t|t(A) = Υ(A)/Υ(Rk) is a discrete

probability measure πnt|t.

Before moving into the detailed description of the SMC methods for approx-imating Υ(A), one important remark should be made. At every time step, oneapproximates the measure π′

t|t which already is an approximation of πt|t. Thus,it is not evident that the approximation error does not accumulate and result indiverging approximation of πt|t. The analysis of convergence of the SMC meth-ods is complicated and beyond the scope of this thesis. Some convergence resultshave been given, e.g. in [Crisan and Doucet, 2002].

5.2 Bootstrap Filter

The bootstrap filter, originally proposed by Gordon et al. [1993], is the mostsimple SMC algorithm, and it is commonly acknowledged to be the method thatinitiated the surge of interest in SMC methods. In the bootstrap filter, the integralΥ(A) is approximated by classical Monte Carlo integration. This is to say thatan IID sample (xi

t−1, xit) | i = 1, 2, . . . , n of size n is simulated according to

π′t:t−1|t−1 and Υ(A) is approximated by

Υ(A) ≈ ΥBS(A) =1

n

n∑

i=1

χA×Rk(xit, x

it−1)gt(x

it) =

1

n

n∑

i=1

χA(xit)gt(x

it),

where (xit−1, x

it) ∼ π′

t:t−1|t−1. More precisely, the notation (xit−1, x

it) means a

realisation of a random vector [(xit−1)

T, (xit)

T]T taking values in R2k. For clarity,the ordered pair notation (xi

t−1, xit) will be used instead.

According to Equation (5.7), an approximation of π′t|t(A) is obtained by

taking the ratio

π′t|t(A) ≈ πn

t|t(A) =ΥBS(A)

ΥBS(Rk)=

∑ni=1 χA(xi

t)gt(xit)∑n

i=1 gt(xit)

.

This is a discrete probability measure of the form

πnt|t(A) =

∑ni=1 wi

tχA(xit)∑n

i=1 wit

, (5.9)

where the unnormalised weights wit are

wit = gt(x

it).


5.3 Sampling/Importance Resampling

According to the discussion in Section 4.3, one does not necessarily have to sim-ulate samples from π′

t+1:t|t in order to approximate Υ(A). To see how this can be

done in detail, let us define a set function πt+1:t|t+1 : B(R2k) → [0,∞) as

πt+1:t|t+1(C) ,

∫

C

gt+1dπ′t+1:t|t, C ∈ B(R2k). (5.10)

Because gt+1 ≥ 0, π′t+1:t|t-a.e., πt+1:t|t+1 is a measure on B(R2k) [Shiryayev, 1984,

page 193]. This measure allows us to write Υ(A) equivalently as

Υ(A) =

∫

A×Rk

dπt+1:t|t+1 = πt+1:t|t+1(A × Rk), A ∈ B(Rk).

According to this notation, it is obvious that Υ(A) is an unnormalised posteriordistribution, i.e. a measure which is not a probability measure but proportionalto π′

t+1|t+1.If there is a probability measure γt+1:t|t such that πt+1:t|t+1 ≪ γt+1:t|t, then

Υ(A) can be equivalently written as

Υ(A) =

∫

A×Rk

dπt+1:t|t+1

dγt+1:t|tdγt+1:t|t, (5.11)

and approximated by importance sampling using an IID sample from γt+1:t|t.The importance distribution γt+1:t|t can be chosen quite arbitrarily as long as theabsolute continuity is preserved. Often, γt+1:t|t is defined to be of the form

γt+1:t|t(C) =

n∑

i=1

witKt(x

it, Ai), C ∈ B(R2k), (5.12)

where Ai = xt+1 ∈ Rk | χC(xit, xt+1) = 1. In other words, γt+1:t|t is identical to

π′t+1:t|t except for the different transition kernels Kt.

In order to give a convenient formulation for the RND in Equation (5.11),let the transition kernels Kt and Kt have probability densities kt(xt, xt+1) andkt(xt, xt+1), respectively. Moreover, define a weight function wt as

wt(xt) =

wi

t , if xt = xit

1 , otherwise.(5.13)

Let us show that, in this case, the function

f(xt, xt+1) = gt+1(xt+1)wt(xt)kt(xt, xt+1)

wt(xt)kt(xt, xt+1)(5.14)

is a variant of the RND dπt+1:t|t+1/dγt+1:t|t. Because this RND can be definedarbitrarily on sets of γt+1:t|t-measure zero, f can be defined arbitrarily for pairs(xt, xt+1) such that the denominator in Equation (5.14) goes to zero.


Let C ∈ B(R2k) be arbitrary and let C, Ci, and Ai be as defined in Equations(5.3). . . (5.5) and let h : Rk × Rk → R be B(R2k)/B(R)-measurable function.Moreover, let Ui be the set of countably simple functions ui : Rk × Rk → R (seeDefinition 48 in Section A.2.3) such that ui ≤ hχCi

, γt+1:t|t-a.e., and let Vi bethe set of countably simple functions vi : Rk → R such that vi ≤ h(xi

t, ·)χAi,

Kt(xit, ·)-a.e. For all ui ∈ Ui, one can define vi = ui(x

it, ·) ∈ Vi. The definition of

the integral for countably simple functions then implies that

∫ui dγt+1:t|t = wi

t

∫vi(x)Kt(x

it, dx). (5.15)

Conversely, for all vi ∈ Vi one can define ui ∈ Ui, as ui(xt, ·) = vi, when xt = xit

and ui(xt, ·) = 0 otherwise. Again, Equation (5.15) holds. Consequently,

supui∈Ui

∫ui dγt+1:t|t = sup

vi∈Vi

wit

∫vi(x)Kt(x

it, dx) = wi

t

∫

Ai

h(xit, x)Kt(x

it, dx).

The same conclusion is obtained by replacing above ‘≤’ by ‘≥’, and supremumby infimum. The definition of integral then implies that

∫

Ci

h dγt+1:t|t = wit

∫

Ai

h(xit, x)Kt(x

it, dx). (5.16)

By replacing h and Kt in Equation (5.16) by gt+1 and Kt, we have

πt+1:t|t+1(Ci) =

∫

Ci

gt+1 dπ′t+1:t|t = wi

t

∫

Ai

gt+1(x)Kt(xit, dx). (5.17)

Finally Equation (5.16) and Equation (5.17) yield

∫

C

f dγt+1:t|t =n∑

i=1

∫

Ci

f dγt+1:t|t =n∑

i=1

wit

∫

Ai

f(xit, x)Kt(x

it, dx)

=

n∑

i=1

wit

∫

Ai

gt+1(xt+1)kt(x

it, x)

kt(xit, x)

kt(xit, x)λk(dx)

=n∑

i=1

wit

∫

Ai

gt+1(x)Kt(xit, dx) = πt+1:t|t+1(C), (5.18)

where we have replaced h in Equation (5.16) by f . The first equality followsfrom the fact that fχC = 0, γt+1:t|t-a.e. Because Equation (5.18) holds for allC ∈ B(R2k), f must be a variant of the RND dπt+1:t|t+1/dγt+1:t|t.

Similarly as above, it can be shown that the numerator and the denominatorin Equation (5.14) are the densities of πt+1:t|t+1 and γt+1:t|t respectively, withrespect to the product measure λB × λk, where λB is a counting measure on theset xi

t | i = 1, 2, . . . , n. The weight function wt, in turn, is a variant of thedensity of πn

t|t with respect to λB. Note that the choice of the value 1 in Equa-

tion (5.13) is arbitrary, since this value is assigned to sets of γt+1:t|t-measure, or


λB ×λk-measure equal to zero. Also a similar proof can be used for showing thatthe density of π′

t+1:t|t with respect to λB×λk is wt(xt)kt(xt, xt+1) and by dropping

out the likelihood in Equation (5.14) one obtains the RND dπ′t+1:t|t/dγt+1:t|t, if it

exists.It is straightforward to show that a sufficient condition for the existence of

the RND dπt+1:t|t+1/dγt+1:t|t is that

Kt(xit, A) = 0 =⇒

∫

A

gt+1(x)Kt(xit, dx) = 0, A ∈ B(Rk), (5.19)

for all i such that wit > 0. For the existence of dπ′

t+1:t|t/dγt+1:t|t, it is required

that Kt(xt, ·) ≪ Kt(xit, ·), for all i such that wi

t > 0. Obviously, this condition ismore stringent than the one in Equation (5.19).

Example 10: Let Ga(a, b) denote a gamma-distribution with parameters aand b. The density of Ga(a, b) is

q(x) =

b−axa−1e−

xb /Γ(a) , if x ∈ [0,∞), a, b > 0

0 , otherwise,

where Γ is the gamma-function. Suppose that xt ∼ Ga(2, 1) is a random variabletaking values in R and kt(xt, xt+1) = fN(xt+1; 2xt, 4). Moreover, let there be aweighted discrete approximation πn

t|t of Ga(2, 1) obtained by simulating an IID

sample B = xit | i = 1, 2, . . . , n from N(2, 1). Figure 5.1(a) illustrates the dens-

ities of Ga(2, 1) and N(2, 1). The density of πnt|t with respect to λB is illustrated

in Figure 5.1(b).

Let wt(x) = wit, when x = xi

t and otherwise wt(x) = 0. Then the resultingdensity of π′

t+1:t|t with respect to λB × λ1 appears as depicted in Figure 5.1(c).For comparison, the density of the exact measure πt+1:t|t with respect to λ2 isillustrated in Figure 5.1(d).

From Figure 5.1(c), it is not evident that π′t+1:t|t approximates πt+1:t|t. Therefore

Figure 5.2(a) illustrates the π′t+1:t|t measures of rectangular Borel sets [ai, ai+1]×

[bi, bi+1] forming an evenly spaced 40 by 40, grid on [0, 4]× [−5, 15]. For compar-ison, the density of the exact measure πt+1:t|t is illustrated in Figure 5.2(b).

After establishing that dπt+1:t|t+1/dγt+1:t|t = f for the importance distribu-tion γt+1:t|t defined in Equation (5.12), the importance sampling approximationof Υ(A) is straightforwardly seen to be

ΥSIR(A) =1

n

n∑

i=1

χA(xit+1)gt+1(x

it+1)

kt(xji

t , xit+1)

kt(xji

t , xit+1)

,

where (xji

t , xit+1) ∼ γt+1:t|t. More explicitly, the pair (xji

t , xit+1) could be written

as (xji

t , xit+1) where xi

t is a random variable with the discrete distribution on theset xi

t | i = 1, 2, . . . , n such that P(xit = xi

t) = wit. Here we have, however, used


0 1 2 3 4 5 60

0.1

0.2

0.3

0.4

Ga(2,1)

N(2,1)

0.5 1 1.5 2 2.5 3 3.5 40

0.02

0.04

0.06

0.08

(a) (b)

0

1

2

3

−5

0

5

10

15

0

0.01

0.02

xt+1

xt

0

1

2

3

−5

0

5

10

15

0

0.05

0.1

xt+1

xt

(c) (d)

Figure 5.1: (a) Densities of Ga(2, 1) and N(2, 1) with respect to Lebesgue measureλ1. (b) The density of the weighted empirical approximation πn

t|t of Ga(2, 1) with

respect to the counting measure on xitn

i=1. (c) The density of π′t+1:t|t with respect

to λB × λ1. (d) The density of πt+1:t|t with respect to Lebesgue measure.

xt

xt+1

0 1 2 3 4

−5

0

5

10

15 0

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0.008

0.009

0.01

xt

xt+1

0 1 2 3 4

−5

0

5

10

15 0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

(a) (b)

Figure 5.2: (a) The π′t+1:t|t-measures of rectangular sets on evenly spaced grid.

(b) The density of πt+1:t|t with respect to λ2.


the notation xit = xji

t , where ji is a random variable with the discrete distributionP(ji = k) = wk

t . This topic will be discussed in more detail in Section 5.5.According to Equation (5.7), the approximation of the measure π′

t+1|t+1 is

obtained as the ratio ΥSIR(A)/ΥSIR(Rk), yielding a discrete probability measureπn

t+1|t+1 of the form given in Equation (5.9), where the unnormalised weights are

wit+1 = gt+1(x

it+1)

kt(xji

t , xit+1)

kt(xji

t , xit+1)

. (5.20)

This is perhaps the most common weight update formula given in the literat-ure and it is often called the sampling/importance resampling (SIR) al-gorithm [Arulampalam et al., 2002, Doucet et al., 2001b, Doucet, 1998].

5.3.1 Generalisation of the Importance Distribution

Equation (5.12) does not represent the only possible way of choosing the im-portance distribution γt+1:t|t. A more general class of importance distributions isobtained by setting

γt+1:t|t(C) =n∑

i=1

witKt(x

it, Ai), (5.21)

where C ∈ B(R2k), and Ai = xt+1 ∈ Rk | χC(xit, xt+1) = 1. To ensure the

absolute continuity πt+1:t|t+1 ≪ γt+1:t|t, it is required, in addition to the conditionin Equation (5.19), that wi

t = 0 =⇒ wit = 0 for all i = 1, 2, . . . , n.

Similarly as in the previous section, a variant of the RND πt+1:t|t+1/dγt+1:t|tis obtained as

dπt+1:t|t+1

dγt+1:t|t(xt, xt+1) = gt+1(xt+1)

wt(xt)kt(xt, xt+1)

wt(xt)kt(xt, xt+1),

where the weight function wt is as defined in Equation (5.13) and wt is definedas

wt(xt) =

wi

t , if xt = xit

1 , otherwise.(5.22)

Again, the choice of the value 1 is arbitrary. It then follows that the approx-imation of π′

t+1|t+1 will be a discrete probability measure of the form given in

Equation (5.9), where the unnormalised weights are

wit+1 = gt+1(x

it+1)

wji

t kt(xji

t , xit+1)

wji

t kt(xji

t , xit+1)

. (5.23)

It should be pointed out that the idea of assigning alternative weights wit

was already proposed by Carpenter et al. [1999] and the approach was referredto as the improved particle filter. The same approach was also proposed byPitt and Shepard [1999], who called the algorithm the auxiliary particle filter.Supposedly, the purpose of the name auxiliary particle filter was to emphasise


that the simulation of random samples is done in a higher dimensional space thanthe actual state space. An illustration of this higher dimensional space is givenin Figure 5.1(c) where the dimension is R2 while the state space is R. However,following the formulation of SIR given in Section 5.3, it seems more reasonableto regard the auxiliary particle filter as another formulation of the general SIRalgorithm rather than a separate method.

5.3.2 Alternative Formulation of SIR

There is an interesting theoretical curiosity, related to SIR, which has rarely beenaddressed in the literature [see, e.g. Pitt and Shepard, 1999]. Recall from Section3.4 that, according to Proposition 3, Υ(A) can be equivalently expressed as

Υ(A) =

∫

A

gt+1 dπ′t+1|t, A ∈ B(Rk), (5.24)

where, by the definition of the Bayesian filter in Equation (3.8),

π′t+1|t(A) =

∫Kt(xt, A)πn

t|t(dxt) =

n∑

i=1

witKt(x

it, A), A ∈ B(Rk).

Similarly as in Section 5.3, we can define a measure πt+1|t+1 as

πt+1|t+1(A) =

∫

A

gt+1 dπ′t+1|t, A ∈ B(Rk), (5.25)

and write

Υ(A) = πt+1|t+1(A) =

∫

A

dπt+1|t+1

dγt+1|tdγt+1|t,

where γt+1|t ≫ πt+1|t+1 is defined to be of the form

γt+1|t(A) =n∑

i=1

witKt(x

it, A), A ∈ B(Rk).

Again, by assuming that Kt and Kt have densities kt(xt, ·) and kt(xt, ·), respec-tively, a variant of the required RND is obtained as

dπt+1|t+1

dγt+1|t(xt+1) = gt+1(xt+1)

∑ni=1 wi

tkt(xit, xt+1)∑n

i=1 witkt(xi

t, xt+1),

and the resulting approximation of Υ(A) is

Υ∗SIR(A) =

1

n

n∑

i=1

χA(xit+1)gt+1(x

it+1)

∑nj=1 wj

tkt(xjt , x

it+1)∑n

j=1 wjt kt(x

jt , x

it+1)

.

By taking the ratio Υ∗SIR(A)/Υ∗

SIR(Rk), one obtains a discrete probability measureapproximation of π′

t+1|t+1 of the form given in Equation (5.9) where the unnorm-alised weights are

wit+1 = gt+1(x

it+1)

∑nj=1 wj

tkt(xjt , x

it+1)∑n

j=1 wjt kt(x

jt , x

it+1)

. (5.26)


It is noted that, in the general SIR algorithm, one must generate a sample(xi

t, xit+1) | i = 1, 2, . . . , n according to γt+1:t|t, where γt+1:t|t is as described

in Equation (5.21). In this case, the latter components of the pairs (xit, x

it+1)

constitute an IID sample from γt+1|t. Consequently, the alternative approximationΥ∗

SIR(A) is straightforwardly obtained at any time in the general SIR algorithm.Because of the following proposition, this alternative formulation is considered tobe worth mentioning.

Proposition 9 Suppose that there is a σ-finite product measure µ = µ1 × µ2 onB(R2k), where µ1 and µ2 are σ-finite measures on B(Rk). Let there be a finitemeasure π and a probability measure γ on B(R2k) such that π ≪ γ ≪ µ. Also,define measures πM(A) , π(A × Rk) and γM(A) , γ(A × Rk) for all A ∈ B(Rk).Then

Vq

[h(x)

f(x, y)

q(x, y)

]≥ VqM

[h(x)

fM(x)

qM(x)

],

where f = dπ/dµ, q = dγ/dµ, fM = dπM/dµ1 and qM = dγM/dµ1

Proof: Straightforwardly it can be shown that fM(x) =∫

f(x, y)µ2(dy), πM-a.e., and qM(x) =

∫q(x, y)µ2(dy), γM-a.e. Therefore, it follows from the Holder

inequality that πM-a.e.

f 2M(x) ≤

∫f 2(x, y)

q(x, y)µ2(dy)qM(x),

implying thatf 2

M(x)

q2M(x)

qM(x) ≤∫

f 2(x, y)

q2(x, y)q(x, y)µ2(dy),

πM-a.e. Then by the properties of the integral it follows that∫

h2f 2M

q2M

qMdµ1 ≤∫

h2

[∫f 2

q2qdµ2

]dµ1,

or equivalently EqM[h2f 2

M/q2M] ≤ Eq[h

2f 2/q2]. Because EqM[hfM/qM] = Eq[hf/q],

the proposition follows.

To see the significance of Proposition 9 for the alternative SIR formulation,let πt+1:t|t+1, γt+1:t|t and λB × λk, where λB is the counting measure on B =xi

t | i = 1, 2, . . . , n, be the measures π, γ, and µ in Proposition 9, respectively.Consequently, πM = πt+1|t+1 and γM = γt+1|t. The required densities, f and fM,are

f(xt, xt+1) = gt+1(xt+1)wt(xt)kt(xt, xt+1),

fM(xt+1) = gt+1(xt+1)n∑

i=1

wt(xit)kt(x

it, xt+1),

q(xt, xt+1) = wt(xt)kt(xt, xt+1),

qM(xt+1) =n∑

i=1

wt(xit)kt(x

it, xt+1).


0

1

2

3

−5

0

5

10

15

0

0.01

0.02

xt+1

xt

0

1

2

3

−5

0

5

10

15

0

0.01

0.02

xt+1

xt

(a) (b)

Figure 5.3: (a) Points where density of π′t+1:t|t is evaluated for sample size 5. (b)

Points where the density of π′t+1:t|t is evaluated using alternative SIR formulation

and sample size 5.

Moreover, by defining h(xt+1) , χA(xt+1), Proposition 9 implies that

V [Υ∗SIR(A)] = n−1

VqM[hfM/qM] ≤ n−1

Vq[hf/q] = V [ΥSIR(A)] ,

i.e. in terms of variance, Υ∗SIR(A) is a more accurate approximation of Υ(A) than

ΥSIR(A) using the same importance distribution. It should be noted that thisresult applies to general importance sampling and not only to SMC methods.

There is, however, a drawback which most likely is the reason why this for-mulation has not been mentioned in the literature. In the evaluation of Υ∗

SIR(A),one must evaluate the densities of π′

t+1|t and γt+1|t at n different points. Because

these densities are by construction sums of n densities kt(xit, ·), each of the n

evaluations itself requires n density evaluations. This makes the total number ofdensity evaluations 2n2. Multiplication by two is due to the fact that one mustevaluate the densities of π′

t+1|t and γt+1|t.Figure 5.3 illustrates the difference between the two SIR formulations for

distribution π′t+1:t|t as defined in Example 10. The points where the density of

π′t+1:t|t is evaluated when computing ΥSIR(A) are illustrated in Figure 5.3(a) for

a sample of size 5. Figure 5.3(b) illustrates the corresponding evaluation pointsfor the alternative approximation Υ∗

SIR(A). By the significantly larger numberof evaluation points, it is not surprising that Υ∗

SIR(A) is expected to be moreaccurate.

5.4 Proposals for Importance Distributions

So far, nothing has been said about the choice of the importance distributionγt+1:t|t. Let C ∈ B(R2k) and let C, Ci, and Ai be as defined in Equations(5.3). . . (5.5). Then, according to the discussion in Section 4.3.3, the optimal


importance distribution would be

γ∗t+1:t|t(C) ,

∑ni=1 wi

t

∫Ai

gt+1(x)Kt(xit, dx)

∑ni=1 wi

t

∫gt+1(x)Kt(x

it, dx)

. (5.27)

Because of the generally intractable integrals, the exact optimal importance dis-tribution is unavailable and, therefore, it must be approximated. Several differentproposals can be found in the literature for choosing the importance distributionto approximate γ∗

t+1:t|t. A comprehensive study of different importance distribu-

tions is omitted in this thesis, but more details can be found, e.g. in [Doucet,1998, Doucet et al., 2001b]. Because a good choice of importance distributiontends to be application specific, only general guidelines will be described in thefollowing.

5.4.1 Auxiliary Particle Filter

The auxiliary particle filter, which was already briefly mentioned in Section 5.3.1,addresses the choice of the weights wi

t in the importance distribution. Accordingto Equation (5.27), the weights in the optimal importance distribution would be

wit =

wit

∫gt+1(x)Kt(x

it, dx)∑n

i=1 wit

∫gt+1(x)Kt(x

it, dx)

. (5.28)

These are, in general, impossible to evaluate exactly. Therefore, Pitt and Shepard[1999] proposed approximating the likelihood function gt+1, separately for alli = 1, 2, . . . , n, by a constant function taking the value gt+1(ξ

it+1), where ξi

t+t isthe mean, the mode, a draw, or some other likely value associated with thedistribution Kt(x

it, ·). In this case,

wit =

witgt+1(ξ

it+1)∑n

i=1 witgt+1(ξi

t+1).

By substituting these weights into Equation (5.23), the unnormalised weights inthe resulting approximation of π′

t|t become

wit =

gt(xit)w

ji

t−1kt−1(xji

t−1, xit)

gt(ξji

t )wji

t−1kt−1(xji

t−1, xit)

=gt(x

it)

gt(ξji

t ).

Indeed, in terms of approximating the integral∫

gt+1(x)Kt(xit, dx), the ap-

proximation of gt+1 by a constant function seems quite coarse. However, the powerof this approximation is its low additional computational cost. In the bootstrapfilter, one must evaluate the likelihood function n times at each iteration. In theproposed constant approximation of the likelihood, one must evaluate the likeli-hood 2n times. Of course, additional computational complexity may arise fromthe choice of the point ξi

t. Since ξit could be defined, for example, as a random

sample from Kt(xit, ·), the choice of ξi

t can be done with rather small cost.


5.4.2 Monte Carlo Weighting

A natural extension to the approach of the auxiliary particle filter is to approxim-ate the integrals in Equation (5.28) by classical Monte Carlo. For this purpose,one can simulate n∗ samples from Kt(x

it, ·) for all i = 1, 2, . . . , n, and approximate

∫gt+1(x)Kt(x

it, dx) ≈ 1

n∗

n∗∑

i=1

gt+1(xi), xi ∼ Kt(x

it, ·). (5.29)

The total number of required likelihood evaluations is then nn∗. Clearly, bytaking only one sample from each Kt(x

it, ·), i.e. n∗ = 1, one obtains the auxiliary

particle filter, where ξit+1 is a random sample.

An evident drawback of this approach is its computational cost for largesample sizes n∗. Some computation can be avoided by noting that the samplesand the evaluated likelihoods that were used for approximating the integral inEquation (5.29) can be reused to some extent when approximating π′

t|t or anyintegrals with respect to it. However, when moving to the next iteration, not allsamples should be used, in order to ensure that the number of samples remainsconstant and not multiplied by n∗ each iteration.

5.4.3 Kalman Filter Importance Distributions

Under certain assumptions, it is possible to design more detailed constructionsfor importance distributions than those proposed in Section 5.4.1 and Section5.4.2. Let us make the following assumptions:

i) For all xt ∈ Rk, Kt(xt, ·) has the density kt(xt, xt+1) = fN(xt+1; ft(xt), Qt+1).

ii) Observations yt ∈ Rm are defined as yt = mt(xt) + wt, where mt is aB(Rk)/B(Rm)-measurable function, and wt ∼ N(0, Rt) are independent.

It is observed that yt is defined to be of the form yt = ht(xt, wt), as given inExample 1, so the assumptions made in Section 3.2 hold. Similarly as in Example1, let us define

Pt(xt; A) =

∫χh−1

t (A)(xt, w)Pwt(dw) = Pwt

(w ∈ Rm | ht(xt, w) ∈ A) ,

where A ∈ B(Rm) and ht(x, y) = mt(x)+y. Since now wt is known to be normallydistributed, it follows that

Pt(xt; A) =

∫χh−1

t (A)(xt, w)fN(w; 0, Rt)λk(dw) =

∫

A

fN(yt; mt(xt), Rt)λk(dyt),

where the second equality follows from the change of variables. This implies thatgt(xt) = fN(yt; mt(xt), Rt).

Also, in this given model, it is impossible to obtain the optimal import-ance distribution because of the intractable integrals

∫Ai

gt+1(x)kt(xit, x)λk(dx).


A feasible approximation is obtained by replacing gt+1 in the integral by git+1,

which is the likelihood function obtained by replacing mt+1 in the definition ofyt+1 by its first order Taylor series expansion at ft(x

it). In other words, mt+1 is

approximated by

mt+1(x) ≈ mt+1(ft(xit)) + Mi

t+1(x − ft(xit)),

where Mit+1 is the Jacobian matrix of mt+1 evaluated at ft(x

it). Then it can be

shown thatgi

t+1(x)kt(xit, x) = ci

t+1fN(x; µit+1, C

it+1) (5.30)

where

cit = fN(yt; mt(ft−1(x

it−1)), M

itQt(M

it)

T + Rt). (5.31)

µit = ft−1(x

it−1) + Ci

t(Mit)

TR−1t (yt − mt(ft−1(x

it−1))) (5.32)

Cit = Qt − Qt(M

it)

T(MitQt(M

it)

T + Rt)−1Mi

tQt. (5.33)

The proof of Equations (5.30). . . (5.33) is straightforward but tedious and there-fore omitted. For more details see, e.g. [Maybeck, 1979]. Clearly,

cit+1 =

∫gi

t+1(x)kt(xit, x)λk(dx),

and therefore the substitution of the approximation in Equation (5.30) into Equa-tion (5.27) yields

γt+1:t|t(C) ,

n∑

i=1

wit

∫

Ai

fN(x; µit+1, C

it+1)λk(dx), C ∈ B(R2k), (5.34)

where

wit =

witc

it+1∑n

i=1 witc

it+1

.

It is observed that Equation (5.32) and Equation (5.33) are exactly theupdate formulas of the well known extended Kalman filter (EKF) algorithm,when using Kt(x

it, ·) as the prior distribution for iteration t+1 [see, e.g. Anderson

and Moore, 1979, page 195]. It should be pointed out that the use of EKFupdate equations in the construction of an importance distribution has beenproposed in the literature [Doucet, 1998, van der Merwe et al., 2000]. Doucet[1998] proposed using an importance distribution which was otherwise similar tothe one in Equation (5.34) except for the weights that were set wi

t = wit. It was

proposed by van der Merwe et al. [2000] also to choose wit = wi

t, but instead ofusing Kt(x

it, ·) as the prior distribution for iteration t + 1, it was proposed to use

a normal distribution

N(ft(x

it), F

itC

it(F

it)

T + Qt+1

),

where Fit is the Jacobian matrix of ft evaluated at xi

t.


Instead of using EKF update formulas, it has also been proposed that oneshould use the unscented Kalman filter (UKF) update equations in the con-struction of the importance distribution. UKF is a relatively novel method forapproximating the Bayesian filter and in several references UKF is reported tooutperform EKF in approximating the Bayesian posterior measure. Therefore,also van der Merwe et al. [2000] prefer the use of UKF instead of EKF in theconstruction of the importance distribution.

Moreover, it has been reported in the literature that the Kalman filter basedimportance distribution construction is superior, in particular, in situations wherethe likelihood function gt+1 is highly peaked [van der Merwe et al., 2000], in otherwords, where the measurement noise is small compared to the uncertainty in theprocess model.

5.5 Resampling

The SIR algorithm, as described in Section 5.3, is based on an IID sample(xji

t , xit+1) | i = 1, 2, . . . , n, where each pair (xji

t , xit+1) has the distribution

γt+1:t|t. Because, for given xji

t , the latter component of the pair (xji

t , xit+1) has the

distribution Kt(xji

t , ·), the simulation of the sample (xji

t , xit+1) | i = 1, 2, . . . , n

according to γt+1:t|t can be in two steps. First, an IID sample ji | i = 1, 2, . . . , nis simulated according to a discrete distribution such that P(ji = k) = wk

t .Second, a sample xi

t+1 | i = 1, 2, . . . , n is simulated such that xit+1 ∼ Kt(x

ji

t , ·),i = 1, 2, . . . , n. Note that the random variables xi

t+1 should be independent ofeach other.

The simulation of the sample ji | i = 1, 2, . . . , n is often called resam-pling in the SMC related literature. Supposedly, the name originates from thefact that if wi

t = wit, then the sample xji

t | i = 1, 2, . . . , n can be considered toconstitute a sample based approximation of a discrete distribution which itself isa sample based approximation of some other, possibly nondiscrete, distribution.This also is the etymology of the yet unaccounted term ‘resampling’ in the namesampling/importance resampling. There are different resampling methods and inthe following, some of these methods will be described.

5.5.1 Multinomial Resampling

Let us first consider the simulation of an IID sample of size n from γt+1:t|t. Thisis done as described above. Let ui denote the number of indices equal to i inthe sample ji | i = 1, 2, . . . , n. In this case, u , [u1, u2, . . . , un]

T is a randomvariable such that ui ∈ 1, 2, . . . , n for all i = 1, 2, . . . , n and

∑ni=1 ui = n

with probability one. The random variable u is then said to be multinomi-ally distributed with parameters w1

t , w2t , . . . , w

nt and n. In other words, the

simulation of an IID sample ji | i = 1, 2, . . . , n according to the distributionP(ji = k) = wk

t corresponds to simulating a single sample from a multinomialdistribution. Therefore, the SIR algorithm that is based on an IID sample ofsize n from γt+1:t|t is often said to employ the multinomial resampling. Occa-


sionally, the multinomial resampling is also called sampling with replacementreferring to the interpretation that an index i is picked with probability wi

t, andafter picking i, it is put back to the set of possible indices so it can be chosenagain.

5.5.2 Stratified Resampling

According to the discussion in Section 4.4, stratified sampling with proportionalsample allocation yields a more accurate approximation of an integral than clas-sical Monte Carlo, regardless of the integrand. In the context of SIR algorithm,the use of proportional allocation would imply that the sample size ni in the ithstratum is nγt+1:t|t(Si), where Si ∈ B(R2k), i = 1, 2, . . . , ns is the ith stratum.In general, nγt+1:t|t(Si) will not be an integer, and thus the exact proportionalallocation is, in general, impossible. To avoid this problem, it has been proposedin the literature to divide the space of interest1, i.e. x1

t , x2t , . . . , x

nt × Rk, into n

strata so that the probability of each stratum is equal to n−1 [Kitagawa, 1996].For such strata, the exact proportional sample allocation is possible by simplysimulating one sample from each stratum.

Equally probable strata can be defined in many different ways. An efficientalgorithm is however obtained by letting the strata overlap as proposed by Kit-agawa [1996]. Let C ∈ B(R2k), and Ci and Ai be as defined in Equation (5.4)and Equation (5.5). Moreover, let the measures πi | i = 1, 2, . . . , n be definedto be of the form

πi(C) ,

n∑

j=1

ci,jKt(xjt , Aj), C ∈ B(R2k),

where according to the discussion in Section 4.4, the coefficients ci,j must bechosen such that

γt+1:t|t(C) =

n∑

i=1

πi(C), C ∈ B(R2k). (5.35)

Let us assume that∑n

i=1 ci,j = wjt . Then, by the definition of πi

n∑

i=1

πi(C) =n∑

i=1

n∑

j=1

πi(Cj) =n∑

j=1

n∑

i=1

ci,jKt(xjt , Aj),

and therefore∑n

i=1 πi(C) =∑n

j=1 wjt Kt(x

jt , Aj) = γt+1:t|t(C). On the other hand,

if we assume that γt+1:t|t(C) =∑n

i=1 πi(C) for all C ∈ B(R2k), then for C =xk

t × B

n∑

i=1

πi(xkt × B) =

n∑

i=1

ci,kKt(xkt , B) = wk

t Kt(xkt , B), B ∈ B(Rk). (5.36)

1Here the space of interest is considered to be x1t , x

2t , . . . , x

nt × Rk instead of R2k. This

can be done because the π′t+1:t|t-measure of the set R2k − x1

t , x2t , . . . , x

nt × Rk is zero.


The last equality follows from the fact that Ai = ∅, if i 6= k and Ai = B if i = k.Therefore

∑ni=1 ci,j = wj

t . Thus, it has been established that a necessary andsufficient condition for Equation (5.35) to hold for all C ∈ B(R2k) is that

n∑

i=1

ci,j = wjt , j = 1, 2, . . . , n. (5.37)

This condition does not, however, ensure that the probabilities of the stratawould be equal. A necessary and sufficient condition for this is obtained straight-forwardly by noting that

πi(R2k) =

n∑

j=1

ci,jKt(xjt , R

k) =n∑

j=1

ci,j = n−1. (5.38)

Once the decomposition πini=1 of γt+1:t|t has been defined, it remains to simulate

the desired number of samples, in this case one, from each stratum. According tothe discussion in Section 4.4, the distribution of the random variable in the ithstratum is γi(C) = πi(C)/πi(R2k) = πi(C)/n, for all C ∈ B(R2k).

The conditions of ci,j given in Equation (5.37) and Equation (5.38) do notuniquely define the stratification. A stratification that allows efficient implemen-tation and satisfies the given conditions was proposed by Kitagawa [1996]. Let vbe the cumulative distribution function (CDF) of ji defined as

v(x) ,

⌊x⌋∑

i=1

wit,

where ⌊·⌋ is used for denoting the integer part of the given real number. Thenby using the notation vi = v(i), the coefficients ci,j are defined as

ci,j =

0 , if in

< vj−1

in− vj−1 , if vj−1 ≤ i

n< vj , and i−1

n< vj−1

1n

, if vj−1 ≤ in

< vj , and i−1n

≥ vj−1

wjt , if vj ≤ i

n, and i−1

n≤ vj−1

vj − i−1n

, if vj ≤ in, and vj−1 < i−1

n< vj

0 , if vj ≤ in, and i−1

n≥ vj

, (5.39)

Because the essence of the coefficients ci,j is not obvious from the complicateddefinition in Equation (5.39), this is illustrated by the following example.

Example 11: Suppose that w1t = 4/10, w2

t = 3/10, w3t = 2/10, and w4

t =1/10. Let kt(xt, xt+1) = fN(xt+1; 4xt, 1), and x1

t = 1, x2t = 2, x3

t = 3, x4t = 4.

Then the density of the distribution γt+1:t|t with respect to λB × λk, where B =x1

t , x2t , x

3t , x

4t, appears as depicted in Figure 5.4(a). Figure 5.4(b) illustrates the

CDF v(x) and also all those coefficients ci,j that are nonzero. All coefficients ci,j


1

2

3

4 5 10 15 20

0

0.05

0.1

0.15

0.2

xt+1

xt 1 2 3 4 5

0

0.5

1

v(x)

x

c4,4

c4,3

c3,3

c3,2

c2,2

c2,1

c1,1

(a) (b)γ1

γ2

γ3

γ4

(c) (d) (e) (f)

Figure 5.4: (a) The density of the importance distribution γt+1:t|t with respectto λB × λk. (b) The CDF of ji and a division of [0, 1] into four equally longintervals. (c). . . (f) The densities of the distributions γ1, γ2, γ3, and γ4 withrespect to λB × λk.

can be given conveniently as a matrix

C =

1/4 0 0 03/20 1/10 0 0

0 1/5 1/20 00 0 3/20 1/10

,

where ci,j = [C]ij . It is observed that all row sums are equal to 1/4 = n−1 and thecolumn sums are 4/10, 3/10, 2/10, and 1/10 as expected. The densities of thedistributions γ1, γ2, γ3, and γ4 with respect to λB × λk, are illustrated in Figures5.4(c). . . 5.4(f), respectively. Obviously, the strata overlap.

If the stratification is done as proposed above, then the stratified samplingapproximation of Υ(A) is

ΥST(A) =1

n

n∑

i=1

χA(xit)gt(x

it)

wji

t−1kt−1(xji

t−1, xit)

wji

t−1kt−1(xji

t−1, xit)

,

where (xji

t−1, xit) ∼ γi. Consequently, the approximation of π′

t|t is a discrete prob-

ability measure of the form given in Equation (5.9), where the unnormalisedweights are identical to those given in Equation (5.23). Therefore, it is observedthat the only difference between the SIR algorithm with multinomial resampling


and the SIR algorithm with stratified resampling is the random sample simulation.This similarity remains for arbitrary stratifications as long as the probabilities ofthe strata are n−1 and sample allocation is proportional.

It should be noted that regardless of the inconvenient formulation of theprobabilities in Equation (5.39), the simulation of random samples from γi can bedone efficiently. Detailed pseudo-code descriptions of efficient stratified samplingalgorithms can be found in the literature [see, e.g. Arulampalam et al., 2002,Carpenter et al., 1999, Kitagawa, 1996]. In principle, the stratified samplingcould be done as follows. First, an IID sample xin

i=1 is simulated from U([0, 1

n]).

Then for all i = 1, 2, . . . , n,

ji = inf(x | v(x) ≥ xi + (i − 1)/n

).

In this case, the sample xji

t | i = 1, 2, . . . , n constitutes the first components ofthe pairs (xji

t , xit+1). Then it remains to simulate samples xi

t+1 similarly as in thecase of multinomial resampling, which yields a stratified sample from γt+1:t|t.

5.5.3 Other Stratification Methods

It should be noted that the conditions on the coefficients ci,j given in Equa-tion (5.37) and Equation (5.38) are necessary and sufficient for enabling exactproportional allocation. However, the method described in the previous sectiondoes not represent the only possibility to perform the exact proportional alloca-tion. Indeed, the multinomial resampling can be regarded as such a stratificationwhere ci,j = wj

t/n. In this case, the stratification does not however introduceany improvements. Because it is possible to have a stratification which does notintroduce any improvements, we draw the conclusion that in order to ensure im-provement in the accuracy of the algorithm, the choice of stratification is notirrelevant. In the literature, however, not much has been said about the choiceof the strata.

The preceding discussion regarded only equally probable strata. More flex-ibility is obtained when this condition is omitted. Moreover, one can alternativelydraw more than one sample from each stratum. In this case, however, the numberof strata should be less than n in order to ensure that the sample size remainsconstant. This is because one must ensure that at least one sample is simulatedfrom each stratum. Otherwise the resulting approximation would be biased andthe results for stratification given in Section 4.4 would not necessarily apply.

According to the preceding discussion, stratified sampling can be done in anumber of different ways. It is straightforward to construct an integral where thestratification described in the previous section is by no means the optimal one.However, for an arbitrary integrand, it is more complicated to give general results.The exact proportional allocation with equally probable strata is convenient in thesense that it does not depend on the integrand. Therefore it is always guaranteedto perform equally well or better than multinomial resampling.


5.5.4 Sequential Importance Sampling

Often in the literature, the SIR algorithm has been described as a so calledsequential importance sampling (SIS) algorithm which has been augmentedby a resampling step [Doucet et al., 2001b, Liu and Chen, 1998]. In this section,the connection between SIS and SIR is described somewhat differently. The SISis considered as a SIR algorithm with a certain stratification.

In SIR context, a natural partition of x1t , x

2t , . . . , x

nt × Rk is obtained by

defining S = S1, S2, . . . , Sn, where

Si = (xt, xt+1) ∈ R2k | xt = xit.

Again, by letting C ∈ B(R2k) and Ci and Ai as defined in Equation (5.4) andEquation (5.5), the measures πi(C) = γt+1:t|t(Ci), for all i = 1, 2, . . . , n define adecomposition of γt+1:t|t as required by stratification. Clearly,

πi(R2k) = γt+1:t|t(xi

t × Rk) = wit.

Thus, the approximation of Υ(A) is

ΥSIS(A) =

ns∑

i=1

witΥ

iSIS(A), (5.40)

where

ΥiSIS(A) =

1

ni

ni∑

j=1

χA(xjt+1)gt+1(x

jt+1)

witkt(x

it, x

jt+1)

witkt(xi

t, xjt+1)

(5.41)

A brief calculation shows that the resulting approximation of π′t+1|t+1 is a dis-

crete probability measure of the form given in Equation (5.9). The unnormalisedweights are

wit+1 = gt+1(x

it+1)

wji

t kt(xji

t , xit+1)

njikt(x

ji

t , xit+1)

where njidenotes the number of samples simulated from the same stratum as

xit+1.

It is observed that, regardless of the choice of the weights wit in the im-

portance distribution, they are cancelled when substituting ΥiSIS(A) into Equa-

tion (5.40). Therefore, the choice of these weights can be arbitrary. Also, oneshould note that in order to ensure the validity of the stratified approximation, atleast one sample should be simulated from each stratum. On the other hand, inorder to ensure that the sample size does not increase over time, only n samplesshould be simulated. Consequently, one must choose ni = 1, for all i = 1, 2, . . . , n.Clearly, this choice of the values ni yields the well known update formula for theSIS algorithm [see, e.g. Arulampalam et al., 2002, Doucet et al., 2001a, Doucet,1998].


5.5.5 Systematic Resampling

There is one more well known resampling method known as the systematic re-sampling that deserves to be mentioned, although its theoretical treatment isomitted in this thesis. The significance of this method is that it enables efficientimplementations and it has been proved to be the optimal resampling method inthe sense that it gives the smallest variance V [niw

it] of all resampling methods. It

is not, however, obvious from this property that a SIR with systematic resamplingwould have superior performance compared to SIRs with other resampling meth-ods. In addition to the minimum variance property, the systematic resamplinghas been proved to minimise the relative entropy between the approximation πn

t|tand the exact πt|t in the case of discrete state space. Here, however, πt|t is as-sumed to be a discrete distribution, which is quite restrictive. The proof of theminimum variance and minimum relative entropy properties of the systematicresampling can be found in [Crisan and Lyons, 2002].

The systematic resampling can be implemented, e.g. by using the treebased branching algorithm, as proposed in [Crisan and Lyons, 2002, Crisan,2001]. Also an alternative and efficient algorithm for systematic resampling canbe found, e.g. in [Arulampalam et al., 2002, Carpenter et al., 1999]. Let it also bepointed out that although systematic resampling was proposed to be used withSMC methods by Kitagawa [1996], it was already a known method in the fieldof genetic algorithms, and it was called stochastic universal sampling [Baker,1987].

5.6 Regularised Particle Filters

So far, the approximations of the Bayesian filter have been based on various MonteCarlo approximations. In these methods, the approximation of the Bayesian pos-terior distribution is inherently discrete although the true Bayesian posterior πt|tis in many situations known to be continuous. One of the consequences of thisapproximation is that in the sample (xji

t , xit+1) | i = 1, 2, . . . , n, the components

xji

t may be equal for more than one particle2. For such pairs, the componentsxi

t+1 are identically distributed. If the pairs (xit, x

it+1) were simulated from a con-

tinuous distribution, the probability of two or more of the components xit having

the same value would be zero. Loosely speaking, this means that the sample(xji

t , xit+1) | i = 1, 2, . . . , n is less diverse than desired. The loss of diversity of

the samples may lead to undesired clustering of the samples and eventually in-accurate results. This is a significant problem, especially in the situations wherethe uncertainty of the process model is small compared to the uncertainty of themeasurements, i.e. the exact opposite to the situation where the EKF or UKF im-portance distributions are reported to have superior performance. In the literat-ure, this problem is often referred to as sample impoverishment [Arulampalamet al., 2002]. To overcome this problem, so called regularised particle filters(RPF) have been proposed [Musso et al., 2001, Oudjane and Musso, 2000].

2SIS is, of course, an exception.


Regularised particle filters are based on the methodology of kernel densityestimation which dates back to the 1950’s and itself is a large branch of statisticalmathematics. The terms regularisation and kernel density estimation will be usedinterchangeably. In this work, the kernel density estimation, as such, is not thetopic of interest. However, in order to cover the basics of kernel density estimationthat are required by the description of regularised particle filter, some elementaryresults and definitions are included in Appendix B. For more details on densityestimation, the reader is instructed to consult the references given in AppendixB.

5.6.1 Post-Regularised Particle Filter

Perhaps the most straightforward use of kernel density estimation in the SIRalgorithm is the post-regularised particle filter, or post-RPF, as proposedin Musso et al. [2001]. In this method, the discrete Bayesian posterior distribu-tion approximation πn

t|t is replaced by a continuous distribution πnt|t obtained by

regularisation. The density of πnt|t is

pnt|t(x) =

1

det(Hn)

n∑

i=1

witKE(H−1

n (x − xit)). (5.42)

Here xit | i = 1, 2, . . . , n is the sample on which πn

t|t is based, wit | i =

1, 2, . . . , n are the weights of the samples, and the mapping KE is the Epanech-nikov regularisation kernel (see Definition 52 in Section B.2). The matrix Hn

is defined as

Hn = C1

2

(8(k + 4)(2

√π)k

nV k

) 1

k+4

,

where V k is the volume of a k-dimensional unit hypersphere and C1

2 is the matrixsquare root3 of the sample covariance of the samples xi

t | i = 1, 2, . . . , n. Adetailed derivation of Hn for equally weighted sample is given in Section B.3 andSection B.4.

Similarly, as in the SIR, the exact Bayesian prediction distribution πt+1:t|t isnaturally approximated in the post-RPF by the unique measure π′

t+1:t|t satisfying

π′t+1:t|t(A × B) =

∫

B

Kt(xt, A)πnt|t(dxt), A, B ∈ B(Rk).

Because of the uniqueness of π′t+1:t|t, it follows that π′

t+1:t|t(C) =∫

Cktp

nt|t dλ2k

for all C ∈ B(R2k). Therefore ktpnt|t is a density of π′

t+1:t|t with respect to λ2k.

Similarly, ktpnt|t is shown to be a density of γt+1:t|t with respect to λ2k. According

to Lemma 1 and Equation (5.10),

πt+1:t|t+1(C) =

∫

C

gt+1 dπ′t+1:t|t =

∫

C

gt+1ktpnt|t dλ2k, C ∈ B(R2k),

3The sample covariance C is assumed to be positive definite. Then it can be decomposedas C = QΛQT, where Q is orthogonal and the diagonal matrix Λ = ⌈λ1, λ2, . . . , λk⌋ consists of

the eigenvalues of C. Then we define C1

2 , QΛ1

2 QT, where Λ1

2 = ⌈λ1

2

1 , λ1

2

2 , . . . , λ1

2

k⌋.


implying that gt+1ktpnt|t is the density of πt+1:t|t+1 with respect to λ2k. Con-

sequently, according to Lemma 2

dπt+1:t|t+1

dγt+1:t|t= gt+1(xt+1)

kt(xt, xt+1)

kt(xt, xt+1)

γt+1:t|t-a.e. The same way as in the SIR, Υ(A) can be approximated by

ΥRPF1(A) ,1

n

n∑

i=1

χA(xit+1)gt+1(x

it+1)

kt(xit, x

it+1)

kt(xit, x

it+1)

,

where (xit, x

it+1) ∼ γt+1:t|t, i = 1, 2, . . . , n independently. By taking the ratio

ΥRPF1(A)/ΥRPF1(Rk), a discrete approximation of π′t+1|t+1 is obtained. This ap-

proximation is of the form given in Equation (5.9) and the unnormalised weightswi

t+1 are

wit+1 = gt+1(x

it+1)

kt(xit, x

it+1)

kt(xit, x

it+1)

.

The resulting approximation appears nearly identical to that in the SIRalgorithm. Indeed, the only difference between the post-RPF and the SIR inpractice is the resampling. Instead of simulating an IID sample of size n fromthe discrete distribution P(ji = k) = wk

t , the sample is simulated from the con-tinuous distribution πn

t|t. In theory, however, the difference appears to be morefundamental. Here the definition of π′

t+1:t|t departs from the definition given in

Section 5.1 and, therefore, Υ(A) and π′t|t have different interpretations as well.

This implies that an altogether different distribution π′t|t is being approximated at

a given iteration than in SIR algorithm. To illustrate this fundamental differencewe give the following example.

Example 12: Consider the joint distribution πt+1:t|t of xt and xt+1 described inExample 10. Figure 5.5(a) shows a contour plot of the density of πt+1:t|t withrespect to λ2. Figure 5.5(b), in turn, illustrates the density of the approximatemeasure π′

t+1:t|t in the post-RPF. For comparison, see the illustration of the ap-

proximation used in the SIR in Figure 5.1(c).

It is not straightforward to say anything about the superiority of the givenapproximations of πt+1:t|t in the post-RPF and in the SIR. However, it has beenreported in the literature that the post-RPF performs significantly better thanthe SIR in the case of small uncertainty in the process model compared to theuncertainty of the measurement model [Musso et al., 2001].

Originally, the post-RPF was proposed only in the bootstrap filter context,implying that kt = kt [Musso et al., 2001]. The method described above is aslightly extended version of the same method by allowing kt 6= kt, and it wasproposed by Arulampalam et al. [2002]. There seems to be no reason why thesame generalisation could not be made for the post-RPF as for the SIR algorithm.This is to say that, instead of defining γt+1:t|t to have the density ktp

nt|t with respect

to λ2k, the importance density could be defined to be of the form ktwt, where wt


0.01

0.01

0.01

0.01

0.01

0.01

0.03

0.03

0.03

0.03

0.05

0.05

xt

xt+1

−1 0 1 2 3 4 5 6

−4

−2

0

2

4

6

8

10

12

0.01

0.01

0.01

0.01

0.01

0.01

0.03

0.03

0.03

0.03

0.05

0.05

xt

xt+1

−1 0 1 2 3 4 5 6

−4

−2

0

2

4

6

8

10

12

(a) (b)

Figure 5.5: (a) A contour plot of the density of πt+1:t|t with respect to λ2. (b) Acontour plot of the density of π′

t+1:t|t with respect to λ2.

is some appropriate density satisfying the required absolute continuity in order toensure the existence of dπt+1:t|t+1/dγt+1:t|t. In this case, the unnormalised weightsin the resulting approximation of π′

t+1|t+1 would be

wit+1 = gt+1(x

it+1)

kt(xit, x

it+1)p

nt|t(x

it)

kt(xit, x

it+1)wt(xi

t).

5.6.2 Pre-Regularised Particle Filter

Another SMC method that utilises regularisation was also proposed by Mussoet al. [2001]. This pre-regularised particle filter, or pre-RPF, is fundament-ally different from the methods described so far in the sense that, while all theprevious methods were based on importance sampling, the pre-RPF is based onthe rejection method described in Section 4.5. One of the effects of using therejection method instead of the importance sampling is that the distributionsare always approximated by unweighted discrete probability measures instead ofweighted discrete measures.

Let B = xit | i = 1, 2, . . . , n be the set of particles that forms the discrete

approximation πnt|t of the Bayesian posterior distribution πt|t at time instant t.

In the pre-RPF, the Bayesian prediction distribution πt+1:t|t is approximated bya regularised measure πn

t+1:t|t whose density with respect to the product measureλB × λk is

pnt+1:t|t(xt, xt+1) =

1n det(Hn)

KE(H−1n (xt+1 − xi

t+1)) , if xt = xit

1 , otherwise,

where xit+1 ∼ Kt(x

it, ·) for all i = 1, 2, . . . , n and xi

t+1 are independent of each


other. In this case, the normalised version of πt+1:t|t+1 has the density

p′t+1:t|t+1(xt, xt+1) =

gt+1(xt+1)zn det(Hn)

KE(H−1n (xt+1 − xi

t+1)) , if xt = xit

1 , otherwise(5.43)

with respect to λB × λk. Here z is the normalisation coefficient of πt+1:t|t+1,

z = πt+1:t|t+1(R2k) =

n∑

i=1

∫gt+1(x)pn

t+1:t|t(xit, x)λk(dx), (5.44)

making p′t+1:t|t+1 a proper probability density. Consequently, πt+1|t+1 is approx-

imated by π′t+1|t+1(A) = π′

t+1:t|t+1(A × Rk). This is to say that p′t+1:t|t+1 ischosen to be the density of the target distribution and a discrete approxima-tion πn

t+1|t+1 of π′t+1|t+1 is obtained by taking the components xi

t+1 of the pairs

(xji

t , xit+1) ∼ p′t+1:t|t+1.Let the instrumental distribution be denoted by γt+1:t|t and its density with

respect to λB × λk by qt+1:t|t. In the pre-RPF, we choose γt+1:t|t = πnt+1:t|t. Fol-

lowing the discussion in Section 4.5, it remains to choose the coefficients c and dfor the instrumental and the target distribution such that

c

d≥ ess sup

(x,y)∈S

p′t+1:t|t+1(x, y)

qt+1:t|t(x, y)

where S = (x, y) | x, y ∈ Rk, qt+1:t|t(x, y) > 0. By setting d = z, it remains tochoose c such that

c

z≥ ess sup

(x,y)∈S

p′t+1:t|t+1(x, y)

qt+1:t|t(x, y)= max

i=1,2,...,ness sup

x∈Si

gt+1(x)

z

where Si = x ∈ Rk | qt+1:t|t(xit, x) > 0. Let us define

g∗i , ess sup

x∈Si

gt+1(x), g∗ , maxi=1,2,...,n

g∗i .

The essential suprema are, of course, assumed to exist. In terms of acceptancerate, the optimal choice for c would be c = g∗.

Note that a particle x ∼ γt+1:t|t is accepted with the probability

dp′t+1:t|t+1(x)

cqt+1:t|t(x)=

gt+1(x)

c,

meaning that, in practice, the normalisation coefficient z does not have to beknown. It should also be noted that if a kernel with support equal to Rk isused instead of the Epanechnikov kernel, then Si = Rk and g∗

i = g∗ for alli = 1, 2, . . . , n. An example of such a kernel is the standard normal kernel (seeAppendix B).


5.6.3 Kernel Filter

A method quite similar to the pre-RPF was proposed by Hurzeler and Kunsch[1998] and it was originally called the kernel method. Later, it has been referredto as the kernel filter [Musso et al., 2001, page 260]. In fact, the kernel filtercan be regarded as a generalisation of the pre-RPF, meaning that, for a certainchoice of parameters, the kernel filter and the pre-RPF coincide. Moreover, itcan be shown that, in certain situations, the kernel filter is more efficient thanthe pre-RPF in terms of acceptance rate.

The target distribution in the kernel filter is the same as in the pre-RPF,but, instead of choosing πn

t+1:t|t to be the instrumental distribution, we chooseγt+1:t|t to have the density

qt+1:t|t(xt, xt+1) =

ai

a det(Hn)KE(H−1

n (xt+1 − xit+1)) , if xt = xi

t

1 , otherwise

with respect to λB × λk. Here B = xit | i = 1, 2, . . . , n, a =

∑ni=1 ai, and ai > 0

for all i = 1, 2, . . . , n. Clearly, the choice ai = n−1 yields the pre-RPF.In order to define the required coefficients c and d for the instrumental and

the target distribution, let us again choose d = z. Then it remains to choose csuch that

c

z≥ ess sup

(x,y)∈S

p′t+1:t|t+1(x, y)

qt+1:t|t(x, y)= max

i=1,2,...,ness sup

x∈Si

agt+1(x)

znai=

ag∗i∗

znai∗, (5.45)

where i∗ = arg maxi=1,2,...,n g∗i /ai. The unknown normalisation coefficient is can-

celled, yielding a feasible condition c ≥ ag∗i∗/(nai∗).

In order to address the choice of the yet undefined parameters ai, let us ex-amine the highest possible acceptance rate for given choice of ai. This is achievedwhen c = ag∗

i∗/(nai∗) and, according to Equation (4.38), it is

d

c=

z

ag∗i∗/(nai∗)

= zai∗

g∗i∗

(1

n

n∑

i=1

ai

)−1

.

The following proposition gives the optimal choice for the parameters ai.

Proposition 10 The highest possible acceptance rate in the kernel filter is ob-tained when for all i = 1, 2, . . . , n, ai = kg∗

i , where k > 0 is an arbitrary realnumber.

Proof: Because the reciprocal of a(zn)−1 maxi=1,2,...,n g∗i /ai is the acceptance

rate, it is sufficient to minimise a maxi=1,2,...,n g∗i /ai. Because for all i = 1, 2, . . . , n,

ai > 0, (max

i=1,2,...,n

g∗i

ai

)( n∑

i=1

ai

)≥

n∑

i=1

aig∗

i

ai

=n∑

i=1

g∗i .

Equality is obtained straightforwardly by substituting ai = kg∗i .


In practice, the exact evaluation of the essential suprema may be infeasibleand, therefore, the optimal choice of ai cannot be implemented. However, thereis a feasible guideline for choosing the values ai that ensures that the acceptancerate is at least equal to that in the pre-RPF. Suppose that the value c1 ≥ g∗

is used as the coefficient of the instrumental distribution in a pre-RPF. Thenthe values ai can be chosen to satisfy g∗

i ≤ ai ≤ c1. In this case, g∗i∗/ai∗ ≤ 1

for all i = 1, 2, . . . , n and according to Equation (5.45), it is sufficient to choosethe coefficient c2 of the instrumental distribution in the kernel filter such thatc2 = a/n. Clearly, a/n = n−1

∑ni=1 ai ≤ c1, implying z/c2 ≥ z/c1. Because

the values z/c1 and z/c2 are the acceptance rates in the pre-RPF and the kernelfilter, the acceptance rate in the kernel filter is found to be at least equal to theacceptance rate in the pre-RPF.

It is important to note that the improvement of the kernel filter relies on thebounded support of the Epanechnikov kernel. If the support of the kernel is Rk,then g∗

i = g∗ for all i = 1, 2, . . . , n and, according to Proposition 10, pre-RPF isthe optimal kernel filter.

5.6.4 Local Rejection Regularised Particle Filter

The last RPF to be described is the local rejection regularised particle filter(LRRPF) [Musso et al., 2001]. The significance of the LRRPF is that it gen-eralises the post-RPF and the kernel filter under the same framework in spite ofthe fact that the former is based on importance sampling and the latter is basedon the rejection method.

For brevity, let us introduce the scaled and shifted regularisation kernel Kis

defined as

Kis(x) ,

1

det(Hn)KE(H−1

n (x − xit+1)).

In the LRRPF, the density of the target distribution with respect to λB × λk is

p′t+1:t|t+1(xit, xt+1) =

1

zmin

(ai(α)Ki

s(xt+1),1

αgt+1(xt+1)K

is(xt+1)

),

for all xt = xit and p′t+1:t|t+1(xt, xt+1) = 1, when xt /∈ B. Here B = xi

t | i =

1, 2, . . . , n, z is the normalisation coefficient, α is a parameter taking valuesin (0, 1], and ai, i = 1, 2, . . . , n is a known function of α. A more convenientexpression for the target density is obtained by defining a function

bi(α, x) , ai(α) min

(1,

gt+1(x)

αai(α)

),

which allows us to write the target density equivalently as

p′t+1:t|t+1(xt, xt+1) =

z−1bi(α, xt+1)K

is(xt+1) , if xt = xi

t

1 , otherwise.

Obviously, the function bi, and hence the target distribution p′t+1:t|t+1, are notdefined for α = 0. However, under certain conditions, bi and p′t+1:t|t+1 can be


shown to have a pointwise limiting function as α → 0. This statement is mademore precise by the following proposition.

Proposition 11 If infx∈Sigt+1(x) > 0 and if limα→0 ai(α) = gt+1(x

it+1), then

limα→0

bi(α, x) = gt+1(xit+1), x ∈ Si

Proof: By the convergence of ai, for all ǫ > 0 there is α1 > 0, such that if α < α1,then

∣∣ai(α) − gt+1(xit+1)

∣∣ < ǫ. Moreover, because infx∈Sigt+1 > 0, there is α2 > 0,

such that if α < α2, then for all x ∈ Si, gt+1(x)/α2 > gt+1(xit+1)+ǫ. Consequently,

if α < min(α1, α2), then ai(α) < gt+1(xit+1) + ǫ and gt+1(x)/α > gt+1(x

it+1) + ǫ,

implying that bi(α, x) = ai(α) and∣∣bi(α, x) − gt+1(x

it+1)

∣∣ < ǫ.

If, in addition to the condition in Proposition 11, the likelihood functiongt+1 is assumed to be continuous, then the definition

ai(α) , ess supx∈Si(α)

gt+1(x), i = 1, 2, . . . , n (5.46)

whereSi(α) = x ∈ Rk | ||H−1

n (x − xit+1)|| ≤ α,

can be shown to satisfy the conditions in Proposition 11. Because the supportof KE is the unit sphere centered at the origin, the support of Ki

s is an ellipsoidcentered at xi

t+1. The definition of Si(α) means that, instead of seeking theessential supremum of gt+1 over the whole support of Ki

s, it is sought over theellipsoid obtained by multiplying the radii by α. This definition of ai implies thatfor α = 1

p′t+1:t|t(xit, xt+1) =

z−1gt+1(xt+1)K


t

1 , otherwise

and for α = 0

p′t+1:t|t(xit, xt+1) =

z−1gt+1(x

it+1)K


t

1 , otherwise.

When comparing these target distributions with those in Equation (5.43) andEquation (5.42), it is observed that for α = 1 the target distribution is the tar-get distribution of the kernel filter and for α = 0 the target distribution is theposterior distribution approximation of the post-RPF with the bootstrap filterimportance distribution. In this sense, the LRRPF indeed generalises the kernelfilter and post-RPF under the same framework.

In practice, the definition of ai in Equation (5.46) may be infeasible becausethe essential suprema cannot necessarily be evaluated even when they exist. Itis, however, sufficient to define ai to satisfy

ai(α) ≥ ess supx∈Si(α)

gt+1(x), i = 1, 2, . . . , n


and limα→0 ai(α) = gt+1(xit+1) as proposed in [Musso et al., 2001]. Also, a feasible

method for evaluating ai(α) to satisfy these conditions can be found in [Mussoet al., 2001].

The instrumental distribution in the LRRPF is defined to have the density

qt+1:t|t(xt, xt+1) =

ai(α)a(α)

Kis(xt+1) , if xt = xi

t

1 , otherwise

with respect to λB × λk. Here a(α) =∑n

i=1 ai(α). If α = 0, then, becausethe limiting function of bi is gt+1(x

it+1), the densities of the instrumental and

the target distribution are equal, and, consequently, every sample is accepted.The situation α ∈ (0, 1] is more complicated. The coefficients c and d of theinstrumental and the target distribution must be chosen to satisfy

c

d≥ ess sup

(x,y)∈S

p′t+1:t|t+1(x, y)

qt+1:t|t(x, y)=

a(α)

zmax

i=1,...,nmin

(1,

g∗i

αai(α)

).

Because obviously, maxi=1,...,n min(1, g∗i /(αai(α))) ≤ 1, it suffices to choose c/d =

a(α)/z. Although this value cannot be evaluated, the algorithm is feasible, be-cause the probability of accepting the sample (xji

t , xit+1) ∼ γt+1:t|t is

dp′t+1:t|t+1(xji

t , xit+1)

cqt+1:t|t(xji

t , xit+1)

= min

(1,

gt+1(xit+1)

αaji(α)

).

5.6.5 Remarks on the LRRPF

A few concluding remarks on the LRRPF are in order. First of all, the meregeneralisation of the post-RPF and the kernel filter under the same frameworkwas not the only motivation for the LRRPF in Musso et al. [2001]. It was alsoproposed that α should be adapted in order to balance between the post-RPF andthe kernel filter while filtering. Because the method proposed by Musso et al.[2001] for evaluating ai(α) ensures that the probability of accepting a sampleincreases as α → 0, the time required by the simulation of an IID sample of sizen from the target distribution decreases as α → 0. On the other hand, it wasclaimed by Musso et al. [2001] that for α = 1, the target distribution is moreaccurate and, hence, α should be chosen as large as possible. To balance betweenthese extremes, the proposition was to set α to be the largest possible value thatenables the simulation of a sample of size n from the target distribution in giventime. An approximate method for this purpose is also given in [Musso et al.,2001].

The second remark is related to the equivalence of the LRRPF with α = 0and the post-RPF. In the description of the LRRPF given above, the equiva-lence is based on using the bootstrap importance distribution in the post-RPF.There is, however, no apparent reason why the LRRPF algorithm could not befurther generalised to cover the post-RPF with general importance distributionas described in Section 5.6.1.


Even in the case of the bootstrap filter importance distribution in the post-RPF, there is a minor detail that may cancel the equivalence. Because, in thepost-RPF, the regularisation is based on a weighted sample that represents theposterior distribution, the most intuitive way to evaluate the bandwidth matrixHn would be to take the weighted sample covariance of the samples as proposedin [Arulampalam et al., 2002]. An approximate way of doing this is to resampleaccording to the weights and then evaluate the sample mean of the resultingequally weighted samples. In the LRRPF, the bandwidth matrix must, however,be evaluated before the weighting, since the weights are, in fact, never evaluated.The difference in the bandwidth matrices may be significant in the case of severeimpoverishment. On the other hand, because the choice of the bandwidth matrixis altogether heuristic, there is no reason why the weighted sample covariancecould not be used in the LRRPF as well.

5.6.6 Conclusions on Regularised Particle Filters

There is an interesting detail in the history of the SMC methods. It was alreadyin the original paper by Gordon et al. [1993] where an ad hoc procedure calledroughening was proposed to reduce the sample impoverishment. The idea wasto add independent, zero mean, normally distributed jitter to the samples afterresampling. In essence, this is nothing but the post-RPF with standard nor-mal regularisation kernel. Gordon et al. [1993] proposed choosing the bandwidthmatrix Hn = ⌈σ2

1 , . . . , σ2k⌋, where σi is proportional to n1/k, and to the maximum

difference between the ith components of the samples after resampling. Accordingto Section B.3 and Section B.4, a theoretically more sound choice of the band-

width matrix would be Hn = (4/(k + 2))1

k+4 n− 1

k+4 C− 1

2 , where C− 1

2 is the matrixsquare root of the inverse sample covariance matrix of the samples after resam-pling. Because the choice of the bandwidth matrix in the post-RPF is heuristic,the ad hoc roughening procedure proposed in [Gordon et al., 1993] could, in fact,be called the original proposal to use post-RPF.

A severe shortcoming in kernel density estimation and, hence, in regularisedparticle filters is the choice of the bandwidth matrix. For distributions closeenough to a normal distribution, Section B.3 and Section B.4 provide a reasonablechoice. However, in a situation where the distribution is multimodal with distinctmodes, there are severe problems. The problems arise from the evaluation of thesample covariance which in the case of multimodal distribution tends to be toolarge resulting in oversmoothing, i.e. posterior distribution approximations thathave too large variances in the posterior distribution. To overcome this problem,Musso et al. [2001] and Hurzeler and Kunsch [1998] have proposed the heuristicof dividing the bandwidth by 2 or 2.5 in the case of multimodality [Silverman,1986]. This idea hardly solves the problem, because it is not known beforehandwhether the distribution is unimodal or multimodal. The determination of themodality would require estimation of the number of modes.

Perhaps the most significant problem with kernel density estimation is thewell known curse of dimensionality. It has been reported in the literature


that kernel density estimation, even in moderate dimensions (k ≥ 5), may befutile [Scott, 1992, page 202]. Although the regularised particle filters employthe methods of kernel density estimation, the main objective is to smooth orto regularise, as the name implies, the discrete distributions in SMC methodsinstead of actually estimating the density at given point. Therefore, the curseof dimensionality may not be that dramatic in the SMC context [Hurzeler andKunsch, 1998].

5.7 Other Sequential Monte Carlo Methods

So far, different SMC methods have been described that have been based ontwo fundamentally different principles of Monte Carlo: the importance samplingand the rejection method. Although the majority of the SMC methods proposedin the literature are considered to be covered by the descriptions given in thepreceding sections, there is a third major class of Monte Carlo methods that havebeen proposed to be used in the SMC context. This class consists of the Markovchain Monte Carlo (MCMC) methods [Robert and Casella, 1999]. However,a detailed description of these methods would expand the scope of this thesisexcessively and, therefore, only some references are mentioned.

Berzuini and Gilks [2001] proposed applying MCMC methods in SMC con-text in a method called the resample-move algorithm. The method was basedon using MCMC methods to simulate a sample from the joint conditional dis-tribution of x0, . . . , xt given the observations y1, . . . , yt. In order to save somecomputation, it was also proposed that the number of dimensions of the joint dis-tribution could be reduced by forgetting the most distant history. In the extreme,the whole history could be forgotten, and MCMC would be used for simulatingsamples from π′

t|t only. This extreme case has also been proposed by Pitt and

Shepard [1999]. A slightly different approach to use MCMC can also be foundin [Ristic et al., 2004, page 55]. Moreover, it was proposed by [Fearnhead, 2002]to use sufficient statistics to save computation in the resample-move algorithm.Another way to apply MCMC to importance sampling integration was proposedby Neal [1998] and it was later applied to SMC context by Godsill and Clapp[2001].

Chapter 6

Summary

The goal of this thesis has been to provide a detailed description of the theoreticalfoundations of sequential Monte Carlo and describe some of the best known SMCmethods in a unified manner. In this chapter, the main conclusions of the preced-ing chapters are briefly summarised. Also a few suggestions for future researchare given.

6.1 Conclusions

Chapter 2 provided the basics of probability theory that enabled a detailed def-inition of Markov chains and the formulation of the Bayesian filtering problem.The discussion of the Bayesian filter consisted of the proof of the well knownBayes’ rule and its extension to the Bayesian filter; a recursion for estimating theunknown realisation of a Markov chain. The formulation of the Bayesian filteringproblem included a detailed list of the assumptions under which the filter is knownto be valid. Also a practical formulation of a general class of models that satisfythese assumptions was given in the form of an example. The discussion on theBayesian filter was concluded by an alternative formulation of the recursion. Thisformulation served as a starting point for the description of the SMC methods.

General measure theoretic definitions of the well known principles of theMonte Carlo method were given in Chapter 4. More specifically, this chapterdescribed the principles of importance sampling, the rejection method, and strat-ified sampling. Classical Monte Carlo integration, which was also included, isconsidered to be a special case of importance sampling. The importance samplingmethod was divided into two different cases: importance sampling and approxim-ate importance sampling. It was pointed out that in some cases, despite the name,the approximate importance sampling can outperform importance sampling interms of variance. The chapter also addressed the choice of the importance dis-tribution and gave a justification for setting the importance distribution equal tothe target distribution. However, a theoretically sound proof for the optimalityof the target distribution in the case of an arbitrary integrand, was not provided.The discussion on importance sampling was concluded by the derivation of theeffective sample size which can be used for assessing the choice of importance dis-

84

CHAPTER 6. SUMMARY 85

tribution. It was pointed out that, occasionally, the results given by the effectivesample size may be remarkably misleading.

The description of the stratified sampling was accompanied by a theoreticaljustification for the preference of stratification instead of simple IID sample. Inthe provided references, the stratified sampling is based on disjoint strata, whilein this work, the definition of stratified sampling was given in manner that allowedthe strata to overlap. All results given in Section 4.4 remain valid even thoughthe strata would overlap. The rejection method, in turn, was shown to be validfor general probability distributions instead of only those that admit continuousdensities with respect to the Lebesgue measure. In terms of essential supremum,less stringent conditions for the instrumental distribution could be given than inthe provided references.

The description of the SMC methods started by describing the fundamentalidea to approximate the filtering distribution, i.e. the Bayesian posterior distri-bution, by a discrete probability distribution. It was then pointed out that inthe bootstrap filter, the posterior probability distribution was approximated byclassical Monte Carlo integration. By replacing the classical Monte Carlo by themore general importance sampling integration, one obtained the SIR algorithm.Also, an alternative formulation for the SIR was provided. Although this al-ternative formulation was shown to be more accurate than the basic SIR, it hasrarely been mentioned in the literature. The reason for this is supposedly itslarge computational cost compared to the basic SIR algorithm. The chapter alsolisted some special cases of the general SIR algorithm. These methods, i.e. theauxiliary particle filter, Monte Carlo weighting, and Kalman filter importancedistributions, were essentially obtained by a specific choices of the importancedistribution.

In this work, resampling was not considered to be an augmentation of theSIS algorithm. Instead, the resampling was regarded as an integral part of thesimulation of samples from the importance distribution in SIR algorithm. Theconclusion was that the multinomial resampling corresponds to simulating an IIDsample from the importance distribution. Some other resampling methods weredescribed in terms of stratified sampling. These descriptions required that thestrata were allowed to overlap. It was also concluded that stratified samplingwith a certain proportional sample allocation in the general SIR algorithm yieldsthe SIS algorithm.

In addition to the SIR algorithm, a class of SMC methods called the reg-ularised particle filters were also described. The set of regularised particle fil-ters appeared not to be theoretically as homogeneous as the SIR methods. Thepost-RPF was concluded to be an importance sampling based method, while theremaining methods, pre-RPF, kernel filter, and LRRPF, based on the rejectionmethod. Although the post-RPF is basically an importance sampling algorithm,it was shown to be fundamentally different from the SIR algorithm. This dif-ference arises from the different interpretations of the filtering distribution thatis being approximated. Moreover, a similar generalisation, as in the case of SIRalgorithm, was proposed to the choice of importance distribution in the post-RPF.

CHAPTER 6. SUMMARY 86

The kernel filter was shown to be a generalisation of the pre-RPF. Moreover,it was shown that the kernel filter can be constructed to have at least equalefficiency as the pre-RPF. In this case, the efficiency was considered in terms ofthe acceptance rate. Finally, it was shown that even more general frameworkwas achieved with the LRRPF which generalised the kernel filter and the post-RPF under the same framework. This generalisation is remarkable in the sensethat it covers fundamentally different approaches of the importance sampling andthe rejection method. In this case, the importance distribution was assumed tobe the bootstrap filter importance distribution, but it was pointed out the thereappeared to be no reason whatsoever, why the generalisation could not be donefor other importance distributions as well.

Although empirical results are excluded from this thesis, something can besaid about the behaviour of the described methods in practice. The bootstrap fil-ter is perhaps the simplest SMC method, and because of its simplicity, it is knownto have severe shortcomings. On the other hand, the simplicity of the bootstrapfilter is also its major asset; the bootstrap filter has the least computational costamong the methods described in this work. This should be taken into accountwhen comparing different SMC methods. Some preliminary experiments suggestthat in a sufficiently well behaving problem it indeed is difficult to outperform thebootstrap filter. Certainly, a specific sample size in a more complicated methodgives more accurate results than the bootstrap filter with equal sample size, butthe extra computation which is required in order to achieve this accuracy tendsto outweigh the gain in the accuracy.

6.2 Future Work

As it was mentioned in Chapter 1, this thesis is an introduction to the theory ofthe SMC methods. Therefore the main guideline for future work would appearto be a more elaborate theoretical analysis of the SMC methods.

One of the questions that remained unanswered, was the inclusion of thesystematic resampling method under the importance sampling framework. More-over, the optimality of the systematic resampling, in general, appears to remainan open question. It was also mentioned that, in principle, many different stratifi-cations are possible in addition to those presented in this work. However, nothingwas said about the convergence or the efficiency of these alternative stratifications.

A rigorous analysis of the convergence of the SMC methods appears to betheoretically involved and it is therefore excluded from this work. Especially,because the comparisons of different SMC methods are most often based on em-pirical evidence, it would be convenient to have theoretical results that would ex-plicitly state that a certain method should be used instead of some other method.

The theory of random measures was also briefly mentioned in Section 4.2.It is somewhat surprising that although there is an apparent connection betweenSMC and the theory of random measures, these topics have not been discussed inthe same context. It remains to be seen whether the theory of random measurescan be used for enriching the theory of SMC, or vice versa, but, at least to someextent, they should be studied in the same context.

Bibliography

B. D. O. Anderson and J. B. Moore. Optimal Filtering. Prentice-Hall Informationand System Sciences Series. Prentice-Hall, Inc., Englewood Cliffs, New Jersey,1979. ISBN 0-13-638122-7.

T. W. Anderson. Introduction to Multivariate Statistical Analysis. Wiley Series inProbability and Mathematical Statistics. John Wiley & Sons, Inc., New York,1958.

M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp. A tutorial on particlefilters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Transactionson Signal Processing, 50(2), February 2002.

J. E. Baker. Reducing bias and inefficiency in the selection algorithm. In Proceed-ings of the Second International Conference on Genetic Algorithms and theirApplication, pages 14–21, 1987.

H. Bauer. Probability Theory and Elements of Measure Theory. Probability andMathematical Statistics, A Series of Monographs and Textbooks. AcademicPress, London, second edition, 1981.

C. Berzuini and W. Gilks. Resample-move filtering with cross-model jumps. InA. Doucet, N. de Freitas, and N. Gordon, editors, Sequential Monte Carlo inPractice, chapter 6, pages 117 – 138. Springer-Verlag, New York, 2001.

J. Carpenter, P. Clifford, and P. Fearnhead. Improved particle filter for nonlinearproblems. IEE Proceedings Radar, Sonar Navigation, 146(1), February 1999.

W. G. Cochran. Sampling Techniques. John Wiley & Sons, Inc., second edition,1963.

D. Crisan. Particle filters - a theoretical perspective. In A. Doucet, N. de Freitas,and N. Gordon, editors, Sequential Monte Carlo in Practice, chapter 2, pages16 – 41. Springer-Verlag, New York, 2001.

D. Crisan and A. Doucet. A survey of convergence results on particle filteringmethods for practitioners. IEEE Transactions on Signal Processing, 50(3),March 2002.

87

BIBLIOGRAPHY 88

D. Crisan and T. Lyons. Minimal entropy approximation and optimal algorithmsfor the filtering problem. Monte Carlo Methods and Applications, 8(4):343–356,2002.

L. Devroye. Non-uniform random variate generation. Springer, New York, 1986.URL http://jeff.cs.mcgill.ca/ luc/rnbookindex.html.

L. Devroye. A Course in Density Estimation. Number 14 in Progress in Proba-bility and Statistics. Birkhauser, Boston, 1987. ISBN 0-8176-3365-0.

A. Doucet. On sequential simulation-based methods for Bayesian filtering. Tech-nical Report CUED/F-INFENG/TR 310, Signal Processing Group, Depart-ment of Engineering, University of Cambridge, 1998.

A. Doucet, N. de Freitas, and N. Gordon. An introduction to sequential MonteCarlo methods. In A. Doucet, N. de Freitas, and N. Gordon, editors, SequentialMonte Carlo in Practice, chapter 1, pages 3 – 14. Springer-Verlag, New York,2001a.

A. Doucet, N. de Freitas, and N. Gordon. Sequential Monte Carlo in Practice.Springer-Verlag, New York, 2001b.

P. Fearnhead. Markov chain Monte Carlo, sufficient statistics, and particle filters.Journal of Computational and Graphical Statistics, 11(4):848–862, 2002.

D. Gamerman. Markov Chain Monte Carlo. Texts in Statistical Science Series.Chapman & Hall/CRC, Boca Raton, Florida, 2002. ISBN 0-412-81820-5.

R. F. Gariepy and W. P. Ziemer. Modern Real Analysis. PWS Publishing Com-pany, Boston, 1995.

J. Geweke. Bayesian inference in econometric models using Monte Carlo integra-tion. Econometrica, 57(6):1317–1339, November 1989.

S. Godsill and T. Clapp. Improvement strategies for Monte Carlo particle filters.In A. Doucet, N. de Freitas, and N. Gordon, editors, Sequential Monte Carloin Practice, chapter 7, pages 139 – 158. Springer-Verlag, New York, 2001.

N. J. Gordon, D. J. Salmond, and A. F. M. Smith. Novel approach tononlinear/non-Gaussian Bayesian state estimation. IEE Proceedings F, 140(2), April 1993.

P. R. Halmos. Measure Theory. The University Series in Higher Mathematics.D. Van Nostrand Company, Inc., Princeton New Jersey, 1950.

J. M. Hammersley and D. C. Handscomb. Monte Carlo Methods. Methuen’smonographs on applied probability and statistics. Methuen & Co. Ltd., London,1964.

BIBLIOGRAPHY 89

Y. C. Ho and R. C. K. Lee. A Bayesian approach to problems in stochasticestimation and control. IEEE Transactions on Automatic Control, 9(4):333–339, October 1964.

M. Hurzeler and H. R. Kunsch. Monte Carlo approximations for general state-space models. Journal of Computational and Graphical Statistics, 7(2):175–193,1998.

A. H. Jazwinski. Stochastic Processes and Filtering Theory. Number 64 in Math-ematics in Science and Engineering. Academic Press Inc., New York, 1970.

O. Kallenberg. Random Measures. Academic Press, London, 3rd edition, 1983.

S. M. Kay. Fundamentals of Statistical Signal Processing: Estimation Theory.Prentice Hall Signal Processing Series. Prentice Hall International, Inc., Engle-wood Cliffs, New Jersey, 1993.

G. Kitagawa. Monte Carlo filter and smoother for non-Gaussian nonlinear statespace models. Journal of Computational and Graphical Statistics, 5(1):1–25,1996.

A. N. Kolmogorov and S. V. Fomin. Introductory Real Analysis. Dover Publica-tions, Inc., 1975a. ISBN 0-486-61226-0.

A. N. Kolmogorov and S. V. Fomin. Reelle Funktionen und Funktionalanalysis.Number 78 in Hochschulbucher fur Mathematik. VEB Deutscher Verlag derWissenschaften, Berlin, 1975b.

A. Kong, J. S. Liu, and W. H. Wong. Sequential imputations and Bayesianmissing data problems. Journal of the American Statistical Association, 89(425):278–288, March 1994.

F. LeGland and N. Oudjane. Stability and uniform approximation of nonlinearfilters using the Hilbert metric, and applications to particle filters. The Annalsof Applied Probability, 14(1):144–187, February 2004.

J. S. Liu. Metropolized independent sampling with comparisons to rejectionsampling and importance sampling. Statistics and Computing, 6(2):113–119,1996.

J. S. Liu. Monte Carlo Strategies in Scientific Computing. Springer Series inStatistics. Springer-Verlag, New York, 2001.

J. S. Liu and R. Chen. Sequential Monte Carlo methods for dynamic systems.Journal of the American Statistical Association, 93(443):1032–1044, September1998.

P. S. Maybeck. Stochastic Models, Estimation and Control, Volume 1. Number141 in Mathematics in Science and Engineering. Academic Press, New York,1979. ISBN 0-12-480701-1.

BIBLIOGRAPHY 90

J. F. Monahan. Numerical Methods of Statistics. Cambridge University Press,2001.

C. Musso, N. Oudjane, and F. LeGland. Improving regularised particle filters. InA. Doucet, N. de Freitas, and N. Gordon, editors, Sequential Monte Carlo inPractice, chapter 12, pages 247–271. Springer-Verlag, New York, 2001.

R. M. Neal. Annealed importance sampling. Technical Report 9805, Universityof Toronto, Department of Statistics, 1998.

N. Oudjane and C. Musso. Progressive correction for regularized particle filters.In Proceedings of the Third International Conference on Information Fusion,FUSION 2000, volume 2, pages ThB2/10–ThB2/17, 2000.

M. K. Pitt and N. Shepard. Filtering via simulation: Auxiliary particle filter.Journal of the American Statistical Association, Theory and Methods, 94(446),June 1999.

B. Ristic, S. Arulampalam, and N. Gordon. Beyond Kalman Filter:Particle Fil-ters for Tracking Applications. Artech House, 2004. ISBN 1-58053-631-x.

C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer-Verlag,New York, 1999.

H. L. Royden. Real Analysis. The Macmillan Company, Collier-Macmillan Ltd.,London, 2nd edition, 1968.

R. Y. Rubinstein. Simulation and the Monte Carlo Method. Wiley, new York,1981.

W. Rudin. Principles of Mathematical Analysis. McGraw-Hill, 3rd, internationaledition, 1976. ISBN 0-07-085613-3.

D. W. Scott. Multivariate Density Estimation, Theory, Practice and Visualiza-tion. John Wiley & Sons, Inc., New York, 1992. ISBN 0-471-54770-0.

A. N. Shiryayev. Probability. Number 95 in Graduate texts in mathematics.Springer-Verlag, New York, 1984.

B. W. Silverman. Density Estimation. Number 26 in Monographs on Statisticsand Applied Probability. Chapman and Hall, New York, 1986. ISBN 0-412-24620-1.

G. F. Simmons. Introduction to Topology and Modern Analysis. InternationalSeries in Pure and Applied Mathematics. McGraw-Hill Book Company, Inc.,New York, 1963.

A. Stuart. A simple presentation of optimum sampling results. Journal of theRoyal Statistical Society. Series B, 16(2):239–241, 1954.

BIBLIOGRAPHY 91

R. van der Merwe, A. Doucet, N. de Freitas, and E. Wan. The unscented particlefilter. Technical Report CUED/F-INFENG/TR 380, Cambridge UniversityEngineering Department, 2000.

M. P. Wand and M. C. Jones. Kernel Smoothing. Number 60 in Monographs onStatistics and Applied Probability. Chapman & Hall, New York, 1995. ISBN0-412-55270-1.

D. Williams. Probability with Martingales. Cambridge University Press, Cam-bridge, 1991. ISBN 0-521-40605-6.

D. Williams. Weighing the Odds. Cambridge University Press, Cambridge, 2001.ISBN 0-521-00618-X.

Appendix A

Analysis

Probability theory is closely related to general measure and integration theoryand real analysis. This chapter briefly summarises some of the main definitionsand results that are used throughout this work. Most of the following discussionis based on [Gariepy and Ziemer, 1995]. Additional details can be found in [see,e.g. Bauer, 1981, Simmons, 1963, Royden, 1968].

Section A.1 defines the basic notations related to functions, sets and systemsof sets. Section A.2 gives some basic definitions related to measure theory and alsoincludes a discussion on the measurability of functions. The chapter is concludedin Section A.3 by defining the integral and by giving some fundamental theoremsrelated to integration theory.

A.1 Functions and Set Theory

This section describes some of the basic definitions related to functions andset theory. In the following, unless explicitly otherwise stated, capital lettersA, B, C, . . . are reserved for denoting sets and the lower case letters a, b, c, . . . arereserved for denoting elements of sets. In particular, sets whose elements aresets are denoted by script letters A , B, C , . . ., and they are called systems orcollections of sets. The notations A ⊂ B and A ⊃ B for the set theoreticinclusion are considered to include the case A = B.

A.1.1 Functions

For two arbitrary sets, X and Y , a rule which assigns a unique value y ∈ Y forevery x ∈ X is called a function f on X, and it is denoted by f : X → Y . Theset X is called the domain of f and Y is called the image space of f . Theelement y ∈ Y assigned to x ∈ X by the function f is called the image of x andit is denoted by f(x). The image of a subset A ⊂ X is denoted by f(A) anddefined as f(A) , y ∈ Y | y = f(x) for some x ∈ A. The image f(X) ⊂ Yis called the range of f . The preimage of B ⊂ Y is denoted by f−1(B) anddefined as f−1(B) = x ∈ X | f(x) ∈ B.

92

APPENDIX A. ANALYSIS 93

A.1.2 Topology

Suppose that X is a set. Then P(X) denotes the system of all subsets of X andit is called the power set of X. The mere existence of a set X is not sufficientfor defining such fundamental concepts as convergence of sequence of elementsof X, or continuity of mappings between two sets. To this end we have thefollowing definition [see, e.g. Gariepy and Ziemer, 1995, page 34].

Definition 26 Let X be a set. Then T ⊂ P(X) is called a topology in X, ifT has the following properties:

i) ∅, X ∈ T ;ii) For any S ⊂ T ,

⋃S∈S

S ∈ T ;iii) For any finite collection S ⊂ T ,

⋂S∈S

S ∈ T .The ordered pair (X, T ), where T is a topology in X, is called a topologicalspace.

The elements of a topology T in X are called open sets of X. A set A whosecomplement ∁A , X−A is open, is closed [Gariepy and Ziemer, 1995, Simmons,1963]. Because ∁(∁A) = X − (X − A) = A, the complement of an open set A isclosed. It is also noted that since ∁X = ∅ and vice versa, then ∅ and X are closed.If there is a topological space (X, T ) and Y ⊂ X, then TY , A ∈ P(X) | A =B ∩ Y, B ∈ T is a topology [Gariepy and Ziemer, 1995, page 35]. The topologyTY is called the relative topology of T with respect to Y .

It would be convenient if a topology T in X could be expressed in termsof some smaller family of open sets. For this purpose we have the followingdefinition [see, e.g. Kolmogorov and Fomin, 1975a]

Definition 27 Let T be a topology. Then B ⊂ T is called a basis of thetopology T if every element of T can be expressed as a union of the elements inB. The elements of B are called the basic sets.

A topology is uniquely defined by its basis to be the collection of all unions of thebasic sets. This is obvious by noting that, according to the definition of the basis,every open set is a union of basic sets, and by the definition of topology, everyunion of basic sets belongs to the topology. An even more compact representationof a topology can be obtained by the following definition [Simmons, 1963, page101]

Definition 28 Let T be a topology. Then S ⊂ T is called a subbase of T

if the set of all finite intersections of the elements of S is a basis of T . Theelements of S are called subbasic sets.

A.1.3 Metric Spaces

The following definition provides the concept of distance between two elementsof an arbitrary set [see, e.g. Gariepy and Ziemer, 1995, page 40].

Definition 29 A mapping ρ : X × X → [0,∞) is called a metric if for allx, y, z ∈ X,


i) ρ(x, y) = 0 ⇐⇒ x = y;ii) ρ(x, y) = ρ(y, x);iii) ρ(x, y) ≤ ρ(x, z) + ρ(z, y).

The ordered pair (X, ρ), where ρ is a metric is called a metric space.

Occasionally the term distance is used instead of metric [see, e.g. Kolmogorovand Fomin, 1975a, page 37]. If (X, ρ) is a metric space, then the metric ρ canbe used for defining a topology in the set X. This is done in terms of openballs [Gariepy and Ziemer, 1995, page 41].

Definition 30 Let (X, ρ) be a metric space. Then an open ball Bρ(x, r) ofradius r centered at x ∈ X is defined as

Bρ(x, r) , y ∈ X | ρ(x, y) < r.

A topology can be defined in X by considering the collection of all open balls asa subbase for the topology as suggested by the following definition [Gariepy andZiemer, 1995, page 41]

Definition 31 Let (X, ρ) be a metric space. Then the collection Bρ(x, r) | x ∈X, r > 0 is a subbase for the topology induced by the metric ρ.

In particular, we are interested in the set Rk which is the collection of allk-tuples x = (x1, x2, . . . , xk) of real numbers. This set can be endowed with theEuclidean metric

ρE(x, y) ,

(k∑

i=1

(xi − yi)2

) 1

2

, x, y ∈ Rk,

yielding a metric space (Rk, ρE). The topology induced by the Euclidean metricis called the Euclidean topology. Throughout this thesis, the set Rk is alwaysconsidered to be endowed with the conventional vector operations (addition andproduct with a scalar), the Euclidean metric, and the induced Euclidean topology.Hence it is called the k-dimensional Euclidean space1. Moreover the elementsof Rk are considered to be column vectors

x = [x1, x2, . . . , xk]T, xi ∈ R,

where ·T stands for transpose.

A.1.4 Borel Sets

Topology does not provide sufficient structure for sets to enable rigorous defini-tion of probability theoretic concepts. Therefore the following system of sets isdefined [Gariepy and Ziemer, 1995, page 72].

1More precisely, any linear space endowed with a scalar product is a Euclidean space [see,e.g. Kolmogorov and Fomin, 1975a, page 144]. In the case of Rk with conventional vector

operations, the scalar product is defined to be (x, y) =∑k

i=1xiyi and it induces the norm, the

metric and the topology, that are all called Euclidean.


Definition 32 A non-empty collection F ⊂ P(X) is called a σ-algebra if ithas the properties:

i) A ∈ F =⇒ ∁A ∈ F;ii) A1, A2, . . . ∈ F =⇒ ⋃∞

i=1 Ai ∈ F.

Additionally, the properties of σ-algebras imply that a σ-algebra is also closedunder countable2 intersections and finite differences. Note also that since F isnonempty, there are sets A, ∁A ∈ F and hence A ∩ ∁A = ∅ and A ∪ ∁A = X areelements of F as well.

Obviously, an arbitrary collection S ⊂ P(X) is not necessarily a σ-algebra.However, an arbitrary subset E ⊂ P(X) can be expanded into a σ-algebraby adding appropriate elements of P(X) into E . This leads to the followingdefinition [see, e.g. Bauer, 1981, page 5-6].

Definition 33 The smallest σ-algebra including S ⊂ P(X) is denoted by σ(S )and it is called the σ-algebra generated by S .

The smallest, in this case, means that if F is a σ-algebra such that S ⊂ F, thenσ(S ) ⊂ F. It should be pointed out that σ(S ) always exists. This follows fromthe fact the intersection of σ-algebras is a σ-algebra, and therefore σ(S ) can bedefined as the intersection of all the σ-algebras that contain S [see, e.g. Bauer,1981, page 5-6]. This set is nonempty since, e.g. P(X) belongs to it.

The σ-algebra generated by a topology plays an important role, for examplein probability theory. For this reason we give the following definition [Bauer, 1981,page 204]

Definition 34 Suppose that (X, T ) is a topological space. Then σ(T ) is theσ-algebra of Borel sets (or the Borel σ-algebra) in X.

Because the Borel σ-algebra is the smallest σ-algebra containing the open sets,the definition of Borel sets requires the knowledge of the underlying topology. Inthe case of Rk, the underlying topology is always assumed to be the Euclideantopology. The shorthand notation B(A), where A ⊂ Rk, is used for denoting theBorel sets in A with respect to the Euclidean topology.

Often in the literature, the Borel σ-algebra in Rk is defined to be the σ-algebra generated by the left half-open intervals [see, e.g. Bauer, 1981, page27]

(a, b] , x ∈ Rk | ai < xi ≤ bi, a, b ∈ Rk.

Regardless of the different characterisation, the resulting σ-algebra is the sameas the σ-algebra generated by the Euclidean topology [Bauer, 1981, page 28]. Infact, the same σ-algebra is generated by open, closed, left half-open, and righthalf-open intervals in Rk [see, e.g. Shiryayev, 1984, page 141-142].

Let us conclude this section with two more definitions of systems of setsthat play an important role in measure theory, e.g. in the construction of meas-ures [Royden, 1968, pages 16, 259 ].

2Countable means finite or countably infinite.


Definition 35 A system A ⊂ P(X) is called a ring ifi) ∅ ∈ A ;ii) A, B ∈ A =⇒ A − B ∈ A ;iii) A, B ∈ A =⇒ A ∪ B ∈ A .

If, in addition, X ∈ A , then A is an algebra.

Definition 36 A nonempty collection C ⊂ P(X) is called a semialgebra ifi) A, B ∈ C =⇒ A ∩ B ∈ C ;ii) A ∈ C =⇒ there are Ci ∈ C , i = 1, 2, . . . , n such that Ci ∩Cj = ∅

’for all

i 6= j and⋃n

i=1 Ci = ∁A.

It should be noted that every σ-algebra is an algebra but the converse does nothold in general.

A.1.5 Extended Real Numbers

Although in many occasions, it is sufficient to consider events on real line only, it issometimes convenient to extend the real line to include infinities, positive and neg-ative. For this reason, the set of extended real numbers R , R ∪ +∞,−∞is introduced. Because the extended real numbers ∞ , +∞ and −∞ do not obeythe traditional arithmetic operations, we use the following conventions [Gariepyand Ziemer, 1995, pages 111-112].

Definition 37 Let x ∈ R and a > 0. Then

x + (±∞) , ±∞(±∞) + x , ±∞

±∞ + (±∞) , ±∞.

a(±∞) , ±∞(−a)(±∞) , ∓∞

0 · (±∞) , 0

(±∞)x , x(±∞)

(±∞)(±∞) , ∞(±∞)(∓∞) , −∞

Note that the expressions (±∞) + (∓∞) and (±∞) − (±∞) remain undefined.The extended real line certainly is not the same as the one dimensional

Euclidean space R. In particular, we are interested in the systems of open andBorel sets in R. These concepts can be defined in terms of the order topology To

on R [Gariepy and Ziemer, 1995, page 112]. Let us define the sets La , [−∞, a)and Ra , (a,∞], where a ∈ R. Then the collection La | a ∈ R ∪ Rb | b ∈ Ris taken as a subbase for the order topology yielding the basis S = La | a ∈R ∪ Ra | a ∈ R ∪ Lb ∩ Ra | a < b, a, b ∈ R. The resulting topology is

To = A ∪ B | A ∈ T , B ∈ ∅, −∞, ∞, −∞,∞,where T is the Euclidean topology in R. Obviously, the relative topology of To

with respect to R is the conventional Euclidean topology.The Borel sets of R are denoted by B(R) and are considered to be taken

with respect to the order topology. The fact that R ∈ To implies that B(R) mustbe closed under differences of the form R − A. Hence B(R) ⊂ B(R). Moreover,because ∞, −∞, and −∞,∞ are elements of To, they must be elementsof B(R) as well. In fact [Bauer, 1981, page 43],

B(R) = A ∪ B | A ∈ B(R), B ∈ ∅, −∞, ∞, −∞,∞.


A.2 Measure Theory

Measure theory is a branch of mathematics that is involved with the concept ofthe measure of sets. In some elementary situations, the measure of a set coincideswith the conventional concept of volume or area of a set. In general, measure isa mathematical abstraction of volume which is defined as follows [Bauer, 1981,pages 10, 19]

Definition 38 Suppose that there is a set X, a system of sets M ⊂ P(X) anda function µ : M → [0,∞] such that

i) µ(∅) = 0;ii) For a sequence of disjoint sets A1, A2, . . . ∈ M ,

µ

(∞⋃

i=1

Ai

)=

∞∑

i=1

µ(Ai).

Then we have the following definitions:i) if M is a ring, µ is a premeasure on M ;ii) if M is a σ-algebra, µ is a measure on M .

A premeasure or measure µ is finite if µ(X) < ∞. If there is a sequence of setsA1, A2, . . . ∈ M such that X =

⋃∞i=1 Ai, and µ(Ai) < ∞ for all i, then µ is said

to be σ-finite.

Definition 39 The ordered pair (X, M ), where M ⊂ P(X) is a σ-algebra, iscalled a measurable space and M is the system of measurable subsets ofX. If in addition µ is a measure on M , the ordered triple (X, M , µ) is called ameasure space.

If there are two measure spaces (X, M , µ) and (Y, N , ν), then M ⊗ N is thedirect product of σ-algebras M and N , i.e. the smallest σ-algebra thatincludes the sets A × B, where A ∈ M and B ∈ N . A measure µ × ν onM ⊗N is called a product measure of µ and ν if for all A ∈ M and B ∈ N ,(µ×ν)(A×B) = µ(A)ν(B). For the proof of the existence and uniqueness of theproduct measure see, e.g. [Shiryayev, 1984, page 197]. If a function f : X → Ydefined on a measure space (X, M , µ) satisfies some property P for all x ∈ X−N ,where N ⊂ X and µ(N) = 0, then the property P is said to be satisfied µ-almosteverywhere, or more shortly µ-a.e.

When associating measure with the physical concept of volume, it is in-tuitive that measures can only have nonnegative values. In theory, it is oftenconvenient to assign negative measures for measurable sets. This results in thefollowing definition [Gariepy and Ziemer, 1995, page 151-152].

Definition 40 Let (X, M ) be a measurable space. A set function µ : M → R iscalled a signed measure, if

i) µ takes at most one of the values ±∞;ii) µ(∅) = 0;


iii) For a sequence of disjoint sets A1, A2, . . . ∈ M ,

µ

(∞⋃

i=1

Ai

)=

∞∑

i=1

µ(Ai).

Obviously a measure is a signed measure but the converse does not hold in general.

Definition 41 Suppose that µ is a measure and ν is a signed measure definedon M . If

µ(A) = 0 =⇒ ν(A) = 0, A ∈ M

then ν is said to be absolutely continuous with respect to µ. This is denotedby ν ≪ µ.

A.2.1 Lebesgue Measure

The measure that most closely coincides with the physical concept of volume isknown as the Lebesgue or Lebesgue-Borel measure. The detailed derivationof the Lebesgue measure is lengthy and can be found in several sources in theliterature [see, e.g. Gariepy and Ziemer, 1995, Bauer, 1981]. In the following, onlythe outline of the derivation of the Lebesgue measure is given. The proofs of thefollowing theorems can be found, e.g. in [Bauer, 1981, pages 10-27].

Definition 42 Let a, b ∈ Rk. Then the real number∏k

i=1(bi − ai) assigned to

every right half-open interval [a, b) , x ∈ Rk | ai ≤ xi < bi in Rk is called theelementary content.

Obviously the elementary content of a set represents the volume of a k dimensionalcube.

Theorem 13 The system Ak of all finite unions of right half-open intervals [a, b)in Rk is a ring. Moreover, there is a unique premeasure λk on Ak, such that λk(A)is equal to the elementary content of A for every right half-open interval.

Theorem 14 Suppose that µ is a σ-finite premeasure on A . Then there is aunique measure µ on the σ-algebra σ(A ) such that µ(A) = µ(A) for all A ∈ A .

This theorem is quite useful. It implies for example that if two different measurescan be shown to be equal on an algebra of sets then the measures are equal onthe σ-algebra generated by the algebra. By combining the two theorems abovewe obtained the following theorem.

Theorem 15 There is a unique measure λk on B(Rk) such that for every righthalf-open interval A, λk(A) is equal to the elementary content of A.

Definition 43 The measure λk on B(Rk) that assigns each right half-open inter-val its elementary content is called the Lebesgue measure.


A.2.2 Measurable Functions

In analysis, a significant role is played by continuous functions. In probabilitytheory, however, continuity is often not the most crucial property of functions.Instead it is important to know whether a function is measurable of not.

Definition 44 Let (X, M ) and (Y, N ) be measurable spaces. A function f :X → Y is M /N -measurable, if for every N ∈ N , f−1(N) ∈ M .

The following theorem provides a convenient way of determining whether agiven function f is measurable or not [Gariepy and Ziemer, 1995, page 113].

Lemma 4 Suppose that (X, M ) is a measurable space and there is a functionf : X → R. Then the following conditions are equivalent.

i) f is M /B(R)-measurable;ii) f−1((a,∞]) ∈ M for all a ∈ R;iii) f−1([a,∞]) ∈ M for all a ∈ R;iv) f−1([−∞, a)) ∈ M for all a ∈ R;v) f−1([−∞, a]) ∈ M for all a ∈ R.

A strategy that is often encountered in the proofs of many measure andintegration theoretic results is that the proof is initially given for nonnegativefunctions and then generalised to cover functions f : X → R using the decom-position f = f+ − f−, where

f+ , max (0, f) , f− , −min (0, f) .

Obviously, f+ and f− are nonnegative.

Proposition 12 Suppose that (X, M ) is a measurable space and f : X → R isa M /B(R)-measurable function. Then f+ and f− are M /B(R)-measurable.

Proof: If a ≥ 0 then (f+)−1((a,∞])) = f((a,∞]) ∈ M . If a < 0, then(f+)−1((a,∞]) = f((0,∞]) ∈ M . Hence (f+)−1((a,∞]) ∈ M for all a ∈ R andaccording to Lemma 4, f+ is M /B(R)-measurable. The measurability of f− isproved similarly.

In the case of continuous functions, it is typically of interest whether a givenfunction is bounded or not, i.e. does the least upper bound or supremum exist.In measure theory, it is often sufficient to know if a function is bounded, exceptfor sets of measure zero.

Definition 45 Suppose that (X, M , µ) is a measure space and f : X → [0,∞]is M /B([0,∞])-measurable. Then f is essentially bounded if the essentialsupremum

ess supx∈X

f(x) , inf(a ∈ [0,∞] | µ(f−1((a,∞])) = 0 ∪ ∞

).

is finite.


Example 13: The triple (R,B(R), λ1) is a measure space. Let us define f : R →(0,∞) as

f(x) =

1/x , if x ∈ Q ∩ (0,∞)

e−x2

2 /√

2π , otherwise.

Then f is not bounded, but it is essentially bounded with essential supremumequal to 1/

√2π.

Definition 46 A measurable space (X, M ) is a Borel space if there is an in-jection φ : X → R such that

i) φ(X) ∈ B(R);ii) φ is M /B(R)-measurable;iii) φ−1 is B(R)/M -measurable.

A.2.3 Simple Functions

In several occasions throughout this thesis, the following convenient function hasbeen used for characterising sets.

Definition 47 The characteristic function χA : X → 0, 1 of the set A ⊂X is defined by

χA(x) =

1 , if x ∈ A0 , otherwise

Especially in the context of integration theory, characteristic functions are usedfor constructing more complicated functions according to the following defini-tion [Gariepy and Ziemer, 1995, pages 127,133].

Definition 48 A function f : X → R is a simple function if its range isai ∈ R | i = 1, 2, . . . , n where n < ∞, that is, f has at most a finite number ofvalues. Then it has the representation

f =

n∑

i=1

aiχAi,

where Ai = f−1(ai). If n ≤ ∞, then f is countably simple (c.s.).

Obviously a simple function is also countably simple. The significance of simplefunctions is based on the ability to approximate arbitrary functions. The followingtheorem describes some of these approximability properties [Gariepy and Ziemer,1995, page 127]

Theorem 16 Suppose that (X, M ) is a measurable space and that there is afunction f : X → R. Then there is a sequence fi∞i=1 of simple functions suchthat limi→∞ fi(x) = f(x) for all x ∈ X. In addition the following statementshold:


i) If f is M /B(R)-measurable, the functions fi can be chosen to be M /B(R)-measurable.

ii) If f is nonnegative, the functions fi can be chosen to satisfy 0 ≤ f1 ≤f2 · · · ≤ f .

iii) If f is nonnegative and M /B(R)-measurable, the functions fi can be chosento be M /B(R)-measurable and satisfy 0 ≤ f1 ≤ f2 · · · ≤ f .

The convergence limi→∞ fi(x) = f(x), for all x ∈ X is called pointwise conver-gence and the shorthand notation fi → f is used. In particular, if f1 ≤ f2 ≤· · · ≤ f and fi → f , then the shorthand notation fi ↑ f is used.

The measurability properties of countably simple functions are characterisedby the following theorem [Kolmogorov and Fomin, 1975b, page 286].

Theorem 17 Suppose that (X, M ) is a measurable space. A countably-simplefunction f : X → R with range ai ∈ R | i = 1, 2, . . . , n, where n ≤ ∞, isM /B(R)-measurable if and only if f−1(ai) ∈ M for all i = 1, 2, . . . , n.

For example, the characteristic function of a set A is measurable if and only ifA ∈ M .

A.3 Integration

Integration is an essential part of probability theory. For example, the probabilitytheoretic concept of expectation is identified with an integral with respect to aprobability measure. Obviously, this identification would not be possible usingthe conventional Riemann integral. Therefore we need to have a more generaldefinition of integral. The following construction of integral and its propertiesfollows the discussion in [Gariepy and Ziemer, 1995, pages 133-138].

Definition 49 Suppose that (X, M , µ) is a measure space. The integral ofa countably simple, nonnegative, M /B(R)-measurable function f :X → R is defined as ∫

f dµ ,

n∑

i=1

aiµ(f−1(ai)).

The integral of countably simple, M /B(R)-measurable function f : X →R is defined as ∫

f dµ ,

∫f+dµ −

∫f−dµ,

if min(∫

f+dµ,∫

f−dµ) < ∞.

The integral of countably simple functions can be generalised to arbitrary mea-surable functions as follows.

Definition 50 Suppose that (X, M , µ) is a measure space, f : X → R is aM /B(R)-measurable function, and

G+f , g : X → R | g is M /B(R)-measurable, c.s., and g ≥ f, µ − a.e.

G−f , g : X → R | g is M /B(R)-measurable, c.s., and g ≤ f, µ − a.e.


The upper and lower integrals of f are defined as∫ ∗

f dµ , infg∈G+

f

∫g dµ,

∫

∗f dµ , sup

g∈G−

f

∫g dµ,

respectively. If the upper and lower integrals of f have a common, finite value,then f is said to be (µ-)integrable and the common value is denoted by

∫f dµ ,

∫ ∗f dµ =

∫

∗f dµ,

and called the (Lebesgue) integral of f .

Moreover, if there is a characteristic function in the integrand, it is common touse the notation ∫

A

f dµ ,

∫χAf dµ.

The following theorem lists some of the most important properties of theintegral [Gariepy and Ziemer, 1995, pages 134-135].

Theorem 18 Suppose that (X, M , µ) is a measure space, f : X → R and g :X → R are µ-integrable, and h : X → R is M /B(R)-measurable. Moreover, leta, b ∈ R and A ∈ M . Then

i) hµ−a.e.= f =⇒ h is µ-integrable and

∫h dµ =

∫f dµ;

ii) f is µ-a.e. finite and fχA is µ-integrable;iii) af + bg is µ-integrable and

∫(af + bg)dµ = a

∫f dµ + b

∫g dµ;

iv) f ≤ g =⇒∫

f dµ ≤∫

g dµ;v) h is µ-integrable ⇐⇒ |h| is µ-integrable;vi) |

∫f dµ| ≤

∫|f | dµ.

To be precise, the definition of integral in Definition 50 does not allow infinitevalue, even though the upper and lower integrals would have it as a common value.Nevertheless, we have the following two results [Gariepy and Ziemer, 1995, page137].

Lemma 5 Suppose that (X, M , µ) is a measure space, f : X → R is M /B(R)-measurable and µ-a.e. nonnegative. Then

∫ ∗f dµ =

∫

∗f dµ.

The common value of the upper and lower integral is denoted by∫

f dµ and ittakes values in [0,∞].

Corollary 1 Suppose that (X, M , µ) is a measure space, f : X → R is nonneg-ative and M /B(R)-measurable and g is µ-integrable. Then

∫

∗(f + g) dµ =

∫ ∗(f + g) dµ =

∫f dµ +

∫g dµ.


These results allow us to give the following convenient formulation for the integral.

Corollary 2 Suppose that (X, M , µ) is a measure space and f : X → R isM /B(R)-measurable with min

(∫f+dµ,

∫f−dµ

)< ∞. Then

∫f dµ =

∫f+dµ −

∫f−dµ.

Proof: Follows directly from Corollary 1.It should be noted that the integral

∫f dµ of a function f may thus be finite,

positively or negatively infinite, or undefined. Only in the case of a finite integralis f said to be integrable.

Let us conclude this section by stating some important and well-knownresults.

Theorem 19 (Monotone Convergence) Suppose that (X, M ) is a measur-able space and fi∞i=1 is a sequence of nonnegative, M /B(R)-measurable func-tions such that fi ≤ fi+1, for i = 1, 2, . . . Then

limi→∞

∫fi dµ =

∫limi→∞

fi dµ

Proof: The proof is given, e.g. in [Gariepy and Ziemer, 1995, page 139].

Theorem 20 (Fubini’s Theorem) Suppose that (X, M , µ) and (Y, N , ν) aremeasure spaces and f : X × Y → R is µ × ν-integrable. Then for the functions

fy(y) ,

∫f(·, y) dµ, fx(x) ,

∫f(x, ·) dν,

the following statements hold:i) fx and fy are defined for all x ∈ X and y ∈ Y , respectively;ii) fx and fy are M /B(R)- and N /B(R)-measurable, respectively;iii) fx is µ-integrable ν-a.e. and fy is ν-integrable µ-a.e.;iv)

∫f d(µ × ν) =

∫ [∫f dµ

]dν =

∫ [∫f dν

]dµ.

Proof: The proof is given, e.g. in [Gariepy and Ziemer, 1995, pages 170-174].In order to apply Fubini’s theorem, one must ensure that f is µ×ν-integrable. Inpractice, this may be difficult. However, if the product measure µ × ν is knownto be σ-finite and f is nonnegative, then it can be shown that

∫fd(µ × ν) =

∫ [∫fdµ

]dν =

∫ [∫fdν

]dµ,

where the common value of integrals may be finite or infinite. This result isknown as Tonelli’s theorem and it can be found, e.g. in [Gariepy and Ziemer,1995, Royden, 1968]. In particular, if µ and ν are both σ-finite, then µ × ν is σ-finite as well. As a consequence of Tonelli’s theorem, the applicability of Fubini’stheorem can be verified by showing that µ and ν are σ-finite, and that one ofthe integrals

∫ [∫|f | dµ

]dν or

∫ [∫|f | dν

]dµ is finite. This implies that f is

µ × ν-integrable [Shiryayev, 1984, page 200].


Theorem 21 (Radon-Nikodym theorem) Suppose that (X, M ) is a measur-able space, µ is a σ-finite measure on M , and ν is a signed measure on M

such that ν ≪ µ. Then there is a µ-a.e. unique, M /B(R)-measurable functionf : X → R such that

ν(A) =

∫

A

f dµ, A ∈ M . (A.1)

Proof: The proof is given, e.g. in [Shiryayev, 1984, page 194].The function f in Equation (A.1) is often denoted by

f =dν

dµ,

and is referred to as the Radon-Nikodym derivative (RND) or the densityof ν with respect to µ. In particular, if µ = λk, then the RND dν/dλk is calledthe density of ν.

Appendix B

Kernel Density Estimation

The goal in density estimation is to recover the probability density function ofsome unknown distribution when only a set of samples from the distribution ofinterest is available. Density estimation can be divided into two classes: para-metric and nonparametric density estimation. In parametric density estimationthe unknown density is assumed to belong into a family of densities that aredistinguished by certain parameters. The task is then to recover the unknownparameters using the samples from the distribution of interest. In nonparametricdensity estimation the density is not assumed to be defined by certain parametersbut to be of more general form. This chapter will focus on a nonparametric den-sity estimation method known as kernel density estimation. The literature onthe topic is extensive and only a few elementary results are discussed here. Foran introduction to the topic see, e.g. [Scott, 1992, Silverman, 1986, Wand andJones, 1995]. A more theoretical discussion can be found, e.g. in [Devroye, 1987].

Section B.1 gives the definition of the kernel density estimator. Section B.2and Section B.3 give some results and guidelines for choosing the parameters ofthe kernel density estimator. Finally, Section B.4 describes how kernel densityestimation can be applied to the estimation of a general density. The chapter isconcluded by an example.

B.1 Kernel Density Estimators

The kernel density estimator can be considered as a convolution of a discrete un-weighted probability measure and some predefined kernel function [Musso et al.,2001]. Without going into the details of this convolution interpretation we givethe following definition. The definition and the following discussions are mostlybased on [Scott, 1992].

Definition 51 Suppose that µ is a probability measure on B(Rk). Let µn be adiscrete approximation of µ, based on the IID sample x1, x2, . . . , xn from µ.Moreover, let µ have a density f with respect to λk. A mapping K : Rk → [0,∞)is called a regularisation kernel if it satisfies

i)∫

K(x)λk(dx) = 1;ii)∫

xK(x)λk(dx) = 0k;

105

APPENDIX B. KERNEL DENSITY ESTIMATION 106

iii)∫||x||2K(x)λk(dx) < ∞.

If Hn ∈ Rk×k is symmetric and positive definite, then

fn(x | x1, x2, . . . , xn) ,1

n det(Hn)

n∑

i=1

K(H−1n (x − xi)), xi ∼ µ, (B.1)

is a kernel density estimator (KDE) of f , and the matrix Hn is called thebandwidth matrix.

It should be noted that a more general definition of the regularisation kernel andhence the KDE can be given. For example, the kernel does not necessarily have tobe nonnegative [see, e.g. Devroye, 1987, Scott, 1992]. This chapter will howeveronly focus on kernels that can be regarded as probability density functions withzero mean and finite second moment. Let us illustrate the kernel density estimatorby an example.

Example 14: Suppose that an IID sample of size n is simulated according tothe density fN(x; 0, 1). Let there be two regularisation kernels

KU(x) = χ[−1,1](x)/2, KN(x) = fN(x; 0, 1)

The bandwidth matrix in this one dimensional case is set to be h = 0.2. FiguresB.1(a) and B.1(b) illustrate the resulting KDEs for sample size n = 10 usingKU and KN, respectively. Figures B.1(c) and B.1(d) show the same results forn = 50.

−4 −2 0 2 40

0.5

1

−4 −2 0 2 40

0.5

1True

Estimate

(a) (b)

−4 −2 0 2 40

0.3

0.6

−4 −2 0 2 40

0.3

0.6

(c) (d)

Figure B.1: Kernel density estimates fn of fN(x; 0, 1). (a) Uniform kernel andn = 10. (b) Standard normal kernel and n = 10. (c) Uniform kernel and n = 50.(d) Standard normal kernel and n = 50.


For brevity of notation, let us introduce the following shorthand notations.

mf (x) = E

[fn(x | x1, x2, . . . , xn)

], vf (x) , V

[fn(x | x1, x2, . . . , xn)

],

where xi ∼ µ, for all i = 1, 2, . . . , n. An important property of KDEs is that theyare biased. This is to say that

εbias(x) , mf (x) − f(x)

is not necessarily equal to zero for a given value of x. In fact, KDE is not ne-cessarily even asymptotically unbiased, meaning that the bias may not tend tozero although the sample size n would tend to infinity. To avoid this problem thebandwidth matrix Hn is defined to depend on the sample size n. By an appropri-ate definition of the bandwidth matrices KDEs can be shown to be asymptoticallyunbiased.

A KDE is completely defined by its regularisation kernel and bandwidthmatrix and these parameters can be chosen in many ways. As usual, the para-meters should not be chosen arbitrarily but with great care in order to guaranteereasonable performance.

B.2 Choice of Regularisation Kernel

The optimality of any set of parameters is always intimately linked to the choiceof the optimality criterion. The significance of the following theorem is that itprovides one optimality criterion which in turn can be used for deriving optimalchoices of KDE parameters. Except for some details, the theorem can be found,e.g. in [Scott, 1992, Wand and Jones, 1995].

Theorem 22 Let f : Rk → R be a probability density function on Rk whichhas continuous third order partial derivatives. Moreover, let fn be a KDE of f ,obtained with bandwidth matrix Hn = anH and regularisation kernel K, where H issymmetric and positive definite, det(H) = 1, an∞n=1 is a sequence of real numberssuch that an → 0 and (nak

n)−1 → 0 as n → ∞, and∫

xxTK(x)λk(dx) = αIk.Then, as n → ∞,

E

[∫ (fn(x | x1, x2, . . . , xn) − f(x)

)2

λk(dx)

]− εAMISE(n) → 0, (B.2)

where

εAMISE(n) =1

4a4

nα2

∫(tr(H2∇2f(x)

))2λk(dx) +

1

nakn

∫K2(y)λk(dy) (B.3)

is the asymptotic mean integrated squared error (AMISE).

Proof: By Fubini’s theorem, the mean integrated squared error (MISE)in Equation (B.2) can be written as

E

[∫ (fn(x | x1, x2, . . . , xn) − f(x)

)2

λk(dx)

]=

∫vf(x) + εbias(x)2λk(dx).

(B.4)


The differentiability assumptions of f enable us to construct a second order Taylorseries expansion of f at x. Thus, by a change of variables

mf(x) =1

det(Hn)

∫K(H−1

n (x − y))f(y)λk(dy) =

∫K(y)f(x − Hny)λk(dy)

=

∫K(y)(f(x) −∇f(x)THny +

1

2yTHn∇2f(x)Hny + r(Hny))λk(dy)

= f(x) +1

2

∫yTHn∇2f(x)HnyK(y)λk(dy) · · ·

+

∫r(Hny)K(y)λk(dy), (B.5)

By Taylor’s theorem, the remainder term r is

r(Hny) =1

6

∑

i,j,k,l,m,s

a3nhklhjmhisylymysDijkf(x∗), (B.6)

where hij = [H]ij and Dijkf(x∗) denotes the third order partial derivatives of fat x∗ ∈ (x − Hny, x). The bias can now be written as

εbias(x) =1

2

∫yTHn∇2f(x)HnyK(y)λk(dy) +

∫r(Hny)K(y)λk(dy). (B.7)

Because the integrand in the first integral is scalar, and hence equal to its trace,the first integral in Equation (B.7) is equal to

1

2

∫tr(yTHn∇2f(x)Hny

)K(y)λk(dy)

=1

2tr

(∫a2

nH2∇2f(x)yyTK(y)λk(dy)

)

=1

2a2

nαtr(H2∇2f(x)

).

The trace can be taken outside the integral because the integral is taken element-wise. Moreover, the first equality uses the property of trace that if A ∈ Rm×n

and B ∈ Rn×m, then tr (AB) = tr (BA). According to Equation (B.6) the secondintegral in Equation (B.7) tends to zero faster than the first integral as n → ∞.Thus the asymptotic bias is

ε∗bias(x) =1

2a2

nαtr(H2∇2f(x)

).

Clearly, the bias vanishes as n → ∞, implying that the conditions of the theoremare sufficient for ensuring that the KDE is asymptotically unbiased.

Let us then examine the variance term in Equation (B.4). Straightfor-wardly,

vf(x) =1

n det(Hn)2E[K2(H−1

n (x − y))]−

mf(x)2

n


Using a change of variables and a zeroth order Taylor series expansion, the ex-pectation can be written as

E[K2(H−1

n (x − y))]

= det(Hn)

(f(x)

∫K2(y)λk(dy) +

∫r(Hny)K2(y)λk(dy)

)

where, by the Taylor’s theorem, the remainder term is

r(Hny) =∑

i,j

anhijyjDif(x∗), x∗ ∈ (x − Hny, x). (B.8)

It is noted that det(Hn) = akn det(H) = ak

n because H ∈ Rk×k. Thus the variancevf (x) becomes

vf(x) =f(x)

nakn

∫K2(y)λk(dy) +

1

nakn

∫r(Hny)K2(y)λk(dy) −

mf (x)2

n

The second and the third terms converge to zero faster than the first term asn → ∞. Therefore the asymptotic variance, as n → ∞, is

v∗f(x) =

f(x)

nakn

∫K2(y)λk(dy).

Finally, the substitution of ε∗bias(x) and v∗f(x) into Equation (B.4) yields the

AMISE

εAMISE(n) =

∫v∗

f(x) + ε∗bias(x)2λk(dx)

=1

nakn

∫K2(y)λk(dy) +

1

4a4

nα2

∫(tr(H2∇2f(x)

))2λk(dx)

which completes the proof.

Roughly speaking the additional requirement for the regularisation kernelin Theorem 22 is that the kernel is symmetric about the origin. An exampleof such a kernel is of course the standard normal density fN(x; 0k, Ik). Anotherregularisation kernel satisfying the additional condition in Theorem 22 is givenby the following definition [see, e.g. Wand and Jones, 1995, Musso et al., 2001].

Definition 52 Let V k be the volume of a k-dimensional unit hypersphere. Adensity KE defined as

KE(x) ,

12(V k)−1(k + 2)(1 − ||x||2) , if ||x|| ≤ 1

0 , otherwise,(B.9)

is called the Epanechnikov kernel.

An illustration of the Epanechnikov kernel in two dimensions is given in FigureB.2(a). For comparison purposes, Figure B.2(b) shows the normal distributionwith the same covariance matrix.

The following theorem gives a reason for introducing the Epanechnikovkernel. A similar proof can also be found in [Devroye, 1987, page 120].


−1

0

1

−1

0

1

0

0.2

0.4

0.6

0.8

−1

0

1

−1

0

1

0

0.05

0.1

0.15

0.2

(a) (b)

Figure B.2: (a) Epanechnikov kernel in two dimensions. (b) Normal kernel intwo dimensions. The kernels have the same covariance matrix.

Theorem 23 Let K be a regularisation kernel satisfying∫

||x||2K(x)λk(dx) =

∫||x||2KE(x)λk(dx). (B.10)

Then ∫K2(x)λk(dx) ≥

∫K2

E(x)λk(dx),

Proof: Let A be the support of the Epanechnikov kernel, i.e. A = x ∈ Rk |xTx ≤ 1. Then

∫K2dλk =

∫(K − KE)2dλk +

∫K2

Edλk + 2

∫KE(K − KE)dλk (B.11)

By the assumption in Equation (B.10),

∫(1 − ||x||2)(K(x) − KE(x))λk(dx)

=

∫Kdλk −

∫||x||2K(x)λk(dx) −

∫KEdλk +

∫||x||2KE(x)λk(dx) = 0

Therefore one has∫

A

(1 − ||x||2)(K(x) − KE(x))λk(dx) =

∫

∁A

(||x||2 − 1)(K(x) − KE(x))λk(dx).

In the latter integral, the integrand is clearly greater than or equal to zero, im-plying that 2

∫KE(K − KE)dλk ≥ 0 in Equation (B.11). Consequently

∫K2dλk ≥

∫(K − KE)2dλk +

∫K2

Edλk,

where the equality is obtained when K = KE.


It is observed that if the covariance matrix of the kernel is fixed to somevalue αIk, then the AMISE in Equation (B.3) depends on the regularisation kernelonly through the integral

∫K2dλk. In the literature, this integral is occasion-

ally referred to as the roughness of K [Scott, 1992, page 53]. Because of thisobservation, we have the following corollary.

Corollary 3 Let ε∗AMISE(n) be the AMISE obtained with the Epanechnikov kernel.Then

εAMISE(n) ≥ ε∗AMISE(n)

where εAMISE(n) is obtained with any regularisation kernel K, such that

∫xxTK(x)λk(dx) = (k + 4)−1Ik.

Proof: It can be shown that∫

xxTKE(x)λk(dx) = (k + 4)−1Ik. After fixingthe value of αIk to this value, the AMISE in Equation (B.3) depends on K onlythrough

∫K2dλk and hence according to Theorem 23, the Epanechnikov kernel

minimises the AMISE.In fact Theorem 23 implies a stronger result stating that among the regularisationkernels such that

∫xxTK(x)λk(dx) = αIk the minimal AMISE is obtained by a

kernel of the same shape as the Epanechnikov kernel in Definition 52 but withcovariance αIk.

It has now been shown that when estimating a sufficiently smooth density,the optimal choice of the regularisation kernel, in terms of AMISE, is the Epan-echnikov kernel. This does not however imply that KE would be the best kernelfor all purposes. Because the KDE inherits the differentiability properties of thekernel, it follows that the KDE obtained with the Epanechnikov kernel does nothave continuous first order partial derivatives [Scott, 1992].

Practical convenience may often suggest using the density of a normal dis-tribution as a regularisation kernel. However, it has been shown that the normalkernel is indeed inefficient [see, e.g. Scott, 1992, page 140]. Therefore we quote thefollowing remark by Scott [1992, page 139]: “Given the computational overheadcomputing exponentials, it is difficult to recommend the actual use of the normalkernel except as a point of reference”.

B.3 Choice of Bandwidth Matrix

The discussion in the previous section provided only a partial answer to thechoice of the parameters for a KDE. The remaining part is to choose the band-width matrix Hn. Unfortunately, a general closed-form expression for the optimalbandwidth matrix does not exist [Scott, 1992, page 151]. To overcome this diffi-culty, it has been proposed in the literature to use the optimal diagonal bandwidthmatrix for the standard normal or Epanechnikov kernel when the target distri-bution is assumed to be a standard normal distribution [Musso et al., 2001, page253]. Optimality is again considered in terms of AMISE.


Under the assumptions given above, the task of finding the optimal band-width matrix is reduced to that of finding an which minimises εAMISE(n). Bysetting the first derivative of εAMISE(n) with respect to an to zero, the optimal an

is given by

an =

(k∫

Kdλk

nα∫

(tr (∇2f(x)))2λk(dx)

) 1

k+4

(B.12)

When seeking the optimal value of an for the Epanechnikov kernel KE

and for the standard normal kernel KN one must evaluate the roughness val-ues

∫K2dλk for both of the kernels. It is straightforward to show that

∫KEdλk =

2k + 8

(k + 4)V k,

∫KNdλk = 2−kπ−k/2.

Also one must evaluate the value α in Equation (B.3). For the standard normalkernel this is straightforwardly α = 1 while, according to the preceding section,for KE it can be shown to be α = (k + 4)−1. Because the density of interest wasassumed to be standard normal, it can be shown after some computations that

∫(tr(∇2fN(x; 0k, Ik)

))2λk(dx) =

k2 + 2k

2k+2πk/2.

The substitution of these values into Equation (B.12) yields the values

aNn =

(4

n(k + 2)

) 1

k+4

, aEn =

(8(k + 4)(2

√π)k

nV k

) 1

k+4

.

B.4 Estimation of General Densities

The assumptions under which the optimal values for an were derived in SectionB.3 appeared quite restrictive since in general the unknown distribution is oftenother than the standard normal distribution. The following approach to overcomethis problem was proposed by Musso et al. [2001].

Suppose that the distribution of interest has mean m and the covariancematrix C. Then the sample points xi can be transformed into sample pointszi = TM(xi) in such a manner that zi are distributed with mean 0k and covarianceIk. The required transformation is

TM(x) = C− 1

2 (x − m),

where C− 1

2 is an inverse matrix square root of C. This transformation is alsoknown as the Mahalanobis transformation. It should be noted that the re-sulting samples zi are not in general distributed according to a standard normaldistribution. However, because the mean and covariance are guaranteed to be 0k

and Ik the problem is brought closer to the assumptions given in the precedingsection. A KDE of the density of the sample zi | i = 1, 2, . . . , n can now beconstructed as

gn(z) =1

n det(Hn)

n∑

i=1

K(H−1n (z − C− 1

2 (xi − m))) (B.13)


The density estimate gn should then be transformed to represent the densityof the original sample xi | i = 1, 2, . . . , n. If a random variable z has the proba-bility density g, and if there is a bijection T : Rk → Rk such that det(∇T (x)) 6= 0and z = T (x), then the probability density function of the random variable xis [see, e.g. Anderson, 1958]

fn(x) = g(T (x)) |det(∇T (x))| , (B.14)

where ∇T denotes the Jacobian matrix of mapping T . By applying this result tothe KDE gn and the Mahalanobis transformation T one has

fn(x) = gn(C− 1

2 (x − m))∣∣∣det(C− 1

2 )∣∣∣ =

∑ni=1 K(H−1

n C− 1

2 (x − xi))

n det(Hn)√

det(C)

=1

n det(Hn)

n∑

i=1

K(H−1n (x − xi)),

where Hn = C1

2 Hn.It is often the case that the exact mean m and covariance C are not available.

Therefore, in practice, these values are approximated by the sample mean andsample covariance of the given data in order to evaluate the transformation T .

Example 15: Consider the joint distribution of xt and xt+1 defined in Example10. The transformation of the samples is based on the sample mean and covari-ance of the data. Figure B.3 illustrates the construction of a KDE of the jointdensity using n = 50 samples from the distribution and Epanechnikov kernel withaE

n . Figure B.3(a) shows the samples after the Mahalanobis transformation whilethe original data is illustrated in Figure B.3(b).Figure B.3(c) shows the contourplot of the density estimate of the transformed data and Figure B.3(d) shows thecontour plot of the density estimate of the original data. For comparison, thecontour plot of the exact density is shown in Figure B.3(e).


−3 −2 −1 0 1 2 3 4 5 6

−4

−2

0

2

4

6

8

10

12

xt

xt+1

−3 −2 −1 0 1 2 3 4 5 6

−4

−2

0

2

4

6

8

10

12

xt

xt+1

(a) (b)

0.04

0.04

0.04

0.040.04

0.08

0.08

0.08

0.120.12

0.16

xt

xt+1

−3 −2 −1 0 1 2 3 4 5 6

−4

−2

0

2

4

6

8

10

12

0.01

0.01

0.01

0.01

0.01

0.01

0.03

0.03

0.03

0.03

0.05

0.05

xt

xt+1

−3 −2 −1 0 1 2 3 4 5 6

−4

−2

0

2

4

6

8

10

12

(c) (d)

0.01

0.01

0.01

0.01

0.01

0.03

0.03

0.03

0.05

0.05

0.07

xt

xt+1

−3 −2 −1 0 1 2 3 4 5 6

−4

−2

0

2

4

6

8

10

12

(e)

Figure B.3: (a) The Mahalanobis transformed data. (b) The original data. (c)A contour plot of the density estimate based on the transformed data. (d) Acontour plot of the density estimate of f . (e) A contour plot of the true densityf .

kari heine a survey of sequential monte carlo...

Documents