bayesian nonparametric modeling and data analysis: an...

Bayesian Nonparametric Modeling and Data Analysis: An Introduction(Draft)

Raffaele ArgientoCNR-IMATI, Milano

IMATI, National Research Council - Milano (Italy)

October 20, 2016

R. Argiento October 20, 2016

Aims and Prerequisites

Aims:This course offers a theoretical and practical introduction to Bayesian nonparametricsstatistical procedures, a rapidly developing area of statistics. Key themes:

• Exchangeability and de Finetti Theorem.• Dirichlet process.• Dirichlet process mixture models.• Computation under Dirichlet process mixture model (marginal and conditional

algorithm)

Prerequisites:I will suppose you know:

• Basic Probability theory;• Bayesian parametric modelling;• The R software and OpenBUGS or WinBUGS;


Reading list

General references:/ Regazzini, E. (1996). Impostazione nonparametrica di problemi d’inferenza

bayesiana. Imati Tech. Report 96-21,http://web.mi.imati.cnr.it/iami/abstracts/96-21.html.

/ Ghosh, J.K. and Ramamoorthi, R.V. (2003). Bayesian nonparametrics.Springer, New York.

/ Hjort, N.L., Holmes, C.C., Müller, P. and Walker, S.G., eds. (2010).Bayesian Nonparametrics. Cambridge Series in Statistical and ProbabilisticMathematics. Cambridge: Cambridge Univ. Press.

/ Müller, P. and Rodriguez, A. (2013). Nonparametric Bayesian inference.NSF-CBMS Regional Conference Series in Probability and Statistics 9, Institute ofMathematical Statstics.

/ Müller, P., Quintana, F.A., Jara, A. and Hanson, T. (2015). Bayesiannonparametric data analysis. Springer.

/ Ghosal, S. and van der Vaart, A. (2016). Fundamental of NonparametricBayesian Inference. Cambridge University Press.


http://web.mi.imati.cnr.it/iami/abstracts/96-21.html

Terminology

Parametric modelNumber of parameters fixed (or constantly bounded) w.r.t. sample sizeNonparametric model

• Number of parameters grows with sample size• ∞-dimensional parameter space

Example: in density estimation the parameter is ∞−dimensional object fY :


Nonparametric Bayesian Model

DefinitionA nonparametric Bayesian model is a Bayesian model on an ∞-dimensional parameterspace.

InterpretationParameter space Θ = {set of possible parameters}, for example:

Problem Θ

Density estimation Probability distributionsRegression Smooth functionsClustering Partitions

3 Target of the Bayesian statistician is the study posterior distribution on the space ofall parameters


Parametric Bayesian Modelling

Independence

3 Classical statistical inference is based on the assumption of independence:

P(X1 ∈ A1, . . . ,Xn ∈ An) =n∏

i=1

P(Xi ∈ Ai ).

This assumption is:

• convenient from a mathematical point of view, in view of the factorisation• implies that the information on one observation does not provide any information

on the subsequent one, that is

P(Xn+1 ∈ A | X (n)) = P(Xn+1 ∈ A), X (n) = X1, . . . ,Xn,


Independence

This is hardly justified in practice:3 independence among observations is a strong assumption difficult to verify3 collecting observations from a quantity of interest must tell me something about

what I’m going to observe next; the information should be incorporated into mymodel and used for updating my knowledge on the phenomenon.

“I am trying to learn about something and have some current knowledge. My currentknowledge is encapsulated in a small model. I learn through further observations thatthis small model is wrong or misplaced. I must change it, whether my foundations for theinference I am undertaking allow me to do this or not. The current knowledge is beingaltered through further observations, and then revised from these observations.” a

aWalker, S.G. Bayesian nonparametrics. In Bayesian Theory and Applications, pp. 249-270.Oxford University Press.


Exchangeability

Exchangeability:• assumes homogeneity/symmetry among the elements of the data sequence• does not assume the events physically influence one another• the order in which r.v.s are observed is irrelevant for inference• among the weakest forms of dependence (e.g., Markovianity implies a natural

order), minimal assumption of symmetry• the implied mathematical framework remains analytically tractable, thanks to de

Finetti’s Theorem

DefinitionA sequence (Xn)n≥1 is said to be exchangeable if

(X1, . . . ,Xn)d= (Xπ(1), . . . ,Xπ(n))

for all n ≥ 1 and all permutations π of (1, . . . , n).

3 Interpretation The order of appearance of the observation does not matter in terms oftheir joint distribution.


Pólya urn process

How we can sample an exchangeable sequence?

3 Consider an urn with B0 black and W0 white balls. The sequence of observationX1,X2, . . . is sampled by the following procedure: Set n = 1,

1. Draw a ball at random from the urn and note its color;2. If the ball is black then Xn = 1, otherwise Xn = 0;3. Place the ball plus 1 extra balls of the observed color in the urn;4. n = n + 1 and go to point 1;

- It is not difficult to realize that

P(X1 = x1, . . . ,Xn = xn) =∏Sn−1

s=0 {B0 + s}∏n−Sn−1

s=0 {W0 + s}∏n−1s=0 {W0 + B0 + s}

where Sn =∑n

i=1 xi

The sequence of observed color is exchangeable!!!


de Finetti Theorem

de Finetti’s representation theorem for binary sequences 1 formalises the relationshipbetween exchangeable and iid sequences.

Theorem – de Finetti’s representation theorem for binary sequencesA sequence (Xn)n≥1 taking values in {0, 1} is exchangeable if and only if there exists aprobability measure π on [0, 1] such that

P(X1 = x1, . . . ,Xn = xn) =∫ 1

0θk (1− θ)n−kπ(dθ)

where∑n

i=1 xi = k is the number of successes. Moreover,

1n∑i≥1

Xi → θ a.s., θ ∼ π.

1de Finetti (1933a). Classi di numeri aleatori equivalenti. Rendiconti della R. Accademia Nazionale deiLincei 18, 107–110.


Comments

• Given the value of θ, we have

P(X1 = x1, . . . ,Xn = xn | θ) = θk (1− θ)n−k

that the sampling is, given θ, conditionally independent with Bern(θ) distribution.Hence they are conditionally iid.

• the parameter θ (Bernoulli success probability) is taken to be a r.v., instead of anunknown constant, with distribution π

• π is called de Finetti measure and is the prior distribution, i.e., a distribution on theparameter space (here [0, 1]) that represents the initial opinion on the parameterbefore observing data

• the integral representation for the joint distribution of the sequence tells that abinary exchangeable sequence is a mixture of iid Bernoulli sequences.

exchangeability ⇔ mixture of iid


Hierarchical modelling

Bayesian Model

X1, . . . ,Xn|θi.i.d.∼ f (y ; θ) sampling model

θ ∼ π(θ) prior

The conditional independence can be stated as follow:3 there is a state of the world θ which is unknown, here θ is a random variable3 given θ, the events are iid3 without knowing θ, they are not independent, only exchangeable.

Two equivalent way to set a Bayesian model

1. predictive approach: choose an (infinite) exchangeable model for the observations.2. hierarchical approach: choose a conditional sampling model and a prior for its

parameter.


A predictive approach

3 Consider the sequence of observations (Xn)n≥1 sampled via the Pólya urn scheme.

3 We already observed that (Xn)n≥1 is exchangeable. Can we find its the de Finettimeasure?

P(X1 = x1, . . . ,Xn = xn) =∏Sn−1

s=0{B0+s}

∏n−Sn−1s=0

{W0+s}∏n−1s=0{W0+B0+s}

= Γ(B0+W0)Γ(B0)Γ(W0)

Γ(B0+Sn)Γ(W0+n−Sn)Γ(W0+B0+n)

in fact∏k−1

j=1 (a + j) =Γ(a+k)

Γ(a) for each a and k

=∫ 1

0 θSn (1− θ)n−Sn

{Γ(W0+B0)

Γ(B0)Γ(W0)θB0−1(1− θ)W0−1

}dθ

=∫ 1

0 θSn (1− θ)n−Snπ(θ)dθ

where π(θ) is the density of a Beta(B0,W0), the de Finetti measure of the sequence Xn


Prior choice

The elicitation (e.g. how to choose B0 and W0) or choice of the prior π is a long lastingand still unresolved debate:

• some believe incorporating the researcher’s knowledge into the prior is at the veryessence of the Bayesian approach. Subjective approach.

• some believe we should incorporate as little knowledge as possible, in order to limitthe effect of the prior opinion and let the data swamp the prior; non informativepriors or objective Bayes

• these debate is somewhat restricted to low dimensional, parametric approaches,when one has at least a hope of being able, if willing, to specify a prior whichincapsulates some desired knowledge.

• In high-dimensional or nonparametric approaches, it is usually very difficult even tounderstand what are the implication of a prior choice on the model, so usually thechoice is dictated by mathematical convenience or guided, among plausiblealternatives, by considerations about some specific aspects of the choice.


Bayes’ Theorem for dominated models

X1, . . . ,Xn|θi.i.d.∼ f (y ; θ) sampling model

θ ∼ π(θ) prior

Interpretation: θ 7→∏n

i=1 f (xi ; θ) likelihood, π(dθ) prior

Then the posterior distribution of θ, given X1 = x1, . . . ,Xn = xn, can be computed byBayes’ Theorem:

P(θ ∈ B|X1 = x1, . . . ,Xn = xn)a.s.=

∫B

∏ni=1 f (xi ; θ)π(dθ)∫

Θ

∏ni=1 f (xi ; θ)π(dθ)

, B ∈ B(Θ)

Proof: definition of conditional distribution (as the solution of an integral equation) +Radon-Nikodym Theorem


Our simple example

Let

X1, . . . ,Xn|θi.i.d.∼ Bern(y ; θ)

θ ∼ Beta(θ;α, β)

α = B0, β = W0 > 0. Then by the Bayes theorem

π(θ | X (n)) ∝π(θ)n∏

i=1

f (Xi ; θ) (keep only θ terms) = θα+∑

iXi−1(1− θ)β+n−

∑i

Xi−1

so normalising we get π(θ | X (n)) = Beta(α +

∑i Xi , β + n −

∑i Xi).

The posterior

expected value for θ is

E(θ | X (n)) ==α+ β

α+ β + n·

α

α+ β︸︷︷︸E(θ)= prior mean

+n

α+ β + n·

1n

n∑i=1

Xi︸︷︷︸sample mean

.


Our simple example

Let

X1, . . . ,Xn|θi.i.d.∼ Bern(y ; θ)

θ ∼ Beta(θ;α, β)

α = B0, β = W0 > 0. Then by the Bayes theorem

π(θ | X (n)) ∝π(θ)n∏

i=1

f (Xi ; θ) (keep only θ terms) = θα+∑

iXi−1(1− θ)β+n−

∑i

Xi−1

so normalising we get π(θ | X (n)) = Beta(α +

∑i Xi , β + n −

∑i Xi). The posterior

expected value for θ is

E(θ | X (n)) ==α+ β

α+ β + n·

α

α+ β︸︷︷︸E(θ)= prior mean

+n

α+ β + n·

1n

n∑i=1

Xi︸︷︷︸sample mean

.


Conjugacy

The Beta example exemplifies the notion of conjugacy.

Definition – ConjugacyLet {f (·; θ), θ ∈ Θ} be a family of distributions, Θ ⊂ Rd . We say that the family ofprior distributions {π(θ | γ), γ ∈ RK}, where γ is a vector of parameters for π, isconjugate to the model f if, given Xi

iid∼ f (·, θ), the posterior distributions can be written

π(θ | γ′(X (n)))

i.e. the posterior has the same analytical structure of the prior, with updated parameters.

3 In the previous example we had

γ = (α, β), γ′(X (n)) =(α +

n∑i=1

Xi , β + n −n∑

i=1

Xi).


Comments

In different words, the family of prior distributions is closed under the operation ofBayesian update based on the collected data.The distributions in the exponential family are conjugate. Typical examples:

• Beta-Binomial model (Beta prior on success probability p)• Gamma-Poisson (Gamma prior on Poisson rate)• Normal-Normal (Normal prior on Normal mean)• InverseGamma-Normal (IG prior on Normal variance)

and some of these generalise to the multivariate case.


Bayesian Nonparametric Modelling

Tiny Bit of Probability Notation

• X is a separable and complete metric space (think on Rp)• X is the Borel σ–algebra of subsets of X• (Xn)n≥1 is a sequence of random elements defined on some probability space

(Ω,F ,P) and taking values in (X∞,X ∞)

(Xn)n≥1 is the sequence of observations (DATA); Xn is the result of the randomexperiment at trial n

• PX space of all probability measures on (X,X ), with the topology of weakconvergence

• P(X) Borel σ–algebra of subsets of PX• A random element P defined on (Ω,F ,P) and taking values in (PX,P(X)) is a

random probability measure.


de Finetti’s representation theorem (general case) 2

Theorem: de Finetti’s representation theorem for general sequencesThe sequence X (∞) = (Xn)n≥1 is exchangeable if and only if there exists a probabilitymeasure on (PX,PX) such that

P [X1 ∈ A1, . . . ,Xn ∈ An] =∫

PX

n∏i=1

P(Ai ) Π(dP)

for ay n ≥ 1 and A1, . . . ,An in X , where the probability Π is uniquely determined.

/ (Xn)n≥1 is exchangeable if and only if there exists a random probability measure Pon (X,X ) such that P ∼ Π and

P [ X1 ∈ A1, . . . , ,Xn ∈ An |P ] =n∏

i=1

P(Ai )

for any n ≥ 1 and A1, . . . ,An in X .

2Hewitt, E. and Savage, L.J. (1955). Symmetric measures on cartesian products. Trans. Amer. Math.Soc. 80, 470–501.Schervish, M.J. (1995). Theory of statistics. Springer-Verlag, New York.


de Finetti’s representation theorem (general case) 2

Theorem: de Finetti’s representation theorem for general sequencesThe sequence X (∞) = (Xn)n≥1 is exchangeable if and only if there exists a probabilitymeasure on (PX,PX) such that

P [X1 ∈ A1, . . . ,Xn ∈ An] =∫

PX

n∏i=1

P(Ai ) Π(dP)

for ay n ≥ 1 and A1, . . . ,An in X , where the probability Π is uniquely determined.

/ (Xn)n≥1 is exchangeable if and only if there exists a random probability measure Pon (X,X ) such that P ∼ Π and

P [ X1 ∈ A1, . . . , ,Xn ∈ An |P ] =n∏

i=1

P(Ai )

for any n ≥ 1 and A1, . . . ,An in X .2Hewitt, E. and Savage, L.J. (1955). Symmetric measures on cartesian products. Trans. Amer. Math.

Soc. 80, 470–501.Schervish, M.J. (1995). Theory of statistics. Springer-Verlag, New York.


de Finetti’s representation theorem (general case) (1933) - continued

Π is a probability measure on PX −→ de Finetti measure of (Xn)n≥1

/ If (Xn)n≥1 is exchangeable, then its empirical distribution is such that

1n

n∑i=1

δXi ⇒ P a.s.–P

where ⇒ denotes weak convergence.

Hierarchical representation: (Xn)n≥1 exchangeable is equivalent to

Xi |Pi.i.d.∼ P

P ∼ ΠΠ = prior distribution

The Bayesian nonparametric framework is equivalent to exchangeability of (Xn)n


Parametric case through the representation theorem

Parametric model: Π degenerate on a finite–dimensional subset P∗X of PX, such that

Π({P∗X}) = Π({P ∈ PX : P = Pθ, θ ∈ Θ}) = 1

and and there exists a function

g : P∗X → Θ bijective.

Θ ⊂ Rp is called parametric space. The prior Π induces a probability on Θ:

π(B) = Π(θ−1(B)),B ∈ B(Θ).

In these cases:

Xi | θ = θi.i.d.∼ Pθ(dx)

θ ∼ π prior distribution

For instance:

Π({P ∈ PX : P(dx) = ϕ((x − µ)/σ) dx , (µ, σ) ∈ R× R+}) = 1

with ϕ being the density function of a N(0, 1) distribution.


Parametric case through the representation theorem

Parametric model: Π degenerate on a finite–dimensional subset P∗X of PX, such that

Π({P∗X}) = Π({P ∈ PX : P = Pθ, θ ∈ Θ}) = 1

and and there exists a function

g : P∗X → Θ bijective.

Θ ⊂ Rp is called parametric space. The prior Π induces a probability on Θ:

π(B) = Π(θ−1(B)),B ∈ B(Θ).

In these cases:

Xi | θ = θi.i.d.∼ Pθ(dx)

θ ∼ π prior distribution

For instance:

Π({P ∈ PX : P(dx) = ϕ((x − µ)/σ) dx , (µ, σ) ∈ R× R+}) = 1

with ϕ being the density function of a N(0, 1) distribution.R. Argiento October 20, 2016

Parametric vs Nonparametric case

When can we assume Π({P∗X}) = 1, where P∗X is finite-dimensional? More clearly, whencould we assume the model is parametric?

• if, from past experience in cases similar to the one analized, we believe that theparametric family approximates well the “true” distribution

• if, in addition to exchangeability, we assume different conditions for the sequence ofobservations. For example, if (Xn)n≥1 is also spherically symmetric (L(X1, . . . ,Xn)T= L(A(X1, . . . ,Xn)T ) for any orthogonal matrix A), then P∗X is the family ofGaussian distributions with 0-mean.

Otherwise: nonparametric model−→ greater flexibility when Π has large support, possibly supp(Π) = PX.


Why a nonparametric approach?

Here there are two main (interrelated) ideas for a nonparametric prior Π.• Flexibility: a nonparametric prior Π is flexible in the sense that it does not place

particular constraints, e.g., shape constraints. Parametric models as Poisson,Geometric, Gaussian, Gamma and so on impose unimodality.

• full support:

Definition – Full supportWe say that a prior distribution Π on PX has full support if

supp(Π) ≡ PX

i.e., if the smallest set over which Π puts probability one is the set of all distributionswhose support is contained in X.

3 A prior with full support any distribution supported on X or its subsets. See[Ferguson(1974)] for more on this point.


Bayes Theorem

Bayes’ Theorem in the usual form does not necessarily applies to in the nonparametricframework. In fact, many nonparametric models are not dominated, e.g., if realisationsare discrete distributions with random locations. In this cases the determination of theposterior distribution relies on

• ad hoc analytical strategies (e.g. for the Dirichlet process through projections)• MCMC (99% of what’s done nowadays); this would not be a determination of the

posterior per se, but rather a way of obtaining a sample from the posterior, to beused for computing quantities of interest.


Conjugacy

A further aspect which would be desirable is that of conjugacy, i.e. when the posteriorbelongs to the same family of the prior with updated parameters. However, in thenonparametric framework

• many models are not conjugate, or not strictly conjugate• for most of the others (e.g., recent developments on complicated spaces) this is not

known.The reason is that the complexity of the certain models which are believed to be usefultoday is such that there is no hope to obtain any analytical result. MCMC is thestandard practice in most cases. There are however a few exceptions. See ,[Lijoi and Prünster(2010)] for a review, and [?] for a recent result on twotime-dependent models.


Historical overview

• Bruno de Finetti in the 1930’s paved the way for the successive development of theBayesian paradigm in its full generality, by laying the theoretical foundations bymeans of the representation theorem for exchangeable sequences and the theory ofprocesses with stationary and independent increments.

• Until the beginning of the ’70s, the necessary tools for the actual implementationof the nonparametric approach were still missing

• First breakthrough arrives with Ferguson (1973), who introduces the Dirichletprocess. This changes everything by showing that the nonparametric approach toBayesian inference was indeed feasible from an analytical point of view, and hencecould be implemented


Historical overview 2

• Focus on survival analysis in 70s and 80s:• neutral to the right processes Doksum (1974); Ferguson (1974); Ferguson and

Phadia (1979)• extended gamma process Dykstra and Laud (1981)• beta process Hjort (1990)

• Focus on density estimation in the 90s:• mixtures w.r.t. the Dirichlet process Lo (1984), to date the most successful Bayesian

nonparametric model• Polya trees Mauldin, Sudderth and Williams (1992)• Bernstein polynomials Petrone (1999)



• Second breakthrough, the computational revolution: popularisation of MCMCtechniques for Bayesian inference, mainly started with Escobar and West (1995)who introduce a Gibbs sampler for Dirichlet mixtures. Important subsequentcontributions by Müller, MacEachern, Ishwaran and James (2001), Walker (2007),Papaspiliopoulos and Roberts (2008).

• Frequentist asymptotic validation of BNP procedures: Diaconis and Freedman(1983; 1986) pose the question “what if” from a frequentist point of view, thenfirst major positive results Barron, Schervish and Wasserman (1999), Ghosal, Ghoshand Ramamoorthi (1999), Ghosal, Ghosh and va der Vaart (2000), Walker (2004)

• Recent theoretical advances hugely rely on the theory of stochastic processes relyon the use of completely random measures introduced by Kingman (1967) and onthe theory of random partitions and combinatorial stochastic processes initiated ina series of papers by Kingman and later developed by Pitman (2006)



• Priors with discrete structures: species sampling models, stick-breaking priors,normalised random measures with independent increments, Gibbs-type processes,etc.

• Dependent priors for inference on non exchangeable data: Cifarelli and Regazzini(1978), MacEachern (1999; 2000). Recent reviews: Dunson (2010), Teh andJordan (2010), Müller and Mitra (2013).

• Lively interactions with other communities and research areas such as, e.g.,Machine Learning, Combinatorics, Popoulation Genetics, Bioinformatics etc.

General references:• Regazzini (1996)• Ghosh and Ramamoorthi (2003)• Hjort, Holmes, Müller and Walker (2010)• Müller and Rodriguez (2013)• Müller, Quintana, Jara and Hanson (2015)• Ghosal and van der Vaart (2016)


Dirichlet Process

The Dirichlet distribution

Let α1, . . . , αK be positive numbers and define

∆K−1 =

{x ∈ [0, 1]K−1 :

K−1∑i=1

xi ≤ 1}≡{

x ∈ [0, 1]K :K∑

i=1

xi = 1}

to be the (K − 1)-dimensional simplex.

(w1, . . . ,wK ) ∼ DirK−1(α1, . . . , αK ),

if it has density w.r.t. the Lebesgue measure on ∆K−1 given by

fK−1(w ; a) =Γ(∑K

i=1 αi )∏Ki=1 Γ(αi )

wα1−11 · · ·wαK−1−1K−1

(1−

K−1∑i=1

wi︸︷︷︸wK

)αK−11∆K−1 (w).

Note: This models K coordinates, by taking wK = 1−∑K−1

i=1 wi .


The Dirichlet Distribution


Property of Dirchlet distribution

Some key properties:• Marginalisation

if (w1, . . . ,wK ) ∼ DirK−1(α1, . . . , αK ) and i1, . . . , i` are ` indices in {1, . . . ,K},then

(wi1 , . . . ,wi` ,wi`+1 ) ∼ Dir`−1(αi1 , . . . , αi` ,

K∑i=1

αi −∑̀j=1

αij

)

• Special case of the previous:

wi ∼ Beta(αi ,

K∑j=1

αj − αi)


Property of Dirchlet distribution

Some key properties:• Marginalisation

if (w1, . . . ,wK ) ∼ DirK−1(α1, . . . , αK ) and i1, . . . , i` are ` indices in {1, . . . ,K},then

(wi1 , . . . ,wi` ,wi`+1 ) ∼ Dir`−1(αi1 , . . . , αi` ,

K∑i=1

αi −∑̀j=1

αij

)• Special case of the previous:

wi ∼ Beta(αi ,

K∑j=1

αj − αi)


Properties

• Construction from independent normalised gamma r.v.’sLet Yj ∼ind Ga(αj , 1), αj > 0, for j = 1, . . . ,K , and define

wj =Yj∑Ki=1 Yi

then(w1, . . . ,wK−1) ∼ DirK−1(α1, . . . , αK )

• Aggregation of coordinatesIf (w1, . . . ,wK−1) ∼ DirK−1(α1, . . . , αK ) and 0 < r1 < · · · < r` = K − 1, then( r1∑

i=1

wi , . . . ,r∑̀

i=r`−1+1

wi)∼ Dir`

( r1∑i=1

αi , . . . ,

r∑̀i=r`−1+1

αi

)

3 This property suggests that one can “interpret” the vector of parameters(α1, . . . , αK ) as a measure on the coordinates indices α(A) =

∑i∈A αi


The Dirichlet distribution in Bayesian statistics

A Hierarchical model

X1, . . . ,Xn|(w1, . . . ,wK )i.i.d.∼ Multinomial(w1, . . . ,wK ) sampling model

(w1, . . . ,wK ) ∼ Dir(α1, . . . , αK ) prior

Recall that under the multinomial likelihood P(X1 = j|w1, . . . ,wK ) = wj .


(w1, . . . ,wK ) | X1 = j ∼ DirK−1(α1, . . . , αj + 1, . . . , αK ).

3 The Dirichlet prior is conjugate w.r.t. the multinomial model.


Predictive distributions

- Since the Dirichlet-Multinomial model is conjugate, we can also easily compute thepredictive distributions, we have:

P(X1 = j) =αj∑Ki=1 αi

P(X2 = j | X1) =αj + δX1 (j)∑K

i=1 αi + 1. . .

P(Xn+1 = j | X1, . . . ,Xn) =αj +

∑ni=1 δXi (j)∑K

i=1 αi + n

The fact that I observe a category j at time n does reinforce the probability that I willobserve the same category in the future, n + 1, n + 2, . . . .


Pólya urn process

3 Consider an urn with K different colors {1, . . . ,K}. At the begin there are αj balls ofcolor j in the urn. We sample a sequence X1,X2, . . . as follow: (n = 1)

1. Draw a ball at random from the urn and note its color;2. If the ball is of color j then Xn = j;3. Place 1 ball of the observed color in the urn (reinforcement);4. n = n + 1 and go to point 1;


P(X1 = j) =αj∑Ki=1 αi

P(X2 = j | X1) =αj + δX1 (j)∑K

i=1 αi + 1. . .

P(Xn+1 = j | X1, . . . ,Xn) =αj +

∑ni=1 δXi (j)∑K

i=1 αi + n


Polya urn

3 The predictive structure of the sequence sampled with the Pólya urn is the same ofthe one obtained under the Dirichlet-Multinomial process so

1. The Pólya sequence is exchangeable,2. its de Finetti measure is the Dirichlet distribution.

We can obtain the Dirichlet-Multinomial model via the predictive approach.


The Dirichlet process

DefinitionLet α > 0 and P0 a probability measure on X. A random measure P on (X,X ) is saidto be a Dirichlet process, denoted P ∼ DP(α,P0), if for every finite measurablepartition A1, . . . ,AK of X, we have

(P(A1), . . . ,P(AK )) ∼ Dir(αP0(A1), . . . , αP0(AK )).

3 The DP is an infinite-dimensional process whose FDDs are Dirichlet.

3 The existence and well definedness of this object is non trivial;

3 Example, given A ∈ X

P(A) ∼ Beta(αP0(A), α(1− P0(A)).


Properties of the Dirichlet process

Two parameters:3 α is called the total mass.3 P0(·) is called the baseline or centering measure.

A priori moments, for any A ∈ X

3 Expected value: E[P(A)] = P0(A)From which the interpretation of the parameter measure P0 ∈ PX as centeringdistribution or prior guess.

3 Variance: Var(P(A)) = P0(A)(1−P0(A))α+1

So α controls the variability P around the prior guess P0, and is also calledprecision parameter.

3 Covariance Cov(P(A),P(B)) = P0(A∩B)−P0(A)P0(B)α+1 for any B ∈ X .

Drawback, if A and B are disjoint the covariance is is always negative.


Example of draws from a DP

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

α = 2

x

cdf

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

α = 4

x

cdf

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

α = 20

x

cdf

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

α = 100

x

cdf


3 Sample from DP(α,P0) with P0 = N(0, 1) (thickblue curve) and varying precision parameters α.

3 Note how α controls not only the variability of therealizations around P0, but also the relative size of thejumps.

Pólya urn process

Consider an urn with a continuous of colors 3. The initial compisitio of the urn issummarized via a finite measure αP0(·). We sample a sequence X1,X2, . . . as follow:(n = 1)

1. Draw a ball at random from the urn according to the probability law obtained bynormalizing its composition measure;

2. If the ball is of color xn then Xn = xn;3. Place 1 ball of the observed color in the urn (reinforcement);4. n = n + 1 and go to point 1;

- It possible to see that

P(X1 ∈ B) =αP0(B)αP0(X)

= P0(B)

P(X2 ∈ B|X1) =αP0(B) + δX1 (B)

α+ 1=

α

α+ 1P0(B) +

1α+ 1

δX1

. . .

P(Xn+1 ∈ B | X1, . . . ,Xn) =αP0(B) +

∑ni=1 δXi (B)

α+ n=

α

α+ nP0(B) +

nα+ n

1n

n∑i=1

δXi (B)

3Blackwell, D. and MacQueen, J.B. (1993), Ferguson distribitions via Pólya urn schemes Annals ofStatistics


A simple Bayesian nonparametric model

3 It is possible to prove that1. The Pólya sequence is exchangeable,2. its de Finetti measure is the Dirichlet Process.

3 We can use the predictive approach to set our first Bayesian nonparamertrc model:

X1, . . . ,Xn | P ∼iid P sampling modelP ∼ DP(α,P0) prior.


Density estimation using DP priors

3 Our first Bayesian nonparametric model is not dominated, so we cannot apply theBayes Theorem as in the parametric case, however it is possible to show Ferguson(1973) that DP is conjugate:

P|X (n) ∼ DP(

n + α, 1α+n

∑ni=1 δXi +

αα+n P0

)

Interpretation: After observnig X (n) = (X1, . . . ,Xn), the updated guess is

P̂X︸︷︷︸Bayesian estimation

= E(P | X (n) = x (n)) =αP0 +

∑ni=1 δxi

α + n =α

α + n P0 +n

α + n1n

n∑i=1

δxi

=α

α + n P0︸︷︷︸prior guess

+n

α + n Pn︸︷︷︸empirical measure

It has a continuous component (if P0 is continuous) and a discrete componentdetermined by the data.


Density estimation using DP priors

3 Our first Bayesian nonparametric model is not dominated, so we cannot apply theBayes Theorem as in the parametric case, however it is possible to show Ferguson(1973) that DP is conjugate:

P|X (n) ∼ DP(

n + α, 1α+n

∑ni=1 δXi +

αα+n P0

)Interpretation: After observnig X (n) = (X1, . . . ,Xn), the updated guess is

P̂X︸︷︷︸Bayesian estimation

= E(P | X (n) = x (n)) =αP0 +

∑ni=1 δxi

α + n =α

α + n P0 +n

α + n1n

n∑i=1

δxi

=α

α + n P0︸︷︷︸prior guess

+n

α + n Pn︸︷︷︸empirical measure

It has a continuous component (if P0 is continuous) and a discrete componentdetermined by the data.


Density estimation using DP prior Illustration

−4 −2 0 2 4 6 8

0.0

0.2

0.4

0.6

0.8

1.0

Sample size n=8, alpha=5

x

E(P

|X1=

x1,.

..,X

n=

xn)

posterior mean

95% posterior credible bounds

empirical

true

centernig P_0

−4 −2 0 2 4 6 8

0.0

0.2

0.4

0.6

0.8

1.0

Sample size n=50, alpha=5

x

E(P

|X1=

x1,.

..,X

n=

xn)

posterior mean

95% posterior credible bounds

empirical

true

centernig P_0

3 Two independent samples with sizes n = 8 and n = 50 generated fromnormal N(2, 4) (gray dashed line “truth”).

3 In both cases, α = 5, while the P0 = N(0, 1) (orange gray line).

3 The empirical CDF (blue step function) and posterior mean (thick red) withits 95% credible bounds (dashed red) are also shown.


Predictive structure of a sample from DP

3 The joint distribution of a sample X1, . . . ,Xn from a DP

P(X1 ∈ dx1, . . . ,Xn ∈ dxn) =∫

PX

n∏i=1

P(dxi )DP(α,P0)(dP)

where we want to integrate out the random probability measure P.

3 By resorting to the chain rule we haveP(X1 ∈ dx1, . . . ,Xn = dxn) = P(X1 ∈ dx1)× P(X2 = dx2 | X1 = x1)

× · · · × P(Xn ∈ dxn | X1 = x1, . . . ,Xn−1 = xn−1).

With the Pólya urn, we have already characterized P(Xi+1 ∈ dxi+1 | X (i)) the predictivedistribution of Xi+1 given the previous observations X1, . . . ,Xi .


Predictive

• In particular

P(Xn+1 ∈ dxn+1 | X (n)) =α

α + n P0(dxn+1) +n

α + n1n

n∑i=1

δXi (dxn+1)︸︷︷︸Pn

which is a mixture of the prior guess and the empirical measure of the observation.

Xn+1 is

∼ P0 w.p. α/(α + n)= X1 w.p. 1/(α + n)· · ·= Xn w.p. 1/(α + n)


Interpretation

Let P0 be diffuse, or non atomic, i.e. P0(x) = 0 for all x ∈ X . Then• nα + n is the probab. that Xn+1 is an “old” value, i.e. already observed inX1, . . . ,Xn

• αα + n is the probab. that Xn+1 is a new value, non previously observed.

3 As the number of observations n increases, we have more information on the datagenerating mechanism and the weight associated to the prior guess goes to zero.

3 This predictive structure characterises the DP, i.e. the de Finetti measure Π of thesequence (Xn)n≥1 is a DP prior iff the prediction rule is a linear combination of P0 andthe empirical measure. [Regazzini(1978), Lo(1991)]


Discreteness

3 Even when the base measure P0 of the DP is absolutely continuous[Ferguson(1973), Blackwell(1973)], realisations from the DP are almost surely discretedistributions. This can be seen from different constructions of the DP we’ll in the nextslides (through gamma processes and stick-breaking representation).


Stick-breaking construction 4

Idea:• break a unit-length stick to construct random weights

0 1W1 = V1 1 − V1

W2 = V2(1 − V1)

W2 1 − W2 = (1 − V1)(1 − V2)

W3 = V3(1 − W 2)

W3 1 − W3...

• sample locations from X• attach the locations to the weights to construct a random discrete measure.

If we assign the correct distributions to these objects we obtain a DP.

4Sethuraman, J. (1994). A constructive definition of the Dirichlet process prior. Statist. SinicaR. Argiento October 20, 2016

Stick-breaking construction

In particular letting• Vi ∼iid Beta(1, α), i = 1, 2, . . .

• Wi = Vii−1∏j=1

(1− Vj ), i = 1, 2, . . .

• X̃i ∼iid P0the random discrete measure

P =∑i≥1

WiδX̃i ∼ DP(α,P0).


Support of the DP

Since realisations of the DP are discrete distributions. However, we mentioned that it isdesirable for a nonparametric prior Π to have large or possibly full support, i.e.

supp(Π) ≡ PX.

In particular for a random probability mesure

supp(P) = {∩A : A closed and P(Ac ) = 0}

[Ferguson(1973), Ferguson(1974)] prove that

supp(

DPαP0

)=

{P ∈ PX : supp(P) ⊂ supp(P0)

}Then

supp(P0) = X ⇒ supp(DPαP0 ) = PX.

Hence the DP prior has full support, including continuous distributions.


Distinct values and induced partition

The variables X1, . . . ,Xn|Pi.i.d.∼ P where P ∼DP induce a random partition ρ

of data indexes {1, . . . , n}.

* Since P is a.s. discrete, we will observe only Kn ≤ n different values:3 X∗1 , . . . ,X∗Kn : unique values in X1, . . . ,Xn3 ρ = {C1, . . . ,CKn}: i ∈ Cj ⇔ Xi = X∗j , #Cj = nj

Note that we can rewrite the predictive distribution as

P(Xn+1 ∈ dx | X (n)) =α

α + n P0(dx) +1

α + n

Kn∑j=1

njδX∗i (dx)

from which Xn+1 = X∗j with probability nj/(α + n).


Chinese restaurant process

The law of the random partition ρ, can be characterized by the so calledChinese restaurant process.

3 The first customer sits at table 1;3 Given that k tables are occupied after by n customers, customer n + 1 sits:

• at table j = 1, . . . k with probabilitynjα+n

where nj is the number of customers at table j.• A new table k + 1 with probability proportional to

αα+n

C11 C2 C3 C4 C5 . . .

Under this metaphor: customers ⇔ observation indices, tables ⇔ clusters, colors ⇔ X∗’s.







αα+n

C11 C2

2

C3 C4 C5 . . .








αα+n

C11

3

C2

2

C3 C4 C5 . . .








αα+n

C11

3

C2

2

C3

4

C4 C5 . . .








αα+n

C11

3 5

C2

2 7

C3

4

C4

6 8

C5 . . .



Ewens sampling formula

Prior of ρ : exchangeable partition probability function

P (ρ = {C1, . . . ,CKn}) = eppf (]C1, . . . , ]CKn ) :=Γ(α)

Γ(α+ n)αk

k∏i=1

(ni − 1)!

3 From this EPPF we can obtain the probability mass function for the number of unique valuesKn (Antoniak, 1974)

P(Kn = kn) = Sn,kn!αkn Γ(α)Γ(α+n)

where Sn,k is a Stirling number of the first kind.

3 Using a conditional expectation argument we find

E(Kn) =∑n

i=1α

α+i−1 ≈ α log(α+nα

)


Prior number of different values

Much is known on the behaviour of the number of distinct values Kn:• E[Kn] ≈ Var[Kn] ≈ α log n• Kn/ log n→ αa.s.

• (Kn − E[Kn])/√

Var(Kn)d→ N(0, 1)

• dTV (L(Kn),Po(E[Kn])) = o(1/ log n).

3 Rich get richer behaviour: DP favors partitions with a small number of largeclusters and a large number of smallish ones.

3 This feature of the model is often inappropriate in applications, which hasmotivated many of the generalizations 5

5Argiento, R., et al. (2015) Modelling the association between clusters of SNPs and diseaseresponses”. Nonparametric Bayesian Methods in Biostatistics and Bioinformatics (R. Mitra, P. Mueller Eds.),Springer.


Distribution of Kn

It is clear that the distribution of the number of distinct values depends on the basemeasure only through the precision parameter α:

• α small ⇒ few distinct observations, and vice versa• from

P(X2 ∈ · | X1) =α

α + 1 P0(·) +1

α + 1δX1we see that

• α→ 0 implies all observations are equal• α→∞ implies all observations are different

0 20 40 60 80 100

0.0

00.0

50.1

00.1

50.2

0

kn

P(K

n)=

kn

α=1

α=5

α=10

α=20

α=50

α=100


DP as a normalized completely random measure (NormCRM)

A gamma completely random measure on X is defined asConstructive definition:

µ(·) d:=+∞∑i=1

Jiδτi (·)

3 The jumps {Ji} are the points of a Poisson process on R+ with intensityρ(s) = αs−1e−s , α > 0

3 the support {τi} is an iid sequence from P0;3 {Ji} and {τi} are independent.

Since∫ +∞

0 min{1, s}ρ(s)ds < +∞ and∫∞

0 ρ(s)ds =∞, then

0 < T :=+∞∑i=1

Ji

Finite dimensional approximations

3 Let M > 0, (w1, . . . ,wM) ∼Dir( αM , . . . ,αM ), and τ1, . . . , τM

i.i.d.∼ P0, then

PM(·) :=M∑

h=1

whδτh (·)L→ P(·) M →∞

where P ∼DP(α,P0)

Observation Let v1, . . . , vN−1i.i.d.∼ Beta(1, α), vN = 1. If wh = vh

∏h−1k=1 (1− vk ), then∑N

h=1 wh = 1 a.s

3 If τ1, . . . , τMi.i.d.∼ P0 then, by the stick breaking construction

PN(·) :=N∑

h=1

whδτh (·)L→ P(·) N →∞

Moreover if N := N� ∼Poi(−α log(�)) then dTV(P,Pε) < ε [Muliere and Tardella(1998)]


Simulation of trajectories

Based on stick-breaking representation of the DP. It has found wide application forinference, especially due to the fact that it lends itself easily to the simulation of theDP. Two ideas here:

• truncation: fix N large enough, simulate X1, . . . ,XN and V1, . . . ,VN and write

Pd≈

N∑i=1

WiδXi

which gives an approximate trajectory [Ishwaran and James(2001)]. Or fix ε > 0,choose a random Nε ∼ Po(−α log ε) which is such that dTV(P,Pε) < ε[Muliere and Tardella(1998)].

• stochastic truncation: simulate an exact trajectory by recursing to a stochastictruncation method; the most famous are the slice sampler [Walker(2007)], whichdevelops a Gibbs sampler on an augmented space, and the retrospective sampler[Papaspiliopoulos and Roberts(2008)], which simulates additional weights andlocations only when these are needed.

These are called conditional methods, in comparison with so called maeginal methodswhich exploit the marginal distribution of the observables for simulating tranjectories,and tend to be more efficient.


Mixture models

What is Cluster Analysis?

3 Waiting time between eruptions and the duration of the eruption for the Old Faithful geyserin Yellowstone National Park, Wyoming, USA.

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

50

60

70

80

90

eruptions

wa

itin

g




1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

50

60

70

80

90

eruptions

wa

itin

g

0.001

0.002

0.002

0.003

0.003 0.0

04

0.004

0.0

05

0.005

0.0

06

0.006

0.007

0.00

7

0.0

08

0.008

0.009

0.00

9

0.0

1

0.0

1

0.0

11

0.011

0.0

12

0.013

0.014

0.0

14

0.016

0.0

17

0.02

0.02

0.0

21

0.0

21

0.02

2

0.0

23

0.031

0.0

32

0.032

0.03

4

0.03

5

0.0

36

0.0

43




1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

50

60

70

80

90

eruptions

wa

itin

g

0.001

0.002

0.002

0.003

0.003 0.0

04

0.004

0.0

05

0.005

0.0

06

0.006

0.007

0.00

7

0.0

08

0.008

0.009

0.00

9

0.0

1

0.0

1

0.0

11

0.011

0.0

12

0.013

0.014

0.0

14

0.016

0.0

17

0.02

0.02

0.0

21

0.0

21

0.02

2

0.0

23

0.031

0.0

32

0.032

0.03

4

0.03

5

0.0

36

0.0

43


eruptions wai

ting

density

Model-based cluster analysis

What is Cluster Analysis?Attempt of group a collection of data objects such that- Similar to one other within the same group (or cluster);- Dissimilar to the objects in the other groups (or clusters).

Model-based clustering

3 Data come from a “random” source with several (possibly infinities) subpopulations.

3Each subpopulation is modeled separately.

3 The distribution of the overall population is a mixture of these subpopulations.

3The resulting model for the data is a mixture model.


Infinite Mixture Models

Two ingredients

3 {f (y ; θ), θ ∈ Θ} a parametric family of densities onRp , with Θ ⊂ Rs ; kernels3 P(·) :=

∑∞h=1 whδτh (·) a discrete probability measure

on Θ. mixing distribution

Under a mixture model the population variable has conditional distribution:

Y |P ∼∫

Θf (y ; θ)P(dθ) =

∑∞h=1 whf (y ; τh)

Interpretation:3 Infinite number of possible clusters: h = 1, 2, . . . .3 wh is the probability that an observation lies in the h-th cluster.3 f (·; τh) the density of data lying in the h-th cluster.


Bayesian nonparametric density estimation

3 Let Y1,Y2, . . . be an i.i.d. sample with unknown denisty f . A Dirichlet processmixture prior (DPM) on f posits that

Yi |P ∼ f (yi |P) =∫X

f (y ; θ)P(dθ)

P ∼ DP(α,P0)

where f (xi |θ) is a parametric distribution (often referred to as thekernel of the mixture ), which is indexed by a finite dimensional parameter θ ∈ Θ ∈ Rs .

3 Exploiting the stick-breaking construction of the Dirichlet process we can write

Yi |(wh, τh)∞h=1 ∼∞∑

h=1

whf (yi |τh)︸︷︷︸f (yi |p)

where τhi.i.d.∼ P0, wh = vh

∏k

Support

3 From a density estimation point of view, working with an infinite number ofcomponents is particularly appealing because it ensures that, for appropriate choices ofthe kernel f (yi ; X), the DPM model has support on a large classes of distributions.

3 For example, [Lo (1984)] showed that a DP location-scale mixture of normals,

Yi |P ∼∫

N(yi ;µ, σ2)P(dµ, dσ2) P ∼ DP(α,P0)

has full support on the space of absolutely continuous distributions.


Hierarchical model

An alternative representation of the Dirchlet process mixture model (DPM) introduceslatent random effects6 θi to replace the mixture by a hierarchical model.

Y1, ...,Yn|θ1, ..., θnind.∼ f (yi ; θi )

θ1, ..., θn|Pi.i.d.∼ P

p ∼ DP(α,P0)

Note I slightly changed my notation: here θ1, . . . , θn is a sample from the Dirichletprocess, in the previous classes I denoted this sample with X1, . . . ,XnSo all the consideration we made on the X ’s now hold true for the variables θ’s!

6Argiento, R., et al. (2014) Estimation, prediction and interpretation of NGG random effects models:an application to Kevlar fibre failure times. Statistical Papers


Clustering

3 Given (θ1, . . . , θn), Yi and Yj belong to the same cluster iff:

θi = θj and we write Yi ↔ Yj

3 θ∗ = (θ∗1 , . . . , θ∗Kn ) are the unique values among the θi ’s.

3 ρ = {C1, . . . ,CKn} is the clustering induced on the data indices from the DPsample (θ1, . . . , θn).

The prior on (θ1, . . . , θn) is equivalent to the prior on (ρ, θ∗), ie

P(θ1 ∈ dθ1, . . . , θn ∈ dθn)⇔ P (ρ = {C1, . . . ,CKn})∏Kn

j=1 P0(dθ∗j )


Nonparametric Bayesian approach to clustering

Conditional likelihood

Y1, ...,Yn|C1, . . . ,Ck , θ∗1 , . . . , θ∗k ∼k∏

j=1

{∏i∈Ci

f (yi ; θ∗j )

}(1)

- ρρρ := {C1, . . . ,Ck} is a partition of the the data index set {1, . . . , n}- {f (·; θ∗), θ∗ ∈ Θ} is a family of density on the sample space X .

Prior specification: ∏kj=1 P0(dθ

∗j )π(ρ) (2)

π(ρ) = P(ρ = {C1, . . . ,Ck}) = eppf (#C1, . . . ,#Ck )

- The infinite exchangeable partition probability function under the DPM model is

eppf (n1, . . . , nk ) = Γ(α)Γ(α+n)αk ∏k

i=1(ni − 1)!


Two points of view

Hierarchical model

Y1, ...,Yn|θ1, ..., θnind.∼ f (yi ; θi )

θ1, ..., θn|Pi.i.d.∼ P

P ∼ DP(α,P0)

3 The density f (·,P) of the population variable Y is random.3 The law of this random density is assigned by a mixture model:

f (y ; P) =∫

Θf (y ; θ)P(dθ)

Targets:

H Density estimation: L(f (y ; P)|Y1, . . . ,Yn)

H Cluster analysis: L(ρ|Y1, . . . ,Yn)


Computation under DPM model

A computational problem: existing approaches

Critical issues, infinite dimensional parameter P =∑∞

i=1 wiδτiMarginal Gibbs sampler algorithms [Escobar, 1988] [Neal, 2000]

3 Integrate out P and resort to generalized Polya urn schemes3 Inference is limited to the point estimates: predictive fXn+1 (·|X1, ..,Xn)

Conditional methods3 Use some tricks to build a Gibbs sampler whose state space encompasses P.3 Full Bayesian posterior analysis.

For instance:

3 Slice sampler [Walker, 2007] [Griffin, 2013] 3 Retrospective methods [Papaspiliopuloset al., 2008]

3 Truncation of the infinite sum defining the r.p.m. P [Muliere & Tardella (1998)](either a-priori or a-posteriori)

As in [Iswaran & James (2001)] or [Argiento et al., 2010] build a finite dimensional ap-proximation of the random probability measure

P(N) =∑N

i=1 wiδτi







For instance:3 Slice sampler [Walker, 2007] [Griffin, 2013] 3 Retrospective methods [Papaspiliopulos

et al., 2008]3 Truncation of the infinite sum defining the r.p.m. P [Muliere & Tardella (1998)]

(either a-priori or a-posteriori)


P(N) =∑N

i=1 wiδτi







For instance:3 Slice sampler [Walker, 2007] [Griffin, 2013] 3 Retrospective methods [Papaspiliopulos

et al., 2008]3 Truncation of the infinite sum defining the r.p.m. P [Argiento et al., 2015a]

(either a-priori or a-posteriori)


P(N) =∑N

i=1 wiδτi


Full Bayesian analysis

Note that by the stick braking construction P(g) ≡ {(w (g)h ), (τgh )}, so

f (y ; P(g)) =∑∞

h=1 w(g)h f (y ; τ

(g)h )

3 If we are able to sample P(1),P(2), . . . ,P(G) from L(P|Y ), plugging in this sample inthe formula above, we can use it as a proxy to study the posterior law of the unknowndensity of the data Y , that is

L(f (Yn+1; P)|Y1, . . . ,Yn)

3 However in many application is enough to study linear functionals of the stochasticprocess f (Yn+1; P), like the predictive distribution fY (y)dy we computed in the previousslide, i.e.

fY (y) = E(f (y ; P)|Y1, . . . ,Yn) =∫

PY

f (y |P)P(P ∈ dP|Y ).

marginalization: we analytically compute the integral with respect to L(P|Y1, . . . ,Yn)in the expectation above .


Marginalization


fY (y) =∫

PY


=

∫PY

∫Θ


=

∫PY

∫Θ

f (y ; θ)P(dθ)∫


=

∫PY

∫Θ

f (y ; θ)P(dθ)∫

ΘnP(P ∈ dP|θ,Y )P(θ ∈ dθ|Y )

=

∫Θn

∫Θ

f (y ; θ)∫

PY


=

∫Θn

∫Θ


njα + n δθ

∗j

(dθ)︸︷︷︸∫PY

P(dθ)P(P∈dP|θ)

P(dθ|Y )


Marginalization


fY (y) =∫

PY


=

∫PY

∫Θ


=

∫PY

∫Θ

f (y ; θ)P(dθ)∫


=

∫PY

∫Θ

f (y ; θ)P(dθ)∫

ΘnP(P ∈ dP|θ)P(θ ∈ dθ|Y )

=

∫Θn

∫Θ

f (y ; θ)∫

PY


=

∫Θn

∫Θ


njα + n δθ

∗j

(dθ)︸︷︷︸∫PY

P(dθ)P(P∈dP|θ)

P(dθ|Y )


Marginalization

Finally

fY (y) =α

n + α

∫Θ

f (y ; θ)P0(dθ) +∫

Θn

{Kn∑j=1

njα + n f (y ; θ

∗j )

}P(dθ|Y )

Let θ(1), θ(2), . . . , θ(G) a Markov Chain sample from P(dθ|Y ), then

f̂Y (y) =α

α + n

∫Θ

f (y ; θ)P0(dθ) +G∑

g=1

K (g)n∑j=1

n(g)jα + n f (y ; θ

∗(g)j )


Pólya Urn Gibbs sampler

Target sample θ = (θ1, . . . , θn) from P(θ|Y ) = P(θ1 ∈ dθ1, . . . , θn ∈ dθn|Y )

Idea use a Gibbs sampler, that is, draw sequentially valuer of θi from

P(θi ∈ dθi |θ−i ,Y ) for all i = 1, . . . , n

where θ−i = (θ1, . . . , θi−1, θi+1, . . . , θn)

3 We describe the this algorithm under the assumption that f (y ; θ) and P0(dθ) areConjugate. For extension to the non-conjugate case see [Neal 2008] Algorithm 8.


Notation

3 Parametric associated model• The one observation parametric Bayesian model associated with the DPM is

Y |θ ∼ f (y ; θ)θ ∼ P0

Its posterior is denoted by π̃(θ|y) and its marginal by m(y) =∫

f (y ; θ)P0(dθ) .

• Let C be a subset of the indices {1, . . . , n}, the associated parametric model onthe subset C is

(Yi )i∈C |θ∗i.i.d.∼ f (yi ; θ∗)

θ∗ ∼ P0

whose posterior is proportional to∏i∈C

f (yi : θ∗)P0(dθ∗)


Pólya Urn Gibbs sampler

We recall again that a priori

P(θn ∈ dθ | θ1, . . . , θn−1) ∝ αP0(dθ) +Kn−1∑j=1

nn−1,jδθ∗i (dθ)

3 Since sequence of latent observation θ1, . . . , θn is exchangeable, this expression givesus the form of the full conditional prior distribution for any θi givenθ−i = (θ1, . . . , θi−1, θi+1, . . . , θn).

3 Multiplying by the likelihood f (yi ; θi ) we find the full conditional posteriordistribution for θi .

P(θi ∈ dθi | θ−i ,Y ) ∝ αf (yi ; θi )P0(dθi ) +K−n∑j=1

n−j f (yi , θi )δθ−∗j (dθi )

where the super

bayesian nonparametric modeling and data analysis: an...

Documents