bayesian nonparametric modeling and data analysis: an...

233
Bayesian Nonparametric Modeling and Data Analysis: An Introduction (Draft) Raaele Argiento CNR-IMATI, Milano IMATI, National Research Council - Milano (Italy) October 20, 2016 R. Argiento October 20, 2016

Upload: others

Post on 23-Jun-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

  • Bayesian Nonparametric Modeling and Data Analysis: An Introduction(Draft)

    Raffaele ArgientoCNR-IMATI, Milano

    IMATI, National Research Council - Milano (Italy)

    October 20, 2016

    R. Argiento October 20, 2016

  • Aims and Prerequisites

    Aims:This course offers a theoretical and practical introduction to Bayesian nonparametricsstatistical procedures, a rapidly developing area of statistics. Key themes:

    • Exchangeability and de Finetti Theorem.• Dirichlet process.• Dirichlet process mixture models.• Computation under Dirichlet process mixture model (marginal and conditional

    algorithm)

    Prerequisites:I will suppose you know:

    • Basic Probability theory;• Bayesian parametric modelling;• The R software and OpenBUGS or WinBUGS;

    R. Argiento October 20, 2016

  • Reading list

    General references:/ Regazzini, E. (1996). Impostazione nonparametrica di problemi d’inferenza

    bayesiana. Imati Tech. Report 96-21,http://web.mi.imati.cnr.it/iami/abstracts/96-21.html.

    / Ghosh, J.K. and Ramamoorthi, R.V. (2003). Bayesian nonparametrics.Springer, New York.

    / Hjort, N.L., Holmes, C.C., Müller, P. and Walker, S.G., eds. (2010).Bayesian Nonparametrics. Cambridge Series in Statistical and ProbabilisticMathematics. Cambridge: Cambridge Univ. Press.

    / Müller, P. and Rodriguez, A. (2013). Nonparametric Bayesian inference.NSF-CBMS Regional Conference Series in Probability and Statistics 9, Institute ofMathematical Statstics.

    / Müller, P., Quintana, F.A., Jara, A. and Hanson, T. (2015). Bayesiannonparametric data analysis. Springer.

    / Ghosal, S. and van der Vaart, A. (2016). Fundamental of NonparametricBayesian Inference. Cambridge University Press.

    R. Argiento October 20, 2016

    http://web.mi.imati.cnr.it/iami/abstracts/96-21.html

  • Terminology

    Parametric modelNumber of parameters fixed (or constantly bounded) w.r.t. sample sizeNonparametric model

    • Number of parameters grows with sample size• ∞-dimensional parameter space

    Example: in density estimation the parameter is ∞−dimensional object fY :

    R. Argiento October 20, 2016

  • Terminology

    Parametric modelNumber of parameters fixed (or constantly bounded) w.r.t. sample sizeNonparametric model

    • Number of parameters grows with sample size• ∞-dimensional parameter space

    Example: in density estimation the parameter is ∞−dimensional object fY :

    R. Argiento October 20, 2016

  • Nonparametric Bayesian Model

    DefinitionA nonparametric Bayesian model is a Bayesian model on an ∞-dimensional parameterspace.

    InterpretationParameter space Θ = {set of possible parameters}, for example:

    Problem Θ

    Density estimation Probability distributionsRegression Smooth functionsClustering Partitions

    3 Target of the Bayesian statistician is the study posterior distribution on the space ofall parameters

    R. Argiento October 20, 2016

  • Parametric Bayesian Modelling

  • Independence

    3 Classical statistical inference is based on the assumption of independence:

    P(X1 ∈ A1, . . . ,Xn ∈ An) =n∏

    i=1

    P(Xi ∈ Ai ).

    This assumption is:

    • convenient from a mathematical point of view, in view of the factorisation• implies that the information on one observation does not provide any information

    on the subsequent one, that is

    P(Xn+1 ∈ A | X (n)) = P(Xn+1 ∈ A), X (n) = X1, . . . ,Xn,

    R. Argiento October 20, 2016

  • Independence

    This is hardly justified in practice:3 independence among observations is a strong assumption difficult to verify3 collecting observations from a quantity of interest must tell me something about

    what I’m going to observe next; the information should be incorporated into mymodel and used for updating my knowledge on the phenomenon.

    “I am trying to learn about something and have some current knowledge. My currentknowledge is encapsulated in a small model. I learn through further observations thatthis small model is wrong or misplaced. I must change it, whether my foundations for theinference I am undertaking allow me to do this or not. The current knowledge is beingaltered through further observations, and then revised from these observations.” a

    aWalker, S.G. Bayesian nonparametrics. In Bayesian Theory and Applications, pp. 249-270.Oxford University Press.

    R. Argiento October 20, 2016

  • Exchangeability

    Exchangeability:• assumes homogeneity/symmetry among the elements of the data sequence• does not assume the events physically influence one another• the order in which r.v.s are observed is irrelevant for inference• among the weakest forms of dependence (e.g., Markovianity implies a natural

    order), minimal assumption of symmetry• the implied mathematical framework remains analytically tractable, thanks to de

    Finetti’s Theorem

    DefinitionA sequence (Xn)n≥1 is said to be exchangeable if

    (X1, . . . ,Xn)d= (Xπ(1), . . . ,Xπ(n))

    for all n ≥ 1 and all permutations π of (1, . . . , n).

    3 Interpretation The order of appearance of the observation does not matter in terms oftheir joint distribution.

    R. Argiento October 20, 2016

  • Exchangeability

    Exchangeability:• assumes homogeneity/symmetry among the elements of the data sequence• does not assume the events physically influence one another• the order in which r.v.s are observed is irrelevant for inference• among the weakest forms of dependence (e.g., Markovianity implies a natural

    order), minimal assumption of symmetry• the implied mathematical framework remains analytically tractable, thanks to de

    Finetti’s Theorem

    DefinitionA sequence (Xn)n≥1 is said to be exchangeable if

    (X1, . . . ,Xn)d= (Xπ(1), . . . ,Xπ(n))

    for all n ≥ 1 and all permutations π of (1, . . . , n).

    3 Interpretation The order of appearance of the observation does not matter in terms oftheir joint distribution.

    R. Argiento October 20, 2016

  • Pólya urn process

    How we can sample an exchangeable sequence?

    3 Consider an urn with B0 black and W0 white balls. The sequence of observationX1,X2, . . . is sampled by the following procedure: Set n = 1,

    1. Draw a ball at random from the urn and note its color;2. If the ball is black then Xn = 1, otherwise Xn = 0;3. Place the ball plus 1 extra balls of the observed color in the urn;4. n = n + 1 and go to point 1;

    - It is not difficult to realize that

    P(X1 = x1, . . . ,Xn = xn) =∏Sn−1

    s=0 {B0 + s}∏n−Sn−1

    s=0 {W0 + s}∏n−1s=0 {W0 + B0 + s}

    where Sn =∑n

    i=1 xi

    The sequence of observed color is exchangeable!!!

    R. Argiento October 20, 2016

  • Pólya urn process

    How we can sample an exchangeable sequence?

    3 Consider an urn with B0 black and W0 white balls. The sequence of observationX1,X2, . . . is sampled by the following procedure: Set n = 1,

    1. Draw a ball at random from the urn and note its color;2. If the ball is black then Xn = 1, otherwise Xn = 0;3. Place the ball plus 1 extra balls of the observed color in the urn;4. n = n + 1 and go to point 1;

    - It is not difficult to realize that

    P(X1 = x1, . . . ,Xn = xn) =∏Sn−1

    s=0 {B0 + s}∏n−Sn−1

    s=0 {W0 + s}∏n−1s=0 {W0 + B0 + s}

    where Sn =∑n

    i=1 xi

    The sequence of observed color is exchangeable!!!

    R. Argiento October 20, 2016

  • Pólya urn process

    How we can sample an exchangeable sequence?

    3 Consider an urn with B0 black and W0 white balls. The sequence of observationX1,X2, . . . is sampled by the following procedure: Set n = 1,

    1. Draw a ball at random from the urn and note its color;2. If the ball is black then Xn = 1, otherwise Xn = 0;3. Place the ball plus 1 extra balls of the observed color in the urn;4. n = n + 1 and go to point 1;

    - It is not difficult to realize that

    P(X1 = x1, . . . ,Xn = xn) =∏Sn−1

    s=0 {B0 + s}∏n−Sn−1

    s=0 {W0 + s}∏n−1s=0 {W0 + B0 + s}

    where Sn =∑n

    i=1 xi

    The sequence of observed color is exchangeable!!!

    R. Argiento October 20, 2016

  • de Finetti Theorem

    de Finetti’s representation theorem for binary sequences 1 formalises the relationshipbetween exchangeable and iid sequences.

    Theorem – de Finetti’s representation theorem for binary sequencesA sequence (Xn)n≥1 taking values in {0, 1} is exchangeable if and only if there exists aprobability measure π on [0, 1] such that

    P(X1 = x1, . . . ,Xn = xn) =∫ 1

    0θk (1− θ)n−kπ(dθ)

    where∑n

    i=1 xi = k is the number of successes. Moreover,

    1n∑i≥1

    Xi → θ a.s., θ ∼ π.

    1de Finetti (1933a). Classi di numeri aleatori equivalenti. Rendiconti della R. Accademia Nazionale deiLincei 18, 107–110.

    R. Argiento October 20, 2016

  • Comments

    • Given the value of θ, we have

    P(X1 = x1, . . . ,Xn = xn | θ) = θk (1− θ)n−k

    that the sampling is, given θ, conditionally independent with Bern(θ) distribution.Hence they are conditionally iid.

    • the parameter θ (Bernoulli success probability) is taken to be a r.v., instead of anunknown constant, with distribution π

    • π is called de Finetti measure and is the prior distribution, i.e., a distribution on theparameter space (here [0, 1]) that represents the initial opinion on the parameterbefore observing data

    • the integral representation for the joint distribution of the sequence tells that abinary exchangeable sequence is a mixture of iid Bernoulli sequences.

    exchangeability ⇔ mixture of iid

    R. Argiento October 20, 2016

  • Hierarchical modelling

    Bayesian Model

    X1, . . . ,Xn|θi.i.d.∼ f (y ; θ) sampling model

    θ ∼ π(θ) prior

    The conditional independence can be stated as follow:3 there is a state of the world θ which is unknown, here θ is a random variable3 given θ, the events are iid3 without knowing θ, they are not independent, only exchangeable.

    Two equivalent way to set a Bayesian model

    1. predictive approach: choose an (infinite) exchangeable model for the observations.2. hierarchical approach: choose a conditional sampling model and a prior for its

    parameter.

    R. Argiento October 20, 2016

  • Hierarchical modelling

    Bayesian Model

    X1, . . . ,Xn|θi.i.d.∼ f (y ; θ) sampling model

    θ ∼ π(θ) prior

    The conditional independence can be stated as follow:3 there is a state of the world θ which is unknown, here θ is a random variable3 given θ, the events are iid3 without knowing θ, they are not independent, only exchangeable.

    Two equivalent way to set a Bayesian model

    1. predictive approach: choose an (infinite) exchangeable model for the observations.2. hierarchical approach: choose a conditional sampling model and a prior for its

    parameter.

    R. Argiento October 20, 2016

  • A predictive approach

    3 Consider the sequence of observations (Xn)n≥1 sampled via the Pólya urn scheme.

    3 We already observed that (Xn)n≥1 is exchangeable. Can we find its the de Finettimeasure?

    P(X1 = x1, . . . ,Xn = xn) =∏Sn−1

    s=0{B0+s}

    ∏n−Sn−1s=0

    {W0+s}∏n−1s=0{W0+B0+s}

    = Γ(B0+W0)Γ(B0)Γ(W0)

    Γ(B0+Sn)Γ(W0+n−Sn)Γ(W0+B0+n)

    in fact∏k−1

    j=1 (a + j) =Γ(a+k)

    Γ(a) for each a and k

    =∫ 1

    0 θSn (1− θ)n−Sn

    {Γ(W0+B0)

    Γ(B0)Γ(W0)θB0−1(1− θ)W0−1

    }dθ

    =∫ 1

    0 θSn (1− θ)n−Snπ(θ)dθ

    where π(θ) is the density of a Beta(B0,W0), the de Finetti measure of the sequence Xn

    R. Argiento October 20, 2016

  • A predictive approach

    3 Consider the sequence of observations (Xn)n≥1 sampled via the Pólya urn scheme.

    3 We already observed that (Xn)n≥1 is exchangeable. Can we find its the de Finettimeasure?

    P(X1 = x1, . . . ,Xn = xn) =∏Sn−1

    s=0{B0+s}

    ∏n−Sn−1s=0

    {W0+s}∏n−1s=0{W0+B0+s}

    = Γ(B0+W0)Γ(B0)Γ(W0)

    Γ(B0+Sn)Γ(W0+n−Sn)Γ(W0+B0+n)

    in fact∏k−1

    j=1 (a + j) =Γ(a+k)

    Γ(a) for each a and k

    =∫ 1

    0 θSn (1− θ)n−Sn

    {Γ(W0+B0)

    Γ(B0)Γ(W0)θB0−1(1− θ)W0−1

    }dθ

    =∫ 1

    0 θSn (1− θ)n−Snπ(θ)dθ

    where π(θ) is the density of a Beta(B0,W0), the de Finetti measure of the sequence Xn

    R. Argiento October 20, 2016

  • A predictive approach

    3 Consider the sequence of observations (Xn)n≥1 sampled via the Pólya urn scheme.

    3 We already observed that (Xn)n≥1 is exchangeable. Can we find its the de Finettimeasure?

    P(X1 = x1, . . . ,Xn = xn) =∏Sn−1

    s=0{B0+s}

    ∏n−Sn−1s=0

    {W0+s}∏n−1s=0{W0+B0+s}

    = Γ(B0+W0)Γ(B0)Γ(W0)

    Γ(B0+Sn)Γ(W0+n−Sn)Γ(W0+B0+n)

    in fact∏k−1

    j=1 (a + j) =Γ(a+k)

    Γ(a) for each a and k

    =∫ 1

    0 θSn (1− θ)n−Sn

    {Γ(W0+B0)

    Γ(B0)Γ(W0)θB0−1(1− θ)W0−1

    }dθ

    =∫ 1

    0 θSn (1− θ)n−Snπ(θ)dθ

    where π(θ) is the density of a Beta(B0,W0), the de Finetti measure of the sequence Xn

    R. Argiento October 20, 2016

  • A predictive approach

    3 Consider the sequence of observations (Xn)n≥1 sampled via the Pólya urn scheme.

    3 We already observed that (Xn)n≥1 is exchangeable. Can we find its the de Finettimeasure?

    P(X1 = x1, . . . ,Xn = xn) =∏Sn−1

    s=0{B0+s}

    ∏n−Sn−1s=0

    {W0+s}∏n−1s=0{W0+B0+s}

    = Γ(B0+W0)Γ(B0)Γ(W0)

    Γ(B0+Sn)Γ(W0+n−Sn)Γ(W0+B0+n)

    in fact∏k−1

    j=1 (a + j) =Γ(a+k)

    Γ(a) for each a and k

    =∫ 1

    0 θSn (1− θ)n−Sn

    {Γ(W0+B0)

    Γ(B0)Γ(W0)θB0−1(1− θ)W0−1

    }dθ

    =∫ 1

    0 θSn (1− θ)n−Snπ(θ)dθ

    where π(θ) is the density of a Beta(B0,W0), the de Finetti measure of the sequence Xn

    R. Argiento October 20, 2016

  • Prior choice

    The elicitation (e.g. how to choose B0 and W0) or choice of the prior π is a long lastingand still unresolved debate:

    • some believe incorporating the researcher’s knowledge into the prior is at the veryessence of the Bayesian approach. Subjective approach.

    • some believe we should incorporate as little knowledge as possible, in order to limitthe effect of the prior opinion and let the data swamp the prior; non informativepriors or objective Bayes

    • these debate is somewhat restricted to low dimensional, parametric approaches,when one has at least a hope of being able, if willing, to specify a prior whichincapsulates some desired knowledge.

    • In high-dimensional or nonparametric approaches, it is usually very difficult even tounderstand what are the implication of a prior choice on the model, so usually thechoice is dictated by mathematical convenience or guided, among plausiblealternatives, by considerations about some specific aspects of the choice.

    R. Argiento October 20, 2016

  • Bayes’ Theorem for dominated models

    X1, . . . ,Xn|θi.i.d.∼ f (y ; θ) sampling model

    θ ∼ π(θ) prior

    Interpretation: θ 7→∏n

    i=1 f (xi ; θ) likelihood, π(dθ) prior

    Then the posterior distribution of θ, given X1 = x1, . . . ,Xn = xn, can be computed byBayes’ Theorem:

    P(θ ∈ B|X1 = x1, . . . ,Xn = xn)a.s.=

    ∫B

    ∏ni=1 f (xi ; θ)π(dθ)∫

    Θ

    ∏ni=1 f (xi ; θ)π(dθ)

    , B ∈ B(Θ)

    Proof: definition of conditional distribution (as the solution of an integral equation) +Radon-Nikodym Theorem

    R. Argiento October 20, 2016

  • Bayes’ Theorem for dominated models

    X1, . . . ,Xn|θi.i.d.∼ f (y ; θ) sampling model

    θ ∼ π(θ) prior

    Interpretation: θ 7→∏n

    i=1 f (xi ; θ) likelihood, π(dθ) prior

    Then the posterior distribution of θ, given X1 = x1, . . . ,Xn = xn, can be computed byBayes’ Theorem:

    P(θ ∈ B|X1 = x1, . . . ,Xn = xn)a.s.=

    ∫B

    ∏ni=1 f (xi ; θ)π(dθ)∫

    Θ

    ∏ni=1 f (xi ; θ)π(dθ)

    , B ∈ B(Θ)

    Proof: definition of conditional distribution (as the solution of an integral equation) +Radon-Nikodym Theorem

    R. Argiento October 20, 2016

  • Our simple example

    Let

    X1, . . . ,Xn|θi.i.d.∼ Bern(y ; θ)

    θ ∼ Beta(θ;α, β)

    α = B0, β = W0 > 0. Then by the Bayes theorem

    π(θ | X (n)) ∝π(θ)n∏

    i=1

    f (Xi ; θ) (keep only θ terms) = θα+∑

    iXi−1(1− θ)β+n−

    ∑i

    Xi−1

    so normalising we get π(θ | X (n)) = Beta(α +

    ∑i Xi , β + n −

    ∑i Xi).

    The posterior

    expected value for θ is

    E(θ | X (n)) ==α+ β

    α+ β + n·

    α

    α+ β︸ ︷︷ ︸E(θ)= prior mean

    +n

    α+ β + n·

    1n

    n∑i=1

    Xi︸ ︷︷ ︸sample mean

    .

    R. Argiento October 20, 2016

  • Our simple example

    Let

    X1, . . . ,Xn|θi.i.d.∼ Bern(y ; θ)

    θ ∼ Beta(θ;α, β)

    α = B0, β = W0 > 0. Then by the Bayes theorem

    π(θ | X (n)) ∝π(θ)n∏

    i=1

    f (Xi ; θ) (keep only θ terms) = θα+∑

    iXi−1(1− θ)β+n−

    ∑i

    Xi−1

    so normalising we get π(θ | X (n)) = Beta(α +

    ∑i Xi , β + n −

    ∑i Xi). The posterior

    expected value for θ is

    E(θ | X (n)) ==α+ β

    α+ β + n·

    α

    α+ β︸ ︷︷ ︸E(θ)= prior mean

    +n

    α+ β + n·

    1n

    n∑i=1

    Xi︸ ︷︷ ︸sample mean

    .

    R. Argiento October 20, 2016

  • Conjugacy

    The Beta example exemplifies the notion of conjugacy.

    Definition – ConjugacyLet {f (·; θ), θ ∈ Θ} be a family of distributions, Θ ⊂ Rd . We say that the family ofprior distributions {π(θ | γ), γ ∈ RK}, where γ is a vector of parameters for π, isconjugate to the model f if, given Xi

    iid∼ f (·, θ), the posterior distributions can be written

    π(θ | γ′(X (n)))

    i.e. the posterior has the same analytical structure of the prior, with updated parameters.

    3 In the previous example we had

    γ = (α, β), γ′(X (n)) =(α +

    n∑i=1

    Xi , β + n −n∑

    i=1

    Xi).

    R. Argiento October 20, 2016

  • Comments

    In different words, the family of prior distributions is closed under the operation ofBayesian update based on the collected data.The distributions in the exponential family are conjugate. Typical examples:

    • Beta-Binomial model (Beta prior on success probability p)• Gamma-Poisson (Gamma prior on Poisson rate)• Normal-Normal (Normal prior on Normal mean)• InverseGamma-Normal (IG prior on Normal variance)

    and some of these generalise to the multivariate case.

    R. Argiento October 20, 2016

  • Bayesian Nonparametric Modelling

  • Tiny Bit of Probability Notation

    • X is a separable and complete metric space (think on Rp)• X is the Borel σ–algebra of subsets of X• (Xn)n≥1 is a sequence of random elements defined on some probability space

    (Ω,F ,P) and taking values in (X∞,X ∞)

    (Xn)n≥1 is the sequence of observations (DATA); Xn is the result of the randomexperiment at trial n

    • PX space of all probability measures on (X,X ), with the topology of weakconvergence

    • P(X) Borel σ–algebra of subsets of PX• A random element P defined on (Ω,F ,P) and taking values in (PX,P(X)) is a

    random probability measure.

    R. Argiento October 20, 2016

  • Tiny Bit of Probability Notation

    • X is a separable and complete metric space (think on Rp)• X is the Borel σ–algebra of subsets of X• (Xn)n≥1 is a sequence of random elements defined on some probability space

    (Ω,F ,P) and taking values in (X∞,X ∞)

    (Xn)n≥1 is the sequence of observations (DATA); Xn is the result of the randomexperiment at trial n

    • PX space of all probability measures on (X,X ), with the topology of weakconvergence

    • P(X) Borel σ–algebra of subsets of PX• A random element P defined on (Ω,F ,P) and taking values in (PX,P(X)) is a

    random probability measure.

    R. Argiento October 20, 2016

  • de Finetti’s representation theorem (general case) 2

    Theorem: de Finetti’s representation theorem for general sequencesThe sequence X (∞) = (Xn)n≥1 is exchangeable if and only if there exists a probabilitymeasure on (PX,PX) such that

    P [X1 ∈ A1, . . . ,Xn ∈ An] =∫

    PX

    n∏i=1

    P(Ai ) Π(dP)

    for ay n ≥ 1 and A1, . . . ,An in X , where the probability Π is uniquely determined.

    / (Xn)n≥1 is exchangeable if and only if there exists a random probability measure Pon (X,X ) such that P ∼ Π and

    P [ X1 ∈ A1, . . . , ,Xn ∈ An |P ] =n∏

    i=1

    P(Ai )

    for any n ≥ 1 and A1, . . . ,An in X .

    2Hewitt, E. and Savage, L.J. (1955). Symmetric measures on cartesian products. Trans. Amer. Math.Soc. 80, 470–501.Schervish, M.J. (1995). Theory of statistics. Springer-Verlag, New York.

    R. Argiento October 20, 2016

  • de Finetti’s representation theorem (general case) 2

    Theorem: de Finetti’s representation theorem for general sequencesThe sequence X (∞) = (Xn)n≥1 is exchangeable if and only if there exists a probabilitymeasure on (PX,PX) such that

    P [X1 ∈ A1, . . . ,Xn ∈ An] =∫

    PX

    n∏i=1

    P(Ai ) Π(dP)

    for ay n ≥ 1 and A1, . . . ,An in X , where the probability Π is uniquely determined.

    / (Xn)n≥1 is exchangeable if and only if there exists a random probability measure Pon (X,X ) such that P ∼ Π and

    P [ X1 ∈ A1, . . . , ,Xn ∈ An |P ] =n∏

    i=1

    P(Ai )

    for any n ≥ 1 and A1, . . . ,An in X .2Hewitt, E. and Savage, L.J. (1955). Symmetric measures on cartesian products. Trans. Amer. Math.

    Soc. 80, 470–501.Schervish, M.J. (1995). Theory of statistics. Springer-Verlag, New York.

    R. Argiento October 20, 2016

  • de Finetti’s representation theorem (general case) (1933) - continued

    Π is a probability measure on PX −→ de Finetti measure of (Xn)n≥1

    / If (Xn)n≥1 is exchangeable, then its empirical distribution is such that

    1n

    n∑i=1

    δXi ⇒ P a.s.–P

    where ⇒ denotes weak convergence.

    Hierarchical representation: (Xn)n≥1 exchangeable is equivalent to

    Xi |Pi.i.d.∼ P

    P ∼ ΠΠ = prior distribution

    The Bayesian nonparametric framework is equivalent to exchangeability of (Xn)n

    R. Argiento October 20, 2016

  • de Finetti’s representation theorem (general case) (1933) - continued

    Π is a probability measure on PX −→ de Finetti measure of (Xn)n≥1

    / If (Xn)n≥1 is exchangeable, then its empirical distribution is such that

    1n

    n∑i=1

    δXi ⇒ P a.s.–P

    where ⇒ denotes weak convergence.

    Hierarchical representation: (Xn)n≥1 exchangeable is equivalent to

    Xi |Pi.i.d.∼ P

    P ∼ ΠΠ = prior distribution

    The Bayesian nonparametric framework is equivalent to exchangeability of (Xn)n

    R. Argiento October 20, 2016

  • de Finetti’s representation theorem (general case) (1933) - continued

    Π is a probability measure on PX −→ de Finetti measure of (Xn)n≥1

    / If (Xn)n≥1 is exchangeable, then its empirical distribution is such that

    1n

    n∑i=1

    δXi ⇒ P a.s.–P

    where ⇒ denotes weak convergence.

    Hierarchical representation: (Xn)n≥1 exchangeable is equivalent to

    Xi |Pi.i.d.∼ P

    P ∼ ΠΠ = prior distribution

    The Bayesian nonparametric framework is equivalent to exchangeability of (Xn)n

    R. Argiento October 20, 2016

  • Parametric case through the representation theorem

    Parametric model: Π degenerate on a finite–dimensional subset P∗X of PX, such that

    Π({P∗X}) = Π({P ∈ PX : P = Pθ, θ ∈ Θ}) = 1

    and and there exists a function

    g : P∗X → Θ bijective.

    Θ ⊂ Rp is called parametric space. The prior Π induces a probability on Θ:

    π(B) = Π(θ−1(B)),B ∈ B(Θ).

    In these cases:

    Xi | θ = θi.i.d.∼ Pθ(dx)

    θ ∼ π prior distribution

    For instance:

    Π({P ∈ PX : P(dx) = ϕ((x − µ)/σ) dx , (µ, σ) ∈ R× R+}) = 1

    with ϕ being the density function of a N(0, 1) distribution.

    R. Argiento October 20, 2016

  • Parametric case through the representation theorem

    Parametric model: Π degenerate on a finite–dimensional subset P∗X of PX, such that

    Π({P∗X}) = Π({P ∈ PX : P = Pθ, θ ∈ Θ}) = 1

    and and there exists a function

    g : P∗X → Θ bijective.

    Θ ⊂ Rp is called parametric space. The prior Π induces a probability on Θ:

    π(B) = Π(θ−1(B)),B ∈ B(Θ).

    In these cases:

    Xi | θ = θi.i.d.∼ Pθ(dx)

    θ ∼ π prior distribution

    For instance:

    Π({P ∈ PX : P(dx) = ϕ((x − µ)/σ) dx , (µ, σ) ∈ R× R+}) = 1

    with ϕ being the density function of a N(0, 1) distribution.R. Argiento October 20, 2016

  • Parametric vs Nonparametric case

    When can we assume Π({P∗X}) = 1, where P∗X is finite-dimensional? More clearly, whencould we assume the model is parametric?

    • if, from past experience in cases similar to the one analized, we believe that theparametric family approximates well the “true” distribution

    • if, in addition to exchangeability, we assume different conditions for the sequence ofobservations. For example, if (Xn)n≥1 is also spherically symmetric (L(X1, . . . ,Xn)T= L(A(X1, . . . ,Xn)T ) for any orthogonal matrix A), then P∗X is the family ofGaussian distributions with 0-mean.

    Otherwise: nonparametric model−→ greater flexibility when Π has large support, possibly supp(Π) = PX.

    R. Argiento October 20, 2016

  • Parametric vs Nonparametric case

    When can we assume Π({P∗X}) = 1, where P∗X is finite-dimensional? More clearly, whencould we assume the model is parametric?

    • if, from past experience in cases similar to the one analized, we believe that theparametric family approximates well the “true” distribution

    • if, in addition to exchangeability, we assume different conditions for the sequence ofobservations. For example, if (Xn)n≥1 is also spherically symmetric (L(X1, . . . ,Xn)T= L(A(X1, . . . ,Xn)T ) for any orthogonal matrix A), then P∗X is the family ofGaussian distributions with 0-mean.

    Otherwise: nonparametric model−→ greater flexibility when Π has large support, possibly supp(Π) = PX.

    R. Argiento October 20, 2016

  • Parametric vs Nonparametric case

    When can we assume Π({P∗X}) = 1, where P∗X is finite-dimensional? More clearly, whencould we assume the model is parametric?

    • if, from past experience in cases similar to the one analized, we believe that theparametric family approximates well the “true” distribution

    • if, in addition to exchangeability, we assume different conditions for the sequence ofobservations. For example, if (Xn)n≥1 is also spherically symmetric (L(X1, . . . ,Xn)T= L(A(X1, . . . ,Xn)T ) for any orthogonal matrix A), then P∗X is the family ofGaussian distributions with 0-mean.

    Otherwise: nonparametric model−→ greater flexibility when Π has large support, possibly supp(Π) = PX.

    R. Argiento October 20, 2016

  • Parametric vs Nonparametric case

    When can we assume Π({P∗X}) = 1, where P∗X is finite-dimensional? More clearly, whencould we assume the model is parametric?

    • if, from past experience in cases similar to the one analized, we believe that theparametric family approximates well the “true” distribution

    • if, in addition to exchangeability, we assume different conditions for the sequence ofobservations. For example, if (Xn)n≥1 is also spherically symmetric (L(X1, . . . ,Xn)T= L(A(X1, . . . ,Xn)T ) for any orthogonal matrix A), then P∗X is the family ofGaussian distributions with 0-mean.

    Otherwise: nonparametric model−→ greater flexibility when Π has large support, possibly supp(Π) = PX.

    R. Argiento October 20, 2016

  • Why a nonparametric approach?

    Here there are two main (interrelated) ideas for a nonparametric prior Π.• Flexibility: a nonparametric prior Π is flexible in the sense that it does not place

    particular constraints, e.g., shape constraints. Parametric models as Poisson,Geometric, Gaussian, Gamma and so on impose unimodality.

    • full support:

    Definition – Full supportWe say that a prior distribution Π on PX has full support if

    supp(Π) ≡ PX

    i.e., if the smallest set over which Π puts probability one is the set of all distributionswhose support is contained in X.

    3 A prior with full support any distribution supported on X or its subsets. See[Ferguson(1974)] for more on this point.

    R. Argiento October 20, 2016

  • Bayes Theorem

    Bayes’ Theorem in the usual form does not necessarily applies to in the nonparametricframework. In fact, many nonparametric models are not dominated, e.g., if realisationsare discrete distributions with random locations. In this cases the determination of theposterior distribution relies on

    • ad hoc analytical strategies (e.g. for the Dirichlet process through projections)• MCMC (99% of what’s done nowadays); this would not be a determination of the

    posterior per se, but rather a way of obtaining a sample from the posterior, to beused for computing quantities of interest.

    R. Argiento October 20, 2016

  • Conjugacy

    A further aspect which would be desirable is that of conjugacy, i.e. when the posteriorbelongs to the same family of the prior with updated parameters. However, in thenonparametric framework

    • many models are not conjugate, or not strictly conjugate• for most of the others (e.g., recent developments on complicated spaces) this is not

    known.The reason is that the complexity of the certain models which are believed to be usefultoday is such that there is no hope to obtain any analytical result. MCMC is thestandard practice in most cases. There are however a few exceptions. See ,[Lijoi and Prünster(2010)] for a review, and [?] for a recent result on twotime-dependent models.

    R. Argiento October 20, 2016

  • Historical overview

    • Bruno de Finetti in the 1930’s paved the way for the successive development of theBayesian paradigm in its full generality, by laying the theoretical foundations bymeans of the representation theorem for exchangeable sequences and the theory ofprocesses with stationary and independent increments.

    • Until the beginning of the ’70s, the necessary tools for the actual implementationof the nonparametric approach were still missing

    • First breakthrough arrives with Ferguson (1973), who introduces the Dirichletprocess. This changes everything by showing that the nonparametric approach toBayesian inference was indeed feasible from an analytical point of view, and hencecould be implemented

    R. Argiento October 20, 2016

  • Historical overview 2

    • Focus on survival analysis in 70s and 80s:• neutral to the right processes Doksum (1974); Ferguson (1974); Ferguson and

    Phadia (1979)• extended gamma process Dykstra and Laud (1981)• beta process Hjort (1990)

    • Focus on density estimation in the 90s:• mixtures w.r.t. the Dirichlet process Lo (1984), to date the most successful Bayesian

    nonparametric model• Polya trees Mauldin, Sudderth and Williams (1992)• Bernstein polynomials Petrone (1999)

    R. Argiento October 20, 2016

  • Historical overview 3

    • Second breakthrough, the computational revolution: popularisation of MCMCtechniques for Bayesian inference, mainly started with Escobar and West (1995)who introduce a Gibbs sampler for Dirichlet mixtures. Important subsequentcontributions by Müller, MacEachern, Ishwaran and James (2001), Walker (2007),Papaspiliopoulos and Roberts (2008).

    • Frequentist asymptotic validation of BNP procedures: Diaconis and Freedman(1983; 1986) pose the question “what if” from a frequentist point of view, thenfirst major positive results Barron, Schervish and Wasserman (1999), Ghosal, Ghoshand Ramamoorthi (1999), Ghosal, Ghosh and va der Vaart (2000), Walker (2004)

    • Recent theoretical advances hugely rely on the theory of stochastic processes relyon the use of completely random measures introduced by Kingman (1967) and onthe theory of random partitions and combinatorial stochastic processes initiated ina series of papers by Kingman and later developed by Pitman (2006)

    R. Argiento October 20, 2016

  • Historical overview 4

    • Priors with discrete structures: species sampling models, stick-breaking priors,normalised random measures with independent increments, Gibbs-type processes,etc.

    • Dependent priors for inference on non exchangeable data: Cifarelli and Regazzini(1978), MacEachern (1999; 2000). Recent reviews: Dunson (2010), Teh andJordan (2010), Müller and Mitra (2013).

    • Lively interactions with other communities and research areas such as, e.g.,Machine Learning, Combinatorics, Popoulation Genetics, Bioinformatics etc.

    General references:• Regazzini (1996)• Ghosh and Ramamoorthi (2003)• Hjort, Holmes, Müller and Walker (2010)• Müller and Rodriguez (2013)• Müller, Quintana, Jara and Hanson (2015)• Ghosal and van der Vaart (2016)

    R. Argiento October 20, 2016

  • Dirichlet Process

  • The Dirichlet distribution

    Let α1, . . . , αK be positive numbers and define

    ∆K−1 =

    {x ∈ [0, 1]K−1 :

    K−1∑i=1

    xi ≤ 1}≡{

    x ∈ [0, 1]K :K∑

    i=1

    xi = 1}

    to be the (K − 1)-dimensional simplex.

    (w1, . . . ,wK ) ∼ DirK−1(α1, . . . , αK ),

    if it has density w.r.t. the Lebesgue measure on ∆K−1 given by

    fK−1(w ; a) =Γ(∑K

    i=1 αi )∏Ki=1 Γ(αi )

    wα1−11 · · ·wαK−1−1K−1

    (1−

    K−1∑i=1

    wi︸ ︷︷ ︸wK

    )αK−11∆K−1 (w).

    Note: This models K coordinates, by taking wK = 1−∑K−1

    i=1 wi .

    R. Argiento October 20, 2016

  • The Dirichlet Distribution

    R. Argiento October 20, 2016

  • Property of Dirchlet distribution

    Some key properties:• Marginalisation

    if (w1, . . . ,wK ) ∼ DirK−1(α1, . . . , αK ) and i1, . . . , i` are ` indices in {1, . . . ,K},then

    (wi1 , . . . ,wi` ,wi`+1 ) ∼ Dir`−1(αi1 , . . . , αi` ,

    K∑i=1

    αi −∑̀j=1

    αij

    )

    • Special case of the previous:

    wi ∼ Beta(αi ,

    K∑j=1

    αj − αi)

    R. Argiento October 20, 2016

  • Property of Dirchlet distribution

    Some key properties:• Marginalisation

    if (w1, . . . ,wK ) ∼ DirK−1(α1, . . . , αK ) and i1, . . . , i` are ` indices in {1, . . . ,K},then

    (wi1 , . . . ,wi` ,wi`+1 ) ∼ Dir`−1(αi1 , . . . , αi` ,

    K∑i=1

    αi −∑̀j=1

    αij

    )• Special case of the previous:

    wi ∼ Beta(αi ,

    K∑j=1

    αj − αi)

    R. Argiento October 20, 2016

  • Properties

    • Construction from independent normalised gamma r.v.’sLet Yj ∼ind Ga(αj , 1), αj > 0, for j = 1, . . . ,K , and define

    wj =Yj∑Ki=1 Yi

    then(w1, . . . ,wK−1) ∼ DirK−1(α1, . . . , αK )

    • Aggregation of coordinatesIf (w1, . . . ,wK−1) ∼ DirK−1(α1, . . . , αK ) and 0 < r1 < · · · < r` = K − 1, then( r1∑

    i=1

    wi , . . . ,r∑̀

    i=r`−1+1

    wi)∼ Dir`

    ( r1∑i=1

    αi , . . . ,

    r∑̀i=r`−1+1

    αi

    )

    3 This property suggests that one can “interpret” the vector of parameters(α1, . . . , αK ) as a measure on the coordinates indices α(A) =

    ∑i∈A αi

    R. Argiento October 20, 2016

  • Properties

    • Construction from independent normalised gamma r.v.’sLet Yj ∼ind Ga(αj , 1), αj > 0, for j = 1, . . . ,K , and define

    wj =Yj∑Ki=1 Yi

    then(w1, . . . ,wK−1) ∼ DirK−1(α1, . . . , αK )

    • Aggregation of coordinatesIf (w1, . . . ,wK−1) ∼ DirK−1(α1, . . . , αK ) and 0 < r1 < · · · < r` = K − 1, then( r1∑

    i=1

    wi , . . . ,r∑̀

    i=r`−1+1

    wi)∼ Dir`

    ( r1∑i=1

    αi , . . . ,

    r∑̀i=r`−1+1

    αi

    )

    3 This property suggests that one can “interpret” the vector of parameters(α1, . . . , αK ) as a measure on the coordinates indices α(A) =

    ∑i∈A αi

    R. Argiento October 20, 2016

  • The Dirichlet distribution in Bayesian statistics

    A Hierarchical model

    X1, . . . ,Xn|(w1, . . . ,wK )i.i.d.∼ Multinomial(w1, . . . ,wK ) sampling model

    (w1, . . . ,wK ) ∼ Dir(α1, . . . , αK ) prior

    Recall that under the multinomial likelihood P(X1 = j|w1, . . . ,wK ) = wj .

    - It is not difficult to realize that

    (w1, . . . ,wK ) | X1 = j ∼ DirK−1(α1, . . . , αj + 1, . . . , αK ).

    3 The Dirichlet prior is conjugate w.r.t. the multinomial model.

    R. Argiento October 20, 2016

  • The Dirichlet distribution in Bayesian statistics

    A Hierarchical model

    X1, . . . ,Xn|(w1, . . . ,wK )i.i.d.∼ Multinomial(w1, . . . ,wK ) sampling model

    (w1, . . . ,wK ) ∼ Dir(α1, . . . , αK ) prior

    Recall that under the multinomial likelihood P(X1 = j|w1, . . . ,wK ) = wj .

    - It is not difficult to realize that

    (w1, . . . ,wK ) | X1 = j ∼ DirK−1(α1, . . . , αj + 1, . . . , αK ).

    3 The Dirichlet prior is conjugate w.r.t. the multinomial model.

    R. Argiento October 20, 2016

  • Predictive distributions

    - Since the Dirichlet-Multinomial model is conjugate, we can also easily compute thepredictive distributions, we have:

    P(X1 = j) =αj∑Ki=1 αi

    P(X2 = j | X1) =αj + δX1 (j)∑K

    i=1 αi + 1. . .

    P(Xn+1 = j | X1, . . . ,Xn) =αj +

    ∑ni=1 δXi (j)∑K

    i=1 αi + n

    The fact that I observe a category j at time n does reinforce the probability that I willobserve the same category in the future, n + 1, n + 2, . . . .

    R. Argiento October 20, 2016

  • Pólya urn process

    3 Consider an urn with K different colors {1, . . . ,K}. At the begin there are αj balls ofcolor j in the urn. We sample a sequence X1,X2, . . . as follow: (n = 1)

    1. Draw a ball at random from the urn and note its color;2. If the ball is of color j then Xn = j;3. Place 1 ball of the observed color in the urn (reinforcement);4. n = n + 1 and go to point 1;

    - It is not difficult to realize that

    P(X1 = j) =αj∑Ki=1 αi

    P(X2 = j | X1) =αj + δX1 (j)∑K

    i=1 αi + 1. . .

    P(Xn+1 = j | X1, . . . ,Xn) =αj +

    ∑ni=1 δXi (j)∑K

    i=1 αi + n

    R. Argiento October 20, 2016

  • Pólya urn process

    3 Consider an urn with K different colors {1, . . . ,K}. At the begin there are αj balls ofcolor j in the urn. We sample a sequence X1,X2, . . . as follow: (n = 1)

    1. Draw a ball at random from the urn and note its color;2. If the ball is of color j then Xn = j;3. Place 1 ball of the observed color in the urn (reinforcement);4. n = n + 1 and go to point 1;

    - It is not difficult to realize that

    P(X1 = j) =αj∑Ki=1 αi

    P(X2 = j | X1) =αj + δX1 (j)∑K

    i=1 αi + 1. . .

    P(Xn+1 = j | X1, . . . ,Xn) =αj +

    ∑ni=1 δXi (j)∑K

    i=1 αi + n

    R. Argiento October 20, 2016

  • Polya urn

    3 The predictive structure of the sequence sampled with the Pólya urn is the same ofthe one obtained under the Dirichlet-Multinomial process so

    1. The Pólya sequence is exchangeable,2. its de Finetti measure is the Dirichlet distribution.

    We can obtain the Dirichlet-Multinomial model via the predictive approach.

    R. Argiento October 20, 2016

  • The Dirichlet process

    DefinitionLet α > 0 and P0 a probability measure on X. A random measure P on (X,X ) is saidto be a Dirichlet process, denoted P ∼ DP(α,P0), if for every finite measurablepartition A1, . . . ,AK of X, we have

    (P(A1), . . . ,P(AK )) ∼ Dir(αP0(A1), . . . , αP0(AK )).

    3 The DP is an infinite-dimensional process whose FDDs are Dirichlet.

    3 The existence and well definedness of this object is non trivial;

    3 Example, given A ∈ X

    P(A) ∼ Beta(αP0(A), α(1− P0(A)).

    R. Argiento October 20, 2016

  • Properties of the Dirichlet process

    Two parameters:3 α is called the total mass.3 P0(·) is called the baseline or centering measure.

    A priori moments, for any A ∈ X

    3 Expected value: E[P(A)] = P0(A)From which the interpretation of the parameter measure P0 ∈ PX as centeringdistribution or prior guess.

    3 Variance: Var(P(A)) = P0(A)(1−P0(A))α+1

    So α controls the variability P around the prior guess P0, and is also calledprecision parameter.

    3 Covariance Cov(P(A),P(B)) = P0(A∩B)−P0(A)P0(B)α+1 for any B ∈ X .

    Drawback, if A and B are disjoint the covariance is is always negative.

    R. Argiento October 20, 2016

  • Example of draws from a DP

    −4 −2 0 2 4

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    α = 2

    x

    cdf

    −4 −2 0 2 4

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    α = 4

    x

    cdf

    −4 −2 0 2 4

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    α = 20

    x

    cdf

    −4 −2 0 2 4

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    α = 100

    x

    cdf

    R. Argiento October 20, 2016

    3 Sample from DP(α,P0) with P0 = N(0, 1) (thickblue curve) and varying precision parameters α.

    3 Note how α controls not only the variability of therealizations around P0, but also the relative size of thejumps.

  • Pólya urn process

    Consider an urn with a continuous of colors 3. The initial compisitio of the urn issummarized via a finite measure αP0(·). We sample a sequence X1,X2, . . . as follow:(n = 1)

    1. Draw a ball at random from the urn according to the probability law obtained bynormalizing its composition measure;

    2. If the ball is of color xn then Xn = xn;3. Place 1 ball of the observed color in the urn (reinforcement);4. n = n + 1 and go to point 1;

    - It possible to see that

    P(X1 ∈ B) =αP0(B)αP0(X)

    = P0(B)

    P(X2 ∈ B|X1) =αP0(B) + δX1 (B)

    α+ 1=

    α

    α+ 1P0(B) +

    1α+ 1

    δX1

    . . .

    P(Xn+1 ∈ B | X1, . . . ,Xn) =αP0(B) +

    ∑ni=1 δXi (B)

    α+ n=

    α

    α+ nP0(B) +

    nα+ n

    1n

    n∑i=1

    δXi (B)

    3Blackwell, D. and MacQueen, J.B. (1993), Ferguson distribitions via Pólya urn schemes Annals ofStatistics

    R. Argiento October 20, 2016

  • Pólya urn process

    Consider an urn with a continuous of colors 3. The initial compisitio of the urn issummarized via a finite measure αP0(·). We sample a sequence X1,X2, . . . as follow:(n = 1)

    1. Draw a ball at random from the urn according to the probability law obtained bynormalizing its composition measure;

    2. If the ball is of color xn then Xn = xn;3. Place 1 ball of the observed color in the urn (reinforcement);4. n = n + 1 and go to point 1;

    - It possible to see that

    P(X1 ∈ B) =αP0(B)αP0(X)

    = P0(B)

    P(X2 ∈ B|X1) =αP0(B) + δX1 (B)

    α+ 1=

    α

    α+ 1P0(B) +

    1α+ 1

    δX1

    . . .

    P(Xn+1 ∈ B | X1, . . . ,Xn) =αP0(B) +

    ∑ni=1 δXi (B)

    α+ n=

    α

    α+ nP0(B) +

    nα+ n

    1n

    n∑i=1

    δXi (B)

    3Blackwell, D. and MacQueen, J.B. (1993), Ferguson distribitions via Pólya urn schemes Annals ofStatistics

    R. Argiento October 20, 2016

  • A simple Bayesian nonparametric model

    3 It is possible to prove that1. The Pólya sequence is exchangeable,2. its de Finetti measure is the Dirichlet Process.

    3 We can use the predictive approach to set our first Bayesian nonparamertrc model:

    X1, . . . ,Xn | P ∼iid P sampling modelP ∼ DP(α,P0) prior.

    R. Argiento October 20, 2016

  • Density estimation using DP priors

    3 Our first Bayesian nonparametric model is not dominated, so we cannot apply theBayes Theorem as in the parametric case, however it is possible to show Ferguson(1973) that DP is conjugate:

    P|X (n) ∼ DP(

    n + α, 1α+n

    ∑ni=1 δXi +

    αα+n P0

    )

    Interpretation: After observnig X (n) = (X1, . . . ,Xn), the updated guess is

    P̂X︸︷︷︸Bayesian estimation

    = E(P | X (n) = x (n)) =αP0 +

    ∑ni=1 δxi

    α + n =α

    α + n P0 +n

    α + n1n

    n∑i=1

    δxi

    α + n P0︸︷︷︸prior guess

    +n

    α + n Pn︸︷︷︸empirical measure

    It has a continuous component (if P0 is continuous) and a discrete componentdetermined by the data.

    R. Argiento October 20, 2016

  • Density estimation using DP priors

    3 Our first Bayesian nonparametric model is not dominated, so we cannot apply theBayes Theorem as in the parametric case, however it is possible to show Ferguson(1973) that DP is conjugate:

    P|X (n) ∼ DP(

    n + α, 1α+n

    ∑ni=1 δXi +

    αα+n P0

    )Interpretation: After observnig X (n) = (X1, . . . ,Xn), the updated guess is

    P̂X︸︷︷︸Bayesian estimation

    = E(P | X (n) = x (n)) =αP0 +

    ∑ni=1 δxi

    α + n =α

    α + n P0 +n

    α + n1n

    n∑i=1

    δxi

    α + n P0︸︷︷︸prior guess

    +n

    α + n Pn︸︷︷︸empirical measure

    It has a continuous component (if P0 is continuous) and a discrete componentdetermined by the data.

    R. Argiento October 20, 2016

  • Density estimation using DP prior Illustration

    −4 −2 0 2 4 6 8

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Sample size n=8, alpha=5

    x

    E(P

    |X1=

    x1,.

    ..,X

    n=

    xn)

    posterior mean

    95% posterior credible bounds

    empirical

    true

    centernig P_0

    −4 −2 0 2 4 6 8

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Sample size n=50, alpha=5

    x

    E(P

    |X1=

    x1,.

    ..,X

    n=

    xn)

    posterior mean

    95% posterior credible bounds

    empirical

    true

    centernig P_0

    3 Two independent samples with sizes n = 8 and n = 50 generated fromnormal N(2, 4) (gray dashed line “truth”).

    3 In both cases, α = 5, while the P0 = N(0, 1) (orange gray line).

    3 The empirical CDF (blue step function) and posterior mean (thick red) withits 95% credible bounds (dashed red) are also shown.

    R. Argiento October 20, 2016

  • Predictive structure of a sample from DP

    3 The joint distribution of a sample X1, . . . ,Xn from a DP

    P(X1 ∈ dx1, . . . ,Xn ∈ dxn) =∫

    PX

    n∏i=1

    P(dxi )DP(α,P0)(dP)

    where we want to integrate out the random probability measure P.

    3 By resorting to the chain rule we haveP(X1 ∈ dx1, . . . ,Xn = dxn) = P(X1 ∈ dx1)× P(X2 = dx2 | X1 = x1)

    × · · · × P(Xn ∈ dxn | X1 = x1, . . . ,Xn−1 = xn−1).

    With the Pólya urn, we have already characterized P(Xi+1 ∈ dxi+1 | X (i)) the predictivedistribution of Xi+1 given the previous observations X1, . . . ,Xi .

    R. Argiento October 20, 2016

  • Predictive

    • In particular

    P(Xn+1 ∈ dxn+1 | X (n)) =α

    α + n P0(dxn+1) +n

    α + n1n

    n∑i=1

    δXi (dxn+1)︸ ︷︷ ︸Pn

    which is a mixture of the prior guess and the empirical measure of the observation.

    Xn+1 is

    ∼ P0 w.p. α/(α + n)= X1 w.p. 1/(α + n)· · ·= Xn w.p. 1/(α + n)

    R. Argiento October 20, 2016

  • Interpretation

    Let P0 be diffuse, or non atomic, i.e. P0(x) = 0 for all x ∈ X . Then• nα + n is the probab. that Xn+1 is an “old” value, i.e. already observed inX1, . . . ,Xn

    • αα + n is the probab. that Xn+1 is a new value, non previously observed.

    3 As the number of observations n increases, we have more information on the datagenerating mechanism and the weight associated to the prior guess goes to zero.

    3 This predictive structure characterises the DP, i.e. the de Finetti measure Π of thesequence (Xn)n≥1 is a DP prior iff the prediction rule is a linear combination of P0 andthe empirical measure. [Regazzini(1978), Lo(1991)]

    R. Argiento October 20, 2016

  • Discreteness

    3 Even when the base measure P0 of the DP is absolutely continuous[Ferguson(1973), Blackwell(1973)], realisations from the DP are almost surely discretedistributions. This can be seen from different constructions of the DP we’ll in the nextslides (through gamma processes and stick-breaking representation).

    R. Argiento October 20, 2016

  • Stick-breaking construction 4

    Idea:• break a unit-length stick to construct random weights

    0 1W1 = V1 1 − V1

    W2 = V2(1 − V1)

    W2 1 − W2 = (1 − V1)(1 − V2)

    W3 = V3(1 − W 2)

    W3 1 − W3...

    • sample locations from X• attach the locations to the weights to construct a random discrete measure.

    If we assign the correct distributions to these objects we obtain a DP.

    4Sethuraman, J. (1994). A constructive definition of the Dirichlet process prior. Statist. SinicaR. Argiento October 20, 2016

  • Stick-breaking construction

    In particular letting• Vi ∼iid Beta(1, α), i = 1, 2, . . .

    • Wi = Vii−1∏j=1

    (1− Vj ), i = 1, 2, . . .

    • X̃i ∼iid P0the random discrete measure

    P =∑i≥1

    WiδX̃i ∼ DP(α,P0).

    R. Argiento October 20, 2016

  • Support of the DP

    Since realisations of the DP are discrete distributions. However, we mentioned that it isdesirable for a nonparametric prior Π to have large or possibly full support, i.e.

    supp(Π) ≡ PX.

    In particular for a random probability mesure

    supp(P) = {∩A : A closed and P(Ac ) = 0}

    [Ferguson(1973), Ferguson(1974)] prove that

    supp(

    DPαP0

    )=

    {P ∈ PX : supp(P) ⊂ supp(P0)

    }Then

    supp(P0) = X ⇒ supp(DPαP0 ) = PX.

    Hence the DP prior has full support, including continuous distributions.

    R. Argiento October 20, 2016

  • Support of the DP

    Since realisations of the DP are discrete distributions. However, we mentioned that it isdesirable for a nonparametric prior Π to have large or possibly full support, i.e.

    supp(Π) ≡ PX.

    In particular for a random probability mesure

    supp(P) = {∩A : A closed and P(Ac ) = 0}

    [Ferguson(1973), Ferguson(1974)] prove that

    supp(

    DPαP0

    )=

    {P ∈ PX : supp(P) ⊂ supp(P0)

    }Then

    supp(P0) = X ⇒ supp(DPαP0 ) = PX.

    Hence the DP prior has full support, including continuous distributions.

    R. Argiento October 20, 2016

  • Distinct values and induced partition

    The variables X1, . . . ,Xn|Pi.i.d.∼ P where P ∼DP induce a random partition ρ

    of data indexes {1, . . . , n}.

    * Since P is a.s. discrete, we will observe only Kn ≤ n different values:3 X∗1 , . . . ,X∗Kn : unique values in X1, . . . ,Xn3 ρ = {C1, . . . ,CKn}: i ∈ Cj ⇔ Xi = X∗j , #Cj = nj

    Note that we can rewrite the predictive distribution as

    P(Xn+1 ∈ dx | X (n)) =α

    α + n P0(dx) +1

    α + n

    Kn∑j=1

    njδX∗i (dx)

    from which Xn+1 = X∗j with probability nj/(α + n).

    R. Argiento October 20, 2016

  • Distinct values and induced partition

    The variables X1, . . . ,Xn|Pi.i.d.∼ P where P ∼DP induce a random partition ρ

    of data indexes {1, . . . , n}.

    * Since P is a.s. discrete, we will observe only Kn ≤ n different values:3 X∗1 , . . . ,X∗Kn : unique values in X1, . . . ,Xn3 ρ = {C1, . . . ,CKn}: i ∈ Cj ⇔ Xi = X∗j , #Cj = nj

    Note that we can rewrite the predictive distribution as

    P(Xn+1 ∈ dx | X (n)) =α

    α + n P0(dx) +1

    α + n

    Kn∑j=1

    njδX∗i (dx)

    from which Xn+1 = X∗j with probability nj/(α + n).

    R. Argiento October 20, 2016

  • Chinese restaurant process

    The law of the random partition ρ, can be characterized by the so calledChinese restaurant process.

    3 The first customer sits at table 1;3 Given that k tables are occupied after by n customers, customer n + 1 sits:

    • at table j = 1, . . . k with probabilitynjα+n

    where nj is the number of customers at table j.• A new table k + 1 with probability proportional to

    αα+n

    C11 C2 C3 C4 C5 . . .

    Under this metaphor: customers ⇔ observation indices, tables ⇔ clusters, colors ⇔ X∗’s.

    R. Argiento October 20, 2016

  • Chinese restaurant process

    The law of the random partition ρ, can be characterized by the so calledChinese restaurant process.

    3 The first customer sits at table 1;3 Given that k tables are occupied after by n customers, customer n + 1 sits:

    • at table j = 1, . . . k with probabilitynjα+n

    where nj is the number of customers at table j.• A new table k + 1 with probability proportional to

    αα+n

    C11 C2 C3 C4 C5 . . .

    Under this metaphor: customers ⇔ observation indices, tables ⇔ clusters, colors ⇔ X∗’s.

    R. Argiento October 20, 2016

  • Chinese restaurant process

    The law of the random partition ρ, can be characterized by the so calledChinese restaurant process.

    3 The first customer sits at table 1;3 Given that k tables are occupied after by n customers, customer n + 1 sits:

    • at table j = 1, . . . k with probabilitynjα+n

    where nj is the number of customers at table j.• A new table k + 1 with probability proportional to

    αα+n

    C11 C2

    2

    C3 C4 C5 . . .

    Under this metaphor: customers ⇔ observation indices, tables ⇔ clusters, colors ⇔ X∗’s.

    R. Argiento October 20, 2016

  • Chinese restaurant process

    The law of the random partition ρ, can be characterized by the so calledChinese restaurant process.

    3 The first customer sits at table 1;3 Given that k tables are occupied after by n customers, customer n + 1 sits:

    • at table j = 1, . . . k with probabilitynjα+n

    where nj is the number of customers at table j.• A new table k + 1 with probability proportional to

    αα+n

    C11

    3

    C2

    2

    C3 C4 C5 . . .

    Under this metaphor: customers ⇔ observation indices, tables ⇔ clusters, colors ⇔ X∗’s.

    R. Argiento October 20, 2016

  • Chinese restaurant process

    The law of the random partition ρ, can be characterized by the so calledChinese restaurant process.

    3 The first customer sits at table 1;3 Given that k tables are occupied after by n customers, customer n + 1 sits:

    • at table j = 1, . . . k with probabilitynjα+n

    where nj is the number of customers at table j.• A new table k + 1 with probability proportional to

    αα+n

    C11

    3

    C2

    2

    C3

    4

    C4 C5 . . .

    Under this metaphor: customers ⇔ observation indices, tables ⇔ clusters, colors ⇔ X∗’s.

    R. Argiento October 20, 2016

  • Chinese restaurant process

    The law of the random partition ρ, can be characterized by the so calledChinese restaurant process.

    3 The first customer sits at table 1;3 Given that k tables are occupied after by n customers, customer n + 1 sits:

    • at table j = 1, . . . k with probabilitynjα+n

    where nj is the number of customers at table j.• A new table k + 1 with probability proportional to

    αα+n

    C11

    3 5

    C2

    2 7

    C3

    4

    C4

    6 8

    C5 . . .

    Under this metaphor: customers ⇔ observation indices, tables ⇔ clusters, colors ⇔ X∗’s.

    R. Argiento October 20, 2016

  • Ewens sampling formula

    Prior of ρ : exchangeable partition probability function

    P (ρ = {C1, . . . ,CKn}) = eppf (]C1, . . . , ]CKn ) :=Γ(α)

    Γ(α+ n)αk

    k∏i=1

    (ni − 1)!

    3 From this EPPF we can obtain the probability mass function for the number of unique valuesKn (Antoniak, 1974)

    P(Kn = kn) = Sn,kn!αkn Γ(α)Γ(α+n)

    where Sn,k is a Stirling number of the first kind.

    3 Using a conditional expectation argument we find

    E(Kn) =∑n

    i=1α

    α+i−1 ≈ α log(α+nα

    )

    R. Argiento October 20, 2016

  • Prior number of different values

    Much is known on the behaviour of the number of distinct values Kn:• E[Kn] ≈ Var[Kn] ≈ α log n• Kn/ log n→ αa.s.

    • (Kn − E[Kn])/√

    Var(Kn)d→ N(0, 1)

    • dTV (L(Kn),Po(E[Kn])) = o(1/ log n).

    3 Rich get richer behaviour: DP favors partitions with a small number of largeclusters and a large number of smallish ones.

    3 This feature of the model is often inappropriate in applications, which hasmotivated many of the generalizations 5

    5Argiento, R., et al. (2015) Modelling the association between clusters of SNPs and diseaseresponses”. Nonparametric Bayesian Methods in Biostatistics and Bioinformatics (R. Mitra, P. Mueller Eds.),Springer.

    R. Argiento October 20, 2016

  • Distribution of Kn

    It is clear that the distribution of the number of distinct values depends on the basemeasure only through the precision parameter α:

    • α small ⇒ few distinct observations, and vice versa• from

    P(X2 ∈ · | X1) =α

    α + 1 P0(·) +1

    α + 1δX1we see that

    • α→ 0 implies all observations are equal• α→∞ implies all observations are different

    0 20 40 60 80 100

    0.0

    00.0

    50.1

    00.1

    50.2

    0

    kn

    P(K

    n)=

    kn

    α=1

    α=5

    α=10

    α=20

    α=50

    α=100

    R. Argiento October 20, 2016

  • DP as a normalized completely random measure (NormCRM)

    A gamma completely random measure on X is defined asConstructive definition:

    µ(·) d:=+∞∑i=1

    Jiδτi (·)

    3 The jumps {Ji} are the points of a Poisson process on R+ with intensityρ(s) = αs−1e−s , α > 0

    3 the support {τi} is an iid sequence from P0;3 {Ji} and {τi} are independent.

    Since∫ +∞

    0 min{1, s}ρ(s)ds < +∞ and∫∞

    0 ρ(s)ds =∞, then

    0 < T :=+∞∑i=1

    Ji

  • DP as a normalized completely random measure (NormCRM)

    A gamma completely random measure on X is defined asConstructive definition:

    µ(·) d:=+∞∑i=1

    Jiδτi (·)

    3 The jumps {Ji} are the points of a Poisson process on R+ with intensityρ(s) = αs−1e−s , α > 0

    3 the support {τi} is an iid sequence from P0;3 {Ji} and {τi} are independent.

    Since∫ +∞

    0 min{1, s}ρ(s)ds < +∞ and∫∞

    0 ρ(s)ds =∞, then

    0 < T :=+∞∑i=1

    Ji

  • DP as a normalized completely random measure (NormCRM)

    A gamma completely random measure on X is defined asConstructive definition:

    µ(·) d:=+∞∑i=1

    Jiδτi (·)

    3 The jumps {Ji} are the points of a Poisson process on R+ with intensityρ(s) = αs−1e−s , α > 0

    3 the support {τi} is an iid sequence from P0;3 {Ji} and {τi} are independent.

    Since∫ +∞

    0 min{1, s}ρ(s)ds < +∞ and∫∞

    0 ρ(s)ds =∞, then

    0 < T :=+∞∑i=1

    Ji

  • DP as a normalized completely random measure (NormCRM)

    A gamma completely random measure on X is defined asConstructive definition:

    µ(·) d:=+∞∑i=1

    Jiδτi (·)

    3 The jumps {Ji} are the points of a Poisson process on R+ with intensityρ(s) = αs−1e−s , α > 0

    3 the support {τi} is an iid sequence from P0;3 {Ji} and {τi} are independent.

    Since∫ +∞

    0 min{1, s}ρ(s)ds < +∞ and∫∞

    0 ρ(s)ds =∞, then

    0 < T :=+∞∑i=1

    Ji

  • DP as a normalized completely random measure (NormCRM)

    A gamma completely random measure on X is defined asConstructive definition:

    µ(·) d:=+∞∑i=1

    Jiδτi (·)

    3 The jumps {Ji} are the points of a Poisson process on R+ with intensityρ(s) = αs−1e−s , α > 0

    3 the support {τi} is an iid sequence from P0;3 {Ji} and {τi} are independent.

    Since∫ +∞

    0 min{1, s}ρ(s)ds < +∞ and∫∞

    0 ρ(s)ds =∞, then

    0 < T :=+∞∑i=1

    Ji

  • Finite dimensional approximations

    3 Let M > 0, (w1, . . . ,wM) ∼Dir( αM , . . . ,αM ), and τ1, . . . , τM

    i.i.d.∼ P0, then

    PM(·) :=M∑

    h=1

    whδτh (·)L→ P(·) M →∞

    where P ∼DP(α,P0)

    Observation Let v1, . . . , vN−1i.i.d.∼ Beta(1, α), vN = 1. If wh = vh

    ∏h−1k=1 (1− vk ), then∑N

    h=1 wh = 1 a.s

    3 If τ1, . . . , τMi.i.d.∼ P0 then, by the stick breaking construction

    PN(·) :=N∑

    h=1

    whδτh (·)L→ P(·) N →∞

    Moreover if N := N� ∼Poi(−α log(�)) then dTV(P,Pε) < ε [Muliere and Tardella(1998)]

    R. Argiento October 20, 2016

  • Finite dimensional approximations

    3 Let M > 0, (w1, . . . ,wM) ∼Dir( αM , . . . ,αM ), and τ1, . . . , τM

    i.i.d.∼ P0, then

    PM(·) :=M∑

    h=1

    whδτh (·)L→ P(·) M →∞

    where P ∼DP(α,P0)

    Observation Let v1, . . . , vN−1i.i.d.∼ Beta(1, α), vN = 1. If wh = vh

    ∏h−1k=1 (1− vk ), then∑N

    h=1 wh = 1 a.s

    3 If τ1, . . . , τMi.i.d.∼ P0 then, by the stick breaking construction

    PN(·) :=N∑

    h=1

    whδτh (·)L→ P(·) N →∞

    Moreover if N := N� ∼Poi(−α log(�)) then dTV(P,Pε) < ε [Muliere and Tardella(1998)]

    R. Argiento October 20, 2016

  • Simulation of trajectories

    Based on stick-breaking representation of the DP. It has found wide application forinference, especially due to the fact that it lends itself easily to the simulation of theDP. Two ideas here:

    • truncation: fix N large enough, simulate X1, . . . ,XN and V1, . . . ,VN and write

    Pd≈

    N∑i=1

    WiδXi

    which gives an approximate trajectory [Ishwaran and James(2001)]. Or fix ε > 0,choose a random Nε ∼ Po(−α log ε) which is such that dTV(P,Pε) < ε[Muliere and Tardella(1998)].

    • stochastic truncation: simulate an exact trajectory by recursing to a stochastictruncation method; the most famous are the slice sampler [Walker(2007)], whichdevelops a Gibbs sampler on an augmented space, and the retrospective sampler[Papaspiliopoulos and Roberts(2008)], which simulates additional weights andlocations only when these are needed.

    These are called conditional methods, in comparison with so called maeginal methodswhich exploit the marginal distribution of the observables for simulating tranjectories,and tend to be more efficient.

    R. Argiento October 20, 2016

  • Mixture models

  • What is Cluster Analysis?

    3 Waiting time between eruptions and the duration of the eruption for the Old Faithful geyserin Yellowstone National Park, Wyoming, USA.

    1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

    50

    60

    70

    80

    90

    eruptions

    wa

    itin

    g

    R. Argiento October 20, 2016

  • What is Cluster Analysis?

    3 Waiting time between eruptions and the duration of the eruption for the Old Faithful geyserin Yellowstone National Park, Wyoming, USA.

    1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

    50

    60

    70

    80

    90

    eruptions

    wa

    itin

    g

    0.001

    0.002

    0.002

    0.003

    0.003 0.0

    04

    0.004

    0.0

    05

    0.005

    0.0

    06

    0.006

    0.007

    0.00

    7

    0.0

    08

    0.008

    0.009

    0.00

    9

    0.0

    1

    0.0

    1

    0.0

    11

    0.011

    0.0

    12

    0.013

    0.014

    0.0

    14

    0.016

    0.0

    17

    0.02

    0.02

    0.0

    21

    0.0

    21

    0.02

    2

    0.0

    23

    0.031

    0.0

    32

    0.032

    0.03

    4

    0.03

    5

    0.0

    36

    0.0

    43

    R. Argiento October 20, 2016

  • What is Cluster Analysis?

    3 Waiting time between eruptions and the duration of the eruption for the Old Faithful geyserin Yellowstone National Park, Wyoming, USA.

    1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

    50

    60

    70

    80

    90

    eruptions

    wa

    itin

    g

    0.001

    0.002

    0.002

    0.003

    0.003 0.0

    04

    0.004

    0.0

    05

    0.005

    0.0

    06

    0.006

    0.007

    0.00

    7

    0.0

    08

    0.008

    0.009

    0.00

    9

    0.0

    1

    0.0

    1

    0.0

    11

    0.011

    0.0

    12

    0.013

    0.014

    0.0

    14

    0.016

    0.0

    17

    0.02

    0.02

    0.0

    21

    0.0

    21

    0.02

    2

    0.0

    23

    0.031

    0.0

    32

    0.032

    0.03

    4

    0.03

    5

    0.0

    36

    0.0

    43

    R. Argiento October 20, 2016

    eruptions wai

    ting

    density

  • Model-based cluster analysis

    What is Cluster Analysis?Attempt of group a collection of data objects such that- Similar to one other within the same group (or cluster);- Dissimilar to the objects in the other groups (or clusters).

    Model-based clustering

    3 Data come from a “random” source with several (possibly infinities) subpopulations.

    3Each subpopulation is modeled separately.

    3 The distribution of the overall population is a mixture of these subpopulations.

    3The resulting model for the data is a mixture model.

    R. Argiento October 20, 2016

  • Model-based cluster analysis

    What is Cluster Analysis?Attempt of group a collection of data objects such that- Similar to one other within the same group (or cluster);- Dissimilar to the objects in the other groups (or clusters).

    Model-based clustering

    3 Data come from a “random” source with several (possibly infinities) subpopulations.

    3Each subpopulation is modeled separately.

    3 The distribution of the overall population is a mixture of these subpopulations.

    3The resulting model for the data is a mixture model.

    R. Argiento October 20, 2016

  • Infinite Mixture Models

    Two ingredients

    3 {f (y ; θ), θ ∈ Θ} a parametric family of densities onRp , with Θ ⊂ Rs ; kernels3 P(·) :=

    ∑∞h=1 whδτh (·) a discrete probability measure

    on Θ. mixing distribution

    Under a mixture model the population variable has conditional distribution:

    Y |P ∼∫

    Θf (y ; θ)P(dθ) =

    ∑∞h=1 whf (y ; τh)

    Interpretation:3 Infinite number of possible clusters: h = 1, 2, . . . .3 wh is the probability that an observation lies in the h-th cluster.3 f (·; τh) the density of data lying in the h-th cluster.

    R. Argiento October 20, 2016

  • Infinite Mixture Models

    Two ingredients

    3 {f (y ; θ), θ ∈ Θ} a parametric family of densities onRp , with Θ ⊂ Rs ; kernels3 P(·) :=

    ∑∞h=1 whδτh (·) a discrete probability measure

    on Θ. mixing distribution

    Under a mixture model the population variable has conditional distribution:

    Y |P ∼∫

    Θf (y ; θ)P(dθ) =

    ∑∞h=1 whf (y ; τh)

    Interpretation:3 Infinite number of possible clusters: h = 1, 2, . . . .3 wh is the probability that an observation lies in the h-th cluster.3 f (·; τh) the density of data lying in the h-th cluster.

    R. Argiento October 20, 2016

  • Infinite Mixture Models

    Two ingredients

    3 {f (y ; θ), θ ∈ Θ} a parametric family of densities onRp , with Θ ⊂ Rs ; kernels3 P(·) :=

    ∑∞h=1 whδτh (·) a discrete probability measure

    on Θ. mixing distribution

    Under a mixture model the population variable has conditional distribution:

    Y |P ∼∫

    Θf (y ; θ)P(dθ) =

    ∑∞h=1 whf (y ; τh)

    Interpretation:3 Infinite number of possible clusters: h = 1, 2, . . . .3 wh is the probability that an observation lies in the h-th cluster.3 f (·; τh) the density of data lying in the h-th cluster.

    R. Argiento October 20, 2016

  • Bayesian nonparametric density estimation

    3 Let Y1,Y2, . . . be an i.i.d. sample with unknown denisty f . A Dirichlet processmixture prior (DPM) on f posits that

    Yi |P ∼ f (yi |P) =∫X

    f (y ; θ)P(dθ)

    P ∼ DP(α,P0)

    where f (xi |θ) is a parametric distribution (often referred to as thekernel of the mixture ), which is indexed by a finite dimensional parameter θ ∈ Θ ∈ Rs .

    3 Exploiting the stick-breaking construction of the Dirichlet process we can write

    Yi |(wh, τh)∞h=1 ∼∞∑

    h=1

    whf (yi |τh)︸ ︷︷ ︸f (yi |p)

    where τhi.i.d.∼ P0, wh = vh

    ∏k

  • Bayesian nonparametric density estimation

    3 Let Y1,Y2, . . . be an i.i.d. sample with unknown denisty f . A Dirichlet processmixture prior (DPM) on f posits that

    Yi |P ∼ f (yi |P) =∫X

    f (y ; θ)P(dθ)

    P ∼ DP(α,P0)

    where f (xi |θ) is a parametric distribution (often referred to as thekernel of the mixture ), which is indexed by a finite dimensional parameter θ ∈ Θ ∈ Rs .

    3 Exploiting the stick-breaking construction of the Dirichlet process we can write

    Yi |(wh, τh)∞h=1 ∼∞∑

    h=1

    whf (yi |τh)︸ ︷︷ ︸f (yi |p)

    where τhi.i.d.∼ P0, wh = vh

    ∏k

  • Support

    3 From a density estimation point of view, working with an infinite number ofcomponents is particularly appealing because it ensures that, for appropriate choices ofthe kernel f (yi ; X), the DPM model has support on a large classes of distributions.

    3 For example, [Lo (1984)] showed that a DP location-scale mixture of normals,

    Yi |P ∼∫

    N(yi ;µ, σ2)P(dµ, dσ2) P ∼ DP(α,P0)

    has full support on the space of absolutely continuous distributions.

    R. Argiento October 20, 2016

  • Hierarchical model

    An alternative representation of the Dirchlet process mixture model (DPM) introduceslatent random effects6 θi to replace the mixture by a hierarchical model.

    Y1, ...,Yn|θ1, ..., θnind.∼ f (yi ; θi )

    θ1, ..., θn|Pi.i.d.∼ P

    p ∼ DP(α,P0)

    Note I slightly changed my notation: here θ1, . . . , θn is a sample from the Dirichletprocess, in the previous classes I denoted this sample with X1, . . . ,XnSo all the consideration we made on the X ’s now hold true for the variables θ’s!

    6Argiento, R., et al. (2014) Estimation, prediction and interpretation of NGG random effects models:an application to Kevlar fibre failure times. Statistical Papers

    R. Argiento October 20, 2016

  • Clustering

    3 Given (θ1, . . . , θn), Yi and Yj belong to the same cluster iff:

    θi = θj and we write Yi ↔ Yj

    3 θ∗ = (θ∗1 , . . . , θ∗Kn ) are the unique values among the θi ’s.

    3 ρ = {C1, . . . ,CKn} is the clustering induced on the data indices from the DPsample (θ1, . . . , θn).

    The prior on (θ1, . . . , θn) is equivalent to the prior on (ρ, θ∗), ie

    P(θ1 ∈ dθ1, . . . , θn ∈ dθn)⇔ P (ρ = {C1, . . . ,CKn})∏Kn

    j=1 P0(dθ∗j )

    R. Argiento October 20, 2016

  • Clustering

    3 Given (θ1, . . . , θn), Yi and Yj belong to the same cluster iff:

    θi = θj and we write Yi ↔ Yj

    3 θ∗ = (θ∗1 , . . . , θ∗Kn ) are the unique values among the θi ’s.

    3 ρ = {C1, . . . ,CKn} is the clustering induced on the data indices from the DPsample (θ1, . . . , θn).

    The prior on (θ1, . . . , θn) is equivalent to the prior on (ρ, θ∗), ie

    P(θ1 ∈ dθ1, . . . , θn ∈ dθn)⇔ P (ρ = {C1, . . . ,CKn})∏Kn

    j=1 P0(dθ∗j )

    R. Argiento October 20, 2016

  • Nonparametric Bayesian approach to clustering

    Conditional likelihood

    Y1, ...,Yn|C1, . . . ,Ck , θ∗1 , . . . , θ∗k ∼k∏

    j=1

    {∏i∈Ci

    f (yi ; θ∗j )

    }(1)

    - ρρρ := {C1, . . . ,Ck} is a partition of the the data index set {1, . . . , n}- {f (·; θ∗), θ∗ ∈ Θ} is a family of density on the sample space X .

    Prior specification: ∏kj=1 P0(dθ

    ∗j )π(ρ) (2)

    π(ρ) = P(ρ = {C1, . . . ,Ck}) = eppf (#C1, . . . ,#Ck )

    - The infinite exchangeable partition probability function under the DPM model is

    eppf (n1, . . . , nk ) = Γ(α)Γ(α+n)αk ∏k

    i=1(ni − 1)!

    R. Argiento October 20, 2016

  • Two points of view

    Hierarchical model

    Y1, ...,Yn|θ1, ..., θnind.∼ f (yi ; θi )

    θ1, ..., θn|Pi.i.d.∼ P

    P ∼ DP(α,P0)

    3 The density f (·,P) of the population variable Y is random.3 The law of this random density is assigned by a mixture model:

    f (y ; P) =∫

    Θf (y ; θ)P(dθ)

    Targets:

    H Density estimation: L(f (y ; P)|Y1, . . . ,Yn)

    H Cluster analysis: L(ρ|Y1, . . . ,Yn)

    R. Argiento October 20, 2016

  • Computation under DPM model

  • A computational problem: existing approaches

    Critical issues, infinite dimensional parameter P =∑∞

    i=1 wiδτiMarginal Gibbs sampler algorithms [Escobar, 1988] [Neal, 2000]

    3 Integrate out P and resort to generalized Polya urn schemes3 Inference is limited to the point estimates: predictive fXn+1 (·|X1, ..,Xn)

    Conditional methods3 Use some tricks to build a Gibbs sampler whose state space encompasses P.3 Full Bayesian posterior analysis.

    For instance:

    3 Slice sampler [Walker, 2007] [Griffin, 2013] 3 Retrospective methods [Papaspiliopuloset al., 2008]

    3 Truncation of the infinite sum defining the r.p.m. P [Muliere & Tardella (1998)](either a-priori or a-posteriori)

    As in [Iswaran & James (2001)] or [Argiento et al., 2010] build a finite dimensional ap-proximation of the random probability measure

    P(N) =∑N

    i=1 wiδτi

    R. Argiento October 20, 2016

  • A computational problem: existing approaches

    Critical issues, infinite dimensional parameter P =∑∞

    i=1 wiδτiMarginal Gibbs sampler algorithms [Escobar, 1988] [Neal, 2000]

    3 Integrate out P and resort to generalized Polya urn schemes3 Inference is limited to the point estimates: predictive fXn+1 (·|X1, ..,Xn)

    Conditional methods3 Use some tricks to build a Gibbs sampler whose state space encompasses P.3 Full Bayesian posterior analysis.

    For instance:3 Slice sampler [Walker, 2007] [Griffin, 2013] 3 Retrospective methods [Papaspiliopulos

    et al., 2008]3 Truncation of the infinite sum defining the r.p.m. P [Muliere & Tardella (1998)]

    (either a-priori or a-posteriori)

    As in [Iswaran & James (2001)] or [Argiento et al., 2010] build a finite dimensional ap-proximation of the random probability measure

    P(N) =∑N

    i=1 wiδτi

    R. Argiento October 20, 2016

  • A computational problem: existing approaches

    Critical issues, infinite dimensional parameter P =∑∞

    i=1 wiδτiMarginal Gibbs sampler algorithms [Escobar, 1988] [Neal, 2000]

    3 Integrate out P and resort to generalized Polya urn schemes3 Inference is limited to the point estimates: predictive fXn+1 (·|X1, ..,Xn)

    Conditional methods3 Use some tricks to build a Gibbs sampler whose state space encompasses P.3 Full Bayesian posterior analysis.

    For instance:3 Slice sampler [Walker, 2007] [Griffin, 2013] 3 Retrospective methods [Papaspiliopulos

    et al., 2008]3 Truncation of the infinite sum defining the r.p.m. P [Argiento et al., 2015a]

    (either a-priori or a-posteriori)

    As in [Iswaran & James (2001)] or [Argiento et al., 2010] build a finite dimensional ap-proximation of the random probability measure

    P(N) =∑N

    i=1 wiδτi

    R. Argiento October 20, 2016

  • MCMC integration

    3 Let Y = (Y1, . . . ,Yn) our datasetTarget Predictive density of the data, i.e. for each y ∈ Y

    fY (y)dy = L(Yn+1|Y ) = P(Yn+1 ∈ dy |Y ) =∫

    PY

    P(Yn+1 ∈ dy , dP|Y )

    =∫

    PYP(Yn+1 ∈ dy | P,�Y )P(P ∈ dP|Y )

    =∫

    PYf (y ; P)dyP(P ∈ dP|Y )

    = E(f (y ; P)dy |Y )

    So, if P(1),P(2), . . . ,P(G) is a MCMC sample from L(P|Y ) = P(P ∈ dP|Y ) then

    f̂Y (y) =1G

    g∑g=1

    f (y ; P(g))

    R. Argiento October 20, 2016

  • MCMC integration

    3 Let Y = (Y1, . . . ,Yn) our datasetTarget Predictive density of the data, i.e. for each y ∈ Y

    fY (y)dy = L(Yn+1|Y ) = P(Yn+1 ∈ dy |Y ) =∫

    PY

    P(Yn+1 ∈ dy , dP|Y )

    =∫

    PYP(Yn+1 ∈ dy | P,�Y )P(P ∈ dP|Y )

    =∫

    PYf (y ; P)dyP(P ∈ dP|Y )

    = E(f (y ; P)dy |Y )

    So, if P(1),P(2), . . . ,P(G) is a MCMC sample from L(P|Y ) = P(P ∈ dP|Y ) then

    f̂Y (y) =1G

    g∑g=1

    f (y ; P(g))

    R. Argiento October 20, 2016

  • Full Bayesian analysis

    Note that by the stick braking construction P(g) ≡ {(w (g)h ), (τgh )}, so

    f (y ; P(g)) =∑∞

    h=1 w(g)h f (y ; τ

    (g)h )

    3 If we are able to sample P(1),P(2), . . . ,P(G) from L(P|Y ), plugging in this sample inthe formula above, we can use it as a proxy to study the posterior law of the unknowndensity of the data Y , that is

    L(f (Yn+1; P)|Y1, . . . ,Yn)

    3 However in many application is enough to study linear functionals of the stochasticprocess f (Yn+1; P), like the predictive distribution fY (y)dy we computed in the previousslide, i.e.

    fY (y) = E(f (y ; P)|Y1, . . . ,Yn) =∫

    PY

    f (y |P)P(P ∈ dP|Y ).

    marginalization: we analytically compute the integral with respect to L(P|Y1, . . . ,Yn)in the expectation above .

    R. Argiento October 20, 2016

  • Full Bayesian analysis

    Note that by the stick braking construction P(g) ≡ {(w (g)h ), (τgh )}, so

    f (y ; P(g)) =∑∞

    h=1 w(g)h f (y ; τ

    (g)h )

    3 If we are able to sample P(1),P(2), . . . ,P(G) from L(P|Y ), plugging in this sample inthe formula above, we can use it as a proxy to study the posterior law of the unknowndensity of the data Y , that is

    L(f (Yn+1; P)|Y1, . . . ,Yn)

    3 However in many application is enough to study linear functionals of the stochasticprocess f (Yn+1; P), like the predictive distribution fY (y)dy we computed in the previousslide, i.e.

    fY (y) = E(f (y ; P)|Y1, . . . ,Yn) =∫

    PY

    f (y |P)P(P ∈ dP|Y ).

    marginalization: we analytically compute the integral with respect to L(P|Y1, . . . ,Yn)in the expectation above .

    R. Argiento October 20, 2016

  • Marginalization

    Target Predictive distribution of the data, i.e.

    fY (y) =∫

    PY

    f (y |P)P(P ∈ dP|Y )

    =

    ∫PY

    ∫Θ

    f (y ; θ)P(dθ)P(P ∈ dP|Y )

    =

    ∫PY

    ∫Θ

    f (y ; θ)P(dθ)∫

    ΘnP(P ∈ dP, θ ∈ dθ|Y )

    =

    ∫PY

    ∫Θ

    f (y ; θ)P(dθ)∫

    ΘnP(P ∈ dP|θ)P(θ ∈ dθ|Y )

    =

    ∫Θn

    ∫Θ

    f (y ; θ)∫

    PY

    P(dθ)P(P ∈ dP|θ)P(θ ∈ dθ|Y )

    =

    ∫Θn

    ∫Θ

    f (y ; θ) αn + αP0(dθ) +Kn∑j=1

    njα + n δθ

    ∗j

    (dθ)︸ ︷︷ ︸∫PY

    P(dθ)P(P∈dP|θ)

    P(dθ|Y )

    R. Argiento October 20, 2016

  • Marginalization

    Target Predictive distribution of the data, i.e.

    fY (y) =∫

    PY

    f (y |P)P(P ∈ dP|Y )

    =

    ∫PY

    ∫Θ

    f (y ; θ)P(dθ)P(P ∈ dP|Y )

    =

    ∫PY

    ∫Θ

    f (y ; θ)P(dθ)∫

    ΘnP(P ∈ dP, θ ∈ dθ|Y )

    =

    ∫PY

    ∫Θ

    f (y ; θ)P(dθ)∫

    ΘnP(P ∈ dP|θ)P(θ ∈ dθ|Y )

    =

    ∫Θn

    ∫Θ

    f (y ; θ)∫

    PY

    P(dθ)P(P ∈ dP|θ)P(θ ∈ dθ|Y )

    =

    ∫Θn

    ∫Θ

    f (y ; θ) αn + αP0(dθ) +Kn∑j=1

    njα + n δθ

    ∗j

    (dθ)︸ ︷︷ ︸∫PY

    P(dθ)P(P∈dP|θ)

    P(dθ|Y )

    R. Argiento October 20, 2016

  • Marginalization

    Target Predictive distribution of the data, i.e.

    fY (y) =∫

    PY

    f (y |P)P(P ∈ dP|Y )

    =

    ∫PY

    ∫Θ

    f (y ; θ)P(dθ)P(P ∈ dP|Y )

    =

    ∫PY

    ∫Θ

    f (y ; θ)P(dθ)∫

    ΘnP(P ∈ dP, θ ∈ dθ|Y )

    =

    ∫PY

    ∫Θ

    f (y ; θ)P(dθ)∫

    ΘnP(P ∈ dP|θ)P(θ ∈ dθ|Y )

    =

    ∫Θn

    ∫Θ

    f (y ; θ)∫

    PY

    P(dθ)P(P ∈ dP|θ)P(θ ∈ dθ|Y )

    =

    ∫Θn

    ∫Θ

    f (y ; θ) αn + αP0(dθ) +Kn∑j=1

    njα + n δθ

    ∗j

    (dθ)︸ ︷︷ ︸∫PY

    P(dθ)P(P∈dP|θ)

    P(dθ|Y )

    R. Argiento October 20, 2016

  • Marginalization

    Target Predictive distribution of the data, i.e.

    fY (y) =∫

    PY

    f (y |P)P(P ∈ dP|Y )

    =

    ∫PY

    ∫Θ

    f (y ; θ)P(dθ)P(P ∈ dP|Y )

    =

    ∫PY

    ∫Θ

    f (y ; θ)P(dθ)∫

    ΘnP(P ∈ dP, θ ∈ dθ|Y )

    =

    ∫PY

    ∫Θ

    f (y ; θ)P(dθ)∫

    ΘnP(P ∈ dP|θ,Y )P(θ ∈ dθ|Y )

    =

    ∫Θn

    ∫Θ

    f (y ; θ)∫

    PY

    P(dθ)P(P ∈ dP|θ)P(θ ∈ dθ|Y )

    =

    ∫Θn

    ∫Θ

    f (y ; θ) αn + αP0(dθ) +Kn∑j=1

    njα + n δθ

    ∗j

    (dθ)︸ ︷︷ ︸∫PY

    P(dθ)P(P∈dP|θ)

    P(dθ|Y )

    R. Argiento October 20, 2016

  • Marginalization

    Target Predictive distribution of the data, i.e.

    fY (y) =∫

    PY

    f (y |P)P(P ∈ dP|Y )

    =

    ∫PY

    ∫Θ

    f (y ; θ)P(dθ)P(P ∈ dP|Y )

    =

    ∫PY

    ∫Θ

    f (y ; θ)P(dθ)∫

    ΘnP(P ∈ dP, θ ∈ dθ|Y )

    =

    ∫PY

    ∫Θ

    f (y ; θ)P(dθ)∫

    ΘnP(P ∈ dP|θ)P(θ ∈ dθ|Y )

    =

    ∫Θn

    ∫Θ

    f (y ; θ)∫

    PY

    P(dθ)P(P ∈ dP|θ)P(θ ∈ dθ|Y )

    =

    ∫Θn

    ∫Θ

    f (y ; θ) αn + αP0(dθ) +Kn∑j=1

    njα + n δθ

    ∗j

    (dθ)︸ ︷︷ ︸∫PY

    P(dθ)P(P∈dP|θ)

    P(dθ|Y )

    R. Argiento October 20, 2016

  • Marginalization

    Target Predictive distribution of the data, i.e.

    fY (y) =∫

    PY

    f (y |P)P(P ∈ dP|Y )

    =

    ∫PY

    ∫Θ

    f (y ; θ)P(dθ)P(P ∈ dP|Y )

    =

    ∫PY

    ∫Θ

    f (y ; θ)P(dθ)∫

    ΘnP(P ∈ dP, θ ∈ dθ|Y )

    =

    ∫PY

    ∫Θ

    f (y ; θ)P(dθ)∫

    ΘnP(P ∈ dP|θ)P(θ ∈ dθ|Y )

    =

    ∫Θn

    ∫Θ

    f (y ; θ)∫

    PY

    P(dθ)P(P ∈ dP|θ)P(θ ∈ dθ|Y )

    =

    ∫Θn

    ∫Θ

    f (y ; θ) αn + αP0(dθ) +Kn∑j=1

    njα + n δθ

    ∗j

    (dθ)︸ ︷︷ ︸∫PY

    P(dθ)P(P∈dP|θ)

    P(dθ|Y )

    R. Argiento October 20, 2016

  • Marginalization

    Target Predictive distribution of the data, i.e.

    fY (y) =∫

    PY

    f (y |P)P(P ∈ dP|Y )

    =

    ∫PY

    ∫Θ

    f (y ; θ)P(dθ)P(P ∈ dP|Y )

    =

    ∫PY

    ∫Θ

    f (y ; θ)P(dθ)∫

    ΘnP(P ∈ dP, θ ∈ dθ|Y )

    =

    ∫PY

    ∫Θ

    f (y ; θ)P(dθ)∫

    ΘnP(P ∈ dP|θ)P(θ ∈ dθ|Y )

    =

    ∫Θn

    ∫Θ

    f (y ; θ)∫

    PY

    P(dθ)P(P ∈ dP|θ)P(θ ∈ dθ|Y )

    =

    ∫Θn

    ∫Θ

    f (y ; θ) αn + αP0(dθ) +Kn∑j=1

    njα + n δθ

    ∗j

    (dθ)︸ ︷︷ ︸∫PY

    P(dθ)P(P∈dP|θ)

    P(dθ|Y )

    R. Argiento October 20, 2016

  • Marginalization

    Finally

    fY (y) =α

    n + α

    ∫Θ

    f (y ; θ)P0(dθ) +∫

    Θn

    {Kn∑j=1

    njα + n f (y ; θ

    ∗j )

    }P(dθ|Y )

    Let θ(1), θ(2), . . . , θ(G) a Markov Chain sample from P(dθ|Y ), then

    f̂Y (y) =α

    α + n

    ∫Θ

    f (y ; θ)P0(dθ) +G∑

    g=1

    K (g)n∑j=1

    n(g)jα + n f (y ; θ

    ∗(g)j )

    R. Argiento October 20, 2016

  • Marginalization

    Finally

    fY (y) =α

    n + α

    ∫Θ

    f (y ; θ)P0(dθ) +∫

    Θn

    {Kn∑j=1

    njα + n f (y ; θ

    ∗j )

    }P(dθ|Y )

    Let θ(1), θ(2), . . . , θ(G) a Markov Chain sample from P(dθ|Y ), then

    f̂Y (y) =α

    α + n

    ∫Θ

    f (y ; θ)P0(dθ) +G∑

    g=1

    K (g)n∑j=1

    n(g)jα + n f (y ; θ

    ∗(g)j )

    R. Argiento October 20, 2016

  • Pólya Urn Gibbs sampler

    Target sample θ = (θ1, . . . , θn) from P(θ|Y ) = P(θ1 ∈ dθ1, . . . , θn ∈ dθn|Y )

    Idea use a Gibbs sampler, that is, draw sequentially valuer of θi from

    P(θi ∈ dθi |θ−i ,Y ) for all i = 1, . . . , n

    where θ−i = (θ1, . . . , θi−1, θi+1, . . . , θn)

    3 We describe the this algorithm under the assumption that f (y ; θ) and P0(dθ) areConjugate. For extension to the non-conjugate case see [Neal 2008] Algorithm 8.

    R. Argiento October 20, 2016

  • Notation

    3 Parametric associated model• The one observation parametric Bayesian model associated with the DPM is

    Y |θ ∼ f (y ; θ)θ ∼ P0

    Its posterior is denoted by π̃(θ|y) and its marginal by m(y) =∫

    f (y ; θ)P0(dθ) .

    • Let C be a subset of the indices {1, . . . , n}, the associated parametric model onthe subset C is

    (Yi )i∈C |θ∗i.i.d.∼ f (yi ; θ∗)

    θ∗ ∼ P0

    whose posterior is proportional to∏i∈C

    f (yi : θ∗)P0(dθ∗)

    R. Argiento October 20, 2016

  • Pólya Urn Gibbs sampler

    We recall again that a priori

    P(θn ∈ dθ | θ1, . . . , θn−1) ∝ αP0(dθ) +Kn−1∑j=1

    nn−1,jδθ∗i (dθ)

    3 Since sequence of latent observation θ1, . . . , θn is exchangeable, this expression givesus the form of the full conditional prior distribution for any θi givenθ−i = (θ1, . . . , θi−1, θi+1, . . . , θn).

    3 Multiplying by the likelihood f (yi ; θi ) we find the full conditional posteriordistribution for θi .

    P(θi ∈ dθi | θ−i ,Y ) ∝ αf (yi ; θi )P0(dθi ) +K−n∑j=1

    n−j f (yi , θi )δθ−∗j (dθi )

    where the super