dirichlet labeling and hierarchical processes for clustering functional...

Dirichlet labeling and hierarchical

processes for clustering functional data

XuanLong Nguyen

Department of Statistics

University of Michigan

Joint work with Alan Gelfand, Duke University

1

Functional data clustering

• Many applications involve data that can be viewed as functions

(e.g., curves, surfaces, multivariate functions)

– borrowing statistical information across spatial domain

– borrowing across replicates in the collection

2






• Model-based clustering using nonparametric labeling processes

– differs from, e.g., Ramsay & Silverman (2002), Ferraty & Vieu (2006)

– global/local clustering and partitioning of curves

– application to progesterone hormone analysis

– application to image analysis

2-a






• Model-based clustering using nonparametric labeling processes

– differs from, e.g., Ramsay & Silverman (2002), Ferraty & Vieu (2006)

– global/local clustering and partitioning of curves

– application to progesterone hormone analysis

– application to image analysis

• Variational Bayesian inference for labeling processes,

i.e., implicit optimization over the space of posterior distributions

2-b

Example 1: Clustering curves

−10 −5 0 5 10 15−6

−4

−2

0

2

4

6

daily index

log(

PG

D)

Progesterone data

• To characterize hormone levels in terms of typical behaviors

– global clustering of entire curves; local clustering on a temporal segment

– effects of contraception, detection of anomalous behavior

3

Example 2: Natural images

• image segmentation

• image ranking (retrieval from databases), image clustering

4

Example 3 – a little twist: Learning functional

clusters from non-functional data

0 5 10 15−2

−1

0

1

2

3Spatially varying mixture distributions

Locations u

Y

• data are daily hormone levels from a number of un-identified women

(not necessarily realizations of curves)

– due to confidentiality or lack of information, hormone levels collected

different days may belong to different (groups of) women

• local clusters = typical daily hormone behaviors in a menstrual cycle

• global (functional) clusters = typical monthly hormone behaviors5

Talk outline

• Global clusters and local clusters in functional data

– Dirichlet labeling process for label allocation

– variational inference

– clustering trajectories and image segmentation

6

Talk outline




– clustering trajectories and image segmentation

• Functional clustering from (possibly) non-functional data

– applications to data with partial information (e.g., confidentiality

constraints)

– nested hierarchy of Dirichlet processes mixture

6-a

Global clustering via mixture modeling

curve replicate

θ∗1

θ∗2

θ∗3

• Curves are functions defined on domain D

• Specify a mixing distribution over the space of entire curves

θ ∼ G =

∞∑

j=1

πjδθ∗

j

θ∗j ∼ G0

– proportions {πj} is drawn from a “stick-breaking” prior

(Gelfand, Kottas, MacEachern, JASA 2005)

– this allows “global” clustering, but not local clustering!7

Local clusteringreplacements

rep. Y (x1)

θ∗1(x1)

θ∗2(x1)

θ∗3(x1)rep. Y (x2)

θ∗1(x2)θ∗2(x2)

Gx1Gx2

• At each location x ∈ D, specify a mixing distribution for the marginal

Gx of curve value Y (x):

Gx =

∞∑

j=1

πj(x)δθ∗

j(x)

– spatial coupling between Gx (MacEachern, 2000; Teh, Jordan, Blei, Beal,

2006; Dunson & Park, 2006; Griffin & Steel, 2006; etc)

– there is no global clustering!

8

Our modeling approach

Y (x), x ∈ D L(x), x ∈ D

• entire collection described by a number of “canonical” curves

• each curve can be viewed as a “hybrid” species

(Duan et al, 2005; Petrone et al, 2007)

• canonical curve allocation via a labeling process

• “hybrid” view also popular in other contexts, e.g., population genetics

and text analysis (Prichard et al, 2001; Blei et al, 2003)

9

G0

k

p τ

Lθ∗

θ Y

n

Y (x), x ∈ D L(x), x ∈ D

• n replicates Y1, . . . , Yn observed at m locations x1, . . . , xm

• k canonical curves θ∗jiid∼ G0, j = 1, . . . , k

10

G0

k

p τ

Lθ∗

θ Y

n

Y (x), x ∈ D L(x), x ∈ D



• random label functions Li : D → {1, . . . , k}

Li| piid∼ p, i = 1, . . . , n

10-a

G0

k

p τ

Lθ∗

θ Y

n

Y (x), x ∈ D L(x), x ∈ D




Li| piid∼ p, i = 1, . . . , n

• given θ∗ and L, hybrid curve θi satisfies: θi(x) = θ∗Li(x)

10-b

G0

k

p τ

Lθ∗

θ Y

n

Y (x), x ∈ D L(x), x ∈ D




Li| piid∼ p, i = 1, . . . , n

• given θ∗ and L, hybrid curve θi satisfies: θi(x) = θ∗Li(x)

• mixing with a pure error process

Yi(x)| θi(x) ∼ N(θi(x), τ2)

10-c

Labeling process L

• q is probability measure on L ∈ {1, . . . , k}D

– Desideratum: q allows spatial dependency of labels within the same curve

11

Labeling process L

• q is probability measure on L ∈ {1, . . . , k}D

– Desideratum: q allows spatial dependency of labels within the same curve

• Draw η from a mean zero Gaussian process with covariance function

ρ(x1, x2) = exp{−φ‖x1 − x2‖},

where φ is the scale parameter

• Thresholding η to obtain L ∼ q

η

t1 t2

L1 = 7

L2 = 6

j = 2

j = 3j = 4j = 5j = 6

j = 7

11-a

Random measure p on labeling functions L

• Want p to be a random probability measure, whose mean measure is

the measure q defined above

– Desideratum: random p facilitates labeling sharing across curves

C.I. for q C.I. for a p realization

D D

{1...k}

{1...k}

x1x1 x2 x2x3 x3

12

Random measure p on labeling functions L

• Want p to be a random probability measure, whose mean measure is

the measure q defined above

– Desideratum: random p facilitates labeling sharing across curves


D D

{1...k}

{1...k}

x1x1 x2 x2x3 x3

• Definition: p is a random measure generated by a Dirichlet process

with base measure q and concentration parameter α

p ∼ DP (αq)

L|p ∼ p

12-a

PSfrag


D D

{1...k}

{1...k}

x1x1 x2 x2x3 x3

• Definition: p is a random measure generated by a Dirichlet process

with base measure q and concentration parameter α

p ∼ DP (αq)

L|p ∼ p

• i.e., for arbitrary x1, . . . , xm, L takes km possible values, whose

associated probability is given by km-dim vector px1,...,xm

– px1,...,xm has a km-dim Dirichlet distribution:

(px1,...,xm(·)) ∼ Dir(αqx1,...,xm(·)),

where q is the probability measure on {1, . . . , k}D.13

Illustration of label properties

• Spatial dependence of labels is driven by φ

Posterior distribution mode of the labels for the leftmost image

φ = 0.5, 0.05, 0.01

14

Illustration of label properties

• Spatial dependence of labels is driven by φ

Posterior distribution mode of the labels for the leftmost image

φ = 0.5, 0.05, 0.01

• Local clustering : Let L1, L2|piid∼ p. Then, unconditionally,

P (L1(x) = L2(x)) =1

α + 1+

1

k·

α

α + 1.

• Global clustering :

P (L1 = L2) =1

α + 1.

14-a

Hierarchical model behavior and identifiability

G0

k

p τ

Lθ∗

θ Y

n

• interplay of three distributions

– canonical curve distribution G0, labeling process p (driven by φ), and pure

error process (driven by τ)

• additional constraints/priors available depending on modeling goals

– inference of canonical curves (e.g., hormone analysis):

⇒ roles of φ and τ

– inference of labels (e.g., image segmentation):

⇒ roles of G0

– predictive posterior inference (e.g., on new locations/curves)

15

Model fitting via MCMC

• Gibbs updates of canonical curves are standard

• Updates of precision parameters α, τ are also standard

(Escobar & West, 1995)

• The most computationally intensive part is the update of labels L1, . . . , Ln

and label switching parameter φ

– the sampler cannot handle this step exactly

– the challenge lies in the (latent) high dimensional thresholded Gaussian

process q nested within the Dirichlet process

• We propose a variational inference method to obtain approximate pos-

terior distribution for labels L and label parameter φ

16

Variational Bayesian inference

• The core of Bayesian inference is the posterior computation:

posterior ∝ likelihood × prior

w(θ|X1, . . . , Xn) ∝

n∏

i=1

P (Xi|θ)π(θ)

• w(·|X) is the solution to the following optimization problem:

w(·|X1, . . . , Xn) = argminw(·) −

∫

θ

n∑

i=1

log P (Xi|θ)dw(θ) + D(w||π),

where D denotes the Kullback-Leibler (KL) divergence

• If inference is intractable, consider modifying either loss function or the

space of distributions

17

Variational approximation for q(L(x)|φ,Data)

MRF QE

qqEEE1

E2 D(q||qE)

x1 x2 x3

• Let E be a subset of edges connecting pairs of locations x1, . . . , xm

• qE is the best approximation (in the sense of KL divergence) among all

Markov random fields (MRF) QE using pairwise potential functions

qE = argminQ∈QED(q||Q).

18

Variational approximation for q(L(x)|φ,Data)

MRF QE

qqEEE1

E2 D(q||qE)

x1 x2 x3

• Let E be a subset of edges connecting pairs of locations x1, . . . , xm

• qE is the best approximation (in the sense of KL divergence) among all

Markov random fields (MRF) QE using pairwise potential functions

qE = argminQ∈QED(q||Q).

• If E1 ⊃ E2 then D(q||qE1) ≤ D(q||qE2

)

• How to choose collection of edges E?

– e.g., shortest spanning tree for x1, . . . , xm, resulting in linear computational

complexity of inference (in m)18-a

Variational approximation for φ|L,Data

• Resultant posterior distribution for φ:

P (φ|L) ∝Y

(xi,xj)∈E

q(L(xi), L(xj)|φ)λπ(φ)

= argminw(·)∈W

Z

φ

λX

(xi,xj)∈E

− log q(L(xi), L(xj))dw(φ) + D(w||π)

19

Variational approximation for φ|L,Data

• Resultant posterior distribution for φ:

P (φ|L) ∝Y

(xi,xj)∈E

q(L(xi), L(xj)|φ)λπ(φ)

= argminw(·)∈W

Z

φ

λX

(xi,xj)∈E

− log q(L(xi), L(xj))dw(φ) + D(w||π)

• P (φ|L) shrinks toward the true φ∗ in an appropriately defined metric:

Proposition 1. Suppose that L = (L(x1), . . . , L(xm)) is drawn from

q for some “true” φ∗ > 0; and that π(·) puts sufficient mass around

φ∗ then under the distribution q∫

1

|E|| log(P (φ∗|L)/P (φ|L))| dP (φ|L) = OP (1/m).

• Posterior mode converges in probability to φ∗ at the rate√

rm

m, where

rm is the number of edges whose lengths are bounded by a fixed constant

19-a

Predictive posterior distributions

• a simulated example on 1-dim domain D = [1, 40]

−3 −2 −1 0 1 2 30

0.2

0.4

0.6

0.8

1

1.2

1.4Location2

−4 −2 0 2 40

0.2

0.4

0.6

0.8

1Location4

−3 −2 −1 0 1 2 30

0.2

0.4

0.6

0.8

1

1.2

1.4Location6

−2 −1 0 1 2 30

0.2

0.4

0.6

0.8

1

1.2

1.4Location8

−3 −2 −1 0 1 2 30

0.2

0.4

0.6

0.8

1Location10

−3 −2 −1 0 1 20

0.2

0.4

0.6

0.8

1Location12

−3 −2 −1 0 1 20

0.2

0.4

0.6

0.8

1

1.2

1.4Location14

−3 −2 −1 0 1 20

0.2

0.4

0.6

0.8

1Location16

−2 −1 0 1 2 30

0.2

0.4

0.6

0.8

1Location26

−2 −1 0 1 20

0.2

0.4

0.6

0.8

1Location28

−3 −2 −1 0 1 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8Location30

−3 −2 −1 0 1 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8Location32

20

Progesterone hormone data

• Inference of canonical curves

(– contraception grouping is unknown to the analysis)

−10 −5 0 5 10 15−6

−4

−2

0

2

4

6

daily index

log(

PG

D)

Progesterone data

−10 −5 0 5 10 15−6

−4

−2

0

2

4

6

daily index

log(

PG

D)

φ = .05 (global clustering)

• Left figure: Distinctive canonical behaviors emerge approximately 7 days

after ovulation date

• Right figure: φ is set very small to insist on “global” clusters

⇒ global clustering not possible!

21

Pairwise comparison of hormone curves

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

850.4

0.5

0.6

0.7

0.8

0.9

1

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

850.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

• Proportion of equal labels for pairs of replicates

– left: for the whole curve

– right: for a curve segment [20, 24]

(last 5 days of the monitored cycle)

• Curves indexed from 67 to 88 belong to the contraception group

22

Detecting unusual hormone levels

−10 −5 0 5 10 15−6

−4

−2

0

2

4

daily index

log(

PG

D)

Unusual hormone levels

• Curves numbered 12, 27, 30, 50, 74, 82, 85 have large degree of depar-

ture from the “mean” behavior

23

Image segmentation

• canonical curves are random constant functions

• image segments correspond to (random) level sets

24

Similar image retrieval

• For a given image (leftmost), list 5 most “similar” images

”happy cows”

”trees”

”cars”

”aircrafts”

”buildings”

”faces”

25

Summary




– clustering trajectory curves and image segmetation

• Dirichlet labeling processes provide a useful alternative to existing mod-

eling frameworks, e.g., Markov random fields

– label behavior easy to interpret (via latent hierarchies)

– label behavior quite rich (via latent stochastic processes)

– computationally tractable (compared to MRFs)

• Functional clustering from non-functional grouped data

– applications in data with partial information (e.g., confidentiality con-

straints)

– nested hierarchical Dirichlet processes

26

Recall – a little twist: Learning functional

clusters from non-functional data

0 5 10 15−2

−1

0

1

2


Locations u

Y

• data are daily hormone levels from a number of un-identified women

(not necessarily realizations of curves)

– due to confidentiality or lack of information, hormone levels collected

different days may belong to different (groups of) women

• local clusters = typical daily hormone behaviors in a menstrual cycle

• global (functional) clusters = typical monthly hormone behaviors27

Set-up of non-functional grouped data

28


• Non-functional data set-up: For group u, and data yui:

yui|θui ∼ F (·|θui)

θuiiid∼ Gu for i = 1, 2, . . .

• Gu is the mixing distribution for local clusters associated with group u

28-a


• Non-functional data set-up: For group u, and data yui:

yui|θui ∼ F (·|θui)

θuiiid∼ Gu for i = 1, 2, . . .

• Gu is the mixing distribution for local clusters associated with group u

• Modeling problem:

– notion of global (functional) clusters that emerge when

all groups aggregate

– specification of distribution Q for global clusters

– linking up Q to the non-functional data

28-b

Formalizing the notion of global clusters

• Let Θu be the space for local atom θu associated with group u

• Define product space Θ =∏

u∈V Θu

• Global atoms φ := (φu)u∈V is an element in Θ

• Modeling approach: hierarchical nonparametric Bayes

– to specify a (random) distribution Q for “global” φ (using Dirichlet

process)

– Linking up Q to the mixing distribution of local clusters Gu’s via a

nested hierarchy of Dirichlet processes

29

Nested hierarchical Dirichlet processes (nHDP)

• extending the hierarchical Dirichlet process (HDP) framework

(Teh, Jordan, Blei, Beal, JASA 2006) to enable functional clustering

– while HDP is a recursive hierarchy of Dirichlet processes (DPs),

nHDP is a nested hierarchy of DPs

• key features of the nHDP approach:

– global and local atoms are related as different levels in the Bayesian

hierarchy

– fully nonparametric specification

30

Hierarchical model specification

• Data: Let yu1, yu2, . . . , yunube the observations obtained within group

u, and assume that each observation is drawn independently from a

mixture model:

yui|θuiiid∼ F (·|θui)

θui|Guiid∼ Gu

for any u ∈ V ; i = 1, . . . , nu

• The Gu are given by nested hierarchy of DPs:

Gu|αu, Qindep∼ DP(αu, Qu), for all u ∈ V

Q|γ, H ∼ DP(γ, H),

• Base measure H is taken to be a Gaussian process indexed by u ∈ V

31

Nested hierarchy of Dirichlet processes

H Hu Hv Hw

Q Qu Qv Qw

Gu Gv Gw

θui θvi θwi

32

Stick-breaking representation

• Q provides the support for global clusters:

Q =∞∑

k=1

βkδφk

where φk = (φuk : u ∈ V ) are independent draws from H, and β =

(βk)∞k=1 ∼ GEM(γ).

• Qu is the induced marginal of Q at u, while Gu centers around the Qu,

and provides the support for local clusters:

Qu =

∞∑

k=1

βkδφuk

Gu =∞∑

k=1

πukδφuk

33

Polya-urn characterization

• Sampling of local atoms distributed by Gu (which is integrated out):

θui|θu1, . . . , θu,i−1, αu, Q ∼

mu∑

t=1

nut

i − 1 + αu

δψut+

αu

i − 1 + αu

Qu.

– Qu is the induced marginal of distribution Q

34

Polya-urn characterization

• Sampling of local atoms distributed by Gu (which is integrated out):

θui|θu1, . . . , θu,i−1, αu, Q ∼

mu∑

t=1

nut

i − 1 + αu

δψut+

αu

i − 1 + αu

Qu.

– Qu is the induced marginal of distribution Q

• Sampling of global atoms distributed by Q (which is integrated out):

ψt|{ψl}l 6=t, γ, H ∼

K∑

k=1

qk

q· + γδφk

+γ

q· + γH.

34-a

Sharing lunch in a communal eatery

global atoms = lunch boxes = φ = (φu)u∈V

local atoms = lunch items = φu

u indexes categories, e.g., appetizer or main entree

• lunch items are consumed based on popularity within its category

• lunch boxes are bought based on its popularity, and are to be shared by all

{φk} {ψt}

φuk φvk ψut ψvt

{θui}{θvi}

Statistical problem: Inference about distribution of lunch boxes bought

based on distributions of lunch items consumed

35

Spatial dependence among Gu’s

• spatial dependence confered by base measure H entails spatial depen-

dence among local distributions Gu’s

• suppose that H is a Gaussian process, φ = (φu : u ∈ V ) ∼ N(µ, Σ),

where Σ takes standard exponential form

• Corr(Gu(A), Gv(B)) → 1 as u → v, and and vanishes as ‖u−v‖ → ∞

• Corr(Gu(A), Gu(B)) increases as either αu or αv increases

Relations between the two levels in the Bayesian hierarchy:

• Corr(Gu(A), Gv(B)) ≤ Corr(Qu(A), Qv(B))

• γ ranges from 0 to ∞, correlation measure ratio

Corr(Gu(A), Gv(B))/Corr(Qu(A), Qv(B))

decreases from 1 to 0.

36

Posterior consistency and identifiability

• The nested Hierarchical Dirichlet process prior distribution placed on the

space of joint density of (yu)u∈V yields consistent posterior distributions

• Under additional assumptions on base measure H, it can be shown that

global atomss (i.e., distributed by Q) can be identified given collection

of data locally grouped by index u

37

Posterior inference

• Hierarchical model is amenable to Gibbs sampling

– sampling local atoms by integrating out Gu’s

– sampling global atoms by integrating out Q, and possible base mea-

sure H

• Conditional distribution of DP-distributed measure is again a Dirichlet

process

• Computational speedup is achieved by replacing the spatial process H

by a graphical model

– inference for tree-structured or chain-structure model requires time

linear in number of locations u

38

Exploiting stick-breaking respresentation

Construct a Markov chain on space of stick-breaking representations (z, q, β, φ).

Sampling β: β|q ∼ Dir(q1, . . . , qK , γ).

Sampling z:

p(zui = k|z−ui, q, β, φk, Data) =

8

<

:

(n−uiu·k

+ αuβk)F (yui|φuk) if k previously used

αuβnewfyuiuknew (yui) if k = knew.

Sampling q: qk =∑

u∈V muk where:

p(muk = m|z, m−uk, β) =Γ(αuβk)

Γ(αuβk + nu·k)s(nu·k, m)(αuβk)m.

Sampling φ:

p(φk|z, Data) ∝ H(φk)∏

ui:zui=k

F (yui|φuk) for each k = 1, . . . , K.

39

Tracking example

0 5 10 15−2

−1

0

1

2


Locations u

Y

Prior specification:

• concentration parameters γ ∼ Gamma(5, .1) and α ∼ Gamma(20, 20)

• variance σ2ǫ of F (·) is given prior InvGamma(5, 1)

• prior for global atoms H is a mean-0 Gaussian Process using (σ, ω) =

(1, 0.01)

40

Clustering bifurcation behavior

0 5 10 15−2

−1

0

1

2


Locations u

Y

• prior for global atoms H is a mean-0 Gaussian Process using (σ, ω) =

(1, 0.05)

• other prior specifications are the same as previous data example

41

Inference of global clusters (tracks)

3 4 5 6 70

20

40

60

80

100

120

140

160Num of global clusters

0 5 10 15−2

−1

0

1

2

3Posterior distributions of global cluster centers

Locations u

Y

Left: Number of global clusters is 5 with > 90%

Right: (.05,.95) credible intervals of global cluster estimates

42

Global clusters of bifurcating behavior

1 2 3 4 50

50

100


0 5 10 15−2

−1

0

1

2


Locations u

Y

Left: Number of global clusters is 3 with > 90%

Right: (.05,.95) credible intervals of global cluster estimates

43

Evolution of local clusters

Posterior distribution of the number of local clusters associating with

different group index (location) u.

1 2 3 4 50

20

40

60

80

100Num of local clusters at loc=1

1 2 3 4 50

20

40

60

80


1 2 3 4 50

20

40

60

80

100


1 2 3 4 50

20

40

60

80

100


1 2 3 4 50

20

40

60

80

100


1 2 3 4 50

20

40

60

80

100

120


1 2 3 4 50

20

40

60

80

100

120

140


1 2 3 4 50

20

40

60

80

100

120

140


44

Effects of vague prior for H

3 4 5 6 70

10

20

30

40

50


0 5 10 15−2

−1

0

1

2


Locations uY

Very vague prior for H results in weak identifiability of global clusters,

even as the local clusters are identified reasonably well.

45

Clustering progesterone hormone

0 5 10 15 20 25−3

−2

−1

0

1

2

3

4Progresterone hormone curves

Locations u

Y

• Hormone levels collected from a number of women

• Subject ids are withheld, so hormone trajectories are not given

• Comparison to hybrid DP approach (Petrone et al, 2009), which does

use the trajectorial information

46

Temporally varying number of local clusters

1 2 30

10

20

30

40

50

60


1 2 30

10

20

30

40

50

60

70


1 2 30

20

40

60

80


1 2 30

20

40

60

80


Number of global clusters:

1 2 30

10

20

30

40

50

60

70


47

Estimates of global clusters

0 5 10 15 20 25−2

−1

0

1

2


Locations u

Y

0 5 10 15 20 25−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5Posterior distributions of global cluster centers

Locations u

Y

Left: Clustering results using the nHDP mixture model

Right: The hybrid-DP approach of Petrone, Guindani and Gelfand (2009)

Black solids are sample mean curves of the contraceptive group and no-

contraceptive group

48

Pairwise comparison of hormone curves

10 20 30 40 50 60

10

20

30

40

50

600.3

0.4

0.5

0.6

0.7

0.8

0.9

1

10 20 30 40 50 60

10

20

30

40

50

60

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Each entry in the heatmap depicts the posterior probability that the two

curves share the same local clusters, averaged over the last 4 days in the

menstrual cycle

nHDP approach (Left panel) provides sharper clusterings than the hybrid

DP approach (Right panel)

49

Modeling of temperature/depth in Atlantic ocean

0 100 200 300 400 5000

5

10

15

20

25

−70 −60 −50 −40 −30 −20 −1020

25

30

35

40

45

1234567891011121314151617181920212223242526272829303132333435363738394041424344

45

46474849

50515253545556

575859606162

6364656667

6869707172737475767778798081828384858687888990919293949596979899

100101

102103

104

Longitude

Latti

tude

gr 1gr 2gr 3gr 4

• data are (temp, depth) samples collected at 4 different locations at

different times

• functional clustering within each location

• functional comparisons (ANOVA) between locations

50

Model formalization

• Data are collection of Yux(i), for i = 1, . . . , nux

• u indexes group (location), x indexes depth

• At each location, there are global clusters representing typical functional

relationship between depth vs temp

• Modeling data associated with each location using a nHDP: for each

u = 1, . . . , 4:

{θux(i)}|γ, α, Hiid∼ nHDP(γ, α, H)

Yux(i)|θux(i)indep∼ N(θux(i), τ2

u)

• Base probability measure H is a draw from a Dirichlet process:

H ∼ DP(α0, H0),

H0 ≡ GP(µ0,Σ0)

51

Fully nonparametric hierarchical specification

H0 ≡ GP(µ0,Σ0)

H|H0 ∼ DP(γ, H0),

Qu|Hiid∼ DP(α0, H), for all u ∈ V

Gu;x|Quindep∼ DP(αu, Qu;x).

As before,

θux(i)|Qu;x ∼ Qu;x for all i = 1, . . . , nu; u ∈ V

Y ux(i)|θux(i) ∼ N(θux(i), τ2u for all i = 1, . . . , nu; u ∈ V.

52

Number of functional clusters

2 4 6 8 100

500

1000

1500

2000All groups

53

Posterior distribution of global atoms

0 100 200 300 400 5000

5

10

15

20

25Mean/credible intervals of functional mean curves

\phi1

\phi2

\phi3

\phi4

\phi5

\phi6

54

Posterior mean/std of mixing proportions

of the dominant functional clusters for each group of data

group (u) πu1 πu2 πu3 πu4 πu5 πu6 πu7

1 0.98 (0.01) 0.00 (0) 0.00 (0) 0.0022 (0) 0.00 (0) 0.00 (0) 0.00 (0)

2 0.07 (0.20) 0.70 (0.16) 0.08 (0.05) 0.06 (0.03) 0.01 (0.02) 0.02 (0.04) 0.00 (0)

3 0.08 (0.24) 0.01 (0.02) 0.01 (0.02) 0.01 (0.02) 0.86 (0.24) 0.00 (0.01) 0.00 (0)

4 0.07 (0.23) 0.01 (0.02) 0.01 (0.03) 0.01 (0.02) 0.86 (0.22) 0.00 (0.01) 0.00 (0)

0 100 200 300 400 5000

5

10

15

20

25

55

Posterior mean of noise variance

group (u) 1 2 3 4

mean/std for τu 0.50 (0.22) 1.15 (2.13) 2.13 (0.29) 1.88 (0.25)

56

Varying number of local clusters with depth

100 200 300 400 5000

1

2

3

4

5

6

7

8Group1

s

num

of l

ocal

clu

ster

s

100 200 300 400 5000

1

2

3

4

5

6

7

8Group2

s

num

of l

ocal

clu

ster

s100 200 300 400 500

0

1

2

3

4

5

6

7

8Group3

s

num

of l

ocal

clu

ster

s

100 200 300 400 5000

1

2

3

4

5

6

7

8Group4

snu

m o

f loc

al c

lust

ers

Posterior mean (solid) and (.05,.95) credible intervals (dash)

57

Summary




– clustering trajectory curves and image segmetation

– appears in Nguyen & Gelfand (Statistica Sinica, 2011)

• Learning functional clusters from non-functional grouped data

– applications in data with partial information (e.g., confidentiality con-

straints)

– nested hierarchical Dirichlet processes

– appears in Nguyen (Bayesian Analysis, 2010)

• This talk focuses mainly on modeling and computation

– the identifiability and posterior consistency of nonparametric

mixing distributions in hierarchical models are currently

under investigation

58

dirichlet labeling and hierarchical processes for clustering functional...

Documents