Download - Dirichlet labeling and hierarchical processes for clustering functional datadept.stat.lsa.umich.edu/.../Talks/Nguyen_IMS_China_2011.pdf · 2012-03-19 · Functional data clustering

Dirichlet labeling and hierarchical

processes for clustering functional data

XuanLong Nguyen

Department of Statistics

University of Michigan

Joint work with Alan Gelfand, Duke University

1

Functional data clustering

• Many applications involve data that can be viewed as functions

(e.g., curves, surfaces, multivariate functions)

– borrowing statistical information across spatial domain

– borrowing across replicates in the collection

2






• Model-based clustering using nonparametric labeling processes

– differs from, e.g., Ramsay & Silverman (2002), Ferraty & Vieu (2006)

– global/local clustering and partitioning of curves

– application to progesterone hormone analysis

– application to image analysis

2-a






• Model-based clustering using nonparametric labeling processes

– differs from, e.g., Ramsay & Silverman (2002), Ferraty & Vieu (2006)

– global/local clustering and partitioning of curves

– application to progesterone hormone analysis

– application to image analysis

• Variational Bayesian inference for labeling processes,

i.e., implicit optimization over the space of posterior distributions

2-b

Example 1: Clustering curves

−10 −5 0 5 10 15−6

−4

−2

0

2

4

6

daily index

log(

PG

D)

Progesterone data

• To characterize hormone levels in terms of typical behaviors

– global clustering of entire curves; local clustering on a temporal segment

– effects of contraception, detection of anomalous behavior

3

Example 2: Natural images

• image segmentation

• image ranking (retrieval from databases), image clustering

4

Example 3 – a little twist: Learning functional

clusters from non-functional data

0 5 10 15−2

−1

0

1

2

3Spatially varying mixture distributions

Locations u

Y

• data are daily hormone levels from a number of un-identified women

(not necessarily realizations of curves)

– due to confidentiality or lack of information, hormone levels collected

different days may belong to different (groups of) women

• local clusters = typical daily hormone behaviors in a menstrual cycle

• global (functional) clusters = typical monthly hormone behaviors5

Talk outline

• Global clusters and local clusters in functional data

– Dirichlet labeling process for label allocation

– variational inference

– clustering trajectories and image segmentation

6

Talk outline




– clustering trajectories and image segmentation

• Functional clustering from (possibly) non-functional data

– applications to data with partial information (e.g., confidentiality

constraints)

– nested hierarchy of Dirichlet processes mixture

6-a

Global clustering via mixture modeling

curve replicate

θ∗1

θ∗2

θ∗3

• Curves are functions defined on domain D

• Specify a mixing distribution over the space of entire curves

θ ∼ G =

∞∑

j=1

πjδθ∗

j

θ∗j ∼ G0

– proportions {πj} is drawn from a “stick-breaking” prior

(Gelfand, Kottas, MacEachern, JASA 2005)

– this allows “global” clustering, but not local clustering!7

Local clusteringreplacements

rep. Y (x1)

θ∗1(x1)

θ∗2(x1)

θ∗3(x1)rep. Y (x2)

θ∗1(x2)θ∗2(x2)

Gx1Gx2

• At each location x ∈ D, specify a mixing distribution for the marginal

Gx of curve value Y (x):

Gx =

∞∑

j=1

πj(x)δθ∗

j(x)

– spatial coupling between Gx (MacEachern, 2000; Teh, Jordan, Blei, Beal,

2006; Dunson & Park, 2006; Griffin & Steel, 2006; etc)

– there is no global clustering!

8

Our modeling approach

Y (x), x ∈ D L(x), x ∈ D

• entire collection described by a number of “canonical” curves

• each curve can be viewed as a “hybrid” species

(Duan et al, 2005; Petrone et al, 2007)

• canonical curve allocation via a labeling process

• “hybrid” view also popular in other contexts, e.g., population genetics

and text analysis (Prichard et al, 2001; Blei et al, 2003)

9

G0

k

p τ

Lθ∗

θ Y

n

Y (x), x ∈ D L(x), x ∈ D

• n replicates Y1, . . . , Yn observed at m locations x1, . . . , xm

• k canonical curves θ∗jiid∼ G0, j = 1, . . . , k

10

G0

k

p τ

Lθ∗

θ Y

n

Y (x), x ∈ D L(x), x ∈ D



• random label functions Li : D → {1, . . . , k}

Li| piid∼ p, i = 1, . . . , n

10-a

G0

k

p τ

Lθ∗

θ Y

n

Y (x), x ∈ D L(x), x ∈ D




Li| piid∼ p, i = 1, . . . , n

• given θ∗ and L, hybrid curve θi satisfies: θi(x) = θ∗Li(x)

10-b

G0

k

p τ

Lθ∗

θ Y

n

Y (x), x ∈ D L(x), x ∈ D




Li| piid∼ p, i = 1, . . . , n

• given θ∗ and L, hybrid curve θi satisfies: θi(x) = θ∗Li(x)

• mixing with a pure error process

Yi(x)| θi(x) ∼ N(θi(x), τ2)

10-c

Labeling process L

• q is probability measure on L ∈ {1, . . . , k}D

– Desideratum: q allows spatial dependency of labels within the same curve

11

Labeling process L

• q is probability measure on L ∈ {1, . . . , k}D

– Desideratum: q allows spatial dependency of labels within the same curve

• Draw η from a mean zero Gaussian process with covariance function

ρ(x1, x2) = exp{−φ‖x1 − x2‖},

where φ is the scale parameter

• Thresholding η to obtain L ∼ q

η

t1 t2

L1 = 7

L2 = 6

j = 2

j = 3j = 4j = 5j = 6

j = 7

11-a

Random measure p on labeling functions L

• Want p to be a random probability measure, whose mean measure is

the measure q defined above

– Desideratum: random p facilitates labeling sharing across curves

C.I. for q C.I. for a p realization

D D

{1...k}

{1...k}

x1x1 x2 x2x3 x3

12

Random measure p on labeling functions L

• Want p to be a random probability measure, whose mean measure is

the measure q defined above

– Desideratum: random p facilitates labeling sharing across curves


D D

{1...k}

{1...k}

x1x1 x2 x2x3 x3

• Definition: p is a random measure generated by a Dirichlet process

with base measure q and concentration parameter α

p ∼ DP (αq)

L|p ∼ p

12-a

PSfrag


D D

{1...k}

{1...k}

x1x1 x2 x2x3 x3

• Definition: p is a random measure generated by a Dirichlet process

with base measure q and concentration parameter α

p ∼ DP (αq)

L|p ∼ p

• i.e., for arbitrary x1, . . . , xm, L takes km possible values, whose

associated probability is given by km-dim vector px1,...,xm

– px1,...,xm has a km-dim Dirichlet distribution:

(px1,...,xm(·)) ∼ Dir(αqx1,...,xm(·)),

where q is the probability measure on {1, . . . , k}D.13

Illustration of label properties

• Spatial dependence of labels is driven by φ

Posterior distribution mode of the labels for the leftmost image

φ = 0.5, 0.05, 0.01

14

Illustration of label properties

• Spatial dependence of labels is driven by φ

Posterior distribution mode of the labels for the leftmost image

φ = 0.5, 0.05, 0.01

• Local clustering : Let L1, L2|piid∼ p. Then, unconditionally,

P (L1(x) = L2(x)) =1

α + 1+

1

k·

α

α + 1.

• Global clustering :

P (L1 = L2) =1

α + 1.

14-a

Hierarchical model behavior and identifiability

G0

k

p τ

Lθ∗

θ Y

n

• interplay of three distributions

– canonical curve distribution G0, labeling process p (driven by φ), and pure

error process (driven by τ)

• additional constraints/priors available depending on modeling goals

– inference of canonical curves (e.g., hormone analysis):

⇒ roles of φ and τ

– inference of labels (e.g., image segmentation):

⇒ roles of G0

– predictive posterior inference (e.g., on new locations/curves)

15

Model fitting via MCMC

• Gibbs updates of canonical curves are standard

• Updates of precision parameters α, τ are also standard

(Escobar & West, 1995)

• The most computationally intensive part is the update of labels L1, . . . , Ln

and label switching parameter φ

– the sampler cannot handle this step exactly

– the challenge lies in the (latent) high dimensional thresholded Gaussian

process q nested within the Dirichlet process

• We propose a variational inference method to obtain approximate pos-

terior distribution for labels L and label parameter φ

16

Variational Bayesian inference

• The core of Bayesian inference is the posterior computation:

posterior ∝ likelihood × prior

w(θ|X1, . . . , Xn) ∝

n∏

i=1

P (Xi|θ)π(θ)

• w(·|X) is the solution to the following optimization problem:

w(·|X1, . . . , Xn) = argminw(·) −

∫

θ

n∑

i=1

log P (Xi|θ)dw(θ) + D(w||π),

where D denotes the Kullback-Leibler (KL) divergence

• If inference is intractable, consider modifying either loss function or the

space of distributions

17

Variational approximation for q(L(x)|φ,Data)

MRF QE

qqEEE1

E2 D(q||qE)

x1 x2 x3

• Let E be a subset of edges connecting pairs of locations x1, . . . , xm

• qE is the best approximation (in the sense of KL divergence) among all

Markov random fields (MRF) QE using pairwise potential functions

qE = argminQ∈QED(q||Q).

18

Variational approximation for q(L(x)|φ,Data)

MRF QE

qqEEE1

E2 D(q||qE)

x1 x2 x3

• Let E be a subset of edges connecting pairs of locations x1, . . . , xm

• qE is the best approximation (in the sense of KL divergence) among all

Markov random fields (MRF) QE using pairwise potential functions

qE = argminQ∈QED(q||Q).

• If E1 ⊃ E2 then D(q||qE1) ≤ D(q||qE2

)

• How to choose collection of edges E?

– e.g., shortest spanning tree for x1, . . . , xm, resulting in linear computational

complexity of inference (in m)18-a

Variational approximation for φ|L,Data

• Resultant posterior distribution for φ:

P (φ|L) ∝Y

(xi,xj)∈E

q(L(xi), L(xj)|φ)λπ(φ)

= argminw(·)∈W

Z

φ

λX

(xi,xj)∈E

− log q(L(xi), L(xj))dw(φ) + D(w||π)

19

Variational approximation for φ|L,Data

• Resultant posterior distribution for φ:

P (φ|L) ∝Y

(xi,xj)∈E

q(L(xi), L(xj)|φ)λπ(φ)

= argminw(·)∈W

Z

φ

λX

(xi,xj)∈E

− log q(L(xi), L(xj))dw(φ) + D(w||π)

• P (φ|L) shrinks toward the true φ∗ in an appropriately defined metric:

Proposition 1. Suppose that L = (L(x1), . . . , L(xm)) is drawn from

q for some “true” φ∗ > 0; and that π(·) puts sufficient mass around

φ∗ then under the distribution q∫

1

|E|| log(P (φ∗|L)/P (φ|L))| dP (φ|L) = OP (1/m).

• Posterior mode converges in probability to φ∗ at the rate√

rm

m, where

rm is the number of edges whose lengths are bounded by a fixed constant

19-a

Predictive posterior distributions

• a simulated example on 1-dim domain D = [1, 40]

−3 −2 −1 0 1 2 30

0.2

0.4

0.6

0.8

1

1.2

1.4Location2

−4 −2 0 2 40

0.2

0.4

0.6

0.8

1Location4

−3 −2 −1 0 1 2 30

0.2

0.4

0.6

0.8

1

1.2

1.4Location6

−2 −1 0 1 2 30

0.2

0.4

0.6

0.8

1

1.2

1.4Location8

−3 −2 −1 0 1 2 30

0.2

0.4

0.6

0.8

1Location10

−3 −2 −1 0 1 20

0.2

0.4

0.6

0.8

1Location12

−3 −2 −1 0 1 20

0.2

0.4

0.6

0.8

1

1.2

1.4Location14

−3 −2 −1 0 1 20

0.2

0.4

0.6

0.8

1Location16

−2 −1 0 1 2 30

0.2

0.4

0.6

0.8

1Location26

−2 −1 0 1 20

0.2

0.4

0.6

0.8

1Location28

−3 −2 −1 0 1 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8Location30

−3 −2 −1 0 1 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8Location32

20

Progesterone hormone data

• Inference of canonical curves

(– contraception grouping is unknown to the analysis)

−10 −5 0 5 10 15−6

−4

−2

0

2

4

6

daily index

log(

PG

D)

Progesterone data

−10 −5 0 5 10 15−6

−4

−2

0

2

4

6

daily index

log(

PG

D)

φ = .05 (global clustering)

• Left figure: Distinctive canonical behaviors emerge approximately 7 days

after ovulation date

• Right figure: φ is set very small to insist on “global” clusters

⇒ global clustering not possible!

21

Pairwise comparison of hormone curves

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

850.4

0.5

0.6

0.7

0.8

0.9

1

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

850.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

• Proportion of equal labels for pairs of replicates

– left: for the whole curve

– right: for a curve segment [20, 24]

(last 5 days of the monitored cycle)

• Curves indexed from 67 to 88 belong to the contraception group

22

Detecting unusual hormone levels

−10 −5 0 5 10 15−6

−4

−2

0

2

4

daily index

log(

PG

D)

Unusual hormone levels

• Curves numbered 12, 27, 30, 50, 74, 82, 85 have large degree of depar-

ture from the “mean” behavior

23

Image segmentation

• canonical curves are random constant functions

• image segments correspond to (random) level sets

24

Similar image retrieval

• For a given image (leftmost), list 5 most “similar” images

”happy cows”

”trees”

”cars”

”aircrafts”

”buildings”

”faces”

25

Summary




– clustering trajectory curves and image segmetation

• Dirichlet labeling processes provide a useful alternative to existing mod-

eling frameworks, e.g., Markov random fields

– label behavior easy to interpret (via latent hierarchies)

– label behavior quite rich (via latent stochastic processes)

– computationally tractable (compared to MRFs)

• Functional clustering from non-functional grouped data

– applications in data with partial information (e.g., confidentiality con-

straints)

– nested hierarchical Dirichlet processes

26

Recall – a little twist: Learning functional

clusters from non-functional data

0 5 10 15−2

−1

0

1

2


Locations u

Y

• data are daily hormone levels from a number of un-identified women

(not necessarily realizations of curves)

– due to confidentiality or lack of information, hormone levels collected

different days may belong to different (groups of) women

• local clusters = typical daily hormone behaviors in a menstrual cycle

• global (functional) clusters = typical monthly hormone behaviors27

Set-up of non-functional grouped data

28


• Non-functional data set-up: For group u, and data yui:

yui|θui ∼ F (·|θui)

θuiiid∼ Gu for i = 1, 2, . . .

• Gu is the mixing distribution for local clusters associated with group u

28-a


• Non-functional data set-up: For group u, and data yui:

yui|θui ∼ F (·|θui)

θuiiid∼ Gu for i = 1, 2, . . .

• Gu is the mixing distribution for local clusters associated with group u

• Modeling problem:

– notion of global (functional) clusters that emerge when

all groups aggregate

– specification of distribution Q for global clusters

– linking up Q to the non-functional data

28-b

Formalizing the notion of global clusters

• Let Θu be the space for local atom θu associated with group u

• Define product space Θ =∏

u∈V Θu

• Global atoms φ := (φu)u∈V is an element in Θ

• Modeling approach: hierarchical nonparametric Bayes

– to specify a (random) distribution Q for “global” φ (using Dirichlet

process)

– Linking up Q to the mixing distribution of local clusters Gu’s via a

nested hierarchy of Dirichlet processes

29

Nested hierarchical Dirichlet processes (nHDP)

• extending the hierarchical Dirichlet process (HDP) framework

(Teh, Jordan, Blei, Beal, JASA 2006) to enable functional clustering

– while HDP is a recursive hierarchy of Dirichlet processes (DPs),

nHDP is a nested hierarchy of DPs

• key features of the nHDP approach:

– global and local atoms are related as different levels in the Bayesian

hierarchy

– fully nonparametric specification

30

Hierarchical model specification

• Data: Let yu1, yu2, . . . , yunube the observations obtained within group

u, and assume that each observation is drawn independently from a

mixture model:

yui|θuiiid∼ F (·|θui)

θui|Guiid∼ Gu

for any u ∈ V ; i = 1, . . . , nu

• The Gu are given by nested hierarchy of DPs:

Gu|αu, Qindep∼ DP(αu, Qu), for all u ∈ V

Q|γ, H ∼ DP(γ, H),

• Base measure H is taken to be a Gaussian process indexed by u ∈ V

31

Nested hierarchy of Dirichlet processes

H Hu Hv Hw

Q Qu Qv Qw

Gu Gv Gw

θui θvi θwi

32

Stick-breaking representation

• Q provides the support for global clusters:

Q =∞∑

k=1

βkδφk

where φk = (φuk : u ∈ V ) are independent draws from H, and β =

(βk)∞k=1 ∼ GEM(γ).

• Qu is the induced marginal of Q at u, while Gu centers around the Qu,

and provides the support for local clusters:

Qu =

∞∑

k=1

βkδφuk

Gu =∞∑

k=1

πukδφuk

33

Polya-urn characterization

• Sampling of local atoms distributed by Gu (which is integrated out):

θui|θu1, . . . , θu,i−1, αu, Q ∼

mu∑

t=1

nut

i − 1 + αu

δψut+

αu

i − 1 + αu

Qu.

– Qu is the induced marginal of distribution Q

34

Polya-urn characterization

• Sampling of local atoms distributed by Gu (which is integrated out):

θui|θu1, . . . , θu,i−1, αu, Q ∼

mu∑

t=1

nut

i − 1 + αu

δψut+

αu

i − 1 + αu

Qu.

– Qu is the induced marginal of distribution Q

• Sampling of global atoms distributed by Q (which is integrated out):

ψt|{ψl}l 6=t, γ, H ∼

K∑

k=1

qk

q· + γδφk

+γ

q· + γH.

34-a

Sharing lunch in a communal eatery

global atoms = lunch boxes = φ = (φu)u∈V

local atoms = lunch items = φu

u indexes categories, e.g., appetizer or main entree

• lunch items are consumed based on popularity within its category

• lunch boxes are bought based on its popularity, and are to be shared by all

{φk} {ψt}

φuk φvk ψut ψvt

{θui}{θvi}

Statistical problem: Inference about distribution of lunch boxes bought

based on distributions of lunch items consumed

35

Spatial dependence among Gu’s

• spatial dependence confered by base measure H entails spatial depen-

dence among local distributions Gu’s

• suppose that H is a Gaussian process, φ = (φu : u ∈ V ) ∼ N(µ, Σ),

where Σ takes standard exponential form

• Corr(Gu(A), Gv(B)) → 1 as u → v, and and vanishes as ‖u−v‖ → ∞

• Corr(Gu(A), Gu(B)) increases as either αu or αv increases

Relations between the two levels in the Bayesian hierarchy:

• Corr(Gu(A), Gv(B)) ≤ Corr(Qu(A), Qv(B))

• γ ranges from 0 to ∞, correlation measure ratio

Corr(Gu(A), Gv(B))/Corr(Qu(A), Qv(B))

decreases from 1 to 0.

36

Posterior consistency and identifiability

• The nested Hierarchical Dirichlet process prior distribution placed on the

space of joint density of (yu)u∈V yields consistent posterior distributions

• Under additional assumptions on base measure H, it can be shown that

global atomss (i.e., distributed by Q) can be identified given collection

of data locally grouped by index u

37

Posterior inference

• Hierarchical model is amenable to Gibbs sampling

– sampling local atoms by integrating out Gu’s

– sampling global atoms by integrating out Q, and possible base mea-

sure H

• Conditional distribution of DP-distributed measure is again a Dirichlet

process

• Computational speedup is achieved by replacing the spatial process H

by a graphical model

– inference for tree-structured or chain-structure model requires time

linear in number of locations u

38

Exploiting stick-breaking respresentation

Construct a Markov chain on space of stick-breaking representations (z, q, β, φ).

Sampling β: β|q ∼ Dir(q1, . . . , qK , γ).

Sampling z:

p(zui = k|z−ui, q, β, φk, Data) =

8

<

:

(n−uiu·k

+ αuβk)F (yui|φuk) if k previously used

αuβnewfyuiuknew (yui) if k = knew.

Sampling q: qk =∑

u∈V muk where:

p(muk = m|z, m−uk, β) =Γ(αuβk)

Γ(αuβk + nu·k)s(nu·k, m)(αuβk)m.

Sampling φ:

p(φk|z, Data) ∝ H(φk)∏

ui:zui=k

F (yui|φuk) for each k = 1, . . . , K.

39

Tracking example

0 5 10 15−2

−1

0

1

2


Locations u

Y

Prior specification:

• concentration parameters γ ∼ Gamma(5, .1) and α ∼ Gamma(20, 20)

• variance σ2ǫ of F (·) is given prior InvGamma(5, 1)

• prior for global atoms H is a mean-0 Gaussian Process using (σ, ω) =

(1, 0.01)

40

Clustering bifurcation behavior

0 5 10 15−2

−1

0

1

2


Locations u

Y

• prior for global atoms H is a mean-0 Gaussian Process using (σ, ω) =

(1, 0.05)

• other prior specifications are the same as previous data example

41

Inference of global clusters (tracks)

3 4 5 6 70

20

40

60

80

100

120

140

160Num of global clusters

0 5 10 15−2

−1

0

1

2

3Posterior distributions of global cluster centers

Locations u

Y

Left: Number of global clusters is 5 with > 90%

Right: (.05,.95) credible intervals of global cluster estimates

42

Global clusters of bifurcating behavior

1 2 3 4 50

50

100


0 5 10 15−2

−1

0

1

2


Locations u

Y

Left: Number of global clusters is 3 with > 90%

Right: (.05,.95) credible intervals of global cluster estimates

43

Evolution of local clusters

Posterior distribution of the number of local clusters associating with

different group index (location) u.

1 2 3 4 50

20

40

60

80

100Num of local clusters at loc=1

1 2 3 4 50

20

40

60

80


1 2 3 4 50

20

40

60

80

100


1 2 3 4 50

20

40

60

80

100


1 2 3 4 50

20

40

60

80

100


1 2 3 4 50

20

40

60

80

100

120


1 2 3 4 50

20

40

60

80

100

120

140


1 2 3 4 50

20

40

60

80

100

120

140


44

Effects of vague prior for H

3 4 5 6 70

10

20

30

40

50


0 5 10 15−2

−1

0

1

2


Locations uY

Very vague prior for H results in weak identifiability of global clusters,

even as the local clusters are identified reasonably well.

45

Clustering progesterone hormone

0 5 10 15 20 25−3

−2

−1

0

1

2

3

4Progresterone hormone curves

Locations u

Y

• Hormone levels collected from a number of women

• Subject ids are withheld, so hormone trajectories are not given

• Comparison to hybrid DP approach (Petrone et al, 2009), which does

use the trajectorial information

46

Temporally varying number of local clusters

1 2 30

10

20

30

40

50

60


1 2 30

10

20

30

40

50

60

70


1 2 30

20

40

60

80


1 2 30

20

40

60

80


Number of global clusters:

1 2 30

10

20

30

40

50

60

70


47

Estimates of global clusters

0 5 10 15 20 25−2

−1

0

1

2


Locations u

Y

0 5 10 15 20 25−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5Posterior distributions of global cluster centers

Locations u

Y

Left: Clustering results using the nHDP mixture model

Right: The hybrid-DP approach of Petrone, Guindani and Gelfand (2009)

Black solids are sample mean curves of the contraceptive group and no-

contraceptive group

48

Pairwise comparison of hormone curves

10 20 30 40 50 60

10

20

30

40

50

600.3

0.4

0.5

0.6

0.7

0.8

0.9

1

10 20 30 40 50 60

10

20

30

40

50

60

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Each entry in the heatmap depicts the posterior probability that the two

curves share the same local clusters, averaged over the last 4 days in the

menstrual cycle

nHDP approach (Left panel) provides sharper clusterings than the hybrid

DP approach (Right panel)

49

Modeling of temperature/depth in Atlantic ocean

0 100 200 300 400 5000

5

10

15

20

25

−70 −60 −50 −40 −30 −20 −1020

25

30

35

40

45

1234567891011121314151617181920212223242526272829303132333435363738394041424344

45

46474849

50515253545556

575859606162

6364656667

6869707172737475767778798081828384858687888990919293949596979899

100101

102103

104

Longitude

Latti

tude

gr 1gr 2gr 3gr 4

• data are (temp, depth) samples collected at 4 different locations at

different times

• functional clustering within each location

• functional comparisons (ANOVA) between locations

50

Model formalization

• Data are collection of Yux(i), for i = 1, . . . , nux

• u indexes group (location), x indexes depth

• At each location, there are global clusters representing typical functional

relationship between depth vs temp

• Modeling data associated with each location using a nHDP: for each

u = 1, . . . , 4:

{θux(i)}|γ, α, Hiid∼ nHDP(γ, α, H)

Yux(i)|θux(i)indep∼ N(θux(i), τ2

u)

• Base probability measure H is a draw from a Dirichlet process:

H ∼ DP(α0, H0),

H0 ≡ GP(µ0,Σ0)

51

Fully nonparametric hierarchical specification

H0 ≡ GP(µ0,Σ0)

H|H0 ∼ DP(γ, H0),

Qu|Hiid∼ DP(α0, H), for all u ∈ V

Gu;x|Quindep∼ DP(αu, Qu;x).

As before,

θux(i)|Qu;x ∼ Qu;x for all i = 1, . . . , nu; u ∈ V

Y ux(i)|θux(i) ∼ N(θux(i), τ2u for all i = 1, . . . , nu; u ∈ V.

52

Number of functional clusters

2 4 6 8 100

500

1000

1500

2000All groups

53

Posterior distribution of global atoms

0 100 200 300 400 5000

5

10

15

20

25Mean/credible intervals of functional mean curves

\phi1

\phi2

\phi3

\phi4

\phi5

\phi6

54

Posterior mean/std of mixing proportions

of the dominant functional clusters for each group of data

group (u) πu1 πu2 πu3 πu4 πu5 πu6 πu7

1 0.98 (0.01) 0.00 (0) 0.00 (0) 0.0022 (0) 0.00 (0) 0.00 (0) 0.00 (0)

2 0.07 (0.20) 0.70 (0.16) 0.08 (0.05) 0.06 (0.03) 0.01 (0.02) 0.02 (0.04) 0.00 (0)

3 0.08 (0.24) 0.01 (0.02) 0.01 (0.02) 0.01 (0.02) 0.86 (0.24) 0.00 (0.01) 0.00 (0)

4 0.07 (0.23) 0.01 (0.02) 0.01 (0.03) 0.01 (0.02) 0.86 (0.22) 0.00 (0.01) 0.00 (0)

0 100 200 300 400 5000

5

10

15

20

25

55

Posterior mean of noise variance

group (u) 1 2 3 4

mean/std for τu 0.50 (0.22) 1.15 (2.13) 2.13 (0.29) 1.88 (0.25)

56

Varying number of local clusters with depth

100 200 300 400 5000

1

2

3

4

5

6

7

8Group1

s

num

of l

ocal

clu

ster

s

100 200 300 400 5000

1

2

3

4

5

6

7

8Group2

s

num

of l

ocal

clu

ster

s100 200 300 400 500

0

1

2

3

4

5

6

7

8Group3

s

num

of l

ocal

clu

ster

s

100 200 300 400 5000

1

2

3

4

5

6

7

8Group4

snu

m o

f loc

al c

lust

ers

Posterior mean (solid) and (.05,.95) credible intervals (dash)

57

Summary




– clustering trajectory curves and image segmetation

– appears in Nguyen & Gelfand (Statistica Sinica, 2011)

• Learning functional clusters from non-functional grouped data

– applications in data with partial information (e.g., confidentiality con-

straints)

– nested hierarchical Dirichlet processes

– appears in Nguyen (Bayesian Analysis, 2010)

• This talk focuses mainly on modeling and computation

– the identifiability and posterior consistency of nonparametric

mixing distributions in hierarchical models are currently

under investigation

58

Download - Dirichlet labeling and hierarchical processes for clustering functional datadept.stat.lsa.umich.edu/.../Talks/Nguyen_IMS_China_2011.pdf · 2012-03-19 · Functional data clustering

Top Related