mixture model clustering for mixed data with missing information

Intelligent Database Systems Lab

國立雲林科技大學National Yunlin University of Science and Technology

Advisor ： Dr. Hsu

Graduate ： Yu Cheng Chen

Author: Lynette Hunt, Murray Jorgensen

Mixture model clustering for mixed data with missing information

Computation statistics & Data Analysis, 2002


N.Y.U.S.T.

I. M.Outline

Motivation Objective Introduction The Mixture approach to Clustering Data Application Discussion Personal Opinion


N.Y.U.S.T.

I. M.Motivation

Missing observations are frequently seen in data sets.Specimen may be damaged result.

Expensive test may only be administered to a random sub-sample of the items.


N.Y.U.S.T.

I. M.Objective

We need to implement some technique when the data to be clustered are incomplete.

Extends mixture likelihood approach to analyse data with mixed categorical and continuous attributes and where some of the data are missing at random.


N.Y.U.S.T.

I. M.Introduction

Data are described as ‘missing at random’ when the probability that a variable is missing for a particular individual may depend on the values of the observed variables, but not for on the value of the missing variable.

The distribution of the missing data does not depend on the missing data.


N.Y.U.S.T.

I. M.Introduction

Rubin(1976) showed the process that causes the missing data can be ignored when making likelihood-based about the parameter of the data if the data are ‘missing at random’.

The EM algorithms of Dempster et al . is a general iterative procedure maximum likelihood estimation in incomplete data problems.

Little and Schluchter(1985) present maximum likelihood procedure using the EM algorithms for the general location model with missing data.


N.Y.U.S.T.

I. M.The Mixture approach to Clustering Data

Suppose p attributes are measured on n individuals. Let xi,…, xn be the observed values of a random sample from a mixture of K populations in known proportions, π1,…,πk

Let the density of xi in the kth group be fk(xi; θk), where θk is the parameter vector for group k.

Let ψ=(θ’, π’)’, where π=(π1,…,πk)’, θ=(θ1,…, θk)’

K

kkikki xfxf

1

);();(


N.Y.U.S.T.


In EM algorihm of Dempster et al., the ‘missing’ data are the unobserved indicators of group membership.

Let the vector of indicator variables, zi=(zi1,…,zik)

k group i individual if 0

k group i individual if 1ikz

K

k kik

kikk

ik

x

xf

group kiindividualprz

1

i

)ˆ;(ˆ

)ˆ;(ˆ

)ˆ; x| (ˆ

for k=1,…K; and xi is assigned to group k if zik > zik’ , k != k’


N.Y.U.S.T.


The latent class model is a finite mixture model for data where each of the p attributes is discrete.

Suppose that the jth attribute can take on 1,…,M1 and let λkjm be the probability that for individuals from group k, the jth attribute has level m. Then, individual I belonging to group k is defined as

p

jkjxkik ij

xf1

),(


N.Y.U.S.T.

I. M.Multimix

Jorgensen and Hunt(1996) Hunt and Jorgensen(1999) proposed a general class of mixture models to include data having continuous and categorical attributes.

By partitioning the observational vector xi such that

If individual I belongs to group k, we can write

)'|...||...|( 1 iLilii xxxx

L

lilklik xfxf

1

)()(


N.Y.U.S.T.

I. M.Multimix

Discrete distribution:

where is a one-dimensional discrete attribute taking values

1,…Ml with probabilities λklM1

Multivariate Normal distribution:

where is a pl-dimensional vector with a Npl(μkl,∑kl)

ilx

ilx


N.Y.U.S.T.

I. M.Graphical modelsA alternative way of looking at these multivariate models within the framework of graphical models.

The graph of a model contains vertices and edges

vertex corresponding to each variable.

Edges shows the independence of corresponding vertices.

Latent class models for p variable are represented by a graph on p+1 vertices corresponding to the variables plus 1 categorical variable indicating the cluster.


N.Y.U.S.T.

I. M.Missing data

We put forward a method for mixture model clustering based on the assumption that the data are missing at random.

We write the observation vector xi in the form (xobs,i ,xmiss,i)

xobs,i is the observed attributes for observation i

xmiss,i is the missing attributes for observation i


N.Y.U.S.T.

I. M.Missing data

The E step of the EM algorithm require the calculation of Q(ψ, ψ(t))=E{ LC(ψ)|xobs; ψ(t)}, the expectation of the complete data log-likelihood conditional on the observed data and the current value of the parameters.

We calculate Q(ψ, ψ(t)) by replace zik with

K

1k

(t)kiobs,kk

(t)kiobs,kk

)(,

)(

);(xf

);(xf

);|(

tiobsik

tikik XzEzz


N.Y.U.S.T.

I. M.Missing data

The remaining calculations in the E step require the calculation of the expected value of the complete data sufficient statistics for each partition cell l.


N.Y.U.S.T.

I. M.Missing data

For multivariate normal partition cells, Eliminating one cluster at a time

Calculate the between-cluster entropy based on remaining clusters


N.Y.U.S.T.

I. M.Missing data

Sweep is usefulness in maximum likelihood estimation for multivariate missing data problems.

We form the augmented covariance matrix Al using the current estimates of the parameters for group k in cell l


N.Y.U.S.T.

I. M.Missing data

Sweeping on the elements of Al corresponding to the observed xij in cell l, yields the conditional distribution of the missing xij’ on the observed xij in the cell.


N.Y.U.S.T.

I. M.Missing data

The new parameter estimates θ(t+1) of parameters are estimated form the complete data sufficient statistic.

Mixing proportion:

Discrete distribution parameters:

kkforzn

n

i

tik

tk ,...,1

1ˆ

1

)()1(

l1

M1,...,m K,1,...,k 1

forzn

n

iilmik

kklm


N.Y.U.S.T.

I. M.Missing data

Multivariate Normal parameters:


N.Y.U.S.T.

I. M.ApplicationProstate cancer clinical trial data of Byar and Green(1980).

The data were obtained from a randomized clinical trial comparing 4 treatments for 506 patients with prostatic cancer.

There are 12 pre-trial covariates measured on each patient, 7 variables may be taken to be continuous, 4 to be discrete and 1 variable (SG) is an index. We treat SG as a continuous variable.


N.Y.U.S.T.

I. M.Application

1/3 individual have at least one of pre-trial covariates missing, giving a total of 62 missing values.

As only approximately 1% of the data are missing.

Missing values were created by assigning each attribute of each individual a random digit generated from the discrete[0,1], respectively, as .10, .15, .20, .25 and .30.


N.Y.U.S.T.

I. M.Application

The data set reported in detail here had 1870values recorded as missing.

Separate data into two clusters.

We regard the data as a random sample from the distribution


N.Y.U.S.T.

I. M.Application


N.Y.U.S.T.

I. M.Discussion

The multimix approach allows to clustering of mixed finite data containing both types of variables.

The finite mixture model leads itself well into coping with missing values.

The approach implemented in this paper works well for mixed data set that had a very large amount of missing data.


N.Y.U.S.T.

I. M.Personal Opinion

…

mixture model clustering for mixed data with missing information

Documents