mixture model clustering for mixed data with missing information
DESCRIPTION
Mixture model clustering for mixed data with missing information. Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Lynette Hunt, Murray Jorgensen. Computation statistics & Data Analysis, 2002. Outline. Motivation Objective Introduction The Mixture approach to Clustering Data - PowerPoint PPT PresentationTRANSCRIPT
Intelligent Database Systems Lab
國立雲林科技大學National Yunlin University of Science and Technology
Advisor : Dr. Hsu
Graduate : Yu Cheng Chen
Author: Lynette Hunt, Murray Jorgensen
Mixture model clustering for mixed data with missing information
Computation statistics & Data Analysis, 2002
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Outline
Motivation Objective Introduction The Mixture approach to Clustering Data Application Discussion Personal Opinion
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Motivation
Missing observations are frequently seen in data sets.Specimen may be damaged result.
Expensive test may only be administered to a random sub-sample of the items.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Objective
We need to implement some technique when the data to be clustered are incomplete.
Extends mixture likelihood approach to analyse data with mixed categorical and continuous attributes and where some of the data are missing at random.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Introduction
Data are described as ‘missing at random’ when the probability that a variable is missing for a particular individual may depend on the values of the observed variables, but not for on the value of the missing variable.
The distribution of the missing data does not depend on the missing data.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Introduction
Rubin(1976) showed the process that causes the missing data can be ignored when making likelihood-based about the parameter of the data if the data are ‘missing at random’.
The EM algorithms of Dempster et al . is a general iterative procedure maximum likelihood estimation in incomplete data problems.
Little and Schluchter(1985) present maximum likelihood procedure using the EM algorithms for the general location model with missing data.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.The Mixture approach to Clustering Data
Suppose p attributes are measured on n individuals. Let xi,…, xn be the observed values of a random sample from a mixture of K populations in known proportions, π1,…,πk
Let the density of xi in the kth group be fk(xi; θk), where θk is the parameter vector for group k.
Let ψ=(θ’, π’)’, where π=(π1,…,πk)’, θ=(θ1,…, θk)’
K
kkikki xfxf
1
);();(
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.The Mixture approach to Clustering Data
In EM algorihm of Dempster et al., the ‘missing’ data are the unobserved indicators of group membership.
Let the vector of indicator variables, zi=(zi1,…,zik)
k group i individual if 0
k group i individual if 1ikz
K
k kik
kikk
ik
x
xf
group kiindividualprz
1
i
)ˆ;(ˆ
)ˆ;(ˆ
)ˆ; x| (ˆ
for k=1,…K; and xi is assigned to group k if zik > zik’ , k != k’
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.The Mixture approach to Clustering Data
The latent class model is a finite mixture model for data where each of the p attributes is discrete.
Suppose that the jth attribute can take on 1,…,M1 and let λkjm be the probability that for individuals from group k, the jth attribute has level m. Then, individual I belonging to group k is defined as
p
jkjxkik ij
xf1
),(
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Multimix
Jorgensen and Hunt(1996) Hunt and Jorgensen(1999) proposed a general class of mixture models to include data having continuous and categorical attributes.
By partitioning the observational vector xi such that
If individual I belongs to group k, we can write
)'|...||...|( 1 iLilii xxxx
L
lilklik xfxf
1
)()(
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Multimix
Discrete distribution:
where is a one-dimensional discrete attribute taking values
1,…Ml with probabilities λklM1
Multivariate Normal distribution:
where is a pl-dimensional vector with a Npl(μkl,∑kl)
ilx
ilx
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Graphical modelsA alternative way of looking at these multivariate models within the framework of graphical models.
The graph of a model contains vertices and edges
vertex corresponding to each variable.
Edges shows the independence of corresponding vertices.
Latent class models for p variable are represented by a graph on p+1 vertices corresponding to the variables plus 1 categorical variable indicating the cluster.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Missing data
We put forward a method for mixture model clustering based on the assumption that the data are missing at random.
We write the observation vector xi in the form (xobs,i ,xmiss,i)
xobs,i is the observed attributes for observation i
xmiss,i is the missing attributes for observation i
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Missing data
The E step of the EM algorithm require the calculation of Q(ψ, ψ(t))=E{ LC(ψ)|xobs; ψ(t)}, the expectation of the complete data log-likelihood conditional on the observed data and the current value of the parameters.
We calculate Q(ψ, ψ(t)) by replace zik with
K
1k
(t)kiobs,kk
(t)kiobs,kk
)(,
)(
);(xf
);(xf
);|(
tiobsik
tikik XzEzz
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Missing data
The remaining calculations in the E step require the calculation of the expected value of the complete data sufficient statistics for each partition cell l.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Missing data
For multivariate normal partition cells, Eliminating one cluster at a time
Calculate the between-cluster entropy based on remaining clusters
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Missing data
Sweep is usefulness in maximum likelihood estimation for multivariate missing data problems.
We form the augmented covariance matrix Al using the current estimates of the parameters for group k in cell l
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Missing data
Sweeping on the elements of Al corresponding to the observed xij in cell l, yields the conditional distribution of the missing xij’ on the observed xij in the cell.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Missing data
The new parameter estimates θ(t+1) of parameters are estimated form the complete data sufficient statistic.
Mixing proportion:
Discrete distribution parameters:
kkforzn
n
i
tik
tk ,...,1
1ˆ
1
)()1(
l1
M1,...,m K,1,...,k 1
forzn
n
iilmik
kklm
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Missing data
Multivariate Normal parameters:
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.ApplicationProstate cancer clinical trial data of Byar and Green(1980).
The data were obtained from a randomized clinical trial comparing 4 treatments for 506 patients with prostatic cancer.
There are 12 pre-trial covariates measured on each patient, 7 variables may be taken to be continuous, 4 to be discrete and 1 variable (SG) is an index. We treat SG as a continuous variable.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Application
1/3 individual have at least one of pre-trial covariates missing, giving a total of 62 missing values.
As only approximately 1% of the data are missing.
Missing values were created by assigning each attribute of each individual a random digit generated from the discrete[0,1], respectively, as .10, .15, .20, .25 and .30.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Application
The data set reported in detail here had 1870values recorded as missing.
Separate data into two clusters.
We regard the data as a random sample from the distribution
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Application
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Discussion
The multimix approach allows to clustering of mixed finite data containing both types of variables.
The finite mixture model leads itself well into coping with missing values.
The approach implemented in this paper works well for mixed data set that had a very large amount of missing data.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Personal Opinion
…