Panel Data Analysis: A Survey on Model-BasedClustering of Time Series
An Academic presentation by
Dr. Nancy Agens, Head, Technical Operations, Statswork Group www.statswork.comEmail: [email protected]
In Brief
Longitudinal Data
Model Based Clustering
Example on Model Based Clustering
Dirichlet Prior
MCMC Simulation
Conclusion
Outline of Topics
TODAY'S DISCUSSION
In Brief
Clustering technique in Statistical Analysis is used to determine the
subsets as clusters in the data using specified distance measure.
We will discuss about some of the methods used for modeling
longitudinal or panel data using Clustering Analysis technique
Longitudinal data is actually a sample of observations which are measured repeatedlyover time.
And, nowadays, longitudinal/repeated measure data or panel data exists in all areas ofApplied statistics such as finance, psychology, economics and social sciences.
Most studies deals with analyzing homogeneity in such Time series data.
The most common method of capturing the heterogeneity is to assume the presence oflatent classes and each class are stratified using the covariates.
Longitudinal Data
Measuring the distance between time series data is notappropriate thus a cluster based modeling strategy forfinite mixture models is adopted using Bayesian rule.
Model based clustering considers each time series to asingle unit contained in an unknown latent class.
One can see an excellent review of finite mixturemodels for longitudinal data in Vermunt (2010)especially in the areas of psychology, bio-statistics andother applied areas.
Model BasedClustering
The data consists of 237 teenagers who use marijuana for the year 1976-1980.
The use marijuana is categorized into three types as never, not more than once a month and morethan once a month.
The following figure represents the sample of 10 observed response of use of marijuana usageamong the 237 teenagers.
The model considered for analyzing the marijuana usage is based on Generalized transition model.
Example on Model Based Clustering
Figure: ModelBased clustering
A Dirichlet prior is chosen in this case since the observed response variable is of categorical in nature.
Five different kernel classes are considered and evaluated the model using Dirichlet priordistribution and the results for the same is presented in the following table.
The clustering kernel M2 to M5 shows that there exists a common behaviour in marijuana usage.
If the value is smaller than one, then one may conclude that the method is overfitting, in this case, H3class of kernel seems to be overfitting.
Dirichlet Prior
Table: Dirichlet PriorDistribution
An MCMC simulation is carried out for M3 with H2 and the following figure explains the sampleof boxplots of the posterior probabilities for male and female groups.
Comparing the likelihood results obtained from the above table (598.5) and the previous table(596.5) the stratified Model based clustering reduces to Standard Model based clustering and itis clear that the use of marijuana is not associated with the gender classification.
From this results, it is concluded that the use of marijuana among teenagers may be clusteredinto two with never-use and other being more user groups.
MCMC Simulation
Figure: Boxplotsfor MCMCSimulation
Table: Gender Specific Posterior Inference
To sum up, model-based clustering technique along with the Bayesian flavor yields betterresults since it provides an answer to the most troublesome problems in the cluster analysis.
In longitudinal or Panel data studies, usage of eculidean distance may be a valid one andhence a kernel based clustering for Time series data Analysis is considered and selection ofthe best method is analysed using different information criteria.
An MCMC simulation is carried out to find the optimal clustering methodology.
Conclusion