may use content in the jstor archive only for your ...zhanghao/paper/jabes2002.pdf · a model-based...

22
http://www.jstor.org Model-Based Clustering for Cross-Sectional Time Series Data Author(s): H. Holly Wang and Hao Zhang Source: Journal of Agricultural, Biological, and Environmental Statistics, Vol. 7, No. 1, (Mar., 2002), pp. 107-127 Published by: American Statistical Association and International Biometric Society Stable URL: http://www.jstor.org/stable/1400543 Accessed: 16/06/2008 15:44 Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at http://www.jstor.org/action/showPublisher?publisherCode=astata. Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. JSTOR is a not-for-profit organization founded in 1995 to build trusted digital archives for scholarship. We enable the scholarly community to preserve their work and the materials they rely upon, and to build a common research platform that promotes the discovery and use of these resources. For more information about JSTOR, please contact [email protected].

Upload: others

Post on 27-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: may use content in the JSTOR archive only for your ...zhanghao/Paper/JABES2002.pdf · A model-based clustering method for cross-sectional time series data is proposed and applied

http://www.jstor.org

Model-Based Clustering for Cross-Sectional Time Series DataAuthor(s): H. Holly Wang and Hao ZhangSource: Journal of Agricultural, Biological, and Environmental Statistics, Vol. 7, No. 1, (Mar.,2002), pp. 107-127Published by: American Statistical Association and International Biometric SocietyStable URL: http://www.jstor.org/stable/1400543Accessed: 16/06/2008 15:44

Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at

http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless

you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you

may use content in the JSTOR archive only for your personal, non-commercial use.

Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at

http://www.jstor.org/action/showPublisher?publisherCode=astata.

Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed

page of such transmission.

JSTOR is a not-for-profit organization founded in 1995 to build trusted digital archives for scholarship. We enable the

scholarly community to preserve their work and the materials they rely upon, and to build a common research platform that

promotes the discovery and use of these resources. For more information about JSTOR, please contact [email protected].

Page 2: may use content in the JSTOR archive only for your ...zhanghao/Paper/JABES2002.pdf · A model-based clustering method for cross-sectional time series data is proposed and applied

Model-Based Clustering for Cross-Sectional Time

Series Data

H. Holly WANG and Hao ZHANG

A model-based clustering method for cross-sectional time series data is proposed and applied to crop insurance programs. To design an effective group risk plan, an important step is to group together the farms that resemble each other and decide the number of clusters, both of which can be achieved via the model-based clustering. The mixture maximum likelihood is employed for inferences. However, with the presence of correlation and missing values, the exact maximum likelihood estimators (MLEs) are difficult to obtain. An approach for obtaining approximate MLEs is proposed and evaluated through simulation studies. A bootstrapping method is used to choose the number of components in the mixture model.

Key Words: Akaike's information criterion; Bootstrapping; Classification; Mixture dis- tribution.

1. INTRODUCTION

The Risk Management Agency (RMA), previously called the Federal Crop Insurance

Corporation, has provided Multiple Peril Crop Insurance (MPCI) for most major crops to

U.S. farmers since the 1980s. The indemnity of this insurance is based on each insured's

farm yield, i.e., if one's farm yield falls below his preselected coverage level, the difference

will be paid. However, farmers may alter their management plans after purchasing MPCI

in order to save their production costs and claim more indemnity, and only farmers with

high risks tend to buy the insurance. These two problems, known as moral hazard and

adverse selection, are believed to cause insurers great financial losses. MPCI also incurs

high administration costs because the yield loss of each farm has to be individually evaluated

and the management practice individually monitored. All these problems have prevented RMA from providing MPCI at a low cost. Indeed, the government has paid 4.2 billion

dollars to support this program between 1981 and 1990 (U.S. General Accounting Office

1995).

H. Holly Wang is with the Department of Agricultural Economics, Washington State University, Pullman, WA 99164. Hao Zhang is with The Program in Statistics, Washington State University, Pullman, WA 99164 (E-mail: [email protected]).

?2002 American Statistical Association and the International Biometric Society Journal of Agricultural, Biological, and Environmental Statistics, Volume 7, Number 1, Pages 107-127

107

Page 3: may use content in the JSTOR archive only for your ...zhanghao/Paper/JABES2002.pdf · A model-based clustering method for cross-sectional time series data is proposed and applied

H. H. WANG AND H. ZHANG

An alternative crop insurance program, Group Risk Plan (GRP), in which the indemnity is based on the average yield of a group of farms, was introduced by RMA in 1993. In any particular year, an insured will receive an indemnity payment only when the group average yield of that year is lower than his preselected coverage level. Under this insurance, not

only moral hazard and adverse selection have no basis but the administration cost is reduced

greatly. For administration reasons, the current GRP bases the indemnity on the average yield of a county since the county level average yields for major crops have been recorded for decades by the U.S. Department of Agriculture's National Agricultural Statistics Service.

The risk management effectiveness of GRP depends heavily on the homogeneity of the farms. Only when all farms have a high proximity in yields can GRP be a meaningful and effective insurance to farmers. However, geographical and natural conditions may vary from one area to another in a county, which results in different farming practices and farm

yields. For example, Whitman County in eastern Washington has three distinct precipitation zones: low, intermediate, and high precipitation zones with precipitation of 9-14, 15-18, and 19-24 inches, respectively. The cropping systems in the three zones are also different.

Crops are grown once every 2 years in the low precipitation zone with winter wheat-summer fallow as the primary rotation, twice every 3 years typically in the intermediate precipitation zone with winter wheat-spring barley-summer fallow as the primary rotation, and annually with wheat rotated with peas or other crops in the high precipitation zone (U.S. Department of Agriculture 1978). All of these result in different wheat yield levels across the county. The county level GRP is thus not an effective risk management instrument to farmers. This is one of the most important reasons that no Whitman farmer participated in GRP in recent

years. In a county like this, it is sensible to group together farms that resemble each other and

apply GRP separately to each group. This leads to a subcounty-based GRP, which could be more effective than the county-based GRP (Wang, Steven, Myers, and Black 1998). An important step in establishing a subcounty-based GRP is thus to group together the farms that resemble each other. An appropriate statistical approach is a cluster analysis that searches for groups in the data in such a way that objects belonging to the same cluster resemble each other whereas objects in different clusters are dissimilar. Here it seems

particularly appropriate to use partitioning or nonhierarchical clustering. Nonhierarchical

clustering partitions n objects into a number of clusters, where the number of clusters may need to be determined. For any clustering method, a clustering criterion must be adopted. The criterion is usually a function to be optimized. Friedman and Rubin (1967) proposed some criteria and algorithms for nonhierarchical clustering, which were based on sums of

squared distances. In recent years, there has been more research on model-based clustering (cf., Symons

1981; McLachlan 1982; McLachlan and Basford 1987; Banfield and Raftery 1992, among others). An advantage of model-based clustering is that it allows for appropriate statistical inferences about the model parameters such as the number of clusters. In a model-based cluster analysis, objects in a cluster are assumed to have the same probability distribution while objects from different clusters have different distributions. These distributions are

108

Page 4: may use content in the JSTOR archive only for your ...zhanghao/Paper/JABES2002.pdf · A model-based clustering method for cross-sectional time series data is proposed and applied

CROSS-SECTIONAL TIME SERIES DATA

assumed to have a parametric form such as a normal distribution. The cluster analysis then

consists of determining the number of clusters, estimating the parameters in each of the

distributions, and classifying each observation into a proper cluster. In most model-based clustering methods, observations are assumed to be independent.

In our particular problem, wheat yields are obviously dependent. For example, they are all affected by weather conditions such as the precipitation and the average temperatures of some months. This dependence should be incorporated into the model. Considering this and the fact that there are between-cluster and within-cluster variations, we propose the

following mixed effects model:

Xijt = A + ~i + fij + Yt + eijt, i = 1, 2,..., I, j = 1, 2,..., rni, t = 1, 2,..., T,

(2.1) where Xijt denotes the yield of farm j in cluster i at year t; f is the mean yield of all farms; ,Ui is the fixed cluster effect with E u = 0; fij is the random farm effect with mean zero and standard deviation ai,f and accounts for the variability among farms within each cluster; Yt is the random year effect with mean zero and standard deviation ay caused by precipitation, temperature, and other yearly factors; eijt is the random error with mean zero and standard deviation 7a; I is the number of clusters; ni is the number of farms in cluster i; and T is the total number of years. We assume that all random variables are independently and normally distributed. The effects of the deviation of yt from normality will be addressed later in the article.

For a fixed pair (i, j), the random vector (Xijt, t = 1,..., T) consists of observations of a farm over the years and may follow one of the I multivariate distributions, which in the model are assumed to be normal. This observation leads to the classification maxi- mum likelihood procedure for clustering. It is not known to which cluster a random vector

belongs, but these random vectors may be viewed as an identically (not independently) distributed sample from a mixture distribution with the normal distributions as components and pi = ni/n as the mixing weights, where n = E ni is the total number of farms. This leads to the mixture maximum likelihood procedure. Both procedures will be described in the next section. We will employ the latter because the mixture maximum likelihood estimators are asymptotically normal and consistent (Redner and Walker 1984), whereas the maximum classification estimators could be inconsistent (Marriott 1975) and heavily biased, particularly if the groups differ greatly in size (Bryant and Williamson 1978).

In the next section, we present the clustering methods based on model (2.1) and in Section 3 apply this method to Whitman County wheat yield data. Some concluding remarks and discussion are presented in Section 4.

2. STATISTICAL ANALYSIS

Because we do not know which cluster a farm belongs to, the likelihood function (in the classical sense) for a sample from model (2.1) cannot be evaluated. Even though we can treat farms' identities as missing values and use an algorithm such as the EM algorithm to

109

Page 5: may use content in the JSTOR archive only for your ...zhanghao/Paper/JABES2002.pdf · A model-based clustering method for cross-sectional time series data is proposed and applied

H. H. WANG AND H. ZHANG

estimate model parameters, there is another difficulty, i.e., because farm yields, Xijt, are correlated across both time and section, the exact likelihood function involves the inverse of a nonblock matrix of n x n that is cumbersome to handle both mathematically and

numerically when the total number of observations n is large. For Whitman County data, there are 2,945 farms and each farm has as many as 10 years of yield. In addition, many farms have missing values due to crop rotations and summer fallow. For these reasons, we do not use the likelihood function of the observed x^t values. Instead, we use the average yield of each farm for clustering and classification. Let Xij., Cij., y. be the averages over

the omitted indices. Then

Xij. = mi + fij + fij + + ij.

are normal random variables. If there are no missing values, their covariance structures are as follows:

1 12 2 rf + +_2 ifi k,j = 'f +T YT T

cov(Xij, Xkl.) 1 T (2.2) t T2 otherwise.

We see that the correlation coefficient between any different Xij. and Xkl. approaches zero

when T approaches infinity. Because the random variables are normal, they are approxi-

mately independent for a large T. If there are missing values, the T in Equation (2.2) should

be replaced by the corresponding number of years when observations are available.

2.1 CLUSTERING THROUGH MIXTURE DISTRIBUTION

The T-year average yields {Xj. } consist of I sets of identically (not independently) distributed random variables, {X1j., j = 1,..., ni}, {X2j., j 1,..., n2} ., {Xij.,

j = 1,..., n}, that are normally distributed with mean mi and standard deviation ai, i 1, 2,..., I, respectively, where aJr2 f= + (a2 + cr2)/T. However, these random

variables are mixed together in one sample, and our goal is to identify the distributions or

clusters and classify each datum into an appropriate cluster. Such mixture distribution problems appear in many fields, including agriculture,

economics, fisheries, and medicine. Mixture distributions have been extensively studied and provide an important model-based approach to clustering. We refer to Everitt and Hand (1981), Titterington, Smith, and Makov (1985), McLachlan and Basford (1988), and references provided there. When I is known, clustering and classification can be done

through either the classification maximum likelihood procedure or the mixture maximum likelihood procedure, both of which will be discussed below. A nice description and

comparison of both procedures are given by McLachlan (1982). Determination of the number of components is a difficult problem and a review of some existing methods will be given in Section 2.1.2.

Let us denote our sample data by x1, x2,... xn and the set of identifying labels by

Y = (7Y1i, 72* * * yn), where 7i = i if xj comes from the ith probability distribution

110

Page 6: may use content in the JSTOR archive only for your ...zhanghao/Paper/JABES2002.pdf · A model-based clustering method for cross-sectional time series data is proposed and applied

CROSS-SECTIONAL TIME SERIES DATA

fi (x; 0), where 0 is the parameter. Here, for our model, fi (x, 0) is normal with mean mi and

standard deviation oi, and the model parameter 0 is the vector (pl, P , , 2, P-1, mi, i, i -

1, 2,..., I). The classification maximum likelihood procedure is to choose 0 and y that maximize

n

L, (, a)=n fyj (xj; a). j=l

The observations x1, x2,..., xn are then classified according to the estimates 71, 7Y2 ..., n-.

Some procedures based on sums of squared distances can be obtained from the classification maximum likelihood procedure. We refer to Gordan (1981, Chapter 3) for a review of these ideas. For example, when all the mixture component distributions are normal with the same variance, the classification maximum likelihood procedure is equivalent to the criterion of minimum det(W) of Friedman and Rubin (1967).

In the mixture maximum likelihood procedure, each xj is assumed to belong to the ith cluster with a probability pi (which in our problem is ni/n) and hence has a mixture

probability density I

f(x; 0) - pifi(x, 0). (2.3) i=i

Parameters can be estimated by maximizing the corresponding likelihood function. The classification maximum likelihood estimators are not necessarily consistent.

Indeed, they are definitely inconsistent under the standard assumption of normal distribu- tions with common variance matrices (Marriott 1975) and are asymptotically biased (Bryant and Williamson 1978). On the other hand, the mixture maximum likelihood estimators are asymptotically normal and consistent under general conditions (Redner and Walker

1984). Many of the recent works on estimating the number of components use the mixture maximum likelihood estimation (e.g., Bozdogan 1993; Feng and McCulloch 1996; Chen and Cheng 1997). For these reasons, we will employ the mixture maximum likelihood

procedure for clustering.

2.1.1 Mixture Maximum Likelihood Estimation When I Is Known

Given an i.i.d sample xl,x2,...,Xn from the mixture distribution (2.3) with the number of components, I, known, the maximum likelihood estimators can be obtained

by maximizing the log-likelihood function

1(0,x) = log f(xi; 0). (2.4)

However, in our problem, the average yields are not exactly independent and, consequently, (2.4) with xi being replaced by xij is not the exact but an approximation of the likelihood function of our sample from model (2.1). Because the exact likelihood function is

computationally cumbersome to handle, we still maximize (2.4) for parameter estimation

111

Page 7: may use content in the JSTOR archive only for your ...zhanghao/Paper/JABES2002.pdf · A model-based clustering method for cross-sectional time series data is proposed and applied

H. H. WANG AND H. ZHANG

and investigate how this approximation affects the estimators. Note that it does not give estimates for 'i,f, cy, and a, in model (2.1), but these parameters are not needed in our

approach to clustering. We have done some simulations to evaluate these approximated maximum likelihood

estimators. In the simulations, we first generate sample data from model (2.1) for a given set of parameters, then estimate the parameters by maximizing the approximated likelihood function. We compare these estimates with the maximum likelihood estimates obtained from i.i.d. samples from the corresponding mixture distribution (2.3). If these two sets of estimates are reasonably close, the approximation will be considered acceptable for

parameter estimation. We first considered I = 2, i.e., two components in the mixture. We have generated

i.i.d. samples from the mixture distribution (2.3) with various parameter values of pi, mi, m2, cl, and c2 and found that the MLE performs well when the two normal components are well separated, say by 1.5 standard deviations from each of the two means, i.e., the cross

point (the value at which the two densities equal) is at least 1.5 standard deviations from each of the two means. We used those parameter values for which the MLE preforms well to generate samples from the mixed model (2.1) for comparison purposes. Findings for the I = 2 case helped us choose the parameter values to be used for I = 3 since the three

components should be at least as separated as they are for I = 2 for the MLE to perform comparably to the I = 2 case.

Presented in Table 1 are the results for m1 = 30, m2 = 52, ca = 8, r2 = 6, and

pi = 0.60, 0.90, 0.10, 0.75, and 0.25. Sample size n = 1,000 and T = 10 (for the mixed model (2.1)) are fixed, and the results are based on 500 simulations. Two different choices

of Cilf, cr2f, cy, and ac were used, namely, alf = 7.797, a2f = 5.727, ry = 4, a, = 4

and alf = 7.155, U2f = 4.817, ty = 8, oa = 8.

Presented in Table 2 are the results for I = 3 with mn = 30, m2 = 52, m3 = 70,

r1 = 8, 0o2 = r3 = 6, and (pl,p2) = (0.15,0.10), (0.15, 0.75), (0.75, 0.15), (0.30,

0.30), (0.30, 0.40), and (0.40, 0.30), T = 10. Sample size n = 1,000 is fixed for all sets of parameters. Two different choices of ry and ac were used, namely, ry = -c = 4 and

cy = re = 8. crif is determined by cr2 -= '2 +(02 ) /T. The Yt term in the mixed model

is simulated from both the normal distribution with mean zero and standard deviation ay and

the beta distribution with shape parameters six and two whose density is drawn in Figure 1, or more precisely, transformation from the beta distribution, (x - 0.75)cry/0.14434, where 0.75 and 0.14434 are the mean and standard deviation of the beta distribution, respectively. The transformed distribution is skewed to the left with mean zero and standard deviation o,y. We used the beta distribution to check how sensitive the estimates are to the asymmetry of the distribution. Note that, even though yt is nonnormal, the average yield Xij. is approximately normal by the central limit theorem when T is large. We keep the other two random terms in the mixed model normal since it is less justifiable to assume their nonnormality than the yt's. Rarely occurring bad weather can greatly affect crop yields while more frequently occurring good weather can only improve the yield by a small margin. Therefore, Yt might be skewed to the left.

112

Page 8: may use content in the JSTOR archive only for your ...zhanghao/Paper/JABES2002.pdf · A model-based clustering method for cross-sectional time series data is proposed and applied

CROSS-SECTIONAL TIME SERIES DATA

Table 1. Summary of Estimates Based on 500 Simulations and Sample Size n =1,000 to Compare the MLEs and AMLEs. Column I is for i.i.d sample from the mixture distribution; column II for the mixed model (2.1) with T = 10, ay = ae = 4; and column III for the mixed model (2.1) with T = 10, ay = ae = 8. All random terms in the mixed model are normal.

Mean of estimates Standard deviation of estimates

Set Parameter I II III I II III

1 pi = 0.60 0.5998 0.5995 0.5988 0.0208 0.0208 0.0127 mi = 30 29.988 29.989 29.901 0.5602 1.4221 1.2944 o1 = 8 7.978 7.841 7.26 0.4238 0.3954 0.3058

m2 = 52 51.989 52.005 51.918 0.5286 1.3875 1.3084

o2 =6 5.992 5.843 4.981 0.3548 0.3411 0.2573

2 i = 0.90 0.8942 0.8960 0.8987 0.0265 0.0226 0.0125

mi =30 29.940 30.034 30.033 0.4571 1.3869 1.3341 a =8 7.946 7.862 7.243 0.3145 0.2921 0.2399

m2 = 52 51.753 51.945 51.967 1.9151 2.1897 1.5938

'2 =6 6.017 5.846 4.976 1.0171 0.8979 0.6712

3 pi = 0.10 0.1043 0.1030 0.1014 0.0207 0.0189 0.0093

mi =30 30.352 30.159 30.207 2.2893 2.3753 1.8472

o"i =8 8.117 7.909 7.241 1.2642 1.2727 0.9436 m2 = 52 52.007 51.918 52.056 0.2700 1.2223 1.2418 02 =6 5.977 5.862 4.973 0.2096 0.1962 0.1488

4 pi = 0.75 0.7486 0.7505 0.7506 0.0217 0.0195 0.0170 m = 30 29.987 30.012 30.041 0.4830 1.4138 1.3358

l = 8 7.979 7.894 7.266 0.3529 0.3344 0.3338

m2 = 52 51.947 52.026 52.074 0.7768 1.5096 1.4008

2 = 6 5.980 5.802 4.957 0.5003 0.4866 0.3747

5 pi = 0.25 0.2525 0.2514 0.2508 0.0194 0.0180 0.0106

mi =30 30.073 29.980 30.013 1.1319 1.6768 1.4224 l= 8 7.964 7.882 7.266 0.6983 0.6717 0.5091

m2 = 52 51.993 51.949 51.969 0.3281 1.2713 1.2543 '2 = 6 5.991 5.845 4.971 0.2263 0.2086 0.1710

To investigate the effects of sample size n and the number of years T on the estimates, we also used n = 2,000 with T = 10 and n = 1,000 with T = 20. Presented in Table 3

are the results for I = 3, sample size n = 2,000 with T = 10, and Table 4 presents the

results for n = 1,000 with T = 20. Here we used only one set of parameter values that

corresponds to set 2 in Table 2.

From the simulation results presented in these tables, the MLE from the approximated likelihood function (hereafter called approximated MLE, or AMLE) gives competitive estimates for Pi. In fact, the AMLE for pi has smaller variances and seems unbiased. The

AMLE for the mean mi has a larger variance than the MLE, and stronger correlations (i.e.,

larger cy and ac) increase the variance. For a component with a smaller pi, the variance of

AMLE for mi increases by a smaller amount than the component with a larger pi. The AMLE persistently underestimates the standard deviations oi. This has to do with

the positive correlations among Xijt. We can see clearly why this occurs when there is only one cluster, i.e., I = 1. We see from (2.2) that X = (Xej., = 1, 2,..., nl, t = 1,2,...,T) has a covariance matrix with diagonal elements all equal to a2 and off-diagonal elements

113

Page 9: may use content in the JSTOR archive only for your ...zhanghao/Paper/JABES2002.pdf · A model-based clustering method for cross-sectional time series data is proposed and applied

Table 2. Summary of Estimates Based on 500 Simulations. Sample size n = 1,000, T = 10 are fixed. Column I is for i.i.d. sample from the mixture distribution; II is for the mixed model (2.1) with ry = ae = 4; III is for model (2.1) with ay = oe = 8; IV is for model (2.1) with ay = ae = 4 and yt from the beta(6, 2) distribution; V is for model (2.1) with o-y = e = 8 and yt from the beta(6, 2) distribution; the other two random terms are normal in all the simulations.

Mean of estimates Standard deviation of estimates

Set Parameter I II IiI IV V I II 1II IV V

1 p = 0.15 0.1473 0.1482 0.1478 0.1477 0.1483 0.0170 0.0158 0.0126 0.0163 0.0113 P2 =0.10 0.1176 0.1143 0.1100 0.1133 0.1095 0.0533 0.0474 0.0370 0.0438 0.0342 mi =30 29.833 29.926 29.811 29.913 29.892 1.4738 1.8822 2.8979 1.8908 2.6762 al =8 7.890 7.793 7.478 7.788 7.489 0.9216 0.9097 0.8036 0.9484 0.7695 m2 = 52 52.532 52.587 52.279 52.428 52.401 2.8134 2.7633 3.3152 2.7612 3.0641 .2 =6 6.377 6.225 5.841 6.223 5.786 2.4900 2.3427 2.0396 2.3303 1.8973 m3 = 70 70.078 70.093 69.945 70.057 70.056 0.4285 1.3122 2.5585 1.3769 2.5187

3 = 6 5.955 5.841 5.417 5.829 5.413 0.3029 0.2825 0.2257 0.2635 0.2315

2 P1 =0.15 0.1514 0.1508 0.1515 0.1502 0.1520 0.0212 0.0174 0.0136 0.0198 0.0157 P2 = 0.75 0.7430 0.7433 0.7465 0.7442 0.7460 0.0489 0.0414 0.0248 0.0445 0.0263 ml = 30 30.043 30.000 29.997 29.965 30.081 1.7255 1.968 2.7963 2.0673 2.8356 a1 = 8 7.941 7.797 7.573 7.815 7.616 1.0742 0.955 0.8762 0.9927 0.9097 m2 = 52 51.971 51.964 51.865 51.985 51.903 0.3774 1.286 2.4634 1.2948 2.5368 '2 =6 6.004 5.834 5.435 5.867 5.433 0.4154 0.3800 0.2922 0.4050 0.3001

m3 = 70 69.836 69.815 69.845 69.790 69.780 2.3069 2.4684 2.8266 2.5348 2.8539 03 = 6 5.943 5.865 5.400 5.886 5.436 1.1255 1.0407 0.7814 1.1105 0.8427

3 pl = 0.75 0.7377 0.7441 0.7440 0.7438 0.7453 0.0416 0.0338 0.0272 0.0325 0.0282 P2 = 0.15 0.1646 0.1563 0.1575 0.1577 0.1586 0.0537 0.0459 0.0365 0.0450 0.0512 ml = 30 29.940 29.900 30.134 29.913 29.988 0.6269 1.4466 2.5561 1.3998 2.8910

i = 8 7.944 7.884 7.542 7.874 7.542 0.3989 0.3894 0.3443 0.3725 0.3356 m2 = 52 51.614 51.747 52.012 51.792 51.893 1.9331 2.0520 2.7520 1.9747 2.8943 a2 =6 6.435 5.989 5.671 6.059 5.645 2.0395 1.8152 1.5066 1.7800 1.5893 m3 = 70 70.140 69.946 70.313 70.040 70.422 1.8302 2.0558 2.7462 2.0461 4.6477 a3 =6 5.880 5.806 5.376 5.748 5.318 1.0430 1.0033 0.7685 0.9530 0.9375

z 0

z

tr:

N

z 0)

Page 10: may use content in the JSTOR archive only for your ...zhanghao/Paper/JABES2002.pdf · A model-based clustering method for cross-sectional time series data is proposed and applied

Table 2. (Cont'd)

Mean of estimates Standard deviation of estimates

Set Parameter I II III IV V I II III IV V

4 pi = 0.30 0.2966 0.2994 0.2989 0.2984 0.2992 0.0233 0.0186 0.0153 0.0207 0.0148 P2 = 0.30 0.3104 0.3010 0.3036 0.3052 0.3027 0.0484 0.0395 0.0303 0.0447 0.0305 ml =30 29.837 30.005 29.978 29.976 29.830 1.0704 1.5577 2.7674 1.5729 2.7232

1 =8 7.914 7.845 7.519 7.858 7.546 0.6605 0.6073 0.5550 0.6348 0.5103 0 m2 = 52 52.061 51.997 52.050 52.096 51.928 0.8506 1.5355 2.6609 1.5038 2.6587

2 =6 6.219 5.886 5.512 5.945 5.474 1.1605 0.9452 0.7762 1.0492 0.7850 m3 = 70 70.094 70.006 70.040 70.073 69.925 0.6985 1.4646 2.6197 1.3936 2.6461 '3= 6 5.924 5.863 5.388 5.829 5.421 0.4112 0.4003 0.3450 0.4275 0.3456

z 5 Pl =0.30 0.2999 0.3010 0.3009 0.2987 0.3003 0.0208 0.0199 0.0199 0.0203 0.0140

P2 = 0.40 0.4005 0.3983 0.3983 0.4023 0.3993 0.0430 0.0379 0.0379 0.0379 0.0263 ml =30 29.957 30.009 30.009 30.014 30.039 0.9836 1.5782 1.5782 1.6060 2.648 c1 =8 7.955 7.924 7.924 7.858 7.574 0.6504 0.6185 0.6185 0.6442 0.521 m2 = 52 52.000 51.952 51.952 52.066 52.040 0.6500 1.3933 1.3933 1.4421 2.520 072 =6 6.032 5.847 5.847 5.903 5.424 0.7708 0.6911 0.6911 0.7066 0.508 m3=70 69.987 69.944 69.944 70.101 70.021 0.8925 1.5536 1.5536 1.5803 2.522

3 = 6 5.986 5.847 5.847 5.848 5.440 0.5200 0.4707 0.4707 0.4862 0.396 >

6 Pl =0.40 0.3960 0.3977 0.3977 0.3989 0.3983 0.0255 0.0215 0.0215 0.0223 0.0170 P2 = 0.30 0.3085 0.3054 0.3054 0.3034 0.3038 0.0487 0.4253 0.0425 0.0419 0.0308 m = 30 29.877 29.974 29.974 29.989 29.804 0.8820 1.6001 1.6001 1.4980 2.5715 a = 8 7.934 7.844 7.844 7.868 7.544 0.5484 0.5304 0.5304 0.5232 0.4814 m2 = 52 52.022 52.047 52.047 52.054 51.893 0.8194 1.5261 1.5261 1.4837 2.5014 a2 =6 6.152 5.977 5.977 5.906 5.526 1.1510 1.0106 1.0101 1.0131 0.7616 m3 = 70 70.079 70.070 70.070 70.045 69.942 0.8923 1.5435 1.5435 1.5097 2.5061

3 =6 5.295 5.811 5.810 5.822 5.397 0.5287 0.5006 0.5006 0.4797 0.3913

Page 11: may use content in the JSTOR archive only for your ...zhanghao/Paper/JABES2002.pdf · A model-based clustering method for cross-sectional time series data is proposed and applied

H. H. WANG AND H. ZHANG

Lo_

CD c \

S / \u V-

) o_ /

o

-

0

o

0.0 0.2 0.4 0.6 0.8 1.0

Figure 1. Density Curve of Beta(6, 2) Distribution.

all equal to pa2, where p is the correlation coefficient between any two distinct elements

of X. We simply write Xj for Xlj. If the Xj's are viewed as independently normally distributed, the MLE of a2 is (Xj - X)2/nl, which is the AMLE of a2. Note that

nl nl/

n i nl =i 2j=1 2

-E n-1

(X, -)2 - 2jZ(X - IX -)

1 j -1 2i<j 2

- (1--)(1-p)0o2.

Therefore, it is biased and underestimates a2. Since we do not have an analytic expression for MLEs of variances a2 in the mixture distribution, it is difficult to give the biases of the

AMLEs for the variances. When ay = a, = 4, alf = 7.7974, o2f = cr3f = 5.7271, and T = 10, the

correlation coefficient between two average farm yields Xij. is, depending on which cluster

or clusters the two farms are from, 0.025, 0.033, or 0.044. Note that the correlation coefficient

between two farm yields at any particular year is much higher: 0.172, 0.247, 0.206. When

ay = a- = 4, alf = 7.8994, a-2f = c3f = 5.8652, and T = 20, the correlation coefficient

between two average farm yields is 0.013, 0.017, or 0.022 for T = 20. When (ay = oa - 8,

116

Page 12: may use content in the JSTOR archive only for your ...zhanghao/Paper/JABES2002.pdf · A model-based clustering method for cross-sectional time series data is proposed and applied

Table 3. Summary of Estimates Based on 500 Simulations. Sample size is n = 2,000 and T = 10. The columns have the same meaning as in Table 2.

Mean of estimates Standard deviation of estimates

Parameter I III IV V I II III IV V

1 =0.15 0.1502 0.1512 0.1510 0.1506 0.1504 0.0126 0.0123 0.0094 0.0124 0.0095 P2 =0.75 0.7477 0.7454 0.7480 0.7487 0.7495 0.0262 0.0256 0.0176 0.0238 0.0166 ml = 30 30.009 30.157 30.068 30.0753 30.034 1.0868 1.6427 2.7435 1.6548 2.7869 al= 8 7.940 7.892 7.605 7.870 7.569 0.6997 0.6710 0.5709 0.6862 0.6125 m2 = 52 51.986 52.047 51.992 52.049 52.012 0.2733 1.2006 2.6464 1.2461 2.5979 a2 =6 6.001 5.843 5.428 5.857 5.446 0.2773 0.2640 0.2136 0.2633 0.1990 m3 = 70 69.981 69.876 69.968 70.079 70.023 1.4448 1.8163 2.8639 1.7367 2.7528 73 =6 5.902 5.864 5.424 5.805 5.388 0.7545 0.7256 0.5796 0.6692 0.5726

0 C,., cn

0

z r2:

c-j tTi

Cu

H

Page 13: may use content in the JSTOR archive only for your ...zhanghao/Paper/JABES2002.pdf · A model-based clustering method for cross-sectional time series data is proposed and applied

H. H. WANG AND H. ZHANG

Table 4. Means and Standard Deviations of the Estimates Based on 500 Simulations. Sample size n = 1,000. T = 20. Columns II and IV correspond to ry = oe = 4, with column II having normal Yt term and column IV having beta yt; columns III and IV correspond to ry = oe = 8, with column III having normal yt term and column V having beta yt.

Mean of estimates Standard deviation of estimates

Parameter II I11 IV V 11 I11 IV V

P1 =0.15 0.1522 0.1527 0.1515 0.1511 0.0201 0.0184 0.0204 0.0173 P2 = 0.75 0.7408 0.7437 0.7396 0.7421 0.0453 0.0338 0.0555 0.0430 mi =30 30.187 30.105 29.958 30.112 1.9169 2.3686 1.7763 2.4853 cr1 =8 7.947 7.820 7.910 7.796 1.0132 0.9931 0.9872 0.9247

m2 = 52 51.994 51.934 51.923 52.022 0.9753 1.7914 0.9543 1.9561

-2 =6 5.883 6.694 5.900 5.713 0.3999 0.3680 0.4560 0.3833

m3 = 70 69.717 69.843 69.610 69.727 2.3949 2.5607 2.8267 2.8926

'3 =6 5.967 5.698 5.985 5.815 1.0475 0.9561 1.2148 1.0264

crlf = 7.155, a2f = c3f - 4.817, and T = 10, the correlation coefficient between

two average farm yields is 0.10, 0.133, 0.178, and the correlation coefficient between two

individual farm yields of any particular year is 0.357, 0.423, or 0.389. When ary = a, = 8,

rlf = 7.5895, c2f = c3f= 4.4406, and T = 20, the correlation coefficient between two

average farm yields is 0.050, 0.067, or 0.089. Comparing Table 4 with Table 2, we observe

that weaker correlations among the Xijt make the variances r2 underestimated to a lesser

degree. Persistently, the AMLE for ai has a smaller variance than the corresponding MLE.

Nonnormality of the Yt term has no obvious effects on the estimates. Comparing Table

3 with set 2 in Table 2, we see that increasing the sample size n decreases the variances of

the estimators and increasing T (therefore decreasing correlations among the average farm

yields) makes a? less underestimated. We conclude that the AMLE performs satisfactorily well, particularly when the

correlations among the individual farm yields are not too high. Note there are no missing values in the simulations and thus T is a constant. For applications, it is common to have

missing values. Though the impact of the resulting lack of balance is not addressed here, we

believe AMLE is still applicable for the purpose of clustering in some cases with missing values. For example, there are missing values in the Whitman County wheat yield data due

to crop rotation and summer fallow. However, the farms in a cluster are believed to follow

a similar pattern of crop rotation and summer fallow. Thus, for each farm in a cluster, the

actual number of years when yield data are available remains approximately a constant.

Therefore, it is reasonable to view average yields in a cluster as having approximately a

common probability distribution and to employ the mixture maximum likelihood approach for clustering.

2.1.2 Determining the Number of Components

The histogram of the pooled sample generally provides some choices for the number

of components in a mixture distribution. However, the formal inference for I is a difficult

problem. A review of some existing methods can be found in Titterington (1990), which

118

Page 14: may use content in the JSTOR archive only for your ...zhanghao/Paper/JABES2002.pdf · A model-based clustering method for cross-sectional time series data is proposed and applied

CROSS-SECTIONAL TIME SERIES DATA

include, among others, Akaike's information criterion (AIC), the generalized likelihood ratio test and the bootstrapping likelihood ratio test. Recent developments include Bozdogan (1993), Roeder (1994), Dacunha-Castelle and Gassiat (1997), and Chen and Cheng (1997).

For the mixture model (2.3), AIC (see Scolve 1983; Bozdogan and Scolve 1984) is to choose an I that minimizes the Akaike information,

AIC(I) =-2L() + 2N(I),

where L is the log likelihood, 0 is the MLE of the parameter 0, and N(I) is the number of free parameters in the mixture model with I components. However, as explained by Titterington et al. (1985), this criterion relies on some regularity conditions in order for AIC to have the usual asymptotic properties, and these regularity conditions do not hold in the context of the mixture model. The performance of AIC for mixture models has not yet been

fully evaluated. The null parameter space is on the boundary of the parameter space rather than in

its interior, as is assumed in classical theory. Thus, the likelihood ratio test for the null

hypothesis I = k against the alternative hypothesis that I = k + 1 for some specific k has the same difficulty. Rather than converging to a chi-square distribution, the likelihood ratio converges to infinity at a very slow rate when the sample size increases to infinity (Hartigan 1985). McLachlan (1987) used a bootstrapping method to obtain the distribution of the likelihood ratio test statistic for any finite sample size. Some theoretical justifications for the bootstrapping test were given by Feng and McCulloch (1996). In our particular test, the bootstrap procedure proceeds as follows: First, find the MLE 00 using I = k and the MLE 0 using I = k + 1. Then calculate the likelihood ratio test statistics

w (so0, x)=2( I(0X) - I (oX)),

where X is the vector of observations and I is the log-likelihood function. To bootstrap W(o0, X), generate bootstrapping samples from the k-normal mixture with the parameter 00. From each of the bootstrap sample X*'s, construct the MLE 0* under the assumption that I = k + 1 and calculate

w (0, ) = 2 (I(0*,x*) -I (I ,X))*

From these values, the bootstrapping distribution of W(o0, X) can be obtained and so can the p-value of the test.

Though the bootstrapping likelihood ratio test is promising and theoretically justified for some cases, full justifications are not yet given. Indeed, the results of Feng and McCulloch

(1996) are not adequate for the testing of I = k against I = k + 1 when k > 1 since their

assumption (b) (p. 611), which says that the mixture density is the same for all parameters in the null parameter space, does not hold in this case. Chen and Cheng (1997) developed a

bootstrapping likelihood ratio test for I = k against I = k+ 1 in mixture models with known

component distributions. However, their results are not applicable to our model because in our model the component distributions are normal but have unknown parameters.

We note that the determination of the number of components in a mixture distribution

119

Page 15: may use content in the JSTOR archive only for your ...zhanghao/Paper/JABES2002.pdf · A model-based clustering method for cross-sectional time series data is proposed and applied

H. H. WANG AND H. ZHANG

still remains a difficult problem. Bozdogan (1993) derived an information complexity criterion of the inverse Fisher information matrix. This criterion is similar to AIC but has a different penalty term. It is not clear whether this criterion provides consistent estimators. Roeder (1994) used a graphic technique where the components are assumed to have common variances. Dacunha-Castelle and Dassiat (1997) utilized penalized Hankel matrices to obtain a consistent estimator of the number of components. However, the choice of the penalty term is difficult. It is not our intention here to provide a new method or evaluate the existing methods for determining the number of components. We choose the bootstrapping likelihood ratio approach to determine the number of components as in McLachlan (1987) and Feng and McCulloch (1996). In addition, we will use quantile-quantile plots to check the adequacy of the fitting of the mixture distribution.

2.2 CLASSIFICATION

We now consider how to classify a farm into one of the clusters. We will employ the following three classification methods. The first one is the Bayesian classification that maximizes the posterior distribution. When an observed value x (average yield) is from the mixture distribution, the probability that it belongs to cluster i, according to the Bayesian formula, is

pi _ Pifi(x) i

~ I

ZPsfs(x) s=l

where fi(x) is the probability density function of cluster i. Therefore, it is classified to cluster i if pi fi(x) = maxj {pj fj (x)}.

The second method is similar to the minimum distance classification. But we use the standardized distance here. More specifically, the farm is classified to cluster i if

Ix - mi min ix

- mj ' min <( I , I

oi i '3 C

where mi and ai are the mean and standard deviation of cluster i. The third method is to maximize the probability density function, i.e., to classify x to

cluster i if f (x) = maxj fj (x). As with any classification method, misclassification can occur. Misclassification rates

for each of the three methods can be calculated for a given set of parameters as illustrated in the next section. Therefore, the methods can be compared and evaluated for a given set of parameters.

3. APPLICATION TO WHITMAN COUNTY FARMS

3.1 THE DATA AND EXPLORATORY ANALYSIS

Whitman is a county in eastern Washington where dryland wheat production is a

prominent industry and that produces one of the highest yields in the world. It accounts for

120

Page 16: may use content in the JSTOR archive only for your ...zhanghao/Paper/JABES2002.pdf · A model-based clustering method for cross-sectional time series data is proposed and applied

CROSS-SECTIONAL TIME SERIES DATA 121

85 -

80

75

o 70

60

55 I

81 82 83 84 85 86 87 88 89 0o 91 92 93 94 95 year

Figure 2. Annual Average Winter Wheat Yields of Whitman County from 1981 to 1995.

20% of wheat production in Washington. RMA has recorded dryland winter wheat farm

yields for each MPCI participant for the maximum of 10 production years from 1981 to 1995. We obtained the yields for 2,945 farms and plotted the annual average yields of the farms in Figure 2, which shows no obvious trend in the data. Figure 3 shows the histogram of the temporal average yield Xij. of each farm. The distribution seems to have three modes, with the lowest one clearly differentiated from the other two, but the two higher ones are close to each other.

LCX

CN

C2~in~~~~~~~~~~~~~~~~~~~~

O -

O / ':.

~o 'illi;:?~V .i:l.. -

C?

20 40 60 80 100

yield yield

Figure 3. Histogram of the Average Yields and Fitted Probability Density Functions.

Page 17: may use content in the JSTOR archive only for your ...zhanghao/Paper/JABES2002.pdf · A model-based clustering method for cross-sectional time series data is proposed and applied

H. H. WANG AND H. ZHANG

Table 5. Parameter Estimates for Whitman County Wheat Yields Using a Three-Normal Mixture and a Two-Normal Mixture. Values in parentheses are the estimated standard deviation of the estimators, given by the inverse of the information matrix.

Distribution p IL a Maximum of log likelihood

Mixture of three normal distributions -12,581 1 0.146 30.36 7.64

(0.019) (1.025) (0.551) 2 0.107 52.56 5.63

(0.078) (1.01) (1.72) 3 0.747 67.01 13.53

- (2.10) (0.91)

Mixture of two normal distributions -12,586 1 0.10 28.09 6.61

(0.010) (0.688) (0.448) 2 0.90 63.85 14.77

- (0.394) (0.301)

Crop yield distributions are usually skewed to the left. As discussed earlier, this is because rarely occurring bad weather can totally destroy crops while more frequently occurring good weather can improve yields by only a small margin. Therefore, beta distributions are used by some agricultural economists when studying yield risk and crop insurance (Nelson 1991; Hennessy, Babcock, and Hayes 1997). However, the average yields over the years will be asymptotically normal when T gets larger by the central limit theorem, and the simulation results show that parameter estimation is not sensitive to departures from

normality.

3.2 PARAMETER ESTIMATION AND CLUSTERING

From the histogram, the reasonable choices for I are two and three. We therefore conduct a test for I = 2 against I = 3 using the bootstrapping likelihood ratio method. The

bootstrapping p-value for the likelihood ratio test is 0.044, which clearly rejects the null

hypothesis of I = 2. The approximated MLEs are presented in Table 5. For the three-component mixture

model, 15% of the farms are from one cluster, 11% from another, and the remaining 74% from the third cluster. The mean yields of the three clusters are 30.36, 52.56, and 67.01 bushels (bu)/acre (ac), respectively, and the standard deviations are 7.64, 5.63, and 13.53 bu/ac. For the two-component mixture model, 10% are from the component with mean 28.09 and standard deviation 7.33 and 90% from another component with mean 63.85 and standard deviation 14.77. It seems that the two-normal mixture combines two components in the three-normal mixtures into one.

To examine the goodness of fit, we plot the fitted probability density functions onto the histogram (Figure 3). We see that the three-normal mixture fits slightly better than the two-normal mixture. This is confirmed by the Q-Q plots given in Figure 4.

122

Page 18: may use content in the JSTOR archive only for your ...zhanghao/Paper/JABES2002.pdf · A model-based clustering method for cross-sectional time series data is proposed and applied

CROSS-SECTIONAL TIME SERIES DATA

at

o f o

o - 0 ^

0 0

i I I I l I t I 1

20 40 60 80 100 20 40 60 80 100 120

Quantiles of 3-normal mixture Quantiles of 2-normal mixture

Figure 4. Quantile-Quantile Plots.

3.3 CLASSIFICATION

We use the three-component mixture distribution for classification. The Bayesian classification is unsatisfactory since no farm is classified into cluster 2. The reason is

that, according to the clustering result in the previous subsection, the second and the third clusters have means close to each other, but cluster 2 has only 1 % of all farms while cluster 3 has 74%. The huge difference in the proportions has more farms classified into cluster 3

by the Bayesian method. Table 6 summarizes the classification results by the minimum distance method and the

maximum probability density method. For the minimum distance classification, 17.4% are classified into cluster 1, 22.3% into cluster 2, and 60.3% into cluster 3. For the maximum

probability density classification, 16.9% are classified into cluster 1, 31.1% into cluster

2, and 52% into cluster 3. The proportions of farms classified to the three clusters do not

agree well with the proportions in the clustering result or in the mixture model due to misclassifications. As we will see below, these classification results are really what should be expected.

We now briefly discuss the misclassification rates of the three methods. For simplicity, we assume the sample is from the mixture distribution with three normal components with

parameters as in Table 5, i.e., the estimates are the true parameter values. The Bayesian classification does not perform well due to the huge difference between P2 and p3. In fact,

p3f3(x) > P2f2(x) for all x. Therefore, no data are classified into cluster 2. We will therefore focus on misclassification rates of the other two methods.

It is easy to see that the minimum distance classification classifies x into cluster 1 if x < 43.1426, into cluster 2 if x is between 43.1426 and 56.8072, and into cluster 3 if x > 56.8072 (the three distance functions are plotted in Figure 5). Let aij be the conditional

probability that x is classified into cluster j given that it is from cluster i. aij can be directly

123

Page 19: may use content in the JSTOR archive only for your ...zhanghao/Paper/JABES2002.pdf · A model-based clustering method for cross-sectional time series data is proposed and applied

H. H. WANG AND H. ZHANG

Table 6. Classification of Farms

Classification results

Cluster Minimum distance Maximum density

1 17.4 16.9 2 22.3 31.1 3 60.3 52.0

calculated from the normal distributions

, (43.1426 - m ) Si

- = ( 56.8072 -

mi) _ 43.1426 - mi ) 1 8i 5) 8i S

i3 _ I (/ 56.8072-mi I

ai3=l-4>[t Si

where I(.) is the cumulative distribution function of the standard normal distribution. For

example, a12 = 0.046744, a13 = 0.000266, a21 = 0.047011, a23 = 0.225302, a31 =

0.038817, a32 = 0.186486. Then the probability that x is classified into cluster j is E piaij. For j = 1, 2, 3, the probability is 0.1732, 0.2240, and 0.6028, respectively. Note that the

classification results in Table 5 are very close to these theoretical probabilities. The maximum probability density method classifies x into cluster 1 if x < 42.6, into

cluster 2 if x is between 42.6 and 60.52, and into cluster 3 if x > 60.52 (see Figure 6).

o-

Cluster 1

Cluster 2 oo -

-Cluster 3

(D-

0 -

I I I 0I 81

0 20 40 60 80 100

Figure 5. Plots of the Distance Functions.

124

Page 20: may use content in the JSTOR archive only for your ...zhanghao/Paper/JABES2002.pdf · A model-based clustering method for cross-sectional time series data is proposed and applied

CROSS-SECTIONAL TIME SERIES DATA

o !: ? - ;Cluster 1

... . \Cluster 2

Cluster 3

o? \ /

? ~/ \

I X, \ .

I r I I I I

0 20 40 60 80 100

Figure 6. Probability Densities for the Three Normal Components.

Analogously, the probability that x is classified into clusters 1, 2, or 3 is 0.1687, 0.3117, and 0.5196, respectively. These results agree well with the classification results in Table 5.

In summary, the minimum distance classification outperforms the other two methods

for the Whitman County wheat yield data. The actual classification probabilities agree well

with the theoretical classification probabilities. This agreement corroborates the adequacy of the mixture distribution fitting to the data.

Based on the minimum classification results, we plotted the means, medians, and the

first and third quartiles of individual farm yields within each cluster against time (Figure 7). These plots reveal three distinct clusters. Though the means change from year to year within a cluster, such changes are due to the year effects. Variations of individual wheat

yields remain approximately the same from year to year within a cluster but differ between

clusters. This supports our model assumptions.

4. SUMMARY AND DISCUSSION

In this article, we have used a model-based clustering method to group together farms

that are similar to each other for the purpose of designing an effective subcounty-based GRP for crop insurance. We built the model by incorporating variations of yields due to

various causes. Clusters were identified through estimation of the parameters in a mixture

distribution. An advantage of this model-based approach to clustering is not only the model

itself defines a cluster but formal inferences of model parameters can also be carried out.

125

Page 21: may use content in the JSTOR archive only for your ...zhanghao/Paper/JABES2002.pdf · A model-based clustering method for cross-sectional time series data is proposed and applied

H. H. WANG AND H. ZHANG

Cluster I

\ A"/ \/ / .. v \/

? .

2 4 6 8 10

year

Cluster 2

A /' \ \ /A\

r I I I 1

2 4 6 8 10

year

Cluster 3

., \ /

2 4 6 8 10

year2 4 6 8 10

year

Figure 7. Plots of the Mean, Median, and the First and Third Quartiles Against Time for Each of the Three Clusters.

We applied the method to wheat yields of Whitman County, Washington, and identified three clusters. Because the identities of farms in the data are not known, the geographical interpretation of the clusters can only be conjectured. The clustering results agree well with agricultural economists' understanding of Whitman County wheat yields and may correspond more or less to the three distinct precipitation zones. However, it is impossible to verify this conjecture without farm identities. The method is nonetheless still valuable to the design of a subcounty-based GRP since it statistically identifies clusters and may validate the empirical hypothesis about the heterogeneity of farms in a region.

ACKNOWLEDGMENTS The authors wish to thank the associate editor and two reviewers for insightful comments.

[Received December 1998. Accepted February 2001.]

REFERENCES Banfield, J. D., and Raftery, A. E. (1993), "Model-Based Gaussian and Non-Gaussian Clustering," Biometrics,

49,803-821. Bozdogan, H. (1993), "Choosing the Number of Component Clusters in the Mixture-Model Using a New

Information Complexity Criterion of the Inverse-Fisher Information Matrix," in Studies in Classification, Data Analysis, and Knowledge Organization, eds. 0. Opitz, B. Lausen, and R. Klar, Heidelberg: Springer- Verlag, pp. 40-54.

Bozdogan, H., and Scolve, S. L. (1984), "Multi-Sample Cluster Analysis Using Akaike's Information Criterion," Ann. Inst. Statist. Math., 36, 163-180.

Mean Median

First quartile Third quartile

126

Page 22: may use content in the JSTOR archive only for your ...zhanghao/Paper/JABES2002.pdf · A model-based clustering method for cross-sectional time series data is proposed and applied

CROSS-SECTIONAL TIME SERIES DATA

Bryant, P., and Williamson, J. A. (1978), "Asymptotic Behavior of Classification Maximum Likelihood Estimates," Biometrika, 65, 272-281.

Chen, J., and Cheng, P. (1997), "On Testing the Number of Components in Finite Mixture Models With Known Relevant Component Distributions," Canadian Journal of Statistics, 25, 389-400.

Dacunha-Castelle, D., and Gassiat, E. (1997), "The Estimation of the Order of a Mixture Model," Bernoulli, 3, 279-299.

Everitt, B. S., and Hand, D. J. (1981), Finite Mixture Distributions, London: Chapman and Hall.

Feng, Z. Z., and McCulloch, C. E. (1996), "Using Bootstrap Likelihood Ratios in Finite Mixture Models," Journal

of the Royal Statistical Society, Series B, 58, 609-617.

Friedman, H. P., and Rubin, J. (1967), "On Some Invariant Criterion for Grouping," Journal of the American Statistical Association, 62, 1159-1178.

Gordon, A. D. (1981), Classification: Methods for the Exploratory Analysis of Multivariate Data, New York:

Chapman and Hall.

Hartigan, J. A. (1985), "A Failure of Likelihood Asymptotic for Normal Mixtures," in Proceedings of the Berkeley Conference in Honor of Jerzy Neyman and Jack Kiefer (Vol. II), ed. L. LeCam and R. A. Olshen, Belmont, CA: Wadsworth and Brooks, pp. 807-810.

Hennessy, D. A., Babcock, B. A., and Hayes, D. J. (1997), "Budgetary and Producer Welfare Effects of Revenue Insurance," American Journal of Agricultural Economics, 79, 1024-1034.

McLachlan, G. J. (1982), "The Classification and Mixture Maximum Likelihood Approaches to Cluster Analysis," in Handbook of Statistics (Vol. 2), eds. P. R. Krishnaiah and L. N. Kanal, Amsterdam: North-Holland, pp. 199-208.

(1987), "On Bootstrapping the Likelihood Ratio Test Statistic for the Number of Components in Normal Mixture," Applied Statistics, 36, 318-324.

McLachlan, G. J., and Basford, K. E. (1988), Maximum Models: Inferences and Applications to Clustering. New York: Marcel Dekker.

Marriott, F. H. C. (1975), "Separating Mixtures of Normal Distributions," Biometrics, 31, 767-769.

Nelson, C. H. (1991), "The Influence of Distributional Assumption on the Calculation of Crop Insurance Premia," North Central Journal of Agricultural Economics, 12, 71-78.

Redner, R. A., and Walker, H. F. (1984), "Mixture Densities, Maximum Likelihood and the EM Algorithm," SIAM Reviews, 26, 195-239.

Roeder, K. (1994), "A Graphical Technique for Determining the Number of Components in a Mixture of Normals," Journal of the American Statistical Association, 89, 487-495.

Scolve, S. L. (1983), "Application of the Conditional Population-Mixture Model to Image Segmentation," IEEE Transactions on Pattern Analysis and Machine Intelligence, 5, 428-433.

Symons, M. (1981), "Clustering Criteria and Multivariate Normal Mixtures," Biometrics, 37, 3543.

Titterington, D. M. (1990), "Some Recent Research on the Analysis of Mixture Distributions," Statistics, 21, 619-641.

Titterington, D. M., Smith, A. F. M., and Makov, U. E. (1985), Statistical Analysis of Finite Mixture Distributions. London: Wiley.

Wang, H. H., Hanson, S. D., Myers, R. J., and Black, J. R. (1998), "The Effects of Yield Insurance Designs on Farmer Participation and Welfare," American Journal of Agricultural Economics, 80, 806-820.

U.S. Department of Agriculture. (1978), "Palouse Cooperative River Basin Study", in Cooperative Study by Soil Conservation Service, Forest Service, and Economics, Statistics, and Cooperatives Service of Whitman

County, WA, Washington, DC: U.S. Government Printing Office. U.S. General Accounting Office. (1995), "Crop Insurance Additional Actions Could Further Improve Program's

Financial Condition," in Report to the Ranking Minority Member, Committee on Agriculture, Nutrition, and

Forestry, U.S. Senate, GAO/RCED-95-269, Washington, DC: Government Printing Office.

127