clustering on a microcomputer with an application to the classification of coals

Analytica Chimica Acti, 153 (1983) 257-260 Elsevier Science Publishers B.V., Amsterdam - Printed in The Netherlands

Short Communication

CLUSTERING ON A MICROCOMPUTER WITH AN APPLICATION TO THE CLASSIFICATION OF COALS

L. KAUFMAN, A. PIERREUX and P. ROUSSEEUW

Department of Statistics, Vrije Univereiteit Brueeel, Pleinlaan 2, B-l 050 Brueeels (Belgium)

M. P. DERDE, M. R. DETAEVERNIER and D. L. MASSART*

Pharmaceutical Inetitute, Vrije Uniuereiteit Brueeel, Laarbeeklaan 103, B-l 090 Brueeels (Belgium)

G. PLATBROOD

LABORELEC, P.O. Box 11. B-1640 Sint-Geneeius-Rode (Belgium)

(Received 6th June 1983)

Summary. The widespread introduction of microcomputers in laboratories where large data sets are more and more frequently gathered, makes it necessary to be able to use adapted statistical software. This paper describes a BASIC program for the Macnaughton- Smith clustering method adapted to the Apple II microcomputer. Experience is reported of an application of the program to the classification of a set of coals.

Clustering programs have the reputation of needing a lot of computer memory and time. This is due to the need for working with a large dissimilarity matrix (n X n for n objects). The usual agglomerative hierarchical methods reduce the matrix one unit at a time, while nonhiersrchical methods need the whole matrix all the time. For use with microcomputers, this is an important handicap. In view of the growing number of microcomputers in laboratories, it seemed worthwhile to examine the possibility of using microcomputers for clustering.

Macnaughton-Smith’s algorithm [l] was selected to achieve this. Macnaughton-Smith’s algorithm is of the hierarchical type but, in contrast to the better-known agglomerative algorithms, it is divisive. This means that one starts out with all the objects to be clustered and divides them first in two groups, each of these groups is then further divided into two, etc. This leads to the kind of result shown in Fig. 1. A more convenient way of representing the results is given in Fig. 2. In this way, one immediately more or less halves the matrix and keeps on doing so, so that the algorithm should be more readily adapted to microcomputers.

0003-2670/83/$03.00 0 1983 Elrevier Science Publishers B.V.

Fig. 1. Sample output of a divisive clustering algorithm.

Methodology A set of n objects is considered; each of them is measured on p variables.

The dissimilarity between the ith and thejth object is denoted by dist(ij). The whole set of objects, A, has to be divided into two subsets. There-

fore a splinter group B is constructed by sequential addition of one object at a time.

Initiation of the splinter group. For each k belonging to A, calculate

Gk = (IAl - 1)‘Zi,,dist(k,i)

where IAl is the number of objects in set A. Call m the object which maximizes Gk :

G, = maxk,,Gk.

When m is not unique, it is chosen arbitrarily. Now m is the object with the greatest total dissimilarity from the rest of the group. Then construct a new set A by taking away object m. Object m goes to the splinter group B.

Allocating new objects to the splinter group and setting up a stopping rule. For each k belonging to A, calculate

Fig. 2. Graphical representation of the output of Fig. 1.

Fig. 3. Graphical representation of the program output of the classification of coals. For symbols, see text.

259

DAk = 61Aj - l)-‘ZiEA dist(k,i) if A contains at least 2 objects. = if A contains a single object,

DBk = lSl_‘E i Ezdist(k,i)

Dk = DAk -DBk.

Call m the object which maximizes Dk : D, = max,,,D, . There are two possibilities. (1) If D, > 0 * DA,,, > DB,, the object m is more similar to B than to

the rest of the set A; object m is taken away from A and placed in B; repeat the allocation step.

(2) If D, d 0 *DA, < DB,, no more objects will be transferred to the splinter group (stopping rule).

The set A is now divided into two subsets, namely B and a new set A. These subsets can be split up again until each subset consists of only one object.

The method and applications are discussed in a book about clustering applied to analytical chemistry [ 21. As for all clustering applications, a dissimilarity measure is needed. In the present instance, this is the Euclidian distance. The program was written in Applesoft and implemented on an Apple II with 48K. The output is a list of clusters. Figures are not provided by the program and must be made manually.

Data To test the program, a set of data about coals was used. The data set con-

sisted of the results of the determination of the elements Al, Si, Fe, Ca, Mg, Na, K and S and the minerals illite, dolomite, kaolinite, gypsum, pyrite, calcite, silica for 53 coals of different origin and quality. Coals from the following origins were present: Australia (A), Canada (C), the U.S.A. (U), Western Europe (E) and South Africa (S). These coals were also subjected to a sintering test and divided into three categories. These categories are related to the possible use of the coals and are symbolized by 4 (low sintering value), t (high sintering value) and - (intermediate sintering value),

Results The clustering was obtained in 96 min of computation time. The result

is given in Fig. 3. The first division separates a cluster containing all the Australian and Canadian coals together with two South Africans and one U.S. coal. All these coals have a low or intermediate sintering value. The larger cluster is split up in two subclusters, the smaller of which contains all the South African coals. The larger subcluster then splits again in two: one of the resulting clusters contains only intermediate quality coals and all the remaining U.S. coals; the other contains only European coals. The clustering was repeated on a CDC CYBER 170/750 mainframe computer with the MASLOC clustering program, the quality of which is well proven [ 3, 43. Very similar results were obtained. As a result, the following conclusions can be drawn.

260

TABLE 1

Computer time required for problems of different size

Number of objects Required computing time (minutes excluding output)

10 20 30 40 50 1 6 17 39 83

First, the clustering program developed for the microcomputer performs well. In order to estimate the upper limit of objects that can be handled with the present program, the required computing time was noted for several problems. The dissimilarity matrix was used as the starting point for the program. The results are given in Table 1. In view of the results in Table 1, it seems reasonable to say that up to approximately 60 objects can be clustered with this program on the Apple II computer. The rapid increase of perform- ance of present day microcomputers makes it reasonable to expect that clustering programs for a few hundred objects will be available quite soon.

Secondly, the clustering of coals reveals that geographical origin is a very important factor in determining the mineral content of coal. The coincidence of sintering values with the clustering is somewhat less good and, in fact, it is not impossible that the relation is fortuitous.

REFERENCES

1 P. Macnaughton-Smith, W. T. Williams, M. B. Dale and L. G. Mockett, Nature, 202 (1964) 1034.

2 D. L. Massart and L. Kaufman, The Interpretation of Analytical Chemical Data by the Use of Cluster Analysis, Wiley, New York, 1983.

3 A. M. Massart-Lgenand D. L. Massart, Biochem. J., 196 (1981) 611. 4 D. L. Massart, L. Kaufman and K. H. Eebensen, Anal. Chem., 54 (1982) 911.

clustering on a microcomputer with an application to the classification of coals

Documents