enhancing grid-density based clustering for high dimensional data

E

Ya

b

c

d

a

ARRAA

KCSH

1

fiabt

1(cdliatdbertewt

cz

0d

The Journal of Systems and Software 84 (2011) 1524– 1539

Contents lists available at ScienceDirect

The Journal of Systems and Software

j ourna l ho me page: www.elsev ier .com/ locate / j ss

nhancing grid-density based clustering for high dimensional data

anchang Zhaoa, Jie Caob,∗, Chengqi Zhangc, Shichao Zhangd,∗

Centrelink, AustraliaJiangsu Provincial Key Laboratory of E-business, Nanjing University of Finance and Economics, Nanjing, 210003, P.R. ChinaCentre for Quantum Computation and Intelligent Systems, Faculty of Engineering and Information Technology, University of Technology, Sydney, AustraliaCollege of CS & IT, Guangxi Normal University, Guilin, China

r t i c l e i n f o

rticle history:eceived 26 July 2010eceived in revised form 9 February 2011

a b s t r a c t

We propose an enhanced grid-density based approach for clustering high dimensional data. Our tech-nique takes objects (or points) as atomic units in which the size requirement to cells is waived withoutlosing clustering accuracy. For efficiency, a new partitioning is developed to make the number of cells

ccepted 25 February 2011vailable online 8 March 2011

eywords:lustering

smoothly adjustable; a concept of the ith-order neighbors is defined for avoiding considering the expo-nential number of neighboring cells; and a novel density compensation is proposed for improving theclustering accuracy and quality. We experimentally evaluate our approach and demonstrate that ouralgorithm significantly improves the clustering accuracy and quality.

ubspace clusteringigh dimensional data

. Introduction

Clustering, as one of the main techniques in data mining, is tond “natural” groups in datasets. Not only can it be used standlone in database segmentation and data compression, it also cane employed in the preprocessing procedures of other data miningechniques, such as classification, association rules, and so on.

Density-based clustering (Ankerst et al., 1999; Ester et al.,996; Hinneburg and Keim, 1998) and grid-based clusteringSheikholeslami et al., 1998; Wang et al., 1997) are two well-knownlustering approaches. The former is famous for its capabilities ofiscovering clusters of various shapes, effectively eliminating out-

iers and being insensitive to the order of inputs, whereas the latters well known for its high speed. However, neither approach is scal-ble to high dimensionality. For density-based ones, the reason ishat the index structures, such as R*-tree, are not scalable to high-imensional spaces. For grid-based approaches, the reason is thatoth the number of cells and the count of neighboring cells growxponentially with the dimensionality of data. Grid-based algo-ithms take cells as atomic units which are inseparable, and thus
he interval partitioned in each dimension must be small enough tonsure the resolution of clustering. Therefore, the number of cellsill increase exponentially with dimensionality. Some researchers
ry to break the curse of dimensionality by using the adaptive grid

∗ Corresponding authors.E-mail addresses: [email protected] (Y. Zhao),

[email protected] (J. Cao), [email protected] (C. Zhang),[email protected] (S. Zhang).

164-1212/$ – see front matter © 2011 Elsevier Inc. All rights reserved.oi:10.1016/j.jss.2011.02.047

© 2011 Elsevier Inc. All rights reserved.

(Nagesh et al., 1999), the optimal grid (Hinneburg and Keim, 1999),or in an a priori-like way (Agrawal et al., 1998).

Previously, we developed an algorithm called AGRID (AdvancedGRid-based Iso-Density line clustering), which combines density-based and grid-based approaches to cluster large high-dimensionaldata (Zhao and Song, 2003). Based on the idea of density-basedclustering, it employs grid to reduce the complexity of distancecomputation and can discover clusters of arbitrary shapes effi-ciently. However, in order to reduce the complexity of densitycomputation, only (2d + 1) out of all 3d neighbors are consideredfor each cell when computing the densities of objects in it. Whenthe dimensionality is high, most neighboring cells are ignored andthe accuracy becomes very poor.

In this paper, we present an enhanced grid-density based algo-rithm for clustering high dimensional data, referred to as AGRID+,which substantially improves the accuracy of density computationand clustering. AGRID+ has four main distinct technical features.The first is that objects (or points), instead of cells, are taken asthe atomic units. In this way, it is no longer necessary to set theintervals very small, so that the number of cells does not growdramatically with the dimensionality of data. The second featureis the concept of ith-order neighbors, with which the neighboringcells are organized into a couple of groups to improve efficiencyand meet different requirements of accuracy. As a result, we obtaina trade-off between accuracy and speed in AGRID+. The third isthe technique of density compensation which improves the accu-racy greatly. Last but no the least, a new distance measure, minimal
subspace distance, is designed for subspace clustering.
The rest of the paper is organized as follows. In Section 2, wepresent the related work and some concepts needed. The AGRID+clustering is designed in Section 3, in which an idea to adapt our

dx.doi.org/10.1016/j.jss.2011.02.047

http://www.sciencedirect.com/science/journal/01641212

http://www.elsevier.com/locate/jss

mailto:[email protected]




dx.doi.org/10.1016/j.jss.2011.02.047

ms an

atps

2

igtcttae1

tatcC(

gomtohbnwwaD

ntaddbWaea

wn1awttobH

Wnsbh

Y. Zhao et al. / The Journal of Syste

lgorithm for subspace clustering is also given in this section. Sec-ion 4 shows the results of experiments both on synthetic andublic datasets. Some discussions are given in Section 5. Conclu-ions are made in Section 6.

. Related work

Most clustering algorithms fall into four categories: partition-ng clustering, hierarchical clustering, density-based clustering andrid-based clustering. The idea of partitioning clustering is to par-ition the dataset into k clusters which are represented by theentroid of the cluster (k-Means) or one representative object ofhe cluster (k-Medoids). It uses an iterative relocation techniquehat improves the partitioning by moving objects from one group tonother. Well-known partitioning algorithms are k-Means (Alsabtit al., 1998), k-Medoids (Huang, 1998) and CLARANS (Ng and Han,994).

Hierarchical clustering creates a hierarchical decomposition ofhe dataset in bottom-up approach (agglomerative) or in top-downpproach (divisive). A major problem of hierarchical methods ishat they cannot correct erroneous decisions. Famous hierarchi-al algorithms are AGENS, DIANA, BIRCH (Zhang et al., 1996),URE (Guha et al., 1998), ROCK (Guha et al., 1999) and ChameleonKarypis et al., 1999).

The general idea of density-based clustering is to continuerowing the given cluster as long as the density (i.e., the numberf objects) in the neighborhood exceeds some threshold. Such aethod can be used to filter out noise and discover clusters of arbi-

rary shapes. The density of an object is defined as the number ofbjects in its neighborhood. Therefore, the densities of each objectave to be computed at first. A naive way is to calculate the distanceetween each pair of objects and count the number of objects in theeighborhood of each object as its density, which is not scalableith the size of datasets, since the computational complexity is (),here N is the number of objects. Typical density-based methods

re DBSCAN (Ester et al., 1996), OPTICS (Ankerst et al., 1999) andENCLUE (Hinneburg and Keim, 1998).

Grid-based algorithms quantize the data space into a finiteumber of cells that form a grid structure and all of the clus-ering operations are performed on the grid structure. The maindvantage of this approach is its fast processing time. However, itoes not work effectively and efficiently in high-dimensional spaceue to the so-called “curse of dimensionality”. Well-known grid-ased approaches for clustering includes STING (Wang et al., 1997),aveCluster (Sheikholeslami et al., 1998), OptiGrid (Hinneburg

nd Keim, 1999), CLIQUE (Agrawal et al., 1998) and MAFIA (Nagesht al., 1999), and they are sometimes called density-grid basedpproaches (Han and Kamber, 2001; Kolatch, 2001).

STING is a grid-based multi-resolution clustering technique inhich the spatial area is divided into rectangular cells and orga-ized into a statistical information cell hierarchy (Wang et al.,997). Thus, the statistical information associated with spatial cellsre captured and queries and clustering problems can be answeredithout recourse to the individual objects. The hierarchical struc-

ure of grid cells and the statistical information associated withhem make STING very fast. STING assumes that K, the numberf cells at bottom layer of hierarchy, is much less than the num-er of objects, and the overall computational complexity is O(K).owever, K can be much greater than N in high-dimensional data.

Sheikholeslami et al. (1998) proposed a technique namedaveCluster to look at the multidimensional data space from a sig-

al processing perspective. The objects are taken as a d-dimensionalignal, so the high frequency parts of the signal correspond to theoundaries of clusters, while the low frequency parts which haveigh amplitude correspond to the areas of the data space where

d Software 84 (2011) 1524– 1539 1525

data are concentrated. It first partitions the data space into cellsand then applies wavelet transform on the quantized feature spaceand detects the dense regions in the transformed space. With themulti-resolution property of wavelet transform, it can detect theclusters at different scales and levels of details. The time complexityof WaveCluster is O(dN log N).

The basic idea of OptiGrid is to use contracting projections of thedata to determine the optimal cutting hyper-planes for partitioningthe data (Hinneburg and Keim, 1999). The data space is partitionedwith arbitrary (non-equidistant, irregular) grids based on the dis-tribution of data, which avoids the effectiveness problems of theexisting grid-based approaches and guarantees that all clusters arefound by the algorithm, while still retaining the efficiency of a gird-based approach. The time complexity of OptiGrid is between O(dN)and O(dN log N).

CLIQUE (Agrawal et al., 1998), MAFIA (Nagesh et al., 1999) andRandom Projection (Fern and Brodley, 2003) are three algorithmsfor discovering clusters in subspaces. CLIQUE discovers clusters insubspaces in a way similar to the Apriori algorithm. It partitionseach dimension into intervals and computes the dense units in alldimensions. Then these dense units are combined to generate thedense units in higher dimensions.

MAFIA is an efficient algorithm for subspace clustering using adensity and gird based approach (Nagesh et al., 1999). It uses adap-tive grids to partition a dimension depending on the distribution ofdata in the dimension. The bins and cells that have low density ofdata are pruned to reduce the computation. The boundaries of thebins are not rigid, which improves the quality of clustering.

Fern and Brodley proposed Random Projection to find the sub-spaces of clusters in a Random Projection and ensemble way (Fernand Brodley, 2003). The dataset is first projected into random sub-spaces, and then EM algorithm is used to discovers clusters inthe projected dataset. The algorithm generates several groups ofclusters with the above method and then combines them into asimilarity matrix, from which the final clusters are discovered withan agglomerative clustering algorithm.

Moise et al. (2008) proposed P3C, a robust algorithm for pro-jected clustering. Based on the computation of so-called clustercores, it can effectively discover projected clusters in the data whileminimizing the number of required parameters. Moreover, it canwork on both numerical and categorical datasets.

Assent et al. (2008) proposed an algorithm capable of find-ing parallel clusters in different subspaces in spatial and temporaldatabases. Although they also use the notions of neighborhood anddensity, their target problem is clustering sequence data instead ofgeneric data in this paper.

Previously, we proposed AGRID, a grid-density based algorithmfor clustering (Zhao and Song, 2003). It has the advantages of bothdensity-based clustering and grid-based clustering, and is effectiveand efficient for clustering large high-dimensional data. However,it is not accurate enough, and the reason is that only 2d immediateneighbors are taken into consideration. Moreover, it is incapable ofdiscovering clusters in subspaces.

With AGRID algorithm, firstly, each dimension is divided intomultiple intervals and the data space is thus partitioned into manyhyper-rectangular cells. Objects are assigned to cells according totheir attribute values. Secondly, for an object in a cell, we onlycompute the distances between it and the objects in its neighbor-ing cells, and use the count of those objects which are close to

as its density. Objects that are not in the neighboring cells arefar away from ˛, and therefore do not contribute to the densitiesof ˛. Thirdly, each object is taken as a cluster and every pair of
objects which are in the neighborhood of each other are checkedwhether they are close enough to merge into one cluster. If yes, thenthe two clusters which the two objects belongs to respectively aremerged into a single cluster. All eligible pairs of clusters meeting

1526 Y. Zhao et al. / The Journal of Systems and Software 84 (2011) 1524– 1539

j+1jj-1

i-1 (i-1,j -1) (i-1,j ) (i-1,j+ 1)

i (i,j-1) (i,j ) (i,j+1 )

i+1 (i+1,j-1) (i+1,j ) (i+1,j +1)

3(a) d Neighbors

j+1jj-1

i-1 (i-1,j)

i (i,j-1) (i,j) (i,j+1)

i+1 (i+1,j)

(2d(b) + 1) Immediate Neighbors

“(i,j)” i

ttc

Fione(idwgnimi

pa

3

dnamdisss

tsnpcacttopod

Fig. 1. Two definitions of neighbors. The grey cell labelled with

he above requirement are merged to generate larger clusters andhe clustering finishes when all such kind of object pairs have beenhecked.

With the idea of grid and neighbor, 3d neighboring cells (seeig. 1(a)) or (2d + 1) immediate neighboring cells (including C˛

tself) (see Fig. 1(b)) are considered when computing densitiesf objects in cell C˛ and clustering in AGRID algorithm. If all 3d

eighboring cells are considered, the computation is prohibitivelyxpensive when the dimensionality is high. Nevertheless, if only2d + 1) immediate neighboring cells are considered, a lot of cells aregnored and the densities computed become inaccurate for high-imensional data. To tackle the above dilemma, ith-order neighborill be defined in this paper to classify 3d the neighboring cells into

roups according to their significance. By considering only most sig-ificant neighboring cells, high speed is achieved and high accuracy

s kept. To improve the accuracy further, density compensation andinimal subspace distance are proposed, which will be described

n the following section.For a comprehensive survey on other approaches for clustering,

lease refer to Berkhin (2002), Grabmeier and Rudolph (2002), Hannd Kamber (2001), Jain et al. (1999) and Kolatch (2001).

. AGRID+: an enhanced density-grid based clustering

The proposed AGRID+ algorithm for clustering high-imensional data will be presented in this section. An ith-ordereighbor is first introduced to improve the efficiency and then

density compensation is proposed to improve accuracy andake the algorithm more effective to cluster high-dimensional

ata. In addition, a measure of minimal subspace distance isntroduced to make the algorithm capable to find clusters inubspaces effectively. Our techniques of partitioning the datapace and choosing parameters will also be discussed in thisection.

The following notations are used throughout this paper. N ishe number of objects (or points or instances) and d is the dimen-ionality of dataset. L is the length of an interval, r is the radius ofeighborhood, and DT is the density threshold. is an object or aoint, and C˛ is the cell in which is located. X is an object withoordinates of (x1, x2, ..., xd) and Distp(X, Y) is the distance between Xnd Y with Lp-metric as the distance measure. Ci1i2...id

stands for theell whose ID is i1i2 . . . id, where ij is the ID of the interval in whichhe cell is located in the jth dimension. Vn and Vc are respectivelyhe volume of neighborhood and the volume of the considered part
f neighborhood. Cntq(˛) is the count of points in the consideredart of neighborhood of and Denq(˛) is the compensated densityf when all ith-order neighbors of (0 ≤ i ≤ q) are considered forensity computation.
n the center is C˛ , and the grey cells around it are its neighbors.

3.1. The ith-order neighbors

In this section, our definitions of neighbors will be presented anddiscussed. Note that neighborhood and neighbors (or neighboringcells) are two different concepts in this paper. The former is definedfor a point and its neighborhood is an area or a space, while thelatter is defined for a cell and its neighbors are those cells adjacentto it. Sometimes we use “the neighbors of point ˛” to denote theneighbors of cell C˛ w.r.t. point (see Definition 4), where C˛ is thecell in which is located.

An intuitive way is to define all the cells around a cell as itsneighbors, as Definition 1 shows.

Definition 1 ((Neighbors)). Cells Ci1i2...idand Cj1j2...jd

are neighborsof each other iff

∀p, 1 ≤ p ≤ d, |ip − jp| ≤ 1,

where i1i2 . . . id and j1j2 . . . jd are respectively the interval IDs of cellCi1i2...id

and Cj1j2...jd.

Generally speaking, there are all together 3d neighbors for eachcell in a d-dimensional data space according to Definition 1 (seeFig. 1(a)). Assume is an object and C˛ is the cell that is located in.When calculating the density of object ˛, we need to compute thedistances between and the objects in cell C˛ and its neighboringcells only. Those objects in other cells are relatively far away fromobject ˛, so they contribute nothing or little to the density of ˛.Therefore, for object ˛, we do not care about the objects which arenot in the neighboring cells of cell C˛.

With Definition 1, each cell has 3d neighbors, which makes thecomputing very expensive when the dimensionality is high. There-fore, an idea of immediate neighbors is defined as follows to reducecomputational complexity.

Definition 2 ((Immediate Neighbors)). Cell Ci1i2...idand Cj1j2...jd

areimmediate neighbors of each other iff

{∃l, 1 ≤ l ≤ d, |il − jl| = 1, and∀p /= l, 1 ≤ p ≤ d, ip = jp

,

where l is an integer between 1 and d, and i1i2 . . . id and j1j2 . . . jdare respectively the interval IDs of cell Ci1i2...id

and Cj1j2...jd.

Generally speaking, in a d-dimensional spaces, each cell has 2dimmediate neighbors (see Fig. 1(b)).

With only immediate neighbors considered according toDefinition 2, the computational complexity is greatly reduced, but
at the cost of accuracy. It is effective when the clusters are compactand dense. Nevertheless, when the dimensionality is high and thedata are sparse, the density values and the clustering become inac-curate, since many cells are ignored when computing densities. To

Y. Zhao et al. / The Journal of Systems and Software 84 (2011) 1524– 1539 1527

(c) i = (d)2 i = 3

er neighbors of cell C˛ .

is

DdwbC

Fbaogniiggo1a2qDinhsti

biretccttt

r

L

Fig. 3. Neighborhood and neighbors. The black point is ˛, the grey cell in the center

(a) Cα (i = (b)0) i = 1

Fig. 2. The ith-ord

mprove the accuracy, we classify the neighbors according to theirignificance by defining ith-order neighbors as follows.

efinition 3 ((ith-order Neighbors)). Let C˛ be a cell in a d-imensional space. A cell which shares a (d–i) dimensional facetith cell C˛ is an ith-order neighbor of C˛, where i is an integer

etween 0 and d. Especially, we set 0th-order neighbors of C˛ to be˛ itself.

Examples of ith-order neighbors in a 3D space are shown inig. 2. The grey cell in Fig. 2(a) is C˛, and the 0th-order neigh-or of C˛ is itself. The grey cells in Fig. 2(b)–(d) are the 1st-, 2nd-nd 3rd-order neighbors of C˛, respectively. With the introductionf ith-order neighbors, the neighbors of cell C˛ are classified intoroups according to their positions relative to C˛, and an ith-ordereighbors’ contribution to the density of is greater with lower

. Therefore, we only consider low-order neighbors when cluster-ng. More specifically, only those neighbors whose order is notreater than q are taken into account, where q is a positive inte-er and 0 ≤ q ≤ d. The ith-order neighbor is a generalized notionf Definitions 1 and 2. When q is set to 1, then only 0th- andst-order neighbors are considered, and the low-order neighborsre C˛ itself and the immediate neighbors defined by Definition. In this case, the speed is very fast, but the accuracy is poor. If

is set to d, all neighbors are considered, which is the same asefinition 1. Thus, the accuracy is guaranteed, but the computation

s prohibitively costly. Since lower-order neighbors are of more sig-ificance, our technique of considering only low-order neighborselps to improve performance and keep accuracy as high as pos-ible. Moreover, the accuracy can be further improved with ourechnique of density compensation, which will be discussed latern this paper.

In the following, the relationship between the radius of neigh-orhood and the length of interval will be discussed to further

mprove the performance of our algorithm. Assume r to be theadius of neighborhood and L the length of interval. When r is largenough that all the objects in all the neighbors of a cell are withinhe neighborhood, AGRID+ will behave somewhat like grid-basedlustering, in the sense that the densities of all the objects in a
ell will be the same and that the density is simply the count ofhose objects in all its neighboring cells. With a very large r, bothhe densities and neighborhood become very large, which will leado the merging of adjacent clusters into bigger clusters and clus-
(a) Cα (i = (b)0) i = 1

Fig. 4. The ith-order neighbo

is C˛ , and the other grey cells around it are its neighbors. The area within the dashedline is the neighborhood of ˛.

ters consisting of noises. On the other hand, if r is much smallerthan the lengths of all edges of the hyper-rectangular cell, AGRID+will become somewhat like density-based clustering, because thedensity of a object is largely decided by the number of objectscircumscribed by r. With a very small r, both the densities andneighborhood become very small, so the result will be composed ofmany small clusters and a large amount of objects will be taken asoutliers. Therefore, it is reasonable to set r to be on the same orderas the length of interval.

If r > L/2, all the 3d cells around C˛ should be considered to com-pute the density of object accurately. If r < L/2, some of the 3d

cells around C˛ will not overlap with the neighborhood and canbe excluded for density computation. An illustration of the aboveobservation is given in Fig. 3, which shows the neighborhood andneighbors in a 2D space. In the figure, L∞-metric is used as the dis-
tance measure and the neighborhood of becomes a hypercube.Note that the above observation also holds for other distance mea-sures. As Fig. 3 shows, if object is located near the top-left cornerof C˛, only the cells that are on the top-left side of C˛ need to be
(c) i = (d)2 i = 3

rs of C˛ w.r.t. point ˛.

1 ms an

cfL

nlTaof

DsA(C

∀

na

wdolosb0t

3

meocl

3

bentsfiiowc

D

wteot

wbco

528 Y. Zhao et al. / The Journal of Syste

onsidered and the computation becomes less expensive. In whatollows, we assume that r, the radius of neighborhood, is less than/2.

The above observation can be generalized as: if the radius ofeighborhood is less than L/2, only those neighbors which are

ocated on the same side of C˛ as contributes to the density of ˛.herefore, for each point in cell C˛, the neighbors need to considerre related to the relative position of in C˛, so a new definitionf ith-order neighbors with respect to the position of is given asollows.

efinition 4 ((ith-order Neighbor w.r.t. Point ˛)). In a d-dimensionalpace, let be a point and C˛ be the cell in which is located.ssume that the coordinate of is (x1, x2, ..., xd), the center of C˛ is

a1, a2, ..., ad) and the center of Cˇ is (b1, b2, ..., bd). Point and cell

ˇ are on the same side of cell C˛ iff

i, 1 ≤ i ≤ d, (xi − ai) (bi − ai) ≥ 0.

Cell Cˇ is an ith-order neighbor of C˛ w.r.t. (or an ith-ordereighbor of for short) iff: (1) Cˇ is an ith-order neighbor of C˛,nd (2) Cˇ and are on the same side of C˛.

Since an ith-order neighbor of shares a (d–i)-dimensional facetith C˛, the ID sequences of the ith-order neighbors of C˛ have iifferent IDs from that of C˛, and the difference between each pairf IDs can be either +1 or −1. Because the ith-order neighbors of ˛ie on the same side of C˛ as ˛, the number of ith-order neighborsf is di. Examples of the ith-order neighbors of C˛ w.r.t. in a 3Dpace are shown in Fig. 4. Assume that is a point on the top-right-ack side of the center of C˛ (the grey cell) in Fig. 4(a), so C˛ is theth-order neighbor of C˛ w.r.t. ˛. The grey cells in Fig. 4(b)–(d) arehe 1st-, 2nd- and 3rd-order neighbors of C˛ w.r.t. ˛, respectively.

.2. Density compensation

With the introduction of ith-order neighbors, the efficiency isuch improved by considering low-order neighbors only. How-

ver, the clustering still becomes less accurate with the increasef dimensionality. To further improve accuracy, an idea of densityompensation will be proposed in this section to make up for theoss introduced by ignoring high-order neighbors.

.2.1. Idea of density compensationSince only low-order neighbors are considered, a part of neigh-

orhood is ignored and the clustering becomes less accurate,specially as d increases. To make up for the loss, we propose aotion of density compensation. The idea is that for each object,he ratio of the volume of the neighborhood to that of the con-idered part is calculated as a compensation coefficient, and thenal density of an object is the product of its original density and

ts compensation coefficient. According to Definition 4, if all ith-rder neighbors of (i = 0, 1, ..., q, where 0 ≤ q ≤ d) are consideredhen computing the density of ˛, the density we get should be

ompensated as

enq(˛) = Vn

VcCntq(˛), (1)

here Vn and Vc are respectively the volume of neighborhood andhe volume of the considered part of neighborhood, which is cov-red by ith-order neighbors of with 0 ≤ i ≤ q. Cntq(˛) is the countf points in the considered part of neighborhood, and Denq(˛) ishe compensated density of ˛.

Unfortunately, since there are too many neighbors for each cell
hen q is a large integer, it is impractical to compute the contri-
ution of each cell individually. Therefore, we simplify the densityompensation by assuming that, for a specific i, the considered partsf an ith-order neighbor are of the same volume. Let VSi

be the

d Software 84 (2011) 1524– 1539

volume of the overlapped space of an ith-order neighbor and theneighborhood of ˛, and based on Eq. (1), we can get

Denq(˛) = Vn∑qi=0 () Vsi

Cntq(˛), (2)

where Si is the overlapped space of an ith-order neighbor and theneighborhood of ˛, and VSi

is the volume of Si. In the above equation,the density becomes more accurate with the increase of q. Whenq = d, the density we obtain is the exact value of density. Neverthe-less, the number of neighbors considered increases dramaticallywith the increase of q. When the value of q is set to 1, the 0th- and1st-order neighbors together are the (2d + 1) neighbors defined inAGRID. The value of VSi

in Eq. (2) varies with the measure of dis-tance. A method for computing VSi

will be presented in the nextsection.

3.2.2. Density compensationEuclidean distance is the most widely used distance measure.

However, Euclidean distance increases with dimensionality, whichmakes it difficult to select a value for r. Assume that there are twopoints ˛(a, a, ..., a) and ˇ(0, 0, ..., 0) in a d-dimensional space. TheLp-distance between them is Distp(˛, ˇ) = (dap)1/p, that is, ad1/p.We can see that the distance increases with the dimensionality,especially when p is small. For example, for a dataset within theunit cube in a 100D space, if Euclidean distance (p = 2) is used, thedistance of point (0.2, 0.2, . . . , 0.2) from the original point is 2. If ris set to 1.5, it will cover the whole range in every single dimen-sion, but still cannot cover the above point! It is the same casewith most other Lp-metric, especially when p is a small integer.However, when L∞-metric is used, the distance becomes Dist(˛,ˇ) = a. For the above example, r can be set to 0.3 to cover thepoint, and it will cover a part in every dimension. Therefore, L∞-metric is more meaningful to measure the distance for clusteringin high dimensional spaces. Moreover, for subspace clustering ina high-dimensional space, clusters are defined by researchers asaxis-parallel hyper-rectangles in subspaces (Agrawal et al., 1998;Procopiuc et al., 2002). Therefore, it is reasonable to define a clus-ter in this paper to be composed of those objects which are in ahyper-rectangle in a subspace. Since a hyper-rectangle in a sub-space can be get by setting the subspace distance with L∞-metric,we select L∞-metric as the distance measure, which is shown asfollows:

Dist∞(X, Y) = maxi=1...d

|xi − yi| (3)

When L∞-metric is used as distance measure, the neighbor-hood of an object becomes a hyper-cube with the edge lengthof 2r and its volume is Vn = (2r)d, where r is the radius ofneighborhood.

Let (a1, a2, ..., ad) be the coordinate of relative to the start pointof C˛ (see Fig. 5(a)). Let bj = min{aj, Lj − aj}, where Lj is the length ofinterval in the jth dimension. If bj < r, then the neighborhood of isbeyond the boundary of C˛ in the jth dimension. Suppose that thereare all together d′ dimensions with bj < r and a is the mean of such bj.To approximate the ratio of overlapped spaces, we assume that thecurrent object is located on the diagonal of cell C˛ and (a, a, ..., a) isthe coordinate relative to the start point of C˛ (Fig. 5(b)), where a =(1/d)

∑di=1ai. With such an assumption, for a specific i, all the ith-

order neighbors of have the same volume of overlapped spaceswith the neighborhood of ˛. Let Si be the overlapped space of an ith-order neighbor and the neighborhood of ˛, and VSi

be the volume of
Si. Si is a hyper-rectangle which has i edges with length of (r − a) and(d′ − i) edges with length of (r + a), so VSi
= (r + a)d′−i(r − a)i, where0 ≤ a ≤ r. Since the neighborhood is overlapped with C˛ only in d′

dimensions, the following equation can be derived by replacing d,


ra1

a2

ra

a

r-a

r+a

(b)(a)

F umscro e that

V(

D

w

teu

D

rTTaittoab

3

eb(dl2c(12aAa

ig. 5. Density compensation. As (a) shows, the black point is ˛, and the area circverlapped spaces (in grey) by the neighborhood and each neighbor of ˛, we assum

n and VSiwith d′, (2r)d′

and (r + a)d′−i(r − a)i respectively in Eq.2):

enq(˛) = (2r)d′∑qi=0 () (r + a)d′−i(r − a)i

Cntq(˛) (4)

here q is a positive integer no larger than d′.In fact, the number of ith-order neighbors of a point is much less

han (), especially in high-dimensional spaces where most cells arempty. Thus, ki, the actual number of ith-order neighbors, can besed to replace (), leading to

enq(˛) = (2r)d′∑qi=0ki(r + a)d′−i(r − a)i

Cntq(˛) (5)

By tuning parameter q, we can obtain different clustering accu-acy. Clearly, both the accuracy and cost will increase as q increases.herefore, we need a trade-off between accuracy and efficiency.he value of q can be chosen according to the requirement of theccuracy and the performance of computers. A large value of q willmprove the accuracy of clustering result, but at the cost of time. Onhe contrary, high speed can be achieved by setting a small valueo q, but the accuracy will become lower accordingly. Interestingly,ur experiments show that setting q to two or three achieves goodccuracy in most situations. The effect of different values of q wille shown in Section 4.

.3. Minimal subspace distance

Euclidean distance is the mostly used distance measure. How-ver, the difference between the nearest and the farthest pointsecomes less discriminating with the increase of dimensionalityHinneburg et al., 2000). Aggarwal et al. suggest to use fractionalistance metrics (i.e., Lp-norm with 0 < p < 1) to measure the simi-

arity between objects in high dimensional space (Aggarwal et al.,001). Nevertheless, many researchers think that most meaningfullusters only exist in subspaces, so they use traditional Lp-normp = 1, 2, 3, ...) to discover clusters in subspaces (Agrawal et al.,998; Fern and Brodley, 2003; Nagesh et al., 1999; Procopiuc et al.,
002). For subspace clustering in high-dimensional space, clustersre constrained to be axis-parallel hyper-rectangles in subspaces bygrawal et al. (1998), and projective clusters are defined as axis-ligned box by Procopiuc et al. (2002). Therefore, it is reasonable
ibed by dotted line is the neighborhood of ˛. To approximate the volumes of the is located on the diagonal line of the current cell C˛ , as (b) shows.

to define a cluster to be composed of those objects which are ina hyper-rectangle in a subspace. To improve traditional Lp-norm(p = 1, 2, 3, ...) for subspace clustering in high-dimensional space,a new distance measure, minimal subspace distance, is defined asfollows.

Definition 5 ((Minimal Subspace Distance)). Suppose that X = (x1,x2, ..., xd) and Y = (y1, y2, ..., yd) are two objects or points in a d-dimensional space. The minimal k-dimensional subspace distancebetween X and Y is the minimal distance between them in all pos-sible k-dimensional subspaces:

Dist(k)(X, Y) = minall Jk

{Dist(XJk , YJk )}, Jk ⊂ {1, 2, ..., d}, 1 ≤ k < d(6)

where Jk = (j1, j2, ..., jk) is a k-dimensional subspace, XJk and YJk arerespectively the projected vectors of X and Y in subspace Jk, andDist( · ) is a traditional distance measure in the full dimensionalspace.

When Lp-metric is used as the measure of distance, the minimalsubspace distance is the Lp distance of the k minimal differencesbetween each pair of xi and yi:

Dist(k)p (X, Y) =

(∑i=1..k

|xji− yji

|p)1/p

(7)

If L∞-norm is used as distance measure, the minimal subspacedistance becomes is the k-th minimum of |xi − yi|, which can beeasily got by sorting |xi − yi| (i = 1..d) ascendingly and then pickingthe k-th value. Then Dist(k)(X, Y) ≤ r means that X and Y are in ahyper-rectangle with edge of r in k dimensions and without limits inother dimensions. Therefore, the above distance measure providesan effective measure for hyper-rectangular clusters in subspaces.

With the help of the above minimal subspace distance, it willbe easier to discover clusters in subspaces. For two objects, it findsthe subspace in which they are the most similar or the nearest toeach other. Assume that L∞-norm is used. For example, if the 4Dminimal subspace distance between two objects is 7, it means thatthe two objects are within a 4D hyper-rectangle with edge lengthof 7.

Minimal subspace distance tries to measure the distancebetween objects in the subspace where they are closest to eachother, so it is effective to find subspaces where clusters existand then discover clusters in those subspaces. With the above

1 ms an

dbwte

3

3

pma(iaeciinHopdvtiadd

fdtto

oTionc

C

w

t

t

3

tta


efinition of minimal subspace distance, our algorithm is capa-le of finding projected clusters and subspaces automaticallyhen the average dimensionality of subspaces is given. The effec-

iveness of the above distance measure will be shown in thexperiments.

.4. Partitioning data space

.4.1. Technique of partitioningThe performance of our algorithm is largely dependent on the

artitioning of data space. Given a certain number of objects, theore cells the objects are in and the more uniformly the objects

re distributed, the better the performance is. In some papersAgrawal et al., 1998; Sheikholeslami et al., 1998), each dimensions divided into the same number (say, m) of intervals and therere md cells in the data space. The above method of partitioning isffective when the dimensionality is low. Nevertheless, it is inappli-able in high dimensional data space, because the number of cellsncreases exponentially with the dimensionality and the comput-ng will become extremely expensive. For example, if d is 80, theumber of cells is too large to be applicable even if m is set to two.owever, the value of m cannot be lower, because there will be onlyne cell and the density calculation of each object needs to com-ute distances for N times if m is set to one. In addition, when theimensionality is high, it is very difficult to choose an appropriatealue for m, the interval number, and a little change of it can leado a great variance of the number of cells. For example, if d is 30, md

s 2.06 × 1014 when m = 3 and 1.07 × 109 when m = 2. To tackle thebove problem, a technique of dividing different dimensions intoifferent number of intervals is employed to partition the wholeata space.

With our technique, different interval numbers are used for dif-erent dimensions. For the first p dimensions, each dimension isivided evenly into m intervals, while (m − 1) intervals for each ofhe remaining (d − p) dimensions. With such a technique of parti-ioning, the total number of cells is mp(m − 1)d−p and the numberf cells can be adjusted smoothly by changing m and p.

Let ω be the percentage of non-empty cells and N the numberf objects. The number of non-empty cells is Nne = ωmp(m − 1)d−p.he average number of objects contained in each non-empty cells Navg = N/Nne. Let Nnc be the average number of neighboring cellsf a non-empty cell (including itself). For each non-empty cell, theumber of distance computation is NavgNncNavg. So the total timeomplexity is

t = NavgNncNavgNne = NavgNncN = N2Nnc

ωmp(m − 1)d−p(8)

By setting the time complexity to be linear with both N and d,e can get

N2Nnc

ωmp(m − 1)d−p= Nd, (9)

hat is,

NNnc

ωmp(m − 1)d−p= d. (10)

Then the values of m and p can be derived from the above equa-ion.

.4.2. Average number of neighbors per cell
For simplicity, we consider the case with q = 1 to select
he values of m and p. Actually, The m and p calculated inhis way are also used when q is set to other values in ourlgorithm.

d Software 84 (2011) 1524– 1539

With q = 1, when m is a large number, most cells have (2d + 1)neighbors, i.e., Nnc ≈ 2d + 1, so the following can be derived from Eq.(9):

N(2d + 1)

ωmp(m − 1)d−p= d, (11)

where both m and p are positive integers, m ≥ 2 and 1 ≤ p ≤ d.However, when the dimensionality is high, m may become small

and the neighbors of the majority of cells would be less than (2d + 1),so Eq. (11) would be inapplicable to compute the value of p. Inthe following, a theorem will be presented to compute the averagenumber of neighbors of a cell.

Theorem 1. In a d-dimensional data space, if each of the first pdimension is evenly divided into m intervals and each of the remain-ing (d − p) dimension into (m − 1) intervals, where m ≥ 2, the averagenumber of immediate neighbors for a cell with q = 1 is

Nnc = 1 + 2(m − 1)m

p + 2(m − 2)m − 1

(d − p). (12)

Proof. If a dimension is partitioned into m intervals, the totalnumber of neighboring intervals of the intervals in the dimen-sion is (2m − 2), since each of the two intervals at both ends hasone neighbor and each of the remaining (m − 2) intervals has twoneighbors.

As to the immediate neighbors (q = 1), the interval ID in onedimension only is different from the ID sequence of the currentcell. If the difference is in one of the first p dimensions, thereare pmp−1(m − 1)d−p cases, and in each ease there are (2m − 2)neighbors, so the count of neighbors in the first p dimensions isn1 = pmp−1(m − 1)d−p(2m − 2). If the difference is in one of the last(d − p) dimensions, there are (d − p)mp(m − 1)d−p−1 cases, and ineach ease there are (2m − 4) neighbors, so the count of neighborsin the last (d − p) dimensions is n2 = (d − p)mp(m − 1)d−p−1(2m − 4).

The count of cells is n3 = mp(m − 1)d−p, and the average numberof neighbors of each cell is

n1 + n2

n3= pmp−1(m − 1)d−p(2m − 2) + (d − p)mp(m − 1)d−p−1(2m − 4)

mp(m − 1)d−p

= 2(m − 1)m

p + 2(m − 2)m − 1

(d − p).

In addition, each cell is also considered as a neighbor of itself, soNnc = 1 + 2(m−1)

m p + 2(m−2)m−1 (d − p). �

From Eq. (9) and Theorem 1, we can get

N(1 + ((2(m − 1))/m)p + ((2(m − 2))/m − 1) (d − p))

ωmp(m − 1)d−p= d, (13)

where m ≥ 2 and 1 ≤ p ≤ d. For a given m, Eq. (13) is a transcendentalequation and cannot be worked out directly. In fact, for each m, pis an integer no less than one and no greater than d. Therefore, thevalues of p fall in a small range and the optimal value can be derivedby trying every possible pair of values in Eq. (13).

3.5. Storage of cells

In a high dimensional data space, the number of cells can be hugeand it is impossible to store all cells in memory. Fortunately, not allof the cells contain objects. Especially when the dimensionality ishigh, the space is very sparse and the majority of cells are empty.Therefore, it is not necessary to store all cells. With our technique,
only the non-empty cells are stored and a hash table is used tostore non-empty cells. Because each non-empty cell contains atleast one object, the number of non-empty cells is no more than N,the number of objects.

ms an

3

npetccoeorrcnocc

nLBc

oce

D

watasmwaacotmA

3

d

(

(

(

(

clusters. Noises, which are uniformly distributed in all dimensions,


.6. Parameters r and DT

While it is very easy to count the number of objects in theeighborhood of an object, it is not so easy to choose an appro-riate value for r, the radius of neighborhood. When r is largenough that all the objects in all the neighbors of a cell are withinhe neighborhood, AGRID+ will behave somewhat like grid-basedlustering, in the sense that the densities of all the objects in aell will be the same and that the density is simply the countf those objects in all its neighboring cells. But they are differ-nt in that AGRID+ considers the significant low-order neighborsnly, instead of all 3d neighboring cells. On the other hand, if

is much smaller than the lengths of all edges of the hyper-ectangular cell, AGRID+ will become somewhat like density-basedlustering because the density of a object is largely decided by theumber of objects circumscribed by r. However, the partitioningf data space into cells helps to reduce the number of distanceomputation and make AGRID+ much faster than density-basedlustering.

Since there is an assumption in Section 3.1 that the radius ofeighborhood is less than L/2, r is simply set to a value less than/2, while L is the length of the shortest interval in all dimensions.ecause a small r can make the densities too low to find any usefullusters, r is set to be between L/4 to L/2 in our algorithm.

Besides r, the result of clustering is also decided by the valuef density threshold, DT. With AGRID+, we calculate DT dynami-ally according to the mean of densities according to the followingquation:

T = 1�

×∑N

i=1Density(i)

N, (14)

here � is a coefficient which can be tuned to get clustering resultst different levels of resolution. Actually, by tuning �, various clus-ering results can be achieved with different DT. On the one hand,

small � will lead to a big DT, the merging condition will becometrict and the result would be composed of many small clusters andany objects would be taken as noises. On the other hand, a large �ill make a small DT. Then it will lead to a few large clusters because

djacent clusters will be merged, and some noises will be mistakens clusters. With a set of different values of �, a multi-resolutionlustering can be obtained. Since the proposed algorithm is basedn AGRID, the effect of DT and the multi-resolution clustering ofhe two algorithms are similar and some experimental results and

ore discussions on that can be found in our previous work onGRID (Zhao and Song, 2003).

.7. The procedure of AGRID+

AGRID+ is composed of the following seven steps. Detailed pseu-ocodes can be found in Figs. 6–8.

1) Partitioning. The whole data space is partitioned into cellsaccording to m and p computed with Eq. (13). Each object is thenassigned to a cell according to its coordinates and non-emptycells are inserted into a hash table.

2) Computing distance threshold. The distance threshold is com-puted according to the interval lengths of every dimension withthe method given in Section 3.6.

3) Calculating densities. For each object, count the number ofobjects both in its neighboring cells and in its neighborhoodas its density.

4) Compensating densities. For each object ˛, compute the ratio ofthe volume of all neighbors and that of neighbors considered,and use the product of the ratio and the density of as the newdensity of ˛, according to Eq. (5).

d Software 84 (2011) 1524– 1539 1531

(5) Calculating density threshold DT. The average of all compensateddensities is calculated and then the density threshold DT iscomputed with Eq. (14).

(6) Clustering automatically. At first, each object whose density isgreater than DT is taken as a cluster. Then, for each object ˛,check each object in the neighboring cells of C˛ to see whetherits density is greater than the density threshold and whetherits distance from object is less than the distance threshold. Ifyes, then merge the two clusters which the two objects belongto respectively. Continue the above merging procedure until alleligible object pairs have been checked.

(7) Removing noises. In those clusters obtained, many are too smallto be considered as meaningful clusters, so they are removedas noises.

3.8. Complexity analysis

The performance of AGRID+ depends on the values of N (thesize of data) and d (the dimensionality of data). With our parti-tioning technique proposed in Section 3.4, the time complexity iscontrolled by m and p, two parameters for space partitioning. Thetime complexity is set to be linear with N and d in Eq. (9) in Section3.4.1. Nevertheless, the time complexity we computed is under anideal condition that the number of objects in every cell is equal toone another. In nearly all cases, the number of objects varies fromcell to cell. So the time complexity is to some degree dependent onthe distribution of objects in data. Our experimental results in nextsection will show that the time complexity is nearly linear withboth data size and dimensionality.

Regarding space complexity, our algorithm stores non-emptycells only in a hash table, and the number of non-empty cells is nomore than the number of objects. Besides, the densities of objectsand the discovered clusters are also kept in memory, and the spacesused to store densities and clusters are also linear with N. Therefore,the space complexity is linear with the size and the dimensionalityof data.

4. Experimental evaluation

Our experiments were performed on a PC with 256MB RAM andan Intel Pentium III 1 GHz CPU. In the experiments, we will showthe improvement of AGRID+ over AGRID in terms of scalability,performance and accuracy, and will also show the effectiveness ofAGRID+ for discovering clusters in subspaces. In addition, we willcompare AGRID+ with Random Projection (Fern and Brodley, 2003)on a public dataset.

4.1. Synthetic data generator

The function nngenc(X, C, N, D) from Matlab1 is used to generateclusters of data points, where X is a R × 2 matrix of cluster bounds,C is the number of clusters, N is the number of data points in eachcluster, and D is the standard deviation of clusters. The functionreturns a matrix containing C × N R-element vectors arranged inC clusters with centers inside bounds set by X, with N elementseach, randomly around the centers with standard deviation of D.The range is set to [0,1000]. For some clusters, we set the values insome dimensions to be of uniform distribution to make subspace

are added to the data.

1 http://www.mathworks.com/.

http://www.mathworks.com/


o-cod

4

wnrtfiwebttA

The confusion matrix of densities is given in Table 1, in which DTstands for the density threshold and the figures are the counts ofobjects. It is clear from the table that the accuracy of AGRID+ is

Table 1Comparison of densities.

Standard density Density in AGRID+ Density in AGRID

Fig. 6. Pseud

.2. Superiority over AGRID

The first dataset is a 15D dataset of 10,000 points. It is generatedith deviation set to 130. There are 4 clusters and 5% of the data areoises. All clusters are in the full-dimensional space. The clusteringesults of AGRID+ and AGRID are shown in Fig. 9. Fig. 9(a) showshe clusters discovered by AGRID, where the numbers below sub-gures are the sizes of clusters. AGRID+ also found four clusters, butith more objects. Fig. 9(b) shows the additional objects discov-

red by AGRID+ as opposed to those by AGRID, where the numbers
elow the sub-figures are the counts of additional objects in clus-ers. The objects in Fig. 9(b) are missed by AGRID, which shows thathe compensation of densities makes AGRID+ more accurate thanGRID.
e of AGRID+.

Tables 1 and 3 show the comparison between AGRID+ andAGRID on the densities of objects and the accuracy of clustering.

≥DT <DT ≥DT <DT

≥DT 9314 190 8292 1212<DT 222 274 16 480Accuracy 95.9% 87.7%


of Co

geada

Fig. 7. Pseudo-code

reater than that of AGRID (Table 2). We then further studied the
ffectiveness of i-th order neighbors and density compensation,nd the results are shown in Table 3. Table 3 gives the clustersiscovered and the accuracy. It shows the clustering results of fourlgorithms: NAIVE, AGRID, IORDER and AGRID+. NAIVE is a naive
ALGO RITHM: ClusteringINPUT: dataOUTPUT: clusters

/* creati ng a new cluster for eac h ob jec

FOR all ob jec ts Oi

IF Denq(Oi) ≥ DTcluster(Oi) = {Oi};

ENDIFENDFOR

/* combining clusters */FOR all ce ll Ci in hash table

FOR all C j, non -empty kth-order neiFOR all ob jec ts Om in Ci

FOR all ob jec ts On in C j

IF Denq(Om) ≥ DT AND Denq

cluster(Om) = cluster(Om) ∪

cluster(On) = cluster(Om);ENDIF

ENDFORENDFOR

ENDFORENDFOR

Fig. 8. Pseudo-code o

mputing density( ).

density-based clustering algorithm without using any grid. AGRID
uses grid and (2d + 1) neighboring cells based on NAIVE. IORDERuses ith-order neighbors to improve the performance of NAIVE,but no compensation is conducted to density computation. AGRID+uses all techniques designed in this paper. The results show that
t who se densit y is no less than DT */

ghb ors of Ci (0 ≤ k ≤ q, j >= i)

(On) ≥ DT AND dist(Om, On) ≤ rcluster(On);

f Clustering( ).


cluster 4: 2019cluster 3: 2163

cluster 1: 1968 cluster 2: 2158

(a) Results of AGRID



(b) Results of AGRID+

Fig. 9. Experimental results of AGRID and AGRID+. (a) The four clusters discovered by AGRIobjects in each cluster found by AGRID+ are shown in (b), and the number under each sub

Table 2Four algorithms and Their techniques.

Algorithms Grid ith-order neighbors Density compensation

NAIVEAGRID

√

N(neadrndatnc

etasduitF

TC

IORDER√ √

AGRID+√ √ √

AIVE is of the highest accuracy but with the longest running timearound 10 times as long as the other three). It is because it does-ot use any grid to reduce computation and the distance betweenvery two objects have to be calculated. All other three algorithmsre 10 times faster than NAIVE, so grid is very effective to speed upensity computation, at the cost of accuracy. IORDER is more accu-ate than AGRID, which demonstrates the effectiveness of ith-ordereighbors. A higher accuracy of AGRID+ than IORDER shows that,ensity compensation enhances clustering quality, at the cost of

marginally longer running time. The table clearly demonstrateshe effectiveness of the grid to improve speed and the effective-ess of ith-order neighbors and density compensation to improvelustering quality.

While the clusters in the above dataset can be easily discov-red by both algorithms, another dataset is used to demonstratehe superiority of AGRID+ over AGRID in subspace clustering. It is

15D dataset of 20,000 points, and the clusters exist in 11D sub-paces. As shown in Fig. 10, the first cluster exists in the first 11imensions, while the attribute values in the last 4 dimensions areniformly distributed. For the second cluster, the attribute values

n dimensions 3–6 are of uniform distribution, i.e., the second clus-er exists in the subspace composed of dimensions 1, 2 and 7–15.or the other three clusters, the uniformly distributed dimensions

able 3omparison of accuracy.

Algorithms Cluster 1 Cluster 2 Cluster 3 Cluster 4 Accuracy Time(s)

NAIVE 2368 2396 2374 2366 95.0% 44.21AGRID 1968 2158 2163 2019 83.1% 3.62IORDER 1990 2216 2240 2081 85.3% 4.54AGRID+ 2293 2379 2374 2323 93.7% 4.60

D, and the number under each sub-figure gives the size of the cluster. The additional-figure gives the count of additional objects in each cluster.

are respectively 7–10, 8–11, and 6–9. The last sub-figure shows thenoise, which account for 10% of the data. Our experiment showsthat AGRID+ can discover the five clusters correctly with the accu-racy of 91%. In contrast, AGRID cannot find the five clusters correctlyeven by fine tuning parameters r and DT.

4.3. Scalability

The performance of AGRID+ and AGRID is shown in Fig. 11,where the solid lines represent AGRID+ and the dashed lines repre-sent AGRID. Ten experiments have been conducted for each methodand the average results are given in the figure. In Fig. 11(a), thedimensionalities of the datasets are all 20, and the sizes of datasetsrange from 10,000 to 100,000. In Fig. 11(b), the size is 100,000 andthe dimensionalities are from 3 to 100. In each dataset, 10% arenoises. From the figure, it is clear that the running time of AGRID+is nearly linear both in the size and dimensionality of datasets, andis a little longer than that of AGRID.

In the above experiments, we set q = 1 when applying Eq. (5). Totest the effect of different q, another experiment is conducted ona dataset of 100,000 objects and 15 dimensions. The running timeand accuracy of clusters discovered with different q are shown inFig. 12(a) and (b), respectively. When q is zero, the algorithm isfastest, but the accuracy is very low. With the increase of q, moreneighbors are taken into consideration and the accuracy goes updramatically, but the running time becomes longer. When q is largerthan three, there is no significant increase in accuracy in the exper-iment. From the figure, it is reasonable to set q to 2 or 3 to achieveboth high speed and high accuracy. Users can set the value of qaccording to computer performance and accuracy requirement intheir applications.

4.4. Multi-resolution clustering

Multi-resolution clustering can be achieved by using differentDT (or �) in Eq. (14), which helps to detect clusters at differentlevels, as shown in Fig. 13. Although the multi-resolution property
of our technique is somewhat like that of WaveCluster, they aremuch different in that AGRID+ achieves by adjusting the value ofdensity threshold while WaveCluster by “increasing the size of acell’s neighborhood”. The clustering of a 2D dataset of 2000 objects


5 10 15 5 10 15

5 10 15 5 10 15

5 10 15 5 10 15

F ed by

icaT4aa

ig. 10. Experimental results of AGRID+. The first 5 sub-figures are clusters discover

s used to demonstrate the effectiveness of multi-resolutionlustering. Fig. 13(a) shows the original data before clustering,nd the other figures are clustering results with different DTs.he values of density threshold are respectively 5, 10, 20, 30, 35,
0 and 50 in Fig. 13(b)–(h). In Fig. 13(b), DT is set to 5 and therere three clusters found. The two groups of objects in top-rightre in one cluster, because they are connected by some objects
10 20 30 40 50 60 70 80 90 1000

50

100

150

200

250

Size (x1000)

Tim

e (s

)

AGRID+AGRID

withScalability(a) N

Fig. 11. Scalability with the size an

AGRID+ in a dataset of 11D subspace clusters, while the last sub-figure shows noise.

between them. When DT increases to 10, they are split into twoclusters, as shown in Fig. 13(c). The bottom cloud of objects areclassified into two clusters, when DT is 20 in Fig. 13(d). As DTincreases further, all clusters shrink, resulting in the splitting or
disappearance of some clusters, as shown in Fig. 13(e)–(h). WhenDT is set to 50, only three clusters are found, composed of objectsin very densely-populated areas (see Fig. 13(h)).
0 10 20 30 40 50 60 70 80 90 1000

100

200

300

400

500

600

700

800

Dimensionality

Tim

e (s

)

AGRID+AGRID

withScalability(b) d

d dimensionality of datasets.


0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 150

500

1000

1500

2000

2500

3000

3500

Order of Neighbors

Tim

e (s

)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 150

10

20

30

40

50

60

70

80

90

100

Order of Neighbors (q)

Per

cent

age

(%)

ContributionAccuracy

ccura

stt

(a) Running Time

Fig. 12. Experimental results of running time and a

Generally speaking, the greater the density threshold is, themaller the clusters are, and the more is the number of objectshat would be treated as outliers. When DT is a very small number,he number of clusters is small and just a few objects are treated as

Data(a) cluster s3(b)

clusters,5(d) DT=20 cluster s,5(e)

clusters,4(g) DT=40 cluster s,3(h)

Fig. 13. Multi-resolu

(b) Accurac y

cy with various values of q, the order of neighbors.

outliers. With the increasing of DT, some clusters break into moreand smaller clusters. A hierarchical clustering tree can be built withselecting a series of DT and the appropriate resolution level forchoosing clusters can be decided on the needs of users.

, DT=5 clusters, 4(c) DT=10

DT=30 clusters, 5(f) DT=35

DT=50

tion clustering.


0

10

20

30

4

0

50

60

0

10

20

30

4

0

50

60

0

10

20

30

4

0

50

60

0

10

20

30

4

0

50

60

0

10

20

30

4

0

50

60

0

10

20

30

4

0

50

60

60504030201006050403020100 6050403020100

30

chart

4

dtmr6dmi

sMcusu(i(la

dittanaet

w

HN

However, if a user has no idea at all about how to set a right valueto it, he may run the algorithm for multiple times with various val-ues for the parameter, and then choose the best clustering withthe help of some internal validation measures or relative valida-

201006050403020100

Fig. 14. Control

.5. Comparison with Random Projection on public data

In addition to the experiments with the above syntheticatasets, experiments have been conducted with the control chartime series dataset from UCI KDD Archive2 and comparison was

ade with Random Projection (Fern and Brodley, 2003), an algo-ithm for subspace clustering. The dataset is of 60 dimensions and00 records. There are six clusters: normal, cyclic, increasing trend,ecreasing trend, upward shift and downward shift (see Fig. 14). Toake it easy to see the six clusters, only a few time series are shown

n each cluster in the figure.The clustering given in UCI KDD Archive is used as the

tandard result and Conditional Entropy (CE) and Normalizedutual Information (NMI) are employed to measure the quality of

lustering. Compactness (Zait and Messatfa, 1997) is also widelysed to measure the quality of clustering, but it favours sphere-haped clusters since the diameter is used. CE and NMI have beensed to measure the quality of clustering by Strehl and Ghosh2002), Fern and Brodley (2003) and Pfitzner et al. (2009), and sim-lar measures base on entropy have also been used by Hu and Sung2006). Conditional Entropy measures the uncertainty of the classabels given a clustering solution. For one clustering with m clustersnd a second clustering with k clusters, the Conditional Entropy is

efined as CE =∑k

j=1nj∗Ej

N , where entropy Ej = −∑m

i=1pij log(pij), nj

s the size of cluster j in the second clustering, pij is the probabilityhat a member of cluster i in the first clustering belongs to cluster j inhe second clustering, pi is the probability of cluster i, pj is the prob-bility of cluster j, and N is the size of dataset. The value of CE is aon-negative real number. The less CE is, the more the tested resultpproaches the standard result. The two results become the same asach other when CE is zero. For two clustering solutions C1 and C2,he normalized mutual information is defined as NMI = MI√

H(C1)H(C2),∑ (

pij)

here mutual information MI = i,jpij log pipj, and H(C1) and

(C2) denote the entropy of C1 and C2, respectively. The value ofMI lies in [0,1]. Contrary to CE, the larger the value of NMI is, the

2 http://kdd.ics.uci.edu/.

605040 6050403020100

time series data.

better is the clustering. If NMI is one, then the two clusterings arethe same as each other. In all, we would like to minimize CE andmaximize NMI.

Since the dimensionality is high and the records are relativelyfew, subspace clustering is employed to find the clusters. Theparameters are selected based on studying the data and fine-tuningof the parameters. Generally speaking, the greater are the averagedimensionality of subspace cluster and the density threshold, thestricter is the condition for merging two objects or clusters intoone cluster, and the result tends to be composed of many smallerclusters. On the contrary, the smaller are the two parameters, thelooser is the condition for merging, and the result tends to con-sist of a few bigger clusters. From Fig. 14, we can see that, for thetime series in a cluster, the values in most dimensions are closeand there are big difference in around 10–20 dimensions. There-fore, it is reasonable to set the average dimensionality of subspaceclusters to 40–50. In our experiments on the above data, we foundthat, with an average dimensionality less than 40, it often mergesincreasing trend and upward shift into one cluster and decreasingtrend and downward shift into another cluster. The best result wasachieved by setting the average dimensionality to 45 and the den-sity threshold to 8. The results of clustering with our algorithm andRandom Projection are given in Table 4. From the table, we can seethat the clustering of our algorithm are of the lowest CE and thehighest NMI, which shows that our algorithm performs better thanRandom Projection. The superiority of AGRID+ over IORDER alsoshows that the effectiveness of density compensation.

Generally speaking, the average dimensionality of subspacescan be set based on domain knowledge in specific applications.

Table 4AGRID+ vs Random Projection.

AGRID+ IORDER Random Projection

CE 0.466 0.517 0.706NMI 0.845 0.822 0.790

http://kdd.ics.uci.edu/

1 ms an

ta

5

iidtlq.wW(iotowoiv9tt

oocdswDvaaaisi

6

dfuincdcTieacsitsbs


ion measures, such as Compactness, Silhouette index, Figure of meritnd Stability (Brun et al., 2007; Halkidi et al., 2001).

. Discussions

To reduce the cost of computation, some assumptions are maden this paper. One assumption is to reduce computation cost of Vc

n Eq. (1) by approximating it in a simple way. For an object in a d-imensional space, when all neighboring cells with order no morehan q are considered, we need to calculate the volume of over-apped spaces for every neighboring cell with order no more than. That is, the number of volume calculation would be 1 + () + () +

. . + (). We can see that the above computation is costly, especiallyhen d and q are big numbers. To simply the above calculation,e use the Vc for object (a, a, ...a) to approximate the Vc for object

a1, a2...ad), where a is the mean of a1, a2, . . . ad. With the approx-mation, we need to calculate only (d + 1) overlapped spaces (i.e.,ne for each order of neighbors). Although for a specific neighbor,he volume may be over- or down- estimated, the overall volumef Vc, which is the sum of the overlapped space with all neighborsith order no more than q, is well approximated. The effectiveness

f the approximation is shown in Table 1, where the accuracy ismproved from 87.7% to 95.9% with density compensation. It is alsoalidated by an improvement in accuracy from 85.3% (IORDER) to3.7% (AGRID+), as shown in Table 3. Moreover, the above assump-ion is used only to calculate volume for density compensation, andhe calculation of Cntq(˛) in Eqs. (1) and (2) is not affected by it.

Regarding space partitioning for producing grid and cells, somether techniques, such as adaptive grid (Nagesh et al., 1999) andptimal grid (Hinneburg and Keim, 1999), partition data space byonsidering data distribution in every dimension. However, theyo not fit our algorithm due to the following reasons. Firstly, equal-ized cells are preferred in our algorithm to cover neighborhoodith cells, since the density is defined based on neighborhood.istribution-based partitioning often produces cells with greatariance in their sizes. Secondly, the number of cells needs beble to smoothly change, so that it would be easier to choose anppropriate value for DT or fine tune DT. The above two featuresre important to our algorithm, but distribution-based partition-ng fails to do so. Although our proposed partitioning method looksimple, it addresses the above two issues well and its effectivenesss shown in our experiments.

. Conclusions

In this paper, we have presented a novel and efficient grid-ensity based clustering approach, which has four novel technicaleatures. The first feature is that it takes objects (or points) as atomicnits in which the size requirement to cells is waived without los-

ng clustering accuracy. The second one is the concept of ith-ordereighbors, with which the neighboring cells are organized into aouple of groups to lower the computational complexity and meetifferent requirements of accuracy. The third is the idea of densityompensation to improve the accuracy of densities and clustering.he last but not the least, the measure of minimal subspace distances used to help AGRID+ to discover clusters in subspaces. We havexperimentally evaluated our approach and demonstrated that ourlgorithm significantly reduces computation cost and improveslustering quality. In fact, besides AGRID+, our measure of minimalubspace distance can also help other algorithms to find clustersn subspaces, which will be included in our future work. Another
wo future works are: (1) finding an optimal order of the dimen-ions based on the distribution of data in every single dimensionefore partitioning; and (2) using internal indices to obtain optimalettings of parameters.
d Software 84 (2011) 1524– 1539

Acknowledgements

This research was done when Yanchang Zhao was an AustralianPostdoctoral Fellow (Industry) at Faculty of Engineering & IT, Uni-versity of Technology, Sydney, Australia.

This work is supported in part by the Australian Research Coun-cil (ARC) under large grant DP0985456, the China “1000-Plan”Distinguished Professorship, the Jiangsu Provincial Key Laboratoryof E-business at the Nanjing University of Finance and Economics,and the Guangxi NSF (Key) grants.

References

Aggarwal, C.C., Hinneburg, A., Keim, D.A., 2001. On the surprising behavior of dis-tance metrics in high dimensional space. In: Proc. of the 8th InternationalConference on Database Theory.

Agrawal, R., Gehrke, J., et al., 1998. Automatic subspace clustering of high dimen-sional data for data mining applications. In: Proc. of the 1998 ACM-SIGMODInternational Conference on Management of Data (SIGMOD’98) , Seattle, WA,June, pp. 94–105.

Alsabti, K., Ranka, S., Singh, V., 1998. An efficient K-means clustering algorithm. In:Proc. of the First Workshop on High Performance Data Mining , Orlando, FL.

Ankerst, M., Breunig, M., et al., 1999. OPTICS: ordering points to identify the cluster-ing structure. In: Proc. of the 1999 ACM-SIGMOD International Conference onManagement of Data (SIGMOD’99) , Philadelphia, PA, June, pp. 49–60.

Assent, I., Krieger, R., Glavic, B., Seidl, T., 2008. Clustering multidimensionalsequences in spatial and temporal databases. Knowledge and Information Sys-tems 16 (July (1)), 29–51.

Berkhin, P., 2002. Survey of Clustering Data Mining Techniques. Technical Report,Accrue Software.

Brun, M., Sima, C., Hua, J., Lowey, J., Carroll, B., Suh, E., Dougherty, E.R., 2007. Model-based evaluation of clustering validation measures. Pattern Recognition, vol. 40.Elsevier Science Inc, pp. 807–824.

Ester, M., Kriegel, H.-P., et al., 1996. A density-based algorithm for discovering clus-ters in large spatial databases with noise. In: Proc. of the 1996 InternationalConference On Knowledge Discovery and Data Mining (KDD’96) , Portland, Ore-gon, August, pp. 226–231.

Fern, X.Z., Brodley, E., 2003. Random projection for high dimensional data clustering:a clustering ensemble approach. In: Proc. of the 20th International ConferenceOn Machine Learning (ICML’03) , Washington, DC.

Grabmeier, J., Rudolph, A., 2002. Techniques of cluster algorithms in data mining.Data Mining and Knowledge Discovery 6, 303–360.

Guha, S., Rastogi, R., Shim, K., 1998. Cure: an efficient clustering algorithm for largedatabases. In: Proc. of the 1998 ACM-SIGMOD International Conference on Man-agement of Data (SIGMOD’98), Seattle, WA, June , pp. 73–84.

Guha, S., Rastogi, R., Shim, K., 1999. Rock: a robust clustering algorithm for categori-cal attributes. In: Proc. of the 1999 International Conference on Data Engineering(ICDE’99) , Sydney, Australia, March, pp. 512–521.

Halkidi, M., Batistakis, Y., Vazirgiannis, M., 2001. On clustering validation techniques.Journal of Intelligent Information Systems 17, 107–145.

Han, J., Kamber, M., 2001. Data Mining: Concepts and Techniques. Higher EducationPress, Morgan Kaufmann Publishers.

Hinneburg, A., Keim, D.A., 1998. An efficient approach to clustering in large multi-media databases with noise. In: Proc. of the 1998 International Conference onKnowledge Discovery and Data Mining (KDD’98) , New York, August, pp. 58–65.

Hinneburg, A., Keim, D.A., 1999. Optimal grid-clustering: towards breading the curseof dimensionality in high-dimensional clustering. In: Proc. of the 25th Interna-tional Conference on Very Large Data Bases , Edinburgh, Scotland.

Hu, T., Sung, S.Y., 2006. Finding centroid clusterings with entropy-based criteria.Knowledge and Information Systems 10 (November (4)), 505–514.

Huang, Z., 1998. Extensions to the k-means algorithm for clustering large data setswith categorical values. Data Mining and Knowledge Discovery 2, 283–304.

Jain, A.K., Murty, M.N., Flynn, P.J., 1999. Data clustering: a review. ACM ComputingSurveys 31 (September (3)).

Karypis, G., Han, E.-H., Kumar, V., 1999. CHAMELEON: a hierarchical clustering algo-rithm using dynamic modelling. IEEE Computer, Special Issue on Data Analysisand Mining 32 (August (8)), 68–75.

Kolatch, E., 2001. Clustering Algorithms for Spatial Databases: A Survey. Dept. ofComputer Science, University of Maryland, College Park.

Hinneburg, A., Aggarwal, C.C., Keim, D.A., 2000. What is the nearest neighbor in highdimensional spaces? In: Proc. of the 26th International Conference on Very LargeData Bases , Cairo, Egypt, pp. 506–515.

Moise, G., Sander, J., Ester, M., 2008. Robust projected clustering. Knowledge andInformation Systems 14 (March (3)), 273–298.

Nagesh H., Goil S. and Choudhary A. MAFIA: Efficient and Scalable Subspace Clus-tering for Very Large Data Sets, Technical Report 9906-010, NorthwesternUniversity, June 1999.

Procopiuc, M., Jones, M., Agarwal, P., Murali, T.M., 2002. A Monte-Carlo algorithm
for fast projective clustering. In: Proc. of the 2002 International Conference onManagement of Data.
Ng, R., Han, J., 1994. Efficient and effective clustering method for spatial data mining.In: Proc of the 1994 International Conference on Very Large Data Bases (VLDB’94), Santiago, Chile, September, pp. 144–155.

ms an

P

S

S

W

Z

Z

Z

YaDUea

class projects in China and Australia. He is served/ing as


fitzner, D., Leibbrandt, R., Powers, D., 2009. Characterization and evaluation of sim-ilarity measures for pairs of clusterings. Knowledge and Information Systems 19(June (3)), 361–394.

heikholeslami, G., Chatterjee, S., Zhang, A., 1998. WaveCluster: a multi-resolutionclustering approach for very large spatial databases. In: Proc. of the 1998 Inter-national Conference on Very Large Data Bases (VLDB’98) , New York, August, pp.428–429.

trehl, A., Ghosh, J., 2002. Cluster ensembles—a knowledge reuse framework forcombining multiple partitions. Machine Learning Research 3, 583–617.

ang, W., Yang D J., Muntz, R., 1997. STING: a statistical information grid approachto spatial data mining. In: Proc. of the 1997 International Conference on VeryLarge Data Bases (VLDB’97) , Athens, Greece, August, pp. 186–195.

ait, M., Messatfa, H., 1997. A comparative study of clustering methods. FutureGeneration Computer Systems 13, 149–159.

hang, T., Ramakrishnan, R., Livny, M., 1996. BIRCH: an efficient data clusteringmethod for very large databases. In: Proc. of the 1996 ACM-SIGMOD Interna-tional Conference on Management of Data (SIGMOD’96) , Montreal, Canada,June, pp. 103–114.

hao, Y., Song, J., 2003. AGRID: an efficient algorithm for clustering large high-dimensional datasets. In: Proc. of The 7th Pacific-Asia Conference on KnowledgeDiscovery and Data Mining (PAKDD’03) , Seoul, Korea, April, pp. 271–282.

anchang Zhao is a Senior Data Mining Specialist in Centrelink, Australia. He wasn Australian Postdoctoral Fellow (Industry) at the Data Sciences and Knowledgeiscovery Research Lab, Centre for Quantum Computation and Intelligent Systems,niversity of Technology, Sydney, Australia, from 2007 to 2009. His research inter-sts are clustering, sequential patterns, time series, association rules and theirpplications. He is a member of the IEEE.

Jie Cao is a Professor and the Chair of Jiangsu ProvincialKey Laboratory of E-business at the Nanjing University ofFinance and Economics. He is a winner of the Program forNew Century Excellent Talents in Universities (NCET). He
received his PhD degree from the Southeast University,China, in 2002. His main research interests include cloudcomputing, business intelligence and data mining. Dr. Caohas published one book and more than 40 refereed papersin various journals and conferences.
d Software 84 (2011) 1524– 1539 1539

Chengqi Zhang has been a Professor of Information Technology at The Uni-versity of Technology, Sydney (UTS) since December 2001. He has been theDirector of the UTS Priority Investment Research Centre for Quantum Compu-tation and Intelligent Systems since April 2008. He has been Chairman of theAustralian Computer Society National Committee for Artificial Intelligence sinceNovember 2005.

Prof. Zhang obtained his PhD degree from the University of Queensland in 1991,followed by a Doctor of Science (DSc – Higher Doctorate) from Deakin University in2002.

Prof. Zhang’s research interests mainly focus on Data Mining and its appli-cations. He has published more than 200 research papers, including several infirst-class international journals, such as Artificial Intelligence, IEEE and ACM Trans-actions. He has published six monographs and edited 16 books. He has delivered 12keynote/invited speeches at international conferences over the last six years. He hasattracted seven Australian Research Council grants.

He is a Fellow of the Australian Computer Society (ACS) and a Senior Mem-ber of the IEEE Computer Society (IEEE). He has been serving as an Associate Editorfor three international journals, including IEEE Transactions on Knowledge and DataEngineering, from 2005 to 2008; and he served as General Chair, PC Chair, or Organis-ing Chair for five international Conferences including ICDM and WI/IAT. His personalweb page can be found at: http://www-staff.it.uts.edu.au/∼chengqi/.

Shichao Zhang is a China “1000-Plan” Distinguished Pro-fessor and the Dean of College of Computer Science andInformation Technology at the Guangxi Normal Univer-sity, Guilin, China. He received his PhD degree in AppliedMathematics from the China Academy of Atomic Energyin 1997. His research interests include information qual-ity and multi-sources data mining. He has published 10solely-authored international journal papers, about 50international journal papers and over 60 internationalconference papers. He has been a CI for winning 10 nation-

an associate editor for IEEE Transactions on Knowledgeand Data Engineering, Knowledge and Information Sys-

tems, and IEEE Intelligent Informatics Bulletin, and served as a (vice-)PC Chair for 5international conferences.

http://www-staff.it.uts.edu.au/~chengqi/

enhancing grid-density based clustering for high dimensional data

Documents