submitted to ieee transactions on pattern …ranger.uta.edu/~chqding/papers/levelset.pdf ·...

16
SUBMITTED TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1 Dynamic Cluster Formation Using Level Set Methods Andy M. Yip, Chris Ding, and Tony F. Chan Abstract— Density-based clustering has the advantages for (i) allowing arbitrary shape of cluster and (ii) not requiring the number of clusters as input. However, when clusters touch each other, both the cluster centers and cluster boundaries (as the peaks and valleys of the density distribution) become fuzzy and difficult to determine. We introduce the notion of cluster intensity function (CIF) which captures the important characteristics of clusters. When clusters are well-separated, CIFs are similar to density functions. But when clusters become closed to each other, CIFs still clearly reveal cluster centers, cluster boundaries, and degree of membership of each data point to the cluster that it belongs. Clustering through bump hunting and valley seeking based on these functions are more robust than that based on density functions obtained by kernel density estimation, which are often oscillatory or over-smoothed. These problems of kernel density estimation are resolved using Level Set Methods and related techniques. Comparisons with two existing density- based methods, valley seeking and DBSCAN, are presented which illustrate the advantages of our approach. Index Terms— Dynamic clustering, level set methods, cluster intensity functions, kernel density estimation, clus- ter contours, partial differential equations. I. I NTRODUCTION R ECENT computer, internet and hardware ad- vances produce massive data which are accu- mulated rapidly. Applications include sky surveys, genomics, remote sensing, pharmacy, network se- curity and web analysis. Undoubtedly, knowledge This work has been partially supported by grants from DOE under contract DE-AC03-76SF00098, NSF under contracts DMS-9973341, ACI-0072112 and INT-0072863, ONR under contract N00014-03-1- 0888, NIH under contract P20 MH65166, and the NIH Roadmap Ini- tiative for Bioinformatics and Computational Biology U54 RR021813 funded by the NCRR, NCBC, and NIGMS. A. Yip is with the Department of Mathematics, National University of Singapore, 2, Science Drive 2, Singapore 117543, Singapore. Email: [email protected] C. Ding is with the Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720. Email: chqd- [email protected] T. Chan is with the Department of Mathematics, University of Cal- ifornia, Los Angeles, CA 90095-1555. Email: [email protected] acquisition and discovery from such data become an important issue. One common technique to analyze data is clustering which aims at grouping entities with similar characteristics together so that main trends or unusual patterns may be discovered. Clus- tering is an example of unsupervised learning. There is no provision of any training examples to guide the grouping of the data. In the other words, cluster analysis can be applied without a priori knowledge of the class distribution. We refer the reader to [1], [2] for more detailed examples of the usage of clustering techniques in a variety of context. A successful clustering task depends on a number of factors: collection of data, selection of variables, cleaning of data, choice of similarity measures, choice of a clustering algorithm, and interpretation of clustering results. In this paper, we focus on proposing a general-purpose clustering algorithm. We assume that we are given a set of data in an Euclidean space, i.e. each object is described by a set of numerical attributes, and pair-wise dissimi- larity is measured by Euclidean distance. We also assume that preprocessing steps such as variable selection, data cleaning, and missing value impu- tation are treated separately. However, the proposed algorithm is robust to noise and outliers. Thus it can tolerate certain amount of imperfection in data cleaning. There are a number of paradigms to define clus- ters. Our approach is density-based. The idea is that clusters are high density regions separated by low density regions. While this approach has a number of desirable properties (detailed in §II), a potential drawback common to all algorithms of this type is over-fitting of density. For example, density estimated from a set of samples drawn from a uniformly distribution is generally not uniform. A density-based clustering method must take this effect into account. Otherwise, the clustering results could be disastrous. We show experimentally that several well-known density-based methods fail to

Upload: phunganh

Post on 10-Jul-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

SUBMITTED TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1

Dynamic Cluster Formation UsingLevel Set Methods

Andy M. Yip, Chris Ding, and Tony F. Chan

Abstract— Density-based clustering has the advantagesfor (i) allowing arbitrary shape of cluster and (ii) notrequiring the number of clusters as input. However, whenclusters touch each other, both the cluster centers andcluster boundaries (as the peaks and valleys of the densitydistribution) become fuzzy and difficult to determine. Weintroduce the notion of cluster intensity function (CIF)which captures the important characteristics of clusters.When clusters are well-separated, CIFs are similar todensity functions. But when clusters become closed toeach other, CIFs still clearly reveal cluster centers, clusterboundaries, and degree of membership of each datapoint to the cluster that it belongs. Clustering throughbump hunting and valley seeking based on these functionsare more robust than that based on density functionsobtained by kernel density estimation, which are oftenoscillatory or over-smoothed. These problems of kerneldensity estimation are resolved usingLevel Set Methodsandrelated techniques. Comparisons with two existing density-based methods, valley seeking and DBSCAN, are presentedwhich illustrate the advantages of our approach.

Index Terms— Dynamic clustering, level set methods,cluster intensity functions, kernel density estimation, clus-ter contours, partial differential equations.

I. I NTRODUCTION

RECENT computer, internet and hardware ad-vances produce massive data which are accu-

mulated rapidly. Applications include sky surveys,genomics, remote sensing, pharmacy, network se-curity and web analysis. Undoubtedly, knowledge

This work has been partially supported by grants from DOE undercontract DE-AC03-76SF00098, NSF under contracts DMS-9973341,ACI-0072112 and INT-0072863, ONR under contract N00014-03-1-0888, NIH under contract P20 MH65166, and the NIH Roadmap Ini-tiative for Bioinformatics and Computational Biology U54 RR021813funded by the NCRR, NCBC, and NIGMS.

A. Yip is with the Department of Mathematics, National Universityof Singapore, 2, Science Drive 2, Singapore 117543, Singapore.Email: [email protected]

C. Ding is with the Computational Research Division, LawrenceBerkeley National Laboratory, Berkeley, CA 94720. Email: [email protected]

T. Chan is with the Department of Mathematics, University of Cal-ifornia, Los Angeles, CA 90095-1555. Email: [email protected]

acquisition and discovery from such data become animportant issue. One common technique to analyzedata isclustering which aims at grouping entitieswith similar characteristics together so that maintrends or unusual patterns may be discovered. Clus-tering is an example of unsupervised learning. Thereis no provision of any training examples to guidethe grouping of the data. In the other words, clusteranalysis can be applied withouta priori knowledgeof the class distribution. We refer the reader to[1], [2] for more detailed examples of the usageof clustering techniques in a variety of context.

A successful clustering task depends on a numberof factors: collection of data, selection of variables,cleaning of data, choice of similarity measures,choice of a clustering algorithm, and interpretationof clustering results. In this paper, we focus onproposing a general-purpose clustering algorithm.We assume that we are given a set of data in anEuclidean space, i.e. each object is described by aset of numerical attributes, and pair-wise dissimi-larity is measured by Euclidean distance. We alsoassume that preprocessing steps such as variableselection, data cleaning, and missing value impu-tation are treated separately. However, the proposedalgorithm is robust to noise and outliers. Thus itcan tolerate certain amount of imperfection in datacleaning.

There are a number of paradigms to define clus-ters. Our approach is density-based. The idea isthat clusters are high density regions separated bylow density regions. While this approach has anumber of desirable properties (detailed in§II), apotential drawback common to all algorithms ofthis type is over-fitting of density. For example,density estimated from a set of samples drawn froma uniformly distribution is generally not uniform.A density-based clustering method must take thiseffect into account. Otherwise, the clustering resultscould be disastrous. We show experimentally thatseveral well-known density-based methods fail to

SUBMITTED TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2

comply with such a robustness requirement. As aresult, natural clusters are split into pieces due tothe artifactitious differential densities.

To remedy such a problem, we propose a partialdifferential equation model to detect high densityregions. The model has a built-in mechanism, whichcan be thought of adding surface tension to clusterboundaries, to overcome the localized roughness ofthe density landscape. To implement the dynamicalevolution of cluster boundaries, we employ theLevel Set Methods which allow moving boundariesto split or merge easily. We further introduce theconcept of cluster intensity functions which clearlyreveals cluster structures. Partitioning data accord-ing valleys of such functions provides an extradegree of robustness.

The organization of the rest of the paper is asfollows. In the next subsection, we give a briefreview of clustering algorithms. Then, we highlightsome characteristics of density-based methods andsome related concepts in§II. In §III, we outline themain steps of our methodologies. The initializationsteps of our method is presented in§IV. Followednext in §V is the major step, which is to advancecluster boundaries robustly. A novel concept, clus-tering intensity function, for finalizing the clustersis introduced in§VI. Experimental results are thenpresented in§VII. Finally, some conclusion remarksare given in§VIII.

A. A Review of Clustering Algorithms

To facilitate a better understanding of our density-based method, we include a brief summary ofvarious classes of clustering algorithms so as tocontrast the different assumptions underlying eachclass of algorithms. Pointers to the literature are alsogiven. A more thorough discussion of density-basedmethods and relevant concepts is presented in thenext section. For more detailed reviews of clusteringtechniques, we refer the reader to [1], [2], [3], [4].

(i) Optimization-based methods. They seek for apartition of the dataset so as to optimize an objec-tive. Usually, a measure of within-cluster similarityis maximized and/or a measure of between-clusterdissimilarity is maximized. Constraints such asnumber of clusters or minimum separation betweenclusters may be incorporated [5]. Perhaps the mostwell-known algorithm of this type is the Lloyd’sAlgorithm [6] for optimizing thek-means objective

[7]. A generalization of thek-means objective andLloyd’s Algorithm to Bregman Divergence can befound in [8]. Algorithms for optimizing thek-medoids objective, a variant ofk-means which ismore robust to outliers, include PAM [3], CLARA[3] and CLARANS [9].

(ii) Hierarchical methods. They aim at produc-ing a hierarchical tree (dendrogram) which depictsthe successive merging (agglomerative methods) orsplitting (divisive methods) of clusters. Examples ofhierarchical methods include DIANA, Single Link-age, Average Linkage, Complete Linkage, Centroidand Ward’s methods [3]. Different agglomerativemethods differ by the way that the similarity be-tween two clusters is updated. If the linkage satisfiesa cluster aggregate inequality, then the algorithmcan be implemented efficiently at a time complex-ity of O(N2) only [10]. A more recent methodCHAMELEON [11] uses a sophisticated mergingcriterion which takes the clusters’ internal structureinto account as well.

(iii) Density-based methods. They are based onthe idea that clusters are high density regions sep-arated by low density regions. Methods of thistype include Valley Seeking [12], DBSCAN [13],GDBSCAN [14], CLIQUE [15], DENCLUE [16]and OPTICS [17]. Our approach is density-based.We will further elaborate the pros and cons of thisparadigm in the next section.

(iv) Grid-based methods. The feature space isprojected onto a regular grid. Presumably, most non-empty grid cells are highly populated. Thus, byusing a few representatives or summary statisticsfor each grid cell, a form of data compression isobtained. Such an approach is usually used for largedatabases. STING [18] and WaveCluster [19] fallinto this category.

(v) Graph-based methods. Data points are rep-resented by nodes and pair-wise similarity are de-picted by the edge weights. Once a graph is con-structed, graph partitioning methods can be used toobtain clusters of data points. Examples of graph-based methods are Spectral Min-Max Cut [20] andShared Nearest Neighbors [21].

(vi) Model-based methods. They are application-specific. They assume the knowledge of a modelwhich prescribes the nature of the data. CLICK [22]uses a finite mixture model [23] to model pair-wisesimilarity between points in the same cluster andpoints in different clusters.

SUBMITTED TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 3

II. D ENSITY-BASED APPROACHES ANDLEVEL

SET METHODS

A. Density-based Approaches

Among various classes of clustering algorithms,density-based methods are of special interest fortheir connections to statistical models which arevery useful in many applications. Density-basedclustering has the advantages for (i) allowing ar-bitrary shape of cluster and (ii) not requiring thenumber of clusters as input, which is usually diffi-cult to determine.

There are several basic approaches for density-based clustering:

(A1) A common approach is so-called bump-hunting: first find the density peaks or “hotspots” and then expand the cluster boundariesoutward until they meet somewhere, presum-ably in the valley regions (local minimums) ofdensity contours. The CLIQUE algorithm [15]adopted this methodology.

(A2) Another direction is to start from valley re-gions and gradually work uphill to connectdata points in low-density regions to clustersdefined by density peaks. This approach hasbeen used in Valley Seeking [12] (see below)and DENCLUE [16].

(A3) A recent approach, DBSCAN [13], is to com-pute reachability from some seed data andthen connect those “reachable” points to theircorresponding seed. Here, a pointp is reach-able from a pointq (with respect toMinPtsand Eps) if there is a chain of pointsp1 =q, p2, . . . , pn = p such that, for eachi, theEps-neighborhood ofpi contains at leastMinPtspoints and containspi+1. A variant, calledOPTICS, has been proposed in [17] whichorders the data in such a way that clusteringsat different density parameters are efficientlyobtained.

When clusters are well-separated, density-basedmethods work well because the peak and valleyregions are well-defined and easy to detect. Whenclusters are closed to each other, which is oftenthe case in real situations, both the cluster centersand cluster boundaries (as the peaks and valleys ofthe density distribution) become fuzzy and difficultto determine. In higher dimension, the boundariesbecome wiggly and over-fitting often occurs.

In this paper, we adopt the framework of bump-hunting but with several new ingredients incorpo-rated to overcome problems that many density-basedalgorithms share. The major steps of our method areas follows: (i) obtain a probability density function(PDF) by Kernel Density Estimation; (ii) identifypeak regions of the density function using a surfaceevolution equation implemented by theLevel SetMethods (LSM); (iii) construct a distance-basedfunction called Cluster Intensity Function(CIF);(iv) apply Valley Seekingon the CIF. In the nextsubsections, we describe each of the above fournotions.

B. Kernel Density Estimation (KDE)

In density-based approaches, one must need to es-timate the density of data. We particularly considerthe use of kernel density estimation [24], [25], anon-parametric technique to estimate the underlyingprobability density from samples. More precisely,given a set of data{xi}N

i=1 ⊂ Rp, the probabilitydensity function (PDF) is defined to be

f(x) :=1

(Nhp)

N∑i=1

K

(x− xi

h

)(1)

where K(x) is a positive kernel andh is a scaleparameter. Clusters may then be obtained accordingto the partition defined by the valleys off . Anefficient valley seeking algorithm is reviewed below.

There are a number of important advantages ofkernel density approach. Identifying high densityregions is independent of the shape of the regions.Smoothing effects of kernels make density estima-tions robust to noise. Kernels are localized in spaceso that outliers do not affect the majority of the data.The number of clusters is automatically determinedfrom the estimated density function, but one needsto adjust the scale parameterh to obtain a goodestimate. Whenh is chosen too large,f will be over-smoothed and will become unimodal eventually ash → ∞. On the other hand, whenh is chosentoo small,f may contain many spurious peaks andwill eventually contain one peak for each data pointas h → 0. Theoretically, an optimalh is the onewhich minimizes the integrated squared error (ISE)∫Rd(f(x) − ftrue(x))2dx or the expected value of

ISE whereftrue is the (unknown) underlying truedensity. A common practice is to apply heuristics

SUBMITTED TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 4

such as cross-validation to obtain a reasonably goodestimate of the optimalh [26].

Despite the numerous advantages of kernel den-sity methods, there are some fundamental draw-backs which deteriorate the quality of the resultingclusterings. PDFs obtained through KDE are veryoften oscillatory (uneven) since they are constructedby adding many kernels together. Such an oscilla-tory nature may lead to the problem of over-fitting,whereas a smooth cluster boundary between theclusters are usually preferred than an oscillatoryone. Last but not least, valleys and peaks of PDFsare often very vague especially when clusters areclosed together.

C. Level Set Methods (LSM)

We recognize that the key issue in density-basedapproach is how to advance the boundary eitherfrom peak regions outward towards valley regions,or the other way around. In this paper, we em-ploy LSM, which are effective tools for computingboundaries in motion, to resolve the boundary ad-vancing problem. LSM have well-established math-ematical foundations and have been successfullyapplied to solve a variety of problems in imageprocessing, computer vision, computational fluiddynamics and optimal design. LSM use implicitfunctions to represent complicated boundaries con-veniently. While implicit representation of static sur-faces have been widely used in computer graphics,LSM move one step further allowing the surfacesto dynamically evolve in an elegant and highlycontrollable way, see [27], [28] for details.

Advantages of LSM include: (i) the boundariesin motion can be made smooth conveniently andsmoothness can be easily controlled by a parameterthat characterizes surface tension; (ii) merging andsplitting of boundaries can be easily done in asystematical way. Property (ii) is very important indata clustering as clusters can be merged or split inan automatic fashion. Furthermore, the advancing ofboundaries is achieved naturally within the frame-work of partial differential equation (PDE) whichgoverns the dynamics of the boundaries.

In LSM, a surfaceΓ(t) at time t is representedby the zero level set of a Lipschitz functionφ =φ(x, t), i.e., Γ(t) = {x : φ(x, t) = 0}. The valueof φ at non-zero level sets can be arbitrary, but acommon practice is to chooseφ to be thesigned

distance functionψΓ(t)(x) for numerical accuracyreasons [27]. Our convention is thatφ < 0 insideΓ(t) andφ > 0 outsideΓ(t).

In general, the signed distance function withrespect to a set of surfacesΓ is defined by

ψΓ(x) =

{ −miny∈Γ

‖x− y‖2 if x lies insideΓ

miny∈Γ

‖x− y‖2 if x lies outsideΓ,

(2)where‖ · ‖2 denotes the Euclidean norm. To evolveΓ(t) (whereΓ(0) is the initial data) with speedβ =β(x, t), the equation is given by

∂φ

∂t= β‖∇φ‖2 (3)

which is known as the level set equation [28]. OurPDE also takes this form. The art is to design thespeed functionβ effectively to achieve one’s goal.

D. Cluster Intensity Functions (CIF)

We may use LSM strictly as an effective mech-anism for advancing boundaries. For example, inthe above approach (A1), once the density peaksare detected, we may advance cluster boundariestowards low-density regions using LSM. This wouldbe a LSM-based bump hunting approach.

However, it turns out that utilizing LSM we canfurther develop a new and useful concept ofclusterintensity function. A suitably modified version ofLSM becomes an effective mechanism to formulateCIFs in a dynamic fashion. Therefore our approachgoes beyond the three approaches (A1)–(A3) de-scribed earlier.

CIFs are effective to capture important character-istics of clusters. When clusters are well-separated,CIFs become similar to density functions. But whenclusters become closed to each other, CIFs stillclearly describe the cluster structure whereas den-sity functions and hence cluster structure becomeblurred. In this sense, CIFs are a better representa-tion of clusters than density functions.

CIFs resolve the problems of PDFs while ad-vantages of PDFs are inherited. Although CIFs arealso built on the top of PDFs, they are cluster-oriented so that only information contained in PDFsthat is useful for clustering is kept while otherirrelevant information is filtered out. We have shownthat such a filtering process is very important inclustering especially when the clusters touch each

SUBMITTED TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 5

other. On the other hand, it is well-known that whenthe clusters are well-separated, then valley seekingon PDFs results in very good clusterings. Sincethe valleys of CIFs and PDFs are very similar, ifnot identical, when the clusters are well-separated,clustering based on CIFs is as good as that basedon PDFs. However, advantages of CIFs over PDFsbecome very significant when the clusters are closedtogether.

E. Valley Seeking

In a graph-based version described in [12], theidea is to connect each point to another point nearbyhaving a higher density. In this way, we obtain aforest where each tree is a cluster. Presumably, theroot node of each tree is located in a density peakwhereas most leaf nodes are in valley regions.

This method starts off with a density estimationevaluated at each data point. The density can beobtained using the PDF described in (1) or using thek-nearest neighbor method [25]. In the experimentsbelow, we use the PDFf . For each pointxi,denote byNr(xi) the set of neighboring points ofxi within a distance ofr, excludingxi itself. Foreachxj ∈ Nr(xi), the directional derivation off atxi alongxj − xi is estimated bysi(j) = [f(xj) −f(xi)]/‖xj−xi‖2. Let jmax = arg maxj si(j). Next,if si(jmax) < 0, then i is set to be a root node. Ifsi(jmax) > 0, then the pointxj that maximizessi(j)is set to be the parent node ofxi. If si(jmax) =0 and if xi is connected to a root node, thenthe root node is assigned to be the parent ofxi;otherwise,xi becomes a root node. When all thepoints are visited, a forest will be constructed andeach connected component (a tree) is a cluster.

In our method, once the CIF is obtained, clusterlabels can be easily assigned by applying the abovealgorithm but with the density function replaced thedistance-based CIF.

III. A N OUTLINE OF OUR CLUSTER FORMATION

STRATEGIES

Our clustering method consists of several majorcomponents. We here give a high-level descriptionof the method in order to provide a global picture.

We start by introducing some terminologies. Acluster core contour(CCC) is a closed surfacesurrounding the core part/density peak of a cluster atwhich density is relatively high. Acluster boundary

refers to the interface between two clusters, i.e., asurface separating two clusters. A CCC is usuallylocated near a density peak while a cluster bound-ary is located at the valley regions of a densitydistribution. Here, a pointx is said to belong toa valley region off if there exists a direction alongwhich f is a local minimum. The gradient and theLaplacian of a functiong are denoted by∇g and∆g respectively.

Our method consists of the following main stepswhich will be elaborated in details in the nextsections:(1) initialize CCCs to surround high density re-

gions;(2) advance the CCCs using LSM to find density

peaks;(3) apply valley seeking algorithm on the CIF con-

structed from the final CCCs to obtain clusters.

IV. I NITIALIZATION OF CLUSTER CORE

CONTOURS(CCC)

We now describe how to construct an initialcluster core contoursΓ0 effectively. The basic ideais to locate the contours at whichf has a relativelylarge (norm of) gradient. In this way, regions insideΓ0 would contain most of the data points — we referthese regions ascluster regions. Similarly, regionsoutsideΓ0 would contain no data point at all and werefer them asnon-cluster regions. Such an interfaceΓ0 is constructed as follows.

Definition 1: An initial set of CCCsΓ0 is the setof zero crossings of∆f , the Laplacian off . Here, apoint x is a zero crossing if∆f(x) = 0 and withinany arbitrarily small neighborhood ofx, there existx+ andx− such that∆f(x+) > 0 and∆f(x−) < 0.

We note thatΓ0 often contains several closedsurfaces, denoted by{Γ0,i}. The idea of using zerocrossings of∆f is that it outlines the shape ofdatasets very well and that for many commonly usedkernels (e.g. Gaussian and cubic B-spline) the signof ∆f(x) indicates whetherx is inside (∆f(x) < 0)or outside (∆f(x) > 0) Γ0.

Complete reasons for using zero crossings of∆fto outline the shape of datasets are several folds: (a)the solution is a set of surfaces at which‖∇f‖2 isrelatively large; (b) the resultingΓ0 is a set of closedsurfaces; (c)Γ0 well captures the shape of clusters;(d) the Laplacian operator is an isotropic operatorwhich does not bias towards certain directions; (e)

SUBMITTED TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 6

the equation is simple and easy to solve; (f) itcoincides with the “definition” of edges in the caseof image processing. In fact, a particular applicationof zero crossings of Laplacian in image processingis to detect edges to outline objects [29]; (g) the signof ∆f(x) indicate whetherx is inside (negative) oroutside (positive) of a cluster region.

Analytic formula for ∆f is often available. Incase of Gaussian kernels, we have

∆f(x) =N∑

i=1

‖x− xi‖22 − h2p

Nhp+4(2π)p/2K

(x− xi

h

).

In Fig. 1(a) and (b), we show a dataset drawnfrom a mixture of three Gaussian components andthe PDF f obtained by KDE respectively. Thedataset is generated so that the three clusters areclosed to each other whilst the Gaussian mixturestill has three distinct peaks theoretically. Such adataset is expected to be tough for most clusteringalgorithms. We observe that the valleys and peakscorrespond to the two smaller large clusters of thePDF are very vague or may even not exist. Thus,the performance of PDF-based bump-hunting and/orvalley seeking could be poor. In Fig. 1(c), we showthe initial CCCs juxtaposed with the dataset. Weobserve that the CCCs capture the shape of thedataset very well.

V. A DVANCING CLUSTER CORE CONTOURS

Next, we discuss how to advance the initial CCCsto obtain peak regions through hill climbing in asmooth way. We found that this is a key issuein density-based approaches and is also how ideasfrom LSM come into play. More precisely, weemploy PDE techniques to advance contours in anelegant way.

A. Evolution Equation

Since each initial CCCΓ0,i in Γ0 changes itsshape as evolution goes on, we parameterize sucha family of CCCs by a time variablet, i.e., thei-th CCC at timet is denoted byΓi(t). Moreover,Γ(0) ≡ Γ0.

Using a level set representation, themean curva-ture κ = κ(x, t) (see [28]) ofΓ(t) at x is givenby

κ(x, t) = ∇ ·( ∇φ(x, t)

‖∇φ(x, t)‖2

).

−6 −4 −2 0 2 4 6

−4

−2

0

2

4

6

8

(a)

(b)

−6 −4 −2 0 2 4 6

−4

−2

0

2

4

6

8

x1

x 2

(c)

Fig. 1. (a) A mixture of three Gaussian distributions. (b) The PDFf using Gaussian kernel with window sizeh = 1. (c) The initialCCC. In (b), peaks and valleys corresponding to the two smallerlarge clusters are very vague that the performance of applying bump-hunting and/or valley seeking algorithm based on the PDF is expectedto be poor. Clusters obtained from our method are shown in Fig. 3. In(c), we observe that the initial CCCs capture the shape of the datasetand that the resulting boundaries capture the hot spots of the datasetvery well.

SUBMITTED TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 7

Roughly speaking, the value ofκ at x indicateshow curve the surface{x : φ(x, t) = 0} is atx. Moreover, if the surface is convex (respectivelyconcave) atx, thenκ > 0 (respectivelyκ < 0).

Given the initial CCCsΓ(0) represented by thezero level set ofφ(x, 0) (which is chosen to bethe signed distance functionψΓ(0)(x)), the timedependent PDE that we employ for hill climbingon density functions is given by

∂φ

∂t=

(1

1 + ‖∇f‖2

+ ακ

)‖∇φ‖2 (4)

φ(x, 0) = ψΓ(0)(x).

This equation is solved independently for each clus-ter region defined according toΓ(t). During evolu-tion, each contour and hence each cluster regionmay split. For example, if an initial CCC enclosestwo density peaks, then the CCC will eventuallysplit into two as it climbs uphill. Evolution isstopped when no further splitting occurs.

The aim of the factor1/(1+‖∇f‖2) is to performhill climbing to look for density peaks. Moreover,the factor also adjusts the speed of each point on theCCCs in such a way that the speed is lower if‖∇f‖2

is larger. Thus the CCCs stay in steep regions offwhere peak regions are defined better. In the limitingcase wheref has a sharp jump (‖∇f‖2 → ∞),the CCCs actually stop moving at the jump. Weremark that in traditional steepest descent methodsfor solving minimization problems, the speed (stepsize) is usually higher if‖∇f‖2 if larger, which isopposite to what we do. This is because our goalis to locate steep regions off rather than localminimums.

The curvatureκ exerts surface tension to smoothout the CCCs[30]. In contrast, without exertingsurface tension, the CCCs could become wigglywhich may lead to the common problem of over-fitting of PDFs. Therefore, we employ the termκto resolve such a problem. In fact, ifφ is keptto be a signed distance function for allt, i.e.,‖∇φ‖2 ≡ 1, then κ = ∆φ so thatφ is smoothedout by Gaussian filtering. In the variational pointof view, the curvature term exactly corresponds tothe steepest descent of the length (in 2-dimensionalcase) and surface area (inn-dimensional case wheren ≥ 3) of the CCCs. More precisely, under thelevel set representation, the length/surface area of

the CCCs at timet can be expressed as (see [31]):∫|∇H(φ(x, t))|dx

whereH(x) is the Heaviside function defined by

H(x) =

{1 if x ≥ 00 if x < 0.

Then the derivative of the length/surface area withrespect toφ gives the curvature ofφ.

B. Dynamic Adjustment ofα

The scalarα ≥ 0 controls the amount of ten-sion added to the surface and will be adjusteddynamically during the course of evolution. At thebeginning of evolution of eachΓi(0), we setα = 0in order to prevent smoothing out of importantfeatures. After a CCC is split into pieces, tensionis added and is gradually decreased to 0. In thisway, spurious oscillations can be removed withoutdestroying other useful features. Such a mechanismis similar to cooling in simulated annealing. In ourimplementation, the PDE (4) is solved at discretetime tk = k∆t. The α for each component contourΓi is dynamically adjusted as follows:

• Setα = 0 for eachΓi initially.• At each time pointtk and for each component

Γi, if Γi is split into two contoursΓi1 andΓi2,then set theα for both Γi1 andΓi2 to beαmax.Otherwise, replace theα for Γi by γα where0 < γ < 1 is a fixed constant.

In our experiments, we fixαmax = 1 andγ = 0.99.Empirically, we found that the evolution is quiterobust to the choice ofγ. For example, one mayset γ to be any number between0.8 and 1.0. Theparameterα should be chosen to be small so that thedominated motion is hill climbing. The purpose ofthe curvature term is to regularize the way that thecontours climb up the hill. Ifα is set to be too large,then the motion of the contours will have nothingto do with the PDFf at all.

C. Termination of Evolution

If we do not stop the evolution, then the contourswill reach the peak regions, presumably one contourfor each peak. Eventually, since‖∇f‖ is neverinfinity in practice (and hence the speed of thecontours is never0), each contour will shrink into a

SUBMITTED TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 8

point and disappear in finite time. Ideally, we wantto stop the evolution when each contour enclosesone density peak. In this case, there will be nofurther splitting of the contours. However, it isgenerally difficult to determine in advance whetherthe contours will split at a later time or not.

Our strategy is to let the contours evolve untilthey disappear. But we keep in record the contoursright after each splitting. More precisely, ifΓi issplit into two contoursΓi1 and Γi2, then we storeΓi1 and Γi2 and deleteΓi from our record. In thisway, we can retain the earliest contours which willnot be split any further. Lett′ be the earliest timeso that no splitting will occur at a later time andlet t′′ be the time where all contours disappear.In most situations, we havet′ À t′′ − t′ so thatthe computation spent on the time interval[t′, t′′] issmall compared with that on[0, t′].

D. A Summary of the Evolution Equation

In summary, the PDE simply (i) moves the initialCCCs uphill in order to locate peak regions; (ii)adjusts the speed according to the slope of the PDF;(iii) removes small oscillations of the CCCs byadding tension so that hill climbing is more robustto the unevenness of the PDF (c.f. Examples 1 and2 in §VII). In addition to these, the use of LSMallows the CCCs to change their topology easily.

In Fig. 3(a)–(c), we show the CCCs during thecourse of evolution governed by (4). We observethat the contours are attracted to density peaks.When a contour is split into several contours, thepieces are not very smooth near the splitting points.Since tension is added in such cases, the contoursare straightened quickly.

Numerical implementation of our method is givenin the Appendix.

E. Some Remarks

Remark 1:Our evolution equation (4) resemblessome features of thegeometric active contoursforimage segmentation [32]:

∂φ(x, t)

∂t= g(‖∇u(x)‖2) [c + κ(x, t)] ‖∇φ(x, t)‖2.

Herec is a nonnegative constant,u(x) is the givenimage andg(‖∇u(x)‖2) is an edge-detector which

is a decreasing function of‖∇u(x)‖2 and is zero atan edge. If we defineg by

g(‖∇u(x)‖2) =1

1 + ‖∇u(x)‖2

,

then we have

∂φ(x, t)

∂t

=1

1 + ‖∇u(x)‖2

[c + κ(x, t)] ‖∇φ(x, t)‖2.

This equation is similar to our PDE (4) exceptfor the coefficient ofκ. Presumably, contours willstop at the edges of the objects in the image. Thecurvature term is used to regularize the motion sothat it is robust to the noise in the image.

Remark 2: While our approach dynamicallyevolves a set of contours to detect density peaks,another approach calledsupport vector clustering[33] obtains a set of static surfaces where eachof them encloses one cluster. The idea is to firstuse kernel methods to map the data to a highdimensional feature space. Then, find the spherewith the smallest radius which encloses most of thedata points in the feature space (a few outliers areallowed). Finally, apply the inverse mapping to mapthe sphere back to the original space. The sphere inthe original space becomes a set of surfaces whereeach of them encloses one cluster. In this method,there is no control over the smoothness of the backtransformed surfaces in the original space. Thus, itcan be clearly seen from their experiments that thesurfaces are very wiggly (for example, see Fig. 3 in[33]).

VI. CLUSTER INTENSITY FUNCTIONS

In non-parametric modelling, one may obtainclusters by employing valley seeking on PDFs.However, as mentioned above, such methods per-form well only when the clusters are well-separatedand of approximately the same density in whichcase peaks and valleys of the PDF are clearlydefined. On the other hand, even though we use thedensity peaks identified by our PDE (4) as a startingpoint, if we expand the CCCs outward according tothe PDF, we still have to face the problems of thePDF; we may still get stuck in local optimum dueto its oscillatory nature.

In this section, we further explore cluster inten-sity functions which are a better representation of

SUBMITTED TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 9

clusters than that by PDFs. Due to the advantagesof CIFs, we propose to perform valley seeking onCIFs to construct clusters, rather than on PDFs.Here, CIFs are constructed based on the final CCCsobtained by solving the PDE (4).

CIFs capture the essential features of clustersand inherit advantages of PDFs while informationirrelevant to clustering contained in PDFs is filteredout. Moreover, peaks and valleys of CIFs stand outclearly which is not the case for PDFs. The principlebehind is that clustering should not be done solelybased on density. Instead, it is better done based onboth density and distance. For example, it is well-known that the density-based algorithm DBSCAN[13] cannot separate clusters that are closed togethereven though their densities are different and thedensity-based algorithm OPTICS [17] cannot sep-arate the clusters when they have similar densities.

CIFs, however, are constructed by calculatingsigned distancefrom CCCs (which are constructedbased on density). Thus, CIFs combine both densityand distance information about the dataset. This isa form of regularization to avoid over-specificationof density peaks.

The definition of a CIF is as follows.Definition 2: Given a set of closed hypersurfaces

Γ (the final CCCs), the CIFϕ with respect toΓ isdefined to be

ϕ(x) = −ψΓ(x)

whereψΓ is the signed distance function in (2).The value of a CIF atx is simply the distance

betweenx and Γ with its sign being positive ifx lies inside Γ and negative ifx lies outsideΓ.Roughly speaking, a large positive (respectivelynegative) value indicates that the point is deep inside(respectively outside)Γ while a small absolute valueindicates that the point lies close to the interfaceΓ. To illustrate the expressive power of CIFs, anexample based on the “C”-shape clusters is shownin Fig. 2.

In Fig. 3(d), the CIF constructed from the CCCsin Fig. 3(c) is shown. The peaks correspond to thethree large clusters can be clearly seen which showsthat our PDE is able to find cluster cores effectively.

Based on the CIF, valley seeking (c.f.§II) canbe easily done in a very robust way. In Fig. 3(e),we show the valleys of the CIF juxtaposed with thedataset and the final CCCs.

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

x1

x 2

(a)

(b)

Fig. 2. (a) Two “C” shape clusters juxtaposed with the zero crossingsof ∆f and the valleys of the CIF. (b) The CIF constructed from thezero crossings of∆f . In (a), the valleys of the CIF clearly separatethe two clusters. This cannot be done effectively by distance-basedmethods such ask-means.

We remark that the use of signed distance as CIFshas a property that their valleys are nothing but theequidistant surfaces between the CCCs. Moreover,cluster core contours play a similar role as clustercenters in thek-means algorithm [7]. Thus, ourmethod may be treated as a generalization of thek-means algorithm in the sense that a “cluster center”may be of arbitrary shape instead of just a point.

Under LSM framework, valleys and peaks areeasily obtained. The valleys are just the singularitiesof the level set function (i.e. CIF) having negativevalues. On the other hand, the singularities of thelevel set function having positive values are thepeaks or ridges of the CIF (also known as skeleton).

VII. E XPERIMENTS

In addition to the examples shown in Figs. 1–3, we give more examples to further illustrate the

SUBMITTED TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 10

−6 −4 −2 0 2 4 6

−4

−2

0

2

4

6

8

x1

x 2

−6 −4 −2 0 2 4 6

−4

−2

0

2

4

6

8

x1

x 2

(a) (b)

−6 −4 −2 0 2 4 6

−4

−2

0

2

4

6

8

x1

x 2

(c) (d)

−6 −4 −2 0 2 4 6

−4

−2

0

2

4

6

8

x1

x 2

(e)

Fig. 3. Evolution of cluster core contours (CCC) using the bump hunting PDEs (4). The dataset is the one used in Fig. 1. (a) Initial CCC.(b) Intermediate CCCs. (c) Final CCCs. (d) CIF constructed from the contours in (c). Peaks corresponding to the three large clusters areclearly seen. (e) Valleys of the CIF. In (e), the three cluster cores are well-discovered.

SUBMITTED TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 11

usefulness of the concepts introduced. Comparisonswith valley seeking [12] and DBSCAN [13] al-gorithms (c.f. §II) are given in Examples 1 and2. Clustering results of two real datasets are alsopresented. For visualization of CIFs which is onedimension higher than the datasets, two dimensionaldatasets are used while the theories presented aboveapply to any number of dimensions.

When applying DBSCAN, the parameterMinPts is fixed at4 as suggested by the authorsin [13].

Example 1:We illustrate how the problem ofover-fitting (or under-fitting) of PDFs is resolved us-ing our method. In Fig. 4, we compare the clusteringresults of valley seeking using the scale parameterh = 0.6, 0.7 and the DBSCAN algorithm usingEps = 0.28, 0.29. The best result is observed in Fig.4(a) but it still contains several small clusters due tothe spurious oscillations of the PDF. For other cases,a mixture of many small clusters and some over-sized clusters are present. In contrast, our method(shown in Fig. 3) resolves these problems by (i)outlining the shape of the dataset well by keepingthe CCCs smooth; (ii) using curvature motion tosmooth out oscillations due to unevenness of PDFs.

Example 2:A dataset with 4000 uniformly dis-tributed points lying in two touching circles is con-sidered. The dataset together with the zero crossingsof ∆f are shown in Fig. 5(a). The result of ourmethod is in Fig. 5(b). We observe that the finalCCCs adapt to the size of the clusters suitably. Theresults of valley seeking on PDFs (h = 0.05, 0.06)are shown in Fig. 5(c) and (d) where the unevennessof the PDFs result in either 2 or 4 large clusters. Theresults of DBSCAN withEps = 0.010, 0.011 are inFig. 5(e) and (f) which contain many small clusters.In addition, this example also makes it clear thatdensity functions must be regularized which is doneimplicitly by adding surface tension in our method.

Example 3:This example uses a dataset con-structed from the co-expression patterns of the genesin yeast during cell cycle [34]. Clusters of thedata are expected to reflect functional modules. Theresults are shown in Fig. 6. We observe that thevalleys of the CIF in Fig. 6(c) are right on the lowdensity regions and thus a reasonable clustering isobtained. We also notice that the clusters are veryclose, if not overlapped, to each other, especiallythe two clusters at the bottom.

Example 4:Our next example uses a real dataset

from text documents in three newsgroups. For theease of visualization, the dataset is first projectedto a 2-D space using principle component analysis[35]. The results in Fig. 7 show that the clusteringresults agree with the true clustering very well.

VIII. C ONCLUDING REMARKS

In the paper, we introduced level set methodsto identify density peaks and valleys in densitylandscape for data clustering. The method relies onadvancing contours to form cluster cores. One keypoint is that during contour advancement, smooth-ness is enforced via LSM. Another point is thatimportant features of clusters are captured by clusterintensity functions which serve as a form of regular-ization. The usual problem of roughness of densityfunctions is overcome. The method is shown to bemore robust and reliable than traditional methodsthat perform bump hunting or valley seeking ondensity functions.

Our method can also identify outliers effectively.After the initial cluster core contours are con-structed, outliers are clearly revealed and can beeasily identified. In this method, different contoursevolve independently. Thus outliers do not affectnormal cluster formation via contour advancing.Such a nice property does not hold for clusteringalgorithms such as thek-means where several out-liers could skew the clustering.

Our method for contour advancement given by (4)is based on the dynamics of interface propagationin LSM. A more elegant approach is to recast thecluster core formation as a minimization problemwhere the boundary advancement can be derivedfrom first principles which will be presented in alater paper.

APPENDIX

NUMERICAL IMPLEMENTATIONS

To solve the PDE (4) numerically, we applystandard finite difference schemes for level set equa-tions. We refer the readers to [27], [28], [36] for ex-cellent overviews of level set methods, evolutionaryequations and numerical implementations.

Consider a grid(i1∆x, i2∆x, . . . , ip∆x) wherei1, i2, . . . , ip are integers,∆x is the spatial step size.We recall thatp is the dimensionality of the data.The time variable is discretized tok∆t wherek isan integer and∆t is the temporal step size. Let

SUBMITTED TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 12

−6 −4 −2 0 2 4 6

−4

−2

0

2

4

6

8

x1

x 2

−6 −4 −2 0 2 4 6

−4

−2

0

2

4

6

8

x1

x 2

(a) (b)

−6 −4 −2 0 2 4 6

−4

−2

0

2

4

6

8

x1

x 2

−6 −4 −2 0 2 4 6

−4

−2

0

2

4

6

8

x1

x 2

(c) (d)

Fig. 4. Clusters obtained from applying valley seeking and DBSCAN to the dataset in Fig. 1. (a) Valley seeking withh = 0.6. (b) Valleyseeking withh = 0.7. (c) DBSCAN withEps = 0.28. (d) DBSCAN withEps = 0.29. In (a) and (c), many small clusters are present dueto unevenness of the density functions. These results are not as good as the results using our method as shown in Fig. 3

i = (i1, i2, . . . , ip). The sample ofφ(x, t) at thegrid point (i1∆x, i2∆x, . . . , ip∆x) at time k∆t isdenoted byφk

i .To obtain the initial dataφ(x, 0) in (4), we need to

compute the signed distance from the initial clustercore contours. This requires solving the Eikonalequation

‖∇ψ‖2 = 1

with boundary conditionsψ(x) = 0 on Γ(0) whichcan be done fast by the Fast Marching Method [27]or Fast Sweeping Method [37]. Since the imple-mentation is straightforward and has been detailedin [27], we omit the details.

Let ej be the vector(0, . . . , 0, 1, 0, . . . , 0) whosej-th entry is an1 and the other entries are0. Definethe forward difference operator in thej-th spatialdimensionD+

j by:

D+j φk

i :=φk

i+ej− φk

i

∆x.

Similarity, define the backward difference operatorin the j-th spatial dimensionD−

j by:

D−j φk

i :=φk

i − φki−ej

∆x.

A first order accurate upwind scheme for the levelset equation

∂φ(x, t)

∂t= β(x, t)‖∇φ(x, t)‖2

is given by

φk+1i − φk

i

∆t= −[max(−βk

i , 0)∇++min(−βki , 0)∇−]

where

∇+ = [max(D−1 , 0) + min(D+

1 , 0) + . . . +

max(D−p , 0) + min(D+

p , 0)]1/2

∇− = [max(D+1 , 0) + min(D−

1 , 0) + . . . +

max(D+p , 0) + min(D−

p , 0)]1/2,

SUBMITTED TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 13

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

x1

x 2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

x1

x 2

(a) (b)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

x1

x 2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

x1

x 2

(c) (d)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

x1

x 2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

x1

x 2

(e) (f)

Fig. 5. Comparisons of our method with valley seeking and DBSCAN. (a) Dataset with the zero crossing of∆f superimposed. (b) FinalCCCs (the closed curves in red and blue) and the valleys of the CIF (the line in black). (c) Valley seeking withh = 0.05. (d) Valley seekingwith h = 0.06. (e) DBSCAN withEps = 0.010. (f) DBSCAN with Eps = 0.011. The CCCs are able to capture the cluster shape whilevalley seeking and DBSCAN seem to suffer from over-fitting and result in many small spurious clusters.

SUBMITTED TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 14

−0.6 −0.4 −0.2 0 0.2 0.4 0.6

−0.4

−0.2

0

0.2

0.4

0.6

x1

x 2

−0.6 −0.4 −0.2 0 0.2 0.4 0.6

−0.4

−0.2

0

0.2

0.4

0.6

x1

x 2

(a) (b)

−0.6 −0.4 −0.2 0 0.2 0.4 0.6

−0.4

−0.2

0

0.2

0.4

0.6

x1

x 2

(c) (d)

Fig. 6. (a) DNA gene expression dataset. (b) Zero crossings of∆f . (c) The final CCCs and the valleys of the CIF. (d) The CIF. The clustercores are well-retrived and the valleys successfully separate the data into clusters of relatively high density.

see [36, p.80] and also [27, p.65]. In our PDE (4),the speed is given by

β(x, t) =1

1 + ‖∇f‖2

+ ακ

=1

1 + ‖∇f‖2

+ α∇ ·( ∇φ

‖∇φ‖2

)

which is approximated by central difference to ob-tain βk

i . For higher order accurate schemes, see [27,pp.66–67].

In level set methods, we are often only interestedin the evolving contours, which are located at thezero-level set ofφ. Thus, there is no need to solvethe PDE on the whole grid. We apply the NarrowBand Level Set Method [27] which updates thegrid points near the moving contours. In this way,computational costs are greatly reduced.

Sparse grids techniques [38], [39] may be used tofurther reduce the computational complexity. In theoriginal full grid, suppose each dimension hasn grid

points, then the number of grid points (in space) isnp. Using sparse grids, which are optimally chosensubsets of grid points, we can reduce the number ofgrid points toO(n(log n)p−1) with a fairly small lossin accuracy. Indeed, sparse grids techniques havealready been successfully applied to a number ofdata mining problems [40]–[42]. We have yet toexplore this direction.

ACKNOWLEDGMENTS

The authors would like to thank the anonymousreferees for the detailed, valuable suggestions andfor bringing into our attention the support vectorclustering algorithm.

REFERENCES

[1] J. Han and M. Kamber,Data Mining: Concepts and Techniques.Morgan Kaufman, 2001.

[2] A. K. Jain, M. N. Murty, and P. J. Flyn, “Data clustering: Areview,” ACM Comp. Surveys, vol. 31, no. 3, pp. 264–323, 1999.

SUBMITTED TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 15

−4 −3 −2 −1 0 1 2

−4

−3

−2

−1

0

1

2

3

4

x1

x 2

−4 −3 −2 −1 0 1 2

−4

−3

−2

−1

0

1

2

3

4

x1

x 2

(a) (b)

−4 −3 −2 −1 0 1 2

−4

−3

−2

−1

0

1

2

3

4

x1

x 2

(c) (d)

Fig. 7. (a) A dataset of 3 internet newsgroups with 100 news items in each group (news items in the same group are displayed with thesame symbol). (b) The zero crossings of∆f . (c) Clustering results (lines are valleys of the final CIF and closed curves are the cluster cores).(d) The CIF.

[3] L. Kaufman and P. Rousseeuw,Finding Groups in Data: AnIntroduction to Cluster Analysis. New York: John Wiley &Sons, Inc., 1990.

[4] P. Berkhin, “Survey of clustering data mining techniques,”Accrue Software, Tech. Rep., 2002.

[5] P. Hansen and B. Jaumard, “Cluster analysis and mathematicalprogramming,”Mathematical Programming, vol. 79, pp. 191–215, 1997.

[6] S. P. Lloyd, “Least squares quantization in PCM,”IEEE Trans.on Information Theory, vol. 28, no. 2, pp. 129–137, 1982.

[7] J. MacQueen, “Some methods for classification and analysisof multivariate observations,” inProceedings of 5th BerkeleySymposium on Mathematical Statistics and Probability, vol. I:Statistics, 1967, pp. 281–297.

[8] A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh, “Clusteringwith bregman divergences,” inProc. 4th SIAM Int. Conf. onData Mining, 2004, pp. 234–245.

[9] R. T. Ng and J. Han, “Efficient and effective clustering methodsfor spatial data mining,” inProc. 20th Int. Conf. on Very LargeData Bases. Santiago, Chile: Morgan Kaufmann, 1994, pp.144–155.

[10] C. Ding and X. He, “Cluster aggregate inequality and multi-level hierarchical clustering,” inProc. 9th European Conf.Principles of Data Mining and Knowledge Discovery, Porto,Portugal, 2005, pp. 71–83.

[11] G. Karypis, E. Han, and V. Kumar, “CHAMELEON: A hier-archical clustering algorithm using dynamic modeling,”Com-puter, vol. 32, pp. 68–75, 1999.

[12] K. Fukunaga,Introduction to Statistical Pattern Recognition,2nd ed. Boston Academic Press, 1990.

[13] M. Ester, H. Kriegel, J. Sander, and X. Xu, “A density-basedalgorithm for discovering clusters in large spatial databases withnoise,” in Int. Conf. Knowledge Discovery and Data Mining.Portland, OR: AAAI Press, 1996, pp. 226–231.

[14] J. Sander, M. Ester, H. Kriegel, and X. Xu, “Density-based clus-tering in spatial databases: The algorithm GDBSCAN and itsapplications,”Data Mining and Knowledge Discovery, vol. 2,no. 2, pp. 169–194, 1998.

[15] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan,“Automatic subspace clustering of high dimensional data fordata mining applications,” inProc. ACM-SIGMOD Int. Conf.Management of Data. Seattle, WA: ACM Press, 1998, pp.94–105.

[16] A. Hinneburg and D. A. Keim, “An efficient approach toclustering in large multimedia databases with noise,” inInt.Conf. Knowledge Discovery and Data Mining. New York City,NY: AAAI Press, 1998, pp. 58–65.

[17] M. Ankerst, M. Breunig, H. P. Kriegel, and J. Sander, “OPTICS:Ordering points to identify the clustering structure,” inProc.ACM-SIGMOD Int. Conf. Management of Data. Philadelphia,

SUBMITTED TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 16

PA: ACM Press, 1999, pp. 49–60.[18] W. Wang, J. Yang, and R. Muntz, “STING: A statistical infor-

mation grid approach to spatial data mining,” inProceedingsof the 23rd VLDB Conference, Athens, Greece, 1997, pp. 186–195.

[19] G. Sheikholeslami, S. Chatterjee, and A. Zhang, “WaveCluster:A wavelet-based clustering approach for spatial data in verylarge databases,”The VLDB Journal, vol. 8, pp. 289–304, 2000.

[20] C. Ding, X. He, H. Zha, M. Gu, and H. Simon, “Spectral min-max cut for graph partitioning and data clustering,” inProc.1st IEEE Int. Conf. on Data Mining, San Jose, CA, 2001, pp.107–114.

[21] L. Ertoz, M. Steinbach, and V. Kumar, “Finding clusters ofdifferent sizes, shapes, and densities in noisy, high dimensionaldata,” in Proc. of the 3th SIAM Int. Conf. on Data Mining,D. Barbara and C. Kamath, Eds., San Francisco, CA, 2003, pp.47–58.

[22] R. Sharan and R. Shamir, “CLICK: A clustering algorithm withapplications to gene expression analysis,” inProc. ISMB ’00.Menlo Park, CA: AAAI Press, 2000, pp. 307–316.

[23] G. J. McLachlan and D. Peel,Finite Mixture Models. NewYork: Wiley, 2001.

[24] E. Parzen, “On estimation of a probability density function andmode,” Ann. Math. Stat., vol. 33, pp. 1065–1076, 1962.

[25] R. Duda, P. Hart, and D. Stork,Pattern Classification, 2nd ed.New York: Wiley-Interscience, 2001.

[26] S. R. Sain, K. A. Baggerly, and D. W. Scott, “Cross-validationof multivariate densities,”Journal of the American StatisitcalAssociation, vol. 89, pp. 807–817, 1992.

[27] J. A. Sethian,Level Set Methods and Fast Marching Methods,2nd ed. New York: Cambridge Univ. Press, 1999.

[28] S. Osher and R. Fedkiw,Level Set Methods and DynamicImplicit Surfaces. New York: Spring Verlag, 2003.

[29] A. K. Jain, Fundamentals of Digital Image Processing. En-glewood Cliffs, NJ: Prentice Hall, 1988.

[30] S. Osher and J. A. Sethian, “Fronts propagating with curvature-dependent speed: Algorithms based on Hamiton-Jacobi formu-lations,” Journal of Computational Physics, vol. 79, pp. 12–49,1988.

[31] H. K. Zhao, T. Chan, B. Merriman, and S. Osher, “A variationallevel set approach to multiphase motion,”Journal of Compua-tional Physics, vol. 127, pp. 179–195, 1996.

[32] V. Caselles, F. Catte, T. Coll, and F. Dibos, “A geometricmodel for active contours in image processing,”NumerischeMathematik, vol. 66, pp. 1–31, 1993.

[33] A. Ben-Hur, D. Horn, H. T. Siegelmann, and V. Vapnik, “Sup-port vector clustering,”Journal of Machine Learning Research,vol. 2, pp. 125–137, 2001.

[34] P. Spellman, G. Sherlock, M. Zhang, V. Iyer, and K. Anders,“Comprehensive identification of cell cycle-regulated genes ofthe yeast Saccharomyces cerevisiae by microarray hybridiza-tion,” Mol. Biol. Cell, vol. 9, no. 12, pp. 3273–3297, 1998.

[35] I. T. Jolliffe, Principal Component Analysis, 2nd ed. Springer,2002.

[36] G. Sapiro,Geometric Partial Differential Equations. NewYork: Cambridge Univ. Press, 2001.

[37] Y. R. Tsai, L. Cheng, S. Osher, and H. Zhao, “Fast sweepingalgorithms for a class of Hamilton-Jacobi equations,”SIAM J.Numer. Anal., vol. 41, no. 2, pp. 673–694, 2003.

[38] C. Zenger, “Sparse grids,” inParallel Algorithms for PartialDifferntial Equations, ser. Notes on Numerical Fluid Mechan-ics, W. Hackbusch, Ed. Braunschweig/Wiesbaden: Vieweg,1991, vol. 31.

[39] H. Bungartz and M. Griebel, “Sparse grids,”Acta Numerica,pp. 147–269, 2004.

[40] J. Garcke and M. Griebel, “Classification with sparse gridsusing simplicial basis functions,” inProc. 7th ACM SIGKDD,F. Provost and R. Srikant, Eds. ACM, 2001, pp. 87–96.

[41] J. Garcke, M. Griebel, and M. Thess, “Data mining with sparsegrids,” Computing, vol. 67, no. 3, pp. 225–253, 2001.

[42] J. Garcke, M. Hegland, and O. Nielsen, “Parallelisation ofsparse grids for large scale data analysis,” inProc. of theInt. Conf. on Computational Science 2003, P. M. A. Sloot,D. Abramson, A. Bogdanov, J. Dongarra, A. Zomaya, andY. Gorbachev, Eds., 2003, pp. 683–692.