daniel korenblummacrostate clustering solutions, limiting their practical accessibility. the new...

20
Macrostate mixture models for probabilistic multiscale nonparametric kernelized spectral clustering Daniel Korenblum * Bayes Impact San Francisco, CA http://www.bayesimpact.org February 25, 2021 Abstract Automating the discovery of meaningful structures in large complex datasets is an important problem in many applica- tion areas including machine learning, source separation, and dimensionality reduction. Mixture models are one category of methods for discovering structure using convex sums of probability distributions to represent structures or clusters in data. Spectral clustering is another category of methods where eigenspaces of Laplacian matrices are used prior to or as part of the clustering process. Macrostate theory defines nonparametric mixture models directly from Laplacian eigensystems, providing a connection between nonhierarchical spectral clustering and nonparametric mixture modeling. Unlike other spectral clustering methods, macrostates are self-contained and predict both the appropriate number of mix- ture components and the cluster assignment distributions directly from Laplacian eigensystems. Macrostates reduce the number of input parameters and steps required compared to other spectral clustering methods and avoid issues of explicit density estimation in higher dimensional input data spaces. Previous formulations used customized algorithms to compute macrostate clustering solutions, limiting their practical accessibility. The new formulation presented here depends only on standardized linear programming solvers and is very easily parallelized, improving the practicality and performance compared to previous formulations. Numerical examples compare the performance of other finite mixture modeling and spectral clustering methods to macrostate clustering. Keywords. data analysis, unsupervised learning, finite mixture models, spectral clustering, graph partitioning, Laplacian matrices, stochastic processes 1 Introduction Many scientific and engineering problems arise where multiple distinct processes, sources, or latent factors combine to generate multimodal superpositions of overlapping non-Gaussian distributions. Estimating or inferring the components of a mixture is called the mixture problem. Mixture problems can be solved using mixture models, also known as latent class models [Everitt, 1996]. Macrostates were originally defined for analyzing physical systems where a single process generates the mixture components in an attempt to rigorously define the concept of metastability [Shalloway, 1996]. For physical problems the mixture compo- nents do not represent independent processes but subprocesses that describe a system’s metastable dynamics. Instead, the assumption that mixture components represent independent superimposed processes is made for nonphysical applications of finite mixture modeling and in statistics. Mixture models have numerous applications in data science and statistics-related fields. They are useful tools for solving signal processing and unsupervised learning problems such as source separation and cluster analysis [McLachlan and Peel, 2000]. Although countably or uncountably infinite numbers of mixture components are also possible, only finite mixture models are described here. 1.1 Finite Mixture Models Over a century ago, Karl Pearson stated that the analytical difficulties, even for the case n =2 are so considerable, that it may be questioned whether the general theory could ever be applied in practice to any numerical case [Pearson, 1894]. To this day, most mixture density separation or unmixing algorithms cannot predict the number of components directly from observations of the mixture without additional information, or else they are parametric approaches that restrict components to fixed functional forms which are often unrealistic assumptions, especially in high dimensional spaces [McAuliffe et al., 2006]. * Electronic address: [email protected] arXiv:1502.00727v3 [stat.ML] 21 Jul 2015

Upload: others

Post on 17-Nov-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Daniel Korenblummacrostate clustering solutions, limiting their practical accessibility. The new formulation presented here depends only The new formulation presented here depends

Macrostate mixture models for probabilistic multiscale nonparametric kernelized

spectral clustering

Daniel Korenblum∗

Bayes ImpactSan Francisco, CA

http://www.bayesimpact.org

February 25, 2021

Abstract

Automating the discovery of meaningful structures in large complex datasets is an important problem in many applica-tion areas including machine learning, source separation, and dimensionality reduction. Mixture models are one categoryof methods for discovering structure using convex sums of probability distributions to represent structures or clustersin data. Spectral clustering is another category of methods where eigenspaces of Laplacian matrices are used prior toor as part of the clustering process. Macrostate theory defines nonparametric mixture models directly from Laplacianeigensystems, providing a connection between nonhierarchical spectral clustering and nonparametric mixture modeling.Unlike other spectral clustering methods, macrostates are self-contained and predict both the appropriate number of mix-ture components and the cluster assignment distributions directly from Laplacian eigensystems. Macrostates reduce thenumber of input parameters and steps required compared to other spectral clustering methods and avoid issues of explicitdensity estimation in higher dimensional input data spaces. Previous formulations used customized algorithms to computemacrostate clustering solutions, limiting their practical accessibility. The new formulation presented here depends onlyon standardized linear programming solvers and is very easily parallelized, improving the practicality and performancecompared to previous formulations. Numerical examples compare the performance of other finite mixture modeling andspectral clustering methods to macrostate clustering.

Keywords. data analysis, unsupervised learning, finite mixture models, spectral clustering, graph partitioning, Laplacianmatrices, stochastic processes

1 Introduction

Many scientific and engineering problems arise where multiple distinct processes, sources, or latent factors combine to generatemultimodal superpositions of overlapping non-Gaussian distributions. Estimating or inferring the components of a mixture iscalled the mixture problem. Mixture problems can be solved using mixture models, also known as latent class models [Everitt,1996].

Macrostates were originally defined for analyzing physical systems where a single process generates the mixture componentsin an attempt to rigorously define the concept of metastability [Shalloway, 1996]. For physical problems the mixture compo-nents do not represent independent processes but subprocesses that describe a system’s metastable dynamics. Instead, theassumption that mixture components represent independent superimposed processes is made for nonphysical applications offinite mixture modeling and in statistics.

Mixture models have numerous applications in data science and statistics-related fields. They are useful tools for solvingsignal processing and unsupervised learning problems such as source separation and cluster analysis [McLachlan and Peel,2000]. Although countably or uncountably infinite numbers of mixture components are also possible, only finite mixturemodels are described here.

1.1 Finite Mixture Models

Over a century ago, Karl Pearson stated that the analytical difficulties, even for the case n = 2 are so considerable, that itmay be questioned whether the general theory could ever be applied in practice to any numerical case [Pearson, 1894]. Tothis day, most mixture density separation or unmixing algorithms cannot predict the number of components directly fromobservations of the mixture without additional information, or else they are parametric approaches that restrict componentsto fixed functional forms which are often unrealistic assumptions, especially in high dimensional spaces [McAuliffe et al.,2006].

∗Electronic address: [email protected]

arX

iv:1

502.

0072

7v3

[st

at.M

L]

21

Jul 2

015

Page 2: Daniel Korenblummacrostate clustering solutions, limiting their practical accessibility. The new formulation presented here depends only The new formulation presented here depends

Methods for separating the components of nonnegative mixtures of the form

f(x) =

m∑k=1

akfk(x) (1)

are called mixture models, where m ∈ N is the number of components and with x ∈ Ω ⊆ Rn. For all practical problems, itis safe to assume f(x) is normalized as a probability distribution without loss of generality.

The fk(x) are known as the mixture components or component probability distributions of each independent process orsignal source, and are also assumed to be normalized without loss of generality. The ak ∈ [0, 1] are the weights or mixingproportions summing to one and forming a discrete probability distribution P (k) = ak, which is known as the class priordistribution in the context of probabilistic or Bayes classifiers [Bishop, 2006].

Finite mixture models can be used to define probabilistic classifiers and vice versa. From exact knowledge of f, f1, . . . , fn, a1, . . . , an,the posterior conditional distribution of an optimal Bayes classifier for any observed y ∈ Ω can be expressed as

P (k|y) =P (y|k)P (k)

P (y)(2)

=akfk(y)

f(y)(3)

forming a partition of unity over the space of observations or data [Fasshauer, 2007]. The component distributions fk(x) canbe understood as the class conditional distributions P (x|k) and f(x) as the evidence P (x) in a Bayes classifier context.

Switching between the f and the P notations helps to build connections between concepts and algorithms from separatefields such as physics and statistics, at some risk of confusion. Supervised classification algorithms that directly optimize theposterior P (k|y) are referred to as discriminative classifiers and sometimes outperform generative methods that fit the fkand ak directly [Bishop, 2006].

Just as probabilistic/Bayes classifiers form partitions of unity from finite mixture models, so can finite mixture models bedefined from partitions of unity. In other words, partitions of unity can be written

wk(x) ≡ akfk(x)

f(x)(4)

subject tom∑k=1

ak = 1∫Ω

fk(x)dx = 1 k = 1, . . . ,m.

(5)

Partitions of unity wk(x)mk=1 as defined in (4) determine mixture components fk(x) and weights ak from mixture distribu-tions f according to

ak(x) =

∫Ω

wk(x)f(x)dx

fk(x) = a−1k wk(x)f(x)

k = 1, . . . ,m (6)

explicitly showing the formal equivalence of either the generative and discriminative descriptions of finite mixture models. Thediscriminative form (4) uses the additional partition of unity constraints (1.1) to combine the information in each fk(x), akpair of components and weights into a single wk(x) discriminative posterior distribution or partition of unity.

1.2 Spectral Clustering

The eigensystems of Laplacian matrices are widely used by spectral clustering methods [Azran and Ghahramani, 2006].Spectral clustering methods typically use the eigenvectors with small-magnitude eigenvalues as a basis for projecting dataonto before applying some other clustering method on the projected item coordinates [Ng et al., 2002].

1.2.1 Laplacian matrices

When clustering data items, pairwise interitem similarity or distance measures describe the regions of data space that representclosely related locations. It is the choice of this pairwise similarity or distance measure that is of utmost importance in creatingaccurate and useful results when generating Laplacian matrices from data items.

2

Page 3: Daniel Korenblummacrostate clustering solutions, limiting their practical accessibility. The new formulation presented here depends only The new formulation presented here depends

Forming Laplacian matrices can be prohibitively expensive in terms of memory resources for sufficiently large problems.Applying additional constraints to accommodate the available storage space requirements may be necessary. Sparsity can beachieved by applying a threshold on the similarities (dissimilarities) that are sufficiently low (high) to reduce the memoryfootprint.

1.2.2 Markov stochastic processes

Negative Laplacian matrices are also known as transition rate matrices of continuous-time Markov chains (CTMCs). Macrostatesdefine finite mixture models in terms of Markov processes, combining aspects of both spectral clustering and finite mixturemodeling methods.

Every Laplacian matrices can be associated with a continuous-time Markov stochastic process. For the discrete case, acontinuous-time Markov chain (CTMC) is obtained.

For continuous mixture density functions, there is a continuous-space analog of Laplacian matrices known as drift-diffusion orSmoluchowski operators. Spectral clustering methods can be applied to both discrete observation data and also continuousmixture density function data.

1.2.3 Data spectroscopy and kernelized spectral clustering

Similar to macrostate clustering, data spectroscopy [Shi et al., 2009] is another spectral clustering algorithm for constructingfinite mixture models directly from Laplacian eigenvectors. Data spectroscopy uses heuristics to approximate the optimalfinite mixture model defined by the selected eigenvectors assuming that the mixture components are well separated.

Unlike macrostate clustering, only the hard cluster labels are obtained and data spectroscopy does not compute the soft clusterassignment probabilities. Another difference between data spectroscopy and macrostate clustering is that their heuristic forselecting the correct number of clusters involves only the scaling parameter for a specific choice of the Gaussian kernelfunction.

By contrast, macrostate clustering applies the same multiscale spectral gap criterion for choosing the optimal cluster numberregardless of which kernel function is used, as described in section 3.2. This allows more choices of kernel functions to beused immediately, without developing any additional heuristics.

In this sense, macrostate clustering methods can also be described as a kernel methods or as kernelized spectral clusteringalgorithms [Schiebinger et al., 2015], since they involve the use of kernel functions to map data into possibly infinite-dimensional feature spaces [Hofmann et al., 2008]. The high level of interest in kernel methods [Fasshauer, 2007] andthe huge variety of available kernel functions [Schaback and Wendland, 2006] means that macrostate clustering methods areapplicable to almost any type of problem involving automated discovery of patterns in unlabeled data.

2 Macrostates

Macrostates were originally created to rigorously define both the concept of metastability and also the physical mixturecomponent distributions based on slow and fast time scales in the relaxation dynamics of nonequilibrium distributionsto stationary states [Shalloway, 1996]. Mixture models are used for linearly separating these stationary or Boltzmanndistributions in systems with nonconvex potential energy landscapes where minima on multiple size scales occur, e.g. high-dimensional overdamped drift-diffusions, such as macromolecules in solution. Proteins folding in aqueous solution are onetype of biological macromolecule that can be described in terms of overdamped drift-diffusions [Shalloway, 1996].

Transitions between states belonging to different components of a mixture occur on relatively slow timescales in such systems,making them appear as the distinct discrete states of a finite-state continuous time Markov process when measured overappropriate timescales. Such systems are called metastable [Risken, 1996, Shalloway, 1996].

Mathematically, drift-diffusions are described as a type of continuous-time Markov process. Samples or stochastic realizationsof drift-diffusions are described by systems of stochastic differential equations known as Langevin equations [Risken, 1996].

The distribution dynamics of drift-diffusions are described by partial differential equations (PDEs) known as Fokker-Planckequations [Risken, 1996]. Distribution dynamics of overdamped drift-diffusions are described by PDEs known as Smoluchowskiequations [Risken, 1996]. Smoluchowski equations are used to define macrostates in section 2.4, after reviewing their basicproperties.

3

Page 4: Daniel Korenblummacrostate clustering solutions, limiting their practical accessibility. The new formulation presented here depends only The new formulation presented here depends

2.1 Smoluchowski Equations

Smoluchowski equations have the form

∂P (x, t;β)

∂t= D∇ · e−βV (x)∇eβV (x)P (x, t;β) (7)

and belong to a class of reversible continuous-time, continuous-state Markov processes used to describe multiscale andmultidimensional physical systems that can exhibit metastability [Risken, 1996].

The potential energy function V (x) determines the deterministic drift forces acting on ensemble members i.e. sample pathsor realizations of this stochastic process and can often be viewed as a fixed parameter that defines the system structure. Thedrift forces bias the temporal evolution of initial distributions P (x, 0) to flow towards regions of lower energy as t increasescompared to free diffusion (Brownian motion).

Technically there is a different Smoluchowski equation for each distinct choice of V (x). Hence the plural form should be usedgenerally although this is often overlooked in the literature.

2.2 Smoluchowski Operators

The elliptic differential operatorL0 ≡ D∇ · e−βV (x)∇eβV (x) (8)

has e−βV (x) as an eigenfunction with eigenvalue zero, also called the stationary state or unnormalized Boltzmann distribution.It is easy to evaluate L0

[e−βV (x)

]directly to verify that it equals zero, satisfying the eigenvalue equation.

L0 is a normal operator and therefore a similarity transform S−1L0S to a self adjoint form L exists [Risken, 1996]. S has

the simple form of a multiplication operator with kernel e−12βV (x), giving

L ≡ De 12βV (x)∇ · e−βV (x)∇e 1

2βV (x) =

[√D

(∇− β

2∇V

)]·[√

D

(∇+

β

2∇V

)]. (9)

Without loss of generality, we can assume that D can be reduced to a constant for a large class of potentials [Risken, 1996]and does not have significance for the nonphysical problems described here. Therefore it can be set equal to one and safelyignored, allowing the model structure to be defined by the potential energy V (x) alone.

If Ω is compact or V (x)→∞ as ‖x‖ → ∞ for unbounded domains, then the stationary state is normalizable and L0 0 hasa discrete one-sided spectrum with one zero eigenvalue for every ergodic region. Ergodicity corresponds to the concept ofirreducible Markov processes in probability theory. For reducible/nonergodic processes, the multiplicity of the zero eigenvaluegives the number of connected components of the process.

The normalized Boltzmann distribution is denoted

pB(x;β) ≡ e−βV (x)∫Ω

e−βV (x)dx(10)

and plays the role of the mixture distribution f(x) in (1) as described in more detail in Section 3.

The stationary state of L from equation (9) is denoted

ψ0(x;β) ≡ e−β2 V (x) (11)

and is used along with other eigenfunctions of L in the separation/unmixing of the Boltzmann distribution into m macrostatesas described below. Adapting the notation slightly from Shalloway [1996], the eigenfunctions of L are denoted ψi∞i=0 andthe eigenvalues are denoted as λi∞i=0.

2.3 Notation

Macrostates of physical systems satisfy the macrostate expansion equation (24) of Shalloway [1996, page 9990], restated hereas

ψ0(x;β) =

m∑α=1

pαΦα(x;β). (12)

4

Page 5: Daniel Korenblummacrostate clustering solutions, limiting their practical accessibility. The new formulation presented here depends only The new formulation presented here depends

where the coordinate vector R from the original notation has been replaced by x to match with the rest of the notation usedhere. The similarity of (1) and (12) up to a factor of ψ0(x;β) shows that ψ0(x;β)Φα(x;β) are mixture components and thatmacrostates are a type of finite mixture model.

Macrostates Φα(x;β) defined by equation (26) of Shalloway [1996] as

Φα(x;β,M) =

m−1∑i=0

Mαiψi(x;β)

which can be written in matrix notation asΦ = Mψ

whereΦ ≡

(Φ1(x;β) Φ2(x;β) · · · Φm(x;β)

)Tand

ψ ≡(ψ0(x;β) ψ1(x;β) · · · ψm−1(x;β)

)Tare column vector valued functions and M is an invertible m-by-m matrix of expansion coefficients with zero-based columnindices. Using this notation, macrostates Φα(x;β) can be defined in terms of an optimization problem for M , the squarem-by-m matrix of macrostate expansion coefficients.

2.4 Macrostate Definition

Minimizing the expected error or overlap of the macrostate boundaries serves as an objective for determining the optimalmatrix Mopt(β) having the least total overlap of the macrostate boundaries allowable by the subspace spanned by the selectedeigenfunctions. The quantity 0 < Υ(M ;β) < 1, as defined by equation (36) of Shalloway [1996], represents the expectedoverlap over all macrostates between a window function wα(x;M) and the other states α′ 6= α,

Υ(M ;β) =

m∑α=1

[1− 〈w

2α(x;M)〉pB

]

= 1−m∑α=1

〈w2α(x;β,M)〉pB

= 1−m∑α=1

⟨(Φα(x;β,M)pα

ψ0(x;β)

)2⟩pB

where the wα (evaluated at β = 1) are defined by Shalloway [1996, equation (27)] as

wα(x;β,M) ≡ pαm−1∑i=0

Mαiψi(x;β)

ψ0(x;β)(13)

=pαΦα(x;β,M)

ψ0(x;β)(14)

which suggests the central connection between macrostates and probabilistic classifiers (2) derived from finite mixture models.

The minimum uncertainty conditionMopt(β) ≡ argmin

MΥ(M ;β) (15)

subject to the partition of unity constraints

wα(x;β,M) > 0, α = 1, . . . ,m∑α

wα(x;β,M) = 1

∀x ∈ Ω (16)

selects maximally crisp (non-overlapping) or minimally fuzzy (overlapping) decision boundaries among classifiers formed bythe span of the selected eigenvectors. This objective function attempts to minimize the expected overlap or model errorbetween the component distributions of the Boltzmann distribution pB(x;β). Note that, although the wα are defined interms of ψ0(x;β), all of the expectations in the definitions of the wα, pα, and Υ are taken over pB(x;β) = ψ2

0(x;β) and thusapply to this distribution.

5

Page 6: Daniel Korenblummacrostate clustering solutions, limiting their practical accessibility. The new formulation presented here depends only The new formulation presented here depends

Minimization of Υ occurs over the eigenspace spanned by the selected eigenvectors ψim−1i=0 which serve as an orthogonal

basis. The truncated spectral expansion of (9) provided by ψim−1i=0 is a form of spectral regularization which prevents

overfitting for appropriately chosen values of m [Engl et al., 1996].

Eigefunction smoothness confers smoothness to the resulting mixture components, which is a necessary property for transitionsbetween macrostates to occur. This makes macrostates well-suited for probabilistic nonphysical mixture modeling problemswhere some amount of smoothness or overlap between components is desired or expected.

3 Macrostate Clustering

Smoluchowski operators L can be viewed as generalized Laplacian operators and discretized Smoluchowski operators definetransition rate matrices for CTMCs. Conversely, Laplacian matrices are derived from similarity and distance matrices andcan be interpreted as generalized stencils for discretized Laplace operators identical to those used in finite-difference baseddiscrete approximation methods as described in section 4.3 [Smola and Kondor, 2003].

The Laplacian matrices used in spectral clustering are mathematically equivalent to discretized Smoluchowski operators usedto define macrostates. This means that for problems outside of physics a general-purpose macrostate clustering algorithmcan also be defined.

Macrostates of metastable physical systems are finite mixture models for physical systems described by continuous-timeMarkov processes. Extending macrostates to arbitrary data spaces provides interpretations of spectral clustering methods interms of stochastic processes.

Imagine a dynamic process that generates the observed mixture f(x) as its equilibrium Boltzmann distribution pB(x; 1). Themetastable states of such a mixture distribution generating process having f(x) as its stationary distribution correspond tothe components of a finite mixture model for the mixture distribution.

The Kolmogorov extension theorem guarantees the existence of such processes for arbitrary mixture distributions as theirstationary states [Øksendal, 1996]. Viewing energy landscapes as functions over arbitrary data spaces instead of as functionsover physical configuration spaces gives statistical meaning to metastability for nonphysical systems having separable mixturedistributions.

3.1 Macrostate Clustering Definition

Let the negative logarithm of the nonnegative mixture distribution f(x) act as a pseudo potential energy

V (x) ≡ − log f(x) (17)

for some mixture distribution f(x) given by (1), so that

e−βV (x)∣∣∣β=1

= f(x)

and therefore

pB(x; 1) =f(x)∫

Ω

f(x)dx

. (18)

This leads to a form of (14) derived from nonphysical or statistical mixture distributions f(x) after substituting k for α andsetting ak ≡ pk:

wk(x;β,Mopt) =ak(β;Mopt)Φk(x;β,Mopt)

ψ0(x;β). (19)

Combining this withfk(x;β,Mopt) ≡ a−1

k (β;Mopt)wk(x;β,Mopt)f(x) (20)

provides the definition of a generalized data macrostate via (6):

f(x) =

m∑k=1

ak(β;Mopt)fk(x;β,Mopt) (21)

for any 0 < β <∞ and Mopt(β) satisfying (15).

6

Page 7: Daniel Korenblummacrostate clustering solutions, limiting their practical accessibility. The new formulation presented here depends only The new formulation presented here depends

Because multiplying f(x) by the wk(x;β,Mopt) in (20) applies an envelope to f(x) reminiscent of the windowing operationsused in signal processing, they were called window functions in the original formulation. The formal equivalence of windowfunctions from Shalloway [1996] to discriminative finite mixture models (6) is the central idea in the definition of macrostates.

Satisfying the partition of unity constraints (16) allows the ak and fk to be computed from the wk via (6), working backwardsfrom the definition (14).

Note that the relationship of pB(x;β) to ψ0(x;β) via the symmetrization of L0 to L in (9) has no consequence on the functionswk(x;β,Mopt) and only serves to symmetrize the spectral decomposition. Numerically, the symmetrization also stabilizes andreduces the computational complexity of calculating small magnitude eigenvectors and eigenvalues of matrix approximationsof L, as illustrated in Section 5.

3.2 Multiscale Spectral Gap

Macrostate clustering methods use a spectral criterion to select the appropriate number of components. The separability ofthe spectrum for each choice of m can be measured using

rm(β) ≡ λm(β)

λm−1(β)(22)

the corresponding eigenvalue gap for m > 1. Larger gap values correspond to better separability for a given choice of m.

Using a ratio of eigenvalues means that gaps are measured relative to the fastest rate of the representative equilibrationmodes, allowing the detection of multiscale structure. In the context of macrostate clustering, the term spectral gap refers tothe ratio defined by (22).

Spectra without sufficiently large gaps indicate lack of separability of the mixture components, perhaps because only a singlecomponent exists or because of a strong background “noise” component that reduces detectability of other components.Values of m having larger gaps will have less expected error in the corresponding optimized mixture model.

Large problems require iterative techniques for computing the extremal eigenvectors. In this case the process can be stoppedafter the first sufficiently large gap appears or some maximum acceptable computational time is exhausted, whichever comessooner.

Once the spectral decomposition of L in (9) is obtained and the appropriate value(s) of m is (are) chosen, the optimizationover all feasible M must be performed subject to the partition of unity constraints (16). Although this is a non-polynomialtime or NP-hard global optimization problem in general, it can be solved with reasonable computational expense for moderatevalues of m as illustrated in Section 5.

3.3 Global Optimization of Uncertainty

In terms of probabilistic classifiers, the uncertainty Υ(M ;β) represents the mean squared error of the probabilistic classifierderived from the mixture model specified by each value of M . By minimizing Υ(M ;β) macrostates provide spectrallyregularized minimum mean squared error classifiers.

For any m of interest, the column vector valued function

w ≡(w1(x;β,M) · · · wm(x;β,M)

)Tis numerically optimized to compute the macrostate clusters. The optimal w is determined via global optimization of Υ(M ;β)over the set of coordinate transformation matrices M satisfying the partition of unity constraints (16).

In matrix notation, any feasible w can be written in terms of M and the column vector of basis functions

ω(x;β) ≡

1

ψ1(x;β)ψ−10 (x;β)

...ψm−1(x;β)ψ−1

0 (x;β)

(23)

asw = Mω

subject to

MT e = e1

Mω(x;β) > 0 ∀x

7

Page 8: Daniel Korenblummacrostate clustering solutions, limiting their practical accessibility. The new formulation presented here depends only The new formulation presented here depends

where e =(

1 1 · · · 1)T

and e1 =(

1 0 · · · 0)T

are m-by-1 column vectors.

Now the objective function Υ(M ;β) can be expressed in terms of ω and f(x) ≡ pB(x;β) as

Υ(M ;β) = 1−m∑k=1

〈w2k(x;β;M)〉f

= 1− 〈ωTMTMω〉ψ20

= 1−∫

Ω

(ψTMTMψ

)(x;β) dx

= 1−m−1∑i,j=0

(MTM)ij

∫Ω

ψi(x;β)ψj(x;β)dx︸ ︷︷ ︸δij

= 1− tr MTM,

a concave quadratic function wheretr MTM

can also be expressed as ‖M‖2F the squared Frobenius norm of the basis transformation matrix M .

These results allow the linearly constrained global optimization problem for Mopt to be succinctly stated as

minimizeM

1− ‖M‖2F

subject toMT e = e1

(Mψ)(x;β) > 0 ∀xM ∈ GL(m)

(24)

for some fixed β > 0 scaling parameter.

The linear inequality constraintsMψ > 0

encode all of the problem-dependent data from the input f(x) via

ψ =(ψ0(x;β) ψ1(x;β) · · · ψm−1(x;β)

)T.

These inequality constraints are difficult to satisfy for all x over domains containing infinitely many points. In the nextsection some suggestions for handling these constraints under various discretizations are provided along with stable andscalable numerical implementations.

3.4 Temperature Scaling

This section describes how macrostate clustering methods depend on the scaling parameter β and its physical interpretationas an inverse temperature parameter. Since higher values of β correspond to lower intensity of random forces for thecorresponding generating process relative to the deterministic drift forces, from a physical standpoint it could alternativelybe interpreted a drift sensitivity parameter.

According to this interpretation, higher drift-sensitivities occur when the magnitudes of random forces are low relative to themagnitude of the deterministic drift forces i.e. negative gradients of the energy surface. Increasing β may resolve componentsmore effectively than using β = 1 in some cases, a hypothesis explored in Section 5. Although the equilibrium distributionof the stochastic process does not equal f when β 6= 1, it relates to f(x) via the negative log density V (x) ≡ − log f(x) from(17).

Combining equations (18) and (17) means that pB(x; 1) ∝ f(x) for β = 1. If β 6= 1 then pB ∝ fβ and the normalizationconstant can be safely ignored since it does not affect the eigenfunctions.

Scaling V (x) by β applies a power law transformation to the corresponding unnormalized distribution fβ(x) = e−βV (x). Thisaffects the separation between relatively higher and lower probability density values, deforming the shape of the originalmixture distribution.

8

Page 9: Daniel Korenblummacrostate clustering solutions, limiting their practical accessibility. The new formulation presented here depends only The new formulation presented here depends

Physically, the β parameter sets the inverse thermal energy of the underlying Brownian dynamics process yielding fβ(x) asthe unnormalized equilibrium distribution or stationary state. Both the number and separability of the components thereforedepend on the choice of β, the inverse temperature parameter.

In physical systems, larger values of β will reduce the ability of the sample paths of the underlying stochastic process totransition between macrostates relative to smaller values, reducing overlap between regions of high probability. The effectof β < 1 on the shape of the distribution is to increase the chance that a sample path will transition between macrostates,resulting in a flatter shape and more overlapping of components.

Although discontinuities in the number and structure of the mixture components can occur for sufficiently large deviationsfrom β = 1, for smaller excursions continuity can be expected. If a specific number of clusters m is expected from someexternal knowledge, it is possible to use this prior information to adjust β until this a sufficiently large eigenvalue gap rm(β)is found for this cluster number.

Computing the Mopt(β) for multiple values of β is often prohibitively expensive. Fortunately, its effects can be understoodvia the resulting spectral gaps rm(β) that are significantly easier to compute. This economy of resources allows examiningseveral values of β at reasonable computational expense, as explained with examples in the next section.

3.5 Previous Formulations

The macrostate data clustering algorithm is an earlier formulation developed in Korenblum and Shalloway [2003], White andShalloway [2009].

These earlier papers did not explicitly mention that macrostates are a type of finite mixture model of the form (1), nor didthey mention the Bayes classifier posterior form and probabilistic interpretations of (2). The probabilistic nature of (14) wasnot explicitly mentioned previously, and more generic terms such as fuzzy spectral clustering and cluster assignment strengthswere used to avoid misinterpretation.

Previous formulations used a different objective function and a different global optimization solver. The objective functionwas a logarithm of the geometric mean uncertainty which led to a nonlinear optimization problem that was not as well-studiedas quadratic programs.

Here, the original the objective function from Shalloway [1996] is used, providing a more standard concave quadratic pro-gramming (QP) problem that can be more easily solved. Linearly constrained concave QPs can be solved using establishedheuristics such as modified Frank-Wolfe type procedures [Pardalos and Guisewite, 1993].

The modified Frank-Wolfe method is easily parallelizable since the linear programming (LP) problems associated with eachof the starting points are standard independent problems, making scalability feasible on high-performance computing archi-tectures such as GPUs. Reducing the software implementation to a sequence of LP solver calls simplifies the ease of use andpractical applicability of macrostates and encourages more active future developments.

Detailed comparisons between the two formulations are not included here because the previous methods required customizedalgorithm implementations that limited their practicality.

Another difference is that in Korenblum and Shalloway [2003], White and Shalloway [2009] only unbounded inverse quadraticsimilarity measures and soft Gaussian thresholds that do not directly control sparsity were tested. Here, other choices ofsimilarity/distance measures are tested and the use of hard thresholding is examined to directly control sparsity of theresulting Laplacian matrix used as the primary input into the algorithm.

By starting from the continuous mixture function representation (1), the range of input datatypes is expanded to densityseparation/unmixing problems where the values of the mixture density function f(x) are available as inputs. The previous for-mulation defined in Korenblum and Shalloway [2003] did not mention the applications to density function unmixing/separationvia (21) and the connection to discrete approximations of Smoluchowski equations as described in 4.3.

4 Applications

4.1 Markov Chain and Graph Embeddings

When data items are samples xiNi=1 from the mixture distribution f(x), the values of f(x) are not directly available.For these problems the density function values f(x) are never computed and the final outputs are the probabilistic clusterassignments (19) evaluated at the item locations. Laplacian matrices provide the connection between pairwise similarity or

9

Page 10: Daniel Korenblummacrostate clustering solutions, limiting their practical accessibility. The new formulation presented here depends only The new formulation presented here depends

dissimilarity matrices and CTMCs, allowing macrostate clustering methods to be used for systems where accurate estimatesof the mixture distribution f(x) are not available.

4.2 Graph Partitioning and Network Analysis

Eigensystems of Laplacian matrices are also used for spectral partitioning of graphs in order to identify strongly intracon-nected subgraphs or communities [Fortunado, 2010]. For graph clustering or partitioning problems, adjacency matrices areprimary inputs rather than coming from similarity/distance measure evaluations over item pairs [Schaeffer, 2007]. In termsof adjacency matrices, the identification of separable subgraphs corresponds to an optimal symmetric permutation of rowsand columns to achieve approximate block-diagonal form.

Unweighted adjacency matrices are a type of binary similarity matrix that encode connections between nodes as a 1 or a0. Weighted symmetric graphs can also be encoded using weighted adjacency matrices that play the same role as similaritymatrices for data items.

Directed graph (digraph) Laplacian matrices are also defined [Chung, 2004]. These also define valid macrostate modelsprovided that the corresponding CTMCs are reversible.

Network graphs of links between world wide web pages or citations between publications are examples of currently relevantpractical graph partitioning problems. The Google PageRank score can be interpreted in terms of the first and secondeigenvectors of the graph Laplacian matrix of the graph of outgoing links on websites [Smola and Kondor, 2003].

4.3 Density Function Separation by Discrete Approximation

For sufficiently low dimensional functions (e.g. n 6 4) evaluated on evenly-spaced grids, the discrete approximation of (7)can be used via a nearest-neighbor Laplacian approximation to construct a sparse approximation of L in (9) as described inBanushkina et al. [2005]. The discrete approximation approach is useful for applications where the mixture function f(x)can be evaluated on a grid such as density estimates generated by histograms or kernel density estimation. This was themethod used for the numerical example described in Section 5.1.

Discrete approximations can also be applied to nonnegative signals such as spectral density estimates and 2 and 3 dimensionalimages sampled on evenly-spaced nodes after preprocessing to remove random noise. Since discrete approximations ofSmoluchowski equations are microscopically reversible continuous-time Markov chains (CTMCs), macrostate models can alsobe constructed by embedding input data into Markov chains.

Just like in the continuous case for (8), discrete transition rate matrices for time reversible processes are similar to symmetricmatrices. Similarly their eigenvalues and eigenvectors are real and the eigenvectors corresponding to distinct eigenvalues areorthogonal.

5 Numerical Examples

In this section, several numerical examples are presented to illustrate the practical application of macrostate clusteringmethods using both synthetic and real-world problems. The results show high levels of performance across a range ofproblem types without encountering problems such as numerical instability or sensitivity to small variations of tunableparameters. All run-times were less than 20 minutes using Matlab on an 2.80 GHz Intel Core i7-2640M CPU runningthe 64-bit Windows 7 Home Premium operating system with 8GB of RAM.

5.1 Density function example: Gaussian/Laplace/hyperbolic-secant Mixture

Starting with a purely-synthetic example allows unambiguous interpretation of the parameter choices on performance. Forthe first example, a three-component mixture f(x) ∝ f1(x)+f2(x)+f3(x) was constructed, consisting of one component madefrom a mixture of three Gaussians, a second component made from a mixture of three unnormalized Laplace distributions,and a third made from a mixture of three hyperbolic-secant functions. Then the discretized density function values are takenas input for macrostate clustering via direct approximation of the Smoluchowski equation as described in 4.3.

Since these components do not share a common parametrization, using any one class of distribution to compute the unmixingwill result in errors. This simple yet nontrivial example provides a test of the accuracy of the nonparametric macrostatemixture modeling approach.

10

Page 11: Daniel Korenblummacrostate clustering solutions, limiting their practical accessibility. The new formulation presented here depends only The new formulation presented here depends

Figure 1 shows an image of f(x) evaluated at 40000 Cartesian gridpoints xi : xi ∈ [−10, 10]2200i=1 in two dimensions with

colors indicating the value of f(x) at each point.

Figure 1: Gaussian/Laplace/hyperbolic-secant mixture density function surface plot colored by probability density. Eachof the three separable components were constructed by adding randomly generated anisotropic radial functions with eitherGaussian, Laplacian, or hyperbolic-secant radial profiles. Finally, these randomized components were superimposed togenerate the final mixture distribution shown here.

The panels of Figure 2 shows the corresponding unthresholded (row a) and hard-thresholded (row b) macrosate mixturemodel solutions using the default β = 1 scaling parameter setting. Row c shows the original mixture components prior toforming their superposition for comparison (ground truth).

Figure 2: macrostate mixture model components for the Laplace (red)/hyperbolic-secant (green)/Gaussian (blue) 2-D testproblem. Row a: unthresholded macrostate mixture model components, white arrows indicate regions of unwanted overlapthat contribute to the error. Row b: hard-thresholded components to remove regions of unwanted overlap, edges are artifactscontributing to the error. Row c: original unmixed components representing the ground-truth for comparison. Column 1 ofTable 1 lists the corresponding mean-squared errors.

Algorithm performance can be improved by tuning the β parameter by examining the behavior of the eigenvalue spectrumas β varies. Since computing the eigenvalue gaps rm(β) is relatively inexpensive, a matrix of rm(β) values was computed form = 1, . . . , 11 with 512 values of β between 1 and 3.

These computed values are shown as a series of lines colored by m with β on the horizontal axis in Figure 3. Notice thata the largest gap occurs at m = 3 for all β ∈ [0, 3], indicating stability of the algorithm in selecting the correct number ofclusters over a wide range of the scaling parameter.

11

Page 12: Daniel Korenblummacrostate clustering solutions, limiting their practical accessibility. The new formulation presented here depends only The new formulation presented here depends

The downward slope of r4(β) versus β is also an indicator that the m = 3 solution is optimal since it implies that thehigher-magnitude transition rates for eigenstates not used in forming the m = 3 solution are becoming increasingly similaras β increases, supporting the conclusion that the m = 3 is the optimal choice.

Figure 3: Semilog plot of eigenvalue gap rm(β) vs. β colored by m for m = 2, . . . , 11. The m = 3 case in blue quickly goesabove the vertical axis limit but continues increasing on the logarithmically-scaled vertical axis with a near constant slopeas shown by the inset plot.

Because the components of this mixture for this problem are significantly overlapping the eigenvalue gap r3(β) for m = 3 atβ = 1 is small as shown in Figure 3. This leads to the unwanted overlap between the optimally separated components visiblein the top row of Figure 2.

Since the true solution was known for this test problem, the relative error vs. β was computed as shown in Figure 4.

Figure 4: Relative error of the m = 3 macrostate mixture model for the Gaussian/Laplace/hyperbolic-secant mixture testproblem vs. β.

For β = 2.6 the computed eigenvalue gap was r3(2.6) = 93.09. The relative error was below 0.94 for β > 2.1 where theeigenvalue gap r3(2.1) = 35.82. This indicates that setting a sufficiently large eigenvalue gap cutoff and increasing β untilthis cutoff is reached would lead to near-optimal accuracy over a relatively wide range of thresholds.

Figure 3 indicates that the optimal value of β = 2.6 occurs where r4(β) reaches the same order of magnitude as the gapsrj(β) for j > 4. This observation suggests that perhaps an optimal β can be found by ensuring that higher-order gaps aresufficiently small.

Generalizing from this example, perhaps the optimal value of β can be found for real-world problems by placing an upper-bound on rj>m and increasing β until this criterion is met up to some maximum cluster/mixture component number m ofinterest. For the Gaussian/Laplace/hyperbolic-secant test problem, the optimal value of β occurs at a corresponding cutoffof r4(β) 6 1.2. While more tests are needed to establish this as a useful estimate in general, the results from the examplesin sections 5.2 and 5.3 support this hypothesis.

Figure 5 shows the optimized macrostate clustering for this test problem. Compared to the β = 1 solution shown in Figure2, rows a and b of Figure 5 are visually improved, with unwanted mixing across components no longer visible.

12

Page 13: Daniel Korenblummacrostate clustering solutions, limiting their practical accessibility. The new formulation presented here depends only The new formulation presented here depends

Figure 5: Optimally-scaled β = 2.6 macrostate mixture model components for the Laplace (red)/hyperbolic-secant(green)/Gaussian (blue) 2-D test problem. Row a: unthresholded macrostate mixture model components, Row b: hard-thresholded components, Row c: original (unmixed) components. Column 2 of Table 1 lists the corresponding mean squarederrors.

Column 2 of Table 1 shows the relative errors for the probabilistic/unthresholded and hard-thresholded optimal β = 2.6solutions. By contrast to the β = 1 solution shown in the first column of this table, hard-thresholding increases the errorvalue compared to the original probabilistic/unthresholded version.

Table 1: Relative errors of macrostate mixture models for the Gaussian/Laplace/hyperbolic-secant mixture density functionseparation/unmixing test problem.

β

1 2.6

no threshold 0.2078 0.0541

hard threshold 0.1803 0.0718

5.2 Graph embedding - Gene expression data

614 genes were selected by following the instructions to remove genes with temporal low variation described in Mathworks[2014] with (logarithmic) fold-change values ranging from -4.479 to 4.372. Rounding these values to the nearest 1/2 andremoving duplicate items yielded an input dataset of 536 unique discretized items. The Gaussian kernel similarity measurewas chosen based on the Euclidean distances of the discretized fold-change vector time series data with a scale factor of 1/5the item standard deviations averaged over all time points (0.3489).

Gene expression datasets can contain dozens of clusters having very few (e.g. 2) members and dozens of clusters. For thepurposes of demonstrating the macrostate clustering approach, reducing the number of clusters to something closer to 10 isappropriate.

To achieve this goal, outlier filtering was applied to remove an additional 293 items. Outliers were iteratively selected asitems having mean similarity less than 0.2 of the mean similarity over all items until all such points were removed.

The remaining 243 items were clustered using the eigenvectors of the (unnormalized) Laplacian matrix derived from theGaussian kernel pairwise similarity matrix. Spectral gaps were computed for m = 2, . . . , 20 as plotted in Figure 6 indicatingthat the m = 9 clustering was the largest m having a gap greater than a cutoff of 1.5. Using a value of 1.5 for the cutoff wasobserved to apply well across a wide range of problems tested, but was otherwise arbitrarily chosen.

13

Page 14: Daniel Korenblummacrostate clustering solutions, limiting their practical accessibility. The new formulation presented here depends only The new formulation presented here depends

Figure 6: spectral gap vs. m using the Gaussian kernel similarity measure with scaling parameter of 0.3489 showing m = 9as the rightmost peak above a cutoff value of 1.5.

Built-in Matlab functions were used to compute both the k-means and GMM outputs. The data spectroscopy software(DaSpec) was obtained online from the original author at http://www.stat.osu.edu/ taoshi/software/software.html.

Figure 7 shows the final hard/crisp cluster assignment labels for four different clustering algorithms in order to visuallycompare their effectiveness. Panel a shows the m = 9 macrostate clustering, panel b shows the k-means, panel c shows theGaussian mixture model (GMM), and panel d shows the data spectroscopy (DaSpec) output.

The macrostate clustering and data spectroscopy results agree perfectly for this example, an unexpected outcome discussedin more detail in section 6.7.

Figure 7: Relative expression level vs. time curves for the 243 gene yeast expression dataset tested showing the m = 9clusterings. Panel a shows the macrostate clustering results using a Euclidean distance based Gaussian kernel similaritymeasure. Panels b and c show the results of the k-means and GMM algorithms, respectively.

Silhouette scores provide a means of quantitatively assessing data cluster validity [Rousseeuw, 1987]. The average silhouettescore over all data of a cluster is a measure of how tightly grouped all the data in the cluster are. Averaging the silhouettescores over all data thus provides a quantitative measure of the overall clustering quality.

Silhouette data clustering validation scores for the gene expression test problem are shown in Table 2. Macrostate clusteringyields a higher average silhouette score compared to k-means and GMM algorithms, indicating highly improved performanceon test data from biological gene expression experiments.

Table 2: Mean silhouette scoresmacrostate clustering 0.7946

k-means 0.4123

GMM -0.4011

DaSpec 0.7946

The corresponding silhouette plots (Figure 8 a-d) indicate that the k-means and GMM algorithms contain clusters with neg-

14

Page 15: Daniel Korenblummacrostate clustering solutions, limiting their practical accessibility. The new formulation presented here depends only The new formulation presented here depends

ative silhouette scores, suggesting bad performance. By contrast, the identical macrostate clustering and data spectroscopicoutputs both provide higher mean silhouette scores for this biological-data test-problem.

Figure 8: Silhouette plots, 9-cluster solutions for the 243 item yeast expression dataset. Panels a and d show the macrostateclustering and data spectroscopy outputs, respectively, which were identical for this test-case. Panels b and c show the plotsfor the k-means and GMM algorithms, respectively. (Negative silhouette scores indicate inferior performance.)

5.3 Graph partitioning - C. elegans neural network graph

The biological neural network graph for the nematode C. elegans, a well-studied animal model organism in biology, wasobtained from Watts and Strogatz [1998] and used as a real-world graph-partitioning example. Nodes of the original directedgraph without both input and output connections were removed in order to create an irreducible and reversible CTMC using

D−1/2(D −Asym)D−1/2

the normalized graph Laplacian of the symmetrized adjacency matrix Asym ≡ A+AT .

Although symmetrization of digraphs is not essential to define a macrostate clustering, it can be useful when overall connec-tivity is of interest and also simplifies the numerical implementation. The symmetrized Laplacian matrix is sufficient for thepresent purpose of demonstrating the potential usefulness of macrostates on graph partitioning problems.

Adapting the temperature scaling methodology from 5.1, the off-diagonal elements of the Laplacian matrix can be interpretedas transition rates and scaled by an arbitrary β exponent to tune the accuracy of the resulting clustering. Figure 9 shows themultiscale gap structure rm(β) vs. β for the m = 2, . . . , 20 cluster cases. The spectral gaps rm(β) suggest an optimal valueof m = 5 near β = 2 since r5(β) > 1.5 and maxrm20

m=6 < 1.5 satisfy the chosen cutoff of 1.5 for sufficient separability.

Figure 9: Multiscale spectral gaps rm(β) vs. β for 1 6 β 6 5 and m = 2, . . . , 20 showing a large value of r5(2) > 1.5 andmaxrm(2)20

m=6 < 1.5 indicating strong separability for the m = 5 at β = 2 model.

15

Page 16: Daniel Korenblummacrostate clustering solutions, limiting their practical accessibility. The new formulation presented here depends only The new formulation presented here depends

The sparsity pattern for the adjacency matrix of the original (left) and the partitioned (right) graph are shown in Figure10. Notice how blocks along the diagonal appear to emerge after reordering the nodes according to the cluster/partitionassignments. Visually, this indicates significant increase in the block-diagonal structure of the adjacency matrix correspondingto each cluster, suggesting that a meaningful partitioning of the graph was obtained.

As expected for small-world networks obeying power laws as described in Watts and Strogatz [1998] the clusters are quitevaried in their sizes and occur at all scales. These results support the conclusion that macrostates are useful for multiscalenetwork analysis applications, although more work needs to be done to complete the interpretation of the results obtainedhere.

Figure 10: Original (left) and partitioned (right) C. elegans neural network graph sparsity pattern where dots indicate edgesbetween nodes. Colors indicate cluster/partition assignment using the m = 5 cluster solution where Υ(Mopt;β) = 0.5893with β = 2. Diagonal blocks suggested by the colors indicate strong relative connectivity of member nodes within eachcluster. Cluster membership numbers are: 12, 35, 114, 51, 47 (from top to bottom).

6 Discussion

Macrostates are versatile and powerful tools for challenging problems such as linear separation/unmixing, cluster analysis,and graph partitioning. Their applicability to a variety of currently active research areas reveals deep similarities in thestructures of these seemingly distinct and independently-derived problems.

Macrostates make no explicit assumptions about the functional form of f(x) the mixture density and are completely non-parametric, allowing almost any type of mixture density function, item sample, or graph/network to be accurately separated.Relying only on the spectral properties of Laplacian matrices addresses Pearson’s original concerns about rigorously comput-ing nonparametric density function components for arbitrary mixture distributions over a century ago.

Compared to existing methods such as data spectroscopy, macrostate clustering methods use fewer heuristics and provideprobabilistic models containing additional information, at the expense of solving a nontrivial optimization problem.

6.1 Interpretation

Macrostates were originally defined for separating physical systems where the equilibrium distribution is generated by asingle process and the mixture components attempt to approximate the system’s metastable dynamics [Shalloway, 1996]. Forphysical problems the mixture components (macrostates) do not represent independent sources or processes but are expansionsof a single process into subprocesses in terms of the system dynamics. In nonphysical statistical finite mixture modeling andunsupervised learning, the assumption is that the observations actually come from distinct independent processes that happento be superimposed.

This philosophical difference is worth noting since the form of the underlying finite mixture model is the same regardless ofwhether or not the physical interpretation is used. This implies that the physical interpretations of macrostates are not acause for concern about their applicability to nonphysical systems.

Understanding this mathematical equivalence between dynamic systems, finite mixture modeling, and spectral clusteringcreates new connections between methods from physics and statistics. These connections allow the effectiveness of Laplacian

16

Page 17: Daniel Korenblummacrostate clustering solutions, limiting their practical accessibility. The new formulation presented here depends only The new formulation presented here depends

matrices in spectral clustering to be understood in terms of the metastable dynamics of continuous-time Markov processesand also allow ideas from spectral clustering to be used in finite mixture modeling.

Stochastic process or random walk interpretations of spectral clustering algorithms have been discussed previously for thesimplest case of two clusters [Smola and Kondor, 2003]. Macrostates extend the stochastic process interpretation to thegeneral case where there may be more than two clusters.

This enables the results of spectral clusterings methods such as data spectroscopy to be interpreted in terms of a metastablerelaxation of the corresponding drift-diffusion’s distribution dynamics towards an equilibrium distribution.

6.2 Geometry

It may seem surprising that a physical method for modeling dynamic systems can be adapted to nonphysical linear separa-tion/unmixing problems. The reverse logic of considering generating stochastic processes for arbitrary mixture distributionsmakes sense because the PDE (7) defining the dynamics of the physical process is entirely determined by the geometry ofthe potential energy function V (x).

The geometric equivalence of separating a single dynamic process into weakly coupled subprocesses and separating mixturesof processes into independent components is the key idea behind macrostates. This motivates the use of generating stochasticprocesses for mixture distributions f(x) defined by negative log densities V (x) in (17) to provide useful information aboutthe geometry or structure of f(x).

6.3 Parameter tuning

The role of β the inverse temperature parameter is analogous to the scaling parameter used in many reproducing kernelfunctions such as Gaussian or other radial basis function kernels. In the context of similarity measures defined by kernelfunctions K(x, y; ε), β can often be related to the scaling factor ε used, often determined by cross-validation.

Although macrostate clustering methods involve fewer parameters than most other spectral clustering and finite mixturemodeling approaches, the temperature parameter can be useful for choosing Laplacian matrices having optimal macrostates.Using β > 1 was shown to improve unmixing accuracy for distributions having significant overlaps to enforce sharperboundaries between components. This makes intuitive sense from a physical standpoint, since it is analogous to lowering thetemperature of a weakly metastable physical process in order to slow the transitions between the metastable states.

6.4 Combined representation and inference

Macrostates provide a single algorithm for probabilistic clustering directly from the eigensystems of Laplacian matrices. Mostother existing spectral clustering methods use a two-stage approach with a represention step using Laplacian eigenspacesfollowed by separate inference step.

In exchange for the added cost or complexity of (24), all of the information used for defining the model comes from theeigenspace of a Laplacian or transition-rate matrix, enabling inference without any additional model selection and executionsteps as required by most other spectral clustering algorithms. Although macrostate clustering involves solving an NP-hard linearly-constrained concave quadratic global optimization problem, the modified Frank-Wolfe heuristic of Pardalos andGuisewite [1993, page 94] provides an effective approximate solver for the example problems tested.

6.5 Cluster number prediction

Many other mixture modeling, clustering, and partitioning algorithms in use today are parametric and cannot easily predictthe correct number of mixture components. By contrast, macrostates are nonparametric and can accurately predict thisnumber directly from efficiently computable eigenspectra.

Using a multiscale spectral gap involving a ratio rather than a difference of adjacent eigenvalues further differentiatesmacrosate mixture models from existing spectral clustering and partitioning methods.

6.6 Density Separation

The direct application of spectral clustering to mixture density separation by discrete approximation of the correspondingSmoluchowski operator is a new connection made here. Since spectral clustering methods rely on Laplacian matrices as

17

Page 18: Daniel Korenblummacrostate clustering solutions, limiting their practical accessibility. The new formulation presented here depends only The new formulation presented here depends

their primary inputs and discrete approximations of Smoluchoswki operators are negative Laplacian matrices, this allowsspectral clustering methods to be extended to density separation problems. Based on the nonparametric test problem resultspresented here, the high accuracy (> 95%) indicates that macrostate clustering warrants further study as a density separationapproach in addition to its applications to data clustering and graph partitioning problems.

6.7 Comparison to Data Spectroscopy

The heuristics used by data spectroscopy analyze the values of the eigenvectors and are computationally less expensive thanmacrostate clustering since they do not involve a global optimization algorithm. As shown in the 5.2 section, at the level ofthe hard/crisp cluster labels, data spectroscopy can indeed provide accurate estimates of the macrostate clustering solutionswhen the same kernel functions or Laplacian matrices are used. However, it does not provide a full probabilistic modelincluding the soft/fuzzy cluster assignment probabilities and involves kernel-specific heuristics for choosing the appropriatecluster number.

It may be possible to use macrostate clustering and data spectroscopic methods together, or to merge them so that they canbe used consistently on different analyses within the same project. For example, the multiscale spectral gap criterion used bymacrostates could also be incorporated into data spectroscopy and used to select an optimal kernel function. Once a kernelfunction has been selected, the more expensive macrostate clustering solution might be computed next, in order to provideadditional soft/fuzzy cluster assignment probabilities and also to verify the results of the approximate solution provided bydata spectroscopy.

Since macrostate clusters are globally optimized finite mixture models derived from Laplacian eigenvectors, these resultsprovide independent verification that data spectroscopy generates accurate approximations of the optimal mixture model forwell-separated clusters. These two algorithms have very different derivations and conceptual origins, so better understandingof their similarities could provide important insights into the relationships of spectral clustering and finite mixture modeling.

The mathematical arguments used to prove the accuracy of data spectroscopy in [Shi et al., 2009] and other kernelized spectralclustering methods described more recently in [Schiebinger et al., 2015] may yield better understanding of the assumptionsused in macrostate clustering. Likewise, the physical interpretations of macrostates in terms of drift-diffusions and therelationship of the kernel scaling parameter to the temperature or energy of the Brownian motion of generating stochasticprocesses may provide additional insight into the accuracy of the approximations used by data spectroscopy.

6.8 Future work

With the description of the easily implemented Frank-Wolfe heuristic here, it should now be easier for others to evaluate theperformance of macrostate clustering to a wide variety of linear separation problems for themselves. By combining aspectsof spectral clustering and finite mixture modeling methods in a self-contained theoretical framework, macrostate clusteringmethods have the potential to address some of the most challenging problems in data analysis without unnecessary heuristics.It would be interesting to obtain a probabilistic interpretation of the set of eigenvalue gaps rm∞m=2, i.e. an estimate ofP (m) the probability that a particular choice of m is appropriate, perhaps using a training dataset for calibration.

The close correspondence between the derivation and outcomes of macrostate clustering methods and the data spectroscopykernelized spectral clustering algorithm suggests that a more detailed evaluation of their similarities and differences wouldbe useful. The perfect agreement of the outputs of these two different algorithms is remarkable and warrants further studyto understand their connections and to provide more detailed comparisons performance evaluations.

There may be some cases where the more rigorous approximation of the NP-hard optimal finite mixture modeling solutionprovided by macrostate clustering provides more accuracy compared to DaSpec, at some additional computational cost. Theuse of the multiscale spectral gap criterion may give more accurate cluster number estimates than the heuristic used byDaSpec, even in cases where the two algorithms agree in their shard cluster assignments.

The topological structure of macrostate splitting as β increases from sufficiently small nonzero value towards∞ has been usefulfor solving challenging global optimization problems [Pardalos et al., 1994]. Such homotopy-related aspects of macrostatetheory are not explored here for the sake of brevity.

Recursive applications for creating hierarchical levels of partitions e.g. as used by multigrid methods are another interestingarea left for future work. User-friendly and open-source software implementations are also necessary to promote the morewidespread use of macrostates but are beyond the scope of this work.

18

Page 19: Daniel Korenblummacrostate clustering solutions, limiting their practical accessibility. The new formulation presented here depends only The new formulation presented here depends

7 Acknowledgements

I gratefully acknowledge David Shalloway and Walter Murray for their instruction and advice, and Panos Pardalos forsuggesting the use of the Frank-Wolfe heuristic. This research was supported by the Bayes Impact nonprofit organization,http://bayesimpact.org.

References

A. Azran and Z. Ghahramani. Spectral methods for automatic multiscale data clustering. In 2006 IEEE Computer SocietyConference on Computer Vision and Pattern Recognition, pages 190–197, 2006.

P. Banushkina, O. Schenk, and M. Meuwly. Efficiency considerations in solving smoluchowski equations for rough potentials.In CompLife’05 Proceedings of the First international conference on Computational Life Sciences, pages 208–216. Springer-Verlag, 2005.

C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.

F. Chung. Laplacians and the cheeger inequality for directed graphs. Annals of Combinatorics, 9:1–19, 2004.

H. W. Engl, M. Hanke, and A. Neubauer. Regularization of Inverse Problems. Kluwer Academic Publishers, 1996.

B. S. Everitt. An introduction to finite mixture distributions. Statistical Methods in Medical Research, 5:107–127, 1996.

Gregory Fasshauer. Meshfree approximation methods with MATLAB. World Scientific, 2007.

S. Fortunado. Community detection in graphs. Physical Review Series II, 486:75–174, 2010.

T. Hofmann, B. Scholkopf, and A. J. Smola. Kernel methods in machine learning. The Annals of Statistics, 36:1171–1120,2008.

D. Korenblum and D. Shalloway. Macrosate data clustering. Physical Review E, 67:056704, 2003.

Mathworks. Analyzing gene expression profiles. Website, 2014. URL http://www.mathworks.com/help/bioinfo/ug/

example-analyzing-gene-expression-profiles.html.

J. D. McAuliffe, D. M. Blei, and M. I. Jordan. Nonparametric empirical bayes for the dirichlet process mixture model. Stat.Comput., 16:5–14, 2006.

G. McLachlan and D. Peel. Finite Mixture Models. John Wiley & Sons, Inc., 2000.

A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and and algorithm. In Advances in neural informationprocessing systems, 2002.

B. Øksendal. Stochastic Differential Equations: An Introduction with Applications. Springer, 1996.

P. M. Pardalos and G. M. Guisewite. Parallel computing in nonconvex programming. Annals of Operations Research, 43:87–107, 1993.

P. M. Pardalos, D. Shalloway, and G. Xue. Optimization methods for computing global minima of nonconvex potentialenergy functions. Journal of Global Optimization, 4:117–133, 1994.

K. Pearson. Contribution to the mathematical theory of evolution. Philosophical Transactions A, 185:71–110, 1894.

H. Risken. The Fokker Planck Equation: Methods of Solution and Applications. Springer, 1996.

P. J. Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computationaland Applied Mathematics, 20:53–65, 1987.

R. Schaback and H. Wendland. Kernel techniques: From machine learning to meshless methods. Acta Numerica, 15:543–639,2006.

S. E. Schaeffer. Graph clustering. Computer Science Review I, pages 26–64, 2007.

G. Schiebinger, M. Wainwright, and B. Yu. The geometry of kernelized spectral clustering. The Annals of Statistics, 43:819–846, 2015.

D. Shalloway. Macrostates for classical stochastic systems. Physical Review E, 105:9986–10007, 1996.

19

Page 20: Daniel Korenblummacrostate clustering solutions, limiting their practical accessibility. The new formulation presented here depends only The new formulation presented here depends

T. Shi, M. Belkin, and B. Yu. Data spectroscopy: Eigenspaces of convolution operators and clustering. The Annals ofStatistics, 37:3960–3984, 2009.

A. J. Smola and R. Kondor. Kernels and Regularization on Graphs. Springer-Verlag, 2003.

D. J. Watts and S. H. Strogatz. Collective dynamics of ‘small-world’ networks. Nature, 393:440–442, 1998.

B. S. White and D. Shalloway. Efficient uncertainty minimization for fuzzy spectral clustering. Physical Review E, 80:056705,2009.

20