dirichlet process mixture models made scalable and ...€¦ · a dirichlet process (dp) is a...

HAL Id: hal-01999453https://hal.archives-ouvertes.fr/hal-01999453

Submitted on 30 Jan 2019

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

Dirichlet Process Mixture Models made Scalable andEffective by means of Massive Distribution

Khadidja Meguelati, Bénédicte Fontez, Nadine Hilgert, Florent Masseglia

To cite this version:Khadidja Meguelati, Bénédicte Fontez, Nadine Hilgert, Florent Masseglia. Dirichlet Process Mix-ture Models made Scalable and Effective by means of Massive Distribution. SAC ’19 - 34thACM/SIGAPP symposium on applied computing, Apr 2019, Limassol, Cyprus. pp.502-509,�10.1145/3297280.3297327�. �hal-01999453�

https://hal.archives-ouvertes.fr/hal-01999453

https://hal.archives-ouvertes.fr

Dirichlet ProcessMixtureModelsmade Scalable and Effective bymeans ofMassive Distribution

Khadidja MeguelatiInria and LIRMM

Montpellier, [email protected]

Benedicte FontezMISTEA, Montpellier SupAgro, Univ.Montpellier


Nadine HilgertMISTEA, INRA, Univ.Montpellier


Florent MassegliaInria and LIRMM


ABSTRACTClusteringwith accurate results have become a topic of high interest.Dirichlet Process Mixture (DPM) is a model used for clustering withthe advantage of discovering the number of clusters automaticallyand offering nice properties like, e.g., its potential convergence tothe actual clusters in the data. These advantages come at the priceof prohibitive response times, which impairs its adoption andmakescentralized DPM approaches inefficient. We propose DC-DPM, aparallel clustering solution that gracefully scales to millions of datapoints while remaining DPM compliant, which is the challenge ofdistributing this process. Our experiments, on both synthetic andreal world data, illustrate the high performance of our approach onmillions of data points. The centralized algorithm does not scale andhas its limit on 100K data points, where it needs more than 7 hours.In this case, our approach needs less than 30 seconds.

KEYWORDSDirichlet Process Mixture Model, Clustering, Parallelism

ACMReference Format:KhadidjaMeguelati, Benedicte Fontez, NadineHilgert, and FlorentMasseglia.2019. Dirichlet ProcessMixtureModelsmade Scalable and Effective bymeansof Massive Distribution. In The 34th ACM/SIGAPP Symposium on AppliedComputing (SAC ’19), April 8–12, 2019, Limassol, Cyprus.ACM, New York, NY,USA, 9 pages. https://doi.org/10.1145/3297280.3297327

1 INTRODUCTIONClustering, or cluster analysis, is the task of grouping similar datainto the same cluster and separating dissimilar data in different clus-ters. It is a data mining technique and is intensively used for dataanalytics, with applications to marketing [2], security [13], or sci-ences like astronomy [21], and many more. Clustering may be usedfor identification in the new challenge of high throughput plant phe-notyping [17], a researchfieldwith the purpose of crop improvementin response to present and future demographic and climate scenarios.

Publication rights licensed to ACM. ACM acknowledges that this contribution wasauthoredor co-authored by an employee, contractor or affiliate of a national government.As such, the Government retains a nonexclusive, royalty-free right to publish orreproduce this article, or to allow others to do so, for Government purposes only.SAC ’19, April 8–12, 2019, Limassol, Cyprus© 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM.ACM ISBN 978-1-4503-5933-7/19/04. . . $15.00https://doi.org/10.1145/3297280.3297327

Figure 1: Durum in an experimental field. RGB image.

In this case, data to be considered include data on plants and crop im-ages, like the one illustrated by Figure 1, showing a view of a Durumcrop. Automatic identification, from such images, of leaves, soil, anddistinguishing plants from foreground, are of high value for expertssince they provide the fundamental information used for popular su-pervisedmethods in the domain [17]. One of themain difficulties, forclustering, is the fact that we don’t know, in advance, the number ofclusters to bediscovered. Tohelpperforming cluster analysis, despitethe unknown tackled number of clusters, statistics advocate for:

(1) Setting a number of clustering runs, with varying value ofK ,and selecting the one that minimizes a goodness of fit crite-ria. It may be a quadratic risk or the Residual Mean SquaredError of Prediction (RMSEP) [14]. This approach needs theimplementation of a cross-validation algorithm [14]. The clus-tering approach in this case, may be a mixture model with anExpectation-Maximization (EM) algorithm [6], or K-means[14], for instance.

(2) Making a hierarchical clustering and then cut off the treeat a given depth, usually decided by the end-user. Differentapproaches for pruningwith advantages and drawbacks exist,see [14].

(3) Using aDirichlet ProcessMixture (DPM)which automaticallydetects the number of clusters [8].

In this work, we focus on the DPM approach since it allows estimat-ing the number of clusters and assigning observations to clusters, inthe same process. Furthermore, its implementation is quite straight-forward in a Bayesian framework. Such properties of DPMmake it avery appealing solution for many use-cases. Unfortunately, DPM is

https://doi.org/10.1145/3297280.3297327

https://doi.org/10.1145/3297280.3297327

SAC ’19, April 8–12, 2019, Limassol, Cyprus K. Meguelati et al.

highly time consuming. Consequently, several attempts have beendone to make it distributed [16, 25, 26]. However, while being effec-tively distributed, these approaches usually suffer from convergenceissues (imbalanceddata distributionon computingnodes) [10, 16, 26]or do not fully benefit from DPM properties [25] (see our discussionin Section 3). Furthermore, making DPM parallel is not straightfor-ward since it must compare each record to the set of existing clusters,a highly repeated number of times. That impairs the global perfor-mances of the approach in parallel, since comparing all the recordsto all the clusters would call for a high number of communicationsandmake the process impracticable. Our goal is to propose a parallelDPM approach that fully exploits parallel architectures for betterperformances and offers meaningful results. Our main contributionis to keep consistency of clusters amongworker nodes, and betweenthe worker and the master nodes with regards to DPM properties.Our motivating example comes from the biology use-case describedabove, where the processing time on one imagemay go up to severaldays in a centralized environment. These performances in responsetime are the main reason why DPM is not used in the domain. Theresults are very informative for experts, but the processing timesare prohibitive. In this case, parallelization is an appealing solutionbut it has to guarantee that results remain as informative as the onestargeted by a centralized run. We propose DC-DPM (DistributedClustering by Dirichlet Process Mixture), a distributed DPM algo-rithm that allows each node to have a view on the local results ofall the other nodes, while avoiding exhaustive data exchanges. Themain novelty of our work is to propose a model and its estimationat the master level by exploiting the sufficient statistics from theworkers, in a DPM compliant approach. Our solution takes advan-tage of the computing power of distributed systems by using parallelframeworks such as MapReduce or Spark [27]. Our DC-DPM solu-tion distributes the Dirichlet Process by identifying local clusters onthe workers and synchronizing these clusters on the master. Theseclusters are then communicated as a basis among workers for localclustering consistency.Wemodified theDirichlet Process to considerthis basis in each worker. By iterating this process we seek globalconsistency of DPM in a distributed environment. Our experiments,using real and synthetic datasets, illustrate both the high efficiencyand linear scalability of our approach. We report significant gainsin response time, compared to centralized DPM approaches, withprocessing times of a fewminutes, compared to several days in thecentralized case. The paper is organized as follows. In Section 2 westate the problem and give the necessary background on DirichletProcessMixture. InSection3wediscuss relatedworkand inSection4,we describe the details of our distributed solution for clutering bymeans of Dirichlet Process Mixture. Section 5 reports the results ofour experimental evaluation to verify the efficiency and effectivenessof our approach, and Section 6 concludes.

2 PROBLEMDEFINITIONANDBACKGROUND2.1 Dirichlet ProcessMixtureModelsADirichlet Process (DP) is a probability distribution over distribu-tions. In our use-case, a distribution over the image pixels could be"plant" with probability p1, and "not plant" with probability p2, withthe property that p1+p2=1. A DP generates a probability distribu-tionG. We observe a sample θ1,...,θN fromG. In our use-case, each

θi is the vector of possible pixel color values.

θn |Giid∼ G,n=1,...,N

G ∼ DP(α,G0)

G θn

N

WhereG is by construction a discrete probability distribution [22] .

G=∞∑i=1

πi (v)δϕi

Therefore, observed variables θn have a non null probability of hav-ing the same value ϕi and this allows for clustering. In our use-case,"plant" pixels will have the same color vectorϕ expressing the greenvalue. Clustering is very sensitive to the DP parameters given bythe end user.G0 is a continuous probability distribution fromwhichthe (ϕi )i ∈N are initially drawn. In our use-case,G0 gives the colorprobability of all possible clusters in the image.

ϕ1,...,ϕi ,...∼G0

α is a scale parameter (α > 0) which tunes the probability weightsπi (v).

V1,...,Vi ,...∼Beta(1,α) πi (v)=vii−1∏j=1

(1−vj )

α tunes indirectly the probability mass function for kN , the num-ber of unique values (namely ϕi ) in a sample of size N [3].

p(kN )=��SN ,kN

��N !αkNΓ(α)

Γ(α+N )(1)

where��SN ,kN

�� is the unsigned Stirling number of the first kind.

With a Dirichlet ProcessMixture we observe the sampley1,...,yNfrom a mixture of distributions F (θn ). In our use-case, we assumethat colors are observed with a noise distributed according to F . Themixture is controlled by a DP on the parameters θn .

yn ∼ F (θn ),n=1,...,Nθn ∼ G

G ∼ DP(α0,G0)

In a Bayesian framework, the estimation of θn is done on the pos-terior: P(θ1,...,θN |y1,...,yN ). Instead of this representation, anotherparameterization is used to speed up computation of the posterior:

P(ϕc1 ,...,ϕcN |y1,...,yN )

Where θn =ϕcn , cn is the cluster label of observation n, and ϕcn isthe unique value of the θn belonging to the same cluster. The Gibbsalgorithm [11] samples the cluster labels c1,...,cN and next the clus-ter parameters (here ϕc , for all c ∈ {1, ...,K} where K designs thenumber of cluster label values) instead of θ1,...,θN .

α π cn

G0 ϕk

yn

N

KN

Dirichlet Process Mixture Models made Scalable and Effective by means of Massive DistributionSAC ’19, April 8–12, 2019, Limassol, Cyprus

Several versions of a Gibbs sampling are proposed by Neal in [19]to simulate values from the posterior. The principle is to repeat thefollowing loops at least until convergence to the posterior:

(1) Cluster assignment, for n=1,...,N• Remove observation n from its cluster. Check if the clusteris empty, if yes then remove the cluster and ϕcn from thelist {ϕ} of all possible values.

• Draw cn from

P (cn =c | {c j }j,n,yn,{ϕ})∝{#(c)

N−1+α F (yn |ϕc ) existing clusterα

N−1+α∫F (yn |ϕ)dG0(ϕ) new cluster

Where #(c) designs the number of observations affected tocluster c (after removing observation n from the sample).

• If c designs a new cluster, need to draw ϕc from P(ϕ |yn )∝F (yn |ϕ)G0(ϕ)

• End loops(2) Update of {ϕ},

• draw ϕc from the posterior distribution of cluster c , P(ϕ |{y}c ) (which is proportional to the product of the priorG0and the likelihood of all observations affected to cluster c).

When distribution F andG0 are conjugates, ϕ can be integratedout from the Gibbs sampling which becomes time-efficient (no needto update {ϕ}). Then

P (cn =c | {c j }j,n,yn,{ϕ})∝{#(c)

N−1+α∫F (yn |ϕ)dP(ϕ | {y}c ) existing cluster

αN−1+α

∫F (yn |ϕ)dG0(ϕ) new cluster

2.2 Massive Distribution and SparkClustering via Dirichlet Process Mixture based on Gibbs Samplingis unable to scale to large datasets due to its high computationalcosts associated with Bayesian inference. For this reason, we aim toimplement a parallel algorithm forDPMclustering in amassively dis-tributed environment called Spark which is a parallel programmingframework aiming to efficiently process large datasets. This pro-grammingmodel can perform analytics with in-memory techniquesto overcome disk bottlenecks. Similar to MapReduce [4], Spark canbe deployed on the Hadoop Distributed File System (HDFS) [23].Unlike traditional in-memory systems, the main feature of Sparkis its distributed memory abstraction, called resilient distributeddatasets (RDD), that is an efficient and fault-tolerant abstraction fordistributing data in a cluster. With RDD, the data can be easily per-sisted inmainmemory aswell as on the hard drive. Spark is designedto support the execution of iterative algorithms.

To execute a Spark job, we need a master node to coordinate jobexecution, and some worker nodes to execute a parallel operation.These parallel operations are summarized to two types: (i) Trans-formations: to create a new RDD from an existing one (e.g., Map,MapToPair, MapPartition, FlatMap); and (ii) Actions: to return a finalvalue to the user (e.g., Reduce, Aggregate or Count).

2.3 ProblemDefinitionTheproblemweaddress is as follows.Givena (potentiallybig)datasetof records find, bymeans of process performed in parallel, a partitionof the dataset into disjoint subsets called clusters, such that:

• Similar records are assigned to the same cluster.• Dissimilar records are assigned to different clusters.• The union of the clusters is the original dataset.

3 RELATEDWORKWe set our work in the context of parallel clustering. Previous worksfor distributed algorithms of unsupervised clustering already ex-ist. Ene et al. [7] gave a MapReduce algorithm for the k-center andk-median problems. Both algorithms use Iterative-Sample as a sub-procedure to get a substantially smaller subset of points that repre-sents all of the points well. To achieve this, they perform an iterative-Sample. However these algorithms require the number of clusters kto be specified in advance, which is considered as one of the most di-ficult problems to solve in data clustering. Debatty et al. [5] proposeda MapReduce implementation of G-means which is an iterative al-gorithm that uses Anderson Darling test to verify if a subset of datafollows a Gaussian distribution. However this algorithm overesti-mates the number of clusters, and then requires a post-processingstep to merge clusters.

In our work, we focused on algorithms inspired by the DPM.Lovell et al. [16] andWilliamson et al. [26] has suggested an alter-native parametrisation for the Dirichlet process in order to derivenon-approximate parallel MCMC inference for it, these approachesare criticized by Gal and Ghahramani in [10], these latter showedthat the approaches suggested are impractical due to an extremelyimbalanced distribution of the data, and gave directions for futureresearch like the development of better approximate parallel infer-ence. The main idea when data is distributed is to perform a DPMin each worker. The issues are then to share information betweenworkers, and to synchronize and update clusters arising fromwork-ers at the master level. For synchronization, the main challenge isa problem of identification and of label switching of clusters. In thiscontext we can use a relabelling algorithm like for example the oneproposed by Stephens [15] for mixture models. For parallel LatentDirichlet Allocation (LDA) andHierarchical Dirichlet Process (HDP),Newman et al. [20] suggested to measure distance between clustersand then proposed a greedy matching. Wang and Lin [25] gave adetailed review of literature and recent advanced in this topic beforegiving a new proposal. They proposed to use a stepwise hierarchicalclassification at the master level with half chance for split or mergeat each step. They began with a full model considering all clustersfrom all workers as different components of the model. Their algo-rithm uses the standard Bayes Factor to compare nested models andchoose the best split or merge. In conclusion, at the master level, theproposed algorithms diverge from a DPM-classifier and are not ascalable estimations of a DPM. Moreover, Wang and Lin [25] useda fixed value for the scale parameter (α ) in their implementation ofthe DPM at the workers level. The number of final clusters is relatedto this value (see equation 1). Authors like Miller and Harrison [18]have demonstrated the inconsistency for the number of componentsof a DPMmodel with fixed α value. If the number of componentsidentified at the worker level is underestimated, then the number


of clusters at the master level might be underestimated. The reversewill increase considerably the running time at the master level.

In our approach, we suggest to keep to a DPM algorithm as muchas possible, even at the master level, to be close to the good proper-ties of a DPM-classifier, despite the fact that data is distributed. Wealso suggest a modification of the DPMmodel to share informationamong workers. In this way we expect to improve our clustering(better estimation) and suppress label switching. Finally, we do notfix a value to α but allow a different estimation in each worker toadd flexibility to our model.

Furthermore [25] is restricted to specific cases where noise in theobservations follows the conjugate distribution of the cluster cen-ters distribution. For example, a Gaussian noise imposes a Gaussiandistribution of the centers. Therefore, this method is not suited forcenters having positive values only.

Our goal is to work on any data, even with exclusively positivecenters.

4 DC-DPM:DISTRIBUTEDCLUSTERINGVIADPM

In this section, we present a novel parallel clustering approach calledDC-DPM, adapted for independent data. Parallelization calls forparticular attention to two main issues. The first one is the loadbalance between computing nodes. In our approach we distributedata evenly across the different nodes, and there is no data exchangebetween nodes during the processing. The second issue is the cost ofcommunications. In order to be efficient, nodes send and receive asfew information as possible by performing many iterations of GibbsSampling independently in each worker before synchronizing theglobal state at the master level and only communicating sufficientstatistics between workers and master. The challenge of using suf-ficient statistics, in a distributed environment, is to remain in theDPM approach at all steps, including the synchronization betweenthe worker and master nodes. The novelty of our approach is toapproximate the DPM model even at the master level when localdata is replaced by sufficient statistics between iterations.

4.1 Architecture and Distributed AlgorithmData Distribution

Data is evenly distributed on the computing nodes when the pro-cess starts. This is a mere, sequential, distribution, that splits thedataset into equal sized partitions.

Worker

This level handles the innovation parts of DPM (detection ofnew clusters) and the individual cluster assignment in each worker.The updates of the cluster labels in worker j depend on sample sizeproportions of the distributed data:

P(cn, j =c |c,n, j ,yn, j ,{ϕ})∝{ #(c)jNj−1+α F (yn, j ,ϕc ),c=1,...,K

αNj−1+α

∫F (yn, j ,ϕ)dG0(ϕ) new

As the clusters are not known at the beginning, we cannot ensurethat the sample size proportions of each cluster will be respected

in each worker. If the data were uniformly distributed, each clus-ter would have only, in average, the same weight/proportion in allworkers. Therefore we added a modification of the update, as canbe seen in Algorithm 1.

Algorithm 1DPM at worker level

for each datayn doDraw cn, j from P(cn, j =c | {cl , j }l,n,yn, j ,{ϕ},{w},α j )∝{ #(c)+α jwc

Nj−1+α j F (yn, j ,ϕc ),c=1,...,Kα jwu

Nj−1+α j


Update of α jDraw ϕc for new clusters

where the weightwc is the proportion of observations from clus-ter c evaluated on the whole dataset andwu the proportion of nonaffected observations (awaiting the creation, innovation, discover oftheir real clusters). Therefore, these parameters are updated at themaster level during the synchronization. Now, the scale parameterα j can be viewed as a tuning parameter between local (worker) andglobal (master) proportions. Following [9] we use an inverse gammaprior to infer this parameter.

This modification of the update implies a slightly modified DPMin each worker j :

yn, j ∼ F (θn, j )

θn, j ∼ G j

G j ∼ DP(α j ,G)

α j ∼ IG(a,b)

G =

K∑c=1

wcδϕc +wuG0,

withwu+

K∑c=1

wc =1

Master

This level handles the final individual assignment in the masternode and therefore the final common number of clusters K . Themaster gets from each worker the following input: sample size ofcluster k in worker j (nj ,k ), cluster parameter values-sufficient sta-tistics, individual predictive value (traditionally/usually the clustermean value in the worker: yn, j =yj ,cn, j=k ). At the master level, theobservations are assigned by clusters. A cluster corresponds to a setof individuals belonging to the same cluster of the sameworker. Eachcluster has a representative or individual predictive value which isused to perform the end of the Gibbs sampling at the master level:

P(cn, j =c |c,n, j )∝

{ #(c)N−1+γ F (yn, j ,ϕc ),c=1,...,K

γN−1+γ


Working at an individual level implies a slow Gibbs sampling withpoor mixing [12]. So, we suggest an update by clusters. In this view,we denote zj ,k the master label of the cluster k in worker j. To takeinto account the worker information ({ϕworkerjk }), we replace the


prior predictive distribution (∫F (yn, j ,ϕ)dG0(ϕ)) by a posterior pre-

dictive distribution. Eventually, we use the cluster mean value (yj ,k )as an individual predictive value:

P(zj ,k =c |z,j ,k )∝

{ #(c)N−1+γ F (yj ,k ,ϕc ),c=1,...,K

γN−1+γ

∫F (yj ,k ,ϕ)dG(ϕ |ϕ

workerjk )

The labels {cn, j } of all the observations in the cluster k of workerj are then assigned to the master label zj ,k . Next, the cluster pa-rameters ({ϕc }c=1, ...,K ) are updated from the posterior computedon the whole dataset. We assume that we don’t need all the databut only sufficient statistics from all clusters from all workers tocompute the posterior. This assumption is straightforward for manydistributions, as the exponential family [1]. Last, the synchroniza-tion of the workers is done through the definition of G using theupdated parameters ({ϕc }c=1, ...,K ) and with weights drawn froma Dirichlet distribution Dir (n1,...,nK ,γ ). The end user parametersof this Dirichlet distribution are updated at the master level fromthe whole dataset. The size nk is the sum of all observations havinglabel k at the end of the master Gibbs sampling.

(w1,...,wK ,wu ) ∼ Dir (n1,...,nK ,γ )

γ ∼ IG(c,d)

By doing so, we do not have to consider label switching. Clusters areexplicitly defined at the master level and parameter values are notupdated in theworker. At theworker level, only innovation (creationof new clusters) is implemented. This is summarized byAlgorithm2.

Algorithm 2DPM at master level

for each (j,k) doDraw zj ,k from P(zj ,k =c | {c},j ,k ,yj ,k ,{ϕ},γ )∝{ #(c)

N−1+γ F (yj ,k ,ϕc ),c=1,...,Kγ

N−1+γ∫F (yj ,k ,ϕ)dG(ϕ |ϕ

workerjk )

Update of ϕ and (w1,...,wK ,wu )

The workflow of our DC-DPM approach is illustrated by Figure 2.It consists in 4 steps:

(1) Identify local new clusters in the workers(2) Compute and send sufficient statistics and cluster sizes from

each worker to the master(3) Synchronize and estimate cluster labels from sufficient sta-

tistics(4) Sendupdated clusterparameters andcluster sizes frommaster

to workersOur first proposition concerns the synchronization and estimationof the DPM. It is done with a Gibbs sampling conditionally on thesufficient statistics instead of the whole dataset/individual obser-vations. Our second proposition is a construction of a shared priordistribution updated at the master level and send to the workers’DPM. This distribution reflects the information/results collectedfrom all workers and synchronized at the master.

The likelihood for one observationyi from the Exponential familyis:

F (yi |η)=h(yi )exp(ηTψ (yi )−a(η)

)where :

Figure 2: Diagram/workflow of the DC - DPM

• η are the natural parameters and is a function of ϕ.• ψ (yi ) are sufficient statistics• a(η) is the Log-normalizing factor or Log-partition can beexpressed in function of ϕ : a(η(ϕ))

• h(yi ) is the base measureand the likelihood for all the observations is:

F (y1,...,yn |η)=

( N∏i=1

h(yi )

)exp

(ηT

( N∑i=1

ψ (yi )

)−Na(η)

)Among all the distributions included in the Exponential Fam-

ily, we implemented the Normal case for the experiments: F (. |ϕc )=N (ϕc ,Σ1). This choice corresponds to the simple linear modelyn =ϕc+εn and εn is Normally distributed N (0,Σ1). In this case, thesample mean of cluster c , namely yc is a sufficient statistic and theposterior distribution can be conditioned only on its value:

P(ϕ | {yn }cn=c )=P(ϕ | yc )∝F (yc |ϕ)G0(ϕ)

When G0 is not a conjugate prior (e.g., a normal distribution),the posterior distribution is not a usual one but a value from thisposterior can be simulated with a Metropolis Hasting (MH) withinGibbs algorithm.

When variances are known andG0 is a conjugate prior (normaldistribution N (m,Σ1)), there is no use of MH algorithm. The poste-rior is a normal distribution N (ϕ

postc =Σ(#(c)Σ−1

2 yc+Σ−11 m),Σ

postc )

where Σpostc = (#(c)Σ−12 +Σ

−11 )−1. The predictive posterior is a nor-

mal distribution N (ϕpostc ,Σ2 + Σ

postc ). In our context, Σpostc was

considered negligible and the mean value ϕpostc was replaced by anindividual drawn from the posterior.

5 EXPERIMENTSThe parallel experimental evaluation was conducted on a cluster of32 machines, each operated by Linux, with 64 Gigabytes of mainmemory, Intel Xeon CPUwith 8 cores and 250 Gigabytes hard disk.The centralized approach is an implementation DPM in Scala, andwas executed on a single machine with the same characteristics.

The distributed algorithm we proposed is an approximation ofa classic DPM, we will compare its properties to a centralized DPMimplementation, on synthetic data and also in our use-case for digitalagronomy. The first step of our process is a distributed K-means thatsets the initial state (usually we set K to be one tenth of the datasetsize).


Reproducibility : All our experiments are fully reproducible.Wemake our code and data available at https://github.com/khadidjaM/DC-DPM.

In the rest of this section, we describe the datasets in Section 5.1andourevaluationcriteria inSection5.2.Then, inSection5.3,wemea-sure the performances, in response time, of our approach comparedto the centralized approach and also by reporting its scalability andspeed-up. We evaluate the clusters obtained by DC-DPM in the caseof real and synthetic dataset in Section 5.4 and Section 5.5 discussesthe results and interest of our work in a real use-case of agronomy.

5.1 DatasetsWe carried out our experiments on a real world and a syntheticdataset.

Our synthetic data are generated using a two-steps principle. Inthe first step we generate cluster centers according to a multivariatenormal distribution with the same variance σ 2

1 for all dimensions. Inthe second step, we generate the data corresponding to each center,by using a multivariate normal distribution parameterized on thecenter with the same variance σ 2

2 for all dimensions. We generateda first batch of 5 datasets having size 20K, 40, 60, 80K and 100K withσ 2

1 = 1000 and σ 22 = 1. They represent 10 clusters. We generated a

second batch of 5 datasets having size 2M, 4M, 6M, 8M and 10Mwithσ 2

1 =100000 and σ 22 =10. They represent 100 clusters. This type of

generator is widely used in statistics, where methods are evaluatedfirst on synthetic data before being applied on real data.

Our real data correspond to the use-case described in Section 1.The image used to test our algorithmwas in RGB format. After pre-processing it contains 1,081,200 data points, described by a vectorof 3 values (red, green and blue) belonging to [0,1].

5.2 Clustering Evaluation CriteriaThere are two cases for evaluating the results of a clustering algo-rithm. Either there is a ground truth available, or there is not. Inthe case of an available ground truth, there are measures allowingto compare the clustering results to the reference, such as ARI, de-scribed below, for instance. This is usually exploited for experimentswhen onewants to check performances in a controlled environment,on synthetic data or labelled real data. In the case where there is noground-truth (which is the usual case, because we don’t knowwhatshould be discovered in real world applications of a clustering algo-rithm) the results may be evaluated by means of relative measures,like RSS, described below, for instance. In our experiments, we chosethe following three criteria.1. The Adjusted Rand Index (ARI): it is the corrected-for-chanceversion of the Rand Index [24], which is a function that measures thesimilarity between two data clustering results, for example betweenthe ground truth class assignments (if known) and the clusteringalgorithm assignments. ARI values are in the range [-1,1] with a bestvalue of 1.2. The residual sum of squares (RSS): it is a measure of how wellthe centroids (means) represent the members of their clusters. It isthe squared distance of each data from its centroid summed over allvectors. In the univariate case, the RSS value divided by the numberof observations gives the value of the Mean Squared Error (MSE),an estimator of the residual variance. In multivariate dataset with

independent variables, the RSS value divided by the number of obser-vations gives an estimator of the sum of the variable variances. Thissum represents its lower bound and also the best value to be observedin the clustering of synthetic data. To simplify, we give in the follow-ing the result of theRSSvaluedividedby thenumberofdataNand thevariance. Therefore the lower bound is known and should be equalto the number of variables (for example 2 for our synthetic data).3. K, the number of discovered clusters.

5.3 Response TimeIn this section we measure the clustering time in DC-DPM and com-pare it to the centralized approach. Figure 3 reports the responsetimes of DC-DPM and the centralized approach on our syntheticdata, limited to 100K data points. Actually, the centralized approachdoes not scale andwould take several days for larger datasets.The dis-tributed approach is run on a cluster of 8 nodes. The results reportedby Figure 3 are in logarithmic scale. The clustering time increaseswith the number of data points for all approaches. This time is muchlower in the case of DC-DPM, than the centralized approach. On 8machines (64 cores) and for a dataset of 100K data points, DC-DPMperforms the clustering in 24 seconds,while the centralized approachneeds more than 7 hours on a single machine.

Figure 4 reports an extended view on the clustering time, onlyfor DC-DPM, and with a dataset having up to 10 million data pointsfrom the synthetic dataset. DC-DPM is run on a cluster of 16 ma-chines. The running time increases with the number of data points.Let us note that the centralized approach does not scale and cannotexecute on such dataset size. DC-DPM enjoys linear scalability withthe dataset size.

Figures 5 and 6 illustrate the parallel speed-up of our approach on2Mdata points from the synthetic dataset and onmore than 1milliondata points obtained after preprocessing the image of our use-case.The results showoptimal or near optimal gain. In Figure 6weobservethat the response time for 2 nodes is more than twice the responsetime for 4 nodes. That is unexpected when measuring a speed-up.However, the response times of our approach are very fast (a fewmin-utes) anddonot consider the time it takes for Spark to load-up, beforerunning DC-DPM. The slight difference between an optimal speed-up and the results reported in Figure 6 are due to that loading time.

5.4 Clustering EvaluationIn the following experiments, we evaluate the clustering perfor-mance of DC-DPM and compare it to the centralized approach.

Table 1 reports 1) the ARI value computed between the cluster-ing obtained and the ground truth, 2) the RSS value divided by thenumber of data N and variance (σ 2

2 ), and 3) the number of clusters,obtained with DC-DPM and the centralized approach on our syn-thetic data while increasing the dataset size. The DC-DPM is runon a cluster of 8 nodes. DC-DPM performs as well as the central-ized approach, there is a small gap in RSS values which is negligiblecompared to the gained time.

Table 2 reports an extended view on the ARI value, and the RSSvalue divided by the number of data N and by the variance (σ 2

2 ), andnumber of clusters obtainedwith DC-DPM,while increasing datasetsize (up to 10 million data points). DC-DPM is run on a cluster of 16machines. The performance keeps showing the maximum possibleaccuracy, even with a large number of data points.

https://github.com/khadidjaM/DC-DPM

https://github.com/khadidjaM/DC-DPM


0.1

1

10

100

1000

20K 40K 60K 80K 100K

Tim

e (

min

)

Number of data

DC-DPMcentralized DPM

Figure 3: Logarithmic scale. Responsetime ofDC-DPMand centralizedDPMas a function of the dataset size.

0

50

100

150

200

2M 4M 6M 8M 10M

Tim

e (

min

)

Number of data

DC-DPM

Figure 4: Response time (minutes) ofDC-DPM as a function of the datasetsize.

50

100

150

200

250

2 4 8 16

Tim

e (

min

)

Number of nodes

DC-DPM

Figure 5: Clustering time as a functionof the number of computing nodes onthe synthetic data.

5

10

15

20

25

30

35

40

2 4 8 16

Tim

e (

min

)

Number of nodes

DC-DPM

Figure 6: Clustering time as a functionof the number of computing nodes onthe image of our use-case.

Figure 7: Visual representation of thesynthetic dataset.

Figure 8: Visual representation of theresults obtained by DC-DPM

Table 1: Clustering evaluation criteria obtained with thecentralized DPM andwith DC-DPM.

Centralized DPM DC-DPMARI RSS

N×σ 22

Clusters ARI RSSN×σ 2

2Clusters

20K 1.00 2.01 10 1.00 2.04 1040K 1.00 2.00 10 1.00 2.03 1060K 1.00 2.00 10 1.00 2.02 1080K 1.00 2.00 10 1.00 2.01 10100K 1.00 2.00 10 1.00 2.02 10

Table 2: Clustering evaluation criteria obtained byDC-DPM .

ARI RSS/(N*σ 22 ) Clusters

2M 1.00 2.00 1024M 1.00 2.00 1006M 1.00 2.00 1008M 1.00 2.02 9910M 1.00 2.10 101

Figure 7 gives a visual representation of our 4M data points syn-thetic dataset. Each cluster is assigned to a color. Our goal is toretrieve these clusters. Figure 8 represents the results obtained byour approach on the data of Figure 7, with 16 nodes. Each cluster isassigned to a different color. shows the performance of our approach

withalmostperfect resultswhere thediscoveredclustersare thesameas the actual ones from the data. This is confirmed by Table 2, line 2.

5.5 Use-casePhenotyping and precision agriculture use more and more informa-tion fromsensors anddrones, like aerial images, leading to the emerg-ing domain of digital agriculture (see for example http://ec.europa.eu/research/participants/portal/desktop/en/opportunities/h2020/topics/dt-rur-12-2018.html). An important challenge, in this context, is tobe able to distinguish clusters of plants: status (normal, hydric stress,disease,...) or species for example. Clustering, applied to images, isa key in this domain.

We want to discover clusters in the image presented in Section 1and transformed as described in Section 5.1. We set σ 2

1 =1 becauseourdata is in the rangeof [0,1]. For bothparameter values ofσ 2

2 =0.01and σ 2

2 =0.0025, the clusters are extracted in approximately 3 min-utes with DC-DPM running in parallel on 16 computing nodes. Thisis confirmed by Figure 6. The centralized approach does not scaleon this data and we could not obtain results.

The number of clusters depends on the value of the variance er-ror. A value of σ 2

2 = 0.01 gave a rough clustering with only K = 3clusters. Those clusters identified the brightness in the RGB image(see figure 9). A lower value of σ 2

2 = 0.0025 gave a clustering withk = 12 clusters, which is enough to reconstruct the image (see fig-ure 10). Depending on the aim of the clustering, different types ofwavelength or data must be used for identification. The accuracy of

http://ec.europa.eu/research/participants/portal/desktop/en/opportunities/h2020/topics/dt-rur-12-2018.html




Figure 9: Clustering of the Durumimage by DC-DPMwith σ 2=0.01.

Figure 10: Clustering of the Durumimage by DC-DPMwith σ 2

2 =0.0025.Figure 11: Clustering of a part of theDurum image by DC-DPM (top) andCentralized DPM (bottom).

the clustering (number of clusters) relies on the variance value (σ 22 ).

Clustering is used to detect structures in the data (genetic, popula-tion, status) before processing. This group detection allows reducingdata dimension and bias in further prediction analysis.

DC-DPMwas compared to the centralized DPM on a part of thethe Durum image in the experimental field. RGB Image with DC-DPM (top, 12 clusters) and centralized DPM (bottom, 17 clusters),σ 2

2 =0.0025. The results were quite similar as shown in figure 11. Theimpact of σ 2 on the number of clusters varies for centralized anddistributed approaches and may be adjusted by the end-user.

6 CONCLUSIONWe proposed DC-DPM, a novel and efficient parallel solution to per-form clustering viaDPMonmillions of data points.We evaluated theperformance of our solution over real world and synthetic datasets.The experimental results illustrate the excellent performance of DC-DPM (e.g., a clustering time of less than 30 seconds for 100K datapoints, while the centralized algorithm needs several hours). Theresults also illustrate the high performance of our approach withresults that are comparable to the ones of the centralized version.Overall, the experimental results show that by using our paralleltechniques, the clustering of very large volumes of data can now bedone in small execution times, which are impossible to achieve usingthe centralized DPM approach. A nice perspective of our approachis to open fundamental research tracks such as DPM clustering oncomplex data like, e.g. time series.

7 ACKNOWLEDGEMENTSThe research leading to these results has received funds from the Eu-ropean Union’s Horizon 2020 Framework Programme for Researchand Innovation, under grant agreement No. 732051.

REFERENCES[1] 2010. Exponential Family. Wiley-Blackwell, Chap-

ter 18, 93–97. https://doi.org/10.1002/9780470627242.ch18arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/9780470627242.ch18

[2] A. Alamsyah and B. Nurriz. 2017. Monte Carlo simulation and clusteringfor customer segmentation in business organization. In 2017 3rd Interna-tional Conference on Science and Technology - Computer (ICST). 104–109.https://doi.org/10.1109/ICSTC.2017.8011861

[3] Charles E. Antoniak. 1974. Mixtures of Dirichlet Processes with Applicationsto Bayesian Nonparametric Problems. Ann. Statist. 2, 6 (11 1974), 1152–1174.https://doi.org/10.1214/aos/1176342871

[4] J. Dean and S. Ghemawat. 2008. MapReduce: Simplified Data Processing on LargeClusters. Commun. ACM 51, 1 (2008), 107–113.

[5] Thibault Debatty, Pietro Michiardi, Wim Mees, and Olivier Thonnard. 2014.Determining the k in k-meanswithMapReduce.. In EDBT/ICDTWorkshops. 19–28.

[6] Arthur P Dempster, NanM Laird, and Donald B Rubin. 1977. Maximum likelihoodfrom incomplete data via the EM algorithm. Journal of the royal statistical society.Series B (methodological) (1977), 1–38.

[7] Alina Ene, Sungjin Im, and Benjamin Moseley. 2011. Fast clustering usingMapReduce. In Proceedings of the 17th ACM SIGKDD international conference onKnowledge discovery and data mining. ACM, 681–689.

[8] Michael D Escobar. 1994. Estimating normal means with a Dirichlet process prior.J. Amer. Statist. Assoc. 89, 425 (1994), 268–277.

[9] Michael D Escobar and Mike West. 1995. Bayesian density estimation andinference using mixtures. Journal of the american statistical association 90, 430(1995), 577–588.

[10] Yarin Gal and Zoubin Ghahramani. 2014. Pitfalls in the use of parallel inferencefor the Dirichlet process. In Proceedings of the 31st International Conference onMachine Learning (ICML-14). 208–216.

[11] AndrewGelman, John B. Carlin, Hal S. Stern, and Donald B. Rubin. 2004. BayesianData Analysis (2nd ed. ed.). Chapman and Hall/CRC.

[12] JosephGonzalez, YuchengLow,ArthurGretton, andCarlosGuestrin. 2011. Parallelgibbs sampling: From colored fields to thin junction trees. In Proceedings of theFourteenth International Conference onArtificial Intelligence and Statistics. 324–332.

[13] Victoria Hodge and Jim Austin. 2004. A Survey of Outlier DetectionMethodologies. Artificial Intelligence Review 22, 2 (01 Oct 2004), 85–126.https://doi.org/10.1023/B:AIRE.0000045502.10941.a9

[14] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. Anintroduction to statistical learning. Vol. 112. Springer.

[15] Ajay Jasra, Chris C Holmes, and David A Stephens. 2005. Markov chain MonteCarlo methods and the label switching problem in Bayesian mixture modeling.Statist. Sci. (2005), 50–67.

[16] Dan Lovell, Ryan P Adams, and VK Mansingka. 2012. Parallel markov chainmonte carlo for dirichlet process mixtures. InWorkshop on Big Learning, NIPS.

[17] Chuang Ma, Hao Helen Zhang, and Xiangfeng Wang. 2014. Machine learningfor Big Data analytics in plants. Trends in Plant Science 19, 12 (1 12 2014), 798–808.

[18] Jeffrey W Miller and Matthew T Harrison. 2014. Inconsistency of Pitman-Yorprocess mixtures for the number of components. The Journal of Machine LearningResearch 15, 1 (2014), 3333–3370.

[19] Radford M Neal. 2000. Markov chain sampling methods for Dirichlet process mix-ture models. Journal of computational and graphical statistics 9, 2 (2000), 249–265.

[20] David Newman, Arthur Asuncion, Padhraic Smyth, and Max Welling. 2009.Distributed algorithms for topic models. Journal of Machine Learning Research10, Aug (2009), 1801–1828.

[21] Ordovás-Pascual, I. and Sánchez Almeida, J. 2014. A fast version of the k-meansclassification algorithm for astronomical applications. Astronomy & Astrophysics565 (2014), A53. https://doi.org/10.1051/0004-6361/201423806

[22] Jayaram Sethuraman. 1994. A constructive definition of Dirichlet priors. Statisticasinica (1994), 639–650.

[23] J. Shafer, S. Rixner, and A. L. Cox. 2010. The Hadoop distributed filesystem:Balancing portability and performance. In Int. ISPASS.

[24] Nguyen Xuan Vinh, Julien Epps, and James Bailey. 2010. Information TheoreticMeasures for Clusterings Comparison: Variants, Properties, Normalizationand Correction for Chance. J. Mach. Learn. Res. 11 (Dec. 2010), 2837–2854.http://dl.acm.org/citation.cfm?id=1756006.1953024

[25] Ruohui Wang and Dahua Lin. 2017. Scalable Estimation of Dirichlet ProcessMixture Models on Distributed Data. In Proceedings of the 26th International

https://doi.org/10.1002/9780470627242.ch18

http://arxiv.org/abs/https://onlinelibrary.wiley.com/doi/pdf/10.1002/9780470627242.ch18

https://doi.org/10.1109/ICSTC.2017.8011861

https://doi.org/10.1214/aos/1176342871

https://doi.org/10.1023/B:AIRE.0000045502.10941.a9

https://doi.org/10.1051/0004-6361/201423806

http://dl.acm.org/citation.cfm?id=1756006.1953024


Joint Conference on Artificial Intelligence (IJCAI’17). AAAI Press, 4632–4639.http://dl.acm.org/citation.cfm?id=3171837.3171935

[26] SineadWilliamson, Avinava Dubey, and Eric Xing. 2013. Parallel Markov chainMonte Carlo for nonparametric mixture models. In International Conference onMachine Learning. 98–106.

[27] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. 2010. Spark:Cluster Computing withWorking Sets. InHotCloud.

http://dl.acm.org/citation.cfm?id=3171837.3171935

dirichlet process mixture models made scalable and ...€¦ · a dirichlet process (dp) is a...

Documents