a gpu-accelerated algorithm for self-organizing maps … · a gpu-accelerated algorithm for...

A GPU-Accelerated Algorithm for

Self-Organizing Maps in a DistributedEnvironment

Peter Wittek and Sandor Daranyi ∗

Swedish School of Library and Information ScienceUniversity of Boras

Boras, SwedenEmail: [email protected],[email protected]

Abstract. In this paper we introduce a MapReduce-based implementa-tion of self-organizing maps that performs compute-bound operations ondistributed GPUs. The kernels are optimized to ensure coalesced memoryaccess and effective use of shared memory. We have performed extensivetests of our algorithms on a cluster of eight nodes with two NVidia TeslaM2050 attached to each, and we achieve a 10x speedup for self-organizingmaps over a distributed CPU algorithm.

1 Introduction

While recent advances in the programmability of GPUs have made graphicshardware more accessible for general-purpose computations, the process of devel-oping efficient GPU implementations is still highly application-dependent. Manyapplications require significant re-structuring of the algorithms to realize highperformance on GPUs, and the problem is even more convoluted if the GPUsare distributed across several computing nodes. In data-intensive tasks, MapRe-duce is a popular paradigm to distribute load and enable quick code development[1]. There have been some attempts to use MapReduce in the compute-intensiveworkloads that are typical to GPU-accelerated algorithms. We leverage on theseto speed up a demanding neural network method, self-organizing maps [2] in adistributed GPU environment.

2 GPU-aware MapReduce

GPUs excel at compute-bound operations, but it is not always easy to formulatea problem in the data parallel fashion that is required by lower level frameworksto program GPUs. The core concept of GPGPU computing is the kernel. Akernel is a unit of code that will execute on a graphics device launched from thehost CPU. The execution is asynchronous in most cases: the host may continueworking on other tasks while the kernel is being executed. The graphics devicecannot access the main memory, the data has to be copied to the memory ofthe device through the system bus. This is a costly operation, and if a sufficientlevel of parallelism cannot be achieved in the kernel, GPU-based computation

∗This work was supported by Amazon Web Services.

609

ESANN 2012 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 25-27 April 2012, i6doc.com publ., ISBN 978-2-87419-049-0. Available from http://www.i6doc.com/en/livre/?GCOI=28001100967420.

may decrease the overall performance due to this overhead [3]. There are severalresearch attempts that port MapReduce to GPUs to help develop applicationsfaster on this kind of hardware.

Mars was the first framework to program a GPU with the MapReduceparadigm [4]. While it did show good potential compared to a CPU-basedMapReduce, it does not utilize the GPU efficiently. It has not been updated fora long time and it is not capable of using more than one GPU.

Inspired by Mars, GPMR is the latest attempt to harness the power of GPUswith MapReduce [5]. It can use any number of GPUs in a node and it is alsocapable of running in a distributed system. Both features are provided by MPI.The performance efficiency declines after about 16 GPUs in most of the tests.GPMR does not try to hide the complexity of GPU programming, and it allowsfull control over the individual GPUs. It is a compromise made to achievemaximum performance. The default MapReduce scheduler moves the data fromthe GPU to the main memory after each map step, and then pushes it backbefore reduction (if needed), which might be inefficient in certain applications.

Taking the approach of GPMR further, the MapReduce and GPU parts ofan algorithm can be entirely decomposed. This way, for instance, one can useMR-MPI [6], which is a lightweight MapReduce framework built with MPI, orHadoop [7], which is a more complex framework that also incorporates a dis-tributed filesystem. These frameworks are good for a high-level load distributionand one can use GPU code only inside the map or reduce jobs. This is the ap-proach we have taken.

3 Self-organizing maps

The self-organizing map (SOM) training algorithm constructs a nonlinear andtopology preserving mapping of the input data set X = {x(t)|t ∈ {t0,...,tf }},where t0 and tf are the beginning and the end of the current epoch, ontoa set of neurons M = n1, . . . , nk of a neural network with associated weightvectors W = w1(t), ..., wk(t) at a given time step t. Each data point x(t) ismapped to its best match neuron bm(x(t)) = nb ∈ M such that d(x(t), wb(t)) ≤d(x(t), wj(t)) ∀wj(t) ∈ W , where d is the distance on the data set. The neu-rons are arranged on a two dimensional map: each neuron i possesses a set oftwo coordinates embedded in a two dimensional surface. Next the weight vec-tor of the best match neuron and its neighbours are adjusted toward the inputpattern using the following equation: wj(t+ 1) = wj(t) + αhbj(t)[x(t) − wj(t)],where 0 < α < 1 is the learning factor, and hck(t) is the neighbourhood func-tion that decreases for neurons further away from the best match neuron ingrid coordinates. A frequently used neighbourhood function is the Gaussian:

hbj = exp(−||rb−rj ||

δ(t) ), where rk and rc stand for the coordinates of the respective

nodes. The width δ(t) decreases from iteration to iteration to narrow the areaof influence. It is assumed, that the SOM gives a mapping with minimal, or atleast tolerable, topological errors [8]. The training might be repeated again onthe same data set to increase the fit, a training cycle is referred to as an epoch.

610


Eventually, the neighbourhood function decreases to an extent that trainingmight stop. The time needed to train an SOM grows linearly with the datasetsize and it also grows linearly with the number of neurons in the SOM.

4 Acceleration of self-organizing maps

A distributed, MapReduce-based SOM builds on the batch formulation of up-dating the weights [9]:

wj(tf ) =

∑tft′=t0

hbj(t′)x(t′)

∑tft′=t0

hbj(t′). (1)

This implementation builds on MR-MPI. The key idea is that before the data setis partitioned by a map call, the current set of weight vectors is broadcast to allworker nodes. The update in Equation 1 is performed locally at each node, andthe reduce step sums up the updates, and eventually the master nodes set thevalues of the new weight vectors. The process repeats with subsequent epochs.

We extended the MapReduce SOM algorithm by moving all calculations onlocal nodes to the GPU with the matrix-based Euclidean distance matrix andreduction algorithm described below. The chunk of X assigned to the node andthe corresponding norms of X are kept in the GPU memory between subsequentepochs, and the weight vectors are copied to the GPU memory in each step afterthe node receives the broadcast of the update.

The most time-consuming part of the calculations is finding the best match-ing neuron. This involves calculating the pairwise distances between the elementsof the data sets and the weight vectors. Euclidean distance is one of the mostcommon distances used in SOMs. The distance between a data point and aweight vector in the batch formulation can be calculated by d(wj(t0), x(t)) =√∑N

i=1(xi(t)− wji(t0))2, where N is the dimension of the space that embeds

the data set, and t ∈ {t0, . . . , tf}. The pairwise distances can be efficiently com-puted on a GPU if the distances are calculated on the same space [10]. Theabove formula is more general, and a much higher performance can be achievedif the entire distance matrix is calculated, and the steps are decomposed intomatrix level operations (see Algorithm 1 [11, 12]).

Algorithm 1 Calculate a Euclidean distance matrix with matrix operations (◦is the Hadamard product)

1: v1 = (X ◦X)[1, 1 . . .1]′

2: v2 = (W ◦W )[1, 1 . . . 1]′

3: P1 = [v1v1 . . . v1]4: P2 = [v2v2 . . . v2]

′

5: P3 = XW ′

6: D = (P1 + P2 − 2P3)

611


The algorithm does not calculate the Euclidean distances, but the square ofthe distances. Taking the square root is an expensive operation on the GPU,and since we seek the minimum of the distances, we can omit calculating it. Theexperimental results in [11] have shown a speed up of 15x on datasets whichcontain more than half million data points with the above algorithm on theGPU.

An important point to note is that as the norms of X do not change betweensubsequent epochs, these values can be computed before the main training loopstarts. Step 5 is a BLAS matrix multiplication, which can be calculated on theGPU with a high-level CUBLAS call. The CUBLAS library is distributed withCUDA, and it may not be the fastest implementation at a given time, but itgives an optimized performance [13].

The other steps have to be implemented carefully to minimize global memoryaccess and avoid bank-conflicts in the shared memory of SMPs. This is achievedby using a column-major representation of the matrices. The minimum is foundby a multi-step reduction algorithm. Existing GPU-based SOM algorithms takea similar approach in finding the best matching unit, but they do not necessarilyuse matrix operations to derive the distance matrix [14, 15].

5 Discussion of experimental results

5.1 Cluster Configuration

We built a Beowulf cluster of eight nodes using Amazon Web Services (AWS).AWS ensures that cluster instances that are launched simultaneously are physi-cally close and they are connected with a high-speed network. A cluster computeGPU instance in AWS has two Intel Xeon X5570 quad-core CPUs and 23 GB ofmemory, and it is equipped with two NVidia Tesla M2050. One Tesla card has448 CUDA cores and 3GByte of device memory.

The implementation of the algorithms1, except for creating the inverted in-dex, used OpenMPI 1.5, the June 2011 version of MR-MPI, and CUDA 4.0. Theinverted index was created using Lucene 3.4 and was not timed.

5.2 Data set

The collection consists of 84,283 PhD theses crawled from the database of theGerman Nationaly Library. The files are in PDF format and the total size is498GByte. File sizes vary widely, averaging around 6Mbytes, but many files areover 100MBytes. The collection is multilingual, with the majority of documentsbeing in English or German.

We created an inverted index with Lucene, applying PDF extraction. We didnot use stemming or any other language specific processing step. The mergedand optimized index was approximately 11GByte, with a total of 34,965,200terms. We applied random projection [16] to reduce the number of dimensionsto two hundred.

1The source code is available online: https://github.com/peterwittek/mr-mpi-som-gpu

612


Method Execution Speeduptime over CPU

CPU (64 cores) 2042s -GPU (16 Tesla) 433s 4.71xCPU (64 cores) 1882s -(One epoch)GPU (16 Tesla) 194s 9.68x(One epoch)

Table 1: Execution time of self-organizing maps

5.3 Running time of self-organizing maps

The baseline distributed, multicore CPU implementation was based onMR-MPI-SOM [9]. We used all sixty-four physical cores across the eight nodes. Whenscaling up, we noticed a nearly linear scaling, which shows that the overhead ofbroadcasting the SOM weight vectors is not considerable.

We had to use all sixteen GPUs to fit the problem in the distributed devicememories. Even then, the tensor describing the SOM at a given time was aserious limiting factor, and we had to limit our experiments to a small 10x10map in both the CPU and GPU cases. The bulk of the memory is taken up bythe dense matrix chunk that is assigned to a GPU. Since such a tiny map doesnot show emergent clustering features very well, in the future we intend to workon a variant of random projection that keeps the resulting structure sparse.

The total wall time of the CPU variant was 2041.71 seconds (Table 1). Thewall time of the GPU implementation was 433.25 seconds, including initializingthe CUDA context, and memory transfers before and after the SOM training.This translates to a speedup of 4.71x. Considering that this is a speedup oversixty-four cores of CPUs, we believe that this is a significant result.

Turning our attention to the execution time of one epoch, a CPU core finishedits chunk in 1881.71 seconds. The GPU variant took 194.54 seconds, whichmeans a speedup of 9.68x. Note that we only calculated the best matching unitson the GPU, so there is room for further improvement.

Looking at the GPU occupancy results, we were not far from fully loading thestream processors. Steps 1 and 2 of Algorithm 1, that is, calculating the norms,were performed with 100% occupancy. Finding the minimum is also optimal.The CUBLAS dense matrix multiplication proved to be the weak point, with anoccupancy varying between 33% and 83%.

6 Conclusions

MapReduce jobs are fairly easy to develop, a wide range of machine learningalgorithms has already been adapted to this paradigm, hence developers of datamining applications will find it convenient to use. With rephrasing the compu-

613


tational problems underlying self-organizing maps, we showed that it is feasibleto leverage on GPUs in a distributed system, and achieved a speedup of close to10x.

References

[1] J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. InProceedings of OSDI-04, 6th International Symposium on Operating Systems Design &Implementation, San Francisco, CA, USA, December 2004.

[2] T. Kohonen, S. Kaski, K. Lagus, J. Salojarvi, J. Honkela, V. Paatero, and A. Saarela.Self organization of a massive text document collection. IEEE Transactions on NeuralNetworks, 11(3):574–585, 2000.

[3] NVida Compute Unified Device Architecture C Best Practices Guide 4.0, 2011.

[4] B. He, W. Fang, Q. Luo, N.K. Govindaraju, and T. Wang. Mars: A MapReduce frame-work on graphics processors. In Proceedings of PACT-08, 17th International Confer-ence on Parallel Architectures and Compilation Techniques, pages 260–269, Toronto,ON, Canada, October 2008.

[5] J.A. Stuart and J.D. Owens. Multi-GPU MapReduce on GPU clusters. In Proceedings ofIPDPS-11, 25th International Parallel and Distributed Computing Symposium, Anchor-age, AK, USA, May 2011.

[6] S.J. Plimpton and K.D. Devine. MapReduce in MPI for large-scale graph algorithms.Parallel Computing, 2011.

[7] T. White. Hadoop: The Definitive Guide. O’Reilly Media, 2009.

[8] T. Kohonen. Self-Organizing Maps. 2001.

[9] S.J. Sul and A. Tovchigrechko. Parallelizing BLAST and SOM algorithms withMapReduce-MPI library. In Proceedings of IPDPS-11, 25th International Parallel andDistributed Computing Symposium, pages 476–483, Anchorage, AK, USA, May 2011.

[10] D. Chang, N.A. Jones, D. Li, M. Ouyang, and R.K. Ragade. Compute pairwise Euclideandistances of data points with GPUs. In Proceedings of CBB-08, International Sympo-sium on Computational Biology and Bioinformatics, pages 278–283, Orlando, FL, USA,November 2008. ACTA Press.

[11] Q. Li, V. Kecman, and R. Salman. A chunking method for Euclidean distance matrixcalculation on large dataset using multi-GPU. In Proceedings of ICMLA-10, 9th Interna-tional Conference on Machine Learning and Applications, pages 208–213, Washington,DC, USA, December 2010.

[12] K.E.A. van de Sande, T. Gevers, and C.G.M. Snoek. Empowering visual categorizationwith the GPU. IEEE Transactions on Multimedia, 13(1):60–70, 2011.

[13] S. Barrachina, M. Castillo, F.D. Igual, R. Mayo, and E.S. Quintana-Orti. Evaluation andtuning of the level 3 CUBLAS for graphics processors. In Proceedings of IPDPS-08, 22ndInternational Symposium on Parallel and Distributed Processing, pages 1–8, Miami, FL,USA, April 2008.

[14] K.S. Oh and K. Jung. GPU implementation of neural networks. Pattern Recognition,37(6):1311–1314, 2004.

[15] G. Strong and M. Gong. Browsing a large collection of community photos based onsimilarity on GPU. Advances in Visual Computing, pages 390–399, 2008.

[16] P. Kanerva, J. Kristofersson, and A. Holst. Random indexing of text samples for latentsemantic analysis. In Proceedings of CogSci-00, 22nd Annual Conference of the CognitiveScience Society, volume 1036, Philadelphia, PA, USA, 2000.

614


a gpu-accelerated algorithm for self-organizing maps … · a gpu-accelerated algorithm for...

Documents