a gpu-accelerated algorithm for self-organizing maps in a distributed environment

A GPU-Accelerated Algorithm for Self-Organizing Maps in a Distributed EnvironmentPeter Wittek, Sándor Darányi

SummaryThis work extends a MapReduce-based dis-tributed variant of self-organizing maps(SOMs) to use graphics hardware available tothe individual nodes. The key contributionsare:

• GPU-optimized calculations in the mapphase with particular interest in batchformulation of computing Euclidean dis-tances and ensuring coalesced memoryaccess.

• Benchmarks on a real-life text collectionof close to half terabyte.

Experimental setupWe built a Beowulf cluster of eight nodes us-ing Amazon Web Services (AWS). AWS ensuresthat cluster instances that are launched simulta-neously are physically close and they are con-nected with a high-speed network. A clus-ter compute GPU instance in AWS has two In-tel Xeon X5570 quad-core CPUs and 23 GB ofmemory, and it is equipped with two NVidiaTesla M2050. One Tesla card has 448 CUDAcores and 3GByte of device memory.The collection consists of 84,283 PhD thesescrawled from the database of the German Na-tionaly Library. The files are in PDF formatand the total size is 498GByte. File sizes varywidely, averaging around 6Mbytes, but manyfiles are over 100MBytes. The collection is mul-tilingual, with the majority of documents beingin English or German.We created an inverted index with Lucene, ap-plying PDF extraction. We did not use stem-ming or any other language specific processingstep. The merged and optimized index was ap-proximately 11GByte, with a total of 34,965,200terms. We applied random projection [2] to re-duce the number of dimensions to two hun-dred.The baseline distributed, multicore CPU im-plementation was based on MR-MPI-SOM [6].We used all sixty-four physical cores across theeight nodes.We had to use all sixteen GPUs to fit the prob-lem in the distributed device memories. Eventhen, the tensor describing the SOM at a giventime was a serious limiting factor, and we hadto limit our experiments to a small 10x10 mapin both the CPU and GPU cases.

AvailabilityThe source code is available under GPLv3 athttps://github.com/peterwittek/mr-mpi-som-gpu

GPU-aware MapReduceGPMR is the latest attempt to harness thepower of GPUs with MapReduce [5]. It canuse any number of GPUs in a node and it isalso capable of running in a distributed system.Both features are provided by MPI. The perfor-mance efficiency declines after about 16 GPUsin most of the tests. GPMR does not try to hidethe complexity of GPU programming, and it al-lows full control over the individual GPUs. It isa compromise made to achieve maximum per-formance. The default MapReduce schedulermoves the data from the GPU to the main mem-ory after each map step, and then pushes itback before reduction (if needed), which mightbe inefficient in certain applications.Taking the approach of GPMR further, theMapReduce and GPU parts of an algorithmcan be entirely decomposed. This way, for in-stance, one can use MR-MPI [4], which is alightweight MapReduce framework built withMPI, or Hadoop [8], which is a more complexframework that also incorporates a distributedfilesystem. These frameworks are good for ahigh-level load distribution and one can useGPU code only inside the map or reduce jobs.This is the approach we have taken.

GPU-based SOMA distributed, MapReduce-based SOM buildson the batch formulation of updating theweights [6]:

wj(tf ) =

∑tft′=t0

hbj(t′)x(t′)∑tf

t′=t0hbj(t′)

. (1)

The key idea is that before the data set is parti-tioned by a map call, the current set of weightvectors is broadcast to all worker nodes. Theupdate in Equation 1 is performed locally ateach node, and the reduce step sums up the up-dates, and eventually the master nodes set thevalues of the new weight vectors.We extended the MapReduce SOM algorithmby moving all calculations on local nodes to theGPU with the matrix-based Euclidean distancematrix and reduction algorithm described be-low. The chunk of X assigned to the nodeand the corresponding norms of X are kept inthe GPU memory between subsequent epochs,and the weight vectors are copied to the GPUmemory in each step after the node receives thebroadcast of the update.The most time-consuming part of the cal-culations is finding the best matching neu-ron. This involves calculating the pairwisedistances between the elements of the datasets and the weight vectors. Euclidean dis-tance is one of the most common distancesused in SOMs. The distance between a datapoint and a weight vector in the batch formu-lation can be calculated by d(wj(t0), x(t)) =√∑N

i=1(xi(t)− wji(t0))2, where N is the di-mension of the space that embeds the data set,and t ∈ {t0, . . . , tf}. The pairwise distancescan be efficiently computed on a GPU if thedistances are calculated on the same space [1].The above formula is more general, and a muchhigher performance can be achieved if the en-tire distance matrix is calculated, and the stepsare decomposed into matrix level operations[3,7]).An important point to note is that as thenorms of X do not change between subsequentepochs, these values can be computed beforethe main training loop starts.

ResultsWhen scaling up, we noticed a nearly lin-ear scaling, which shows that the overhead ofbroadcasting the SOM weight vectors is notconsiderable.Calculating the norms was performed with100% occupancy. Finding the minimum is alsooptimal. The CUBLAS dense matrix multipli-cation proved to be the weak point, with an oc-cupancy varying between 33% and 83%.

Method Execution Speeduptime over CPU

CPU (64 cores) 2042s -GPU (16 Tesla) 433s 4.71xCPU (64 cores) 1882s -(One epoch)GPU (16 Tesla) 194s 9.68x(One epoch)

References[1] D. Chang, N. Jones, D. Li, M. Ouyang, and R. Ragade. Compute pairwise Euclidean distances of data points with GPUs. In Proceedings of CBB-08, International

Symposium on Computational Biology and Bioinformatics, pages 278–283, Orlando, FL, USA, November 2008. ACTA Press.[2] P. Kanerva, J. Kristofersson, and A. Holst. Random indexing of text samples for latent semantic analysis. In Proceedings of CogSci-00, 22nd Annual Conference of the

Cognitive Science Society, volume 1036, Philadelphia, PA, USA, 2000.[3] Q. Li, V. Kecman, and R. Salman. A chunking method for Euclidean distance matrix calculation on large dataset using multi-GPU. In Proceedings of ICMLA-10, 9th

International Conference on Machine Learning and Applications, pages 208–213, Washington, DC, USA, December 2010.[4] S. Plimpton and K. Devine. MapReduce in MPI for large-scale graph algorithms. Parallel Computing, 2011.[5] J. Stuart and J. Owens. Multi-GPU MapReduce on GPU clusters. In Proceedings of IPDPS-11, 25th International Parallel and Distributed Computing Symposium,

Anchorage, AK, USA, May 2011.[6] S. Sul and A. Tovchigrechko. Parallelizing BLAST and SOM algorithms with MapReduce-MPI library. In Proceedings of IPDPS-11, 25th International Parallel and

Distributed Computing Symposium, pages 476–483, Anchorage, AK, USA, May 2011.[7] K. van de Sande, T. Gevers, and C. Snoek. Empowering visual categorization with the GPU. IEEE Transactions on Multimedia, 13(1):60–70, 2011.[8] T. White. Hadoop: The Definitive Guide. O’Reilly Media, 2009.

1

a gpu-accelerated algorithm for self-organizing maps in a distributed environment

Documents