a gpu-accelerated algorithm for self-organizing maps in a distributed environment

1
A GPU-Accelerated Algorithm for Self-Organizing Maps in a Distributed Environment Peter Wittek, Sándor Darányi Summary This work extends a MapReduce-based dis- tributed variant of self-organizing maps (SOMs) to use graphics hardware available to the individual nodes. The key contributions are: GPU-optimized calculations in the map phase with particular interest in batch formulation of computing Euclidean dis- tances and ensuring coalesced memory access. Benchmarks on a real-life text collection of close to half terabyte. Experimental setup We built a Beowulf cluster of eight nodes us- ing Amazon Web Services (AWS). AWS ensures that cluster instances that are launched simulta- neously are physically close and they are con- nected with a high-speed network. A clus- ter compute GPU instance in AWS has two In- tel Xeon X5570 quad-core CPUs and 23 GB of memory, and it is equipped with two NVidia Tesla M2050. One Tesla card has 448 CUDA cores and 3GByte of device memory. The collection consists of 84,283 PhD theses crawled from the database of the German Na- tionaly Library. The files are in PDF format and the total size is 498GByte. File sizes vary widely, averaging around 6Mbytes, but many files are over 100MBytes. The collection is mul- tilingual, with the majority of documents being in English or German. We created an inverted index with Lucene, ap- plying PDF extraction. We did not use stem- ming or any other language specific processing step. The merged and optimized index was ap- proximately 11GByte, with a total of 34,965,200 terms. We applied random projection [2] to re- duce the number of dimensions to two hun- dred. The baseline distributed, multicore CPU im- plementation was based on MR-MPI-SOM [6]. We used all sixty-four physical cores across the eight nodes. We had to use all sixteen GPUs to fit the prob- lem in the distributed device memories. Even then, the tensor describing the SOM at a given time was a serious limiting factor, and we had to limit our experiments to a small 10x10 map in both the CPU and GPU cases. Availability The source code is available under GPLv3 at https://github.com/peterwittek/mr-mpi-som-gpu GPU-aware MapReduce GPMR is the latest attempt to harness the power of GPUs with MapReduce [5]. It can use any number of GPUs in a node and it is also capable of running in a distributed system. Both features are provided by MPI. The perfor- mance efficiency declines after about 16 GPUs in most of the tests. GPMR does not try to hide the complexity of GPU programming, and it al- lows full control over the individual GPUs. It is a compromise made to achieve maximum per- formance. The default MapReduce scheduler moves the data from the GPU to the main mem- ory after each map step, and then pushes it back before reduction (if needed), which might be inefficient in certain applications. Taking the approach of GPMR further, the MapReduce and GPU parts of an algorithm can be entirely decomposed. This way, for in- stance, one can use MR-MPI [4], which is a lightweight MapReduce framework built with MPI, or Hadoop [8], which is a more complex framework that also incorporates a distributed filesystem. These frameworks are good for a high-level load distribution and one can use GPU code only inside the map or reduce jobs. This is the approach we have taken. GPU-based SOM A distributed, MapReduce-based SOM builds on the batch formulation of updating the weights [6]: w j (t f )= t f t 0 =t 0 h bj (t 0 )x(t 0 ) t f t 0 =t 0 h bj (t 0 ) . (1) The key idea is that before the data set is parti- tioned by a map call, the current set of weight vectors is broadcast to all worker nodes. The update in Equation 1 is performed locally at each node, and the reduce step sums up the up- dates, and eventually the master nodes set the values of the new weight vectors. We extended the MapReduce SOM algorithm by moving all calculations on local nodes to the GPU with the matrix-based Euclidean distance matrix and reduction algorithm described be- low. The chunk of X assigned to the node and the corresponding norms of X are kept in the GPU memory between subsequent epochs, and the weight vectors are copied to the GPU memory in each step after the node receives the broadcast of the update. The most time-consuming part of the cal- culations is finding the best matching neu- ron. This involves calculating the pairwise distances between the elements of the data sets and the weight vectors. Euclidean dis- tance is one of the most common distances used in SOMs. The distance between a data point and a weight vector in the batch formu- lation can be calculated by d(w j (t 0 ),x(t)) = q N i=1 (x i (t) - w ji (t 0 )) 2 , where N is the di- mension of the space that embeds the data set, and t ∈{t 0 ,...,t f }. The pairwise distances can be efficiently computed on a GPU if the distances are calculated on the same space [1]. The above formula is more general, and a much higher performance can be achieved if the en- tire distance matrix is calculated, and the steps are decomposed into matrix level operations[3, 7]). An important point to note is that as the norms of X do not change between subsequent epochs, these values can be computed before the main training loop starts. Results When scaling up, we noticed a nearly lin- ear scaling, which shows that the overhead of broadcasting the SOM weight vectors is not considerable. Calculating the norms was performed with 100% occupancy. Finding the minimum is also optimal. The CUBLAS dense matrix multipli- cation proved to be the weak point, with an oc- cupancy varying between 33% and 83%. Method Execution Speedup time over CPU CPU (64 cores) 2042s - GPU (16 Tesla) 433s 4.71x CPU (64 cores) 1882s - (One epoch) GPU (16 Tesla) 194s 9.68x (One epoch) References [1] D. Chang, N. Jones, D. Li, M. Ouyang, and R. Ragade. Compute pairwise Euclidean distances of data points with GPUs. In Proceedings of CBB-08, International Symposium on Computational Biology and Bioinformatics, pages 278–283, Orlando, FL, USA, November 2008. ACTA Press. [2] P. Kanerva, J. Kristofersson, and A. Holst. Random indexing of text samples for latent semantic analysis. In Proceedings of CogSci-00, 22nd Annual Conference of the Cognitive Science Society, volume 1036, Philadelphia, PA, USA, 2000. [3] Q. Li, V. Kecman, and R. Salman. A chunking method for Euclidean distance matrix calculation on large dataset using multi-GPU. In Proceedings of ICMLA-10, 9th International Conference on Machine Learning and Applications, pages 208–213, Washington, DC, USA, December 2010. [4] S. Plimpton and K. Devine. MapReduce in MPI for large-scale graph algorithms. Parallel Computing, 2011. [5] J. Stuart and J. Owens. Multi-GPU MapReduce on GPU clusters. In Proceedings of IPDPS-11, 25th International Parallel and Distributed Computing Symposium, Anchorage, AK, USA, May 2011. [6] S. Sul and A. Tovchigrechko. Parallelizing BLAST and SOM algorithms with MapReduce-MPI library. In Proceedings of IPDPS-11, 25th International Parallel and Distributed Computing Symposium, pages 476–483, Anchorage, AK, USA, May 2011. [7] K. van de Sande, T. Gevers, and C. Snoek. Empowering visual categorization with the GPU. IEEE Transactions on Multimedia, 13(1):60–70, 2011. [8] T. White. Hadoop: The Definitive Guide. O’Reilly Media, 2009.

Upload: peter-wittek

Post on 11-May-2015

473 views

Category:

Documents


2 download

DESCRIPTION

In this poster we introduce a MapReduce-based implementation of self-organizing maps that performs compute-bound operations on distributed GPUs. The kernels are optimized to ensure coalesced memory access and effective use of shared memory. We have performed extensive tests of our algorithms on a cluster of eight nodes with two NVidia Tesla M2050 attached to each, and we achieve a 10x speedup for self-organizing maps over a distributed CPU algorithm.

TRANSCRIPT

Page 1: A GPU-Accelerated Algorithm for Self-Organizing Maps in a Distributed Environment

A GPU-Accelerated Algorithm for Self-Organizing Maps in a Distributed EnvironmentPeter Wittek, Sándor Darányi

SummaryThis work extends a MapReduce-based dis-tributed variant of self-organizing maps(SOMs) to use graphics hardware available tothe individual nodes. The key contributionsare:

• GPU-optimized calculations in the mapphase with particular interest in batchformulation of computing Euclidean dis-tances and ensuring coalesced memoryaccess.

• Benchmarks on a real-life text collectionof close to half terabyte.

Experimental setupWe built a Beowulf cluster of eight nodes us-ing Amazon Web Services (AWS). AWS ensuresthat cluster instances that are launched simulta-neously are physically close and they are con-nected with a high-speed network. A clus-ter compute GPU instance in AWS has two In-tel Xeon X5570 quad-core CPUs and 23 GB ofmemory, and it is equipped with two NVidiaTesla M2050. One Tesla card has 448 CUDAcores and 3GByte of device memory.The collection consists of 84,283 PhD thesescrawled from the database of the German Na-tionaly Library. The files are in PDF formatand the total size is 498GByte. File sizes varywidely, averaging around 6Mbytes, but manyfiles are over 100MBytes. The collection is mul-tilingual, with the majority of documents beingin English or German.We created an inverted index with Lucene, ap-plying PDF extraction. We did not use stem-ming or any other language specific processingstep. The merged and optimized index was ap-proximately 11GByte, with a total of 34,965,200terms. We applied random projection [2] to re-duce the number of dimensions to two hun-dred.The baseline distributed, multicore CPU im-plementation was based on MR-MPI-SOM [6].We used all sixty-four physical cores across theeight nodes.We had to use all sixteen GPUs to fit the prob-lem in the distributed device memories. Eventhen, the tensor describing the SOM at a giventime was a serious limiting factor, and we hadto limit our experiments to a small 10x10 mapin both the CPU and GPU cases.

AvailabilityThe source code is available under GPLv3 athttps://github.com/peterwittek/mr-mpi-som-gpu

GPU-aware MapReduceGPMR is the latest attempt to harness thepower of GPUs with MapReduce [5]. It canuse any number of GPUs in a node and it isalso capable of running in a distributed system.Both features are provided by MPI. The perfor-mance efficiency declines after about 16 GPUsin most of the tests. GPMR does not try to hidethe complexity of GPU programming, and it al-lows full control over the individual GPUs. It isa compromise made to achieve maximum per-formance. The default MapReduce schedulermoves the data from the GPU to the main mem-ory after each map step, and then pushes itback before reduction (if needed), which mightbe inefficient in certain applications.Taking the approach of GPMR further, theMapReduce and GPU parts of an algorithmcan be entirely decomposed. This way, for in-stance, one can use MR-MPI [4], which is alightweight MapReduce framework built withMPI, or Hadoop [8], which is a more complexframework that also incorporates a distributedfilesystem. These frameworks are good for ahigh-level load distribution and one can useGPU code only inside the map or reduce jobs.This is the approach we have taken.

GPU-based SOMA distributed, MapReduce-based SOM buildson the batch formulation of updating theweights [6]:

wj(tf ) =

∑tft′=t0

hbj(t′)x(t′)∑tf

t′=t0hbj(t′)

. (1)

The key idea is that before the data set is parti-tioned by a map call, the current set of weightvectors is broadcast to all worker nodes. Theupdate in Equation 1 is performed locally ateach node, and the reduce step sums up the up-dates, and eventually the master nodes set thevalues of the new weight vectors.We extended the MapReduce SOM algorithmby moving all calculations on local nodes to theGPU with the matrix-based Euclidean distancematrix and reduction algorithm described be-low. The chunk of X assigned to the nodeand the corresponding norms of X are kept inthe GPU memory between subsequent epochs,and the weight vectors are copied to the GPUmemory in each step after the node receives thebroadcast of the update.The most time-consuming part of the cal-culations is finding the best matching neu-ron. This involves calculating the pairwisedistances between the elements of the datasets and the weight vectors. Euclidean dis-tance is one of the most common distancesused in SOMs. The distance between a datapoint and a weight vector in the batch formu-lation can be calculated by d(wj(t0), x(t)) =√∑N

i=1(xi(t)− wji(t0))2, where N is the di-mension of the space that embeds the data set,and t ∈ {t0, . . . , tf}. The pairwise distancescan be efficiently computed on a GPU if thedistances are calculated on the same space [1].The above formula is more general, and a muchhigher performance can be achieved if the en-tire distance matrix is calculated, and the stepsare decomposed into matrix level operations[3,7]).An important point to note is that as thenorms of X do not change between subsequentepochs, these values can be computed beforethe main training loop starts.

ResultsWhen scaling up, we noticed a nearly lin-ear scaling, which shows that the overhead ofbroadcasting the SOM weight vectors is notconsiderable.Calculating the norms was performed with100% occupancy. Finding the minimum is alsooptimal. The CUBLAS dense matrix multipli-cation proved to be the weak point, with an oc-cupancy varying between 33% and 83%.

Method Execution Speeduptime over CPU

CPU (64 cores) 2042s -GPU (16 Tesla) 433s 4.71xCPU (64 cores) 1882s -(One epoch)GPU (16 Tesla) 194s 9.68x(One epoch)

References[1] D. Chang, N. Jones, D. Li, M. Ouyang, and R. Ragade. Compute pairwise Euclidean distances of data points with GPUs. In Proceedings of CBB-08, International

Symposium on Computational Biology and Bioinformatics, pages 278–283, Orlando, FL, USA, November 2008. ACTA Press.[2] P. Kanerva, J. Kristofersson, and A. Holst. Random indexing of text samples for latent semantic analysis. In Proceedings of CogSci-00, 22nd Annual Conference of the

Cognitive Science Society, volume 1036, Philadelphia, PA, USA, 2000.[3] Q. Li, V. Kecman, and R. Salman. A chunking method for Euclidean distance matrix calculation on large dataset using multi-GPU. In Proceedings of ICMLA-10, 9th

International Conference on Machine Learning and Applications, pages 208–213, Washington, DC, USA, December 2010.[4] S. Plimpton and K. Devine. MapReduce in MPI for large-scale graph algorithms. Parallel Computing, 2011.[5] J. Stuart and J. Owens. Multi-GPU MapReduce on GPU clusters. In Proceedings of IPDPS-11, 25th International Parallel and Distributed Computing Symposium,

Anchorage, AK, USA, May 2011.[6] S. Sul and A. Tovchigrechko. Parallelizing BLAST and SOM algorithms with MapReduce-MPI library. In Proceedings of IPDPS-11, 25th International Parallel and

Distributed Computing Symposium, pages 476–483, Anchorage, AK, USA, May 2011.[7] K. van de Sande, T. Gevers, and C. Snoek. Empowering visual categorization with the GPU. IEEE Transactions on Multimedia, 13(1):60–70, 2011.[8] T. White. Hadoop: The Definitive Guide. O’Reilly Media, 2009.

1