parallelizing applications with a reduction based framework on multi-core clusters
DESCRIPTION
Venkatram Ramanathan. Parallelizing Applications With a Reduction Based Framework on Multi-Core Clusters. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet Transform On FREERIDE Co-clustering on FREERIDE - PowerPoint PPT PresentationTRANSCRIPT
Parallelizing Applications With a Reduction Based
Framework on Multi-Core Clusters
Venkatram Ramanathan
1
OutlineMotivation
Evolution of Multi-Core Machines and the challenges
Summary of ContributionsBackground: MapReduce and FREERIDE
Wavelet Transform On FREERIDECo-clustering on FREERIDEConclusion
2
Motivation - Evolution Of Multi-Core Machines
Performance Increase: Increased number of cores with lower
clock frequencies Cost Effective Scalability of performance
HPC Environments – Cluster of Multi-Cores
3
Challenges
Multi-Level Parallelism Within Cores in a node – Shared
Memory Parallelism - Pthreads, OpenMP Within Nodes – Distributed Memory
Parallelism - MPI Achieving Programmability and Performance – Major Challenge
4
Challenges
Possible solutionUse higher-level/restricted APIsReduction based APIs
Map-ReduceHigher-level APIProgram Cluster of Multi-Cores with 1 APIExpressive Power Considered Limited
Expressing computations using reduction-based APIs
5
Summary of Contributions
Two AlgorithmsWavelet TransformCo-Clustering
Expressed as reduction structures and parallelized on FREERIDE
Speedup of 42 on 64 cores for Wavelet Transform
Speedup of 21 on 32 cores for Co-clustering
6
Background
MapReduceMap (in_key,in_value) ->
list(out_key,intermediate_value)Reduce(out_key,list(intermediate_value) -> list(out_value)
FREERIDEUsers explicitly declare Reduction Object
and update itMap and Reduce steps combined Each data element – processed and reduced
before next element is processed7
MapReduce and FREERIDE: Comparison
8
One Iteration of Reduction Loop on FREERIDE
9
Wavelet Transform On FREERIDE- Motivation
Wavelet Transform – Important tool in Medical Imaging
fMRI – probing mechanism for brain activation Seeks to study behavior across spatio-
temporal data
10
Introduction To Wavelet Transform
Discrete Wavelet Transform Defined for input having 2^n numbers Convolution along Time domain results
in 2^n output values Has following steps
Pair up Input values Store difference Pass the sum Repeat till there are 2^n – 1 differences and
1 sum
11
Introduction To Wavelet Transform
Serial Wavelet Transform Algorithm
Input: a1, a2, a3, a4, a5, a6, a7, a8 Output: a1-a2, a3-a4, a5-a6, a7-a8
a1+a2-a3-a4, a5+a6-a7-a8 a1+a2+a3+a4-a5-a6-a7-a8 a1+a2+a3+a4+a5+a6+a7+a8
12
Parallelization Overview
Time Series length = T; Number of Nodes = P Time Series Per Node = T/P If P is power of 2,
T/P-1 final values of output calculated locallyT-P final values produced without communicationRemaining P values require Inter-process CommunicationAllocate reduction object of size P on each NodeEach node updates Reduction Object with its contributionGlobal reductionThe last P values can be calculated.Since output – out of order, index on the output where
each final value needs to go can be calculated
13
Shared/Distributed Memory Parallelization
14
Hybrid Parallelization
Input data distributed among nodes Threads share data Size of reduction object - #Threads x #Nodes Each thread computes local final values Updates reduction object at
ThreadID+(#Threads x NodeID) Global Combination Calculate last #Threads x #Nodes values from
the data in reduction object
15
Hybrid Parallelization
16
Hybrid Parallelization – An Optimization
Computation of the last #Threads x #Nodes values – ParallelizedLocal Reduction step Global Reduction step- Global Array
Size of Reduction ObjectLocal Reduction Step : #ThreadsGlobal Reduction Step: #Nodes
17
Hybrid Parallelization – An Optimization
18
Generation of final output index
Index if Iteration I = 0
Index if Iteration I > 0
term is local index of value calculated in current iteration
Chunkid is ThreadID+(NodeID x #Threads) I is current iteration
19
Experimental Results
Experimental Setup: Cluster of Multi-core machines Intel Xeon CPU E5345 – quad core Clock Frequency 2.33 GHz Main Memory 6 GB
Datasets Varying p, dimension of spatial cube and s, time-steps in
time series p = 10; s = 262144(DS1) p = 32; s = 2048 (DS2) p = 32; s = 4096 (DS3) p = 32; s = 8192 (DS4) p = 39; s = 8192 (DS5)
20
Experimental Results
21
Experimental Results
22
Experimental Results
23
Co-Clustering on FREERIDE
Clustering - Grouping together of “similar” objectsHard Clustering -- Each object belongs to a single
cluster
Soft Clustering -- Each object is probabilistically assigned to clusters
24
Co-clustering of Text data
Co-clustering clusters both words and documents simultaneously
25
Co-clustering
Involves simultaneous clustering of rows to row clusters and columns to column clusters
Maximizes Mutual Information Uses Kullback-Leibler Divergence
x
xqxpxpqpKL ))()(log()(),(
26
Overview of Co-clustering Algorithm – Preprocessing
27
Overview of Co-clustering Algorithm – Iterative Procedure
28
Parallelizing Co-clustering on FREERIDE
Input matrix and its transpose pre-computed Input matrix and transpose
Divided into files Distributed among nodes Each node - same amount of row and column data
rowCL and colCL – replicated on all nodes Initial clustering
Round robin fashion - consistency across nodes
29
Parallelizing Preprocess Step
In Preprocessing, pX and pY – normalized by total sum
Wait till all nodes process to normalize Each node calculates pX and pY with local data Reduction object updated partial sum, pX and pY
values Accumulated partial sums - total sum pX and pY normalized
xnorm and ynorm calculated in second iteration as they need total sum
30
Parallelizing Preprocess Step
Compressed Matrix of size #rowclusters x #colclusters, calculated with local data Sum of values of values of each row cluster
across each column cluster Final compressed matrix -sum of local
compressed matrices Local compressed matrices – updated in
reduction object Produces final compressed matrix on
accumulation Cluster Centroids calculated
31
Parallelizing Iterative Procedure
Reassign clusteringDetermined by Kullback-Leibler divergence Reduction object updated
Compute compressed matrix Update reduction object
Column Clustering – similar Objective function – finalize Next iteration
32
Parallelizing Co-clustering on FREERIDE
33
Parallelizing Iterative Procedure
34
Experimental Results
Algorithm - same for shared memory, distributed memory and hybrid parallelization
Experiments conducted 2 clusters env1
Intel Xeon E5345 Quad Core Clock Frequency 2.33 GHz Main Memory 6 GB 8 nodes
env2 AMD Opteron 8350 CPU 8 Cores Main Memory 16 GB 4 Nodes
35
Experimental Results
2 Datasets 1 GB Dataset
Matrix Dimensions 16k x 16k 4 GB Dataset
Matrix Dimensions 32k x 32k Datasets and transpose
Split into 32 files each (row partitioning) Distributed among nodes
Number of row and column clusters: 4
36
Experimental Results
37
Experimental Results
38
Experimental Results
39
Experimental Results
Preprocessing stage – bottleneck for smaller dataset – not compute intensive
Speedup with Preprocessing : 12.17 Speedup without Preprocessing: 18.75 Preprocessing stage scales well for Larger
dataset – more computation Speedup is the same with and without
preprocessing. Speedup for larger dataset : 20.7
40
Conclusion
Parallelized two data intensive applications, namely Wavelet Transform Co-clustering
Representing the algorithms as generalized reduction structures
Implementing them on FREERIDE Wavelet Transform - speedup 42 on 64 cores Co-clustering - speedup 21 on 32 cores.
41
Thank You!
42