parallelizing applications with a reduction based framework on multi-core clusters

42
Parallelizing Applications With a Reduction Based Framework on Multi-Core Clusters Venkatram Ramanathan 1

Upload: mabli

Post on 23-Feb-2016

39 views

Category:

Documents


0 download

DESCRIPTION

Venkatram Ramanathan. Parallelizing Applications With a Reduction Based Framework on Multi-Core Clusters. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet Transform On FREERIDE Co-clustering on FREERIDE - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Parallelizing Applications With a Reduction Based Framework on  Multi-Core  Clusters

Parallelizing Applications With a Reduction Based

Framework on Multi-Core Clusters

Venkatram Ramanathan

1

Page 2: Parallelizing Applications With a Reduction Based Framework on  Multi-Core  Clusters

OutlineMotivation

Evolution of Multi-Core Machines and the challenges

Summary of ContributionsBackground: MapReduce and FREERIDE

Wavelet Transform On FREERIDECo-clustering on FREERIDEConclusion

2

Page 3: Parallelizing Applications With a Reduction Based Framework on  Multi-Core  Clusters

Motivation - Evolution Of Multi-Core Machines

Performance Increase: Increased number of cores with lower

clock frequencies Cost Effective Scalability of performance

HPC Environments – Cluster of Multi-Cores

3

Page 4: Parallelizing Applications With a Reduction Based Framework on  Multi-Core  Clusters

Challenges

Multi-Level Parallelism Within Cores in a node – Shared

Memory Parallelism - Pthreads, OpenMP Within Nodes – Distributed Memory

Parallelism - MPI Achieving Programmability and Performance – Major Challenge

4

Page 5: Parallelizing Applications With a Reduction Based Framework on  Multi-Core  Clusters

Challenges

Possible solutionUse higher-level/restricted APIsReduction based APIs

Map-ReduceHigher-level APIProgram Cluster of Multi-Cores with 1 APIExpressive Power Considered Limited

Expressing computations using reduction-based APIs

5

Page 6: Parallelizing Applications With a Reduction Based Framework on  Multi-Core  Clusters

Summary of Contributions

Two AlgorithmsWavelet TransformCo-Clustering

Expressed as reduction structures and parallelized on FREERIDE

Speedup of 42 on 64 cores for Wavelet Transform

Speedup of 21 on 32 cores for Co-clustering

6

Page 7: Parallelizing Applications With a Reduction Based Framework on  Multi-Core  Clusters

Background

MapReduceMap (in_key,in_value) ->

list(out_key,intermediate_value)Reduce(out_key,list(intermediate_value) -> list(out_value)

FREERIDEUsers explicitly declare Reduction Object

and update itMap and Reduce steps combined Each data element – processed and reduced

before next element is processed7

Page 8: Parallelizing Applications With a Reduction Based Framework on  Multi-Core  Clusters

MapReduce and FREERIDE: Comparison

8

Page 9: Parallelizing Applications With a Reduction Based Framework on  Multi-Core  Clusters

One Iteration of Reduction Loop on FREERIDE

9

Page 10: Parallelizing Applications With a Reduction Based Framework on  Multi-Core  Clusters

Wavelet Transform On FREERIDE- Motivation

Wavelet Transform – Important tool in Medical Imaging

fMRI – probing mechanism for brain activation Seeks to study behavior across spatio-

temporal data

10

Page 11: Parallelizing Applications With a Reduction Based Framework on  Multi-Core  Clusters

Introduction To Wavelet Transform

Discrete Wavelet Transform Defined for input having 2^n numbers Convolution along Time domain results

in 2^n output values Has following steps

Pair up Input values Store difference Pass the sum Repeat till there are 2^n – 1 differences and

1 sum

11

Page 12: Parallelizing Applications With a Reduction Based Framework on  Multi-Core  Clusters

Introduction To Wavelet Transform

Serial Wavelet Transform Algorithm

Input: a1, a2, a3, a4, a5, a6, a7, a8 Output: a1-a2, a3-a4, a5-a6, a7-a8

a1+a2-a3-a4, a5+a6-a7-a8 a1+a2+a3+a4-a5-a6-a7-a8 a1+a2+a3+a4+a5+a6+a7+a8

12

Page 13: Parallelizing Applications With a Reduction Based Framework on  Multi-Core  Clusters

Parallelization Overview

Time Series length = T; Number of Nodes = P Time Series Per Node = T/P If P is power of 2,

T/P-1 final values of output calculated locallyT-P final values produced without communicationRemaining P values require Inter-process CommunicationAllocate reduction object of size P on each NodeEach node updates Reduction Object with its contributionGlobal reductionThe last P values can be calculated.Since output – out of order, index on the output where

each final value needs to go can be calculated

13

Page 14: Parallelizing Applications With a Reduction Based Framework on  Multi-Core  Clusters

Shared/Distributed Memory Parallelization

14

Page 15: Parallelizing Applications With a Reduction Based Framework on  Multi-Core  Clusters

Hybrid Parallelization

Input data distributed among nodes Threads share data Size of reduction object - #Threads x #Nodes Each thread computes local final values Updates reduction object at

ThreadID+(#Threads x NodeID) Global Combination Calculate last #Threads x #Nodes values from

the data in reduction object

15

Page 16: Parallelizing Applications With a Reduction Based Framework on  Multi-Core  Clusters

Hybrid Parallelization

16

Page 17: Parallelizing Applications With a Reduction Based Framework on  Multi-Core  Clusters

Hybrid Parallelization – An Optimization

Computation of the last #Threads x #Nodes values – ParallelizedLocal Reduction step Global Reduction step- Global Array

Size of Reduction ObjectLocal Reduction Step : #ThreadsGlobal Reduction Step: #Nodes

17

Page 18: Parallelizing Applications With a Reduction Based Framework on  Multi-Core  Clusters

Hybrid Parallelization – An Optimization

18

Page 19: Parallelizing Applications With a Reduction Based Framework on  Multi-Core  Clusters

Generation of final output index

Index if Iteration I = 0

Index if Iteration I > 0

term is local index of value calculated in current iteration

Chunkid is ThreadID+(NodeID x #Threads) I is current iteration

19

Page 20: Parallelizing Applications With a Reduction Based Framework on  Multi-Core  Clusters

Experimental Results

Experimental Setup: Cluster of Multi-core machines Intel Xeon CPU E5345 – quad core Clock Frequency 2.33 GHz Main Memory 6 GB

Datasets Varying p, dimension of spatial cube and s, time-steps in

time series p = 10; s = 262144(DS1) p = 32; s = 2048 (DS2) p = 32; s = 4096 (DS3) p = 32; s = 8192 (DS4) p = 39; s = 8192 (DS5)

20

Page 21: Parallelizing Applications With a Reduction Based Framework on  Multi-Core  Clusters

Experimental Results

21

Page 22: Parallelizing Applications With a Reduction Based Framework on  Multi-Core  Clusters

Experimental Results

22

Page 23: Parallelizing Applications With a Reduction Based Framework on  Multi-Core  Clusters

Experimental Results

23

Page 24: Parallelizing Applications With a Reduction Based Framework on  Multi-Core  Clusters

Co-Clustering on FREERIDE

Clustering - Grouping together of “similar” objectsHard Clustering -- Each object belongs to a single

cluster

Soft Clustering -- Each object is probabilistically assigned to clusters

24

Page 25: Parallelizing Applications With a Reduction Based Framework on  Multi-Core  Clusters

Co-clustering of Text data

Co-clustering clusters both words and documents simultaneously

25

Page 26: Parallelizing Applications With a Reduction Based Framework on  Multi-Core  Clusters

Co-clustering

Involves simultaneous clustering of rows to row clusters and columns to column clusters

Maximizes Mutual Information Uses Kullback-Leibler Divergence

x

xqxpxpqpKL ))()(log()(),(

26

Page 27: Parallelizing Applications With a Reduction Based Framework on  Multi-Core  Clusters

Overview of Co-clustering Algorithm – Preprocessing

27

Page 28: Parallelizing Applications With a Reduction Based Framework on  Multi-Core  Clusters

Overview of Co-clustering Algorithm – Iterative Procedure

28

Page 29: Parallelizing Applications With a Reduction Based Framework on  Multi-Core  Clusters

Parallelizing Co-clustering on FREERIDE

Input matrix and its transpose pre-computed Input matrix and transpose

Divided into files Distributed among nodes Each node - same amount of row and column data

rowCL and colCL – replicated on all nodes Initial clustering

Round robin fashion - consistency across nodes

29

Page 30: Parallelizing Applications With a Reduction Based Framework on  Multi-Core  Clusters

Parallelizing Preprocess Step

In Preprocessing, pX and pY – normalized by total sum

Wait till all nodes process to normalize Each node calculates pX and pY with local data Reduction object updated partial sum, pX and pY

values Accumulated partial sums - total sum pX and pY normalized

xnorm and ynorm calculated in second iteration as they need total sum

30

Page 31: Parallelizing Applications With a Reduction Based Framework on  Multi-Core  Clusters

Parallelizing Preprocess Step

Compressed Matrix of size #rowclusters x #colclusters, calculated with local data Sum of values of values of each row cluster

across each column cluster Final compressed matrix -sum of local

compressed matrices Local compressed matrices – updated in

reduction object Produces final compressed matrix on

accumulation Cluster Centroids calculated

31

Page 32: Parallelizing Applications With a Reduction Based Framework on  Multi-Core  Clusters

Parallelizing Iterative Procedure

Reassign clusteringDetermined by Kullback-Leibler divergence Reduction object updated

Compute compressed matrix Update reduction object

Column Clustering – similar Objective function – finalize Next iteration

32

Page 33: Parallelizing Applications With a Reduction Based Framework on  Multi-Core  Clusters

Parallelizing Co-clustering on FREERIDE

33

Page 34: Parallelizing Applications With a Reduction Based Framework on  Multi-Core  Clusters

Parallelizing Iterative Procedure

34

Page 35: Parallelizing Applications With a Reduction Based Framework on  Multi-Core  Clusters

Experimental Results

Algorithm - same for shared memory, distributed memory and hybrid parallelization

Experiments conducted 2 clusters env1

Intel Xeon E5345 Quad Core Clock Frequency 2.33 GHz Main Memory 6 GB 8 nodes

env2 AMD Opteron 8350 CPU 8 Cores Main Memory 16 GB 4 Nodes

35

Page 36: Parallelizing Applications With a Reduction Based Framework on  Multi-Core  Clusters

Experimental Results

2 Datasets 1 GB Dataset

Matrix Dimensions 16k x 16k 4 GB Dataset

Matrix Dimensions 32k x 32k Datasets and transpose

Split into 32 files each (row partitioning) Distributed among nodes

Number of row and column clusters: 4

36

Page 37: Parallelizing Applications With a Reduction Based Framework on  Multi-Core  Clusters

Experimental Results

37

Page 38: Parallelizing Applications With a Reduction Based Framework on  Multi-Core  Clusters

Experimental Results

38

Page 39: Parallelizing Applications With a Reduction Based Framework on  Multi-Core  Clusters

Experimental Results

39

Page 40: Parallelizing Applications With a Reduction Based Framework on  Multi-Core  Clusters

Experimental Results

Preprocessing stage – bottleneck for smaller dataset – not compute intensive

Speedup with Preprocessing : 12.17 Speedup without Preprocessing: 18.75 Preprocessing stage scales well for Larger

dataset – more computation Speedup is the same with and without

preprocessing. Speedup for larger dataset : 20.7

40

Page 41: Parallelizing Applications With a Reduction Based Framework on  Multi-Core  Clusters

Conclusion

Parallelized two data intensive applications, namely Wavelet Transform Co-clustering

Representing the algorithms as generalized reduction structures

Implementing them on FREERIDE Wavelet Transform - speedup 42 on 64 cores Co-clustering - speedup 21 on 32 cores.

41

Page 42: Parallelizing Applications With a Reduction Based Framework on  Multi-Core  Clusters

Thank You!

42