porting irregular reductions on heterogeneous cpu-gpu configurations xin huo, vignesh t. ravi, gagan...

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations

Xin Huo, Vignesh T. Ravi, Gagan AgrawalDepartment of Computer Science and

EngineeringThe Ohio State University

Irregular Reduction - Context

• A dwarf in Berkeley view on parallel computing– Unstructured grid pattern– More random and irregular accesses– Indirect memory references

• Previous efforts in porting to different architectures– Distributed memory machines– Distributed shared memory machines– Shared memory machines– Cache performance improvement on uniprocessor– Many-core architecture - GPGPU (Our work in ICS 11)– No system study on heterogeneous architecture (CPU&GPU)

Why CPU + GPU ? – A Glimpse

• Dominant positions in different areas, but connected tightly– GPU: Computation Intensive problems, large number of threads

execute in SIMD– CPU: Data Intensive problems, branch processing, high precision

and complicated computation– GPU is dependent on the scheduling and data from CPU

• One of the most popular heterogeneous architectures– 3 out of 5 fastest supercomputers are based on CPU + GPU

architecture (500 list on 11/2011)– Fusion in AMD, Sandy Bridge in Intel, and Denver in NVIDIA– Cloud compute instances in Amazon

Outline

• Background– Irregular Reduction Structure– Partitioning-based Locking Scheme– Main Issues

• Contributions• Multi-level Partitioning Framework• Runtime Support

– Pipeline Scheme– Task Scheduling

• Experiment Support• Conclusions

Irregular Reduction Structure

• Robj: Reduction Object• e: Iteration of

computation loop• IA(e,x): Indirection Array

• IA: Iterators over e (Computation Space)

• Robj: Accessed by Indirection Array (Reduction Space)

/* Outer Sequence Loop */while( ){ /* Reduction Loop */ Foreach(element e) { (IA(e,0),val1) = Process(IA(e,0)); (IA(e,1),val2) = Process(IA(e,1)); Robj = Reduce(Robj(IA(e,0)),val1); Robj = Reduce(Robj(IA(e,1)),val2); } /*Global Reduction to Combine Robj*/}

Application Context

Molecular Dynamics• Indirection Array -> Edges (Interactions)• Reduction Objects -> Molecules (Attributes)

• Computation Space -> Interactions b/w molecules

• Reduction Space -> Attributes of Molecules

Partitioning-based Locking Strategy

• Reduction Space Partitioning– Efficient shared

memory utilization– Eliminate intra and

inter-block combination

• Multi-Dimensional Partitioning Method– Balance between

minimum cutting edges and partitioning time

Huo et al., 25th International Conference on Supercomputing

Main Issues

• Device memory limitation on GPU• Partitioning overhead– Partitioning cost increases with the increasing of

data volume– GPU is in idle for waiting the results of partitioning

• Low utilization of CPU– CPU only conducts partitioning– CPU is in idle when GPU doing computation

Contributions

• A Novel Multi-level Partitioning Framework– Parallelize irregular reduction on heterogeneous architecture

(CPU + GPU)– Eliminate device memory limitation on GPU

• Runtime Support Scheme– Pipelining Scheme– Work stealing based scheduling strategy

• Significant Performance Improvements– Exhaustive evaluations– Achieve 11% and 22% improvement for Euler and Molecular

Dynamics

Multi-level Partitioning Framework

Computation Space Partitioning• Partitioning on the iterations of the computation loop

• Pros– Load Balance on Computation

• Cons– Unequal reduction size in each

partition– Replicated reduction elements (4

out of 16 nodes)– Combination cost

• Between CPU and GPU (First Level)• Between different thread blocks

(Second Level)

1

4

3

5

6

1215

7 9

13

11

14

1610

8

2

Partition 1

Partition 2

Partition 3 Partition 4

24

12

7

Reduction Space Partitioning

• Pros– Balance reduction space

• Shared memory is feasible for GPU

– Independent between each two partitions• No communication between CPU

and GPU (First Level)• No communication between thread

blocks (Second Level)

– Avoid combination cost• Between different thread blocks• Between CPU and GPU

• Cons– Imbalance on computation space– Replicated work caused by

crossing edges

1

4

3

5

6

1215

7 9

13

11

14

1610

2

8

Partition 1

Partition 2

Partition 4

• Partitioning on Reduction Elements

Partition 3

Task Scheduling Framework

Task Scheduling Framework

• Pipelining Scheme– K blocks assigned to GPU in one global loading– Pipelining between Partitioning and Computation in a

global loading• Work Stealing Scheduling– Scheduling granularity (Large for GPU; Small for CPU)

• Too large: Better pipelining effect, but worse load balance• Too small: Better load balance, but small pipelining length

– Work Stealing can achieve both maximum pipelining length and good load balance

Experiment Setup

• Platform– GPU

• NVIDIA Tesla C2050 “Fermi” (14x32 = 448 cores)• 2.68GB device memory• 64KB configurable shared memory

– CPU• Intel 2.27 GHz Quad Xeon E5520• 48GB memory• x16 PCI Express 2.0

• Applications– Euler (Computational Fluid Dynamics)– MD (Molecular Dynamics)

Scalability – Molecular Dynamics• Scalability of Molecular Dynamics on Multi-core CPU and GPU across

Different Datasets (MD)

Pipelining – Euler• Effect of Pipelining CPU Partitioning and GPU Computation (EU)

Performance increasing

Pipelining increasing

Avoid partitioning overhead except the first partition

Heterogeneous Performance – EU and MD• Benefits From Dividing Computations Between CPU and GPU for EU and

MD

11%

22%

Work Stealing - Euler• Comparison of Fine-grained, Coarse-grained, and Work Stealing Strategies

Granularity = 1

Granularity = 5

Good load balance, Bad pipelining effect

Good pipelining effect Bad load balance

Good pipelining effect Good load balance

Conclusions

• Multi-level Partitioning Framework to port irregular reduction on heterogeneous architectures

• Pipelining Scheme can overlap partitioning on CPU and computation on GPU

• Work Stealing Scheduling achieves the best pipelining effect and load balance

Thank youQuestions

?Contacts:Xin Huo [email protected] T. Ravi [email protected] Agrawal [email protected]

mailto:[email protected]



porting irregular reductions on heterogeneous cpu-gpu configurations xin huo, vignesh t. ravi, gagan...

Documents

reduction loop

reduction objecte

global reduction

cpu gpu architecture

computation intensive

e computation spacerobj

partitioning timehuo

val1 robj