parallelization strategies for implementing nbody codes on multicore architectures

1
Pontifical Catholic University of Rio Grande do Sul PUCRS Graduate Program in Computer Science Faculty of Informatics Parallelization Strategies for N-Body Simulations on Multicore Architectures Filipo Novo Mór Thais Christina Webber dos Santos César Augusto Missio Marcon GPU implementation 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 Global Memory Shared Memory Bank Active Tasks Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Global Memory Shared Memory Bank Active Tasks Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 BARRIER Global Memory Shared Memory Bank Active Tasks Active Transfers data bufferization on shared memory data on buffer consumption by parallel threads at the end, global memory is updated Initially, information about particles is copied from RAM to the GPU memory. Then, the code runs in a pipeline where several data transferences between shared and global memory will take place while data on shared memory buffer is consumed. At the end of the process all remaining data, now updated, is copied back to the global memory and then back to the RAM on the CPU. The computational cost is given by (1) where C is the cost of the force calculation function; n is the amount of particles; p is the amount of parallel running threads; M is the cost of data transfer between shared and global memories on GPU, and T is the cost of data transfer between RAM and the GPU memory. 2 +2 +2 (1) Multicore CPU Implementation 2 For a multicore CPU, a standard serial code was parallelized by adding OpenMP directives directly on it. The computational cost was reduced from n 2 to (2), once there is no need to memory transfers. p is the parallel threads, which normally is one per amount of CPU core. (2) [email protected] The N-Body Simulation The Particle-Particle method does not round off the force summation, such that the accuracy equals to the machine precision. Energy Monitoring during the simulation. Computational Cost: O(n 2 ). Partial Results and Perspectives Cost saving by real speedup achievement. Visualization module already implemented. Next Steps: Cluster implementation (CUDA + MPI + GPUDirect). Exploration of hierarchical algorithms (such as Barnes&Hut and mesh). OpenCL version. OpenACC version. Execution quoted on Amazon EC2 service. Version Cost Serial 0,33 $ OpenMP 0,08 $ CUDA 0,05 $

Upload: filipo-mor

Post on 27-Jun-2015

371 views

Category:

Science


2 download

TRANSCRIPT

Page 1: Parallelization Strategies for Implementing Nbody Codes on Multicore Architectures

Pontifical Catholic University of Rio Grande do Sul – PUCRS

Graduate Program in Computer Science

Faculty of Informatics

Parallelization Strategies for N-Body

Simulations on Multicore ArchitecturesFilipo Novo Mór

Thais Christina Webber dos Santos

César Augusto Missio Marcon

GPU implementation

0

1

23

4

56

78

9

101112

1314

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

0 1 2

Global Memory

Shared Memory Bank

Active Tasks

Active Transfers

0

1

23

4

56

78

9

101112

1314

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Global Memory

Shared Memory Bank

Active Tasks

Active Transfers

0

1

23

4

56

78

9

101112

1314

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

BA

RR

IER

Global Memory

Shared Memory Bank

Active Tasks

Active Transfers

data bufferization

on shared

memory

data on buffer

consumption by

parallel threads

at the end, global

memory is

updated

Initially, information about particles is copied from RAM

to the GPU memory. Then, the code runs in a pipeline

where several data transferences between shared and

global memory will take place while data on shared

memory buffer is consumed. At the end of the process

all remaining data, now updated, is copied back to the

global memory and then back to the RAM on the CPU.

The computational cost is given by (1)

where C is the cost of the force calculation function; n

is the amount of particles; p is the amount of parallel

running threads; M is the cost of data transfer between

shared and global memories on GPU, and T is the cost

of data transfer between RAM and the GPU memory.

𝐶 𝑛2

𝑝 + 2𝑀𝑛 + 2𝑇𝑛 (1)

Multicore CPU Implementation

𝐶 𝑛2

𝑝

For a multicore CPU, a standard serial code

was parallelized by adding OpenMP

directives directly on it. The computational

cost was reduced from n2 to (2), once there

is no need to memory transfers. p is the

parallel threads, which normally is one per

amount of

CPU core.

(2)

[email protected]

The N-Body Simulation

The Particle-Particle method does not round

off the force summation, such that the

accuracy equals to the machine precision.

Energy Monitoring during the simulation.

Computational Cost: O(n2).

Partial Results and Perspectives Cost saving by real speedup achievement.

Visualization module already implemented.

Next Steps:

Cluster implementation (CUDA + MPI + GPUDirect).

Exploration of hierarchical algorithms (such as

Barnes&Hut and mesh).

OpenCL version.

OpenACC version.

Execution quoted on Amazon EC2 service.

Version Cost

Serial 0,33$

OpenMP 0,08$

CUDA 0,05$