parallelization strategies for implementing nbody codes on multicore architectures

Pontifical Catholic University of Rio Grande do Sul – PUCRS

Graduate Program in Computer Science

Faculty of Informatics

Parallelization Strategies for N-Body

Simulations on Multicore ArchitecturesFilipo Novo Mór

Thais Christina Webber dos Santos

César Augusto Missio Marcon

GPU implementation

0

1

23

4

56

78

9

101112

1314

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

0 1 2

Global Memory

Shared Memory Bank

Active Tasks

Active Transfers

0

1

23

4

56

78

9

101112

1314

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Global Memory

Shared Memory Bank

Active Tasks

Active Transfers

0

1

23

4

56

78

9

101112

1314

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

BA

RR

IER

Global Memory

Shared Memory Bank

Active Tasks

Active Transfers

data bufferization

on shared

memory

data on buffer

consumption by

parallel threads

at the end, global

memory is

updated

Initially, information about particles is copied from RAM

to the GPU memory. Then, the code runs in a pipeline

where several data transferences between shared and

global memory will take place while data on shared

memory buffer is consumed. At the end of the process

all remaining data, now updated, is copied back to the

global memory and then back to the RAM on the CPU.

The computational cost is given by (1)

where C is the cost of the force calculation function; n

is the amount of particles; p is the amount of parallel

running threads; M is the cost of data transfer between

shared and global memories on GPU, and T is the cost

of data transfer between RAM and the GPU memory.

𝐶 𝑛2

𝑝 + 2𝑀𝑛 + 2𝑇𝑛 (1)

Multicore CPU Implementation

𝐶 𝑛2

𝑝

For a multicore CPU, a standard serial code

was parallelized by adding OpenMP

directives directly on it. The computational

cost was reduced from n2 to (2), once there

is no need to memory transfers. p is the

parallel threads, which normally is one per

amount of

CPU core.

(2)

[email protected]

The N-Body Simulation

The Particle-Particle method does not round

off the force summation, such that the

accuracy equals to the machine precision.

Energy Monitoring during the simulation.

Computational Cost: O(n2).

Partial Results and Perspectives Cost saving by real speedup achievement.

Visualization module already implemented.

Next Steps:

Cluster implementation (CUDA + MPI + GPUDirect).

Exploration of hierarchical algorithms (such as

Barnes&Hut and mesh).

OpenCL version.

OpenACC version.

Execution quoted on Amazon EC2 service.

Version Cost

Serial 0,33$

OpenMP 0,08$

CUDA 0,05$

parallelization strategies for implementing nbody codes on multicore architectures

Science

shared memory data

memory transfers

gpu memory

shared memory buffer

cost of data transfer

computational cost

perspectives cost

version cost serial