brief overview of a parallel nbody code
DESCRIPTION
This is a brief overview of a nbody code, including the sequential and parallel versions (OpenMP and CUDA) and its computational complexity.TRANSCRIPT
![Page 1: Brief Overview of a Parallel Nbody Code](https://reader035.vdocuments.net/reader035/viewer/2022081401/558e76501a28abe8478b46d6/html5/thumbnails/1.jpg)
Brief overview of a parallel nbody code
Filipo Novo MórGraduate Program in Computer Science UFRGSProf. Nicollas Maillard 2013, December
Implementation and analysis
![Page 2: Brief Overview of a Parallel Nbody Code](https://reader035.vdocuments.net/reader035/viewer/2022081401/558e76501a28abe8478b46d6/html5/thumbnails/2.jpg)
Overview• About the nbody problem• The Serial Implementation• The OpenMP Implementation• The CUDA Implementation• Experimental Results• Conclusion
![Page 3: Brief Overview of a Parallel Nbody Code](https://reader035.vdocuments.net/reader035/viewer/2022081401/558e76501a28abe8478b46d6/html5/thumbnails/3.jpg)
About the nbody problemFeatures:
Force calculation between all particles. Complexity O(N2) Energy should be constant. The brute force algorithm demands huge
computational power.
![Page 4: Brief Overview of a Parallel Nbody Code](https://reader035.vdocuments.net/reader035/viewer/2022081401/558e76501a28abe8478b46d6/html5/thumbnails/4.jpg)
The Serial ImplementationNAIVE!
• Clearly N2
• Each pair is evaluated twice• Acceleration has to be adjusted at the end.
![Page 5: Brief Overview of a Parallel Nbody Code](https://reader035.vdocuments.net/reader035/viewer/2022081401/558e76501a28abe8478b46d6/html5/thumbnails/5.jpg)
The Serial Implementationsmart
• It stills under N2 domain, but:• Each pair is evaluated once only.• Acceleration it’s OK at the end!
![Page 6: Brief Overview of a Parallel Nbody Code](https://reader035.vdocuments.net/reader035/viewer/2022081401/558e76501a28abe8478b46d6/html5/thumbnails/6.jpg)
The OpenMP Implementation
• MUST be based on the “naive” version. • We lost the “/2”, but we gain the “/p”!
• OBS: the static schedule seems to be slightly faster than dynamic schedule.
![Page 7: Brief Overview of a Parallel Nbody Code](https://reader035.vdocuments.net/reader035/viewer/2022081401/558e76501a28abe8478b46d6/html5/thumbnails/7.jpg)
Analysis
“naive” Serial “smart” Serial
for (i=0; i<N; i++){ for(j=i+1; j<N; j++) { printf(“*”); } printf(“\n”);}
*************************
≈𝒏(𝒏−𝟏)𝟐
OpenMP Parallel
![Page 8: Brief Overview of a Parallel Nbody Code](https://reader035.vdocuments.net/reader035/viewer/2022081401/558e76501a28abe8478b46d6/html5/thumbnails/8.jpg)
The CUDA Implementation
Basic CUDA GPU architecture
![Page 9: Brief Overview of a Parallel Nbody Code](https://reader035.vdocuments.net/reader035/viewer/2022081401/558e76501a28abe8478b46d6/html5/thumbnails/9.jpg)
0123456789
1011121314
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Global Memory
Shared Memory Bank
N = 15K = 3
![Page 10: Brief Overview of a Parallel Nbody Code](https://reader035.vdocuments.net/reader035/viewer/2022081401/558e76501a28abe8478b46d6/html5/thumbnails/10.jpg)
0123456789
1011121314
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
BARR
IER
Global Memory
Shared Memory Bank
Active Tasks
Active Transfers
![Page 11: Brief Overview of a Parallel Nbody Code](https://reader035.vdocuments.net/reader035/viewer/2022081401/558e76501a28abe8478b46d6/html5/thumbnails/11.jpg)
0123456789
1011121314
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 1 2
Global Memory
Shared Memory Bank
Active Tasks
Active Transfers
![Page 12: Brief Overview of a Parallel Nbody Code](https://reader035.vdocuments.net/reader035/viewer/2022081401/558e76501a28abe8478b46d6/html5/thumbnails/12.jpg)
0123456789
1011121314
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
BARR
IER
Global Memory
Shared Memory Bank
Active Tasks
Active Transfers
![Page 13: Brief Overview of a Parallel Nbody Code](https://reader035.vdocuments.net/reader035/viewer/2022081401/558e76501a28abe8478b46d6/html5/thumbnails/13.jpg)
0123456789
1011121314
3 4 5
Global Memory
Shared Memory Bank
Active Tasks
Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
![Page 14: Brief Overview of a Parallel Nbody Code](https://reader035.vdocuments.net/reader035/viewer/2022081401/558e76501a28abe8478b46d6/html5/thumbnails/14.jpg)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
BARR
IER
Global Memory
Shared Memory Bank
Active Tasks
Active Transfers
0123456789
1011121314
![Page 15: Brief Overview of a Parallel Nbody Code](https://reader035.vdocuments.net/reader035/viewer/2022081401/558e76501a28abe8478b46d6/html5/thumbnails/15.jpg)
0123456789
1011121314
6 7 8
Global Memory
Shared Memory Bank
Active Tasks
Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
![Page 16: Brief Overview of a Parallel Nbody Code](https://reader035.vdocuments.net/reader035/viewer/2022081401/558e76501a28abe8478b46d6/html5/thumbnails/16.jpg)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
BARR
IER
Global Memory
Shared Memory Bank
Active Tasks
Active Transfers
0123456789
1011121314
![Page 17: Brief Overview of a Parallel Nbody Code](https://reader035.vdocuments.net/reader035/viewer/2022081401/558e76501a28abe8478b46d6/html5/thumbnails/17.jpg)
0123456789
1011121314
9 10
11
Global Memory
Shared Memory Bank
Active Tasks
Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
![Page 18: Brief Overview of a Parallel Nbody Code](https://reader035.vdocuments.net/reader035/viewer/2022081401/558e76501a28abe8478b46d6/html5/thumbnails/18.jpg)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
BARR
IER
Global Memory
Shared Memory Bank
Active Tasks
Active Transfers
0123456789
1011121314
![Page 19: Brief Overview of a Parallel Nbody Code](https://reader035.vdocuments.net/reader035/viewer/2022081401/558e76501a28abe8478b46d6/html5/thumbnails/19.jpg)
0123456789
1011121314
12
13
14
Global Memory
Shared Memory Bank
Active Tasks
Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
![Page 20: Brief Overview of a Parallel Nbody Code](https://reader035.vdocuments.net/reader035/viewer/2022081401/558e76501a28abe8478b46d6/html5/thumbnails/20.jpg)
0123456789
1011121314
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Global Memory
Shared Memory Bank
Active Tasks
Active Transfers
![Page 21: Brief Overview of a Parallel Nbody Code](https://reader035.vdocuments.net/reader035/viewer/2022081401/558e76501a28abe8478b46d6/html5/thumbnails/21.jpg)
0123456789
1011121314
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Global Memory
Shared Memory Bank
![Page 22: Brief Overview of a Parallel Nbody Code](https://reader035.vdocuments.net/reader035/viewer/2022081401/558e76501a28abe8478b46d6/html5/thumbnails/22.jpg)
Analysis CUDA implementation:
First of all, all the elements are transfered from Host to Device memory.
Each thread is responsible for only one particle. There are barriers with synchronizations
between shared and global memory. On each barrier, elements are transferred to the
shared memory at time. At the end, all elements are copied from local to
global memory at once, and finally copied back to the CPU memory.
C : cost of the CalculateForce function.M : transfer cost between global and shared memories.T : transfer cost between CPU and device memories.
Access to shared memory is around 100X faster than to the global memory.
![Page 23: Brief Overview of a Parallel Nbody Code](https://reader035.vdocuments.net/reader035/viewer/2022081401/558e76501a28abe8478b46d6/html5/thumbnails/23.jpg)
Experimental Results
Testing Environment: Dell PowerEdge R610
2 Intel Xeon Quad-Core E5520 2.27 GHz Hyper-Threading 8 physical cores, 16 threads. RAM 16GB NVIDIA Tesla S2050
Ubuntu Server 10.0.4 LTS GCC 4.4.3 CUDA 5.0
How much would it cost???Version Cost
Naive 0.49$
Smart 0.33$
OMP 0.08$
CUDA 0.05$
Amazon EC2: General Purpose - m1.large plan GPU Instances - g2.2xlarge plan
![Page 24: Brief Overview of a Parallel Nbody Code](https://reader035.vdocuments.net/reader035/viewer/2022081401/558e76501a28abe8478b46d6/html5/thumbnails/24.jpg)
• PRAM is OK for sequential and OpenMP.• But for CUDA, we need a better model!
– Considering block threads, warps and latency.
Thanks!
Conclusions
![Page 25: Brief Overview of a Parallel Nbody Code](https://reader035.vdocuments.net/reader035/viewer/2022081401/558e76501a28abe8478b46d6/html5/thumbnails/25.jpg)
Additional Slides
![Page 26: Brief Overview of a Parallel Nbody Code](https://reader035.vdocuments.net/reader035/viewer/2022081401/558e76501a28abe8478b46d6/html5/thumbnails/26.jpg)
• Calculations
𝑓𝑖 ≈𝐺𝑚𝑖 𝑚𝑗𝑟𝑖𝑗ቀฮ𝑟𝑖𝑗ฮ2 +𝜀2ቁ321<𝑗<𝑁𝑗≠𝑖
Force (acceleration)
𝐸= 𝐸𝑘 +𝐸𝑝
𝐸𝑝 = − 𝐺𝑚𝑖𝑚𝑗ฮ𝑟𝑖𝑗ฮ
𝑁1<𝑗<𝑁𝑖≠𝑗
𝐸𝑘 = 𝑚𝑖𝑣𝑖22𝑁1<𝑖<𝑁
Energy (kinetic and potential)
Softening Factorcollisionless system
virtual particles
About the nbody problem
![Page 27: Brief Overview of a Parallel Nbody Code](https://reader035.vdocuments.net/reader035/viewer/2022081401/558e76501a28abe8478b46d6/html5/thumbnails/27.jpg)
About the nbody problem