griffon topic2 presentation (tia)
DESCRIPTION
TRANSCRIPT
![Page 1: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/1.jpg)
GRIFFON GPU PROGRAMMING API FOR SCIENTIFIC AND GENERAL PURPOSE
PISIT MAKPAISIT 4909611727SUPERVISOR : DR. WORAWAN DIAZ CARBALLO
DEPARTMENT OF COMPUTER SCIENCE, FACULTY OF SCIENCE AND TECHNOLOGY, THAMMASAT UNIVERSITY
![Page 2: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/2.jpg)
04/08/2023
2
Griffon - GPU Programming API for Scientific and General Purpose
• GPU-CPU performance gap • GPGPU• GPU programming model complexity
Motivation
![Page 3: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/3.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
3
GPU-CPU performance gap
All we have graphic card in PC Processor unit in graphic card called “GPU” Therefore every PC have GPU Now GPU performance is pulling away from traditional
processors
http://developer.download.nvidia.com/compute/cuda/2_2/toolkit/docs/NVIDIA_CUDA_Programming_Guide_2.2.pdf
![Page 4: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/4.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
4
GPGPU
General-Purpose computation on Graphics Processing Units
Very high computation and data throughput
Scalability
![Page 5: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/5.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
5
GPGPU Applications
Simulation Finance Fluid Dynamics Medical Imaging Visualization Signal Processing Image Processing Optical Flow Differential Equation Linear Algebra Finite Element Fast Fourier Transform etc.
![Page 6: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/6.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
6
Vector Addition
1 5 6 8 9 1 2 3 6 5Vector A
5 4 1 1 5 6 5 8 9 2Vector B
+
6 9 7 9 14 7 7 11 15 7Vector C
=
![Page 7: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/7.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
7
Vector Addition (Sequential Code)
#include <stdio.h>
#define SIZE 500
void VecAdd(float *A, float *B, float *C){
int i;
for(i=0;i<SIZE;i++)
C[i] = A[i] + B[i]
}
Declare Function
void main(){int i, size = SIZE *
sizeof(float);float *A, *B, *C;
Declare Variables
A = (float*)malloc(size);B = (float*)malloc(size);C = (float*)malloc(size);
Memory Allocate
free(A);free(B);free(C);
}
VecAdd(A,B,C);Function Call
Memory De-Allocate
![Page 8: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/8.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
8
Vector Addition (Sequential Code)
1 5 6 8 9 1 2 3 6 5Vector A
5 4 1 1 5 6 5 8 9 2Vector B
+
Vector C
=
6
+
=
9
+
=
7
+
=
9
+
=
14
+
=
7
+
=
7
+
=
11
+
=
15
+
=
7
![Page 9: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/9.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
9
Improve Performance
We can improve vector with parallel computing
Data Parallelism – simultaneously add each elements
1st choice
Multicore on CPU OpenMP
2nd choice
Multicore on GPU CUDA
![Page 10: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/10.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
10
Vector Addition (OpenMP)
#include <stdio.h>#define SIZE 500
void VecAdd(float *A, float *B, float *C){int i;
for(i=0;i<SIZE;i++)C[i] = A[i] + B[i]
}void main(){
int i, size = SIZE * sizeof(float);
float *A, *B, *C;A = (float*)malloc(size);B = (float*)malloc(size);C = (float*)malloc(size);
VecAdd(A,B,C);
free(A);free(B);free(C);
}
1. Sequential Code#pragma omp parallel for
2. Add Compiler Directive
3. Finish
![Page 11: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/11.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
11
Vector Addition (OpenMP)
1 5 6 8 9 1 2 3 6 5Vector A
5 4 1 1 5 6 5 8 9 2Vector B
+
Vector C
=
6
+
=
9
+
=
7
+
=
9
+
=
14
+
=
7
+
=
7
+
=
11
+
=
15
+
=
7
![Page 12: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/12.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
12
Speed Up (Amdahl’s Law)
Execution time (Sequential)
Vector Addition ~ 80%
Vector Addition New Exec. Time = Exec. Time / Core = 80% / 2
Execution time (Parallel on CPU)
Vector Addition
![Page 13: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/13.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
13
OpenMP
Easy and automatic threads management
Few threads on CPU
![Page 14: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/14.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
14
Vector Addition (GPU - CUDA)
1 5 6 8 9 1 2 3 6 5
Vector A on CPU
5 4 1 1 5 6 5 8 9 2
Vector B on CPU
+
Vector C on CPU
=
6
+
=
9
+
=
7
+
=
9
+
=
14
+
=
7
+
=
7
+
=
11
+
=
15
+
=
7
1 5 6 8 9 1 2 3 6 5
5 4 1 1 5 6 5 8 9 2
6 9 7 9 14 7 7 11 15 7
Copy
Copy
CPU Memory GPU Memory
![Page 15: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/15.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
15
Parallel Vector Addition on GPU (CUDA)
#include <stdio.h>
#define SIZE 500
__global__ void VecAdd(float* A, float* B, float* C){
int idx = threadIdx.x;
if(idx < SIZE)
C[idx] = A[idx] + B[idx];
}
Declare Kernel Function
void main(){int i, size = SIZE * sizeof(float);float *h_A, *h_B, *h_C, *d_A, *d_B,
*d_C;
Declare Variables
h_A = (float*)malloc(size);h_B = (float*)malloc(size);h_C = (float*)malloc(size);
CPU Memory Allocate
cudaMalloc((void**)&d_A, size);cudaMalloc((void**)&d_B, size);cudaMalloc((void**)&d_C, size);
GPU Memory Allocate
![Page 16: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/16.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
16
Parallel Vector Addition on GPU (CUDA)
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
Data Transfer from CPU to GPU
addVec<<<1, SIZE>>>(d_A, d_B, d_C);Kernel Call
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
Data Transfer from GPU to CPU
free(h_A);free(h_B);free(h_C);
CPU Memory De-Allocate
cudaFree(d_A);cudaFree(d_B);cudaFree(d_C);
}
GPU Memory De-Allocate
![Page 17: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/17.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
17
Speed Up (Amdahl’s Law)
Execution time (Sequential)
Vector Addition ~ 80%
Vector Addition New Exec. Time = Exec. Time / Core = 80% / 16
Execution time (Parallel on GPU)
![Page 18: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/18.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
18
CUDA
Speed up but spend more effort and time Many threads on GPU
![Page 19: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/19.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
19
CUDA Memory Model
Global Memory – Off-chip, large, shared by all threads, slow, host can read and write
Local Memory – per one thread , faster than Global Memory
Shared Memory – shared by all threads in block, faster than Global Memory
![Page 20: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/20.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
20
Griffon
Simple programming model (OpenMP)
Computing Performance (GPU - CUDA)+
=Easy and Efficient (Griffon)
![Page 21: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/21.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
21
Parallel Vector Addition on GPU (Griffon)
#include <stdio.h>#define SIZE 500
void VecAdd(float *A, float *B, float *C){int i;
for(i=0;i<SIZE;i++)C[i] = A[i] + B[i]
}void main(){
int i, size = SIZE * sizeof(float);
float *A, *B, *C;A = (float*)malloc(size);B = (float*)malloc(size);C = (float*)malloc(size);
VecAdd(A,B,C);
free(A);free(B);free(C);
}
1. Sequential Code#pragma gfn parallel for
So Easy !!
2. Add Compiler Directive
3. Finish
![Page 22: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/22.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
22
Griffon
Compiler directive for C-Language
Source-to-source compiler Automatic data management Optimization
![Page 23: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/23.jpg)
04/08/2023
23
Griffon - GPU Programming API for Scientific and General Purpose
Objectives
![Page 24: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/24.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
24
Objectives (1/2)
To develop a set of GPU programming APIs, called Griffon, to support the development of CUDA-based programs. Griffon comprises a) compiler directives and b) a source-to-source compiler Simple – The numbers of compiler directives do not
exceed 20 instructions. The grammar of griffon directives is similar to OpenMP, i.e. a standard shared-memory API.
Thread safety – The codes generated by Griffon will give the correct behaviors, i.e. equivalent to that of sequential codes.
![Page 25: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/25.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
25
Objectives (2/2)
To demonstrate that Griffon generated codes can gain reasonable performance over the sequential codes on two example applications: Pi calculation using numerical integration, and Monte Carlo method: Automatic – The GPU memory management
of generated codes is done automatically by Griffon.
Efficient – When using Griffon, generated codes could gain the actual speed up according to Amdahl’s law or with a difference less than 20%.
![Page 26: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/26.jpg)
04/08/2023
26
Griffon - GPU Programming API for Scientific and General Purpose
Project Constraint
![Page 27: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/27.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
27
Project Constraint
Griffon is a C-language API that supports both Windows and Linux environments
The generated executable program can only run on the NVIDIA graphic card.
Uses can use Griffon in cooperated with OpenMP.
![Page 28: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/28.jpg)
04/08/2023
28
Griffon - GPU Programming API for Scientific and General Purpose
Related Works
![Page 29: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/29.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
29
Brook+ & CUDA
General propose computation on GPU Manual kernel and data transfer on
various GPU memory management Vendor dependent
![Page 30: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/30.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
30
OpenCL (Open Computing Language)
Cross-platform and Vendor neutral Approachable language for accessing
heterogeneous computational resources (CPU, GPU, other processor)
Data and Task Parallelism
![Page 31: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/31.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
31
OpenMP to GPGPU
OpenMP applications into CUDA-based GPGPU applications
GPU Optimization technique – Parallel Loop Swap and Loop-collapsing, to enhance inter-thread locality
![Page 32: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/32.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
32
hiCUDA
Directive-based GPU Programming Language
Computation Model for identify code region that executed on GPU
Data Model for allocate and de-allocate memory on GPU and data transfer
![Page 33: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/33.jpg)
04/08/2023
33
Griffon - GPU Programming API for Scientific and General Purpose
• Software Architecture• Directives• Griffon Compilation Process• Optimization Techniques
Methodology
![Page 34: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/34.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
34
Software Architecture
NVCC is one of the Griffon toolchain.
Griffon source-to-source compiler comprises oMemory Allocator and Optimizer
Griffon CompilerGriffon Compiler
NVCC (NVIDIA CUDA Compiler)
Griffon C Application
CUDA C Application
PTX compiler GCC (Linux),CL (MS
Windows)
PTX code C code
CPU object codeGPU object code
Executable
Compile-time Memory Allocator
Optimizer
![Page 35: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/35.jpg)
04/08/2023
35
Griffon - GPU Programming API for Scientific and General Purpose
Directives
![Page 36: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/36.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
36
Griffon Directives
Parallel Region
Control Flow
GPU/CPU Overlap Compute
Synchronous
Define synchronou
s point
Specify kernel work
flow
Define region that CPU overlap
compute with GPU
Define parallel region
![Page 37: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/37.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
37
Directives
#pragma gfn directive-name [clause[ [,] clause]...] new-line
#pragma gfn parallel for [clause[ [,] clause]...] new-linefor-loops
Clause : kernelname(name)
waitfor(kernelname-list)private(var-list)accurate([low,high])reduction(operator:var-list)
Parallel Region
General Form
![Page 38: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/38.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
38
Parallel Region
for(i=0;i<N;i++){C[i] = A[i] +
B[i];}
#pragma gfn parallel forfor(i=0;i<N;i++){
C[i] = A[i] + B[i];}
![Page 39: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/39.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
Kernel Flow Control39
#pragma gfn parallel for kernelname( A ) #pragma gfn parallel for kernelname( B ) waitfor( A ) #pragma gfn parallel for kernelname( C ) waitfor( A ) #pragma gfn parallel for kernelname( D ) waitfor( B,C )
A
CB
D
Kernel B and C can compute in parallel
![Page 40: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/40.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
40
Synchronization
#pragma gfn barrier new-line
#pragma gfn atomic newlineassignment-statement
Atomic
Synchronous Point
#pragma gfn parallel for reduction(operation,var-list)
Parallel Reduction
P0P0
P1P1
P2P2P3P3
P0P0
P1P1
P2P2
P3P3Barr
ier
![Page 41: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/41.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
41
Synchronization
#pragma gfn parallel forfor(i=1;i<N-1;i++){
B[i] = A[i-1] + A[i] + A[i+1;#pragma gfn barrierA[i] = B[i];if(A[i] > 7){
#pragma gfn atomicC[i] += x / 5;
}}
for(i=1;i<N-1;i++){B[i] = A[i-1] + A[i] +
A[i+1;}for(i=1;i<N-1;i++){
A[i] = B[i];if(A[i] > 7){
C[i] += x / 5;}
}
#pragma gfn parallel forfor(i=1;i<N-1;i++){
B[i] = A[i-1] + A[i] + A[i+1;}#pragma gfn parallel forfor(i=1;i<N-1;i++){
A[i] = B[i];if(A[i] > 7){
#pragma gfn atomicC[i] += x / 5;
}}
Option 1
Option 2
![Page 42: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/42.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
42
Synchronization
#pragma gfn parallel for \private(x) reduction(+:integral)for (i = 1; i <= n-1; i++) {
x = a + (i * h); integral = integral + f(x);}
for (i = 1; i <= n-1; i++) {x = a + (i * h);
integral = integral + f(x);}
![Page 43: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/43.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
43
GPU/CPU Overlap compute
#pragma gfn overlapcompute(kernelname) newlinestructure-block
Many threads on GPU
CPU function
GPU/CPU Synchronize
Parallel
![Page 44: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/44.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
44
GPU/CPU Overlap compute
for(i=0;i<N;i++){…
}independenceCpuFunction();
#pragma gfn parallel for kernelname( calA )for(i=0;i<N;i++){
…}#pragma gfn overlapcompute( calA )independenceCpuFunction();
![Page 45: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/45.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
45
Accurate Level
#pragma gfn parallel for accurate( [low, high] )
Use low when speed is important
Use high when precision is important
Default is high
![Page 46: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/46.jpg)
04/08/2023
46
Griffon - GPU Programming API for Scientific and General Purpose
Griffon Compilation Process
![Page 47: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/47.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
47
Create Kernel
int main(){int sum = 0;int x, y;#pragma gfn parallel
for \ private(x, y) reduction(+:sum)
for(i=0;i<N;i++){x = sin(A[i]);y = cos(B[i]);
C[i] = x + y; }return 0;
}
__global__ void __kernel_0(…, int __N){int __tid = blockIdx.x * blockDix.x +
threadIdx.x;int i = __tid [* 1 + 0] ;
if(__tid<N){x = sin(A[i]);y = cos(B[i]);C[i] = x + y;
}}int main(){
int sum = 0;int x, y;
__kernel_0<<<(((N - 1 - 0) / 1 + 1) - 1 + 512.00) / 512.00,512>>>(..., (N - 1 - 0) / 1 + 1);
// Insert kernel callreturn 0;
}
![Page 48: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/48.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
48
For-Loop Format and Thread Mapping
For-loop must be in format for( index = min ; index <= max ; index += increment ){
…}
for( index = max ; index >= min ; index -= increment ){ …} // This case will be transformed to first case
Number of Thread can calculate by formula
Iterative Index and Thread Mapping__tid = blockIdx.x * blockDix.x + threadIdx.x;index = __tid * increment + min;
![Page 49: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/49.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
49
Private and shared variable management
Shared variables much be pass to kernel function
Private variables mush be declare in kernel fucntion
Declare GPU device variables for shared variable Size for allocate
Static : size when declare. Ex int A[500]; Dynamic : allocate function – malloc, calloc, realloc
![Page 50: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/50.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
50
Private and shared variable management
int main(){int sum = 0;int x, y; int A[N], B[N], C[N] ;
#pragma gfn parallel for \ private(x, y) reduction(+:sum)
for(i=0;i<N;i++){x = sin(A[i]);y = cos(B[i]);
C[i] = x + y; }return 0;
}
__global__ void __kernel_0(int * A, int * B, int * C, int __N){
int __tid = blockIdx.x * blockDix.x + threadIdx.x;
int i = __tid [* 1 + 0] ;int x, y;
if(__tid<N){x = sin(A[i]);y = cos(B[i]);C[i] = x + y;
}}int main(){
int sum = 0;int x, y;int A[N], B[N], C[N] ;int * __d_A ,* __d_B ,* __d_C ;cudaMalloc((void**)&__d_C,sizeof(int) * N);cudaMalloc((void**)&__d_B,sizeof(int) * N);cudaMalloc((void**)&__d_A,sizeof(int) * N);
__kernel_0<<<(((N - 1 - 0) / 1 + 1) - 1 + 512.00) / 512.00,512>>>(__d_A, __d_B, __d_C, (N - 1 - 0) / 1 + 1);
cudaFree(__d_C); cudaFree(__d_B); cudaFree(__d_A);
return 0;}
![Page 51: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/51.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
51
Reduction variable management
int main(){…#pragma gfn parallel for \ reduction(+:sum)for(i=0;i<MAX;i++){
...sum += A[i];...
}...
}
__global__ void __kernel_0(float *A, float * global___sum_add){int __tid = blockIdx.x * blockDim.x + threadIdx.x ;int i = __tid ;int __rtid = threadIdx.x ;__shared__ int __sum_add[512] ;int sum = 0 ;
__sum_add[__rtid] = 0;if( __tid < __N ){
…sum += c[i];
__sum_add[__rtid] = sum;__syncthreads();if(__rtid < 256) __sum_add[__rtid] +=
__sum_add[__rtid + 256];__syncthreads();if(__rtid < 128) __sum_add[__rtid] +=
__sum_add[__rtid + 128];__syncthreads();if(__rtid < 64) __sum_add[__rtid] +=
__sum_add[__rtid + 64];__syncthreads();if(__rtid < 32) __sum_add[__rtid] +=
__sum_add[__rtid + 32];__syncthreads();if(__rtid < 16) __sum_add[__rtid] +=
__sum_add[__rtid + 16];if(__rtid < 8) __sum_add[__rtid] += __sum_add[__rtid
+ 8];if(__rtid < 4) __sum_add[__rtid] += __sum_add[__rtid
+ 4];if(__rtid < 2) __sum_add[__rtid] += __sum_add[__rtid
+ 2];if(__rtid < 1) __sum_add[__rtid] += __sum_add[__rtid
+ 1];}if(__rtid == 0)
atomicAdd(global___sum_add, __sum_add[0]);}
Very complex because optimize parallel reduction implementation
![Page 52: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/52.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
52
Replace math functions & GPU functions
int f1(int a){return ++a;
}int f0(int a){
return f1(a) + 5;}
#pragma gfn parallel forfor(i=0;i<N;i++){
A[i] = f0(A[i]) + sin(B[i]);
}
__device__ int __device_f1(int a){return ++a;
}__device__ int __device_f0(int a){
return __device_f1(a) + 5;}
__global__ void __kernel_1(int *A, int *B, int N){…A[i] = __device_f0(A[i]) + __sinf(B[i]);
}
![Page 53: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/53.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
53
Barrier and Atomic
__global__ void __kernel_A(…){if(tid<__N){
B[i] = A[i-1] + A[i] + A[i+1; #pragma gfn barrier
A[i] = B[i];#pragma gfn atomicC[i] += x / 5;
}}
__global__ void __kernel_A(…){if(tid<__N){
B[i] = A[i-1] + A[i] + A[i+1; __threadfence();
A[i] = B[i];atomicAdd(&C[i], x / 5);
}}
![Page 54: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/54.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
54
Kernel call and data transfer sort
Detail in optimization section
__kernel_K<<<((((N - 1) - 1 - 1) / 1 + 1) - 1 + 512.00) / 512.00,512>>>(__d_A, __d_C, ((N - 1) - 1 - 1) / 1 + 1);__kernel_0<<<(((N - 1 - 0) / 5 + 1) - 1 + 512.00) / 512.00,512>>>(__d_D, __d_B, __d_A, (N - 1 - 0) / 5 + 1, global___sum_add); cudaMemcpy(&sum,global___sum_add,sizeof(int), cudaMemcpyDeviceToHost );cudaMemcpy(A,__d_A,sizeof(int) * N, cudaMemcpyDeviceToHost );cudaMemcpy(D,__d_D,sizeof(int) * N, cudaMemcpyDeviceToHost );
![Page 55: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/55.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
55
Automatic cache with shared memory
Detail in optimization section
__global__ void __kernel_0 (int * B, int * A, int __N){
int __tid = blockIdx.x * blockDim.x + threadIdx.x ;int i = __tid * 1 + 1 ;__shared__ int sa[514] ;
if(__tid < __N){
sa[threadIdx.x + 0] = A[i + 0 - 1];if(threadIdx.x + 512 < 514)sa[threadIdx.x + 512] = A[i + 512 - 1];__syncthreads();B[i] = sa[threadIdx.x + 1 - 1] + sa[threadIdx.x + 1] +
sa[threadIdx.x + 1 + 1];}
}
#pragma gfn parallel forfor(i=1;i<(MAX-1);i++){
B[i] = A[i-1] + A[i] + A[i+1];}
![Page 56: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/56.jpg)
04/08/2023
56
Griffon - GPU Programming API for Scientific and General Purpose
• Maximum thread on GPU• Reduce data transfer with analysis control flow• Reduce data transfer with kernel control flow• Overlapping kernel and data transfer and asynchronous data transfer • Automatic cache with shared memory
Optimization Techniques
![Page 57: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/57.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
57
Reduce data transfer with analysis control flow
A, B transfer from CPU to GPU C transfers from GPU to CPU D is both
#pragma gfn parallel forfor(i=0;i<N;i++){
C[i] = A[i] + B[i] + D[i];
D[i] = C[i] * 0.5;}
Used variable Defined variable
![Page 58: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/58.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
58
Reduce data transfer with kernel control flow
Memcpy Host to Device for Variable that is defined in kernel Memcpy Device to Host for Variable that is used in kernel
#pragma gfn parallel for
for(i=0;i<N;i++){C[i] = A[i] + B[i];
}
cudaMemcpy(dA, A, size, cudaMemcpyHostToDevice );
cudaMemcpy(dB, B, size, cudaMemcpyHostToDevice );
Kernel <<< … , … >>> ( … )
cudaMemcpy(C, dC, size, cudaMemcpyDeviceToHost);
K1
A
C
B
![Page 59: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/59.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
59
Reduce data transfer with kernel control flow
Use graph defined by kernelname and waitfor construct
K1
K2
A
DCC
A
B
E
#pragma gfn parallel for \kernelname(k1)for(i=0;i<N;i++){
C[i] = A[i] + B[i];}#pragma gfn parallel for \kernelname(k2) waitfor(k1) for(i=0;i<N;i++){
E[i] = A[i] * C[i] – D[i];C[i] = E[i] / 3.0;
}
C
![Page 60: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/60.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
60
Reduce data transfer with kernel control flow
If there is a path from k1 to k21. If invar of k1 is
same as invar of k2 delete invar of k2
2. If outvar of k1 is same as outvar of k2 delete outvar of k1
3. if outvar of k1 is same as invar of k2 delete invar of k2
K1
K2
A
DCC
A
B
E C
![Page 61: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/61.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
Schedule Kernel and Memcpy for Maximum overlap
K1
K2
AB
D
K3
C
E
Already reduce transfer nodes graph
How to schedule?
![Page 62: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/62.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
62
Schedule for synchronous function
K1 K2AB D K3C E
62
Total Time = T(K1) + T(B) + T(A) + T(K2) + T(D) + T(C) + T(K3) + T(KE)
New version of CUDA API has asynchronous data transfer function
![Page 63: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/63.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
63
Schedule Kernel and Memcpy for Maximum overlap
Memcpy and Kernel can be overlaped
Maximum is 3-ways overlap MemcpyHostToDevice Kernel MemcpyDeviceToHost
4-ways overlap If include CPU compute by overlapcompute directive
K1
K2
A
B
D K3
C
E
Level 1
Level 2
Level 3
Level 4
1 2
12 3
12
1
![Page 64: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/64.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
64
K1
K2
A
B
D K3
C
E
Level 1
Level 2
Level 3
Level 4
1 2
12 3
12
1
1. Set queue to empty2. Until all node is deleted
1.1. Set level =1 and stream_num = 1;1.2. Find 0 incoming degree kernel node,
delete node and link, create transfer command with stream_num1.2.1. if found in 1.2 stream_num += 1
1.3. Find 0 incoming degree GPU to CPU node, delete node and link, create transfer command with stream_num1.3.1 if found in 1.3 stream_num += 1
1.4. Find 0 incoming degree CPU to GPU node, delete node and link, create transfer command with stream_num1.4.1 if found in 1.4 stream_num += 1
1.5. if 1.2-1.4 is not found, find 0 incoming degree kernel node , create transfer command for CPU to GPU node
1.6. Insert synchronous function1.7. Collect max stream_num1.8. level += 1;
![Page 65: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/65.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
65
Automatic cache with shared memory
When detect “linear access” pattern in kernel automatic cache will work
Thread block1
Global Memory
Shared
Shared
Shared
#pragma gfn parallel forfor(i=1;i<(MAX-1);i++){
B[i] = A[i-1] + A[i] + A[i+1];}
Thread block2
Thread block 3
… Shared
Thread block n
![Page 66: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/66.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
66
Automatic cache with shared memory
__global__ void __kernel_0 (int * B, int * A, int __N){
int __tid = blockIdx.x * blockDim.x + threadIdx.x ;int i = __tid * 1 + 1 ;__shared__ int sa[514] ;
if(__tid < __N){
sa[threadIdx.x + 0] = A[i + 0 - 1];if(threadIdx.x + 512 < 514)sa[threadIdx.x + 512] = A[i + 512 - 1];__syncthreads();B[i] = sa[threadIdx.x + 1 - 1] + sa[threadIdx.x + 1] +
sa[threadIdx.x + 1 + 1];}
}
#pragma gfn parallel forfor(i=1;i<(MAX-1);i++){
B[i] = A[i-1] + A[i] + A[i+1];}
![Page 67: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/67.jpg)
04/08/2023
67
Griffon - GPU Programming API for Scientific and General Purpose
DEMO
![Page 68: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/68.jpg)
04/08/2023
68
Griffon - GPU Programming API for Scientific and General Purpose
• Compiler Directives• Compiler Performance
Evaluation
![Page 69: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/69.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
69
Compiler Directives
Program 1 Program 2 Program 30
5
10
15
20
25
30
GriffonCUDA
Program
Tim
e (
min
ute
)
5 undergraduate students who have studied the concepts of CUDA
only 1.5 hour of demonstration
![Page 70: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/70.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
70
Compiler Directives
PNI PMC TR VN SOV0
20
40
60
80
100
120
Sequen-tial
CUDA
Griffon
Application
Lin
es o
f co
de
s
Calculation of Pi Using Numerical Integration
Calculation of Pi Using the Monte Carlo Method
Trapezoidal Rule Vector
Normalization Calculate Sine of
Vector’s Element
![Page 71: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/71.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
71
Compiler Performance
PNI PMC TR VN SOV0
5
10
15
20
25
SequencialParallel (Griffon)
Application
Sp
ee
d U
p
Expected Speed up
Calculation of Pi Using Numerical Integration
Calculation of Pi Using the Monte Carlo Method
Trapezoidal Rule Vector
Normalization Calculate Sine of
Vector’s Element
![Page 72: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/72.jpg)
04/08/2023
72
Griffon - GPU Programming API for Scientific and General Purpose
Conclusion
![Page 73: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/73.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
73
Griffon Instruction
Total numbers of instructions (Directive +
Clause): 9 Problem is performance of high
communication degree parallel program Improve directive for describe algorithm in
program (Divide and conquer, Partial summation, etc.)
New optimization technique such as cache with shared memory, appropriate thread number
![Page 74: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/74.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
74
Performance factor and speed up
Parallelism
Data Transfer
Computation
Density
Speed Up
Calculation of Pi Using Numerical
Integration
High Very Low Low 1.76
Calculation of Pi Using the Monte Carlo Method
High Average High 7.36
Trapezoidal Rule High Very Low High 19.28
Vector Normalization High High Low 1.21
Calculate Sine of Vector’s Element
Very High High High 3.78Computation density is most effect on Performance
![Page 75: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/75.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
75
Building S2S Compiler
Source to source compilers aren’t popular
Compiler that transform Griffon code to GPU object code (PTX) Although the programs generated by a PTX
compiler could be very efficient, they cannot gain any benefits from manual optimization.
![Page 76: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/76.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
76
Future Work
Optimization Techniques Data Structure Loop transformation
Directives More support OpenMP CPU/GPU Parallel region Support OpenCL
Compiler Support C++, other language Support popular IDE
![Page 77: Griffon Topic2 Presentation (Tia)](https://reader030.vdocuments.net/reader030/viewer/2022012910/546a0052af7959653c8b6f6e/html5/thumbnails/77.jpg)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
77
Reference Brook, http://graphics.stanford.edu/projects/brookgpu Cameron Hughes, Tracey Hughes, Professional Multicore Programming, Wiley
Publishing CUDA Zone, http://www.nvidia.com/object/cuda_home.html Dick Grune, Henri E. Bal, Carial J.H. Jacobs and Koen G. Langendoen, Modern
Compiler Design, John Wiley & Sons Ltd General-Purpose Computation on Graphic Hardware, http://gpgpu.org Ilias Leontiadis, George Tzoumas, OpenMP C Parser Joe Stam, Maximizing GPU Efficiency in Extreme Throughput Applications, GPU
Technology Conference Mark Harris, Optimizing Parallel Reduction in CUDA OpenCL, http://www.khronos.org/opencl Seyong Lee, Seung-Jai Min, and Rudolf Eigenmann. OpenMP to GPGPU: A Compiler
Framework for Automatic. PPoPP ’09 The OpenMP API specification for parallel programming, http://openmp.org/wp Thomas Niemann, A Guide to Lex & Yacc Tianyi David Han, Tarek S. Abdelrahman. hiCUDA: A High-level Directive-based
Language for GPU Programming. GPGPU '09 Wolfe, M. (1996). High Performance Compilers for Parallel Computing. Addison-Wesley