griffon topic2 presentation (tia)

GRIFFON GPU PROGRAMMING API FOR SCIENTIFIC AND GENERAL PURPOSE

PISIT MAKPAISIT 4909611727SUPERVISOR : DR. WORAWAN DIAZ CARBALLO

DEPARTMENT OF COMPUTER SCIENCE, FACULTY OF SCIENCE AND TECHNOLOGY, THAMMASAT UNIVERSITY

04/08/2023

2

Griffon - GPU Programming API for Scientific and General Purpose

• GPU-CPU performance gap • GPGPU• GPU programming model complexity

Motivation

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

3

GPU-CPU performance gap

All we have graphic card in PC Processor unit in graphic card called “GPU” Therefore every PC have GPU Now GPU performance is pulling away from traditional

processors

http://developer.download.nvidia.com/compute/cuda/2_2/toolkit/docs/NVIDIA_CUDA_Programming_Guide_2.2.pdf


4

GPGPU

General-Purpose computation on Graphics Processing Units

Very high computation and data throughput

Scalability


5

GPGPU Applications

Simulation Finance Fluid Dynamics Medical Imaging Visualization Signal Processing Image Processing Optical Flow Differential Equation Linear Algebra Finite Element Fast Fourier Transform etc.


6

Vector Addition

1 5 6 8 9 1 2 3 6 5Vector A

5 4 1 1 5 6 5 8 9 2Vector B

+

6 9 7 9 14 7 7 11 15 7Vector C

=


7

Vector Addition (Sequential Code)

#include <stdio.h>

#define SIZE 500

void VecAdd(float *A, float *B, float *C){

int i;

for(i=0;i<SIZE;i++)

C[i] = A[i] + B[i]

}

Declare Function

void main(){int i, size = SIZE *

sizeof(float);float *A, *B, *C;

Declare Variables

A = (float*)malloc(size);B = (float*)malloc(size);C = (float*)malloc(size);

Memory Allocate

free(A);free(B);free(C);

}

VecAdd(A,B,C);Function Call

Memory De-Allocate


8

Vector Addition (Sequential Code)

1 5 6 8 9 1 2 3 6 5Vector A

5 4 1 1 5 6 5 8 9 2Vector B

+

Vector C

=

6

+

=

9

+

=

7

+

=

9

+

=

14

+

=

7

+

=

7

+

=

11

+

=

15

+

=

7


9

Improve Performance

We can improve vector with parallel computing

Data Parallelism – simultaneously add each elements

1st choice

Multicore on CPU OpenMP

2nd choice

Multicore on GPU CUDA


10

Vector Addition (OpenMP)

#include <stdio.h>#define SIZE 500

void VecAdd(float *A, float *B, float *C){int i;

for(i=0;i<SIZE;i++)C[i] = A[i] + B[i]

}void main(){

int i, size = SIZE * sizeof(float);

float *A, *B, *C;A = (float*)malloc(size);B = (float*)malloc(size);C = (float*)malloc(size);

VecAdd(A,B,C);


}

1. Sequential Code#pragma omp parallel for

2. Add Compiler Directive

3. Finish


11

Vector Addition (OpenMP)

1 5 6 8 9 1 2 3 6 5Vector A

5 4 1 1 5 6 5 8 9 2Vector B

+

Vector C

=

6

+

=

9

+

=

7

+

=

9

+

=

14

+

=

7

+

=

7

+

=

11

+

=

15

+

=

7


12

Speed Up (Amdahl’s Law)

Execution time (Sequential)

Vector Addition ~ 80%

Vector Addition New Exec. Time = Exec. Time / Core = 80% / 2

Execution time (Parallel on CPU)

Vector Addition


13

OpenMP

Easy and automatic threads management

Few threads on CPU


14

Vector Addition (GPU - CUDA)

1 5 6 8 9 1 2 3 6 5

Vector A on CPU

5 4 1 1 5 6 5 8 9 2

Vector B on CPU

+

Vector C on CPU

=

6

+

=

9

+

=

7

+

=

9

+

=

14

+

=

7

+

=

7

+

=

11

+

=

15

+

=

7

1 5 6 8 9 1 2 3 6 5

5 4 1 1 5 6 5 8 9 2

6 9 7 9 14 7 7 11 15 7

Copy

Copy

CPU Memory GPU Memory


15

Parallel Vector Addition on GPU (CUDA)

#include <stdio.h>

#define SIZE 500

__global__ void VecAdd(float* A, float* B, float* C){

int idx = threadIdx.x;

if(idx < SIZE)

C[idx] = A[idx] + B[idx];

}

Declare Kernel Function

void main(){int i, size = SIZE * sizeof(float);float *h_A, *h_B, *h_C, *d_A, *d_B,

*d_C;

Declare Variables

h_A = (float*)malloc(size);h_B = (float*)malloc(size);h_C = (float*)malloc(size);

CPU Memory Allocate

cudaMalloc((void**)&d_A, size);cudaMalloc((void**)&d_B, size);cudaMalloc((void**)&d_C, size);

GPU Memory Allocate


16

Parallel Vector Addition on GPU (CUDA)

cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

Data Transfer from CPU to GPU

addVec<<<1, SIZE>>>(d_A, d_B, d_C);Kernel Call

cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

Data Transfer from GPU to CPU

free(h_A);free(h_B);free(h_C);

CPU Memory De-Allocate

cudaFree(d_A);cudaFree(d_B);cudaFree(d_C);

}

GPU Memory De-Allocate


17

Speed Up (Amdahl’s Law)

Execution time (Sequential)

Vector Addition ~ 80%

Vector Addition New Exec. Time = Exec. Time / Core = 80% / 16

Execution time (Parallel on GPU)


18

CUDA

Speed up but spend more effort and time Many threads on GPU


19

CUDA Memory Model

Global Memory – Off-chip, large, shared by all threads, slow, host can read and write

Local Memory – per one thread , faster than Global Memory

Shared Memory – shared by all threads in block, faster than Global Memory


20

Griffon

Simple programming model (OpenMP)

Computing Performance (GPU - CUDA)+

=Easy and Efficient (Griffon)


21

Parallel Vector Addition on GPU (Griffon)

#include <stdio.h>#define SIZE 500

void VecAdd(float *A, float *B, float *C){int i;

for(i=0;i<SIZE;i++)C[i] = A[i] + B[i]

}void main(){

int i, size = SIZE * sizeof(float);

float *A, *B, *C;A = (float*)malloc(size);B = (float*)malloc(size);C = (float*)malloc(size);

VecAdd(A,B,C);


}

1. Sequential Code#pragma gfn parallel for

So Easy !!

2. Add Compiler Directive

3. Finish


22

Griffon

Compiler directive for C-Language

Source-to-source compiler Automatic data management Optimization

04/08/2023

23


Objectives


24

Objectives (1/2)

To develop a set of GPU programming APIs, called Griffon, to support the development of CUDA-based programs. Griffon comprises a) compiler directives and b) a source-to-source compiler Simple – The numbers of compiler directives do not

exceed 20 instructions. The grammar of griffon directives is similar to OpenMP, i.e. a standard shared-memory API.

Thread safety – The codes generated by Griffon will give the correct behaviors, i.e. equivalent to that of sequential codes.


25

Objectives (2/2)

To demonstrate that Griffon generated codes can gain reasonable performance over the sequential codes on two example applications: Pi calculation using numerical integration, and Monte Carlo method: Automatic – The GPU memory management

of generated codes is done automatically by Griffon.

Efficient – When using Griffon, generated codes could gain the actual speed up according to Amdahl’s law or with a difference less than 20%.

04/08/2023

26


Project Constraint


27

Project Constraint

Griffon is a C-language API that supports both Windows and Linux environments

The generated executable program can only run on the NVIDIA graphic card.

Uses can use Griffon in cooperated with OpenMP.

04/08/2023

28


Related Works


29

Brook+ & CUDA

General propose computation on GPU Manual kernel and data transfer on

various GPU memory management Vendor dependent


30

OpenCL (Open Computing Language)

Cross-platform and Vendor neutral Approachable language for accessing

heterogeneous computational resources (CPU, GPU, other processor)

Data and Task Parallelism


31

OpenMP to GPGPU

OpenMP applications into CUDA-based GPGPU applications

GPU Optimization technique – Parallel Loop Swap and Loop-collapsing, to enhance inter-thread locality


32

hiCUDA

Directive-based GPU Programming Language

Computation Model for identify code region that executed on GPU

Data Model for allocate and de-allocate memory on GPU and data transfer

04/08/2023

33


• Software Architecture• Directives• Griffon Compilation Process• Optimization Techniques

Methodology


34

Software Architecture

NVCC is one of the Griffon toolchain.

Griffon source-to-source compiler comprises oMemory Allocator and Optimizer

Griffon CompilerGriffon Compiler

NVCC (NVIDIA CUDA Compiler)

Griffon C Application

CUDA C Application

PTX compiler GCC (Linux),CL (MS

Windows)

PTX code C code

CPU object codeGPU object code

Executable

Compile-time Memory Allocator

Optimizer

04/08/2023

35


Directives


36

Griffon Directives

Parallel Region

Control Flow

GPU/CPU Overlap Compute

Synchronous

Define synchronou

s point

Specify kernel work

flow

Define region that CPU overlap

compute with GPU

Define parallel region


37

Directives

#pragma gfn directive-name [clause[ [,] clause]...] new-line

#pragma gfn parallel for [clause[ [,] clause]...] new-linefor-loops

Clause : kernelname(name)

waitfor(kernelname-list)private(var-list)accurate([low,high])reduction(operator:var-list)

Parallel Region

General Form


38

Parallel Region

for(i=0;i<N;i++){C[i] = A[i] +

B[i];}

#pragma gfn parallel forfor(i=0;i<N;i++){

C[i] = A[i] + B[i];}


Kernel Flow Control39

#pragma gfn parallel for kernelname( A ) #pragma gfn parallel for kernelname( B ) waitfor( A ) #pragma gfn parallel for kernelname( C ) waitfor( A ) #pragma gfn parallel for kernelname( D ) waitfor( B,C )

A

CB

D

Kernel B and C can compute in parallel


40

Synchronization

#pragma gfn barrier new-line

#pragma gfn atomic newlineassignment-statement

Atomic

Synchronous Point

#pragma gfn parallel for reduction(operation,var-list)

Parallel Reduction

P0P0

P1P1

P2P2P3P3

P0P0

P1P1

P2P2

P3P3Barr

ier


41

Synchronization

#pragma gfn parallel forfor(i=1;i<N-1;i++){

B[i] = A[i-1] + A[i] + A[i+1;#pragma gfn barrierA[i] = B[i];if(A[i] > 7){

#pragma gfn atomicC[i] += x / 5;

}}

for(i=1;i<N-1;i++){B[i] = A[i-1] + A[i] +

A[i+1;}for(i=1;i<N-1;i++){

A[i] = B[i];if(A[i] > 7){

C[i] += x / 5;}

}

#pragma gfn parallel forfor(i=1;i<N-1;i++){

B[i] = A[i-1] + A[i] + A[i+1;}#pragma gfn parallel forfor(i=1;i<N-1;i++){

A[i] = B[i];if(A[i] > 7){

#pragma gfn atomicC[i] += x / 5;

}}

Option 1

Option 2


42

Synchronization

#pragma gfn parallel for \private(x) reduction(+:integral)for (i = 1; i <= n-1; i++) {

x = a + (i * h); integral = integral + f(x);}

for (i = 1; i <= n-1; i++) {x = a + (i * h);

integral = integral + f(x);}


43

GPU/CPU Overlap compute

#pragma gfn overlapcompute(kernelname) newlinestructure-block

Many threads on GPU

CPU function

GPU/CPU Synchronize

Parallel


44

GPU/CPU Overlap compute

for(i=0;i<N;i++){…

}independenceCpuFunction();

#pragma gfn parallel for kernelname( calA )for(i=0;i<N;i++){

…}#pragma gfn overlapcompute( calA )independenceCpuFunction();


45

Accurate Level

#pragma gfn parallel for accurate( [low, high] )

Use low when speed is important

Use high when precision is important

Default is high

04/08/2023

46


Griffon Compilation Process


47

Create Kernel

int main(){int sum = 0;int x, y;#pragma gfn parallel

for \ private(x, y) reduction(+:sum)

for(i=0;i<N;i++){x = sin(A[i]);y = cos(B[i]);

C[i] = x + y; }return 0;

}

__global__ void __kernel_0(…, int __N){int __tid = blockIdx.x * blockDix.x +

threadIdx.x;int i = __tid [* 1 + 0] ;

if(__tid<N){x = sin(A[i]);y = cos(B[i]);C[i] = x + y;

}}int main(){

int sum = 0;int x, y;

__kernel_0<<<(((N - 1 - 0) / 1 + 1) - 1 + 512.00) / 512.00,512>>>(..., (N - 1 - 0) / 1 + 1);

// Insert kernel callreturn 0;

}


48

For-Loop Format and Thread Mapping

For-loop must be in format for( index = min ; index <= max ; index += increment ){

…}

for( index = max ; index >= min ; index -= increment ){ …} // This case will be transformed to first case

Number of Thread can calculate by formula

Iterative Index and Thread Mapping__tid = blockIdx.x * blockDix.x + threadIdx.x;index = __tid * increment + min;


49

Private and shared variable management

Shared variables much be pass to kernel function

Private variables mush be declare in kernel fucntion

Declare GPU device variables for shared variable Size for allocate

Static : size when declare. Ex int A[500]; Dynamic : allocate function – malloc, calloc, realloc


50

Private and shared variable management

int main(){int sum = 0;int x, y; int A[N], B[N], C[N] ;

#pragma gfn parallel for \ private(x, y) reduction(+:sum)

for(i=0;i<N;i++){x = sin(A[i]);y = cos(B[i]);

C[i] = x + y; }return 0;

}

__global__ void __kernel_0(int * A, int * B, int * C, int __N){

int __tid = blockIdx.x * blockDix.x + threadIdx.x;

int i = __tid [* 1 + 0] ;int x, y;

if(__tid<N){x = sin(A[i]);y = cos(B[i]);C[i] = x + y;

}}int main(){

int sum = 0;int x, y;int A[N], B[N], C[N] ;int * __d_A ,* __d_B ,* __d_C ;cudaMalloc((void**)&__d_C,sizeof(int) * N);cudaMalloc((void**)&__d_B,sizeof(int) * N);cudaMalloc((void**)&__d_A,sizeof(int) * N);

__kernel_0<<<(((N - 1 - 0) / 1 + 1) - 1 + 512.00) / 512.00,512>>>(__d_A, __d_B, __d_C, (N - 1 - 0) / 1 + 1);

cudaFree(__d_C); cudaFree(__d_B); cudaFree(__d_A);

return 0;}


51

Reduction variable management

int main(){…#pragma gfn parallel for \ reduction(+:sum)for(i=0;i<MAX;i++){

...sum += A[i];...

}...

}

__global__ void __kernel_0(float *A, float * global___sum_add){int __tid = blockIdx.x * blockDim.x + threadIdx.x ;int i = __tid ;int __rtid = threadIdx.x ;__shared__ int __sum_add[512] ;int sum = 0 ;

__sum_add[__rtid] = 0;if( __tid < __N ){

…sum += c[i];

__sum_add[__rtid] = sum;__syncthreads();if(__rtid < 256) __sum_add[__rtid] +=

__sum_add[__rtid + 256];__syncthreads();if(__rtid < 128) __sum_add[__rtid] +=




__sum_add[__rtid + 16];if(__rtid < 8) __sum_add[__rtid] += __sum_add[__rtid

+ 8];if(__rtid < 4) __sum_add[__rtid] += __sum_add[__rtid



+ 1];}if(__rtid == 0)

atomicAdd(global___sum_add, __sum_add[0]);}

Very complex because optimize parallel reduction implementation


52

Replace math functions & GPU functions

int f1(int a){return ++a;

}int f0(int a){

return f1(a) + 5;}


A[i] = f0(A[i]) + sin(B[i]);

}

__device__ int __device_f1(int a){return ++a;

}__device__ int __device_f0(int a){

return __device_f1(a) + 5;}

__global__ void __kernel_1(int *A, int *B, int N){…A[i] = __device_f0(A[i]) + __sinf(B[i]);

}


53

Barrier and Atomic

__global__ void __kernel_A(…){if(tid<__N){

B[i] = A[i-1] + A[i] + A[i+1; #pragma gfn barrier

A[i] = B[i];#pragma gfn atomicC[i] += x / 5;

}}

__global__ void __kernel_A(…){if(tid<__N){

B[i] = A[i-1] + A[i] + A[i+1; __threadfence();

A[i] = B[i];atomicAdd(&C[i], x / 5);

}}


54

Kernel call and data transfer sort

Detail in optimization section

__kernel_K<<<((((N - 1) - 1 - 1) / 1 + 1) - 1 + 512.00) / 512.00,512>>>(__d_A, __d_C, ((N - 1) - 1 - 1) / 1 + 1);__kernel_0<<<(((N - 1 - 0) / 5 + 1) - 1 + 512.00) / 512.00,512>>>(__d_D, __d_B, __d_A, (N - 1 - 0) / 5 + 1, global___sum_add); cudaMemcpy(&sum,global___sum_add,sizeof(int), cudaMemcpyDeviceToHost );cudaMemcpy(A,__d_A,sizeof(int) * N, cudaMemcpyDeviceToHost );cudaMemcpy(D,__d_D,sizeof(int) * N, cudaMemcpyDeviceToHost );


55

Automatic cache with shared memory

Detail in optimization section

__global__ void __kernel_0 (int * B, int * A, int __N){

int __tid = blockIdx.x * blockDim.x + threadIdx.x ;int i = __tid * 1 + 1 ;__shared__ int sa[514] ;

if(__tid < __N){

sa[threadIdx.x + 0] = A[i + 0 - 1];if(threadIdx.x + 512 < 514)sa[threadIdx.x + 512] = A[i + 512 - 1];__syncthreads();B[i] = sa[threadIdx.x + 1 - 1] + sa[threadIdx.x + 1] +

sa[threadIdx.x + 1 + 1];}

}

#pragma gfn parallel forfor(i=1;i<(MAX-1);i++){

B[i] = A[i-1] + A[i] + A[i+1];}

04/08/2023

56


• Maximum thread on GPU• Reduce data transfer with analysis control flow• Reduce data transfer with kernel control flow• Overlapping kernel and data transfer and asynchronous data transfer • Automatic cache with shared memory

Optimization Techniques


57

Reduce data transfer with analysis control flow

A, B transfer from CPU to GPU C transfers from GPU to CPU D is both


C[i] = A[i] + B[i] + D[i];

D[i] = C[i] * 0.5;}

Used variable Defined variable


58

Reduce data transfer with kernel control flow

Memcpy Host to Device for Variable that is defined in kernel Memcpy Device to Host for Variable that is used in kernel

#pragma gfn parallel for

for(i=0;i<N;i++){C[i] = A[i] + B[i];

}

cudaMemcpy(dA, A, size, cudaMemcpyHostToDevice );

cudaMemcpy(dB, B, size, cudaMemcpyHostToDevice );

Kernel <<< … , … >>> ( … )

cudaMemcpy(C, dC, size, cudaMemcpyDeviceToHost);

K1

A

C

B


59


Use graph defined by kernelname and waitfor construct

K1

K2

A

DCC

A

B

E

#pragma gfn parallel for \kernelname(k1)for(i=0;i<N;i++){

C[i] = A[i] + B[i];}#pragma gfn parallel for \kernelname(k2) waitfor(k1) for(i=0;i<N;i++){

E[i] = A[i] * C[i] – D[i];C[i] = E[i] / 3.0;

}

C


60


If there is a path from k1 to k21. If invar of k1 is

same as invar of k2 delete invar of k2

2. If outvar of k1 is same as outvar of k2 delete outvar of k1

3. if outvar of k1 is same as invar of k2 delete invar of k2

K1

K2

A

DCC

A

B

E C


Schedule Kernel and Memcpy for Maximum overlap

K1

K2

AB

D

K3

C

E

Already reduce transfer nodes graph

How to schedule?


62

Schedule for synchronous function

K1 K2AB D K3C E

62

Total Time = T(K1) + T(B) + T(A) + T(K2) + T(D) + T(C) + T(K3) + T(KE)

New version of CUDA API has asynchronous data transfer function


63

Schedule Kernel and Memcpy for Maximum overlap

Memcpy and Kernel can be overlaped

Maximum is 3-ways overlap MemcpyHostToDevice Kernel MemcpyDeviceToHost

4-ways overlap If include CPU compute by overlapcompute directive

K1

K2

A

B

D K3

C

E

Level 1

Level 2

Level 3

Level 4

1 2

12 3

12

1


64

K1

K2

A

B

D K3

C

E

Level 1

Level 2

Level 3

Level 4

1 2

12 3

12

1

1. Set queue to empty2. Until all node is deleted

1.1. Set level =1 and stream_num = 1;1.2. Find 0 incoming degree kernel node,

delete node and link, create transfer command with stream_num1.2.1. if found in 1.2 stream_num += 1

1.3. Find 0 incoming degree GPU to CPU node, delete node and link, create transfer command with stream_num1.3.1 if found in 1.3 stream_num += 1

1.4. Find 0 incoming degree CPU to GPU node, delete node and link, create transfer command with stream_num1.4.1 if found in 1.4 stream_num += 1

1.5. if 1.2-1.4 is not found, find 0 incoming degree kernel node , create transfer command for CPU to GPU node

1.6. Insert synchronous function1.7. Collect max stream_num1.8. level += 1;


65


When detect “linear access” pattern in kernel automatic cache will work

Thread block1

Global Memory

Shared

Shared

Shared


B[i] = A[i-1] + A[i] + A[i+1];}

Thread block2

Thread block 3

… Shared

Thread block n


66


__global__ void __kernel_0 (int * B, int * A, int __N){

int __tid = blockIdx.x * blockDim.x + threadIdx.x ;int i = __tid * 1 + 1 ;__shared__ int sa[514] ;

if(__tid < __N){

sa[threadIdx.x + 0] = A[i + 0 - 1];if(threadIdx.x + 512 < 514)sa[threadIdx.x + 512] = A[i + 512 - 1];__syncthreads();B[i] = sa[threadIdx.x + 1 - 1] + sa[threadIdx.x + 1] +

sa[threadIdx.x + 1 + 1];}

}


B[i] = A[i-1] + A[i] + A[i+1];}

04/08/2023

67


DEMO

04/08/2023

68


• Compiler Directives• Compiler Performance

Evaluation


69

Compiler Directives

Program 1 Program 2 Program 30

5

10

15

20

25

30

GriffonCUDA

Program

Tim

e (

min

ute

)

5 undergraduate students who have studied the concepts of CUDA

only 1.5 hour of demonstration


70

Compiler Directives

PNI PMC TR VN SOV0

20

40

60

80

100

120

Sequen-tial

CUDA

Griffon

Application

Lin

es o

f co

de

s

Calculation of Pi Using Numerical Integration

Calculation of Pi Using the Monte Carlo Method

Trapezoidal Rule Vector

Normalization Calculate Sine of

Vector’s Element


71

Compiler Performance

PNI PMC TR VN SOV0

5

10

15

20

25

SequencialParallel (Griffon)

Application

Sp

ee

d U

p

Expected Speed up

Calculation of Pi Using Numerical Integration


Trapezoidal Rule Vector

Normalization Calculate Sine of

Vector’s Element

04/08/2023

72


Conclusion


73

Griffon Instruction

Total numbers of instructions (Directive +

Clause): 9 Problem is performance of high

communication degree parallel program Improve directive for describe algorithm in

program (Divide and conquer, Partial summation, etc.)

New optimization technique such as cache with shared memory, appropriate thread number


74

Performance factor and speed up

Parallelism

Data Transfer

Computation

Density

Speed Up

Calculation of Pi Using Numerical

Integration

High Very Low Low 1.76


High Average High 7.36

Trapezoidal Rule High Very Low High 19.28

Vector Normalization High High Low 1.21

Calculate Sine of Vector’s Element

Very High High High 3.78Computation density is most effect on Performance


75

Building S2S Compiler

Source to source compilers aren’t popular

Compiler that transform Griffon code to GPU object code (PTX) Although the programs generated by a PTX

compiler could be very efficient, they cannot gain any benefits from manual optimization.


76

Future Work

Optimization Techniques Data Structure Loop transformation

Directives More support OpenMP CPU/GPU Parallel region Support OpenCL

Compiler Support C++, other language Support popular IDE


77

Reference Brook, http://graphics.stanford.edu/projects/brookgpu Cameron Hughes, Tracey Hughes, Professional Multicore Programming, Wiley

Publishing CUDA Zone, http://www.nvidia.com/object/cuda_home.html Dick Grune, Henri E. Bal, Carial J.H. Jacobs and Koen G. Langendoen, Modern

Compiler Design, John Wiley & Sons Ltd General-Purpose Computation on Graphic Hardware, http://gpgpu.org Ilias Leontiadis, George Tzoumas, OpenMP C Parser Joe Stam, Maximizing GPU Efficiency in Extreme Throughput Applications, GPU

Technology Conference Mark Harris, Optimizing Parallel Reduction in CUDA OpenCL, http://www.khronos.org/opencl Seyong Lee, Seung-Jai Min, and Rudolf Eigenmann. OpenMP to GPGPU: A Compiler

Framework for Automatic. PPoPP ’09 The OpenMP API specification for parallel programming, http://openmp.org/wp Thomas Niemann, A Guide to Lex & Yacc Tianyi David Han, Tarek S. Abdelrahman. hiCUDA: A High-level Directive-based

Language for GPU Programming. GPGPU '09 Wolfe, M. (1996). High Performance Compilers for Parallel Computing. Addison-Wesley

griffon topic2 presentation (tia)

Documents

toolki gpu programming

griffongpu programming

vector b

gpunow gpu performance

gpucpu performance gapall

general purpose9we

general purpose6vector

general purpose7