gpgpu - utrecht university 1.pdf · gpu programming requires a very different way of expressing...

GPGPUIGAD – 2014/2015

Lecture 1

Jacco Bikker

Today:

Course introduction

GPGPU background

Getting started

Assignment

Introduction

GPU History

History

3DO-FZ1

console

1991

NVidia NV-1

(Diamond Edge 3D)

1995

History

3Dfx –

Diamond Monster 3D

1996

History

Quake vs GLQuake

1997

History

Fixed function pipeline

vs

Programmable pipeline

2007

History

History

Source: Naffziger, AMD

History

History

GPU - conveyor belt:

input = vertices + connectivity

step 1: transform

step 2: rasterize

step 3: shade

step 4: z-test

output = pixels

History

Introductionvoid main(void) {

float t = iGlobalTime;vec2 uv = gl_FragCoord.xy / iResolution.y;float r = length(uv), a = atan(uv.y,uv.x);float i = floor(r*10);a *= floor(pow(128,i/10)); a += 20.*sin(0.5*t)+123.34*i-100.*(r*i/10)*cos(0.5*t);r += (0.5+0.5*cos(a)) / 10; r = floor(N*r)/10;gl_FragColor = (1-r)*vec4(0.5,1,1.5,1);

}

https://www.shadertoy.com/view/4sjSRt

https://www.shadertoy.com/view/4sjSRt

IntroductionHistorically, the GPU is a co-processor.

GPUs perform well because they have a constrained execution model, which is based on parallelism.

GPU programming requires a very different way of expressing algorithms.

Introduction

This course

Teacher background

Your role

Learning objectives

ECTS / lectures / homework / assessment

This course

AGT6:

7 lectures

We start at 10.00am

Demo time

Break half-way

Lecturer

Me : dr. Jacco Bikker - CUDA – Ray tracing – Rendering

Your role

You:

Maybe a GPGPU / shader expert

Use AGT6 to get further

Or just pass with a 6

Objectives

Objectives:

Get feet wet

Generic GPGPU concepts

*not*:

Detailed API knowledge

Details

AGT6:

3 ECTS = ~80 hours

Weekly homework, unverified

Final assignment: free form

Background

GPU architecture

GPU architecture

CPU: Designed to run one thread as fast as possible.

Use large caches to minimize memory latency

Maximize cache usage using pipeline & branch prediction

Multi-core processing Task parallelism

Interesting tricks:

SIMD

“Hyperthreading”

GPU architecture

GPU: Designed to combat latency using many threads.

Hide latency by computation

Maximize parallelism

Streaming processing Data parallelism

Interesting tricks:

Use typical GPU hardware (filtering etc.)

Cache anyway

S I M T

GPU architecture

CPU

Multiple tasks = multiple

threads

Tasks run different instructions

10s of complex threads execute

on a few cores

Thread execution managed

explicitly

GPU

SIMD: same instructions on

multiple data

10.000s of light-weight threads

on 100s of cores

Threads are managed and

scheduled by hardware

GPU architecture

GPU architecture

SIMT Thread execution:

Group 32 threads (vertices, pixels, primitives) into warps

Each warp executes the same instruction

In case of latency, switch to different warp (thus: switch out 32

threads for 32 different threads)

Flow control: …

GPU architecture

void main(void) // for each pixel{

float t = iGlobalTime;vec2 uv = gl_FragCoord.xy / iResolution.y;float r = length(uv), a = atan(uv.y,uv.x);float i = floor(r*10);a *= floor(pow(128,i/10)); a += 20.*sin(0.5*t)+123.34*i-100.*(r*i/10)*cos(0.5*t);r += (0.5+0.5*cos(a)) / 10; r = floor(N*r)/10;gl_FragColor = (1-r)*vec4(0.5,1,1.5,1);

}

GPU architecture

Easy to port to GPU:

Image postprocessing

Particle effects

Ray tracing

…

Actually, a lot of algorithms are not easy to port at all.

Decades of legacy, or a fundamental problem?

Background

Why GPGPU

OpenCL vs Shaders vs CUDA

Why GPGPU

Some tasks are more efficient on the GPU

GPU has high theoretical peak performance

Prevent wasting processing power

OpenCL vs shaders

No mapping to graphics context needed

Avoid thinking about various transformations of

coordinates (world / screen / texture)

Access to memory levels that are implicit in

OpenGL

OpenCL also runs on CPUs

OpenCL vs CUDA

…

(but if you must:

“A Comprehensive Performance Comparison of CUDA and OpenCL”, Fang et al., 2011http://www.researchgate.net/publication/221084751_A_Comprehensive_Performance_Comparison_of_CUDA_and_OpenCL/links/0c96051c2bd67d9896000000 )

http://www.researchgate.net/publication/221084751_A_Comprehensive_Performance_Comparison_of_CUDA_and_OpenCL/links/0c96051c2bd67d9896000000

Getting Started

Tools of the trade

Template

Tools

Get your development tools here:

NVidia: https://developer.nvidia.com/opencl

AMD: http://developer.amd.com/tools-and-sdks/opencl-zone/

Intel: https://software.intel.com/en-us/intel-opencl

https://developer.nvidia.com/opencl

http://developer.amd.com/tools-and-sdks/opencl-zone/

https://software.intel.com/en-us/intel-opencl

Template

Template available from N@TSchool!

Template

__kernel void main( write_only image2d_t outimg ){

int column = get_global_id( 0 );int line = get_global_id( 1 );// calculate checkerboard patternint tileX = column / 40;int tileY = line / 40;float color = (float)((tileX + tileY) & 1); // 0 or 1float4 white = (float4)( 1, 1, 1, 1 );write_imagef( outimg, (int2)(column, line), color * white );

}

Template

#version 330uniform sampler2D color;in vec2 P;in vec2 uv;out vec3 pixel;void main(){

// retrieve input pixelpixel = texture( color, uv ).rgb;// darken towards edgesfloat dx = P.x - 0.5, dy = P.y - 0.5;float distance = sqrt( dx * dx + dy * dy );float scale = 1 - max( 0, distance * 2.2 - 0.8 );pixel *= scale;

}

Template

bool Game::Init(){

// load shader and textureclOutput = new Texture( SCRWIDTH, SCRHEIGHT, Texture::FLOAT );shader = new Shader( "shaders/checker.vert", "shaders/checker.frag" );// load OpenCL codekernel = new Kernel( "programs/program.cl", "main" );// link cl output texture as an OpenCL bufferoutputBuffer = clCreateFromGLTexture2D( kernel->GetContext(),

CL_MEM_WRITE_ONLY, GL_TEXTURE_2D, 0, clOutput->GetID(), 0 );kernel->SetArgument( 0, &outputBuffer );// donereturn true;

}

Template

void Game::Tick(){

// run cl code to fill texturekernel->Run( &outputBuffer );// run shader on cl-generated textureshader->Bind();shader->SetInputTexture( GL_TEXTURE0, "color", clOutput );shader->SetInputMatrix( "view", mat4( 1 ) );DrawQuad();

}

Getting Started

MyFirst OpenCL app

OpenCL terminology

Terminology

A few words you need to know the meaning of:

1. Device

2. Host

3. Context

4. Kernel

5. Program

6. Compute unit (CUDA: CUDA core)

7. Work item (CUDA: thread)

8. Command queue (synchronous, asynchronous)

MyFirst

To execute an OpenCL program:

1. Query the host system for OpenCL devices

2. Create a context to associate the OpenCL devices

3. Create programs that will run on one or more associated

devices

4. From the programs, select kernels to execute

5. Create memory objects on the host or on the device

6. Copy memory data to the device as needed

7. Provide arguments for the kernels

8. Submit the kernels to the command queue for execution

9. Copy the results from the device to the host.

clGetPlatformIDs(…)

clGetDeviceIDs(…)

clCreateContext(…)

clCreateCommandQueue(…)

clCreateProgramWithSource(…)

clBuildProgram(…)

clCreateKernel(…)

clCreateBuffer(…)

clEnqueueWriteBuffer(…)

clSetKernelArg(…)

clEnqueueNDRangeKernel(…)

clFinish(…)

clEnqueueReadBuffer(…)

MyFirst

#include <stdio.h>#include "CL/cl.h"#define ITEMS 10const char *KernelSource ="__kernel void hello(__global float *input, __global float *output)\n"\"{\n size_t id = get_global_id(0);\n output[id] = input[id] * input[id];\n}";

void main(){

cl_int err; cl_uint num_of_platforms = 0;cl_platform_id platform_id; cl_device_id device_id;cl_uint num_of_devices = 0; size_t global = ITEMS;float inputData[ITEMS] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 }, results[ITEMS] = { 0 };

clGetPlatformIDs( 1, &platform_id, &num_of_platforms );clGetDeviceIDs( platform_id, CL_DEVICE_TYPE_GPU, 1, &device_id, &num_of_devices );cl_context_properties props[3] = { CL_CONTEXT_PLATFORM, (cl_context_properties)platform_id, 0 };

cl_context context = clCreateContext( props, 1, &device_id, 0, 0, &err );cl_command_queue queue = clCreateCommandQueue( context, device_id, 0, &err );cl_program program = clCreateProgramWithSource( context, 1, (const char**)&KernelSource, 0, &err );clBuildProgram( program, 0, NULL, NULL, NULL, NULL );cl_kernel kernel = clCreateKernel( program, "hello", &err );

cl_mem input = clCreateBuffer( context, CL_MEM_READ_ONLY, 4 * ITEMS, 0, 0 );cl_mem output = clCreateBuffer( context, CL_MEM_WRITE_ONLY, 4 * ITEMS, 0, 0 );

clEnqueueWriteBuffer( queue, input, CL_TRUE, 0, 4 * ITEMS, inputData, 0, 0, 0 );clSetKernelArg( kernel, 0, sizeof( cl_mem ), &input );clSetKernelArg( kernel, 1, sizeof( cl_mem ), &output );

clEnqueueNDRangeKernel( queue, kernel, 1, 0, &global, 0, 0, 0, 0 );clFinish( queue );

clEnqueueReadBuffer( queue, output, CL_TRUE, 0, 4 * ITEMS, results, 0, 0, 0 );for( int i = 0; i < ITEMS; i++ ) printf( "%f ",results[i] );

clReleaseMemObject( input ); clReleaseMemObject( output );clReleaseProgram( program ); clReleaseKernel( kernel );clReleaseCommandQueue( queue ); clReleaseContext( context );

}

MyFirst

bool Kernel::InitCL(){

cl_platform_id platform;cl_device_id* devices;cl_uint devCount;cl_int error;

...

}

Like I said, I don’t care much for API details…

Just start with the template, and modify /

replace it when the need arises.

Assignment

Create an OpenCL program that calculates Voronoi noise

for a 512x512 buffer and make it available to the CPU.

Measure the performance gain compared to CPU-only.

Reference: https://www.shadertoy.com/view/4djGRh

https://www.shadertoy.com/view/4djGRh

Words of Advice

WebGL != OpenCL

Can’t do ‘by reference’, use pointers instead

float3 parameter: (float3)(1, 1, 1)

fract requires second parameter

sinf doesn’t exist, use sin

Also, see this helpful chart:

https://www.khronos.org/files/opencl-1-1-quick-reference-card.pdf

https://www.khronos.org/files/opencl-1-1-quick-reference-card.pdf

“The End”(for now)

gpgpu - utrecht university 1.pdf · gpu programming requires a very different way of expressing...

Documents