gpgpu - utrecht university 1.pdf · gpu programming requires a very different way of expressing...
TRANSCRIPT
GPGPUIGAD – 2014/2015
Lecture 1
Jacco Bikker
Today:
Course introduction
GPGPU background
Getting started
Assignment
Introduction
GPU History
History
3DO-FZ1
console
1991
NVidia NV-1
(Diamond Edge 3D)
1995
History
3Dfx –
Diamond Monster 3D
1996
History
Quake vs GLQuake
1997
History
Fixed function pipeline
vs
Programmable pipeline
2007
History
History
Source: Naffziger, AMD
History
History
GPU - conveyor belt:
input = vertices + connectivity
step 1: transform
step 2: rasterize
step 3: shade
step 4: z-test
output = pixels
History
Introductionvoid main(void) {
float t = iGlobalTime;vec2 uv = gl_FragCoord.xy / iResolution.y;float r = length(uv), a = atan(uv.y,uv.x);float i = floor(r*10);a *= floor(pow(128,i/10)); a += 20.*sin(0.5*t)+123.34*i-100.*(r*i/10)*cos(0.5*t);r += (0.5+0.5*cos(a)) / 10; r = floor(N*r)/10;gl_FragColor = (1-r)*vec4(0.5,1,1.5,1);
}
https://www.shadertoy.com/view/4sjSRt
IntroductionHistorically, the GPU is a co-processor.
GPUs perform well because they have a constrained execution model, which is based on parallelism.
GPU programming requires a very different way of expressing algorithms.
Introduction
This course
Teacher background
Your role
Learning objectives
ECTS / lectures / homework / assessment
This course
AGT6:
7 lectures
We start at 10.00am
Demo time
Break half-way
Lecturer
Me : dr. Jacco Bikker - CUDA – Ray tracing – Rendering
Your role
You:
Maybe a GPGPU / shader expert
Use AGT6 to get further
Or just pass with a 6
Objectives
Objectives:
Get feet wet
Generic GPGPU concepts
*not*:
Detailed API knowledge
Details
AGT6:
3 ECTS = ~80 hours
Weekly homework, unverified
Final assignment: free form
Background
GPU architecture
GPU architecture
CPU: Designed to run one thread as fast as possible.
Use large caches to minimize memory latency
Maximize cache usage using pipeline & branch prediction
Multi-core processing Task parallelism
Interesting tricks:
SIMD
“Hyperthreading”
GPU architecture
GPU: Designed to combat latency using many threads.
Hide latency by computation
Maximize parallelism
Streaming processing Data parallelism
Interesting tricks:
Use typical GPU hardware (filtering etc.)
Cache anyway
S I M T
GPU architecture
CPU
Multiple tasks = multiple
threads
Tasks run different instructions
10s of complex threads execute
on a few cores
Thread execution managed
explicitly
GPU
SIMD: same instructions on
multiple data
10.000s of light-weight threads
on 100s of cores
Threads are managed and
scheduled by hardware
GPU architecture
GPU architecture
SIMT Thread execution:
Group 32 threads (vertices, pixels, primitives) into warps
Each warp executes the same instruction
In case of latency, switch to different warp (thus: switch out 32
threads for 32 different threads)
Flow control: …
GPU architecture
void main(void) // for each pixel{
float t = iGlobalTime;vec2 uv = gl_FragCoord.xy / iResolution.y;float r = length(uv), a = atan(uv.y,uv.x);float i = floor(r*10);a *= floor(pow(128,i/10)); a += 20.*sin(0.5*t)+123.34*i-100.*(r*i/10)*cos(0.5*t);r += (0.5+0.5*cos(a)) / 10; r = floor(N*r)/10;gl_FragColor = (1-r)*vec4(0.5,1,1.5,1);
}
GPU architecture
Easy to port to GPU:
Image postprocessing
Particle effects
Ray tracing
…
Actually, a lot of algorithms are not easy to port at all.
Decades of legacy, or a fundamental problem?
Background
Why GPGPU
OpenCL vs Shaders vs CUDA
Why GPGPU
Some tasks are more efficient on the GPU
GPU has high theoretical peak performance
Prevent wasting processing power
OpenCL vs shaders
No mapping to graphics context needed
Avoid thinking about various transformations of
coordinates (world / screen / texture)
Access to memory levels that are implicit in
OpenGL
OpenCL also runs on CPUs
OpenCL vs CUDA
…
(but if you must:
“A Comprehensive Performance Comparison of CUDA and OpenCL”, Fang et al., 2011http://www.researchgate.net/publication/221084751_A_Comprehensive_Performance_Comparison_of_CUDA_and_OpenCL/links/0c96051c2bd67d9896000000 )
Getting Started
Tools of the trade
Template
Tools
Get your development tools here:
NVidia: https://developer.nvidia.com/opencl
AMD: http://developer.amd.com/tools-and-sdks/opencl-zone/
Intel: https://software.intel.com/en-us/intel-opencl
Template
Template available from N@TSchool!
Template
__kernel void main( write_only image2d_t outimg ){
int column = get_global_id( 0 );int line = get_global_id( 1 );// calculate checkerboard patternint tileX = column / 40;int tileY = line / 40;float color = (float)((tileX + tileY) & 1); // 0 or 1float4 white = (float4)( 1, 1, 1, 1 );write_imagef( outimg, (int2)(column, line), color * white );
}
Template
#version 330uniform sampler2D color;in vec2 P;in vec2 uv;out vec3 pixel;void main(){
// retrieve input pixelpixel = texture( color, uv ).rgb;// darken towards edgesfloat dx = P.x - 0.5, dy = P.y - 0.5;float distance = sqrt( dx * dx + dy * dy );float scale = 1 - max( 0, distance * 2.2 - 0.8 );pixel *= scale;
}
Template
bool Game::Init(){
// load shader and textureclOutput = new Texture( SCRWIDTH, SCRHEIGHT, Texture::FLOAT );shader = new Shader( "shaders/checker.vert", "shaders/checker.frag" );// load OpenCL codekernel = new Kernel( "programs/program.cl", "main" );// link cl output texture as an OpenCL bufferoutputBuffer = clCreateFromGLTexture2D( kernel->GetContext(),
CL_MEM_WRITE_ONLY, GL_TEXTURE_2D, 0, clOutput->GetID(), 0 );kernel->SetArgument( 0, &outputBuffer );// donereturn true;
}
Template
void Game::Tick(){
// run cl code to fill texturekernel->Run( &outputBuffer );// run shader on cl-generated textureshader->Bind();shader->SetInputTexture( GL_TEXTURE0, "color", clOutput );shader->SetInputMatrix( "view", mat4( 1 ) );DrawQuad();
}
Getting Started
MyFirst OpenCL app
OpenCL terminology
Terminology
A few words you need to know the meaning of:
1. Device
2. Host
3. Context
4. Kernel
5. Program
6. Compute unit (CUDA: CUDA core)
7. Work item (CUDA: thread)
8. Command queue (synchronous, asynchronous)
MyFirst
To execute an OpenCL program:
1. Query the host system for OpenCL devices
2. Create a context to associate the OpenCL devices
3. Create programs that will run on one or more associated
devices
4. From the programs, select kernels to execute
5. Create memory objects on the host or on the device
6. Copy memory data to the device as needed
7. Provide arguments for the kernels
8. Submit the kernels to the command queue for execution
9. Copy the results from the device to the host.
clGetPlatformIDs(…)
clGetDeviceIDs(…)
clCreateContext(…)
clCreateCommandQueue(…)
clCreateProgramWithSource(…)
clBuildProgram(…)
clCreateKernel(…)
clCreateBuffer(…)
clEnqueueWriteBuffer(…)
clSetKernelArg(…)
clEnqueueNDRangeKernel(…)
clFinish(…)
clEnqueueReadBuffer(…)
MyFirst
#include <stdio.h>#include "CL/cl.h"#define ITEMS 10const char *KernelSource ="__kernel void hello(__global float *input, __global float *output)\n"\"{\n size_t id = get_global_id(0);\n output[id] = input[id] * input[id];\n}";
void main(){
cl_int err; cl_uint num_of_platforms = 0;cl_platform_id platform_id; cl_device_id device_id;cl_uint num_of_devices = 0; size_t global = ITEMS;float inputData[ITEMS] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 }, results[ITEMS] = { 0 };
clGetPlatformIDs( 1, &platform_id, &num_of_platforms );clGetDeviceIDs( platform_id, CL_DEVICE_TYPE_GPU, 1, &device_id, &num_of_devices );cl_context_properties props[3] = { CL_CONTEXT_PLATFORM, (cl_context_properties)platform_id, 0 };
cl_context context = clCreateContext( props, 1, &device_id, 0, 0, &err );cl_command_queue queue = clCreateCommandQueue( context, device_id, 0, &err );cl_program program = clCreateProgramWithSource( context, 1, (const char**)&KernelSource, 0, &err );clBuildProgram( program, 0, NULL, NULL, NULL, NULL );cl_kernel kernel = clCreateKernel( program, "hello", &err );
cl_mem input = clCreateBuffer( context, CL_MEM_READ_ONLY, 4 * ITEMS, 0, 0 );cl_mem output = clCreateBuffer( context, CL_MEM_WRITE_ONLY, 4 * ITEMS, 0, 0 );
clEnqueueWriteBuffer( queue, input, CL_TRUE, 0, 4 * ITEMS, inputData, 0, 0, 0 );clSetKernelArg( kernel, 0, sizeof( cl_mem ), &input );clSetKernelArg( kernel, 1, sizeof( cl_mem ), &output );
clEnqueueNDRangeKernel( queue, kernel, 1, 0, &global, 0, 0, 0, 0 );clFinish( queue );
clEnqueueReadBuffer( queue, output, CL_TRUE, 0, 4 * ITEMS, results, 0, 0, 0 );for( int i = 0; i < ITEMS; i++ ) printf( "%f ",results[i] );
clReleaseMemObject( input ); clReleaseMemObject( output );clReleaseProgram( program ); clReleaseKernel( kernel );clReleaseCommandQueue( queue ); clReleaseContext( context );
}
MyFirst
bool Kernel::InitCL(){
cl_platform_id platform;cl_device_id* devices;cl_uint devCount;cl_int error;
...
}
Like I said, I don’t care much for API details…
Just start with the template, and modify /
replace it when the need arises.
Assignment
Create an OpenCL program that calculates Voronoi noise
for a 512x512 buffer and make it available to the CPU.
Measure the performance gain compared to CPU-only.
Reference: https://www.shadertoy.com/view/4djGRh
Words of Advice
WebGL != OpenCL
Can’t do ‘by reference’, use pointers instead
float3 parameter: (float3)(1, 1, 1)
fract requires second parameter
sinf doesn’t exist, use sin
Also, see this helpful chart:
https://www.khronos.org/files/opencl-1-1-quick-reference-card.pdf
“The End”(for now)