java gpu computing

Post on 12-Jul-2015

146 Views

Category:

Software

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Java GPU Computing

Maarten Steur & Arjan Lamers

● Overzicht OpenCL● Simpel voorbeeld ● Casus● Tips & tricks● Vragen

Waarom GPU Computing

Afkortingen

● CPU, GPU, APU● Khronos: OpenCL, OpenGL● Nvidia: CUDA● JogAmp JOCL, JavaCL, JOCL

GPU vergeleken met CPU● Veel simpele cores● Veel high bandwidth geheugen

●Intel core i7 GeForce GT 650M

8 cores 384 cores

180 Gflops 650 Gflops

Programmeer model

● Definieer stream (flow)

● Run in parallel

Gebruik

● Algorithme:– Hoge Concurrency

– Partitioneerbaar

● Maar:– Extra latency door on- en offloaden op

de GPU

– Extra complexiteit

Componenten

Componenten

Voorbeeld (MacBook Pro)Platform name: ApplePlatform profile: FULL_PROFILEPlatform spec version: OpenCL 1.2Platform vendor: Apple

Device 16925696 HD Graphics 4000Driver:1.2(Aug 17 2014 20:29:07)Max work group size:512Global mem size: 1073741824Local mem size: 65536Max clock freq: 1200Max compute units: 16

Device 16918272 GeForce GT 650MDriver:8.26.28 310.40.55b01Max work group size:1024Global mem size: 1073741824Local mem size: 49152Max clock freq: 900Max compute units: 2

Device 4294967295 Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHzDriver:1.1Max work group size:1024Global mem size: 17179869184Local mem size: 32768Max clock freq: 2600Max compute units: 8

Work & Memory

Application / Kernel

● Schrijf .cl files in C variant● Kernels zijn de 'publieke' functies

● Java Bytecode – Aparapi (OpenCL)

– RootBeer (CUDA)

Disclaimer

Parallel sort

kernel void sort(global const float* in, global float* out, int size) { int i = get_global_id(0); // current thread float id = in[i]; int pos = 0; for (int j=0;j<size;j++) { float jd = in[j];

// in[j] < in[i] ? bool smaller = (jx < ix) || (jx == ix && j < i);

pos += (smaller)?1:0; } out[pos] = id;}

Java GPU Computing

CLContext globalContext = CLContext.create();

CLDevice device = globalContext.getMaxFlopsDevice(Type.GPU);

CLContext context = CLContext.create(device);

CLCommandQueue queue = device.createCommandQueue();

CLProgram program = context.createProgram(

First8GpuComputing.class.getResourceAsStream("MyTask.cl")).build();

Je kunt ook builden voor specifieke devices: build(device)

Java GPU ComputingCLBuffer<FloatBuffer> inBuffer = context.createFloatBuffer(

input.length , READ_ONLY);

CLBuffer<FloatBuffer> outBuffer = context.createFloatBuffer(

input.length, WRITE_ONLY);

mapToBuffer(inBuffer.getBuffer(), workLoad);

Java GPU ComputingCLBuffer<FloatBuffer> inBuffer = context.createFloatBuffer(

input.length , READ_ONLY);

CLBuffer<FloatBuffer> outBuffer = context.createFloatBuffer(

input.length, WRITE_ONLY);

mapToBuffer(inBuffer.getBuffer(), workLoad);

CLKernel kernel = program.createCLKernel("MyTask");

kernel.putArgs(inBuffer, outBuffer).putArg(workLoad.length);

Java GPU ComputingCLBuffer<FloatBuffer> inBuffer = context.createFloatBuffer(

input.length , READ_ONLY);

CLBuffer<FloatBuffer> outBuffer = context.createFloatBuffer(

input.length, WRITE_ONLY);

mapToBuffer(inBuffer.getBuffer(), workLoad);

CLKernel kernel = program.createCLKernel("MyTask");

kernel.putArgs(inBuffer, outBuffer).putArg(workLoad.length);

queue.putWriteBuffer(inBuffer, false)

.put1DRangeKernel(kernel, 0, globalWorkSize, localWorkSize)

.putReadBuffer(outBuffer, true);

FloatBuffer output = outBuffer.getBuffer();

Praktijkcasus

Praktijk casus

● Rekeninstrument ter ondersteuning van de Programmatische Aanpak Stikstof.

● http://www.aerius.nl

Praktijk casus

Praktijk casus

Tips & tricks

● CL beheer– getResourceAsStream()?

– Java constanten → #define

– Locale? Oops!

Tips & tricks

● Unit testen– Aparte test kernels

– Test cases in batches

kernel void testDifficultCalculation(const int testCount, global const double* distance, global double* results) {

const int testId = get_global_id(0); if (testId < testCount) { results[testId] = difficultCalculation(distance[testId]); }}

Direct memory management

● -XX:MaxDirectMemorySize=??M● ByteBuffer.allocateDirect(int capacity)

– Max 2GB per buffer

● Garbage collection te laat– Getriggered door heap collection

– Handmatig vrijgeven

– ((sun.nio.ch.DirectBuffer) myBuffer).cleaner().clean();

● VisualVM plugin voor direct buffers

GPU vs CPU

● GPU's checken minder dan CPU's– Div by zero

– Out of bounds checks

– Test eerst op CPU

Portabiliteit

● OpenCL is portable, de performance niet

– Memory sizes verschillen

– Memory latencies verschillen

– Work group sizes verschillen

– Compute devices verschillen

– OpenCL implementatie verschillen

● Develop dus voor de productie hardware

Ten slotte

● Float vs Double– Dubbele precisie

– Halve performance

– Double support optioneel

Conclusie

Conclusie

● Wanneer te gebruiken?– Als performance echt nodig is

– Als probleem hoge concurrency heeft

– Als probleem partitioneerbaar is

Vragen?Setting up OpenCL test on Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHzWarming up OpenCL test[thread 32003 also had an error][thread 33027 also had an error]

## A fatal error has been detected by the Java Runtime Environment:## SIGSEGV[thread 32515 also had an error] (0xb)[thread 32771 also had an error][thread 32259 also had an error] at pc=0x00000001250ded70, pid=99851, tid=29475## JRE version: Java(TM) SE Runtime Environment (8.0_20-b26) (build 1.8.0_20-b26)# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.20-b23 mixed mode bsd-amd64 compressed oops)# Problematic frame:# [thread 17415 also had an error]C [cl_kernels+0x1d70] sort_wrapper+0x1b0## Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again## An error report file with more information is saved as:# /Users/arjanl/Documents/opencl/workspace/opencl-test/jogamp/hs_err_pid99851.log[thread 31763 also had an error]## If you would like to submit a bug report, please visit:# http://bugreport.sun.com/bugreport/crash.jsp#

top related