tutorial: high performance sbse using commodity graphics cards
TRANSCRIPT
![Page 1: Tutorial: High Performance SBSE Using Commodity Graphics Cards](https://reader030.vdocuments.net/reader030/viewer/2022021209/620639588c2f7b1730059d91/html5/thumbnails/1.jpg)
Tutorial: High Performance SBSEUsing Commodity Graphics Cards
Simon Poulding, University of York, UKSSBSE, September 2012
© Simon Poulding & The University of York, 2012
![Page 2: Tutorial: High Performance SBSE Using Commodity Graphics Cards](https://reader030.vdocuments.net/reader030/viewer/2022021209/620639588c2f7b1730059d91/html5/thumbnails/2.jpg)
SBSE and High Performance Computing
entire searchalgorithm parallelisable
operations withinalgorithm parallelisable
EVALUATION
EVALUATIONVARIATION SELECTION
EVALUATION
VARIATION EVALUATION SELECTION
VARIATION EVALUATION SELECTION
VARIATION EVALUATION SELECTION
![Page 3: Tutorial: High Performance SBSE Using Commodity Graphics Cards](https://reader030.vdocuments.net/reader030/viewer/2022021209/620639588c2f7b1730059d91/html5/thumbnails/3.jpg)
Distributed Computing
![Page 4: Tutorial: High Performance SBSE Using Commodity Graphics Cards](https://reader030.vdocuments.net/reader030/viewer/2022021209/620639588c2f7b1730059d91/html5/thumbnails/4.jpg)
Multicore Computing
![Page 5: Tutorial: High Performance SBSE Using Commodity Graphics Cards](https://reader030.vdocuments.net/reader030/viewer/2022021209/620639588c2f7b1730059d91/html5/thumbnails/5.jpg)
General Purpose Computing on GPUs (GPGPU)
CUDA Architecture
Developing CUDA Applications
Case StudiesParallelising Search AlgorithmParallelising Fitness EvaluationParallelising Software Execution
![Page 6: Tutorial: High Performance SBSE Using Commodity Graphics Cards](https://reader030.vdocuments.net/reader030/viewer/2022021209/620639588c2f7b1730059d91/html5/thumbnails/6.jpg)
GPU Cards
![Page 7: Tutorial: High Performance SBSE Using Commodity Graphics Cards](https://reader030.vdocuments.net/reader030/viewer/2022021209/620639588c2f7b1730059d91/html5/thumbnails/7.jpg)
0
1000
2000
3000
4000
2008 2009 2010 2011 2012 2013
Technical Innovation
release date
GFL
OP/
s(si
ngle
prec
ision
)
GeForce GTX 280
GeForce GTX 480GeForce GTX 580
GeForce GTX 680
Adapted from “CUDA C Programming Guide”, NVIDIA, July 2012
![Page 8: Tutorial: High Performance SBSE Using Commodity Graphics Cards](https://reader030.vdocuments.net/reader030/viewer/2022021209/620639588c2f7b1730059d91/html5/thumbnails/8.jpg)
General Purpose Computing for GPUs (GPGPU)
NVIDIA GPUs(most since 2009)
NVIDIA GPUsAMD GPUs
Intel HD GPUs
other vendors ... Intel Core CPUs
![Page 9: Tutorial: High Performance SBSE Using Commodity Graphics Cards](https://reader030.vdocuments.net/reader030/viewer/2022021209/620639588c2f7b1730059d91/html5/thumbnails/9.jpg)
General Purpose Computing on GPUs (GPGPU)
CUDA Architecture
Developing CUDA Applications
Case StudiesParallelising Search AlgorithmParallelising Fitness EvaluationParallelising Software Execution
![Page 10: Tutorial: High Performance SBSE Using Commodity Graphics Cards](https://reader030.vdocuments.net/reader030/viewer/2022021209/620639588c2f7b1730059d91/html5/thumbnails/10.jpg)
Physical Architecture
globalmemory
streamingmultiprocessorsDRAM
GPU
systemmemory
DRAM
CPU
sharedmemory
registers
‘core’
sharedmemory
registers
‘core’
sharedmemory
registers
‘core’
sharedmemory
registers
‘core’
Adapted from “CUDA C Best Practices Guide”, NVIDIA, May 2012
![Page 11: Tutorial: High Performance SBSE Using Commodity Graphics Cards](https://reader030.vdocuments.net/reader030/viewer/2022021209/620639588c2f7b1730059d91/html5/thumbnails/11.jpg)
Logical Architecture
shared memory
threadlocalmemory
blocks
globalmemory
![Page 12: Tutorial: High Performance SBSE Using Commodity Graphics Cards](https://reader030.vdocuments.net/reader030/viewer/2022021209/620639588c2f7b1730059d91/html5/thumbnails/12.jpg)
Mapping Logical to Physical
sharedmemory
registers
‘core’
sharedmemory
registers
‘core’
blockstreamingmultiprocessor
![Page 13: Tutorial: High Performance SBSE Using Commodity Graphics Cards](https://reader030.vdocuments.net/reader030/viewer/2022021209/620639588c2f7b1730059d91/html5/thumbnails/13.jpg)
CUDA Performance Features
single-instruction multiple-thread
hardware multithreading
coalesced memory access
![Page 14: Tutorial: High Performance SBSE Using Commodity Graphics Cards](https://reader030.vdocuments.net/reader030/viewer/2022021209/620639588c2f7b1730059d91/html5/thumbnails/14.jpg)
Single-Instruction Multiple-Thread
} 1 warp =32 threads
......
![Page 15: Tutorial: High Performance SBSE Using Commodity Graphics Cards](https://reader030.vdocuments.net/reader030/viewer/2022021209/620639588c2f7b1730059d91/html5/thumbnails/15.jpg)
Hardware Multithreading & Occupancy
...
}shared
memory
registers
‘core’
...... }
...
}...... }
...
}...... }
![Page 16: Tutorial: High Performance SBSE Using Commodity Graphics Cards](https://reader030.vdocuments.net/reader030/viewer/2022021209/620639588c2f7b1730059d91/html5/thumbnails/16.jpg)
Coalesced Memory Access
......
}global
memory
...
![Page 17: Tutorial: High Performance SBSE Using Commodity Graphics Cards](https://reader030.vdocuments.net/reader030/viewer/2022021209/620639588c2f7b1730059d91/html5/thumbnails/17.jpg)
General Purpose Computing on GPUs (GPGPU)
CUDA Architecture
Developing CUDA Applications
Case StudiesParallelising Search AlgorithmParallelising Fitness EvaluationParallelising Software Execution
![Page 18: Tutorial: High Performance SBSE Using Commodity Graphics Cards](https://reader030.vdocuments.net/reader030/viewer/2022021209/620639588c2f7b1730059d91/html5/thumbnails/18.jpg)
Typical CUDA Application Pattern
globalmemory
systemmemory
memorycopy
kernellaunch
kernelcompletion
memorycopy
device
host
threads runningkernel code
![Page 19: Tutorial: High Performance SBSE Using Commodity Graphics Cards](https://reader030.vdocuments.net/reader030/viewer/2022021209/620639588c2f7b1730059d91/html5/thumbnails/19.jpg)
Example Problem
a b c = a * b382 17 ?1124 17 ?
30 17 ?2781 98 ?824 98 ?
4510 98 ?4088 31 ?
......
...256 x 64
![Page 20: Tutorial: High Performance SBSE Using Commodity Graphics Cards](https://reader030.vdocuments.net/reader030/viewer/2022021209/620639588c2f7b1730059d91/html5/thumbnails/20.jpg)
Kernel Code (device-side)
__global__ void exampleKernel(int * a, int * b, int * c) {
__shared__ int sb;
const unsigned int thread = threadIdx.x; const unsigned int block = blockIdx.x; const unsigned int gThread = block * blockDim.x + thread;
if (thread == 0) { sb = b[block];}
__syncthreads();
c[gThread] = a[gThread] * sb;
}
![Page 21: Tutorial: High Performance SBSE Using Commodity Graphics Cards](https://reader030.vdocuments.net/reader030/viewer/2022021209/620639588c2f7b1730059d91/html5/thumbnails/21.jpg)
Launching a Kernel (host-side)
const unsigned int numThreads = 256;const unsigned int numBlocks = 64;
dim3 gridD(numBlocks, 1, 1);dim3 blockD(numThreads, 1, 1);
exampleKernel<<<gridD,blockD>>>(a,b,c);
![Page 22: Tutorial: High Performance SBSE Using Commodity Graphics Cards](https://reader030.vdocuments.net/reader030/viewer/2022021209/620639588c2f7b1730059d91/html5/thumbnails/22.jpg)
Allocating and Copying Memory (host-side)
const unsigned int numThreads = 256;const unsigned int numBlocks = 64;
int * a,b,c;
cudaMalloc((void **)&a, numThreads * numBlocks * sizeof(int));cudaMalloc((void **)&b, numBlocks * sizeof(int));cudaMalloc((void **)&c, numThreads * numBlocks * sizeof(int));
cudaMemcpy(a, inputA, numThreads * numBlocks * sizeof(int), cudaMemcpyHostToDevice);cudaMemcpy(b, inputB, numBlocks * sizeof(int), cudaMemcpyHostToDevice);
![Page 23: Tutorial: High Performance SBSE Using Commodity Graphics Cards](https://reader030.vdocuments.net/reader030/viewer/2022021209/620639588c2f7b1730059d91/html5/thumbnails/23.jpg)
Putting It All Together
__global__ void exampleKernel(int * a, int * b, int * c) {...}
int main(...) {...cudaMalloc(...);cudaMemcpy(...);...exampleKernel<<<gridD,blockD>>>(a,b,c);...cudaMemcpy(...);...}
![Page 24: Tutorial: High Performance SBSE Using Commodity Graphics Cards](https://reader030.vdocuments.net/reader030/viewer/2022021209/620639588c2f7b1730059d91/html5/thumbnails/24.jpg)
Build Process
CUDAsource
file host source
device source
deviceintermediatecode (PTX)
deviceexecutable
code (cubin)host source
withembeddeddevice code
hostexecutable
nvcc
non-CUDAsource
file
standardcompilerand linker
Adapted from “CUDA Compiler Driver NVCC”, NVIDIA, May 2012
![Page 25: Tutorial: High Performance SBSE Using Commodity Graphics Cards](https://reader030.vdocuments.net/reader030/viewer/2022021209/620639588c2f7b1730059d91/html5/thumbnails/25.jpg)
Compute Capability
compute capability 1.0 1.1 1.2 1.3 2.x 3.0 3.5
atomic functions (global memory) No YesYesYesYesYesYes
atomic functions (shared memory) NoNo YesYesYesYesYes
warp vote functions NoNo YesYesYesYesYes
double precision floating point NoNoNo YesYesYesYes
additional fence and sync functions NoNoNoNo YesYesYes
max number threads per block 512512512512 102410241024
number register per multiprocessor 8K8K 16K16K 32K 64K64K
max shared memory per multiprocessor 16KB16KB16KB16KB 48KB48KB48KB
local memory per thread 16KB16KB16KB16KB 512KB512KB512KB
max number instructions per kernel 2 million2 million2 million2 million 512 million512 million512 million
Adapted from “CUDA C Programming Guide”, NVIDIA, July 2012
![Page 26: Tutorial: High Performance SBSE Using Commodity Graphics Cards](https://reader030.vdocuments.net/reader030/viewer/2022021209/620639588c2f7b1730059d91/html5/thumbnails/26.jpg)
Additional Tools and Libraries
Development Tools CUDA Libraries
debugger linear algebra (CUBLAS)
memory checker
profiler
sparse matrices (CUSPARSE)
random number generation (CURAND)
fast Fourier transform (CUFFT)
Thrust
![Page 27: Tutorial: High Performance SBSE Using Commodity Graphics Cards](https://reader030.vdocuments.net/reader030/viewer/2022021209/620639588c2f7b1730059d91/html5/thumbnails/27.jpg)
General Purpose Computing on GPUs (GPGPU)
CUDA Architecture
Developing CUDA Applications
Case StudiesParallelising Search AlgorithmParallelising Fitness EvaluationParallelising Software Execution
![Page 28: Tutorial: High Performance SBSE Using Commodity Graphics Cards](https://reader030.vdocuments.net/reader030/viewer/2022021209/620639588c2f7b1730059d91/html5/thumbnails/28.jpg)
Bayesian Optimisation Algorithm
![Page 29: Tutorial: High Performance SBSE Using Commodity Graphics Cards](https://reader030.vdocuments.net/reader030/viewer/2022021209/620639588c2f7b1730059d91/html5/thumbnails/29.jpg)
Ising Spin Glass
+1-1 -1
+1+1
-1 -1+1
+1
+1
+1 +1
+1
+1
-1-1
-1
+1
-1
-1
+1
![Page 30: Tutorial: High Performance SBSE Using Commodity Graphics Cards](https://reader030.vdocuments.net/reader030/viewer/2022021209/620639588c2f7b1730059d91/html5/thumbnails/30.jpg)
Implementation
build Bayesiannetwork model
EVALUATION
EVALUATIONVARIATION SELECTION
EVALUATIONVARIATION
VARIATION SELECTION
SELECTION
calculate Isingspin glass energy
restricted tournamentreplacement
Poulding, Staunton, Burles, “Full Implementation of an Estimation of Distribution Algorithm on a GPU”, CIGPU Competition Entry, GECCO 2011
CUDA kernel CUDA kernel CUDA kernel
![Page 31: Tutorial: High Performance SBSE Using Commodity Graphics Cards](https://reader030.vdocuments.net/reader030/viewer/2022021209/620639588c2f7b1730059d91/html5/thumbnails/31.jpg)
Results
0
20
40
60
80
100
8x8x8 12x12x12 16x16x16 24x24x24
GPU
Spe
ed-U
p
Problem Size
Poulding, Staunton, Burles, “Full Implementation of an Estimation of Distribution Algorithm on a GPU”, CIGPU Competition Entry, GECCO 2011
![Page 32: Tutorial: High Performance SBSE Using Commodity Graphics Cards](https://reader030.vdocuments.net/reader030/viewer/2022021209/620639588c2f7b1730059d91/html5/thumbnails/32.jpg)
General Purpose Computing on GPUs (GPGPU)
CUDA Architecture
Developing CUDA Applications
Case StudiesParallelising Search AlgorithmParallelising Fitness EvaluationParallelising Software Execution
![Page 33: Tutorial: High Performance SBSE Using Commodity Graphics Cards](https://reader030.vdocuments.net/reader030/viewer/2022021209/620639588c2f7b1730059d91/html5/thumbnails/33.jpg)
Multi-Objective Test Suite Minimisation
t1 t2 t3 ... tl
r1 1 0 1 ... 0
r2 1 0 0 ... 1
r3 0 1 1 ... 1
rm 1 1 0 ... 0
cost 9 7 4 6
test casesre
quire
men
ts
... ... ... ... ...
Yoo, Harman, Ur, “Highly Scalable Multi Objective Test Suite Minimisation Using Graphics Cards”, SSBSE 2011
![Page 34: Tutorial: High Performance SBSE Using Commodity Graphics Cards](https://reader030.vdocuments.net/reader030/viewer/2022021209/620639588c2f7b1730059d91/html5/thumbnails/34.jpg)
Implementation
MO algorithm NSGA-II
EVALUATION
EVALUATIONVARIATION SELECTION
EVALUATION
calculation of coverage and cost by
matrix multiplication
Java jMetal MOEA library openCL usingJavaCL wrapper
Yoo, Harman, Ur, “Highly Scalable Multi Objective Test Suite Minimisation Using Graphics Cards”, SSBSE 2011
![Page 35: Tutorial: High Performance SBSE Using Commodity Graphics Cards](https://reader030.vdocuments.net/reader030/viewer/2022021209/620639588c2f7b1730059d91/html5/thumbnails/35.jpg)
Results
0
10
20
30
5.92E+4 6.62E+5 1.12E+7
GPU
Spe
ed-U
p
Problem Size
Yoo, Harman, Ur, “Highly Scalable Multi Objective Test Suite Minimisation Using Graphics Cards”, SSBSE 2011
![Page 36: Tutorial: High Performance SBSE Using Commodity Graphics Cards](https://reader030.vdocuments.net/reader030/viewer/2022021209/620639588c2f7b1730059d91/html5/thumbnails/36.jpg)
General Purpose Computing on GPUs (GPGPU)
CUDA Architecture
Developing CUDA Applications
Case StudiesParallelising Search AlgorithmParallelising Fitness EvaluationParallelising Software Execution
![Page 37: Tutorial: High Performance SBSE Using Commodity Graphics Cards](https://reader030.vdocuments.net/reader030/viewer/2022021209/620639588c2f7b1730059d91/html5/thumbnails/37.jpg)
Implementation
EVALUATION
EVALUATION
EVALUATION
execute instrumented softwarewith test inputs
CUDA kernel
research funded by the MOD Centre for Defence Enterprise (CDE)
![Page 38: Tutorial: High Performance SBSE Using Commodity Graphics Cards](https://reader030.vdocuments.net/reader030/viewer/2022021209/620639588c2f7b1730059d91/html5/thumbnails/38.jpg)
Language Compatibility
large subset of C++Standard Template Libraryruntime type informationnetwork and file IOrand()
dynamic memory allocationfunction pointersfunction recursionmultiple source code files
only in computecapability 2.0+:
missing:
OO featurestemplatesmath libraryIEEE 754 floating point compliance
including:
research funded by the MOD Centre for Defence Enterprise (CDE)
![Page 39: Tutorial: High Performance SBSE Using Commodity Graphics Cards](https://reader030.vdocuments.net/reader030/viewer/2022021209/620639588c2f7b1730059d91/html5/thumbnails/39.jpg)
Results
0
20
40
60
80
~20 LOC ~100 LOC ~1,500 LOC
GPU
Spe
ed-U
p
Problem Size
research funded by the MOD Centre for Defence Enterprise (CDE)
![Page 40: Tutorial: High Performance SBSE Using Commodity Graphics Cards](https://reader030.vdocuments.net/reader030/viewer/2022021209/620639588c2f7b1730059d91/html5/thumbnails/40.jpg)
Resources
NVIDIA CUDA Zone
CUDA SDK samples - ‘template’ application
C Programming GuideC Best Practices Guide