performance analysis: c vs cuda
DESCRIPTION
Some tests comparing n-queens solutions between CPU using C and GPU using Cuda.TRANSCRIPT
2Copyright Vitor F. Pamplona
Goals
● Learn CudaLearn Cuda and its limitations ● Implement some n-Queens solutions
● Cuda version ● C++ version
● Compare performanceCompare performance● Check for possible papers possible papers
● Parallel processing ● Computer graphics
3Copyright Vitor F. Pamplona
N by N Queens Problem
http://en.wikipedia.org/wiki/Eight_queens_puzzle
4Copyright Vitor F. Pamplona
Possibilities vs Solutions
1 1 12 4 03 27 04 256 25 3,125 106 46,656 47 823,543 408 16,777,216 929 387,420,489 35210 10,000,000,000 72411 285,311,670,611 2,68012 8,916,100,448,256 14,20013 302,875,106,592,253 73,71314 11,112,006,825,558,000 365,59615 437,893,890,380,859,000 2,299,18416 18,446,744,073,709,600,000 14,772,51217 827,240,261,886,337,000,000 95,815,104
Board Size Possibilities Solutions
5Copyright Vitor F. Pamplona
Cu... what?
● CCompute UUnified DDevice AArchitecture● C-style language and compiler● DesignedDesigned for for parallelparallel solutionssolutions● Not a graphics APINot a graphics API● Runs on current graphics hardware
● nVidia GeForce 8+● Faster transfers between CPU and GPUFaster transfers between CPU and GPU● Compiler for CPU and GPU
6Copyright Vitor F. Pamplona
CPU
Hardware Architecture
GPU
7Copyright Vitor F. Pamplona
CPU
Hardware Architecture
GPU
Processor
Memory
Cache
8Copyright Vitor F. Pamplona
CPU
Hardware Architecture
GPU
Processor
Memory
CacheMemory
Processor
9Copyright Vitor F. Pamplona
Hardware Architecture
CPU
GPU
Host
Host Memory
CacheDevice Memory
Device
10Copyright Vitor F. Pamplona
Hardware Architecture
Host
Host Memory
CacheDevice Memory
11Copyright Vitor F. Pamplona
CPU
Hardware Architecture
Host
Host Memory
CacheDevice Memory
thread
warp
12Copyright Vitor F. Pamplona
CPU
Hardware Architecture
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
Host
Host Memory
Cache
L
L
L
L
L
L
L
L
L
L
L
Lwarp
local
thread
13Copyright Vitor F. Pamplona
CPU
Hardware Architecture
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
Host
Host Memory
Cache
warp
local
thread
banks
14Copyright Vitor F. Pamplona
CPU
Hardware Architecture
Constant(64kB)
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
Lwarp
local
thread
banks
CacheHost
Host Memory
Cache
15Copyright Vitor F. Pamplona
CPU
Hardware Architecture
Constant(64kB)
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
Lwarp
local
thread
banks
CacheHost
Host Memory
Cache
Global
16Copyright Vitor F. Pamplona
CPU
Hardware Architecture
Constant(64kB)
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
Lwarp
local
thread
banks
CacheHost
Host Memory
Cache
Global
Textureoptimized for 2D access
Cache
17Copyright Vitor F. Pamplona
CPU
Hardware Architecture
Constant(64kB)
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
Lwarp
local
thread
banks
CacheHost
Host Memory
Cache
Global
Textureoptimized for 2D access
Cache
18Copyright Vitor F. Pamplona
Memory Access
19Copyright Vitor F. Pamplona
Basics of Programming
20Copyright Vitor F. Pamplona
Hardware Architecture
21Copyright Vitor F. Pamplona
Hardware Architecture
22Copyright Vitor F. Pamplona
Hardware Architecture
23Copyright Vitor F. Pamplona
Hardware Architecture
24Copyright Vitor F. Pamplona
Libraries and Access
CPU
GPU
Application
25Copyright Vitor F. Pamplona
Libraries and Access
CPU
GPU
CUDA Libraries
Application
26Copyright Vitor F. Pamplona
Libraries and Access
CPU
GPU
CUDA Runtime
CUDA Libraries
Application
27Copyright Vitor F. Pamplona
Libraries and Access
CPU
CUDA Driver
GPU
CUDA Runtime
CUDA Libraries
Application
28Copyright Vitor F. Pamplona
Libraries and Access
CPU
CUDA Driver
GPU
CUDA Runtime
CUDA Libraries
Application
29Copyright Vitor F. Pamplona
Startup
● Special Windows/Linux driversdrivers● CUDA ToolkitToolkit● CUDA Developer SDKSDK, which includes
● API Documentation● Programming guide● Compiler (nvcc)● Libraries (CUFFT, CUBLAS)● Source code examplesSource code examples
30Copyright Vitor F. Pamplona
Host Example
float *pHostData = (float*) malloc(sizeof(float) * 256);// fill in the data array...
// allocate global memoryfloat *pInput, *pOutput;cudaMalloc((void**) &pInput, sizeof(float) * 256));cudaMalloc((void**) &pOutput, sizeof(float) * 256));
// host memory to global memorycudaMemcpy(pInput, pHostData, sizeof(float) * 256, cudaMemcpyHostToDevice));
dim3 nDimGrid(1, 1, 1);// 1 block onlydim3 nDimBlock(32, 1, 1); // 32 threads per blockint nSharedMemBytes = sizeof(float) * 32;MyKernel<<<nDimGrid, nDimBlock, nSharedMemBytes>>>(pInput, pOutput);
// global memory to host memorycudaMemcpy(pHostData, pOutput, sizeof(float) * 256, cudaMemcpyDeviceToHost));
free(pHostData); free(pInput); free(pOutput);
31Copyright Vitor F. Pamplona
Kernel Example
__global__ void MyKernel(float* pInData, float* pOutData){extern __shared__ float sharedData[];const unsigned int tid = threadIdx.x;const unsigned int num_threads = blockDim.x;
// global memory to shared memorysharedData[tid] = pInData[tid];__syncthreads();
// do somethingsharedData[tid] = (float) num_threads * sharedData[tid];__syncthreads();
// shared memory to global memorypOutData[tid] = sharedData[tid];
}
32Copyright Vitor F. Pamplona
Competitors
● AMD/ATI Close to Metal (CTM)● RapidMind● Acceleware● PeakStream
● Unavailable since acquisition by Google● BrookGPU● OpenGL/Direct3D + GLSL/HLSL/Cg● BSGPBSGP
33Copyright Vitor F. Pamplona
Back to Work
● Brute force implementations● 3 solutions for CPU
● Monothread depth-first recursive● Monothread depth-first plain ● N-threads depth-first plain
34Copyright Vitor F. Pamplona
Back to Work
● Brute force implementations● 3 solutions for CPU
● Monothread depth-first recursive● Monothread depth-first plain ● N-threads depth-first plain
● 3 solutions for GPU● Step-based breadth-first static memory● Step-based breadth-first dynamic memory● Plain depth-first dynamic memory version
35Copyright Vitor F. Pamplona
Back to Work
● Brute force implementations● 3 solutions for CPU
● Monothread depth-first rMonothread depth-first recursiveecursive● Monothread depth-first plain ● N-threads depth-first plain
● 3 solutions for GPU● Step-based breadth-first static memory● Step-based breadth-first dynamic memory● Plain depth-first dynamic memory version
36Copyright Vitor F. Pamplona
Back to Work
● Brute force implementations● 3 solutions for CPU
● Monothread depth-first recursive● Monothread depth-first plain Monothread depth-first plain ● N-threads depth-first plain
● 3 solutions for GPU● Step-based breadth-first static memory● Step-based breadth-first dynamic memory● Plain depth-first dynamic memory version
37Copyright Vitor F. Pamplona
CPU Monothread Depth-first Plain
● Optimized implementation● Single thread● Depth-first approach● No recursionNo recursion, no function call● Memory buffersMemory buffers :)● Fast, really fast!
38Copyright Vitor F. Pamplona
Back to Work
● Brute force implementations● 3 solutions for CPU
● Monothread depth-first recursive● Monothread depth-first plain ● N-threads depth-first plainN-threads depth-first plain
● 3 solutions for GPU● Step-based breadth-first static memory● Step-based breadth-first dynamic memory● Plain depth-first dynamic memory version
39Copyright Vitor F. Pamplona
CPU N-threads Depth-first Plain
● N-threadsN-threads, where N is the board size● First column filled in the main thread● Create N linux pthreads
● One thread for each lineOne thread for each line● Each thread process N-1 columns
● Critical Section● solutions++;● saveSolution(board);
40Copyright Vitor F. Pamplona
Back to Work
● Brute force implementations● 3 solutions for CPU
● Monothread depth-first recursive● Monothread depth-first plain ● N-threads depth-first plain
● 3 solutions for GPU● Step-based breadth-firstStep-based breadth-first static memory● Step-based breadth-firstStep-based breadth-first dynamic memory● Plain depth-first dynamic memory version
41Copyright Vitor F. Pamplona
GPU Step Breadth-first
Step 1
In
42Copyright Vitor F. Pamplona
GPU Step Breadth-first
1
2
3
N..
.
Thread 1
Thread 2
Thread 3
Thread N
Step 1
In Out
Threads = Num. Solutions * N
43Copyright Vitor F. Pamplona
GPU Step Breadth-first
1
2
3
8
Step 2
In..
.
44Copyright Vitor F. Pamplona
GPU Step Breadth-first
1
2
3
8..
.
Thread 1
Thread 2
Thread 3
Thread N*N
Step 2
In Out
1 1
1 2
1 3
N N
...
Threads = Num. Solutions * N
45Copyright Vitor F. Pamplona
GPU Step Breadth-first
1 3
1 4
1 5
8 6
...
Step 3
In..
.
Threads = Num. Solutions * N
46Copyright Vitor F. Pamplona
Why a Breadth-first solution?
● Graphics processors are not Intel/AMD● Slow: 650 MHzSlow: 650 MHz
● Driver can kill time-expensive kernels● Lots of threadsLots of threads
● Good for GPU● Easy solution-thread mapping by indexes
● Fast kernelsFast kernels● Good for GPU
47Copyright Vitor F. Pamplona
GPU Step Breadth-first
● StaticStatic memory version● Bad: One sort in the output for each step● Good for GPU
● DynamicDynamic memory version● Bad: Synchronized memory access● Bad: Global last output index
48Copyright Vitor F. Pamplona
Back to Work
● Brute force implementations● 3 solutions for CPU
● Monothread depth-first recursive● Monothread depth-first plain ● N-threads depth-first plain
● 3 solutions for GPU● Step-based breadth-first static memory● Step-based breadth-first dynamic memory● Plain depth-first dynamic memory versionPlain depth-first dynamic memory version
49Copyright Vitor F. Pamplona
Plain Depth-first Dynamic
● Best case: N^4 threadsBest case: N^4 threads● Thread indexes fill the first 4 columns● Depth-first approach
● Synchronized global memory access
50Copyright Vitor F. Pamplona
Implementations and Threads
SoluçãoSol * NSol * N1NNN*NN*N*NN*N*N*NN^N
11N
ThreadsGPU-breadth-first static memGPU-breadth-first dynamic memGPU-depth-first 1–ThreadGPU-depth-first n-ThreadsGPU-depth-first n-gridsGPU-depth-first n*n-gridsGPU-depth-first n*n-grids*n-threadsGPU-depth-first n*n-grids*n*n-threadsGPU-depth-first FULL threads
CPU-PlainCPU-RecursiveCPU-Plain-Threads
51Copyright Vitor F. Pamplona
Test platforms
● CPU: CPU: Intel Quad Core 2.4 GhzIntel Quad Core 2.4 Ghz● Ubuntu● 4GB RAM
● GPU: Geforce 9600 GTGeforce 9600 GT● 8 multiprocessor● 64 processors at 650 Mhz● 512MB RAM at 900 Mhz● Cuda 1.0
52Copyright Vitor F. Pamplona
Results: CPUCPU
12 13 140
1000
2000
3000
4000
5000
6000
7000
8000
9000
CPU-PlainCPU-RecursiveCPU-Plain-Threads
53Copyright Vitor F. Pamplona
Results: GPU: Static vs DynamicDynamic
11 120
1000
2000
3000
4000
5000
6000
7000
breadth-first staticbreadth-first dynamicCPU-PlainCPU-RecursiveCPU-Plain-Threads
54Copyright Vitor F. Pamplona
Results: Same Number of Threads
12 130
1000
2000
3000
4000
5000
6000
7000
8000
9000
depth-first n-Threadsdepth-first n-GridsCPU-Plain-Threads
55Copyright Vitor F. Pamplona
Results: Only 1 Thread
10 11 120
1000
2000
3000
4000
5000
6000
7000
8000
depth-first 1-ThreadCPU-RecursiveCPU-Plain
56Copyright Vitor F. Pamplona
Results: Dynamic vs Depth
120
200
400
600
800
1000
1200
1400
1600
1800
breadth-first dynamicdepth-first n-Threadsdepth-first n-GridsCPU-PlainCPU-RecursiveCPU-Plain-Threads
57Copyright Vitor F. Pamplona
Results: Depth vs CPU
120
200
400
600
800
1000
1200
1400
1600
1800depth-first n-Threads
depth-first n-Gridsdepth-first n*n-gridsdepth-first n*n-grids*n-threads
depth-first n*n-grids*n*n-threadsCPU-Plain
CPU-Recursive
CPU-Plain-Threads
58Copyright Vitor F. Pamplona
Results: GPU N^N solution
7 8 90
2000
4000
6000
8000
10000
12000depth-first n^nCPU-PlainCPU-RecursiveCPU-Plain-Threads
59Copyright Vitor F. Pamplona
Results: Dynamic, Depth, CPU
10 11 12 130
200
400
600
800
1000
1200
1400
1600breadth-first dynamicdepth-first N*N*N*NCPU-PlainCPU-RecursiveCPU-Plain-Threads
60Copyright Vitor F. Pamplona
Results: Depth vs CPU Threads
14 15 160
20000
40000
60000
80000
100000
120000
140000depth-first N*N*N*NCPU-PlainCPU-Plain-Threads
61Copyright Vitor F. Pamplona
Results
Solução Threads 1 2 3 4 5 6 7 8 9GPU-breadth-first static Sol * N 171 171 171 174 174 174 178 184 220GPU-breadth-first dynamic Sol * N 171 171 171 173 173 173 173 173 174GPU-depth-first 1–Thread 1 171 171 171 171 171 171 171 185 227GPU-depth-first n-Threads N 171 171 171 172 172 173 173 175 230GPU-depth-first n-grids N 171 171 171 171 171 173 173 173 177GPU-depth-first n*n-grids N*N 172 172 172 172 172 172 172 172 174GPU-depth-first N^3 N^3 171 172 172 172 172 172 172 172 174GPU-depth-first N^4 N^4 171 171 171 171 171 171 171 171 171GPU-depth-first FULL N^N 171 171 172 172 172 172 230 1682 11420
CPU-Plain 1 2 2 2 2 2 2 2 2 3CPU-Recursive 1 2 2 2 2 2 2 2 2 3CPU-Plain-Threads N 2 2 2 2 2 2 2 2 5
62Copyright Vitor F. Pamplona
Results
Solução 11 12 13 14 15 16 17Sol * N 1234 6184Sol * N 218 407 1481 7886 Cont1 1463 7198N 441 1561 7827N 301 824 3604N*N 216 424 1425 7025N^3 192 267 661 2937N^4 181 199 360 1369 7562 43488 05:38.99N^N
1 18 91 502 3020 196851 35 198 1225 8283 58493N 17 84 290 1393 8578 32010 04:40.95
ThreadsGPU-breadth-first static Mem Mem Mem Mem MemGPU-breadth-first dynamic Mem MemGPU-depth-first 1–ThreadGPU-depth-first n-ThreadsGPU-depth-first n-gridsGPU-depth-first n*n-gridsGPU-depth-first N^3GPU-depth-first N^4GPU-depth-first FULL
CPU-PlainCPU-RecursiveCPU-Plain-Threads
63Copyright Vitor F. Pamplona
Conclusions
● Cuda is slow. is slow.● Low use of GPU graphics resources● GLSL, HLSL and Cg are faster● Compiler needs improvements ● More documentation on assembly optimizationassembly optimization
● Instable ● GPU kill some process (I don't know why)
● Performance depends on implementation ● Good for mixed solutionsmixed solutions: CPU + GPU
64Copyright Vitor F. Pamplona
Conclusions
● %, * and / are slow%, * and / are slow● ThreadIdx and blockIdx are fantastic● __shared____shared__ memory helps● Cuda locks the screenlocks the screen while processing
● No inter-process scheduling● Synchronized architecture
● Think synchronized
Perguntas?
Vitor [email protected]