performance analysis: c vs cuda

n-Queens Problem: A Comparison Between CPU and

GPU using C++ and Cuda

Vitor [email protected]

2Copyright Vitor F. Pamplona

Goals

● Learn CudaLearn Cuda and its limitations ● Implement some n-Queens solutions

● Cuda version ● C++ version

● Compare performanceCompare performance● Check for possible papers possible papers

● Parallel processing ● Computer graphics


N by N Queens Problem

http://en.wikipedia.org/wiki/Eight_queens_puzzle


Possibilities vs Solutions

1 1 12 4 03 27 04 256 25 3,125 106 46,656 47 823,543 408 16,777,216 929 387,420,489 35210 10,000,000,000 72411 285,311,670,611 2,68012 8,916,100,448,256 14,20013 302,875,106,592,253 73,71314 11,112,006,825,558,000 365,59615 437,893,890,380,859,000 2,299,18416 18,446,744,073,709,600,000 14,772,51217 827,240,261,886,337,000,000 95,815,104

Board Size Possibilities Solutions


Cu... what?

● CCompute UUnified DDevice AArchitecture● C-style language and compiler● DesignedDesigned for for parallelparallel solutionssolutions● Not a graphics APINot a graphics API● Runs on current graphics hardware

● nVidia GeForce 8+● Faster transfers between CPU and GPUFaster transfers between CPU and GPU● Compiler for CPU and GPU


CPU

Hardware Architecture

GPU


CPU


GPU

Processor

Memory

Cache


CPU


GPU

Processor

Memory

CacheMemory

Processor



CPU

GPU

Host

Host Memory

CacheDevice Memory

Device



Host

Host Memory

CacheDevice Memory


CPU


Host

Host Memory

CacheDevice Memory

thread

warp


CPU


L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

Host

Host Memory

Cache

L

L

L

L

L

L

L

L

L

L

L

Lwarp

local

thread


CPU


L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

Host

Host Memory

Cache

warp

local

thread

banks


CPU


Constant(64kB)

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

Lwarp

local

thread

banks

CacheHost

Host Memory

Cache


CPU


Constant(64kB)

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

Lwarp

local

thread

banks

CacheHost

Host Memory

Cache

Global


CPU


Constant(64kB)

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

Lwarp

local

thread

banks

CacheHost

Host Memory

Cache

Global

Textureoptimized for 2D access

Cache


Memory Access


Basics of Programming


Libraries and Access

CPU

GPU

Application



CPU

GPU

CUDA Libraries

Application



CPU

GPU

CUDA Runtime

CUDA Libraries

Application



CPU

CUDA Driver

GPU

CUDA Runtime

CUDA Libraries

Application


Startup

● Special Windows/Linux driversdrivers● CUDA ToolkitToolkit● CUDA Developer SDKSDK, which includes

● API Documentation● Programming guide● Compiler (nvcc)● Libraries (CUFFT, CUBLAS)● Source code examplesSource code examples


Host Example

float *pHostData = (float*) malloc(sizeof(float) * 256);// fill in the data array...

// allocate global memoryfloat *pInput, *pOutput;cudaMalloc((void**) &pInput, sizeof(float) * 256));cudaMalloc((void**) &pOutput, sizeof(float) * 256));

// host memory to global memorycudaMemcpy(pInput, pHostData, sizeof(float) * 256, cudaMemcpyHostToDevice));

dim3 nDimGrid(1, 1, 1);// 1 block onlydim3 nDimBlock(32, 1, 1); // 32 threads per blockint nSharedMemBytes = sizeof(float) * 32;MyKernel<<<nDimGrid, nDimBlock, nSharedMemBytes>>>(pInput, pOutput);

// global memory to host memorycudaMemcpy(pHostData, pOutput, sizeof(float) * 256, cudaMemcpyDeviceToHost));

free(pHostData); free(pInput); free(pOutput);


Kernel Example

__global__ void MyKernel(float* pInData, float* pOutData){extern __shared__ float sharedData[];const unsigned int tid = threadIdx.x;const unsigned int num_threads = blockDim.x;

// global memory to shared memorysharedData[tid] = pInData[tid];__syncthreads();

// do somethingsharedData[tid] = (float) num_threads * sharedData[tid];__syncthreads();

// shared memory to global memorypOutData[tid] = sharedData[tid];

}


Competitors

● AMD/ATI Close to Metal (CTM)● RapidMind● Acceleware● PeakStream

● Unavailable since acquisition by Google● BrookGPU● OpenGL/Direct3D + GLSL/HLSL/Cg● BSGPBSGP


Back to Work

● Brute force implementations● 3 solutions for CPU

● Monothread depth-first recursive● Monothread depth-first plain ● N-threads depth-first plain


Back to Work



● 3 solutions for GPU● Step-based breadth-first static memory● Step-based breadth-first dynamic memory● Plain depth-first dynamic memory version


Back to Work


● Monothread depth-first rMonothread depth-first recursiveecursive● Monothread depth-first plain ● N-threads depth-first plain



Back to Work


● Monothread depth-first recursive● Monothread depth-first plain Monothread depth-first plain ● N-threads depth-first plain



CPU Monothread Depth-first Plain

● Optimized implementation● Single thread● Depth-first approach● No recursionNo recursion, no function call● Memory buffersMemory buffers :)● Fast, really fast!


Back to Work


● Monothread depth-first recursive● Monothread depth-first plain ● N-threads depth-first plainN-threads depth-first plain



CPU N-threads Depth-first Plain

● N-threadsN-threads, where N is the board size● First column filled in the main thread● Create N linux pthreads

● One thread for each lineOne thread for each line● Each thread process N-1 columns

● Critical Section● solutions++;● saveSolution(board);


Back to Work



● 3 solutions for GPU● Step-based breadth-firstStep-based breadth-first static memory● Step-based breadth-firstStep-based breadth-first dynamic memory● Plain depth-first dynamic memory version


GPU Step Breadth-first

Step 1

In



1

2

3

N..

.

Thread 1

Thread 2

Thread 3

Thread N

Step 1

In Out

Threads = Num. Solutions * N



1

2

3

8

Step 2

In..

.



1

2

3

8..

.

Thread 1

Thread 2

Thread 3

Thread N*N

Step 2

In Out

1 1

1 2

1 3

N N

...




1 3

1 4

1 5

8 6

...

Step 3

In..

.



Why a Breadth-first solution?

● Graphics processors are not Intel/AMD● Slow: 650 MHzSlow: 650 MHz

● Driver can kill time-expensive kernels● Lots of threadsLots of threads

● Good for GPU● Easy solution-thread mapping by indexes

● Fast kernelsFast kernels● Good for GPU



● StaticStatic memory version● Bad: One sort in the output for each step● Good for GPU

● DynamicDynamic memory version● Bad: Synchronized memory access● Bad: Global last output index


Back to Work



● 3 solutions for GPU● Step-based breadth-first static memory● Step-based breadth-first dynamic memory● Plain depth-first dynamic memory versionPlain depth-first dynamic memory version


Plain Depth-first Dynamic

● Best case: N^4 threadsBest case: N^4 threads● Thread indexes fill the first 4 columns● Depth-first approach

● Synchronized global memory access


Implementations and Threads

SoluçãoSol * NSol * N1NNN*NN*N*NN*N*N*NN^N

11N

ThreadsGPU-breadth-first static memGPU-breadth-first dynamic memGPU-depth-first 1–ThreadGPU-depth-first n-ThreadsGPU-depth-first n-gridsGPU-depth-first n*n-gridsGPU-depth-first n*n-grids*n-threadsGPU-depth-first n*n-grids*n*n-threadsGPU-depth-first FULL threads

CPU-PlainCPU-RecursiveCPU-Plain-Threads


Test platforms

● CPU: CPU: Intel Quad Core 2.4 GhzIntel Quad Core 2.4 Ghz● Ubuntu● 4GB RAM

● GPU: Geforce 9600 GTGeforce 9600 GT● 8 multiprocessor● 64 processors at 650 Mhz● 512MB RAM at 900 Mhz● Cuda 1.0


Results: CPUCPU

12 13 140

1000

2000

3000

4000

5000

6000

7000

8000

9000



Results: GPU: Static vs DynamicDynamic

11 120

1000

2000

3000

4000

5000

6000

7000

breadth-first staticbreadth-first dynamicCPU-PlainCPU-RecursiveCPU-Plain-Threads


Results: Same Number of Threads

12 130

1000

2000

3000

4000

5000

6000

7000

8000

9000

depth-first n-Threadsdepth-first n-GridsCPU-Plain-Threads


Results: Only 1 Thread

10 11 120

1000

2000

3000

4000

5000

6000

7000

8000

depth-first 1-ThreadCPU-RecursiveCPU-Plain


Results: Dynamic vs Depth

120

200

400

600

800

1000

1200

1400

1600

1800

breadth-first dynamicdepth-first n-Threadsdepth-first n-GridsCPU-PlainCPU-RecursiveCPU-Plain-Threads


Results: Depth vs CPU

120

200

400

600

800

1000

1200

1400

1600

1800depth-first n-Threads

depth-first n-Gridsdepth-first n*n-gridsdepth-first n*n-grids*n-threads

depth-first n*n-grids*n*n-threadsCPU-Plain

CPU-Recursive

CPU-Plain-Threads


Results: GPU N^N solution

7 8 90

2000

4000

6000

8000

10000

12000depth-first n^nCPU-PlainCPU-RecursiveCPU-Plain-Threads


Results: Dynamic, Depth, CPU

10 11 12 130

200

400

600

800

1000

1200

1400

1600breadth-first dynamicdepth-first N*N*N*NCPU-PlainCPU-RecursiveCPU-Plain-Threads


Results: Depth vs CPU Threads

14 15 160

20000

40000

60000

80000

100000

120000

140000depth-first N*N*N*NCPU-PlainCPU-Plain-Threads


Results

Solução Threads 1 2 3 4 5 6 7 8 9GPU-breadth-first static Sol * N 171 171 171 174 174 174 178 184 220GPU-breadth-first dynamic Sol * N 171 171 171 173 173 173 173 173 174GPU-depth-first 1–Thread 1 171 171 171 171 171 171 171 185 227GPU-depth-first n-Threads N 171 171 171 172 172 173 173 175 230GPU-depth-first n-grids N 171 171 171 171 171 173 173 173 177GPU-depth-first n*n-grids N*N 172 172 172 172 172 172 172 172 174GPU-depth-first N^3 N^3 171 172 172 172 172 172 172 172 174GPU-depth-first N^4 N^4 171 171 171 171 171 171 171 171 171GPU-depth-first FULL N^N 171 171 172 172 172 172 230 1682 11420

CPU-Plain 1 2 2 2 2 2 2 2 2 3CPU-Recursive 1 2 2 2 2 2 2 2 2 3CPU-Plain-Threads N 2 2 2 2 2 2 2 2 5


Results

Solução 11 12 13 14 15 16 17Sol * N 1234 6184Sol * N 218 407 1481 7886 Cont1 1463 7198N 441 1561 7827N 301 824 3604N*N 216 424 1425 7025N^3 192 267 661 2937N^4 181 199 360 1369 7562 43488 05:38.99N^N

1 18 91 502 3020 196851 35 198 1225 8283 58493N 17 84 290 1393 8578 32010 04:40.95

ThreadsGPU-breadth-first static Mem Mem Mem Mem MemGPU-breadth-first dynamic Mem MemGPU-depth-first 1–ThreadGPU-depth-first n-ThreadsGPU-depth-first n-gridsGPU-depth-first n*n-gridsGPU-depth-first N^3GPU-depth-first N^4GPU-depth-first FULL



Conclusions

● Cuda is slow. is slow.● Low use of GPU graphics resources● GLSL, HLSL and Cg are faster● Compiler needs improvements ● More documentation on assembly optimizationassembly optimization

● Instable ● GPU kill some process (I don't know why)

● Performance depends on implementation ● Good for mixed solutionsmixed solutions: CPU + GPU


Conclusions

● %, * and / are slow%, * and / are slow● ThreadIdx and blockIdx are fantastic● __shared____shared__ memory helps● Cuda locks the screenlocks the screen while processing

● No inter-process scheduling● Synchronized architecture

● Think synchronized

Perguntas?

Vitor [email protected]

performance analysis: c vs cuda

Technology

hardware architecturelocall

cudavitor pamplona

memory access

memory shareddatatid

poutput global memory

x global memory

global memory poutdatatid

syncthreads shared memory