nvidia® cuda™ 5.0 sample evaluation result part 1
DESCRIPTION
This evaluation to be continued, For future reference.TRANSCRIPT
NVIDIA® CUDA™ 5.0 Sample evaluation result
PART Ⅰ
GPU: GTX 560 Ti
CPU: i5-3450S (TDP65W)
RAM: 16GB
OS: Windows 7 x64 Ultimate
Yukio Saitoh | FXFROG.com
21st/Apr/2013
INDEX
Sample binary :1. alignedTypes.exe
2. asyncAPI.exe
3. bandwidthTest.exe
4. batchCUBLAS.exe
5. bicubicTexture.exe
6. bilateralFilter.exe
7. bindlessTexture.exe / Failure
8. binomialOptions.exe
9. BlackScholes.exe 1/2
10. boxFilter.exe
11. boxFilterNPP.exe
12. cdpAdvancedQuicksort.exe / Failure
13. cdpLUDecomposition.exe / Failure
14. cdpQuadTree.exe / Failure
15. cdpSimplePrint.exe / Failure
16. cdpSimplePrint.exe / Failure
17. cdpSimpleQuicksort.exe / Failure
18. clock.exe
Sample target path and files
• C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release
alignedTypes.exe 1/2
[C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥alignedTypes.exe] - Starting...
GPU Device 0: "GeForce GTX 560 Ti" with compute capability 2.1
[GeForce GTX 560 Ti] has 8 MP(s) x 48 (Cores/MP) = 384 (Cores)
> Compute scaling value = 1.00
> Memory Size = 49999872
Allocating memory...
Generating host input data array...
Uploading input data to GPU memory...
Testing misaligned types...
uint8...
Avg. time: 2.563287 ms / Copy throughput: 18.166525 GB/s.
TEST OK
uint16...
Avg. time: 1.429239 ms / Copy throughput: 32.580981 GB/s.
TEST OK
RGBA8_misaligned...
Avg. time: 1.766606 ms / Copy throughput: 26.359026 GB/s.
TEST OK
LA32_misaligned...
Avg. time: 0.998594 ms / Copy throughput: 46.631585 GB/s.
TEST OK
RGB32_misaligned...
Avg. time: 1.273794 ms / Copy throughput: 36.556941 GB/s.
TEST OK
RGBA32_misaligned...
Avg. time: 1.703606 ms / Copy throughput: 27.333794 GB/s.
TEST OK
alignedTypes.exe 2/2
Testing aligned types...
RGBA8...
Avg. time: 1.131558 ms / Copy throughput: 41.152104 GB/s.
TEST OK
I32...
Avg. time: 1.091073 ms / Copy throughput: 42.679095 GB/s.
TEST OK
LA32...
Avg. time: 0.952468 ms / Copy throughput: 48.889827 GB/s.
TEST OK
RGB32...
Avg. time: 1.431797 ms / Copy throughput: 32.522784 GB/s.
TEST OK
RGBA32...
Avg. time: 0.961305 ms / Copy throughput: 48.440401 GB/s.
TEST OK
RGBA32_2...
Avg. time: 1.340105 ms / Copy throughput: 34.748032 GB/s.
TEST OK
[alignedTypes] -> Test Results: 0 Failures
asyncAPI.exe
[C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥asyncAPI.exe] - Starting...
GPU Device 0: "GeForce GTX 560 Ti" with compute capability 2.1
CUDA device [GeForce GTX 560 Ti]
time spent executing by the GPU: 22.45
time spent by CPU in CUDA calls: 0.04
CPU executed 12884 iterations while waiting for GPU to finish
bandwidthTest.exe
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: GeForce GTX 560 Ti
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 6016.1
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 6103.5
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 108588.2
batchCUBLAS.exe 1/3
batchCUBLAS Starting...
GPU Device 0: "GeForce GTX 560 Ti" with compute capability 2.1
==== Running single kernels ====
Testing sgemm
#### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0xbf800000, -1) beta= (0x40000000, 2)
#### args: lda=128 ldb=128 ldc=128
^^^^ elapsed = 0.00010011 sec GFLOPS=41.8986
@@@@ sgemm test OK
Testing dgemm
#### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0x0000000000000000, 0) beta= (0x0000000000000000, 0)
#### args: lda=128 ldb=128 ldc=128
^^^^ elapsed = 0.00012166 sec GFLOPS=34.4752
@@@@ dgemm test OK
==== Running N=10 without streams ====
Testing sgemm
#### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0xbf800000, -1) beta= (0x00000000, 0)
#### args: lda=128 ldb=128 ldc=128
^^^^ elapsed = 0.00030251 sec GFLOPS=138.65
@@@@ sgemm test OK
Testing dgemm
#### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0xbff0000000000000, -1) beta= (0x0000000000000000, 0)
#### args: lda=128 ldb=128 ldc=128
^^^^ elapsed = 0.00062913 sec GFLOPS=66.668
@@@@ dgemm test OK
batchCUBLAS.exe 2/3
==== Running N=10 without streams ====
Testing sgemm
#### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0xbf800000, -1) beta= (0x00000000, 0)
#### args: lda=128 ldb=128 ldc=128
^^^^ elapsed = 0.00030251 sec GFLOPS=138.65
@@@@ sgemm test OK
Testing dgemm
#### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0xbff0000000000000, -1) beta= (0x0000000000000000, 0)
#### args: lda=128 ldb=128 ldc=128
^^^^ elapsed = 0.00062913 sec GFLOPS=66.668
@@@@ dgemm test OK
==== Running N=10 with streams ====
Testing sgemm
#### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0x40000000, 2) beta= (0x40000000, 2)
#### args: lda=128 ldb=128 ldc=128
^^^^ elapsed = 0.00030580 sec GFLOPS=137.159
@@@@ sgemm test OK
Testing dgemm
#### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0xbff0000000000000, -1) beta= (0x0000000000000000, 0)
#### args: lda=128 ldb=128 ldc=128
^^^^ elapsed = 0.00055826 sec GFLOPS=75.1324
@@@@ dgemm test OK
batchCUBLAS.exe 3/3
==== Running N=10 batched ====
Testing sgemm
#### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0x3f800000, 1) beta= (0xbf800000, -1)
#### args: lda=128 ldb=128 ldc=128
^^^^ elapsed = 0.00051843 sec GFLOPS=80.9036
@@@@ sgemm test OK
Testing dgemm
#### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0xbff0000000000000, -1) beta= (0x4000000000000000, 2)
#### args: lda=128 ldb=128 ldc=128
^^^^ elapsed = 0.00065873 sec GFLOPS=63.6729
@@@@ dgemm test OK
Test Summary
0 error(s)
bicubicTexture.exe 1/2
Starting bicubicTexture
[CUDA BicubicTexture] (OpenGL Mode)
CUDA device [GeForce GTX 560 Ti] has 8 Multi-Processors
Loaded 'lena_bw.pgm', 512 x 512 pixels
Controls
=/- : Zoom in/out
b : Run Benchmark g_FilterMode
c : Draw Bicubic Spline Curve
[esc] - Quit
Press number keys to change filtering g_FilterMode:
1 : nearest filtering
2 : bilinear filtering
3 : bicubic filtering
4 : fast bicubic filtering
5 : Catmull-Rom filtering
bicubicTexture.exe 2/2
[CUDA BicubicTexture] (Benchmark Mode)
time: 0.098 ms, 2673.560320 Mpixels/sec
> FilterMode[1] = Nearest
> FilterMode[2] = Bilinear
> FilterMode[3] = Bicubic
> FilterMode[4] = Fast Bicubic
> FilterMode[5] = Catmull-Rom
bilateralFilter.exe 1/2
Loading ../../../3_Imaging/bilateralFilter/data/nature_monte.bmp...
BMP width: 640
BMP height: 480
BMP file loaded successfully!
Loaded '../../../3_Imaging/bilateralFilter/data/nature_monte.bmp', 640 x 480 pixels
Found 1 CUDA Capable device(s) supporting CUDA
Device 0: "GeForce GTX 560 Ti"
CUDA Runtime Version : 5.0
CUDA Compute Capability : 2.1
Found CUDA Capable Device 0: "GeForce GTX 560 Ti"
Setting active device to 0
Using device 0: GeForce GTX 560 Ti
Running Standard Demonstration with GLUT loop...
Press '+' and '-' to change filter width
Press ']' and '[' to change number of iterations
Press 'e' and 'E' to change Euclidean delta
Press 'g' and 'G' to changle Gaussian delta
Press 'a' or 'A' to change Animation mode ON/OFF
bilateralFilter.exe 2/2
bindlessTexture.exe / Failure
CUDA bindlessTexture Starting...
No GPU device was found that can support CUDA compute capability 3.0.
binomialOptions.exe
[C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥binomialOptions.exe] - Starting...
GPU Device 0: "GeForce GTX 560 Ti" with compute capability 2.1
Using single precision...
Generating input data...
Running GPU binomial tree...
Options count : 512
Time steps : 2048
binomialOptionsGPU() time: 29.790300 msec
Options per second : 17186.802203
Running CPU binomial tree...
Comparing the results...
GPU binomial vs. Black-Scholes
L1 norm: 1.323721E-004
CPU binomial vs. Black-Scholes
L1 norm: 1.045245E-004
CPU binomial vs. GPU binomial
L1 norm: 3.391858E-005
Shutting down...
Test passed
BlackScholes.exe 1/2
[C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥BlackScholes.exe] - Starting...
GPU Device 0: "GeForce GTX 560 Ti" with compute capability 2.1
Initializing data...
...allocating CPU memory for options.
...allocating GPU memory for options.
...generating input data in CPU mem.
...copying input data to GPU mem.
Data init done.
Executing Black-Scholes GPU kernel (512 iterations)...
Options count : 8000000
BlackScholesGPU() time : 0.806277 msec
Effective memory bandwidth: 99.221508 GB/s
Gigaoptions per second : 9.922151
BlackScholes, Throughput = 9.9222 GOptions/s, Time = 0.00081 s, Size = 8000000 options, NumDevsUsed = 1, Workgroup = 128
BlackScholes.exe 2/2
Reading back GPU results...
Checking the results...
...running CPU calculations.
Comparing the results...
L1 norm: 1.768024E-007
Max absolute error: 1.120567E-005
Shutting down...
...releasing GPU memory.
...releasing CPU memory.
Shutdown done.
[BlackScholes] - Test Summary
Test passed
boxFilter.exe
C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥boxFilter.exe Starting...
Loaded '../../../3_Imaging/boxFilter/data/lenaRGB.ppm', 1024 x 1024 pixels
Found 1 CUDA Capable device(s) supporting CUDA
Device 0: "GeForce GTX 560 Ti"
CUDA Runtime Version : 5.0
CUDA Compute Capability : 2.1
Found CUDA Capable Device 0: "GeForce GTX 560 Ti"
Setting active device to 0
Running Standard Demonstration with GLUT loop...
Press '+' and '-' to change filter width
Press ']' and '[' to change number of iterations
Press 'a' or 'A' to change animation ON/OFF
boxFilterNPP.exe
C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥boxFilterNPP.exe Starting...
GPU Device 0: "GeForce GTX 560 Ti" with compute capability 2.1
cudaSetDevice GPU0 = GeForce GTX 560 Ti
NPP Library Version 5.0.35
C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥boxFilterNPP.exe using GPU <GeForce GTX 560 Ti> wi
th 8 SM(s) with Compute 2.1
boxFilterNPP opened: <../../../common/data/Lena.pgm> successfully!
Saved image: ../../../common/data/Lena_boxFilter.pgm
cdpAdvancedQuicksort.exe / Failure
GPU 0 (GeForce GTX 560 Ti) does not support CUDA Dynamic Parallelism
cdpAdvancedQuicksort requires GPU devices with compute SM 3.5 or higher. Exiting...
cdpLUDecomposition.exe / Failure
Starting LU Decomposition (CUDA Dynamic Parallelism)
GPU device GeForce GTX 560 Ti has compute capabilities (SM 2.1)
cdpLUDecomposition requires SM 3.5 or higher to use CUDA Dynamic Parallelism. Exiting...
cdpQuadTree.exe / Failure
GPU 0 (GeForce GTX 560 Ti) does not support CUDA Dynamic Parallelism
cdpQuadTree requires SM 3.5 or higher to use CUDA Dynamic Parallelism. Exiting...
cdpSimplePrint.exe / Failure
starting Simple Print (CUDA Dynamic Parallelism)
GPU 0 (GeForce GTX 560 Ti) does not support CUDA Dynamic Parallelism
cdpSimplePrint requires GPU devices with compute SM 3.5 or higher. Exiting...
cdpSimpleQuicksort.exe / Failure
GPU 0 (GeForce GTX 560 Ti) does not support CUDA Dynamic Parallelism
cdpSimpleQuicksort requires GPU devices with compute SM 3.5 or higher. Exiting...
clock.exe
CUDA Clock sample
GPU Device 0: "GeForce GTX 560 Ti" with compute capability 2.1
Total clocks = 15204
Summary
GTX560, Some samples does not work fine.
→ MUST support CUDA compute capability 3.0.
→ Requires GPU devices with compute SM 3.5 or higher.
This evaluation to be continued, For future reference.