nvidia® cuda™ 5.0 sample evaluation result part 1

27
NVIDIA® CUDA™ 5.0 Sample evaluation result PART Ⅰ GPU: GTX 560 Ti CPU: i5-3450S (TDP65W) RAM: 16GB OS: Windows 7 x64 Ultimate Yukio Saitoh | FXFROG.com 21 st /Apr/2013

Upload: -office-saitoh

Post on 28-May-2015

516 views

Category:

Technology


3 download

DESCRIPTION

This evaluation to be continued, For future reference.

TRANSCRIPT

Page 1: Nvidia® cuda™ 5.0 Sample Evaluation Result Part 1

NVIDIA® CUDA™ 5.0 Sample evaluation result

PART Ⅰ

GPU: GTX 560 Ti

CPU: i5-3450S (TDP65W)

RAM: 16GB

OS: Windows 7 x64 Ultimate

Yukio Saitoh | FXFROG.com

21st/Apr/2013

Page 2: Nvidia® cuda™ 5.0 Sample Evaluation Result Part 1

INDEX

Sample binary :1. alignedTypes.exe

2. asyncAPI.exe

3. bandwidthTest.exe

4. batchCUBLAS.exe

5. bicubicTexture.exe

6. bilateralFilter.exe

7. bindlessTexture.exe / Failure

8. binomialOptions.exe

9. BlackScholes.exe 1/2

10. boxFilter.exe

11. boxFilterNPP.exe

12. cdpAdvancedQuicksort.exe / Failure

13. cdpLUDecomposition.exe / Failure

14. cdpQuadTree.exe / Failure

15. cdpSimplePrint.exe / Failure

16. cdpSimplePrint.exe / Failure

17. cdpSimpleQuicksort.exe / Failure

18. clock.exe

Page 3: Nvidia® cuda™ 5.0 Sample Evaluation Result Part 1

Sample target path and files

• C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release

Page 4: Nvidia® cuda™ 5.0 Sample Evaluation Result Part 1

alignedTypes.exe 1/2

[C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥alignedTypes.exe] - Starting...

GPU Device 0: "GeForce GTX 560 Ti" with compute capability 2.1

[GeForce GTX 560 Ti] has 8 MP(s) x 48 (Cores/MP) = 384 (Cores)

> Compute scaling value = 1.00

> Memory Size = 49999872

Allocating memory...

Generating host input data array...

Uploading input data to GPU memory...

Testing misaligned types...

uint8...

Avg. time: 2.563287 ms / Copy throughput: 18.166525 GB/s.

TEST OK

uint16...

Avg. time: 1.429239 ms / Copy throughput: 32.580981 GB/s.

TEST OK

RGBA8_misaligned...

Avg. time: 1.766606 ms / Copy throughput: 26.359026 GB/s.

TEST OK

LA32_misaligned...

Avg. time: 0.998594 ms / Copy throughput: 46.631585 GB/s.

TEST OK

RGB32_misaligned...

Avg. time: 1.273794 ms / Copy throughput: 36.556941 GB/s.

TEST OK

RGBA32_misaligned...

Avg. time: 1.703606 ms / Copy throughput: 27.333794 GB/s.

TEST OK

Page 5: Nvidia® cuda™ 5.0 Sample Evaluation Result Part 1

alignedTypes.exe 2/2

Testing aligned types...

RGBA8...

Avg. time: 1.131558 ms / Copy throughput: 41.152104 GB/s.

TEST OK

I32...

Avg. time: 1.091073 ms / Copy throughput: 42.679095 GB/s.

TEST OK

LA32...

Avg. time: 0.952468 ms / Copy throughput: 48.889827 GB/s.

TEST OK

RGB32...

Avg. time: 1.431797 ms / Copy throughput: 32.522784 GB/s.

TEST OK

RGBA32...

Avg. time: 0.961305 ms / Copy throughput: 48.440401 GB/s.

TEST OK

RGBA32_2...

Avg. time: 1.340105 ms / Copy throughput: 34.748032 GB/s.

TEST OK

[alignedTypes] -> Test Results: 0 Failures

Page 6: Nvidia® cuda™ 5.0 Sample Evaluation Result Part 1

asyncAPI.exe

[C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥asyncAPI.exe] - Starting...

GPU Device 0: "GeForce GTX 560 Ti" with compute capability 2.1

CUDA device [GeForce GTX 560 Ti]

time spent executing by the GPU: 22.45

time spent by CPU in CUDA calls: 0.04

CPU executed 12884 iterations while waiting for GPU to finish

Page 7: Nvidia® cuda™ 5.0 Sample Evaluation Result Part 1

bandwidthTest.exe

[CUDA Bandwidth Test] - Starting...

Running on...

Device 0: GeForce GTX 560 Ti

Quick Mode

Host to Device Bandwidth, 1 Device(s)

PINNED Memory Transfers

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 6016.1

Device to Host Bandwidth, 1 Device(s)

PINNED Memory Transfers

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 6103.5

Device to Device Bandwidth, 1 Device(s)

PINNED Memory Transfers

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 108588.2

Page 8: Nvidia® cuda™ 5.0 Sample Evaluation Result Part 1

batchCUBLAS.exe 1/3

batchCUBLAS Starting...

GPU Device 0: "GeForce GTX 560 Ti" with compute capability 2.1

==== Running single kernels ====

Testing sgemm

#### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0xbf800000, -1) beta= (0x40000000, 2)

#### args: lda=128 ldb=128 ldc=128

^^^^ elapsed = 0.00010011 sec GFLOPS=41.8986

@@@@ sgemm test OK

Testing dgemm

#### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0x0000000000000000, 0) beta= (0x0000000000000000, 0)

#### args: lda=128 ldb=128 ldc=128

^^^^ elapsed = 0.00012166 sec GFLOPS=34.4752

@@@@ dgemm test OK

==== Running N=10 without streams ====

Testing sgemm

#### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0xbf800000, -1) beta= (0x00000000, 0)

#### args: lda=128 ldb=128 ldc=128

^^^^ elapsed = 0.00030251 sec GFLOPS=138.65

@@@@ sgemm test OK

Testing dgemm

#### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0xbff0000000000000, -1) beta= (0x0000000000000000, 0)

#### args: lda=128 ldb=128 ldc=128

^^^^ elapsed = 0.00062913 sec GFLOPS=66.668

@@@@ dgemm test OK

Page 9: Nvidia® cuda™ 5.0 Sample Evaluation Result Part 1

batchCUBLAS.exe 2/3

==== Running N=10 without streams ====

Testing sgemm

#### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0xbf800000, -1) beta= (0x00000000, 0)

#### args: lda=128 ldb=128 ldc=128

^^^^ elapsed = 0.00030251 sec GFLOPS=138.65

@@@@ sgemm test OK

Testing dgemm

#### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0xbff0000000000000, -1) beta= (0x0000000000000000, 0)

#### args: lda=128 ldb=128 ldc=128

^^^^ elapsed = 0.00062913 sec GFLOPS=66.668

@@@@ dgemm test OK

==== Running N=10 with streams ====

Testing sgemm

#### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0x40000000, 2) beta= (0x40000000, 2)

#### args: lda=128 ldb=128 ldc=128

^^^^ elapsed = 0.00030580 sec GFLOPS=137.159

@@@@ sgemm test OK

Testing dgemm

#### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0xbff0000000000000, -1) beta= (0x0000000000000000, 0)

#### args: lda=128 ldb=128 ldc=128

^^^^ elapsed = 0.00055826 sec GFLOPS=75.1324

@@@@ dgemm test OK

Page 10: Nvidia® cuda™ 5.0 Sample Evaluation Result Part 1

batchCUBLAS.exe 3/3

==== Running N=10 batched ====

Testing sgemm

#### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0x3f800000, 1) beta= (0xbf800000, -1)

#### args: lda=128 ldb=128 ldc=128

^^^^ elapsed = 0.00051843 sec GFLOPS=80.9036

@@@@ sgemm test OK

Testing dgemm

#### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0xbff0000000000000, -1) beta= (0x4000000000000000, 2)

#### args: lda=128 ldb=128 ldc=128

^^^^ elapsed = 0.00065873 sec GFLOPS=63.6729

@@@@ dgemm test OK

Test Summary

0 error(s)

Page 11: Nvidia® cuda™ 5.0 Sample Evaluation Result Part 1

bicubicTexture.exe 1/2

Starting bicubicTexture

[CUDA BicubicTexture] (OpenGL Mode)

CUDA device [GeForce GTX 560 Ti] has 8 Multi-Processors

Loaded 'lena_bw.pgm', 512 x 512 pixels

Controls

=/- : Zoom in/out

b : Run Benchmark g_FilterMode

c : Draw Bicubic Spline Curve

[esc] - Quit

Press number keys to change filtering g_FilterMode:

1 : nearest filtering

2 : bilinear filtering

3 : bicubic filtering

4 : fast bicubic filtering

5 : Catmull-Rom filtering

Page 12: Nvidia® cuda™ 5.0 Sample Evaluation Result Part 1

bicubicTexture.exe 2/2

[CUDA BicubicTexture] (Benchmark Mode)

time: 0.098 ms, 2673.560320 Mpixels/sec

> FilterMode[1] = Nearest

> FilterMode[2] = Bilinear

> FilterMode[3] = Bicubic

> FilterMode[4] = Fast Bicubic

> FilterMode[5] = Catmull-Rom

Page 13: Nvidia® cuda™ 5.0 Sample Evaluation Result Part 1

bilateralFilter.exe 1/2

Loading ../../../3_Imaging/bilateralFilter/data/nature_monte.bmp...

BMP width: 640

BMP height: 480

BMP file loaded successfully!

Loaded '../../../3_Imaging/bilateralFilter/data/nature_monte.bmp', 640 x 480 pixels

Found 1 CUDA Capable device(s) supporting CUDA

Device 0: "GeForce GTX 560 Ti"

CUDA Runtime Version : 5.0

CUDA Compute Capability : 2.1

Found CUDA Capable Device 0: "GeForce GTX 560 Ti"

Setting active device to 0

Using device 0: GeForce GTX 560 Ti

Running Standard Demonstration with GLUT loop...

Press '+' and '-' to change filter width

Press ']' and '[' to change number of iterations

Press 'e' and 'E' to change Euclidean delta

Press 'g' and 'G' to changle Gaussian delta

Press 'a' or 'A' to change Animation mode ON/OFF

Page 14: Nvidia® cuda™ 5.0 Sample Evaluation Result Part 1

bilateralFilter.exe 2/2

Page 15: Nvidia® cuda™ 5.0 Sample Evaluation Result Part 1

bindlessTexture.exe / Failure

CUDA bindlessTexture Starting...

No GPU device was found that can support CUDA compute capability 3.0.

Page 16: Nvidia® cuda™ 5.0 Sample Evaluation Result Part 1

binomialOptions.exe

[C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥binomialOptions.exe] - Starting...

GPU Device 0: "GeForce GTX 560 Ti" with compute capability 2.1

Using single precision...

Generating input data...

Running GPU binomial tree...

Options count : 512

Time steps : 2048

binomialOptionsGPU() time: 29.790300 msec

Options per second : 17186.802203

Running CPU binomial tree...

Comparing the results...

GPU binomial vs. Black-Scholes

L1 norm: 1.323721E-004

CPU binomial vs. Black-Scholes

L1 norm: 1.045245E-004

CPU binomial vs. GPU binomial

L1 norm: 3.391858E-005

Shutting down...

Test passed

Page 17: Nvidia® cuda™ 5.0 Sample Evaluation Result Part 1

BlackScholes.exe 1/2

[C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥BlackScholes.exe] - Starting...

GPU Device 0: "GeForce GTX 560 Ti" with compute capability 2.1

Initializing data...

...allocating CPU memory for options.

...allocating GPU memory for options.

...generating input data in CPU mem.

...copying input data to GPU mem.

Data init done.

Executing Black-Scholes GPU kernel (512 iterations)...

Options count : 8000000

BlackScholesGPU() time : 0.806277 msec

Effective memory bandwidth: 99.221508 GB/s

Gigaoptions per second : 9.922151

BlackScholes, Throughput = 9.9222 GOptions/s, Time = 0.00081 s, Size = 8000000 options, NumDevsUsed = 1, Workgroup = 128

Page 18: Nvidia® cuda™ 5.0 Sample Evaluation Result Part 1

BlackScholes.exe 2/2

Reading back GPU results...

Checking the results...

...running CPU calculations.

Comparing the results...

L1 norm: 1.768024E-007

Max absolute error: 1.120567E-005

Shutting down...

...releasing GPU memory.

...releasing CPU memory.

Shutdown done.

[BlackScholes] - Test Summary

Test passed

Page 19: Nvidia® cuda™ 5.0 Sample Evaluation Result Part 1

boxFilter.exe

C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥boxFilter.exe Starting...

Loaded '../../../3_Imaging/boxFilter/data/lenaRGB.ppm', 1024 x 1024 pixels

Found 1 CUDA Capable device(s) supporting CUDA

Device 0: "GeForce GTX 560 Ti"

CUDA Runtime Version : 5.0

CUDA Compute Capability : 2.1

Found CUDA Capable Device 0: "GeForce GTX 560 Ti"

Setting active device to 0

Running Standard Demonstration with GLUT loop...

Press '+' and '-' to change filter width

Press ']' and '[' to change number of iterations

Press 'a' or 'A' to change animation ON/OFF

Page 20: Nvidia® cuda™ 5.0 Sample Evaluation Result Part 1

boxFilterNPP.exe

C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥boxFilterNPP.exe Starting...

GPU Device 0: "GeForce GTX 560 Ti" with compute capability 2.1

cudaSetDevice GPU0 = GeForce GTX 560 Ti

NPP Library Version 5.0.35

C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥boxFilterNPP.exe using GPU <GeForce GTX 560 Ti> wi

th 8 SM(s) with Compute 2.1

boxFilterNPP opened: <../../../common/data/Lena.pgm> successfully!

Saved image: ../../../common/data/Lena_boxFilter.pgm

Page 21: Nvidia® cuda™ 5.0 Sample Evaluation Result Part 1

cdpAdvancedQuicksort.exe / Failure

GPU 0 (GeForce GTX 560 Ti) does not support CUDA Dynamic Parallelism

cdpAdvancedQuicksort requires GPU devices with compute SM 3.5 or higher. Exiting...

Page 22: Nvidia® cuda™ 5.0 Sample Evaluation Result Part 1

cdpLUDecomposition.exe / Failure

Starting LU Decomposition (CUDA Dynamic Parallelism)

GPU device GeForce GTX 560 Ti has compute capabilities (SM 2.1)

cdpLUDecomposition requires SM 3.5 or higher to use CUDA Dynamic Parallelism. Exiting...

Page 23: Nvidia® cuda™ 5.0 Sample Evaluation Result Part 1

cdpQuadTree.exe / Failure

GPU 0 (GeForce GTX 560 Ti) does not support CUDA Dynamic Parallelism

cdpQuadTree requires SM 3.5 or higher to use CUDA Dynamic Parallelism. Exiting...

Page 24: Nvidia® cuda™ 5.0 Sample Evaluation Result Part 1

cdpSimplePrint.exe / Failure

starting Simple Print (CUDA Dynamic Parallelism)

GPU 0 (GeForce GTX 560 Ti) does not support CUDA Dynamic Parallelism

cdpSimplePrint requires GPU devices with compute SM 3.5 or higher. Exiting...

Page 25: Nvidia® cuda™ 5.0 Sample Evaluation Result Part 1

cdpSimpleQuicksort.exe / Failure

GPU 0 (GeForce GTX 560 Ti) does not support CUDA Dynamic Parallelism

cdpSimpleQuicksort requires GPU devices with compute SM 3.5 or higher. Exiting...

Page 26: Nvidia® cuda™ 5.0 Sample Evaluation Result Part 1

clock.exe

CUDA Clock sample

GPU Device 0: "GeForce GTX 560 Ti" with compute capability 2.1

Total clocks = 15204

Page 27: Nvidia® cuda™ 5.0 Sample Evaluation Result Part 1

Summary

GTX560, Some samples does not work fine.

→ MUST support CUDA compute capability 3.0.

→ Requires GPU devices with compute SM 3.5 or higher.

This evaluation to be continued, For future reference.