introduction to cuda programming profiler, assembly, and floating-point andreas moshovos winter 2009...

57
Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA Robert Strzodka, Dominik Göddeke, NVISION08 presentation http://www.mathematik.uni-dortmund.de/~goeddeke/pubs/ NVISION08-long.pdf

Upload: angela-walton

Post on 13-Jan-2016

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

Introduction to CUDA ProgrammingProfiler, Assembly, and Floating-Point

Andreas MoshovosWinter 2009

Some material from: Wen-Mei Hwu and David Kirk

NVIDIARobert Strzodka, Dominik Göddeke, NVISION08 presentation

http://www.mathematik.uni-dortmund.de/~goeddeke/pubs/NVISION08-long.pdf

Page 2: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

The CUDA Profiler

• Both GUI and command-line

• Non-GUI control:– CUDA_PROFILE

• set to 1 or 0 to enable or disable the profiler

– CUDA_PROFILE_LOG• set to the name of the log file (will default to ./cuda_profile.log)

– CUDA_PROFILE_CSV• set to 1 or 0 to enable or disable a comma separated version of the

log

– CUDA_PROFILE_CONFIG• specify a config file with up to 4 signals

Page 3: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

Profiler Signals

Page 4: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

Profiler Counters

• Grid size X, Y• Block size X, Y, Z• Dyn smem per block:

– Dynamic shared memory

• Sta smem per block: – static shared memory

• Reg per thread

• Mem transfer dir– Direction: 0 host to device, 1 device to host

• Mem transfer size– bytes

Page 5: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

Interpreting Profiler Counters• Values represent events within a thread warp

• Only targets one multiprocessor– Values will not correspond to the total number of warps launched for a

particular kernel.– Launch enough thread blocks to ensure that the target multiprocessor is

given a consistent percentage of the total work.

• Values are best used to identify relative performance differences between non-optimized and optimized code– e.g., make the number of non-coalesced loads go from some non-zero

value to zero

Page 6: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

CUDA Visual Profiler

• Helps measure and find potential performance problem– GPU and CPU timing for all kernel invocations and

memcpys– Time stamps

• Access to hardware performance counters

Page 7: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

Assembly

Page 8: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

PTX: Assembly for NVIDIA GPUs

• Parallel Thread eXecution

• Virtual Assembly– Translated to actual machine code at runtime– Allows for different hardware implementations

• Might enable additional optimizations– E.g., %clock register to time blocks of code

Page 9: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

Code Generation Flow

• Parallel Thread eXecution (PTX)– Virtual Machine and ISA– Programming model– Execution resources and state

• ISA – Instruction Set Architecture– Variable declarations– Instructions and operands

• Translator is an optimizing compiler– Translates PTX to Target code– Program install time

• Driver implements VM runtime– Coupled with Translator

C/C++Compiler

C/C++Application

PTX to TargetTranslator

C G80 … GPU

ASM-levelLibrary

Programmer

Target code

PTX Code PTX Code

Page 10: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

How to See the PTX code

• nvcc –keep– Produces .ptx and .cubin

• nvcc --opencc-options -LIST:source=on

Page 11: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

PTX Example

ld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0];# 174 me.x += me.y * me.z;mad.f32 $f1, $f5, $f3, $f1;

float4 me = gx[gtid];me.x += me.y * me.z;CUDA

PTX

Registers are virtual – The actual hardware registers are hidden from PTX

Page 12: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

PTX Syntax Example

Page 13: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

Another Example: CUDA Function

• CUDA • PTXsub.f32 $f18, $f1, $f15;sub.f32 $f19, $f3, $f16;sub.f32 $f20, $f5, $f17;mul.f32 $f21, $f18, $f18;mul.f32 $f22, $f19, $f19;mul.f32 $f23, $f20, $f20;add.f32 $f24, $f21, $f22;add.f32 $f25, $f23, $f24;rsqrt.f32 $f26, $f25;mad.f32 $f13, $f18, $f26, $f13;mov.f32 $f14, $f13;mad.f32 $f11, $f19, $f26, $f11;mov.f32 $f12, $f11;mad.f32 $f9, $f20, $f26, $f9;mov.f32 $f10, $f9;

__device__ void interaction(float4 b0, float4 b1, float3 *accel)

{ r.x = b1.x - b0.x; r.y = b1.y - b0.y; r.z = b1.z - b0.z; float distSqr = r.x * r.x + r.y * r.y + r.z * r.z; float s = 1.0f/sqrt(distSqr); accel->x += r.x * s; accel->y += r.y * s; accel->z += r.z * s;}

Page 14: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

PTX Data types

Page 15: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

Predicated Execution

– p = Evaluate cond– Branch not true After

– Then Code

• After:– After Code

If (cond)

Then Code

After Code

– p = Evaluate cond– (p) Then Code– After Code

Page 16: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

PTX Predicated Execution

Page 17: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

Variable Declaration

Page 18: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

Parameterized Variable Names

• How to create 100 register “variables”

• .reg .b32 %r<100>

• Declares %r0 - %r99

Page 19: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

Addresses as Operands

The value of x

The value of tbl[12]

The base address of tlb

Page 20: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

Compiling a loop that calls a function - again

• CUDA– sx is shared– mx, accel are local

• PTX

mov.s32 $r12, 0;$Lt_0_26: setp.eq.u32 $p1, $r12, $r5; @$p1 bra $Lt_0_27; mul.lo.u32 $r13, $r12, 16; add.u32 $r14, $r13, $r1; ld.shared.f32 $f15, [$r14+0]; ld.shared.f32 $f16, [$r14+4]; ld.shared.f32 $f17, [$r14+8];

[func body from previous slide inlined here]

$Lt_0_27: add.s32 $r12, $r12, 1; mov.s32 $r15, 128; setp.ne.s32 $p2, $r12, $r15; @$p2 bra $Lt_0_26;

for (i = 0; i < K; i++) { if (i != threadIdx.x) { interaction( sx[i], mx, &accel ); }}

Page 21: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

Yet Another Example: SAXPY codecvt.u32.u16 $blockid, %ctaid.x; // Calculate i from thread/block IDscvt.u32.u16 $blocksize, %ntid.x;cvt.u32.u16 $tid, %tid.x;mad24.lo.u32 $i, $blockid, $blocksize, $tid;ld.param.u32 $n, [N]; // Nothing to do if n ≤ isetp.le.u32 $p1, $n, $i;@$p1 bra $L_finish;

mul.lo.u32 $offset, $i, 4; // Load y[i]ld.param.u32 $yaddr, [Y];add.u32 $yaddr, $yaddr, $offset;ld.global.f32 $y_i, [$yaddr+0];ld.param.u32 $xaddr, [X]; // Load x[i]add.u32 $xaddr, $xaddr, $offset;ld.global.f32 $x_i, [$xaddr+0];

ld.param.f32 $alpha, [ALPHA]; // Compute and store alpha*x[i] + y[i]mad.f32 $y_i, $alpha, $x_i, $y_i;st.global.f32 [$yaddr+0], $y_i;

$L_finish: exit;

Page 22: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

The %clock register

• Real time clock cycle counter

• How to read:

–   mov.u32         $r1, %clock;

• Can be used to time code

• It measures real time not just time spent executing this thread– If a thread is blocks time still elapses

   

Page 23: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

PTX Reference

• Please Read the PTX ISA specification– Posted under the handouts section

Page 24: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

Occupancy Calculator• http://developer.download.nvidia.com/compute/cuda/C

UDA_Occupancy_calculator.xls• GPU Occupancy

– Active warps / max warps – Threads/block– Registers/thread– Shared memory/block

• Nvcc –cubin– code {

name = my_kernellmem = 0smem = 24reg = 5bar = 0bincode { }�const { }�}

Page 25: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

Occupancy Calculator Example

Page 26: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

Floating Point Considerations

Page 27: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

Comparison of FP CapabilitiesG80 SSE IBM Altivec Cell SPE

Precision IEEE 754 IEEE 754 IEEE 754 IEEE 754

Rounding modes for FADD and FMUL

Round to nearest and round to zero

All 4 IEEE, round to nearest, zero, inf, -inf

Round to nearest only

Round to zero/truncate only

Denormal handling Flush to zeroSupported,1000’s of cycles

Supported,1000’s of cycles

Flush to zero

NaN support Yes Yes Yes No

Overflow and Infinity support

Yes, only clamps to max norm

Yes Yes No, infinity

Flags No Yes Yes Some

Square root Software only Hardware Software only Software only

Division Software only Hardware Software only Software only

Reciprocal estimate accuracy

24 bit 12 bit 12 bit 12 bit

Reciprocal sqrt estimate accuracy

23 bit 12 bit 12 bit 12 bit

log2(x) and 2^x estimates accuracy

23 bit No 12 bit No

Page 28: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

IEEE Floating Point Representation• A floating point binary number consists of three parts:

– sign (S), exponent (E), and mantissa (M). – Each (S, E, M) pattern uniquely identifies a floating point

number.

• For each bit pattern, its IEEE floating-point value is derived as:

– value = (-1)S * M * {2E}, where 1.0 ≤ M < 10.0B

• The interpretation of S is simple: S=0 results in a positive number and S=1 a negative number.

Page 29: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

Normalized Representation

• Specifying that 1.0B ≤ M < 10.0B makes the mantissa value for each floating point number unique. – For example, the only one mantissa value allowed for 0.5D is

M =1.0• 0.5D  = 1.0B * 2-1

– Neither 10.0B * 2 -2 nor 0.1B * 2 0 qualifies

• Because all mantissa values are of the form 1.XX…, one can omit the “1.” part in the representation.  – The mantissa value of 0.5D in a 2-bit mantissa is 00, which is

derived by omitting “1.” from 1.00.

Page 30: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

Exponent Representation

• In an n-bits exponent representation, 2n-1-1 is added to its 2's complement representation to form its excess representation. – See Table for a 3-bit exponent

representation

• A simple unsigned integer comparator can be used to compare the magnitude of two FP numbers

• Symmetric range for +/- exponents (111 reserved)

2’s complement Actual decimal Excess-3

000 0 011

001 1 100

010 2 101

011 3 110

100 (reserved pattern)

111

101 -3 000

110 -2 001

111 -1 010

E = represented E - BIAS

Page 31: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

A Hypothetical 5-bit Floating Point Representation

• Assume 1-bit S, 2-bit E, and 2-bit M– 0.5D  = 1.00B * 2-1

– 0.5D = 0 00 00, 

– where • S = 0, • E = 00• M = (1.)00

2’s complement Actual decimal Excess-1

00 0 01

01 1 10

10 (reserved pattern) 11

11 -1 00

Page 32: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

Representable Numbers

• The representable numbers of a given format is the set of all numbers that can be exactly represented in the format.

• • See Table for

representable numbers of an unsigned 3-bit integer format

000 0

001 1

010 2

011 3

100 4

101 5

110 6

111 7

0 71 42 3 5 6-1

98

Page 33: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

Hypothetical 5-bit FP: Representable NumbersNo-zero Abrupt underflow Gradual underflow

E M S=0 S=1 S=0 S=1 S=0 S=1

00 00 2-1 -(2-1) 0 0 0 0

01 2-1+1*2-3 -(2-1+1*2-3) 0 0 1*2-2 -1*2-2

10 2-1+2*2-3 -(2-1+2*2-3) 0 0 2*2-2 -2*2-2

11 2-1+3*2-3 -(2-1+3*2-3) 0 0 3*2-2 -3*2-2

01 00 20 -(20) 20 -(20) 20 -(20)

01 20+1*2-2 -(20+1*2-2) 20+1*2-2 -(20+1*2-2) 20+1*2-2 -(20+1*2-2)

10 20+2*2-2 -(20+2*2-2) 20+2*2-2 -(20+2*2-2) 20+2*2-2 -(20+2*2-2)

11 20+3*2-2 -(20+3*2-2) 20+3*2-2 -(20+3*2-2) 20+3*2-2 -(20+3*2-2)

10 00 21 -(21) 21 -(21) 21 -(21)

01 21+1*2-1 -(21+1*2-1) 21+1*2-1 -(21+1*2-1) 21+1*2-1 -(21+1*2-1)

10 21+2*2-1 -(21+2*2-1) 21+2*2-1 -(21+2*2-1) 21+2*2-1 -(21+2*2-1)

11 21+3*2-1 -(21+3*2-1) 21+3*2-1 -(21+3*2-1) 21+3*2-1 -(21+3*2-1)

11 Reserved pattern

Page 34: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

Flush To Zero

• Treat all bit patterns with E=0 as 0.0– This takes away several representable

numbers near zero and lump them all into 0.0– For a representation with large M, a large

number of representable numbers numbers will be removed.

1 2 3 40

Page 35: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

Hypothetical 5-bit FP: Representable NumbersNo-zero Abrupt underflow Gradual underflow

E M S=0 S=1 S=0 S=1 S=0 S=1

00 00 2-1 -(2-1) 0 0 0 0

01 2-1+1*2-3 -(2-1+1*2-3) 0 0 1*2-2 -1*2-2

10 2-1+2*2-3 -(2-1+2*2-3) 0 0 2*2-2 -2*2-2

11 2-1+3*2-3 -(2-1+3*2-3) 0 0 3*2-2 -3*2-2

01 00 20 -(20) 20 -(20) 20 -(20)

01 20+1*2-2 -(20+1*2-2) 20+1*2-2 -(20+1*2-2) 20+1*2-2 -(20+1*2-2)

10 20+2*2-2 -(20+2*2-2) 20+2*2-2 -(20+2*2-2) 20+2*2-2 -(20+2*2-2)

11 20+3*2-2 -(20+3*2-2) 20+3*2-2 -(20+3*2-2) 20+3*2-2 -(20+3*2-2)

10 00 21 -(21) 21 -(21) 21 -(21)

01 21+1*2-1 -(21+1*2-1) 21+1*2-1 -(21+1*2-1) 21+1*2-1 -(21+1*2-1)

10 21+2*2-1 -(21+2*2-1) 21+2*2-1 -(21+2*2-1) 21+2*2-1 -(21+2*2-1)

11 21+3*2-1 -(21+3*2-1) 21+3*2-1 -(21+3*2-1) 21+3*2-1 -(21+3*2-1)

11 Reserved pattern

Page 36: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

Denormalized Numbers

• The actual method adopted by the IEEE standard is called denormalized numbers or gradual underflow.– The method relaxes the normalization requirement for

numbers very close to 0. – whenever E=0, the mantissa is no longer assumed to

be of the form 1.XX. Rather, it is assumed to be 0.XX. In general, if the n-bit exponent is 0, the value is

• 0.M * 2 - 2 ^(n-1) + 2

0 1 2 3

Page 37: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

Hypothetical 5-bit FP: Representable NumbersNo-zero Abrupt underflow Gradual underflow

E M S=0 S=1 S=0 S=1 S=0 S=1

00 00 2-1 -(2-1) 0 0 0 0

01 2-1+1*2-3 -(2-1+1*2-3) 0 0 1*2-2 -1*2-2

10 2-1+2*2-3 -(2-1+2*2-3) 0 0 2*2-2 -2*2-2

11 2-1+3*2-3 -(2-1+3*2-3) 0 0 3*2-2 -3*2-2

01 00 20 -(20) 20 -(20) 20 -(20)

01 20+1*2-2 -(20+1*2-2) 20+1*2-2 -(20+1*2-2) 20+1*2-2 -(20+1*2-2)

10 20+2*2-2 -(20+2*2-2) 20+2*2-2 -(20+2*2-2) 20+2*2-2 -(20+2*2-2)

11 20+3*2-2 -(20+3*2-2) 20+3*2-2 -(20+3*2-2) 20+3*2-2 -(20+3*2-2)

10 00 21 -(21) 21 -(21) 21 -(21)

01 21+1*2-1 -(21+1*2-1) 21+1*2-1 -(21+1*2-1) 21+1*2-1 -(21+1*2-1)

10 21+2*2-1 -(21+2*2-1) 21+2*2-1 -(21+2*2-1) 21+2*2-1 -(21+2*2-1)

11 21+3*2-1 -(21+3*2-1) 21+3*2-1 -(21+3*2-1) 21+3*2-1 -(21+3*2-1)

11 Reserved pattern

Page 38: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

Floating Point Numbers

• As the exponent gets larger

• The distance between two representable numbers increases

Page 39: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

Arithmetic Instruction Throughput• int and float add, shift, min, max and float mul, mad: 4

cycles per warp– int multiply (*) is by default 32-bit

• requires multiple cycles / warp

– Use __mul24() / __umul24() intrinsics for 4-cycle 24-bit int multiply

– For G80, for G20 should be OK

• Integer divide and modulo are expensive– Compiler will convert literal power-of-2 divides to shifts– Be explicit in cases where compiler can’t tell that divisor is

a power of 2– Useful trick: foo % n == foo & (n-1) if n is a power of 2

Page 40: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

Arithmetic Instruction Throughput• Reciprocal, reciprocal square root, sin/cos,

log, exp: 16 cycles per warp– These are the versions prefixed with “__”– Examples:__rcp(), __sin(), __exp()

• Other functions are combinations of the above– y / x == rcp(x) * y == 20 cycles per warp– sqrt(x) == rcp(rsqrt(x)) == 32 cycles per warp

Page 41: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

Runtime Math Library• There are two types of runtime math

operations– __func(): direct mapping to hardware ISA

• Fast but low accuracy (see prog. guide for details)• Examples: __sin(x), __exp(x), __pow(x,y)

– func() : compile to multiple instructions• Slower but higher accuracy (5 ulp, units in the least

place, or less)• Examples: sin(x), exp(x), pow(x,y)

• The -use_fast_math compiler option forces every func() to compile to __func()

Page 42: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

Make your program float-safe!• G20 has double precision support– G80 is single-precision only– Double precision has additional performance cost

• Only one unit per multiprocessor

– Careless use of double or undeclared types may run more slowly on G80+

• Important to be float-safe (be explicit whenever you want single precision) to avoid using double precision where it is not needed– Add ‘f’ specifier on float literals:

• foo = bar * 0.123; // double assumed • foo = bar * 0.123f; // float explicit

– Use float version of standard library functions• foo = sin(bar); // double assumed • foo = sinf(bar); // single precision explicit

Page 43: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

Deviations from IEEE-754• Addition and Multiplication are IEEE 754 compliant

– Maximum 0.5 ulp (units in the least place) error

• However, often combined into multiply-add (FMAD)– Intermediate result is truncated

• Division is non-compliant (2 ulp)• Not all rounding modes are supported• Denormalized numbers are not supported• No mechanism to detect floating-point exceptions

Page 44: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

Units in the Last Place Error

• If the result of a FP computation is:– 3.12 x 10^-2 = 0.0312

• But the answer when computed to infinite precision is:– -0.0312159

• Then ulp is:– 0.0314 – 0.0312 = 0.159

• For binary representations the maximum ulp is 0.5– Round to nearest number

Page 45: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

Mixed Precision Methods

From slides by Robert StrzodkaDominik Göddeke

http://www.mathematik.uni-dortmund.de/~goeddeke/pubs/NVISION08-long.pdf

Page 46: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

What is a Mixed Precision Method?

Page 47: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

Mixed Precision Performance Gains

Page 48: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

Single vs Double Precision FP

Float s23e8 Double s53e11

Page 49: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

Round off and Cancellation

Page 50: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

Double Precision != Better Accuracy

Page 51: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

The Dominant Data Error

Page 52: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

Understanding Floating Point Operations

Page 53: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

Commutative Summation

Page 54: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

Commutative Summation Example

Say we want to calculate 1 + 0.0000004 – 0.00000003

Page 55: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

High Precision Emulation

Page 56: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

Example: Addition c = a + b

Page 57: Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA

Please read the following: