Transcript
Page 1: Cvim half precision floating point

Half Precision Floating Point Number

-half-@tomoaki_teshima

Page 2: Cvim half precision floating point

How big is the image ?• Multiplying two images (floating point operation)

Page 3: Cvim half precision floating point

Size ! Size !! Size !!!• RGB 3 bytes / pixel• float 4 bytes / pixel• Any more space to reduce ?

Page 4: Cvim half precision floating point

Summary• Explanation of half• Example on ARM• Example on ARM w/ SIMD instruction• Example on Intel, AMD(x86)• Example on CUDA

Page 5: Cvim half precision floating point

Format of Floating pointsIEEE75464bit = double, double precision

32bit = float, single precision

16bit = half, half precision

Signed bit

Exponent

Significand

1

1

1

11bit 52bit

23bit

10bit5bit

8bit

Page 6: Cvim half precision floating point

ARM has fp16

https://ja.wikipedia.org/wiki/半精度浮動小数点数

Page 7: Cvim half precision floating point

What to prepare• An ARM machine which runs Linux • Raspberry Pi zero/1/2/3• ODROID XU4/C2• Jetson TK1/TX1• PINE64• Red ones are 64bit architecture

• Buy one for better understanding

Page 8: Cvim half precision floating point

Example on ARMint main(int argc, char**argv)

{

printf("Hello World !!\n");

__fp16 halfPrecision = 1.5f;

printf("half precision:%f\n“, halfPrecision);

printf("half precision:sizeof %d\n“, sizeof(halfPrecision));

printf("half precision:0x%04x\n", *(short*)(void*)&halfPrecision);

float original[] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f,

9.0f,10.0f,11.0f,12.0f,13.0f,14.0f,15.0f,16.0f,};

for (unsigned int i = 0;i < 16;i++)

{

__fp16 stub = original[i];

printf(“%2d 0x%04x\n", (int)original[i], *(short*)&stub);

}

return 0;

}

https://github.com/tomoaki0705/sampleFp16

Page 9: Cvim half precision floating point

Build it

• Required to put option “-mpf16-format”• Try it on ARM gcc, otherwise “unknown option”error

$ gcc -std=c99 -mfp16-format=ieee main.c

Page 10: Cvim half precision floating point

Result 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0

1/2

1/1024

1/41/81/16

1/321/64

1/1281/256

1/512

2(17− 15)×(1+ 12+ 14 )=22× 74=7

Signed bit(+)

Exponent(17)

Significand

When exponent is all 0, the number is subnormal.When exponent is all 1, the number is Inf or NaN.

Page 11: Cvim half precision floating point

Summary• Floating points format is complicated than Integer• Half can express floating point numbers in 2 bytes

Page 12: Cvim half precision floating point

Check in Assembly• Soft implemented conversion

• What’s the point doing it on SW side ?

$ gcc –S -std=c99 -mfp16-format=ieee –O main.c.s main.c

movw r3, #15872 <-0x3e00strh r3, [r7, #8] @ __fp16 <-store to stackldrh r3, [r7, #8] @ __fp16 <-load from stackmov r0, r3 @ __fp16 <-copy to r0bl __gnu_h2f_ieee <-function call (half2float)

Page 13: Cvim half precision floating point

Half conversion instructions•Conversion instruction between

half and float• VCVTB.F16.F32 ( float -> half)• VCVTB.F32.F16 ( half -> float)• VCVTT.F16.F32 ( float -> half)• VCVTT.F32.F16 ( half -> float)

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204ij/CJAGIFIJ.html

Page 14: Cvim half precision floating point

Half instructions•ARM CPU might not have an FPU• To use the FPU, compiler has to know• Give an option to tell gcc

$ gcc –mfp16-format=ieee main.c    ↓$ gcc –mfp16-format=ieee –mfpu=vfpv4 main.c

Page 15: Cvim half precision floating point

Check in Assembler 2

movw r3, #15872strh r3, [r7, #8] @ __fp16add r2, r7, #8vld1.16 {d7[2]}, [r2]vcvtb.f32.f16 s15, s15

movw r3, #15872strh r3, [r7, #8] @ __fp16ldrh r3, [r7, #8] @ __fp16mov r0, r3 @ __fp16bl __gnu_h2f_ieee

w/o FPU option mfpu=vfpv4

Page 16: Cvim half precision floating point

fp16 instructions on ARM• Conversion between half <-> float only• VCVTB.F16.F32• VCVTB.F32.F16• VCVTT.F16.F32• VCVTT.F32.F16

• If you perfume an operation with half number, the number will be promoted to single precision float just before the operation

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204ij/CJAGIFIJ.html

Page 17: Cvim half precision floating point

Summary• ARM• To use the HW instruction, specify the FPU• No operation instruction but conversion between fp32

• ARM(SIMD)• Intel, AMD (x86)• CUDA

Page 18: Cvim half precision floating point

fp16 instruction on ARM (SIMD)•

vcvt stands for vector

• Let’s try using SIMD instructions• Conversion instruction using SIMD• float16x4_t vcvt_f16_f32(float32x4_t a);• VCVT.F16.F32 d0, q0

• float32x4_t vcvt_f32_f16(float16x4_t a);• VCVT.F32.F16 q0, d0

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0348bj/BABGABJH.html

Page 19: Cvim half precision floating point

Try the operation in vectorconst unsigned int cParallel = 8;for (unsigned int x = 0;x <= cSize - cParallel;x += cParallel){ uint8x8_t srcInteger = vld1_u8(src+x); // load 64bits float16x4_t gainHalfLow = *(float16x4_t*)(gain + x ); // load 32bits float16x4_t gainHalfHigh = *(float16x4_t*)(gain + x + 4 ); // load 32bits uint16x8_t srcIntegerShort = vmovl_u8(srcInteger); // uchar -> ushort uint32x4_t srcIntegerLow = vmovl_u16(vget_low_s16 (srcIntegerShort)); // ushort -> uint uint32x4_t srcIntegerHigh = vmovl_u16(vget_high_s16(srcIntegerShort)); // ushort -> uint float32x4_t srcFloatLow = vcvtq_f32_u32(srcIntegerLow ); // uint -> float float32x4_t srcFloatHigh = vcvtq_f32_u32(srcIntegerHigh); // uint -> float float32x4_t gainFloatLow = vcvt_f32_f16(gainHalfLow ); // half -> float float32x4_t gainFloatHigh = vcvt_f32_f16(gainHalfHigh); // half -> float float32x4_t dstFloatLow = vmulq_f32(srcFloatLow, gainFloatLow ); // float * float float32x4_t dstFloatHigh = vmulq_f32(srcFloatHigh, gainFloatHigh); // float * float uint32x4_t dstIntegerLow = vcvtq_u32_f32(dstFloatLow ); // float -> uint uint32x4_t dstIntegerHigh = vcvtq_u32_f32(dstFloatHigh); // float -> uint uint16x8_t dstIntegerShort = vcombine_u16(vmovn_u16(dstIntegerLow), vmovn_u16(dstIntegerHigh)); // uint -> ushort uint8x8_t dstInteger = vmovn_u16(dstIntegerShort); // ushort -> uchar vst1_u8(dst+x, dstInteger); // store}

https://github.com/tomoaki0705/sampleFp16Vector

Page 20: Cvim half precision floating point

Little bit of improvementsconst unsigned int cParallel = 8;for (unsigned int x = 0;x <= cSize - cParallel;x += cParallel){ uchar8 srcInteger = load_uchar8(src+x); // load 64bits half4 gainHalfLow = load_half4(gain + x ); // load 32bits half4 gainHalfHigh = load_half4(gain + x + 4 ); // load 32bits ushort8 srcIntegerShort = convert_uchar8_ushort8(srcInteger); // uchar -> ushort uint4 srcIntegerLow = convert_ushort8_lo_uint4(srcIntegerShort); // ushort -> uint uint4 srcIntegerHigh = convert_ushort8_hi_uint4(srcIntegerShort); // ushort -> uint float4 srcFloatLow = convert_uint4_float4(srcIntegerLow ); // uint -> float float4 srcFloatHigh = convert_uint4_float4(srcIntegerHigh); // uint -> float float4 gainFloatLow = convert_half4_float4(gainHalfLow ); // half -> float float4 gainFloatHigh = convert_half4_float4(gainHalfHigh); // half -> float float4 dstFloatLow = multiply_float4(srcFloatLow , gainFloatLow ); // float * float float4 dstFloatHigh = multiply_float4(srcFloatHigh, gainFloatHigh); // float * float uint4 dstIntegerLow = convert_float4_uint4(dstFloatLow ); // float -> uint uint4 dstIntegerHigh = convert_float4_uint4(dstFloatHigh); // float -> uint ushort8 dstIntegerShort = convert_uint4_ushort8(dstIntegerLow, dstIntegerHigh); // uint -> ushort uchar8 dstInteger = convert_ushort8_uchar8(dstIntegerShort); // ushort -> uchar store_uchar8(dst + x, dstInteger); // store}

Page 21: Cvim half precision floating point

Let’ build• Specify one of the red FPU

options• The FPU has to have feature of

SIMD and half

vfpvfpv3vfpv3-fp16vfpv3-d16vfpv3-d16-fp16vfpv3xdvfpv3xd-fp16neonneon-fp16vfpv4vfpv4-d16fpv4-sp-d16neon-vfpv4fp-armv8neon-fp-armv8crypto-neon-fp-armv8

List of FPU option

http://dench.flatlib.jp/opengl/fpu_vfphttp://tessy.org/wiki/index.php?ARM%A4%CEFPU

Page 22: Cvim half precision floating point

Check in Assembly

VCVT instruction

Page 23: Cvim half precision floating point

Summary• ARM• Done

• ARM(SIMD)• Specify the FPU which is capable of both SIMD and half

• Intel,AMD (x86)• CUDA

Page 24: Cvim half precision floating point

half instructions on x86• F16C instruction set

https://en.wikipedia.org/wiki/F16C

Page 25: Cvim half precision floating point

Try the operation in vectorconst unsigned int cParallel = 8;for (unsigned int x = 0;x <= cSize - cParallel;x += cParallel){ __m128i srcInteger = _mm_loadl_epi64((__m128i const *)(src + x)); // load 64bits __m128i gainHalfLow = _mm_loadl_epi64((__m128i const *)(gain + x )); // load 32bits __m128i gainHalfHigh = _mm_loadl_epi64((__m128i const *)(gain + x + 4)); // load 32bits __m128i srcIntegerShort = _mm_unpacklo_epi8(srcInteger, v_zero); // uchar -> ushort __m128i srcIntegerLow = _mm_unpacklo_epi16(srcIntegerShort, v_zero); // ushort -> uint __m128i srcIntegerHigh = _mm_unpackhi_epi16(srcIntegerShort, v_zero); // ushort -> uint __m128i srcFloatLow = _mm_cvtepi32_ps(srcIntegerLow ); // uint -> float __m128i srcFloatHigh = _mm_cvtepi32_ps(srcIntegerHigh); // uint -> float __m128 gainFloatLow = _mm_cvtph_ps(gainHalfLow ); // half -> float __m128 gainFloatHigh = _mm_cvtph_ps(gainHalfHigh); // half -> float __m128 dstFloatLow = _mm_mul_ps(srcFloatLow , gainFloatLow ); // float * float __m128 dstFloatHigh = _mm_mul_ps(srcFloatHigh, gainFloatHigh); // float * float __m128i dstIntegerLow = _mm_cvtps_epi32(dstFloatLow ); // float -> uint __m128i dstIntegerHigh = _mm_cvtps_epi32(dstFloatHigh); // float -> uint __m128i dstIntegerShort = _mm_packs_epi32(dstIntegerLow, dstIntegerHigh); // uint -> ushort __m128i dstInteger = _mm_packus_epi16(dstIntegerShort, v_zero); // ushort -> uchar _mm_storel_epi64((__m128i *)(dst + x), dstInteger); // store}

https://github.com/tomoaki0705/sampleFp16Vector

Page 26: Cvim half precision floating point

Little bit of improvementsconst unsigned int cParallel = 8;for (unsigned int x = 0;x <= cSize - cParallel;x += cParallel){ uchar8 srcInteger = load_uchar8(src+x); // load 64bits half4 gainHalfLow = load_half4(gain + x ); // load 32bits half4 gainHalfHigh = load_half4(gain + x + 4 ); // load 32bits ushort8 srcIntegerShort = convert_uchar8_ushort8(srcInteger); // uchar -> ushort uint4 srcIntegerLow = convert_ushort8_lo_uint4(srcIntegerShort); // ushort -> uint uint4 srcIntegerHigh = convert_ushort8_hi_uint4(srcIntegerShort); // ushort -> uint float4 srcFloatLow = convert_uint4_float4(srcIntegerLow ); // uint -> float float4 srcFloatHigh = convert_uint4_float4(srcIntegerHigh); // uint -> float float4 gainFloatLow = convert_half4_float4(gainHalfLow ); // half -> float float4 gainFloatHigh = convert_half4_float4(gainHalfHigh); // half -> float float4 dstFloatLow = multiply_float4(srcFloatLow , gainFloatLow ); // float * float float4 dstFloatHigh = multiply_float4(srcFloatHigh, gainFloatHigh); // float * float uint4 dstIntegerLow = convert_float4_uint4(dstFloatLow ); // float -> uint uint4 dstIntegerHigh = convert_float4_uint4(dstFloatHigh); // float -> uint ushort8 dstIntegerShort = convert_uint4_ushort8(dstIntegerLow, dstIntegerHigh);// uint -> ushort uchar8 dstInteger = convert_ushort8_uchar8(dstIntegerShort); // ushort -> uchar store_uchar8(dst + x, dstInteger); // store}

$ gcc -mf16c main.cpp

Page 27: Cvim half precision floating point

Check in Assembly• Note that inline

functions have not been expanded “inline” when build in Debug mode

Page 28: Cvim half precision floating point

Check in Assembly• Build with

RelWithDebInfo mode• Instructions are more

packed

Conversion instruction(vcvtph2ps)

Page 29: Cvim half precision floating point

Check in Assembly(gcc)• Same behavior as

Visual Studio, inline functions are kept as function calls

Page 30: Cvim half precision floating point

Check in Assembly(gcc)• Assembly of Release

mode• Much more packed

instructionsConversion instruction(vcvtph2ps)

Page 31: Cvim half precision floating point

まとめ• ARM• Done

• ARM(SIMD)• Done

• Intel,AMD (x86)• x86 has half conversion as one of the SIMD instructions• Implemented on Ivy Bridge and later CPU (Intel)• Implemented on Piledriver and later CPU (AMD) • Done

• CUDA

Page 32: Cvim half precision floating point

CUDAunsigned short a = g_indata[y*imgw+x];float gain;gain = __half2float(a);

float b = imageData[(y*imgw+x)*3 ];float g = imageData[(y*imgw+x)*3+1];float r = imageData[(y*imgw+x)*3+2];

g_odata[(y*imgw+x)*3 ] = clamp(b * gain, 0.0f, 255.0f);g_odata[(y*imgw+x)*3+1] = clamp(g * gain, 0.0f, 255.0f);g_odata[(y*imgw+x)*3+2] = clamp(r * gain, 0.0f, 255.0f);

Page 33: Cvim half precision floating point

The best point using half• Data size transferring to GPU will be reduced

GPU memory

Page 34: Cvim half precision floating point

Summary• ARM

• Done• ARM(SIMD)

• Done• Intel,AMD (x86)

• Done

• CUDA• CUDA 7.5 and later will support half natively• Pascal is expected to have has been announced to have direct

operation treating half <- Announced on 5th/April• Partially available on Jetson TX1• Conversion instruction it self exists for long timehttp://www.slideshare.net/NVIDIAJapan/1071-gpu-cuda-75maxwellhttp://pc.watch.impress.co.jp/docs/news/event/20160406_751833.html

Page 35: Cvim half precision floating point

Summary of each platform

Platform Conversion(Single variable)

Conversion(Vector)

Direct operation with fp16

ARM ◯ ◯ ×X86 × ◯ ×CUDA(Maxwell and older) ◯ ◯ ×CUDA(Pascal and later) ◯ ◯ <-New!◯<-New!

Page 36: Cvim half precision floating point

Limit of half precision -Overflow-• The maximum of float (32bit)• Exponent 8bits, significand 23bits

-> Up to 10E38• This is larger than maximum of signed int

(+ 2,147,483,647 )• The maximum of half (16bit)• Exponent 5bits, significand 10bits

-> Up to 65504• This is smaller than maximum of unsigned short

(65535)

Page 37: Cvim half precision floating point

Limit of half precision –Rounding Error-• Rounding error of float (32bit)• Exact integer can be expressed up to 16777216(=2^24)

• Rounding error of half (16bit)• Exact integer can be expressed up to 2048 (=2^11)• In between 1024-2047, half can only express exact

integer number• In between 512-1024, half can only express numbers

with step of 0.5• Ex. 180.5 + 178.2 + 185.2 + 150.3 + 160.3 = 854.5• Correct average: 854.5/5 = 170.9• Computing with half: 171.0 <- rounding error

Page 38: Cvim half precision floating point

Summary• Explanation of FP16, half precision floating point• Available on platform• ARM (single variable / SIMD, storage only)• X86 (SIMD only, storage only)• CUDA (operation of fp16 coming on TX1)


Top Related