cvim half precision floating point
TRANSCRIPT
Half Precision Floating Point Number
-half-@tomoaki_teshima
How big is the image ?• Multiplying two images (floating point operation)
Size ! Size !! Size !!!• RGB 3 bytes / pixel• float 4 bytes / pixel• Any more space to reduce ?
Summary• Explanation of half• Example on ARM• Example on ARM w/ SIMD instruction• Example on Intel, AMD(x86)• Example on CUDA
Format of Floating pointsIEEE75464bit = double, double precision
32bit = float, single precision
16bit = half, half precision
Signed bit
Exponent
Significand
1
1
1
11bit 52bit
23bit
10bit5bit
8bit
ARM has fp16
https://ja.wikipedia.org/wiki/半精度浮動小数点数
What to prepare• An ARM machine which runs Linux • Raspberry Pi zero/1/2/3• ODROID XU4/C2• Jetson TK1/TX1• PINE64• Red ones are 64bit architecture
• Buy one for better understanding
Example on ARMint main(int argc, char**argv)
{
printf("Hello World !!\n");
__fp16 halfPrecision = 1.5f;
printf("half precision:%f\n“, halfPrecision);
printf("half precision:sizeof %d\n“, sizeof(halfPrecision));
printf("half precision:0x%04x\n", *(short*)(void*)&halfPrecision);
float original[] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f,
9.0f,10.0f,11.0f,12.0f,13.0f,14.0f,15.0f,16.0f,};
for (unsigned int i = 0;i < 16;i++)
{
__fp16 stub = original[i];
printf(“%2d 0x%04x\n", (int)original[i], *(short*)&stub);
}
return 0;
}
https://github.com/tomoaki0705/sampleFp16
Build it
• Required to put option “-mpf16-format”• Try it on ARM gcc, otherwise “unknown option”error
$ gcc -std=c99 -mfp16-format=ieee main.c
Result 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0
1/2
1/1024
1/41/81/16
1/321/64
1/1281/256
1/512
2(17− 15)×(1+ 12+ 14 )=22× 74=7
Signed bit(+)
Exponent(17)
Significand
When exponent is all 0, the number is subnormal.When exponent is all 1, the number is Inf or NaN.
Summary• Floating points format is complicated than Integer• Half can express floating point numbers in 2 bytes
Check in Assembly• Soft implemented conversion
• What’s the point doing it on SW side ?
$ gcc –S -std=c99 -mfp16-format=ieee –O main.c.s main.c
movw r3, #15872 <-0x3e00strh r3, [r7, #8] @ __fp16 <-store to stackldrh r3, [r7, #8] @ __fp16 <-load from stackmov r0, r3 @ __fp16 <-copy to r0bl __gnu_h2f_ieee <-function call (half2float)
Half conversion instructions•Conversion instruction between
half and float• VCVTB.F16.F32 ( float -> half)• VCVTB.F32.F16 ( half -> float)• VCVTT.F16.F32 ( float -> half)• VCVTT.F32.F16 ( half -> float)
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204ij/CJAGIFIJ.html
Half instructions•ARM CPU might not have an FPU• To use the FPU, compiler has to know• Give an option to tell gcc
$ gcc –mfp16-format=ieee main.c ↓$ gcc –mfp16-format=ieee –mfpu=vfpv4 main.c
Check in Assembler 2
movw r3, #15872strh r3, [r7, #8] @ __fp16add r2, r7, #8vld1.16 {d7[2]}, [r2]vcvtb.f32.f16 s15, s15
movw r3, #15872strh r3, [r7, #8] @ __fp16ldrh r3, [r7, #8] @ __fp16mov r0, r3 @ __fp16bl __gnu_h2f_ieee
w/o FPU option mfpu=vfpv4
fp16 instructions on ARM• Conversion between half <-> float only• VCVTB.F16.F32• VCVTB.F32.F16• VCVTT.F16.F32• VCVTT.F32.F16
• If you perfume an operation with half number, the number will be promoted to single precision float just before the operation
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204ij/CJAGIFIJ.html
Summary• ARM• To use the HW instruction, specify the FPU• No operation instruction but conversion between fp32
• ARM(SIMD)• Intel, AMD (x86)• CUDA
fp16 instruction on ARM (SIMD)•
vcvt stands for vector
• Let’s try using SIMD instructions• Conversion instruction using SIMD• float16x4_t vcvt_f16_f32(float32x4_t a);• VCVT.F16.F32 d0, q0
• float32x4_t vcvt_f32_f16(float16x4_t a);• VCVT.F32.F16 q0, d0
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0348bj/BABGABJH.html
Try the operation in vectorconst unsigned int cParallel = 8;for (unsigned int x = 0;x <= cSize - cParallel;x += cParallel){ uint8x8_t srcInteger = vld1_u8(src+x); // load 64bits float16x4_t gainHalfLow = *(float16x4_t*)(gain + x ); // load 32bits float16x4_t gainHalfHigh = *(float16x4_t*)(gain + x + 4 ); // load 32bits uint16x8_t srcIntegerShort = vmovl_u8(srcInteger); // uchar -> ushort uint32x4_t srcIntegerLow = vmovl_u16(vget_low_s16 (srcIntegerShort)); // ushort -> uint uint32x4_t srcIntegerHigh = vmovl_u16(vget_high_s16(srcIntegerShort)); // ushort -> uint float32x4_t srcFloatLow = vcvtq_f32_u32(srcIntegerLow ); // uint -> float float32x4_t srcFloatHigh = vcvtq_f32_u32(srcIntegerHigh); // uint -> float float32x4_t gainFloatLow = vcvt_f32_f16(gainHalfLow ); // half -> float float32x4_t gainFloatHigh = vcvt_f32_f16(gainHalfHigh); // half -> float float32x4_t dstFloatLow = vmulq_f32(srcFloatLow, gainFloatLow ); // float * float float32x4_t dstFloatHigh = vmulq_f32(srcFloatHigh, gainFloatHigh); // float * float uint32x4_t dstIntegerLow = vcvtq_u32_f32(dstFloatLow ); // float -> uint uint32x4_t dstIntegerHigh = vcvtq_u32_f32(dstFloatHigh); // float -> uint uint16x8_t dstIntegerShort = vcombine_u16(vmovn_u16(dstIntegerLow), vmovn_u16(dstIntegerHigh)); // uint -> ushort uint8x8_t dstInteger = vmovn_u16(dstIntegerShort); // ushort -> uchar vst1_u8(dst+x, dstInteger); // store}
https://github.com/tomoaki0705/sampleFp16Vector
Little bit of improvementsconst unsigned int cParallel = 8;for (unsigned int x = 0;x <= cSize - cParallel;x += cParallel){ uchar8 srcInteger = load_uchar8(src+x); // load 64bits half4 gainHalfLow = load_half4(gain + x ); // load 32bits half4 gainHalfHigh = load_half4(gain + x + 4 ); // load 32bits ushort8 srcIntegerShort = convert_uchar8_ushort8(srcInteger); // uchar -> ushort uint4 srcIntegerLow = convert_ushort8_lo_uint4(srcIntegerShort); // ushort -> uint uint4 srcIntegerHigh = convert_ushort8_hi_uint4(srcIntegerShort); // ushort -> uint float4 srcFloatLow = convert_uint4_float4(srcIntegerLow ); // uint -> float float4 srcFloatHigh = convert_uint4_float4(srcIntegerHigh); // uint -> float float4 gainFloatLow = convert_half4_float4(gainHalfLow ); // half -> float float4 gainFloatHigh = convert_half4_float4(gainHalfHigh); // half -> float float4 dstFloatLow = multiply_float4(srcFloatLow , gainFloatLow ); // float * float float4 dstFloatHigh = multiply_float4(srcFloatHigh, gainFloatHigh); // float * float uint4 dstIntegerLow = convert_float4_uint4(dstFloatLow ); // float -> uint uint4 dstIntegerHigh = convert_float4_uint4(dstFloatHigh); // float -> uint ushort8 dstIntegerShort = convert_uint4_ushort8(dstIntegerLow, dstIntegerHigh); // uint -> ushort uchar8 dstInteger = convert_ushort8_uchar8(dstIntegerShort); // ushort -> uchar store_uchar8(dst + x, dstInteger); // store}
Let’ build• Specify one of the red FPU
options• The FPU has to have feature of
SIMD and half
vfpvfpv3vfpv3-fp16vfpv3-d16vfpv3-d16-fp16vfpv3xdvfpv3xd-fp16neonneon-fp16vfpv4vfpv4-d16fpv4-sp-d16neon-vfpv4fp-armv8neon-fp-armv8crypto-neon-fp-armv8
List of FPU option
http://dench.flatlib.jp/opengl/fpu_vfphttp://tessy.org/wiki/index.php?ARM%A4%CEFPU
Check in Assembly
VCVT instruction
Summary• ARM• Done
• ARM(SIMD)• Specify the FPU which is capable of both SIMD and half
• Intel,AMD (x86)• CUDA
half instructions on x86• F16C instruction set
https://en.wikipedia.org/wiki/F16C
Try the operation in vectorconst unsigned int cParallel = 8;for (unsigned int x = 0;x <= cSize - cParallel;x += cParallel){ __m128i srcInteger = _mm_loadl_epi64((__m128i const *)(src + x)); // load 64bits __m128i gainHalfLow = _mm_loadl_epi64((__m128i const *)(gain + x )); // load 32bits __m128i gainHalfHigh = _mm_loadl_epi64((__m128i const *)(gain + x + 4)); // load 32bits __m128i srcIntegerShort = _mm_unpacklo_epi8(srcInteger, v_zero); // uchar -> ushort __m128i srcIntegerLow = _mm_unpacklo_epi16(srcIntegerShort, v_zero); // ushort -> uint __m128i srcIntegerHigh = _mm_unpackhi_epi16(srcIntegerShort, v_zero); // ushort -> uint __m128i srcFloatLow = _mm_cvtepi32_ps(srcIntegerLow ); // uint -> float __m128i srcFloatHigh = _mm_cvtepi32_ps(srcIntegerHigh); // uint -> float __m128 gainFloatLow = _mm_cvtph_ps(gainHalfLow ); // half -> float __m128 gainFloatHigh = _mm_cvtph_ps(gainHalfHigh); // half -> float __m128 dstFloatLow = _mm_mul_ps(srcFloatLow , gainFloatLow ); // float * float __m128 dstFloatHigh = _mm_mul_ps(srcFloatHigh, gainFloatHigh); // float * float __m128i dstIntegerLow = _mm_cvtps_epi32(dstFloatLow ); // float -> uint __m128i dstIntegerHigh = _mm_cvtps_epi32(dstFloatHigh); // float -> uint __m128i dstIntegerShort = _mm_packs_epi32(dstIntegerLow, dstIntegerHigh); // uint -> ushort __m128i dstInteger = _mm_packus_epi16(dstIntegerShort, v_zero); // ushort -> uchar _mm_storel_epi64((__m128i *)(dst + x), dstInteger); // store}
https://github.com/tomoaki0705/sampleFp16Vector
Little bit of improvementsconst unsigned int cParallel = 8;for (unsigned int x = 0;x <= cSize - cParallel;x += cParallel){ uchar8 srcInteger = load_uchar8(src+x); // load 64bits half4 gainHalfLow = load_half4(gain + x ); // load 32bits half4 gainHalfHigh = load_half4(gain + x + 4 ); // load 32bits ushort8 srcIntegerShort = convert_uchar8_ushort8(srcInteger); // uchar -> ushort uint4 srcIntegerLow = convert_ushort8_lo_uint4(srcIntegerShort); // ushort -> uint uint4 srcIntegerHigh = convert_ushort8_hi_uint4(srcIntegerShort); // ushort -> uint float4 srcFloatLow = convert_uint4_float4(srcIntegerLow ); // uint -> float float4 srcFloatHigh = convert_uint4_float4(srcIntegerHigh); // uint -> float float4 gainFloatLow = convert_half4_float4(gainHalfLow ); // half -> float float4 gainFloatHigh = convert_half4_float4(gainHalfHigh); // half -> float float4 dstFloatLow = multiply_float4(srcFloatLow , gainFloatLow ); // float * float float4 dstFloatHigh = multiply_float4(srcFloatHigh, gainFloatHigh); // float * float uint4 dstIntegerLow = convert_float4_uint4(dstFloatLow ); // float -> uint uint4 dstIntegerHigh = convert_float4_uint4(dstFloatHigh); // float -> uint ushort8 dstIntegerShort = convert_uint4_ushort8(dstIntegerLow, dstIntegerHigh);// uint -> ushort uchar8 dstInteger = convert_ushort8_uchar8(dstIntegerShort); // ushort -> uchar store_uchar8(dst + x, dstInteger); // store}
$ gcc -mf16c main.cpp
Check in Assembly• Note that inline
functions have not been expanded “inline” when build in Debug mode
Check in Assembly• Build with
RelWithDebInfo mode• Instructions are more
packed
Conversion instruction(vcvtph2ps)
Check in Assembly(gcc)• Same behavior as
Visual Studio, inline functions are kept as function calls
Check in Assembly(gcc)• Assembly of Release
mode• Much more packed
instructionsConversion instruction(vcvtph2ps)
まとめ• ARM• Done
• ARM(SIMD)• Done
• Intel,AMD (x86)• x86 has half conversion as one of the SIMD instructions• Implemented on Ivy Bridge and later CPU (Intel)• Implemented on Piledriver and later CPU (AMD) • Done
• CUDA
CUDAunsigned short a = g_indata[y*imgw+x];float gain;gain = __half2float(a);
float b = imageData[(y*imgw+x)*3 ];float g = imageData[(y*imgw+x)*3+1];float r = imageData[(y*imgw+x)*3+2];
g_odata[(y*imgw+x)*3 ] = clamp(b * gain, 0.0f, 255.0f);g_odata[(y*imgw+x)*3+1] = clamp(g * gain, 0.0f, 255.0f);g_odata[(y*imgw+x)*3+2] = clamp(r * gain, 0.0f, 255.0f);
The best point using half• Data size transferring to GPU will be reduced
GPU memory
Summary• ARM
• Done• ARM(SIMD)
• Done• Intel,AMD (x86)
• Done
• CUDA• CUDA 7.5 and later will support half natively• Pascal is expected to have has been announced to have direct
operation treating half <- Announced on 5th/April• Partially available on Jetson TX1• Conversion instruction it self exists for long timehttp://www.slideshare.net/NVIDIAJapan/1071-gpu-cuda-75maxwellhttp://pc.watch.impress.co.jp/docs/news/event/20160406_751833.html
Summary of each platform
Platform Conversion(Single variable)
Conversion(Vector)
Direct operation with fp16
ARM ◯ ◯ ×X86 × ◯ ×CUDA(Maxwell and older) ◯ ◯ ×CUDA(Pascal and later) ◯ ◯ <-New!◯<-New!
Limit of half precision -Overflow-• The maximum of float (32bit)• Exponent 8bits, significand 23bits
-> Up to 10E38• This is larger than maximum of signed int
(+ 2,147,483,647 )• The maximum of half (16bit)• Exponent 5bits, significand 10bits
-> Up to 65504• This is smaller than maximum of unsigned short
(65535)
Limit of half precision –Rounding Error-• Rounding error of float (32bit)• Exact integer can be expressed up to 16777216(=2^24)
• Rounding error of half (16bit)• Exact integer can be expressed up to 2048 (=2^11)• In between 1024-2047, half can only express exact
integer number• In between 512-1024, half can only express numbers
with step of 0.5• Ex. 180.5 + 178.2 + 185.2 + 150.3 + 160.3 = 854.5• Correct average: 854.5/5 = 170.9• Computing with half: 171.0 <- rounding error
Summary• Explanation of FP16, half precision floating point• Available on platform• ARM (single variable / SIMD, storage only)• X86 (SIMD only, storage only)• CUDA (operation of fp16 coming on TX1)