Cvim saisentan 半精度浮動小数点数 half

Download Cvim saisentan 半精度浮動小数点数 half

Post on 08-Jan-2017

1.055 views

Category:

Engineering

2 download

TRANSCRIPT

<p>OpenCV float</p> <p>half @tomoaki_teshimaCV</p> <p>Float </p> <p>RGB 3 bytes / pixelfloat 4 bytes / pixel</p> <p>Deep LearningPythonC/C++Chainer</p> <p> half ARMARM(SIMD)Intel, AMD(x86)CUDA</p> <p>IEEE754</p> <p>64bit = double 32bit = float 16bit = half </p> <p>bit11111bit52bit23bit10bit5bit8bit</p> <p>ARMfp16</p> <p>https://ja.wikipedia.org/wiki/</p> <p>Linux ARMRaspberry Pi zero/1/2/3ODROID XU4/C2Jetson TK1/TX1PINE6464bit</p> <p>int main(int argc, char**argv){ printf("Hello World !!\n"); __fp16 halfPrecision = 1.5f; printf("half precision:%f\n, halfPrecision); printf("half precision:sizeof %d\n, sizeof(halfPrecision)); printf("half precision:0x%04x\n", *(short*)(void*)&amp;halfPrecision);</p> <p> float original[] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f, 9.0f,10.0f,11.0f,12.0f,13.0f,14.0f,15.0f,16.0f,}; for (unsigned int i = 0;i &lt; 16;i++) { __fp16 stub = original[i]; printf(%2d 0x%04x\n", (int)original[i], *(short*)&amp;stub); } return 0;}https://github.com/tomoaki0705/sampleFp16</p> <p>ARMgccunknown option$ gcc -std=c99 -mfp16-format=ieee main.c</p> <p> 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 01/21/10241/41/81/161/321/641/1281/2561/512</p> <p>(+)(17)</p> <p>0subnormal1InfNaN</p> <p>/2byte</p> <p>$ gcc S -std=c99 -mfp16-format=ieee O main.c.s main.cmovw r3, #15872 0x3e00strh r3, [r7, #8] @ __fp16 stackldrh r3, [r7, #8] @ __fp16 stackmov r0, r3 @ __fp16 bl __gnu_h2f_ieee (half2float)</p> <p>ARMhalfhalf floatVCVTB.F16.F32floathalfVCVTB.F32.F16halffloatVCVTT.F16.F32floathalfVCVTT.F32.F16halffloat</p> <p>http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204ij/CJAGIFIJ.html</p> <p>HalfFPUFPU$ gcc mfp16-format=ieee main.c$ gcc mfp16-format=ieee mfpu=vfpv4 main.c</p> <p>2movw r3, #15872strh r3, [r7, #8] @ __fp16add r2, r7, #8vld1.16 {d7[2]}, [r2]vcvtb.f32.f16 s15, s15movw r3, #15872strh r3, [r7, #8] @ __fp16ldrh r3, [r7, #8] @ __fp16mov r0, r3 @ __fp16bl __gnu_h2f_ieeeFPUFPU=vfpv4</p> <p>ARMhalfhalf floatVCVTB.F16.F32VCVTB.F32.F16VCVTT.F16.F32VCVTT.F32.F16halffloatcast</p> <p>ARMFPUHWfloatcastARM(SIMD)Intel, AMD (x86)CUDA</p> <p>ARMfp16(SIMD) vcvtVSIMDfloat16x4_t vcvt_f16_f32(float32x4_t a);VCVT.F16.F32 d0, q0float32x4_t vcvt_f32_f16(float16x4_t a);VCVT.F32.F16 q0, d0http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0348bj/BABGABJH.html</p> <p>const unsigned int cParallel = 8;for (unsigned int x = 0;x ushort uint32x4_t srcIntegerLow = vmovl_u16(vget_low_s16 (srcIntegerShort)); // ushort -&gt; uint uint32x4_t srcIntegerHigh = vmovl_u16(vget_high_s16(srcIntegerShort)); // ushort -&gt; uint float32x4_t srcFloatLow = vcvtq_f32_u32(srcIntegerLow ); // uint -&gt; float float32x4_t srcFloatHigh = vcvtq_f32_u32(srcIntegerHigh); // uint -&gt; float float32x4_t gainFloatLow = vcvt_f32_f16(gainHalfLow ); // half -&gt; float float32x4_t gainFloatHigh = vcvt_f32_f16(gainHalfHigh); // half -&gt; float float32x4_t dstFloatLow = vmulq_f32(srcFloatLow, gainFloatLow ); // float * float float32x4_t dstFloatHigh = vmulq_f32(srcFloatHigh, gainFloatHigh); // float * float uint32x4_t dstIntegerLow = vcvtq_u32_f32(dstFloatLow ); // float -&gt; uint uint32x4_t dstIntegerHigh = vcvtq_u32_f32(dstFloatHigh); // float -&gt; uint uint16x8_t dstIntegerShort = vcombine_u16(vmovn_u16(dstIntegerLow), vmovn_u16(dstIntegerHigh)); // uint -&gt; ushort uint8x8_t dstInteger = vmovn_u16(dstIntegerShort); // ushort -&gt; uchar vst1_u8(dst+x, dstInteger); // store}</p> <p>https://github.com/tomoaki0705/sampleFp16Vector</p> <p>const unsigned int cParallel = 8;for (unsigned int x = 0;x ushort uint4 srcIntegerLow = convert_ushort8_lo_uint4(srcIntegerShort); // ushort -&gt; uint uint4 srcIntegerHigh = convert_ushort8_hi_uint4(srcIntegerShort); // ushort -&gt; uint float4 srcFloatLow = convert_uint4_float4(srcIntegerLow ); // uint -&gt; float float4 srcFloatHigh = convert_uint4_float4(srcIntegerHigh); // uint -&gt; float float4 gainFloatLow = convert_half4_float4(gainHalfLow ); // half -&gt; float float4 gainFloatHigh = convert_half4_float4(gainHalfHigh); // half -&gt; float float4 dstFloatLow = multiply_float4(srcFloatLow , gainFloatLow ); // float * float float4 dstFloatHigh = multiply_float4(srcFloatHigh, gainFloatHigh); // float * float uint4 dstIntegerLow = convert_float4_uint4(dstFloatLow ); // float -&gt; uint uint4 dstIntegerHigh = convert_float4_uint4(dstFloatHigh); // float -&gt; uint ushort8 dstIntegerShort = convert_uint4_ushort8(dstIntegerLow, dstIntegerHigh); // uint -&gt; ushort uchar8 dstInteger = convert_ushort8_uchar8(dstIntegerShort); // ushort -&gt; uchar store_uchar8(dst + x, dstInteger); // store}</p> <p>-mfpuhalfhalfNEON(SIMD)FPU</p> <p>vfpvfpv3vfpv3-fp16vfpv3-d16vfpv3-d16-fp16vfpv3xdvfpv3xd-fp16neonneon-fp16vfpv4vfpv4-d16fpv4-sp-d16neon-vfpv4fp-armv8crypto-neon-fp-armv8:mfpuhttp://dench.flatlib.jp/opengl/fpu_vfphttp://tessy.org/wiki/index.php?ARM%A4%CEFPU</p> <p>ARMARM(SIMD)fp16neonFPUIntel,AMD (x86)CUDA</p> <p>x86halfF16C</p> <p>https://en.wikipedia.org/wiki/F16C</p> <p>const unsigned int cParallel = 8;for (unsigned int x = 0;x ushort __m128i srcIntegerLow = _mm_unpacklo_epi16(srcIntegerShort, v_zero); // ushort -&gt; uint __m128i srcIntegerHigh = _mm_unpackhi_epi16(srcIntegerShort, v_zero); // ushort -&gt; uint __m128i srcFloatLow = _mm_cvtepi32_ps(srcIntegerLow ); // uint -&gt; float __m128i srcFloatHigh = _mm_cvtepi32_ps(srcIntegerHigh); // uint -&gt; float __m128 gainFloatLow = _mm_cvtph_ps(gainHalfLow ); // half -&gt; float __m128 gainFloatHigh = _mm_cvtph_ps(gainHalfHigh); // half -&gt; float __m128 dstFloatLow = _mm_mul_ps(srcFloatLow , gainFloatLow ); // float * float __m128 dstFloatHigh = _mm_mul_ps(srcFloatHigh, gainFloatHigh); // float * float __m128i dstIntegerLow = _mm_cvtps_epi32(dstFloatLow ); // float -&gt; uint __m128i dstIntegerHigh = _mm_cvtps_epi32(dstFloatHigh); // float -&gt; uint __m128i dstIntegerShort = _mm_packs_epi32(dstIntegerLow, dstIntegerHigh); // uint -&gt; ushort __m128i dstInteger = _mm_packus_epi16(dstIntegerShort, v_zero); // ushort -&gt; uchar _mm_storel_epi64((__m128i *)(dst + x), dstInteger); // store}https://github.com/tomoaki0705/sampleFp16Vector</p> <p>const unsigned int cParallel = 8;for (unsigned int x = 0;x ushort uint4 srcIntegerLow = convert_ushort8_lo_uint4(srcIntegerShort); // ushort -&gt; uint uint4 srcIntegerHigh = convert_ushort8_hi_uint4(srcIntegerShort); // ushort -&gt; uint float4 srcFloatLow = convert_uint4_float4(srcIntegerLow ); // uint -&gt; float float4 srcFloatHigh = convert_uint4_float4(srcIntegerHigh); // uint -&gt; float float4 gainFloatLow = convert_half4_float4(gainHalfLow ); // half -&gt; float float4 gainFloatHigh = convert_half4_float4(gainHalfHigh); // half -&gt; float float4 dstFloatLow = multiply_float4(srcFloatLow , gainFloatLow ); // float * float float4 dstFloatHigh = multiply_float4(srcFloatHigh, gainFloatHigh); // float * float uint4 dstIntegerLow = convert_float4_uint4(dstFloatLow ); // float -&gt; uint uint4 dstIntegerHigh = convert_float4_uint4(dstFloatHigh); // float -&gt; uint ushort8 dstIntegerShort = convert_uint4_ushort8(dstIntegerLow, dstIntegerHigh);// uint -&gt; ushort uchar8 dstInteger = convert_ushort8_uchar8(dstIntegerShort); // ushort -&gt; uchar store_uchar8(dst + x, dstInteger); // store}</p> <p>$ gcc -mf16c main.cpp</p> <p>DebugInline</p> <p>ReleaseCmakeRelWithDebInfo(vcvtph2ps)</p> <p>(gcc)</p> <p>VSDebuginline </p> <p>(gcc)</p> <p>Release</p> <p>(vcvtph2ps)</p> <p>ARMARM(SIMD)Intel,AMD (x86)x86SSEIvy BridgeIntelPiledriverAMD CUDAhttps://blogs.msdn.microsoft.com/chuckw/2012/09/11/directxmath-f16c-and-fma/</p> <p>CUDAunsigned short a = g_indata[y*imgw+x];float gain;gain = __half2float(a);</p> <p>float b = imageData[(y*imgw+x)*3 ];float g = imageData[(y*imgw+x)*3+1];float r = imageData[(y*imgw+x)*3+2];</p> <p>g_odata[(y*imgw+x)*3 ] = clamp(b * gain, 0.0f, 255.0f);g_odata[(y*imgw+x)*3+1] = clamp(g * gain, 0.0f, 255.0f);g_odata[(y*imgw+x)*3+2] = clamp(r * gain, 0.0f, 255.0f);</p> <p>half GPU</p> <p>(GPU)</p> <p>ARMARM(SIMD)Intel,AMD (x86)CUDACUDA 7.5Half PascalNew!!Jetson TX1GPUHWhttp://www.slideshare.net/NVIDIAJapan/1071-gpu-cuda-75maxwellhttp://pc.watch.impress.co.jp/docs/news/event/20160406_751833.html</p> <p>halfSIMDFp16ARMX86CUDA(Maxwell)CUDA(Pascal)New!New!</p> <p> float 8bit23bit 10E38 signed int + 2,147,483,647 half 5bit10bit65504unsigned short +65536</p> <p> float16777216(=2^24) half2048 (=2^11)1024-2048 512-1024 0.5Ex. 180.5 + 178.2 + 185.2 + 150.3 + 160.3 = 854.5 170.9Half 171.0 0.1</p> <p>CodeIQpow </p> <p> - https://codeiq.jp/q/2549 - https://codeiq.jp/magazine/2015/12/35521/</p> <p>nFnpn(10)(2)73806515533049393806515533049393155074130496954492865713049695449286571651752111485077978050211148507797805016517634164546229067073416454622906706165277552793970088475755279397008847561653788944394323791464894439432379146416537914472334024676221144723340246762181754</p> <p> - Wikipediahttps://ja.wikipedia.org/wiki/tomoaki0705/sampleFp16: sample code to treat FP16 on ARMhttps://github.com/tomoaki0705/sampleFp16ARM Information Centerhttp://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204ij/CJAGIFIJ.htmlARM Information Centerhttp://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0348bj/BABGABJH.htmltomoaki0705/sampleFp16Vector: float16bit sample code on x86 and ARMhttps://github.com/tomoaki0705/sampleFp16Vector opengl:fpu_vfp [HYPER]http://dench.flatlib.jp/opengl/fpu_vfpARMFPU - AkiWiki http://tessy.org/wiki/index.php?ARM%A4%CEFPUF16C - Wikipedia, the free encyclopedia https://en.wikipedia.org/wiki/F16CDirectXMath: F16C and FMA | Games for Windows and the DirectX SDK https://blogs.msdn.microsoft.com/chuckw/2012/09/11/directxmath-f16c-and-fma/ 1071: GPU CUDA 7.5Maxwell http://www.slideshare.net/NVIDIAJapan/1071-gpu-cuda-75maxwellGPUPascal HBM2720GB/sec - PC Watch http://pc.watch.impress.co.jp/docs/news/event/20160406_751833.html | CodeIQ https://codeiq.jp/q/2549CodeIQ MAGAZINEhttps://codeiq.jp/magazine/2015/12/35521/</p>

Recommended

View more >