intel® avx-512 architecture - llvm...a0 a1 a2 a3 a1 a2 a3 a6 a4 a5 a6 a7 a7 mask: k[] = 01110011...

1
Intel® AVX-512 Architecture Comprehensive vector extension for HPC and enterprise 512-bit wide vectors, 32 SIMD registers 8 new mask registers Embedded Rounding Control Embedded Broadcast New Math instructions 2-source shuffles Gather and Scatter Compress and Expand Conflict Detection AVX-512 – What’s new? KNL SSE* AVX AVX2 SNB SSE* AVX HSW SSE* AVX AVX2 NHM SSE* AVX- 512 AVX-512 Foundation Exponential and Reciprocal Prefetching Conflict Detection 32 SIMD registers 512 bit wide More And Bigger Registers Sparse computations are hard for vectorization Code above is wrong if any values within B[i] are duplicated VPCONFLICT instruction detects elements with conflicts Conflict Detection for(i=0; i<16; i++) { A[B[i]]++; } index = vload &B[i] // Load 16 B[i] old_val = vgather A, index // Grab A[B[i]] new_val = vadd old_val, +1.0 // Compute new values vscatter A, index, new_val // Update A[B[i]] index = vload &B[i] // Load 16 B[i] pending_elem = 0xFFFF; do { curr_elem = get_conflict_free_subset(index, pending_elem) old_val = vgather {curr_elem} A, index // Grab A[B[i]] new_val = vadd old_val, +1.0 // Compute new values vscatter A {curr_elem}, index, new_val // Update A[B[i]] pending_elem = pending_elem ^ curr_elem // remove done idx } while (pending_elem) Compress And Expand Compress values from 512-bit vectors compound of f64, f32, i64, i32 elements using mask and store in register or memory VCOMPRESSPD zmm1/mV {k1}, zmm2 VEXPANDPS zmm1 {k1}{z}, zmm2/mV A0 A1 A2 A3 A1 A2 A3 A6 A4 A5 A6 A7 A7 Mask: k[] = 01110011 Zeroed in the register form compress X A0 A1 A2 A0 A1 A2 A3 X X A3 A4 A4 Mask: k[] = 01110011 All “X” are zeroed or remain unchanged expand for (i =0; i < N; i++) { if (topVal > b[i]) { *dst = a[i]; dst++; } } CMP %regMask, ops ADD %regA1, x1, y1 ADD %regA2 = x2, y2 BLEND %regA, %regMask, %regA1, %regA2 Predication Scheme If (condition) { A = .. } else { A = .. } Source code Machine code %Mask = cmp (condition) %A1 = %A2 = %A = SELECT %Mask, %A1, %A2 LLVM IR Mask = cmp (condition) Not-Mask = not(Mask) {Mask} C1 = {Mask} B1 = {Mask} A = B1+C1 {Not-Mask} C2 = {Not-Mask} B2 = {Not-Mask} A = B2+C2 Goal: To set predicates for instructions that calculate A1 and A2 (Mask and Not-Mask) Result: If the mask is all-zero, instruction will not be executed. Mask Propagation Pass – design ideas topVal = Mask = cmp() C1 = topVal*2 B1 = A1 = B1 + C1 C2 = B2 = A2 = B2 + C2 A = BLEND(Mask, A1, A2) A new Machine Pass: Before register allocation Start from the “blend” operands and go up recursively till mask definition Check all users of the destination operand before applying the mask Predicated memory accesses in LLVM IR would be helpful ! Mask Propagation Pass does not guarantee full mask propagation over the whole path from blend to compare Load/Store operations require exact masking FP operations require masking if exceptions are not suppressed – IR generators should use compiler intrinsics topVal = Mask = cmp() C1 = topVal*2 B1 = A1 = B1 + C1 C2 = B2 = A2 = B2 + C2 A = BLEND(Mask, A1, A2) Mask Mask Mask topVal = Mask = cmp() !Mask = Not(Mask) C1 = topVal*2 B1 = A1 = B1 + C1 C2 = B2 = A2 = B2 + C2 A = BLEND(Mask, A1, A2) Mask Mask !Mask !Mask !Mask Mask topVal = Mask = cmp() !Mask = Not(Mask) C1 = topVal*2 B1 = A = B1 + C1 C2 = B2 = A = B2 + C2 Mask Mask Mask !Mask !Mask !Mask Masking in LLVM Masking Unmasked elements remain unchanged: VADDPD zmm1 {k1}, zmm2, zmm3 Or zeroed: VADDPD zmm1 {k1} {z}, zmm2, zmm3 float32 A[N], B[N], C[N]; for(i=0; i<16; i++) { if (B[i] != 0) A[i] = A[i] / B[i]; else A[i] = A[i] / C[i]; } VMOVUPS zmm2, A VCMPPS k1, zmm0, B VDIVPS zmm1 {k1}{z}, zmm2, B KNOT k2, k1 VDIVPS zmm1 {k2}, zmm2, C VMOVUPS A, zmm1 A source from memory is repeated across all the elements. vbroadcastss zmm3, [rax] vaddps zmm1, zmm2, zmm3 vaddps zmm1, zmm2, [rax] {1to16} Embedded Broadcast Static (per instruction) rounding rode No need to access MXCSR any more! vaddps zmm7 {k6}, zmm2, zmm4 {rd} vcvtdq2ps zmm1, zmm2, {ru} All exceptions are always suspended by using embedded RC Embedded Rounding Control Memory fault suppression Avoid FP exceptions Avoid extra blends a7 a6 a5 a4 a3 a2 a1 a0 zmm1 b7 b6 b5 b4 b3 b2 b1 b0 zmm2 zmm3 k1 b7+c7 a6 b5+c5 b4+c4 b3+c3 b2+c2 a1 a0 zmm1 + + + + + + + + 1 0 1 1 1 1 0 0 c7 c6 c5 c4 c3 c2 c1 c0 XMM0-15 128-bits YMM0-15 256-bits ZMM0-31 512-bits INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Copyright © 2013 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S and other countries. Elena Demikhovsky Intel® Software and Services Group Israel

Upload: others

Post on 09-Jul-2020

37 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Intel® AVX-512 Architecture - LLVM...A0 A1 A2 A3 A1 A2 A3 A6 A4 A5 A6 A7 A7 Mask: k[] = 01110011 Zeroed in the ss register form X A0 A1 A2 A0 A1 A2 A3 X X A3 A4 A4 Mask: k[] = 01110011

Intel® AVX-512 Architecture Comprehensive vector extension for HPC and enterprise

512-bit wide vectors, 32 SIMD registers

8 new mask registers Embedded Rounding Control Embedded Broadcast New Math instructions 2-source shuffles Gather and Scatter Compress and Expand Conflict Detection

AVX-512 – What’s new?

KNL

SSE*

AVX

AVX2

SNB

SSE*

AVX

HSW

SSE*

AVX

AVX2

NHM

SSE*

AVX-512

AVX-512 Foundation

Exponential and Reciprocal

Prefetching

Conflict Detection

32 SIMD registers 512 bit wide

More And Bigger Registers

Sparse computations are hard for vectorization Code above is wrong if any values within B[i] are duplicated VPCONFLICT instruction detects elements with conflicts

Conflict Detection

for(i=0; i<16; i++) { A[B[i]]++; }

index = vload &B[i] // Load 16 B[i]

old_val = vgather A, index // Grab A[B[i]]

new_val = vadd old_val, +1.0 // Compute new values

vscatter A, index, new_val // Update A[B[i]]

index = vload &B[i] // Load 16 B[i]

pending_elem = 0xFFFF;

do {

curr_elem = get_conflict_free_subset(index, pending_elem)

old_val = vgather {curr_elem} A, index // Grab A[B[i]]

new_val = vadd old_val, +1.0 // Compute new values

vscatter A {curr_elem}, index, new_val // Update A[B[i]]

pending_elem = pending_elem ^ curr_elem // remove done idx

} while (pending_elem)

Compress And Expand

Compress values from 512-bit vectors compound of f64, f32, i64, i32 elements using mask and store in register or memory

VCOMPRESSPD zmm1/mV {k1}, zmm2

VEXPANDPS zmm1 {k1}{z}, zmm2/mV

A0 A1 A2 A3

A1 A2 A3 A6

A4 A5 A6 A7

A7

Mask: k[] = 01110011 Zeroed in the register form

com

pre

ss

X A0 A1 A2

A0 A1 A2 A3

X X A3 A4

A4

Mask: k[] = 01110011 All “X” are zeroed or remain unchanged

expand

for (i =0; i < N; i++) {

if (topVal > b[i]) {

*dst = a[i];

dst++;

}

}

CMP %regMask, ops

ADD %regA1, x1, y1

ADD %regA2 = x2, y2

BLEND %regA, %regMask, %regA1, %regA2

Predication Scheme

If (condition) {

A = ..

} else {

A = ..

}

Source code

Machine code

%Mask = cmp (condition)

%A1 =

%A2 =

%A = SELECT %Mask, %A1, %A2

LLVM IR

Mask = cmp (condition)

Not-Mask = not(Mask)

{Mask} C1 =

{Mask} B1 =

{Mask} A = B1+C1

{Not-Mask} C2 =

{Not-Mask} B2 =

{Not-Mask} A = B2+C2

Goal:

To set predicates for instructions that calculate A1 and A2 (Mask and Not-Mask)

Result:

If the mask is all-zero, instruction will not be executed.

Mask Propagation Pass – design ideas

topVal =

Mask = cmp()

C1 = topVal*2

B1 =

A1 = B1 + C1

C2 =

B2 =

A2 = B2 + C2

A = BLEND(Mask, A1, A2)

A new Machine Pass:

• Before register allocation

• Start from the “blend” operands and go up recursively till mask definition

• Check all users of the destination operand before applying the mask

Predicated memory accesses in LLVM IR would be helpful !

Mask Propagation Pass does not guarantee full mask propagation

over the whole path from blend to compare

Load/Store operations require exact masking

FP operations require masking if exceptions are not suppressed

– IR generators should use compiler intrinsics

topVal =

Mask = cmp()

C1 = topVal*2

B1 =

A1 = B1 + C1

C2 =

B2 =

A2 = B2 + C2

A = BLEND(Mask, A1, A2)

Mask

Mask

Mask

topVal =

Mask = cmp()

!Mask = Not(Mask)

C1 = topVal*2

B1 =

A1 = B1 + C1

C2 =

B2 =

A2 = B2 + C2

A = BLEND(Mask, A1, A2)

Mask

Mask

!Mask

!Mask

!Mask

Mask

topVal =

Mask = cmp()

!Mask = Not(Mask)

C1 = topVal*2

B1 =

A = B1 + C1

C2 =

B2 =

A = B2 + C2

Mask

Mask

Mask

!Mask

!Mask

!Mask

Masking in LLVM Masking

Unmasked elements remain unchanged: VADDPD zmm1 {k1}, zmm2, zmm3 Or zeroed: VADDPD zmm1 {k1} {z}, zmm2, zmm3

float32 A[N], B[N], C[N];

for(i=0; i<16; i++)

{

if (B[i] != 0)

A[i] = A[i] / B[i];

else

A[i] = A[i] / C[i];

} VMOVUPS zmm2, A

VCMPPS k1, zmm0, B

VDIVPS zmm1 {k1}{z}, zmm2, B

KNOT k2, k1

VDIVPS zmm1 {k2}, zmm2, C

VMOVUPS A, zmm1

A source from memory is repeated across all the elements. vbroadcastss zmm3, [rax]

vaddps zmm1, zmm2, zmm3

vaddps zmm1, zmm2, [rax] {1to16}

Embedded Broadcast

• Static (per instruction) rounding rode

• No need to access MXCSR any more!

vaddps zmm7 {k6}, zmm2, zmm4 {rd}

vcvtdq2ps zmm1, zmm2, {ru}

All exceptions are always suspended by using embedded RC

Embedded Rounding Control

Memory fault suppression Avoid FP exceptions Avoid extra blends

a7 a6 a5 a4 a3 a2 a1 a0 zmm1

b7 b6 b5 b4 b3 b2 b1 b0 zmm2

zmm3

k1

b7+c7 a6 b5+c5 b4+c4 b3+c3 b2+c2 a1 a0 zmm1

+ + + + + + + +

1 0 1 1 1 1 0 0

c7 c6 c5 c4 c3 c2 c1 c0

XMM0-15 128-bits

YMM0-15 256-bits

ZMM0-31 512-bits

• INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. • A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR

ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS. • Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. • The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. • Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. • Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm • Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. • Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. • Copyright © 2013 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S and other countries.

Elena Demikhovsky Intel® Software and Services Group

Israel