click to add text © 2006-2008 ibm corporation openpower abi performance improvements ian mcintosh...

17
www.ibm.com © 2006-2008 IBM Corporation OpenPOWER ABI Performance Improvements Ian McIntosh November 5, 2014 CASCON 2014

Upload: jemima-lane

Post on 17-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Click to add text  © 2006-2008 IBM Corporation OpenPOWER ABI Performance Improvements Ian McIntosh November 5, 2014 CASCON 2014

www.ibm.com

© 2006-2008 IBM Corporation

OpenPOWER ABIPerformance

ImprovementsIan McIntoshNovember 5, 2014CASCON 2014

Page 2: Click to add text  © 2006-2008 IBM Corporation OpenPOWER ABI Performance Improvements Ian McIntosh November 5, 2014 CASCON 2014

2 w3.ibm.com© 2007 IBM Corporation

Abstract

OpenPOWER Linux ABI changes to improve performance

Many of the differences between the AIX and Linux on PowerPC ABIs versus the new OpenPOWER Linux ABI are designed to improve performance of calls, passing parameters, returning results, accessing static memory, and they also reduce memory usage. Some of the approaches are novel.

Page 3: Click to add text  © 2006-2008 IBM Corporation OpenPOWER ABI Performance Improvements Ian McIntosh November 5, 2014 CASCON 2014

3 w3.ibm.com© 2007 IBM Corporation

OpenPOWER OpenPOWER is a new partnership led by IBM, with 68 others. See www.openpowerfoundation.org and

www.en.wikipedia.org/wiki/OpenPOWER_Foundation. IBM will openly license PowerPC CPU designs (not just the

architecture) to their partners. The minimum PowerPC hardware architecture level is Power8. Nvidia GPU processors integrated with PowerPC speed some

programs.

Page 4: Click to add text  © 2006-2008 IBM Corporation OpenPOWER ABI Performance Improvements Ian McIntosh November 5, 2014 CASCON 2014

4 w3.ibm.com© 2007 IBM Corporation

OpenPOWER ABI The OpenPOWER Linux on PowerPC variant includes a new

ABI (Application Binary Interface). An ABI defines the interfaces between program components;

eg, how calls are made, how parameters are passed, how results are returned, what object files and debug information looks like, etc.

This ABI is 64-bit mode only and Little Endian (LE) only. Some of it resembles the 64 bit Big Endian ABI and the AIX

ABI, parts resemble the 32 bit Big Endian ABI, but some important parts are new and different, some very different.

Most of the changes are to improve performance.

Page 5: Click to add text  © 2006-2008 IBM Corporation OpenPOWER ABI Performance Improvements Ian McIntosh November 5, 2014 CASCON 2014

5 w3.ibm.com© 2007 IBM Corporation

OpenPOWER ABI The object file format is ELF (same as Big Endian Linux on

PowerPC), different than AIX’s XCOFF format.

The debug format is DWARF4.

The default code model is medium, with each library or the main module (executable) limited to 4 GB (a 32 bit displacement).There are also small and large code models.Currently the XL compilers only support the medium model.

Most changes affect only compiler back ends not front ends, and not user application programming.

Page 6: Click to add text  © 2006-2008 IBM Corporation OpenPOWER ABI Performance Improvements Ian McIntosh November 5, 2014 CASCON 2014

6 w3.ibm.com© 2007 IBM Corporation

OpenPOWER ABI - Addressing Pointers are 64 bits. For the medium code module, normally two instructions (add

immediate shifted, then load / store / add immediate) are used to access static memory (function instructions, the GOT / TOC, some extern variables, static variables and the constant area). Each provides 16 bits of the displacement.

A GOT (Global Offset Table) holds pointers to extern functions and some extern / static data used by a module.

A TOC (Table of Contents) contains the GOT and also may contain directly accessible data.

Other static memory might be accessed indirectly via the GOT / TOC, directly (if not visible outside the library) or via an absolute address (main module only, and only up to 2 GB).

Page 7: Click to add text  © 2006-2008 IBM Corporation OpenPOWER ABI Performance Improvements Ian McIntosh November 5, 2014 CASCON 2014

7 w3.ibm.com© 2007 IBM Corporation

OpenPOWER ABI Major Changes Major Changes:

– Accessing static and some extern variables is faster.

– How calls are made is normally faster.

– How parameters are passed is sometimes faster.

– How results are returned is sometimes faster.

– The minimum stack frame size is smaller.

– Code size can be smaller.

– Some of the changes take advantage of Power8 CPU improvements like Instruction Fusion.

– There are some changes to Altivec / VMX functions.

Page 8: Click to add text  © 2006-2008 IBM Corporation OpenPOWER ABI Performance Improvements Ian McIntosh November 5, 2014 CASCON 2014

8 w3.ibm.com© 2007 IBM Corporation

OpenPOWER ABI – Static/Extern Addressing How static variables and some extern variables are

accessed is normally faster:– Static and some extern variables can be accessed directly with

an add immediate shifted (addis) from gpr2 instruction, then a load / store / add immediate, without first having to load a GOT / TOC pointer to the block they’re in.(For many loads and stores, the two instructions are fusable on Power8.)

– The default is nopic, allowing direct access in the main module.

– In libraries, use pic (which uses the medium model). Extern variables that could be accessed from the main module must be accessed indirectly via the GOT / TOC, because the library loader may move them to the main module. Using datalocal and visibility to ensure they are not exported allows the compiler to use direct access, which is often faster.

Page 9: Click to add text  © 2006-2008 IBM Corporation OpenPOWER ABI Performance Improvements Ian McIntosh November 5, 2014 CASCON 2014

9 w3.ibm.com© 2007 IBM Corporation

OpenPOWER ABI - Calls

How calls are made is normally faster:– Functions have both a Global Entry Point and a Local Entry

Point. The Global code typically uses two instructions (addis + addi) to point to the GOT / TOC, then falls into the Local code.

– If the function doesn’t use the GOT / TOC then it can omit those.– There are no Function Descriptors:

• Calls to extern functions and calls via pointers (including C++ virtual calls) do not need to load the GOT / TOC pointer because the GEP will construct it when it's needed. Local calls skip that, so are faster.

• Calls to nested functions via pointers do not need to load the parent’s environment pointer because a “trampoline” will load it.

• This eliminates a load of the FD address then 2 or 3 loads from it.

– The GOT / TOC pointer only needs to be saved once per function not once per call.

Page 10: Click to add text  © 2006-2008 IBM Corporation OpenPOWER ABI Performance Improvements Ian McIntosh November 5, 2014 CASCON 2014

10 w3.ibm.com© 2007 IBM Corporation

OpenPOWER ABI - Parameters How parameters are passed is sometimes faster:

– When practical, value parameters are passed in registers, including cases where AIX and BE Linux did not.

– Homogeneous floating point struct etc. parameters (with up to 8 compatible floating point members) are normally passed via FPRs instead of GPRs or memory. That includes complex and similar structs.

– Homogeneous vector struct etc. parameters (with up to 8 vector members) are normally passed via VRs instead of GPRs or memory.

– Using the right registers saves instructions, store-reload delays (aka load-hit-store) and compiler temporaries that waste stack space and cache locality.

Page 11: Click to add text  © 2006-2008 IBM Corporation OpenPOWER ABI Performance Improvements Ian McIntosh November 5, 2014 CASCON 2014

11 w3.ibm.com© 2007 IBM Corporation

OpenPOWER ABI - Results How results are returned is sometimes faster:

– When practical, results are returned in registers, including cases where AIX and BE Linux did not.

– Homogeneous floating-point struct results are returned via FPRs instead of in memory. That includes complex and similar structs.

– Homogeneous vector struct results are returned via VRs instead of in memory.

– Small non-floating-point non-vector struct results of up to 2 doublewords are returned via GPRs instead of in memory.

– Using the right registers saves store and load instructions, store-reload delays and compiler temps.

Page 12: Click to add text  © 2006-2008 IBM Corporation OpenPOWER ABI Performance Improvements Ian McIntosh November 5, 2014 CASCON 2014

12 w3.ibm.com© 2007 IBM Corporation

OpenPOWER ABI – ABI OverheadWork instructions in green, AIX ABI overhead in red. NO OpenPOWER overhead. | 000000 PDEF _mm256_mul_ps 1 199| PROC #retvalptr_49,left,right,gr3,gr5-gr10 0 0| 0031A0 stdu F821FF31 1 ST8U gr1,#stack(gr1,-208)=gr1 1 206| 0031A4 addi 380000C0 1 LI gr0=192 1 203| 0031A8 addi 38810130 1 LA gr4=right(gr1,304) 0 0| 0031AC std F8C10118 1 ST8 left(gr1,280)=gr6 0 0| 0031B0 std F8A10110 1 ST8 left(gr1,272)=gr5 0 0| 0031B4 std F8E10120 1 ST8 left(gr1,288)=gr7 0 0| 0031B8 std F9010128 1 ST8 left(gr1,296)=gr8 0 0| 0031BC std F9210130 1 ST8 right(gr1,304)=gr9 1 202| 0031C0 addi 38A10110 1 LA gr5=left(gr1,272) 0 0| 0031C4 std F9410138 1 ST8 right(gr1,312)=gr10 1 203| 0031C8 lxvd2x 7C202698 1 VLQD vs1=right(gr4,0) 1 202| 0031CC lxvd2x 7C002E98 1 VLQD vs0=left(gr5,0) 1 203| 0031D0 addi 38C40010 1 AI gr6=gr4,16 1 202| 0031D4 addi 38850010 1 AI gr4=gr5,16 1 205| 0031D8 xvmulsp F0400A80 1 VFM vs2=vs0,vs1,fcr 1 203| 0031DC lxvd2x 7C203698 1 VLQD vs1=right(gr6,0) 1 202| 0031E0 lxvd2x 7C002698 1 VLQD vs0=left(gr4,0) 1 206| 0031E4 xvmulsp F0000A80 1 VFM vs0=vs0,vs1,fcr 1 205| 0031E8 addi 388000B0 1 LI gr4=176 1 205| 0031EC stxvd2x 7C412798 1 VSTQD result.m128_0(gr1,gr4,0)=vs2 0 0| 0031F0 ori 60420000 1 XNOP 1 206| 0031F4 stxvd2x 7C010798 1 VSTQD result.m128_1(gr1,gr0,0)=vs0 1 207| 0031F8 ld E80100B0 1 L8 gr0=result(gr1,176) 1 207| 0031FC std F8030000 2 ST8 #retval_49(gr3,0)=gr0 1 207| 003200 ld E88100B8 1 L8 gr4=result(gr1,184) 1 207| 003204 std F8830008 2 ST8 #retval_49(gr3,8)=gr4 0 0| 003208 ori 60420000 1 XNOP 1 207| 00320C ld E80100C0 1 L8 gr0=result(gr1,192) 1 207| 003210 ld E88100C8 1 L8 gr4=result(gr1,200) 1 208| 003214 addi 382100D0 1 AI gr1=gr1,208 1 207| 003218 std F8030010 1 ST8 #retval_49(gr3,16)=gr0 1 207| 00321C std F8830018 1 ST8 #retval_49(gr3,24)=gr4

1 208| 003220 bclr 4E800020 0 BA lr

Page 13: Click to add text  © 2006-2008 IBM Corporation OpenPOWER ABI Performance Improvements Ian McIntosh November 5, 2014 CASCON 2014

13 w3.ibm.com© 2007 IBM Corporation

OpenPOWER ABI – Stack Frame Size The minimum stack frame size is smaller:

– The minimum Parameter Save Area is 0 not 64 bytes, and two unneeded linkage doublewords are eliminated, so the minimum stack frame size is 32 bytes.

– AIX and BE Linux 32-bit mode minimum is 64 bytes.

– AIX and BE Linux 64-bit mode minimum is 112 bytes.

– The smaller size may allow better cache locality, deeper recursion, or more threads.

Page 14: Click to add text  © 2006-2008 IBM Corporation OpenPOWER ABI Performance Improvements Ian McIntosh November 5, 2014 CASCON 2014

14 w3.ibm.com© 2007 IBM Corporation

OpenPOWER ABI – Code Size Many calls take fewer instructions.

Because the cost of a call, its parameter passing and value returning can be lower, often inlining is less important.Doing less inlining can reduce code size even more.

In addition to fetching and executing fewer instructions, smaller code size can make the instruction cache more effective, improving performance more.

Page 15: Click to add text  © 2006-2008 IBM Corporation OpenPOWER ABI Performance Improvements Ian McIntosh November 5, 2014 CASCON 2014

15 w3.ibm.com© 2007 IBM Corporation

OpenPOWER ABI - Fusion Some of the changes take advantage of Power8 CPU

improvements like Instruction Fusion:– For the medium code module, normally two instructions (add

immediate shifted then load / store / add immediate) are used to access all static memory (function instructions, the GOT / TOC, extern variables, static variables and the constant area). The addis provides the upper 16 bits of the displacement, and the load / store / add immediate provides the lower 16 bits.

– For the main module with a 2 GB limit the add immediate shifted can be a load immediate shifted without using the GOT / TOC.

– The Power8 CPU usually fuses the two instructions used to load a general purpose register into one instruction going down the pipeline, and makes some other instruction pairs faster too.

Page 16: Click to add text  © 2006-2008 IBM Corporation OpenPOWER ABI Performance Improvements Ian McIntosh November 5, 2014 CASCON 2014

16 w3.ibm.com© 2007 IBM Corporation

OpenPOWER ABI - Performance Unchanged performance in programs with:

– Large functions / subroutines (including most of SPEC).– Extensive inlining.– Even for these, accessing static memory can be faster.

Faster performance in programs with:– Many calls to small functions.– Many pointer calls (except to nested functions),

including many C++ virtual function calls.– Homogeneous floating-point struct parameters and

homogeneous vector struct parameters.– Homogeous floating-point struct and vector struct results.

– Small non-FP non-vector results.

Page 17: Click to add text  © 2006-2008 IBM Corporation OpenPOWER ABI Performance Improvements Ian McIntosh November 5, 2014 CASCON 2014

17 w3.ibm.com© 2007 IBM Corporation

Discussion and Questions

?