implementation of binary floating-point arithmetic on

120
Implementation of binary floating-point arithmetic on embedded integer processors Polynomial evaluation-based algorithms and certified code generation Guillaume Revy Advisors: Claude-Pierre Jeannerod and Gilles Villard Ar´ enaire INRIA project-team (LIP, Ens Lyon) Universit ´ e de Lyon CNRS Ph.D. Defense – December 1 st , 2009 Guillaume Revy – December 1 st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 1/45

Upload: others

Post on 18-Dec-2021

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Implementation of binary floating-point arithmetic on

Implementation of binary floating-point arithmeticon embedded integer processors

Polynomial evaluation-based algorithmsand

certified code generation

Guillaume RevyAdvisors: Claude-Pierre Jeannerod and Gilles Villard

Arenaire INRIA project-team (LIP, Ens Lyon) Universite de Lyon CNRS

Ph.D. Defense – December 1st, 2009

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 1/45

Page 2: Implementation of binary floating-point arithmetic on

Motivation

Embedded systems are ubiquitousI microprocessors dedicated to one or a few specific tasksI satisfy constraints: area, energy consumption, conception cost

Some embedded systems do not have any FPU (floating-point unit)

Highly used in audio and video applicationsI demanding on floating-point computations

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 2/45

Page 3: Implementation of binary floating-point arithmetic on

Motivation

Embedded systems are ubiquitousI microprocessors dedicated to one or a few specific tasksI satisfy constraints: area, energy consumption, conception cost

Some embedded systems do not have any FPU (floating-point unit)

Embedded systems

No FPU

Highly used in audio and video applicationsI demanding on floating-point computations

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 2/45

Page 4: Implementation of binary floating-point arithmetic on

Motivation

Embedded systems are ubiquitousI microprocessors dedicated to one or a few specific tasksI satisfy constraints: area, energy consumption, conception cost

Some embedded systems do not have any FPU (floating-point unit)

Applications

FP computations

Embedded systems

No FPU

Highly used in audio and video applicationsI demanding on floating-point computations

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 2/45

Page 5: Implementation of binary floating-point arithmetic on

Motivation

Embedded systems are ubiquitousI microprocessors dedicated to one or a few specific tasksI satisfy constraints: area, energy consumption, conception cost

Some embedded systems do not have any FPU (floating-point unit)

floating-point arithmeticSoftware implementing

Applications

FP computations

Embedded systems

No FPU

Highly used in audio and video applicationsI demanding on floating-point computations

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 2/45

Page 6: Implementation of binary floating-point arithmetic on

Motivation

Embedded systems are ubiquitousI microprocessors dedicated to one or a few specific tasksI satisfy constraints: area, energy consumption, conception cost

Some embedded systems do not have any FPU (floating-point unit)

ICache

ST231 core

DTLBMul

Registerfile (64

registers8 read

4 write)

Load

(LSU)

IU IU IU IU

Trapcontroller

4 x SDI

STBus

SDI ports

61 interrupts Debuglink

PeripheralsDebug

Timers3 x

controllersupport unit 32-bit

I-sidememorysubsystem

Interrupt

registerPC andbranch

unit

Branch

file

DCache

bufferWrite

Prefetchbuffer

SCU

CMC

STBus

64-bit

registersControl

UTLBMul

D-sidememorysubsystem

StoreUnit

ITLB

Instructionbuffer

floating-point arithmeticSoftware implementing

Applications

FP computations

Embedded systems

No FPU

Highly used in audio and video applicationsI demanding on floating-point computations

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 2/45

Page 7: Implementation of binary floating-point arithmetic on

Overview of the ST231 architecture

ICache

DTLB

Mul

Registerfile (64

registers8 read

4 write)

Load

(LSU)

IU IU IU IU

Trapcontroller

4 x SDI

STBus

SDI ports

61 interrupts Debuglink

Peripherals

DebugTimers

3 xcontroller support unit 32-bit

I-sidememorysubsystem

Interrupt

registerPC andbranch

unit

Branch

file

DCache

bufferWrite

Prefetchbuffer

SCU

CMC

STBus

64-bit

registersControl

UTLBMul

D-sidememorysubsystem

StoreUnit

ITLB

Instructionbuffer

ST231 core

4-issue VLIW 32-bit integer processor→ no FPU

Parallel execution unitI 4 integer ALUI 2 pipelined multipliers 32 × 32→ 32

Latencies: ALU→ 1 cycle, Mul→ 3cycles

VLIW (Very Long Instruction Word)→ instructions grouped into bundles

→ Instruction-Level Parallelism (ILP) explicitly exposed by the compiler

uint32_t R1 = A0 + C;uint32_t R2 = A3 * X;uint32_t R3 = A1 * X;uint32_t R4 = X * X;

Issue 1 Issue 2 Issue 3 Issue 4

0 R1 R2 R3

1 R4

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 3/45

Page 8: Implementation of binary floating-point arithmetic on

Overview of the ST231 architecture

ICache

DTLB

Mul

Registerfile (64

registers8 read

4 write)

Load

(LSU)

IU IU IU IU

Trapcontroller

4 x SDI

STBus

SDI ports

61 interrupts Debuglink

Peripherals

DebugTimers

3 xcontroller support unit 32-bit

I-sidememorysubsystem

Interrupt

registerPC andbranch

unit

Branch

file

DCache

bufferWrite

Prefetchbuffer

SCU

CMC

STBus

64-bit

registersControl

UTLBMul

D-sidememorysubsystem

StoreUnit

ITLB

Instructionbuffer

ST231 core

4-issue VLIW 32-bit integer processor→ no FPU

Parallel execution unitI 4 integer ALUI 2 pipelined multipliers 32 × 32→ 32

Latencies: ALU→ 1 cycle, Mul→ 3cycles

VLIW (Very Long Instruction Word)→ instructions grouped into bundles

→ Instruction-Level Parallelism (ILP) explicitly exposed by the compiler

uint32_t R1 = A0 + C;uint32_t R2 = A3 * X;uint32_t R3 = A1 * X;uint32_t R4 = X * X;

Issue 1 Issue 2 Issue 3 Issue 4

0 R1 R2 R3

1 R4

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 3/45

Page 9: Implementation of binary floating-point arithmetic on

Overview of the ST231 architecture

ICache

DTLB

Mul

Registerfile (64

registers8 read

4 write)

Load

(LSU)

IU IU IU IU

Trapcontroller

4 x SDI

STBus

SDI ports

61 interrupts Debuglink

Peripherals

DebugTimers

3 xcontroller support unit 32-bit

I-sidememorysubsystem

Interrupt

registerPC andbranch

unit

Branch

file

DCache

bufferWrite

Prefetchbuffer

SCU

CMC

STBus

64-bit

registersControl

UTLBMul

D-sidememorysubsystem

StoreUnit

ITLB

Instructionbuffer

ST231 core

4-issue VLIW 32-bit integer processor→ no FPU

Parallel execution unitI 4 integer ALUI 2 pipelined multipliers 32 × 32→ 32

Latencies: ALU→ 1 cycle, Mul→ 3cycles

VLIW (Very Long Instruction Word)→ instructions grouped into bundles

→ Instruction-Level Parallelism (ILP) explicitly exposed by the compiler

uint32_t R1 = A0 + C;uint32_t R2 = A3 * X;uint32_t R3 = A1 * X;uint32_t R4 = X * X;

Issue 1 Issue 2 Issue 3 Issue 4

0 R1 R2 R3

1 R4

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 3/45

Page 10: Implementation of binary floating-point arithmetic on

How to emulate floating-point arithmetic in software?

Design and implementation of efficient software support forIEEE 754 floating-point arithmetic on integer processors

Existing software for IEEE 754 floating-point arithmetic:

I Software floating-point support of GCC, Glibc and µClibc, GoFastFloating-Point Library

I SoftFloat (→ STlib)

I FLIP (Floating-point Library for Integer Processors)

• software support for binary32 floating-point arithmetic on integer processors

• correctly-rounded addition, subtraction, multiplication, division, square root,reciprocal, ...

• handling subnormals, and handling special inputs

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 4/45

Page 11: Implementation of binary floating-point arithmetic on

Towards the generation of fast and certified codes

Underlying problem: development “by hand”

I long and tedious, error prone

I new target ? new floating-point format ?

⇒ need for automation and certification

Current challenge: tools and methodologies for the automatic generationof efficient and certified programs

I optimized for a given format, for the target architecture

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 5/45

Page 12: Implementation of binary floating-point arithmetic on

Towards the generation of fast and certified codes

Underlying problem: development “by hand”

I long and tedious, error prone

I new target ? new floating-point format ?

⇒ need for automation and certification

Current challenge: tools and methodologies for the automatic generationof efficient and certified programs

I optimized for a given format, for the target architecture

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 5/45

Page 13: Implementation of binary floating-point arithmetic on

Towards the generation of fast and certified codes

Underlying problem: development “by hand”

I long and tedious, error prone

I new target ? new floating-point format ?

⇒ need for automation and certification

Current challenge: tools and methodologies for the automatic generationof efficient and certified programs

I optimized for a given format, for the target architecture

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 5/45

Page 14: Implementation of binary floating-point arithmetic on

Towards the generation of fast and certified codes

Arenaire’s developments: hardware (FloPoCo) and software (Sollya,Metalibm)

Spiral project: hardware and software code generation for DSP algorithms

Can we teach computers to write fast libraries?

Our tool: CGPE (Code Generation for Polynomial Evaluation)

In the particular case of polynomial evaluation, can we teach computersto write fast and certified codes, for a given target and optimized for a

given format?

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 6/45

Page 15: Implementation of binary floating-point arithmetic on

Towards the generation of fast and certified codes

Arenaire’s developments: hardware (FloPoCo) and software (Sollya,Metalibm)

Spiral project: hardware and software code generation for DSP algorithms

Can we teach computers to write fast libraries?

Our tool: CGPE (Code Generation for Polynomial Evaluation)

In the particular case of polynomial evaluation, can we teach computersto write fast and certified codes, for a given target and optimized for a

given format?

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 6/45

Page 16: Implementation of binary floating-point arithmetic on

Basic blocks for implementing correctly-rounded operators(X, Y )

no

Floating-point number unpacking

Normalization

Result significand approximation

Rounding condition decision

Correct rounding computation

Result reconstruction

R

Range reduction

Result sign/exponentcomputation

Special output selection

yes

Special input detectionfunction independent

function dependent

Objectives

→ Low latency, correctly-roundedimplementations

→ ILP exposure

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 7/45

Page 17: Implementation of binary floating-point arithmetic on

Basic blocks for implementing correctly-rounded operators(X, Y )

no

Floating-point number unpacking

Normalization

Result significand approximation

Rounding condition decision

Correct rounding computation

Result reconstruction

R

Range reduction

Result sign/exponentcomputation

Special output selection

yes

Special input detection

yes

Special output selection

Special input detectionfunction independent

function dependent

Objectives

→ Low latency, correctly-roundedimplementations

→ ILP exposure

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 7/45

Page 18: Implementation of binary floating-point arithmetic on

Basic blocks for implementing correctly-rounded operators(X, Y )

no

Floating-point number unpacking

Normalization

Result significand approximation

Rounding condition decision

Correct rounding computation

Result reconstruction

R

Range reduction

Result sign/exponentcomputation

Special output selection

yes

Special input detection

yes

Special output selection

Special input detection

automatedFully

generation

Problem: function to be evaluated

Efficient and certified C code

ST231 features

C code Certificate

Computation of polynomial approximant

Uniform approach for nth rootsand their reciprocals→ polynomial evaluation

Extension to division

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 7/45

Page 19: Implementation of binary floating-point arithmetic on

Flowchart for generating efficient and certified C codes

generation

Problem: function to be evaluated

Efficient and certified C code

ST231 features

C code Certificate

Computation of polynomial approximant

Constraints

Accuracy of approximant andC code

I Sollya

I interval arithmetic (MPFI),Gappa

Low evaluation latency onST231, ILP exposure

I ?

Efficiency of the generationprocess

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 8/45

Page 20: Implementation of binary floating-point arithmetic on

Flowchart for generating efficient and certified C codes

generation

Problem: function to be evaluated

Efficient and certified C code

ST231 features

C code Certificate

Computation of polynomial approximant

Constraints

Accuracy of approximant andC code

I Sollya

I interval arithmetic (MPFI),Gappa

Low evaluation latency onST231, ILP exposure

I ?

Efficiency of the generationprocess

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 8/45

Page 21: Implementation of binary floating-point arithmetic on

Flowchart for generating efficient and certified C codes

generation

Problem: function to be evaluated

Efficient and certified C code

ST231 features

C code Certificate

Computation of polynomial approximant

Constraints

Accuracy of approximant andC code

I Sollya

I interval arithmetic (MPFI),Gappa

Low evaluation latency onST231, ILP exposure

I ?

Efficiency of the generationprocess

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 8/45

Page 22: Implementation of binary floating-point arithmetic on

Flowchart for generating efficient and certified C codes

generation

Problem: function to be evaluated

Efficient and certified C code

ST231 features

C code Certificate

Computation of polynomial approximant

Constraints

Accuracy of approximant andC code

I Sollya

I interval arithmetic (MPFI),Gappa

Low evaluation latency onST231, ILP exposure

I ?

Efficiency of the generationprocess

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 8/45

Page 23: Implementation of binary floating-point arithmetic on

Flowchart for generating efficient and certified C codes

generation

Problem: function to be evaluated

Efficient and certified C code

ST231 features

C code Certificate

Computation of polynomial approximant

Constraints

Accuracy of approximant andC code

I Sollya

I interval arithmetic (MPFI),Gappa

Low evaluation latency onST231, ILP exposure

I ?

Efficiency of the generationprocess

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 8/45

Page 24: Implementation of binary floating-point arithmetic on

Flowchart for generating efficient and certified C codes

generation

Problem: function to be evaluated

Efficient and certified C code

ST231 features

C code Certificate

Computation of polynomial approximant

CGPECGPE

Constraints

Accuracy of approximant andC code

I Sollya

I interval arithmetic (MPFI),Gappa

Low evaluation latency onST231, ILP exposure

I ?

Efficiency of the generationprocess

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 8/45

Page 25: Implementation of binary floating-point arithmetic on

Flowchart for generating efficient and certified C codes

generation

Problem: function to be evaluated

Efficient and certified C code

ST231 features

C code Certificate

Computation of polynomial approximant

CGPECGPE

Constraints

Accuracy of approximant andC code

I Sollya

I interval arithmetic (MPFI),Gappa

Low evaluation latency onST231, ILP exposure

I ?

Efficiency of the generationprocess

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 8/45

Page 26: Implementation of binary floating-point arithmetic on

Outline of the talk

1. Design and implementation of floating-point operatorsBivariate polynomial evaluation-based approachImplementation of correct rounding

2. Low latency parenthesization computationClassical evaluation methodsComputation of all parenthesizationsTowards low evaluation latency

3. Selection of effective evaluation parenthesizationsGeneral frameworkAutomatic certification of generated C codes

4. Numerical results

5. Conclusions and perspectives

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 9/45

Page 27: Implementation of binary floating-point arithmetic on

Design and implementation of floating-point operators

Outline of the talk

1. Design and implementation of floating-point operatorsBivariate polynomial evaluation-based approachImplementation of correct rounding

2. Low latency parenthesization computation

3. Selection of effective evaluation parenthesizations

4. Numerical results

5. Conclusions and perspectives

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 10/45

Page 28: Implementation of binary floating-point arithmetic on

Design and implementation of floating-point operators

Notation and assumptions

Division C code(x, y) RN(x/y)

Input (x ,y ) and output RN(x/y): normal numbers

→ no underflow nor overflow

→ precision p, extremal exponents emin, emax

x =±1.mx ,1 . . .mx ,p−1 ·2ex with ex ∈ {emin, . . . ,emax}

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 11/45

Page 29: Implementation of binary floating-point arithmetic on

Design and implementation of floating-point operators

Notation and assumptions

Division C code(x, y) RN(x/y)

Input (x ,y ) and output RN(x/y): normal numbers

→ no underflow nor overflow

→ precision p, extremal exponents emin, emax

x =±1.mx ,1 . . .mx ,p−1 ·2ex with ex ∈ {emin, . . . ,emax}

→ RoundTiesToEvenx/y RN(x/y)

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 11/45

Page 30: Implementation of binary floating-point arithmetic on

Design and implementation of floating-point operators

Notation and assumptions

Division C code(X, Y ) R

Standard binary encoding: k -bit unsigned integer X encodes input x

Tx = mx,1 . . . mx,p−1

p− 1 bits

Ex = ex − emin − 1

w = k − p bits1 bit

sx

Computation: k -bit unsigned integers

→ integer and fixed-point arithmetic

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 11/45

Page 31: Implementation of binary floating-point arithmetic on

Design and implementation of floating-point operators

Notation and assumptions

Division C code(X, Y ) R??

Standard binary encoding: k -bit unsigned integer X encodes input x

Tx = mx,1 . . . mx,p−1

p− 1 bits

Ex = ex − emin − 1

w = k − p bits1 bit

sx

Computation: k -bit unsigned integers

→ integer and fixed-point arithmetic

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 11/45

Page 32: Implementation of binary floating-point arithmetic on

Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach

Range reduction of divisionExpress the exact result r = x/y as:

r = ` ·2d ⇒ RN(x/y) = RN(`) ·2d

with` ∈ [1,2) and d ∈ {emin, . . . ,emax}

Definition

c = 1 if mx ≥my , and c = 0 otherwise

Range reduction

x/y =(21−c ·mx/my

)︸ ︷︷ ︸:= ` ∈ [1,2)

· 2d with d = ex −ey −1+ c

How to compute the correctly-rounded significand RN(`) ?

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 12/45

Page 33: Implementation of binary floating-point arithmetic on

Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach

Range reduction of divisionExpress the exact result r = x/y as:

r = ` ·2d ⇒ RN(x/y) = RN(`) ·2d

with` ∈ [1,2) and d ∈ {emin, . . . ,emax}

Definition

c = 1 if mx ≥my , and c = 0 otherwise

Range reduction

x/y =(21−c ·mx/my

)︸ ︷︷ ︸:= ` ∈ [1,2)

· 2d with d = ex −ey −1+ c

How to compute the correctly-rounded significand RN(`) ?

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 12/45

Page 34: Implementation of binary floating-point arithmetic on

Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach

Range reduction of divisionExpress the exact result r = x/y as:

r = ` ·2d ⇒ RN(x/y) = RN(`) ·2d

with` ∈ [1,2) and d ∈ {emin, . . . ,emax}

Definition

c = 1 if mx ≥my , and c = 0 otherwise

Range reduction

x/y =(21−c ·mx/my

)︸ ︷︷ ︸:= ` ∈ [1,2)

· 2d with d = ex −ey −1+ c

How to compute the correctly-rounded significand RN(`) ?

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 12/45

Page 35: Implementation of binary floating-point arithmetic on

Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach

Range reduction of divisionExpress the exact result r = x/y as:

r = ` ·2d ⇒ RN(x/y) = RN(`) ·2d

with` ∈ [1,2) and d ∈ {emin, . . . ,emax}

Definition

c = 1 if mx ≥my , and c = 0 otherwise

Range reduction

x/y =(21−c ·mx/my

)︸ ︷︷ ︸:= ` ∈ [1,2)

· 2d with d = ex −ey −1+ c

How to compute the correctly-rounded significand RN(`) ?

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 12/45

Page 36: Implementation of binary floating-point arithmetic on

Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach

Methods for computing the correctly-rounded significand

Iterative methods: restoring, non-restoring, SRT, ...

I Oberman and Flynn (1997)

I minimal ILP exposure, sequential algorithm

Multiplicative methods: Newton-Raphson, Goldschmidt

I Pineiro and Bruguera (2002) – Raina’s Ph.D., FLIP 0.3 (2006)

I exploit available multipliers, more ILP exposure

Polynomial-based methods

I Agarwal, Gustavson and Schmookler (1999)→ univariate polynomial evaluation

I Our approach→ bivariate polynomial evaluation: maximal ILP exposure

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 13/45

Page 37: Implementation of binary floating-point arithmetic on

Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach

Methods for computing the correctly-rounded significand

Iterative methods: restoring, non-restoring, SRT, ...

I Oberman and Flynn (1997)

I minimal ILP exposure, sequential algorithm

Multiplicative methods: Newton-Raphson, Goldschmidt

I Pineiro and Bruguera (2002) – Raina’s Ph.D., FLIP 0.3 (2006)

I exploit available multipliers, more ILP exposure

Polynomial-based methods

I Agarwal, Gustavson and Schmookler (1999)→ univariate polynomial evaluation

I Our approach→ bivariate polynomial evaluation: maximal ILP exposure

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 13/45

Page 38: Implementation of binary floating-point arithmetic on

Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach

Methods for computing the correctly-rounded significand

Iterative methods: restoring, non-restoring, SRT, ...

I Oberman and Flynn (1997)

I minimal ILP exposure, sequential algorithm

Multiplicative methods: Newton-Raphson, Goldschmidt

I Pineiro and Bruguera (2002) – Raina’s Ph.D., FLIP 0.3 (2006)

I exploit available multipliers, more ILP exposure

Polynomial-based methods

I Agarwal, Gustavson and Schmookler (1999)→ univariate polynomial evaluation

I Our approach→ bivariate polynomial evaluation: maximal ILP exposure

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 13/45

Page 39: Implementation of binary floating-point arithmetic on

Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach

Correct rounding via truncated one-sided approximation

How to compute RN(`), with ` = 21−c ·mx/my ?

Three steps for correct rounding computation

1. compute v = 1.v1 . . .vk−2 such that −2−p ≤ `− v < 0

→ implied by |(`+2−p−1)− v |< 2−p−1

→ bivariate polynomial evaluation

2. compute u as the truncation of v after p fraction bits

3. determine RN(`) after possibly adding 2−p

How to compute the one-sided approximation v and then deduce RN(`)?

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 14/45

Page 40: Implementation of binary floating-point arithmetic on

Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach

Correct rounding via truncated one-sided approximation

How to compute RN(`), with ` = 21−c ·mx/my ?

Three steps for correct rounding computation

1. compute v = 1.v1 . . .vk−2 such that −2−p ≤ `− v < 0

→ implied by |(`+2−p−1)− v |< 2−p−1

→ bivariate polynomial evaluation

2. compute u as the truncation of v after p fraction bits

3. determine RN(`) after possibly adding 2−p

How to compute the one-sided approximation v and then deduce RN(`)?

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 14/45

Page 41: Implementation of binary floating-point arithmetic on

Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach

One-sided approximation via bivariate polynomials1. Consider `+2−p−1 as the exact result of the function

F(s, t) = s/(1+ t)+2−p−1

at the points s∗ = 21−c ·mx and t∗ = my −1

2. Approximate F(s, t) by a bivariate polynomial P(s, t)

P(s, t) = s ·a(t)+2−p−1

→ a(t): univariate polynomial approximant of 1/(1+ t)

→ approximation error Eapprox

3. Evaluate P(s, t) by a well-chosen efficient evaluation program P

v = P (s∗, t∗)→ evaluation error Eeval

How to ensure that |(`+2−p−1)− v |< 2−p−1 ?

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 15/45

Page 42: Implementation of binary floating-point arithmetic on

Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach

One-sided approximation via bivariate polynomials1. Consider `+2−p−1 as the exact result of the function

F(s, t) = s/(1+ t)+2−p−1

at the points s∗ = 21−c ·mx and t∗ = my −1

2. Approximate F(s, t) by a bivariate polynomial P(s, t)

P(s, t) = s ·a(t)+2−p−1

→ a(t): univariate polynomial approximant of 1/(1+ t)

→ approximation error Eapprox

3. Evaluate P(s, t) by a well-chosen efficient evaluation program P

v = P (s∗, t∗)→ evaluation error Eeval

How to ensure that |(`+2−p−1)− v |< 2−p−1 ?

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 15/45

Page 43: Implementation of binary floating-point arithmetic on

Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach

One-sided approximation via bivariate polynomials1. Consider `+2−p−1 as the exact result of the function

F(s, t) = s/(1+ t)+2−p−1

at the points s∗ = 21−c ·mx and t∗ = my −1

2. Approximate F(s, t) by a bivariate polynomial P(s, t)

P(s, t) = s ·a(t)+2−p−1

→ a(t): univariate polynomial approximant of 1/(1+ t)

→ approximation error Eapprox

3. Evaluate P(s, t) by a well-chosen efficient evaluation program P

v = P (s∗, t∗)→ evaluation error Eeval

How to ensure that |(`+2−p−1)− v |< 2−p−1 ?

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 15/45

Page 44: Implementation of binary floating-point arithmetic on

Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach

One-sided approximation via bivariate polynomials1. Consider `+2−p−1 as the exact result of the function

F(s, t) = s/(1+ t)+2−p−1

at the points s∗ = 21−c ·mx and t∗ = my −1

2. Approximate F(s, t) by a bivariate polynomial P(s, t)

P(s, t) = s ·a(t)+2−p−1

→ a(t): univariate polynomial approximant of 1/(1+ t)

→ approximation error Eapprox

3. Evaluate P(s, t) by a well-chosen efficient evaluation program P

v = P (s∗, t∗)→ evaluation error Eeval

How to ensure that |(`+2−p−1)− v |< 2−p−1 ?

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 15/45

Page 45: Implementation of binary floating-point arithmetic on

Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach

Sufficient error bounds

To ensure |(`+2−p−1)− v |< 2−p−1

it suffices to ensure that µ ·Eapprox +Eeval < 2−p−1,

since

|(`+2−p−1)− v | ≤ µ ·Eapprox +Eeval with µ = 4−23−p

This gives the following sufficient conditions

Eapprox < 2−p−1/µ ⇒ Eeval < 2−p−1−µ ·Eapprox

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 16/45

Page 46: Implementation of binary floating-point arithmetic on

Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach

Sufficient error bounds

To ensure |(`+2−p−1)− v |< 2−p−1

it suffices to ensure that µ ·Eapprox +Eeval < 2−p−1,

since

|(`+2−p−1)− v | ≤ µ ·Eapprox +Eeval with µ = 4−23−p

This gives the following sufficient conditions

Eapprox < 2−p−1/µ ⇒ Eeval < 2−p−1−µ ·Eapprox

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 16/45

Page 47: Implementation of binary floating-point arithmetic on

Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach

Sufficient error bounds

To ensure |(`+2−p−1)− v |< 2−p−1

it suffices to ensure that µ ·Eapprox +Eeval < 2−p−1,

since

|(`+2−p−1)− v | ≤ µ ·Eapprox +Eeval with µ = 4−23−p

This gives the following sufficient conditions

Eapprox ≤ θ with θ < 2−p−1/µ ⇒ Eeval < η = 2−p−1−µ ·θ

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 16/45

Page 48: Implementation of binary floating-point arithmetic on

Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach

Example for the binary32 division

Sufficient conditions with µ = 4−2−21

Eapprox ≤ θ with θ < 2−25/µ and Eeval < η = 2−25−µ ·θ

Approximation of 1/(1+ t) by a Remez-like polynomial of degree 10

-8e-09

-6e-09

-4e-09

-2e-09

0

2e-09

4e-09

6e-09

8e-09

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Abs

olut

eap

prox

imat

ion

erro

r

t

Approximation error Required bound

I Eapprox ≤ θ,

with θ = 3 ·2−29 ≈ 6 ·10−9

I Eeval < η,

with η≈ 7.4 ·10−9

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 17/45

Page 49: Implementation of binary floating-point arithmetic on

Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach

Example for the binary32 division

Sufficient conditions with µ = 4−2−21

Eapprox ≤ θ with θ < 2−25/µ and Eeval < η = 2−25−µ ·θ

Approximation of 1/(1+ t) by a Remez-like polynomial of degree 10

-8e-09

-6e-09

-4e-09

-2e-09

0

2e-09

4e-09

6e-09

8e-09

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Abs

olut

eap

prox

imat

ion

erro

r

t

Approximation error Required bound

I Eapprox ≤ θ,

with θ = 3 ·2−29 ≈ 6 ·10−9

I Eeval < η,

with η≈ 7.4 ·10−9

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 17/45

Page 50: Implementation of binary floating-point arithmetic on

Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach

Flowchart for generating efficient and certified C codes

Selection of effective parenthesizations

Eapprox ≤ θF(s,t) Eeval < η ST231 features

C code Certificate

Computation of polynomial approximant

??

Computation of low latency parenthesizations

ST231 features

??

v |(` + 2−p−1)− v| < 2−p−1

truncation

u

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 18/45

Page 51: Implementation of binary floating-point arithmetic on

Design and implementation of floating-point operators Implementation of correct rounding

Rounding condition: definitionApproximation u of ` with

` = 21−c ·mx/my

The exact value ` may have an infinite number of bits→ the sticky bit cannot always be computed

midpointfloating-point

u u

` `

Compute RN(`) requires to be able to decide whether u ≥ `

→ ` cannot be a midpoint

Rounding condition: u ≥ `

u ≥ ` ⇐⇒ u ·my ≥ 21−c ·mx

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 19/45

Page 52: Implementation of binary floating-point arithmetic on

Design and implementation of floating-point operators Implementation of correct rounding

Rounding condition: definitionApproximation u of ` with

` = 21−c ·mx/my

The exact value ` may have an infinite number of bits→ the sticky bit cannot always be computed

midpointfloating-point

u u

` `

Compute RN(`) requires to be able to decide whether u ≥ `

→ ` cannot be a midpoint

Rounding condition: u ≥ `

u ≥ ` ⇐⇒ u ·my ≥ 21−c ·mx

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 19/45

Page 53: Implementation of binary floating-point arithmetic on

Design and implementation of floating-point operators Implementation of correct rounding

Rounding condition: definitionApproximation u of ` with

` = 21−c ·mx/my

The exact value ` may have an infinite number of bits→ the sticky bit cannot always be computed

midpointfloating-point

u u

` `

Compute RN(`) requires to be able to decide whether u ≥ `

→ ` cannot be a midpoint

Rounding condition: u ≥ `

u ≥ ` ⇐⇒ u ·my ≥ 21−c ·mx

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 19/45

Page 54: Implementation of binary floating-point arithmetic on

Design and implementation of floating-point operators Implementation of correct rounding

Rounding condition: implementation in integer arithmetic

Rounding condition: u ·my ≥ 21−c ·mx

Approximation u and my : representable with 32 bits

u

my×

u · my

I u ·my is exactly representable with 64 bits

I 21−c ·mx is representable with 32 bits since c ∈ {0,1}

⇒ one 32×32→ 32-bit multiplication and one comparison

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 20/45

Page 55: Implementation of binary floating-point arithmetic on

Design and implementation of floating-point operators Implementation of correct rounding

Rounding condition: implementation in integer arithmetic

Rounding condition: u ·my ≥ 21−c ·mx

Approximation u and my : representable with 32 bits

u

my×

u · my

21−c · mx

I u ·my is exactly representable with 64 bitsI 21−c ·mx is representable with 32 bits since c ∈ {0,1}

⇒ one 32×32→ 32-bit multiplication and one comparison

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 20/45

Page 56: Implementation of binary floating-point arithmetic on

Design and implementation of floating-point operators Implementation of correct rounding

Rounding condition: implementation in integer arithmetic

Rounding condition: u ·my ≥ 21−c ·mx

Approximation u and my : representable with 32 bits

u

my×

u ·my

21−c ·mx

I u ·my is exactly representable with 64 bitsI 21−c ·mx is representable with 32 bits since c ∈ {0,1}

⇒ one 32×32→ 32-bit multiplication and one comparison

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 20/45

Page 57: Implementation of binary floating-point arithmetic on

Design and implementation of floating-point operators Implementation of correct rounding

Flowchart for generating efficient and certified C codes

Selection of effective parenthesizations

Eapprox ≤ θF(s,t) Eeval < η ST231 features

C code Certificate

Computation of polynomial approximant

??

Computation of low latency parenthesizations

ST231 features

??

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 21/45

Page 58: Implementation of binary floating-point arithmetic on

Design and implementation of floating-point operators Implementation of correct rounding

Flowchart for generating efficient and certified C codes

Computation of low latency parenthesizations

Selection of effective parenthesizations

a(t)

Eapprox ≤ θF(s,t) Eeval < η ST231 features

C code Certificate

Computation of polynomial approximant

??

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 21/45

Page 59: Implementation of binary floating-point arithmetic on

Low latency parenthesization computation

Outline of the talk

1. Design and implementation of floating-point operators

2. Low latency parenthesization computationClassical evaluation methodsComputation of all parenthesizationsTowards low evaluation latency

3. Selection of effective evaluation parenthesizations

4. Numerical results

5. Conclusions and perspectives

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 22/45

Page 60: Implementation of binary floating-point arithmetic on

Low latency parenthesization computation Classical evaluation methods

Objectives

Compute an efficient parenthesization for evaluating P(s, t)

→ reduces the evaluation latency on unbounded parallelism

Evaluation program P = main part of the full software implementation

→ dominates the cost

Two families of algorithmsI algorithms with coefficient adaptation: Knuth and Eve (60’s), Paterson and

Stockmeyer (1964), ...→ ill-suited in the context of fixed-point arithmetic

I algorithms without coefficient adaptation

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 23/45

Page 61: Implementation of binary floating-point arithmetic on

Low latency parenthesization computation Classical evaluation methods

Objectives

Compute an efficient parenthesization for evaluating P(s, t)

→ reduces the evaluation latency on unbounded parallelism

Evaluation program P = main part of the full software implementation

→ dominates the cost

Two families of algorithmsI algorithms with coefficient adaptation: Knuth and Eve (60’s), Paterson and

Stockmeyer (1964), ...→ ill-suited in the context of fixed-point arithmetic

I algorithms without coefficient adaptation

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 23/45

Page 62: Implementation of binary floating-point arithmetic on

Low latency parenthesization computation Classical evaluation methods

Objectives

Compute an efficient parenthesization for evaluating P(s, t)

→ reduces the evaluation latency on unbounded parallelism

Evaluation program P = main part of the full software implementation

→ dominates the cost

Two families of algorithmsI algorithms with coefficient adaptation: Knuth and Eve (60’s), Paterson and

Stockmeyer (1964), ...→ ill-suited in the context of fixed-point arithmetic

I algorithms without coefficient adaptation

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 23/45

Page 63: Implementation of binary floating-point arithmetic on

Low latency parenthesization computation Classical evaluation methods

Objectives

Compute an efficient parenthesization for evaluating P(s, t)

→ reduces the evaluation latency on unbounded parallelism

Evaluation program P = main part of the full software implementation

→ dominates the cost

Two families of algorithmsI algorithms with coefficient adaptation: Knuth and Eve (60’s), Paterson and

Stockmeyer (1964), ...→ ill-suited in the context of fixed-point arithmetic

I algorithms without coefficient adaptation

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 23/45

Page 64: Implementation of binary floating-point arithmetic on

Low latency parenthesization computation Classical evaluation methods

Classical parenthesizations for binary32 division

P(s, t) = 2−25 + s · ∑0≤i≤10

ai · t i

Horner’s rule: (3+1)×11 = 44 cycles

→ no ILP exposure

Second-order Horner’s rule: 27 cycles

→ evaluation of odd and even parts independently with Horner, more ILP

Estrin’s method: 19 cycles

→ evaluation of high and low parts in parallel, even more ILP

→ distributing the multiplication by s in the evaluation of a(t)→ 16 cycles

... We can do better.

How to explore the solution space of parenthesizations?

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 24/45

Page 65: Implementation of binary floating-point arithmetic on

Low latency parenthesization computation Classical evaluation methods

Classical parenthesizations for binary32 division

P(s, t) = 2−25 + s · ∑0≤i≤10

ai · t i

Horner’s rule: (3+1)×11 = 44 cycles

→ no ILP exposure

Second-order Horner’s rule: 27 cycles

→ evaluation of odd and even parts independently with Horner, more ILP

Estrin’s method: 19 cycles

→ evaluation of high and low parts in parallel, even more ILP

→ distributing the multiplication by s in the evaluation of a(t)→ 16 cycles

... We can do better.

How to explore the solution space of parenthesizations?

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 24/45

Page 66: Implementation of binary floating-point arithmetic on

Low latency parenthesization computation Classical evaluation methods

Classical parenthesizations for binary32 division

P(s, t) = 2−25 + s · ∑0≤i≤10

ai · t i

Horner’s rule: (3+1)×11 = 44 cycles

→ no ILP exposure

Second-order Horner’s rule: 27 cycles

→ evaluation of odd and even parts independently with Horner, more ILP

Estrin’s method: 19 cycles

→ evaluation of high and low parts in parallel, even more ILP

→ distributing the multiplication by s in the evaluation of a(t)→ 16 cycles

... We can do better.

How to explore the solution space of parenthesizations?

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 24/45

Page 67: Implementation of binary floating-point arithmetic on

Low latency parenthesization computation Classical evaluation methods

Classical parenthesizations for binary32 division

P(s, t) = 2−25 + s · ∑0≤i≤10

ai · t i

Horner’s rule: (3+1)×11 = 44 cycles

→ no ILP exposure

Second-order Horner’s rule: 27 cycles

→ evaluation of odd and even parts independently with Horner, more ILP

Estrin’s method: 19 cycles

→ evaluation of high and low parts in parallel, even more ILP

→ distributing the multiplication by s in the evaluation of a(t)→ 16 cycles

... We can do better.

How to explore the solution space of parenthesizations?

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 24/45

Page 68: Implementation of binary floating-point arithmetic on

Low latency parenthesization computation Computation of all parenthesizations

Algorithm for computing all parenthesizations

a(x ,y) = ∑0≤i≤nx

∑0≤j≤ny

ai ,j · x i · y j with n = nx +ny , and anx ,ny 6= 0

ExampleLet a(x ,y) = a0,0 +a1,0 · x +a0,1 · y +a1,1 · x · y . Then

a1,0 +a1,1 · y is a valid expression, while a1,0 · x +a1,1 · x is not.

Exhaustive algorithm: iterative process→ step k = computation of all the valid expressions of total degree k

3 building rules for computing all parenthesizations

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 25/45

Page 69: Implementation of binary floating-point arithmetic on

Low latency parenthesization computation Computation of all parenthesizations

Algorithm for computing all parenthesizations

a(x ,y) = ∑0≤i≤nx

∑0≤j≤ny

ai ,j · x i · y j with n = nx +ny , and anx ,ny 6= 0

ExampleLet a(x ,y) = a0,0 +a1,0 · x +a0,1 · y +a1,1 · x · y . Then

a1,0 +a1,1 · y is a valid expression, while a1,0 · x +a1,1 · x is not.

Exhaustive algorithm: iterative process→ step k = computation of all the valid expressions of total degree k

3 building rules for computing all parenthesizations

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 25/45

Page 70: Implementation of binary floating-point arithmetic on

Low latency parenthesization computation Computation of all parenthesizations

Rules for building valid expressions

Consider step k of the algorithm

E(k): valid expressions of total degree k

P(k): powers x iy j of total degree k = i + j

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 26/45

Page 71: Implementation of binary floating-point arithmetic on

Low latency parenthesization computation Computation of all parenthesizations

Rules for building valid expressions

Consider step k of the algorithm

E(k): valid expressions of total degree k

P(k): powers x iy j of total degree k = i + j

Rule R1 for building the powers

p1 p2

p

deg(p) = deg(p1) + deg(p2)

deg(p1) ≤ bk/2c dk/2e ≤ deg(p2) < k

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 26/45

Page 72: Implementation of binary floating-point arithmetic on

Low latency parenthesization computation Computation of all parenthesizations

Rules for building valid expressions

Consider step k of the algorithm

E(k): valid expressions of total degree k

P(k): powers x iy j of total degree k = i + j

Rule R2 for expressions by multiplications

e′ p

deg(p) ≤ kdeg(e′) < k

deg(e) = deg(e′) + deg(p)

e

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 26/45

Page 73: Implementation of binary floating-point arithmetic on

Low latency parenthesization computation Computation of all parenthesizations

Rules for building valid expressions

Consider step k of the algorithm

E(k): valid expressions of total degree k

P(k): powers x iy j of total degree k = i + j

Rule R3 for expressions by additions

e2e1

deg(e1) = k deg(e2) ≤ k

deg(e) = max ( deg(e1), deg(e2))

e

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 26/45

Page 74: Implementation of binary floating-point arithmetic on

Low latency parenthesization computation Computation of all parenthesizations

Number of parenthesizations

nx = 1 nx = 2 nx = 3 nx = 4 nx = 5 nx = 6

ny = 0 1 7 163 11602 2334244 1304066578

ny = 1 51 67467 1133220387 207905478247998 · · · · · ·ny = 2 67467 106191222651 10139277122276921118 · · · · · · · · ·

Number of generated parenthesizations for evaluating a bivariate polynomial

Timings for parenthesization computation→ for univariate polynomial of degree 5 ≈ 1h on a 2.4 GHz core

→ for bivariate polynomial of degree (2,1) ≈ 30s

→ for P(s, t) of degree (3,1) ≈ 7s (88384 schemes)

Optimization for univariate polynomial and P(s, t)→ univariate polynomial of degree 5 ≈ 4min

→ for P(s, t) of degree (3,1) ≈ 2s (88384 schemes)

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 27/45

Page 75: Implementation of binary floating-point arithmetic on

Low latency parenthesization computation Computation of all parenthesizations

Number of parenthesizations

10

100

1000

10000

100000

1e+06

10 12 14 16 18 20

Num

ber

ofde

gree

-5pa

rent

hesi

zati

ons

Latency on unbounded parallelism (# cycles)

→ minimal latency for univariate polynomial of degree 5: 10 cycles(36 schemes)

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 27/45

Page 76: Implementation of binary floating-point arithmetic on

Low latency parenthesization computation Computation of all parenthesizations

Number of parenthesizations

10

100

1000

10000

100000

1e+06

10 12 14 16 18 20

Num

ber

ofde

gree

-5pa

rent

hesi

zati

ons

Latency on unbounded parallelism (# cycles)

→ minimal latency for univariate polynomial of degree 5: 10 cycles(36 schemes)

How to compute only parenthesizations of low latency?

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 27/45

Page 77: Implementation of binary floating-point arithmetic on

Low latency parenthesization computation Towards low evaluation latency

Determination of a target latency

Target latency = minimal cost for evaluating

a0,0 +anx ,ny · xnx yny

I if no scheme satisfies τ then increase τ and restart

Static target latency τstatic

I as general as evaluating a0,0 + xnx +ny +1

τstatic = A+M×dlog2(nx +ny +1)e

Dynamic target latency τdynamic

I cost of operator on anx ,ny and delay on intederminatesI dynamic programming

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 28/45

Page 78: Implementation of binary floating-point arithmetic on

Low latency parenthesization computation Towards low evaluation latency

Determination of a target latency

Target latency = minimal cost for evaluating

a0,0 +anx ,ny · xnx yny

I if no scheme satisfies τ then increase τ and restart

Static target latency τstatic

I as general as evaluating a0,0 + xnx +ny +1

τstatic = A+M×dlog2(nx +ny +1)e

Dynamic target latency τdynamic

I cost of operator on anx ,ny and delay on intederminatesI dynamic programming

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 28/45

Page 79: Implementation of binary floating-point arithmetic on

Low latency parenthesization computation Towards low evaluation latency

Determination of a target latency

Target latency = minimal cost for evaluating

a0,0 +anx ,ny · xnx yny

I if no scheme satisfies τ then increase τ and restart

Static target latency τstatic

I as general as evaluating a0,0 + xnx +ny +1

τstatic = A+M×dlog2(nx +ny +1)e

Dynamic target latency τdynamic

I cost of operator on anx ,ny and delay on intederminatesI dynamic programming

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 28/45

Page 80: Implementation of binary floating-point arithmetic on

Low latency parenthesization computation Towards low evaluation latency

Determination of a target latency

Target latency = minimal cost for evaluating

a0,0 +anx ,ny · xnx yny

I if no scheme satisfies τ then increase τ and restart

Example

Degree-9 bivariate polynomial: nx = 8 and ny = 1

Latencies: A = 1 and M = 3

Delay: y available 9 cycles later than x

τstatic τdynamic

1+3×dlog2(10)e = 13 cycles 16 cycles

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 28/45

Page 81: Implementation of binary floating-point arithmetic on

Low latency parenthesization computation Towards low evaluation latency

Optimized search of best parenthesizationsExampleLet a(x ,y) be a degree-2 bivariate polynomial

a(x ,y) = a0,0 +a1,0 · x +a0,1 · y +a1,1 · x · y .

⇒ find a best splitting of the polynomial→ low latency

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 29/45

Page 82: Implementation of binary floating-point arithmetic on

Low latency parenthesization computation Towards low evaluation latency

Optimized search of best parenthesizationsExampleLet a(x ,y) be a degree-2 bivariate polynomial

a(x ,y) = a0,0 +a1,0 · x +a0,1 · y +a1,1 · x · y .

⇒ find a best splitting of the polynomial→ low latency

(a0,0 +a1,0 · x +a0,1 · y

)+(

a1,1 · x · y)

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 29/45

Page 83: Implementation of binary floating-point arithmetic on

Low latency parenthesization computation Towards low evaluation latency

Optimized search of best parenthesizationsExampleLet a(x ,y) be a degree-2 bivariate polynomial

a(x ,y) = a0,0 +a1,0 · x +a0,1 · y +a1,1 · x · y .

⇒ find a best splitting of the polynomial→ low latency

((a0,0 +a1,0 · x

)+a0,1 · y

)+(

a1,1 · x · y)

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 29/45

Page 84: Implementation of binary floating-point arithmetic on

Low latency parenthesization computation Towards low evaluation latency

Optimized search of best parenthesizationsExampleLet a(x ,y) be a degree-2 bivariate polynomial

a(x ,y) = a0,0 +a1,0 · x +a0,1 · y +a1,1 · x · y .

⇒ find a best splitting of the polynomial→ low latency

(a0,0 +

(a1,0 · x +a0,1 · y

))+(

a1,1 · x · y)

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 29/45

Page 85: Implementation of binary floating-point arithmetic on

Low latency parenthesization computation Towards low evaluation latency

Optimized search of best parenthesizationsExampleLet a(x ,y) be a degree-2 bivariate polynomial

a(x ,y) = a0,0 +a1,0 · x +a0,1 · y +a1,1 · x · y .

⇒ find a best splitting of the polynomial→ low latency

(a0,0 +a1,0 · x

)+(

a0,1 · y +a1,1 · x · y)

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 29/45

Page 86: Implementation of binary floating-point arithmetic on

Low latency parenthesization computation Towards low evaluation latency

Optimized search of best parenthesizationsExampleLet a(x ,y) be a degree-2 bivariate polynomial

a(x ,y) = a0,0 +a1,0 · x +a0,1 · y +a1,1 · x · y .

⇒ find a best splitting of the polynomial→ low latency

a0,0 +(

a1,0 · x +a0,1 · y +a1,1 · x · y)

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 29/45

Page 87: Implementation of binary floating-point arithmetic on

Low latency parenthesization computation Towards low evaluation latency

Optimized search of best parenthesizationsExampleLet a(x ,y) be a degree-2 bivariate polynomial

a(x ,y) = a0,0 +a1,0 · x +a0,1 · y +a1,1 · x · y .

⇒ find a best splitting of the polynomial→ low latency

Level 1

a(x, y)

a0,0

a′′(x, y)if support ≤max

exhaustive search

a(x, y)

a′′(x, y)

a′(x, y)

keep

Level 2

if support > max

a′(x, y) a′(x, y)

a′2(x, y)a′1(x, y)a′2(x, y)

a′1(x, y)

a(x, y)

a′(x, y)

anx,nyxnxyny

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 29/45

Page 88: Implementation of binary floating-point arithmetic on

Low latency parenthesization computation Towards low evaluation latency

Efficient evaluation parenthesization generation

P(s, t) = 2−25 + s · ∑0≤i≤10

ai · t i

First target latency τ = 13→ no parenthesization found

Second target latency τ = 14→ obtained in about 10 sec.

Classical methodsI Horner: 44 cycles,I Estrin: 19 cycles,I Estrin by distributing s: 16 cycles

131211109876543210

s

a0

a1

s

t t

a2

t a3

a4

t a5

a6

t a7

a8

t a9

a10

14

t

v

2−25

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 30/45

Page 89: Implementation of binary floating-point arithmetic on

Low latency parenthesization computation Towards low evaluation latency

Efficient evaluation parenthesization generation

P(s, t) = 2−25 + s · ∑0≤i≤10

ai · t i

First target latency τ = 13→ no parenthesization found

Second target latency τ = 14→ obtained in about 10 sec.

Classical methodsI Horner: 44 cycles,I Estrin: 19 cycles,I Estrin by distributing s: 16 cycles

131211109876543210

s

a0

a1

s

t t

a2

t a3

a4

t a5

a6

t a7

a8

t a9

a10

14

t

v

2−25

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 30/45

Page 90: Implementation of binary floating-point arithmetic on

Low latency parenthesization computation Towards low evaluation latency

Flowchart for generating efficient and certified C codes

Computation of low latency parenthesizations

Selection of effective parenthesizations

a(t)

Eapprox ≤ θF(s,t) Eeval < η ST231 features

C code Certificate

Computation of polynomial approximant

??

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 31/45

Page 91: Implementation of binary floating-point arithmetic on

Low latency parenthesization computation Towards low evaluation latency

Flowchart for generating efficient and certified C codes

Computation of low latency parenthesizations

Selection of effective parenthesizations

a(t)

Eapprox ≤ θF(s,t) Eeval < η ST231 features

C code Certificate

Computation of polynomial approximant

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 31/45

Page 92: Implementation of binary floating-point arithmetic on

Selection of effective evaluation parenthesizations

Outline of the talk

1. Design and implementation of floating-point operators

2. Low latency parenthesization computation

3. Selection of effective evaluation parenthesizationsGeneral frameworkAutomatic certification of generated C codes

4. Numerical results

5. Conclusions and perspectives

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 32/45

Page 93: Implementation of binary floating-point arithmetic on

Selection of effective evaluation parenthesizations General framework

Selection of effective parenthesizations

1. Arithmetic Operator Choice

I all intermediate variables are of constant sign

2. Scheduling on a simplified model of the ST231

I constraints of architecture: cost of operators, instructions bundling, ...I delays on indeterminates

3. Certification of generated C code

I straightline polynomial evaluation programI “certified C code”: we can bound the evaluation error in integer arithmetic

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 33/45

Page 94: Implementation of binary floating-point arithmetic on

Selection of effective evaluation parenthesizations General framework

Selection of effective parenthesizations

1. Arithmetic Operator Choice

I all intermediate variables are of constant sign

2. Scheduling on a simplified model of the ST231

I constraints of architecture: cost of operators, instructions bundling, ...I delays on indeterminates

3. Certification of generated C code

I straightline polynomial evaluation programI “certified C code”: we can bound the evaluation error in integer arithmetic

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 33/45

Page 95: Implementation of binary floating-point arithmetic on

Selection of effective evaluation parenthesizations General framework

Selection of effective parenthesizations

1. Arithmetic Operator Choice

I all intermediate variables are of constant sign

2. Scheduling on a simplified model of the ST231

I constraints of architecture: cost of operators, instructions bundling, ...I delays on indeterminates

3. Certification of generated C code

I straightline polynomial evaluation programI “certified C code”: we can bound the evaluation error in integer arithmetic

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 33/45

Page 96: Implementation of binary floating-point arithmetic on

Selection of effective evaluation parenthesizations Automatic certification of generated C codes

Certification of evaluation error for binary32 division

Sufficient conditions with µ = 4−2−21

Eapprox ≤ θ with θ < 2−25/µ and Eeval < η = 2−25−µ ·θ

-8e-09

-6e-09

-4e-09

-2e-09

0

2e-09

4e-09

6e-09

8e-09

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Abs

olut

eap

prox

imat

ion

erro

r

t

Approximation error Required bound

I Eapprox ≤ θ,

with θ = 3 ·2−29 ≈ 6 ·10−9

I Eeval < η,

with η≈ 7.4 ·10−9

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 34/45

Page 97: Implementation of binary floating-point arithmetic on

Selection of effective evaluation parenthesizations Automatic certification of generated C codes

Certification of evaluation error for binary32 division

Case 1: mx ≥my → condition satisfiedCase 2: mx < my → condition not satisfied: Eeval ≥ η

s∗ = 3.935581684112548828125 and t∗ = 0.97490441799163818359375

-8e-09

-6e-09

-4e-09

-2e-09

0

2e-09

4e-09

6e-09

8e-09

0.965 0.97 0.975 0.98 0.985 0.99 0.995

Abs

olut

eap

prox

imat

ion

erro

r

t

Approximation errorRequired bound 2−25/(4− 2−21) ≈ 8 · 10−9

Approximation error bound θ = 3 · 2−29 ≈ 6 · 10−9

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 35/45

Page 98: Implementation of binary floating-point arithmetic on

Selection of effective evaluation parenthesizations Automatic certification of generated C codes

Certification of evaluation error for binary32 division

Case 1: mx ≥my → condition satisfiedCase 2: mx < my → condition not satisfied: Eeval ≥ η

s∗ = 3.935581684112548828125 and t∗ = 0.97490441799163818359375

-8e-09

-6e-09

-4e-09

-2e-09

0

2e-09

4e-09

6e-09

8e-09

0.965 0.97 0.975 0.98 0.985 0.99 0.995

Abs

olut

eap

prox

imat

ion

erro

r

t

Approximation errorRequired bound 2−25/(4− 2−21) ≈ 8 · 10−9

Approximation error bound θ = 3 · 2−29 ≈ 6 · 10−9

1. determine an interval I around this point

2. compute Eapprox over I3. determine an evaluation error bound η

4. check if Eeval < η ?

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 35/45

Page 99: Implementation of binary floating-point arithmetic on

Selection of effective evaluation parenthesizations Automatic certification of generated C codes

Certification of evaluation error for binary32 division

Case 1: mx ≥my → condition satisfiedCase 2: mx < my → condition not satisfied: Eeval ≥ η

s∗ = 3.935581684112548828125 and t∗ = 0.97490441799163818359375

-8e-09

-6e-09

-4e-09

-2e-09

0

2e-09

4e-09

6e-09

8e-09

0.965 0.97 0.975 0.98 0.985 0.99 0.995

Abs

olut

eap

prox

imat

ion

erro

r

t

Approximation errorRequired bound 2−25/(4− 2−21) ≈ 8 · 10−9

Approximation error bound θ = 3 · 2−29 ≈ 6 · 10−9

1. determine an interval I around this point

2. compute Eapprox over I3. determine an evaluation error bound η

4. check if Eeval < η ?

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 35/45

Page 100: Implementation of binary floating-point arithmetic on

Selection of effective evaluation parenthesizations Automatic certification of generated C codes

Certification of evaluation error for binary32 division

Sufficient conditions for each subinterval, with µ = 4−2−21

E(i)approx ≤ θ

(i) with θ(i) < 2−25/µ and E(i)

eval < η(i) = 2−25−µ ·θ(i)

-8e-09

-6e-09

-4e-09

-2e-09

0

2e-09

4e-09

6e-09

8e-09

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Abs

olut

eap

prox

imat

ion

erro

r

t

Approximation error Required bound

E(i)approx ≤ θ(i)

E(i)eval < η(i)

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 36/45

Page 101: Implementation of binary floating-point arithmetic on

Selection of effective evaluation parenthesizations Automatic certification of generated C codes

Certification of evaluation error for binary32 division

Sufficient conditions for each subinterval, with µ = 4−2−21

E(i)approx ≤ θ

(i) with θ(i) < 2−25/µ and E(i)

eval < η(i) = 2−25−µ ·θ(i)

-8e-09

-6e-09

-4e-09

-2e-09

0

2e-09

4e-09

6e-09

8e-09

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Abs

olut

eap

prox

imat

ion

erro

r

t

Approximation error Required bound

I E(i)approx ≤ θ(i)

I E(i)eval < η(i)

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 36/45

Page 102: Implementation of binary floating-point arithmetic on

Selection of effective evaluation parenthesizations Automatic certification of generated C codes

Certification using a dichotomy-based strategy

Implementation of the splitting by dichotomy

I for each T (i)

1. compute a certified approximation error bound θ(i)

Sollya

2. determine an evaluation error bound η(i)

Sollya

3. check this bound: E(i)eval < η(i)

Gappa

⇒ if this bound is not satisfied, T (i) is split up into 2 subintervals

Example of binary32 implementation

→ launched on a 64 processor grid

→ 36127 subintervals found in several hours (≈ 5h.)

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 37/45

Page 103: Implementation of binary floating-point arithmetic on

Selection of effective evaluation parenthesizations Automatic certification of generated C codes

Certification using a dichotomy-based strategy

Implementation of the splitting by dichotomy

I for each T (i)

1. compute a certified approximation error bound θ(i)

Sollya

2. determine an evaluation error bound η(i)

Sollya

3. check this bound: E(i)eval < η(i)

Gappa

⇒ if this bound is not satisfied, T (i) is split up into 2 subintervals

Example of binary32 implementation

→ launched on a 64 processor grid

→ 36127 subintervals found in several hours (≈ 5h.)

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 37/45

Page 104: Implementation of binary floating-point arithmetic on

Selection of effective evaluation parenthesizations Automatic certification of generated C codes

Certification using a dichotomy-based strategy

Implementation of the splitting by dichotomy

I for each T (i)

1. compute a certified approximation error bound θ(i)

Sollya

2. determine an evaluation error bound η(i)

Sollya

3. check this bound: E(i)eval < η(i)

Gappa

⇒ if this bound is not satisfied, T (i) is split up into 2 subintervals

Example of binary32 implementation

→ launched on a 64 processor grid

→ 36127 subintervals found in several hours (≈ 5h.)

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 37/45

Page 105: Implementation of binary floating-point arithmetic on

Numerical results

Outline of the talk

1. Design and implementation of floating-point operators

2. Low latency parenthesization computation

3. Selection of effective evaluation parenthesizations

4. Numerical results

5. Conclusions and perspectives

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 38/45

Page 106: Implementation of binary floating-point arithmetic on

Numerical results

Performances of FLIP on ST231

0

20

40

60

80

100

120

140

160

180

addsub

muldiv

sqrt

Lat

ency

(#cy

cles

)

Floating-point operators

FLIP 1.0FLIP 0.3

STlib

0

20

40

60

80

100

addsub

muldiv

sqrt

Spee

d-up

(%)

Floating-point operators

FLIP 1.0 vs STlibFLIP 1.0 vs FLIP 0.3

Performances on ST231, in RoundTiesToEven

⇒ Speed-up between 20 and 50 %

Implementations of other operators

x−1 x−1/2 x1/3 x−1/3 x−1/4

25 29 34 40 42

Performances on ST231, in RoundTiesToEven (in number of cycles)

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 39/45

Page 107: Implementation of binary floating-point arithmetic on

Numerical results

Performances of FLIP on ST231

0

20

40

60

80

100

120

140

160

180

addsub

muldiv

sqrt

Lat

ency

(#cy

cles

)

Floating-point operators

FLIP 1.0FLIP 0.3

STlib

0

20

40

60

80

100

addsub

muldiv

sqrt

Spee

d-up

(%)

Floating-point operators

FLIP 1.0 vs STlibFLIP 1.0 vs FLIP 0.3

Performances on ST231, in RoundTiesToEven

⇒ Speed-up between 20 and 50 %

Implementations of other operators

x−1 x−1/2 x1/3 x−1/3 x−1/4

25 29 34 40 42

Performances on ST231, in RoundTiesToEven (in number of cycles)

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 39/45

Page 108: Implementation of binary floating-point arithmetic on

Numerical results

Impact of dynamic target latency

x1/3 x−1/3

Degree (nx ,ny ) (8,1) (9,1)

Delay on the operand s (# cycles) 9 9

Static target latency 13 13

Dynamic target latency 16 16

Latency on unbounded parallelism and on ST231 16 16

Latency (# cycles) on unbounded parallelism and on ST231

=⇒ Conclude on the optimality in terms of polynomial evaluation latency

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 40/45

Page 109: Implementation of binary floating-point arithmetic on

Numerical results

Impact of dynamic target latency

x1/3 x−1/3

Degree (nx ,ny ) (8,1) (9,1)

Delay on the operand s (# cycles) 9 9

Static target latency 13 13

Dynamic target latency 16 16

Latency on unbounded parallelism and on ST231 16 16

Latency (# cycles) on unbounded parallelism and on ST231

=⇒ Conclude on the optimality in terms of polynomial evaluation latency

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 40/45

Page 110: Implementation of binary floating-point arithmetic on

Numerical results

Timings for code generationx1/2 x−1/2 x1/3 x−1/3 x−1

Degree (nx ,ny ) (8,1) (9,1) (8,1) (9,1) (10,0)

Static target latency 13 13 13 13 13

Dynamic target latency 13 13 16 16 13

Latency on unbounded parallelism 13 13 16 16 13

Latency on ST231 13 14 16 16 13

Parenthesization generation 172ms 152ms 53s 56s 168ms

Arithmetic Operator Choice 6ms 6ms 7ms 11ms 4ms

Scheduling 29s 4m21s 32ms 132ms 7s

Certification (Gappa) 6s 4s 1m38s 1m07s 11s

Total time (≈) 35s 4m25s 2m31s 2m03s 18s

Timing of each step of the generation flow

Impact of the target latency on the first step of the generation

What may dominate the cost

→ scheduling algorithm

→ certification using Gappa

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 41/45

Page 111: Implementation of binary floating-point arithmetic on

Numerical results

Timings for code generationx1/2 x−1/2 x1/3 x−1/3 x−1

Degree (nx ,ny ) (8,1) (9,1) (8,1) (9,1) (10,0)

Static target latency 13 13 13 13 13

Dynamic target latency 13 13 16 16 13

Latency on unbounded parallelism 13 13 16 16 13

Latency on ST231 13 14 16 16 13

Parenthesization generation 172ms 152ms 53s 56s 168ms

Arithmetic Operator Choice 6ms 6ms 7ms 11ms 4ms

Scheduling 29s 4m21s 32ms 132ms 7s

Certification (Gappa) 6s 4s 1m38s 1m07s 11s

Total time (≈) 35s 4m25s 2m31s 2m03s 18s

Timing of each step of the generation flow

Impact of the target latency on the first step of the generation

What may dominate the cost

→ scheduling algorithm

→ certification using Gappa

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 41/45

Page 112: Implementation of binary floating-point arithmetic on

Numerical results

Timings for code generationx1/2 x−1/2 x1/3 x−1/3 x−1

Degree (nx ,ny ) (8,1) (9,1) (8,1) (9,1) (10,0)

Static target latency 13 13 13 13 13

Dynamic target latency 13 13 16 16 13

Latency on unbounded parallelism 13 13 16 16 13

Latency on ST231 13 14 16 16 13

Parenthesization generation 172ms 152ms 53s 56s 168ms

Arithmetic Operator Choice 6ms 6ms 7ms 11ms 4ms

Scheduling 29s 4m21s 32ms 132ms 7s

Certification (Gappa) 6s 4s 1m38s 1m07s 11s

Total time (≈) 35s 4m25s 2m31s 2m03s 18s

Timing of each step of the generation flow

Impact of the target latency on the first step of the generation

What may dominate the cost

→ scheduling algorithm

→ certification using Gappa

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 41/45

Page 113: Implementation of binary floating-point arithmetic on

Conclusions and perspectives

Outline of the talk

1. Design and implementation of floating-point operators

2. Low latency parenthesization computation

3. Selection of effective evaluation parenthesizations

4. Numerical results

5. Conclusions and perspectives

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 42/45

Page 114: Implementation of binary floating-point arithmetic on

Conclusions and perspectives Conclusions

Conclusions

Design and implementation of floating-point operatorsI uniform approach for correctly-rounded roots and their reciprocalsI extension to correctly-rounded division

I polynomial evaluation-based method, very high ILP exposure

⇒ new, much faster version of FLIP

Code generation for efficient and certified polynomial evaluationI methodologies and tools for automating polynomial evaluation

implementationI heuristics and techniques for generating quickly efficient and certified C

codes

⇒ CGPE: allows to write and certify automatically ≈ 50 % of the codes of FLIP

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 43/45

Page 115: Implementation of binary floating-point arithmetic on

Conclusions and perspectives Conclusions

Conclusions

Design and implementation of floating-point operatorsI uniform approach for correctly-rounded roots and their reciprocalsI extension to correctly-rounded division

I polynomial evaluation-based method, very high ILP exposure

⇒ new, much faster version of FLIP

Code generation for efficient and certified polynomial evaluationI methodologies and tools for automating polynomial evaluation

implementationI heuristics and techniques for generating quickly efficient and certified C

codes

⇒ CGPE: allows to write and certify automatically ≈ 50 % of the codes of FLIP

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 43/45

Page 116: Implementation of binary floating-point arithmetic on

Conclusions and perspectives Conclusions

Conclusions

Design and implementation of floating-point operatorsI uniform approach for correctly-rounded roots and their reciprocalsI extension to correctly-rounded division

I polynomial evaluation-based method, very high ILP exposure

⇒ new, much faster version of FLIP

Code generation for efficient and certified polynomial evaluationI methodologies and tools for automating polynomial evaluation

implementationI heuristics and techniques for generating quickly efficient and certified C

codes

⇒ CGPE: allows to write and certify automatically ≈ 50 % of the codes of FLIP

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 43/45

Page 117: Implementation of binary floating-point arithmetic on

Conclusions and perspectives Perspectives

Perspectives

Faithful implementation of floating-point operators

→ other floating-point operators:• log2(1+ x) over [0.5,1), 1/

√1+ x2 over [0,0.5), ...

→ roots and their reciprocals: rounding condition decision not automated yet

Extension to other binary floating-point formats

→ square root in binary64: 171 cycles on ST231, 396 cycles with STlib

Extension to other architectures, typically FPGAs

→ polynomial evaluation-based approach: already seems to be a goodalternative to multiplicative methods on FPGAs

→ the other techniques introduced of this thesis: should be investigated further

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 44/45

Page 118: Implementation of binary floating-point arithmetic on

Conclusions and perspectives Perspectives

Perspectives

Faithful implementation of floating-point operators

→ other floating-point operators:• log2(1+ x) over [0.5,1), 1/

√1+ x2 over [0,0.5), ...

→ roots and their reciprocals: rounding condition decision not automated yet

Extension to other binary floating-point formats

→ square root in binary64: 171 cycles on ST231, 396 cycles with STlib

Extension to other architectures, typically FPGAs

→ polynomial evaluation-based approach: already seems to be a goodalternative to multiplicative methods on FPGAs

→ the other techniques introduced of this thesis: should be investigated further

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 44/45

Page 119: Implementation of binary floating-point arithmetic on

Conclusions and perspectives Perspectives

Perspectives

Faithful implementation of floating-point operators

→ other floating-point operators:• log2(1+ x) over [0.5,1), 1/

√1+ x2 over [0,0.5), ...

→ roots and their reciprocals: rounding condition decision not automated yet

Extension to other binary floating-point formats

→ square root in binary64: 171 cycles on ST231, 396 cycles with STlib

Extension to other architectures, typically FPGAs

→ polynomial evaluation-based approach: already seems to be a goodalternative to multiplicative methods on FPGAs

→ the other techniques introduced of this thesis: should be investigated further

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 44/45

Page 120: Implementation of binary floating-point arithmetic on

Conclusions and perspectives Perspectives

Implementation of binary floating-point arithmeticon embedded integer processors

Polynomial evaluation-based algorithmsand

certified code generation

Guillaume RevyAdvisors: Claude-Pierre Jeannerod and Gilles Villard

Arenaire INRIA project-team (LIP, Ens Lyon) Universite de Lyon CNRS

Ph.D. Defense – December 1st, 2009

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 45/45