computing time for summation algorithm: less hazard and ... · numerical software: design, analysis...

HAL Id: lirmm-00835508https://hal-lirmm.ccsd.cnrs.fr/lirmm-00835508

Submitted on 25 Nov 2015

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

Computing Time for Summation Algorithm: LessHazard and More Scientific Research

Bernard Goossens, Philippe Langlois, David Parello, Kathy Porada

To cite this version:Bernard Goossens, Philippe Langlois, David Parello, Kathy Porada. Computing Time for SummationAlgorithm: Less Hazard and More Scientific Research. Numerical Sofware: Design, Analysis andVerification, Jul 2012, Santander, Spain. �lirmm-00835508�

https://hal-lirmm.ccsd.cnrs.fr/lirmm-00835508

https://hal.archives-ouvertes.fr

Numerical Software: Design, Analysis and Verification

IFIP WG2.5, Santander, July 4–6 2012

Computing Time of Summation Algorithms:

Less Hazard and More Scientific Research

Bernard Goossens, Gregoire Langlois, David Parello, Kathy Porada

University of Perpignan Via Domitia, DALI,

University Montpellier 2, LIRMM,

CNRS UMR 5506, France

1 / 30

1 Why measure summation algorithm performance?

2 How to measure summation algorithm performance?

3 ILP and the PerPI Tool

4 Experiments with recent summation algorithms

5 Conclusion

2 / 30

How to manage accuracy and speed?

A new “better” algorithm every year since 1999

1965 Møller, Ross

1969 Babuska, Knuth

1970 Nickel

1971 Dekker, Malcolm

1972 Kahan, Pichat

1974 Neumaier

1975 Kulisch/Bohlender

1977 Bohlender, Mosteller/Tukey

1981 Linnaimaa

1982 Leuprecht/Oberaigner

1983 Jankowski/Semoktunowicz/-

Wozniakowski

1985 Jankowski/Wozniakowski

1987 Kahan

1991 Priest

1992 Clarkson, Priest

1993 Higham

1997 Shewchuk

1999 Anderson

2001 Hlavacs/Uberhuber

2002 Li et al. (XBLAS)

2003 Demmel/Hida, Nievergelt,

Zielke/Drygalla

2005 Ogita/Rump/Oishi,

Zhu/Yong/Zeng

2006 Zhu/Hayes

2008 Rump/Ogita/Oishi

2009 Rump, Zhu/Hayes

2010 Zhu/Hayes

3 / 30

Accuracy of the floating point summation

Precision

u= arithmetic precision

u = 2−53 ≈ 10−16 for b64 in IEEE-754 (2008)

Accuracy for backward stable algorithms

Accuracy of the computed sum ≤ (n − 1)× cond × u

cond(∑

xi ) =∑|xi |

|∑

xi |

No more significant digit in IEEE-b64 for large cond, i.e. > 1016

More accuracy . . .

More precision: double-double, quad-double, . . .

Compensated algorithms: Kahan(72), . . . , Sum2(05), SumK(05)

Accuracy of the computed sum . u + cond × uK

. . . but still depending on the conditioning

4 / 30

Skip over the conditioning

Distillation: iterate until faithful or exact rounding

Error free transformation of [x ]→ [x (1)]→ · · · → [x∗] such that∑xi =

∑x∗i and [x∗] provides the expected rounded value.

Kahan (87), . . . , Zhu-Hayes: iFastSum (SISC-09)

More space to keep everything

Long accumulator, hardware oriented: Malcolm (71), Kulish (80)

Cut the summands: AccSum (SISC-08), FastAccSum (SISC-09)

Sum by fixed exponent: HybridSum (SISC-09), OnLineExact (TOMS-10)

From faithful to exact rounding

costly choice of the right side when closed to breakpoints

e.g. 1 + 2−53 ± 2−106

5 / 30

Skip over the conditioning

Distillation: iterate until faithful or exact rounding

Error free transformation of [x ]→ [x (1)]→ · · · → [x∗] such that∑xi =

∑x∗i and [x∗] provides the expected rounded value.

Kahan (87), . . . , Zhu-Hayes: iFastSum (SISC-09)

More space to keep everything

Long accumulator, hardware oriented: Malcolm (71), Kulish (80)

Cut the summands: AccSum (SISC-08), FastAccSum (SISC-09)

Sum by fixed exponent: HybridSum (SISC-09), OnLineExact (TOMS-10)

From faithful to exact rounding

−→ Run-time and memory efficiencies are now the discriminant factors

5 / 30





5 Conclusion

6 / 30

Reliable and significant measure of the time complexity?

The classic way: count the number of flop

A usual problem: double the accuracy of a computed result

A usual answer for polynomial evaluation (degree n)

Metric Horner CompHorner DDHorner

Flop count 2n 22n + 5 28n + 4

Flop count ratio 1 ≈ 11 ≈ 14

Measured #cycles ratio 1 2.8 – 3.2 8.7 – 9.7

Flop count vs. run-time measures

Flop counts and measured run-times are not proportional

Run-time measure is a very difficult experimental process

Which one trust?

7 / 30

How to trust non-reproducible experiment results?

Measures are mostly non-reproducible

The execution time of a binary program varies, even using the same data

input and the same execution environment.

Why? Experimental uncertainty of the hardware performance counters

Spoiling events: background tasks, concurrent jobs, OS interrupts

Non deterministic issues: instruction scheduler, branch predictor

External conditions: temperature of the room

Timing accuracy: no constant cycle period on modern processors (i7. . . )

Uncertainty increases as computer system complexity does

Architecture and micro-architecture issues: multicore, hybrid, speculation

Compiler options and its effects

8 / 30

How to read the current literature?

Numerical results in S.M. Rump contributions (for summation)

26% for Sum2-SumK (SISC-05) : 9 pages over 34

20% for AccSum (SISC-08) : 7 pages over 35

20% for AccSumK-NearSum (SISC-08b) : 6 pages over 30

less that 3% for FastAccSum (SISC-09) : 1 page over 37

Lack of proof, or at least of reproducibility

Measuring the computing time of summation algorithms in a high-level

language on today’s architectures is more of a hazard than scientific

research. S.M. Rump (SISC, 2009)

. . . in the paper entitled Ultimately Fast Accurate Summation

9 / 30

Software and System Performance experts’ point of view

The limited Accuracy of Performance Counter Measurements

We caution performance analysts to be suspicious of cycle counts

. . . gathered with performance counters.

D. Zaparanuks, M. Jovic, M. Hauswirth (2009)

Can Hardware Performance Counters Produces Expected, Deterministic Results?

In practice counters that should be deterministic show variation from

run to run on the x86 64 architecture. . . . it is difficult to determine

known “good” reference counts for comparison.

V.M. Weaver, J. Dongarra (2010)

The picture is blurred: the computing chain is wobbling around

If we combine all the published speedups (accelerations) on the well

known public benchmarks since four decades, why don’t we observe

execution times approaching to zero? S. Touati (2009)

10 / 30

Outline





5 Conclusion

11 / 30

ILP and the performance potential of an algorithm

Instruction Level Parallelism (ILP) describes the potential of the instructions of

a program that can be executed simultaneously

Hennessy-Patterson’s ideal machine (H-P IM)

every instruction is executed one cycle after the execution one of the

producers it depends

no other constraint than the true instruction dependency (RAW)

Measure the #cycles and the #IPC running the code with the H-P IM

maximal exploitation of the program ILP

processor and ILP in practice: superscalar and out-of-order execution

ILP measures the potential of the algorithm performance

12 / 30

What is ILP?

A synthetic sample: e = (a+b) + (c+d)

x86 binary...

mov eax,DWP[ebp-16]i1

mov edx,DWP[ebp-20]i2

add edx,eaxi3

mov ebx,DWP[ebp-8]i4

add ebx,DWP[ebp-12]i5

add edx,ebxi6

...

13 / 30

What is ILP?


x86 binary...



add edx,eaxi3



add edx,ebxi6

...

Instruction and cycle counting

13 / 30

What is ILP?


x86 binary...



add edx,eaxi3



add edx,ebxi6

...


Cycle 0: i1 i2 i4

13 / 30

What is ILP?


x86 binary...



add edx,eaxi3



add edx,ebxi6

...


Cycle 0: i1 i2 i4

Cycle 1: i3 i5

13 / 30

What is ILP?


x86 binary...



add edx,eaxi3



add edx,ebxi6

...


Cycle 0: i1 i2 i4

Cycle 1: i3 i5

Cycle 2: i6

13 / 30

What is ILP?


x86 binary...



add edx,eaxi3



add edx,ebxi6

...


Cycle 0: i1 i2 i4

Cycle 1: i3 i5

Cycle 2: i6

# of instructions = 6, # of cycles = 3

ILP = # of instructions/# of cycles = 2

13 / 30

ILP explains why compensated algorithms run fast

N. Louvet, PhD (07) CompHorner DDHorner

#C=2n+8, ILP≈ 11 #C=17n+2, ILP≈ 1.65

*

* *

*

*

+

+

−

−

−

− −

−

−

−

+

+

+

+

+

x_lo

x_hi

x

P[i]

P[i]

r

c

c

(i)

(i+1)

(i)

* *

x_lo

x_hi

x

(i+1)r

splitter

1

2

3

4

5

6

7

8

9

10

c

c

r

(i+1)

(i)

(i)

(i+1)

r

(n−1)

(n−2)

(n−3)

(n−4)

(n−5)

(0)

(1)

(2)

(3)

(4)

(a) (b) (c)

+

−

−

−

+

−

−

+

+

+

−

+

+

x_lox_hi

x_hix_lo

x

P[i]

sh

sh

sl

(i+1)

(i)

(i)

x

(i+1)sl

**

+

−

−

−

−* *

* *−

+

+

splitter

sh

sl

sh

sl

(i+1)

(i+1)

(i)

(i)

10

12

19

2

1

3

4

5

6

7

8

9

11

13

14

15

16

17

18

(n−1)

(n−2)

(n−3)

(n−4)

(a) (b) (c)

14 / 30

The PerPI Tool automatizes this ILP analysis

PerPI: a pintool to analyse and visualise the ILP of x86-coded algorithms

Pin (Intel) tool (http://www.pintool.org)

Outputs: ILP measure (#C, #I), IPC histogram, data-dependency graph

Input: x86 64 binary file

Developed and maintained by B. Goossens and D. Parello (DALI)

15 / 30





5 Conclusion

16 / 30

Some recent accurate and fast summation algorithms

Twice more precision

Sum2: Compensated with a VectSum that uses TwoSum

DDSum: Recursive sum + double-double arithmetic

Faithful or exact rounding

iFastSum: SumK with dynamic error control

AccSum and FastAccSum: Adaptative computational effort wrt cond.

Split the summands by chunk that sums exactly (width depends on n),

careful sum of the chunks.

Chunk cutting-line fixed in AccSum while more dynamic in FastAccSum

HybridSum and OnLineExactSum: Exponent extraction of the summands,

careful accumulation in one (HS) or two vectors (OLE) of fixed and short

length (2048 in IEEE-b64), and distillate the (very for OLE) short vector

with iFastSum.

17 / 30

How to chose the data test?

Time complexity parameters of the summation algorithms

Only n for Sum2, SumK: constant accuracy improvement

n and cond for AccSum, iFastSum: adaptive accuracy improvement

Exponent range of the summands: Z-H exhibit no influence for HybridSum

and OnlineExactSum for large n

Rump’s generator of arbitrary ill-conditioned dot product (SISC-05),

modified for summation and to cover an arbitrary exponent range.

Length: n ∈ [103, 107] and cond ∈ [108, 1040] = [√

1/u, 1/u2.5]

18 / 30

Our fuzzy PAPI picture

Parameters : sum length: 103 to 106, cond: 108 to 1040

cond Sum Sum2 FastAccSum iFastSum HybridSum OnLineExact

108 1 2–3 4–5 7–8 5 (n > 105) 4 (n > 105)

1016 1 2–3 5–6 7–8 5 (n > 105) 4 (n > 105)

1024 1 – 7 13 5 (n > 105) 4 (n > 105)

1032 1 – 8 18 5 (n > 105) 4 (n > 105)

1040 1 – 9 18+ 5 (n > 105) 4 (n > 105)

Experimental process: PAPI, counter delay, hot caches, average over 50 samples

for each n and cond. . . .

Intel(R) Core(TM) i7 CPU870 2.93GHz, x86 64, GNU/Linux noyau 2.6.38-8-generic

– gcc (4.6) -std=c99 -march=corei7 -mfpmath=sse -O3 -funroll-all-loops

– icc (12.0.420110427) -std=c99 -O3 -mtune=corei7 -xSSE -axsse4.2 -funroll-all-loops

19 / 30

Focus inside OnLineExact

Low level choices are crucial

cond = 1016 gcc icc

n 2sum Fast2sum 2sum Fast2sum

103 13 11 13.7 9

104 6.5 6.5 5.8 4

105 5 5.5 3.8 3.3

106 4.7 5.6 3.8 3.3

Cycle ratios (vs. Sum) vary for different EFT and compilers

20 / 30

PerPI and reproducibility: one run is enough

21 / 30

PerPI: # cycle ratios for summation algorithms

Number of cycles: ratios vs. Sum

for cond = 1032 and n = 2048, 104, 10522 / 30

PerPI: # cycle ratios for summation algorithms

Twice more accurate computed sum

No overhead: compensation is the right choice

Faithfully or exactly rounded computed sum

The newest, the potentially fastest but be cautious: sensitive in practice

FastAccSum (3n) not faster than AccSum (4n) [GLPP-Para10]

OnLineExact for large n, else iFastSum

PerPI highlights the control, e.g. iteration counters

Less #C in HybridSum than in Sum?

Sum is unrolled 8 times by gcc but C forbids to change the evaluation order

of the arithmetic expression

Every cycle of HybridSum has enough parallel work with different

summands: 2 here

OnLineExact introduces dependency between iterations: x [i ] and x [i + 1]

may have the same exponent

23 / 30

HS (left) and OLE (right), unrolled (up) or not (down)

24 / 30

Conclusion





5 Conclusion

25 / 30

Conclusion

Highly accurate algorithm −→ reliable performance evaluation

Flop count: not significant

Hardware counter based measure: uncertainty and no reproducibility

PerPI: a software platform to analyze and visualise ILP

Reliable: reproducibility both in time and location

Realistic: correlation with measured ones

Useful: a detailed picture of the intrinsic behavior of the algorithm

Optimisation tool: analyse the effect of some hardware constraints

[GLPP-Para10]

Exploratory tool: gives us the taste of the behavior of our algorithms

running on “tomorrow” processors

26 / 30

Conclusion

Computing time: More science? Less hazard?

No definitive answer

PerPI result is far from perfect

Not abstract enough: instruction set dependence, compiler choice

Good abstraction level? Assembler program or high level programming

language?

Next step for f.p. summation: reproducibility to improve productivity

Web site with common and shared resources: tested + test + make file

sources, data files and generators, real and abstract associated measures

Open and dynamic interaction: load your new algorithm, your new data,

run them and let’s contribute

architectures? compilers?

suggestions and partners are welcome!

27 / 30

References I

D. H. Bailey.

Twelve ways to fool the masses when giving performance results on parallel computers.

Supercomputing Review, pages 54–55, Aug. 1991.

B. Goossens, P. Langlois, D. Parello, and E. Petit.

PerPI: A tool to measure instruction level parallelism.

In K. Jonasson, editor, Applied Parallel and Scientific Computing - 10th International

Conference, PARA 2010, Reykjavık, Iceland, June 6-9, 2010, Revised Selected Papers,

Part I, volume 7133 of Lecture Notes in Computer Science, pages 270–281. Springer,

2012.

N. J. Higham.

Accuracy and Stability of Numerical Algorithms.

SIAM, 2nd edition, 2002.

J.-M. Muller, N. Brisebarre, F. de Dinechin, C.-P. Jeannerod, V. Lefevre, G. Melquiond,

N. Revol, D. Stehle, and S. Torres.

Handbook of Floating-Point Arithmetic.

Birkhauser Boston, 2010.

28 / 30

References II

T. Ogita, S. M. Rump, and S. Oishi.

Accurate sum and dot product.

SIAM J. Sci. Comput., 26(6):1955–1988, 2005.

S. M. Rump.

Ultimately fast accurate summation.

SIAM J. Sci. Comput., 31(5):3466–3502, 2009.

S. M. Rump, T. Ogita, and S. Oishi.

Accurate floating-point summation – part I: Faithful rounding.

SIAM J. Sci. Comput., 31(1):189–224, 2008.

V. Weaver and J. Dongarra.

Can hardware performance counters produce expected, deterministic results?

In 3rd Workshop on Functionality of Hardware Performance Monitoring, 2010.

D. Zaparanuks, M. Jovic, and M. Hauswirth.

Accuracy of performance counter measurements.

In ISPASS, pages 23–32. IEEE, 2009.

29 / 30

References III

Y.-K. Zhu and W. B. Hayes.

Correct rounding and hybrid approach to exact floating-point summation.

SIAM J. Sci. Comput., 31(4):2981–3001, 2009.

Y.-K. Zhu and W. B. Hayes.

Algorithm 908: Online exact summation of floating-point streams.

ACM Transactions on Mathematical Software, 37(3):37:1–37:13, Sept. 2010.

30 / 30

computing time for summation algorithm: less hazard and ... · numerical software: design, analysis...

Documents