computing time for summation algorithm: less hazard and ... · numerical software: design, analysis...
TRANSCRIPT
HAL Id: lirmm-00835508https://hal-lirmm.ccsd.cnrs.fr/lirmm-00835508
Submitted on 25 Nov 2015
HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.
Computing Time for Summation Algorithm: LessHazard and More Scientific Research
Bernard Goossens, Philippe Langlois, David Parello, Kathy Porada
To cite this version:Bernard Goossens, Philippe Langlois, David Parello, Kathy Porada. Computing Time for SummationAlgorithm: Less Hazard and More Scientific Research. Numerical Sofware: Design, Analysis andVerification, Jul 2012, Santander, Spain. �lirmm-00835508�
Numerical Software: Design, Analysis and Verification
IFIP WG2.5, Santander, July 4–6 2012
Computing Time of Summation Algorithms:
Less Hazard and More Scientific Research
Bernard Goossens, Gregoire Langlois, David Parello, Kathy Porada
University of Perpignan Via Domitia, DALI,
University Montpellier 2, LIRMM,
CNRS UMR 5506, France
1 / 30
1 Why measure summation algorithm performance?
2 How to measure summation algorithm performance?
3 ILP and the PerPI Tool
4 Experiments with recent summation algorithms
5 Conclusion
2 / 30
How to manage accuracy and speed?
A new “better” algorithm every year since 1999
1965 Møller, Ross
1969 Babuska, Knuth
1970 Nickel
1971 Dekker, Malcolm
1972 Kahan, Pichat
1974 Neumaier
1975 Kulisch/Bohlender
1977 Bohlender, Mosteller/Tukey
1981 Linnaimaa
1982 Leuprecht/Oberaigner
1983 Jankowski/Semoktunowicz/-
Wozniakowski
1985 Jankowski/Wozniakowski
1987 Kahan
1991 Priest
1992 Clarkson, Priest
1993 Higham
1997 Shewchuk
1999 Anderson
2001 Hlavacs/Uberhuber
2002 Li et al. (XBLAS)
2003 Demmel/Hida, Nievergelt,
Zielke/Drygalla
2005 Ogita/Rump/Oishi,
Zhu/Yong/Zeng
2006 Zhu/Hayes
2008 Rump/Ogita/Oishi
2009 Rump, Zhu/Hayes
2010 Zhu/Hayes
3 / 30
Accuracy of the floating point summation
Precision
u= arithmetic precision
u = 2−53 ≈ 10−16 for b64 in IEEE-754 (2008)
Accuracy for backward stable algorithms
Accuracy of the computed sum ≤ (n − 1)× cond × u
cond(∑
xi ) =∑|xi |
|∑
xi |
No more significant digit in IEEE-b64 for large cond, i.e. > 1016
More accuracy . . .
More precision: double-double, quad-double, . . .
Compensated algorithms: Kahan(72), . . . , Sum2(05), SumK(05)
Accuracy of the computed sum . u + cond × uK
. . . but still depending on the conditioning
4 / 30
Skip over the conditioning
Distillation: iterate until faithful or exact rounding
Error free transformation of [x ]→ [x (1)]→ · · · → [x∗] such that∑xi =
∑x∗i and [x∗] provides the expected rounded value.
Kahan (87), . . . , Zhu-Hayes: iFastSum (SISC-09)
More space to keep everything
Long accumulator, hardware oriented: Malcolm (71), Kulish (80)
Cut the summands: AccSum (SISC-08), FastAccSum (SISC-09)
Sum by fixed exponent: HybridSum (SISC-09), OnLineExact (TOMS-10)
From faithful to exact rounding
costly choice of the right side when closed to breakpoints
e.g. 1 + 2−53 ± 2−106
5 / 30
Skip over the conditioning
Distillation: iterate until faithful or exact rounding
Error free transformation of [x ]→ [x (1)]→ · · · → [x∗] such that∑xi =
∑x∗i and [x∗] provides the expected rounded value.
Kahan (87), . . . , Zhu-Hayes: iFastSum (SISC-09)
More space to keep everything
Long accumulator, hardware oriented: Malcolm (71), Kulish (80)
Cut the summands: AccSum (SISC-08), FastAccSum (SISC-09)
Sum by fixed exponent: HybridSum (SISC-09), OnLineExact (TOMS-10)
From faithful to exact rounding
−→ Run-time and memory efficiencies are now the discriminant factors
5 / 30
1 Why measure summation algorithm performance?
2 How to measure summation algorithm performance?
3 ILP and the PerPI Tool
4 Experiments with recent summation algorithms
5 Conclusion
6 / 30
Reliable and significant measure of the time complexity?
The classic way: count the number of flop
A usual problem: double the accuracy of a computed result
A usual answer for polynomial evaluation (degree n)
Metric Horner CompHorner DDHorner
Flop count 2n 22n + 5 28n + 4
Flop count ratio 1 ≈ 11 ≈ 14
Measured #cycles ratio 1 2.8 – 3.2 8.7 – 9.7
Flop count vs. run-time measures
Flop counts and measured run-times are not proportional
Run-time measure is a very difficult experimental process
Which one trust?
7 / 30
How to trust non-reproducible experiment results?
Measures are mostly non-reproducible
The execution time of a binary program varies, even using the same data
input and the same execution environment.
Why? Experimental uncertainty of the hardware performance counters
Spoiling events: background tasks, concurrent jobs, OS interrupts
Non deterministic issues: instruction scheduler, branch predictor
External conditions: temperature of the room
Timing accuracy: no constant cycle period on modern processors (i7. . . )
Uncertainty increases as computer system complexity does
Architecture and micro-architecture issues: multicore, hybrid, speculation
Compiler options and its effects
8 / 30
How to read the current literature?
Numerical results in S.M. Rump contributions (for summation)
26% for Sum2-SumK (SISC-05) : 9 pages over 34
20% for AccSum (SISC-08) : 7 pages over 35
20% for AccSumK-NearSum (SISC-08b) : 6 pages over 30
less that 3% for FastAccSum (SISC-09) : 1 page over 37
Lack of proof, or at least of reproducibility
Measuring the computing time of summation algorithms in a high-level
language on today’s architectures is more of a hazard than scientific
research. S.M. Rump (SISC, 2009)
. . . in the paper entitled Ultimately Fast Accurate Summation
9 / 30
Software and System Performance experts’ point of view
The limited Accuracy of Performance Counter Measurements
We caution performance analysts to be suspicious of cycle counts
. . . gathered with performance counters.
D. Zaparanuks, M. Jovic, M. Hauswirth (2009)
Can Hardware Performance Counters Produces Expected, Deterministic Results?
In practice counters that should be deterministic show variation from
run to run on the x86 64 architecture. . . . it is difficult to determine
known “good” reference counts for comparison.
V.M. Weaver, J. Dongarra (2010)
The picture is blurred: the computing chain is wobbling around
If we combine all the published speedups (accelerations) on the well
known public benchmarks since four decades, why don’t we observe
execution times approaching to zero? S. Touati (2009)
10 / 30
Outline
1 Why measure summation algorithm performance?
2 How to measure summation algorithm performance?
3 ILP and the PerPI Tool
4 Experiments with recent summation algorithms
5 Conclusion
11 / 30
ILP and the performance potential of an algorithm
Instruction Level Parallelism (ILP) describes the potential of the instructions of
a program that can be executed simultaneously
Hennessy-Patterson’s ideal machine (H-P IM)
every instruction is executed one cycle after the execution one of the
producers it depends
no other constraint than the true instruction dependency (RAW)
Measure the #cycles and the #IPC running the code with the H-P IM
maximal exploitation of the program ILP
processor and ILP in practice: superscalar and out-of-order execution
ILP measures the potential of the algorithm performance
12 / 30
What is ILP?
A synthetic sample: e = (a+b) + (c+d)
x86 binary...
mov eax,DWP[ebp-16]i1
mov edx,DWP[ebp-20]i2
add edx,eaxi3
mov ebx,DWP[ebp-8]i4
add ebx,DWP[ebp-12]i5
add edx,ebxi6
...
13 / 30
What is ILP?
A synthetic sample: e = (a+b) + (c+d)
x86 binary...
mov eax,DWP[ebp-16]i1
mov edx,DWP[ebp-20]i2
add edx,eaxi3
mov ebx,DWP[ebp-8]i4
add ebx,DWP[ebp-12]i5
add edx,ebxi6
...
Instruction and cycle counting
13 / 30
What is ILP?
A synthetic sample: e = (a+b) + (c+d)
x86 binary...
mov eax,DWP[ebp-16]i1
mov edx,DWP[ebp-20]i2
add edx,eaxi3
mov ebx,DWP[ebp-8]i4
add ebx,DWP[ebp-12]i5
add edx,ebxi6
...
Instruction and cycle counting
Cycle 0: i1 i2 i4
13 / 30
What is ILP?
A synthetic sample: e = (a+b) + (c+d)
x86 binary...
mov eax,DWP[ebp-16]i1
mov edx,DWP[ebp-20]i2
add edx,eaxi3
mov ebx,DWP[ebp-8]i4
add ebx,DWP[ebp-12]i5
add edx,ebxi6
...
Instruction and cycle counting
Cycle 0: i1 i2 i4
Cycle 1: i3 i5
13 / 30
What is ILP?
A synthetic sample: e = (a+b) + (c+d)
x86 binary...
mov eax,DWP[ebp-16]i1
mov edx,DWP[ebp-20]i2
add edx,eaxi3
mov ebx,DWP[ebp-8]i4
add ebx,DWP[ebp-12]i5
add edx,ebxi6
...
Instruction and cycle counting
Cycle 0: i1 i2 i4
Cycle 1: i3 i5
Cycle 2: i6
13 / 30
What is ILP?
A synthetic sample: e = (a+b) + (c+d)
x86 binary...
mov eax,DWP[ebp-16]i1
mov edx,DWP[ebp-20]i2
add edx,eaxi3
mov ebx,DWP[ebp-8]i4
add ebx,DWP[ebp-12]i5
add edx,ebxi6
...
Instruction and cycle counting
Cycle 0: i1 i2 i4
Cycle 1: i3 i5
Cycle 2: i6
# of instructions = 6, # of cycles = 3
ILP = # of instructions/# of cycles = 2
13 / 30
ILP explains why compensated algorithms run fast
N. Louvet, PhD (07) CompHorner DDHorner
#C=2n+8, ILP≈ 11 #C=17n+2, ILP≈ 1.65
*
* *
*
*
+
+
−
−
−
− −
−
−
−
+
+
+
+
+
x_lo
x_hi
x
P[i]
P[i]
r
c
c
(i)
(i+1)
(i)
* *
x_lo
x_hi
x
(i+1)r
splitter
1
2
3
4
5
6
7
8
9
10
c
c
r
(i+1)
(i)
(i)
(i+1)
r
(n−1)
(n−2)
(n−3)
(n−4)
(n−5)
(0)
(1)
(2)
(3)
(4)
(a) (b) (c)
+
−
−
−
+
−
−
+
+
+
−
+
+
x_lox_hi
x_hix_lo
x
P[i]
sh
sh
sl
(i+1)
(i)
(i)
x
(i+1)sl
**
+
−
−
−
−* *
* *−
+
+
splitter
sh
sl
sh
sl
(i+1)
(i+1)
(i)
(i)
10
12
19
2
1
3
4
5
6
7
8
9
11
13
14
15
16
17
18
(n−1)
(n−2)
(n−3)
(n−4)
(a) (b) (c)
14 / 30
The PerPI Tool automatizes this ILP analysis
PerPI: a pintool to analyse and visualise the ILP of x86-coded algorithms
Pin (Intel) tool (http://www.pintool.org)
Outputs: ILP measure (#C, #I), IPC histogram, data-dependency graph
Input: x86 64 binary file
Developed and maintained by B. Goossens and D. Parello (DALI)
15 / 30
1 Why measure summation algorithm performance?
2 How to measure summation algorithm performance?
3 ILP and the PerPI Tool
4 Experiments with recent summation algorithms
5 Conclusion
16 / 30
Some recent accurate and fast summation algorithms
Twice more precision
Sum2: Compensated with a VectSum that uses TwoSum
DDSum: Recursive sum + double-double arithmetic
Faithful or exact rounding
iFastSum: SumK with dynamic error control
AccSum and FastAccSum: Adaptative computational effort wrt cond.
Split the summands by chunk that sums exactly (width depends on n),
careful sum of the chunks.
Chunk cutting-line fixed in AccSum while more dynamic in FastAccSum
HybridSum and OnLineExactSum: Exponent extraction of the summands,
careful accumulation in one (HS) or two vectors (OLE) of fixed and short
length (2048 in IEEE-b64), and distillate the (very for OLE) short vector
with iFastSum.
17 / 30
How to chose the data test?
Time complexity parameters of the summation algorithms
Only n for Sum2, SumK: constant accuracy improvement
n and cond for AccSum, iFastSum: adaptive accuracy improvement
Exponent range of the summands: Z-H exhibit no influence for HybridSum
and OnlineExactSum for large n
Rump’s generator of arbitrary ill-conditioned dot product (SISC-05),
modified for summation and to cover an arbitrary exponent range.
Length: n ∈ [103, 107] and cond ∈ [108, 1040] = [√
1/u, 1/u2.5]
18 / 30
Our fuzzy PAPI picture
Parameters : sum length: 103 to 106, cond: 108 to 1040
cond Sum Sum2 FastAccSum iFastSum HybridSum OnLineExact
108 1 2–3 4–5 7–8 5 (n > 105) 4 (n > 105)
1016 1 2–3 5–6 7–8 5 (n > 105) 4 (n > 105)
1024 1 – 7 13 5 (n > 105) 4 (n > 105)
1032 1 – 8 18 5 (n > 105) 4 (n > 105)
1040 1 – 9 18+ 5 (n > 105) 4 (n > 105)
Experimental process: PAPI, counter delay, hot caches, average over 50 samples
for each n and cond. . . .
Intel(R) Core(TM) i7 CPU870 2.93GHz, x86 64, GNU/Linux noyau 2.6.38-8-generic
– gcc (4.6) -std=c99 -march=corei7 -mfpmath=sse -O3 -funroll-all-loops
– icc (12.0.420110427) -std=c99 -O3 -mtune=corei7 -xSSE -axsse4.2 -funroll-all-loops
19 / 30
Focus inside OnLineExact
Low level choices are crucial
cond = 1016 gcc icc
n 2sum Fast2sum 2sum Fast2sum
103 13 11 13.7 9
104 6.5 6.5 5.8 4
105 5 5.5 3.8 3.3
106 4.7 5.6 3.8 3.3
Cycle ratios (vs. Sum) vary for different EFT and compilers
20 / 30
PerPI and reproducibility: one run is enough
21 / 30
PerPI: # cycle ratios for summation algorithms
Number of cycles: ratios vs. Sum
for cond = 1032 and n = 2048, 104, 10522 / 30
PerPI: # cycle ratios for summation algorithms
Twice more accurate computed sum
No overhead: compensation is the right choice
Faithfully or exactly rounded computed sum
The newest, the potentially fastest but be cautious: sensitive in practice
FastAccSum (3n) not faster than AccSum (4n) [GLPP-Para10]
OnLineExact for large n, else iFastSum
PerPI highlights the control, e.g. iteration counters
Less #C in HybridSum than in Sum?
Sum is unrolled 8 times by gcc but C forbids to change the evaluation order
of the arithmetic expression
Every cycle of HybridSum has enough parallel work with different
summands: 2 here
OnLineExact introduces dependency between iterations: x [i ] and x [i + 1]
may have the same exponent
23 / 30
HS (left) and OLE (right), unrolled (up) or not (down)
24 / 30
Conclusion
1 Why measure summation algorithm performance?
2 How to measure summation algorithm performance?
3 ILP and the PerPI Tool
4 Experiments with recent summation algorithms
5 Conclusion
25 / 30
Conclusion
Highly accurate algorithm −→ reliable performance evaluation
Flop count: not significant
Hardware counter based measure: uncertainty and no reproducibility
PerPI: a software platform to analyze and visualise ILP
Reliable: reproducibility both in time and location
Realistic: correlation with measured ones
Useful: a detailed picture of the intrinsic behavior of the algorithm
Optimisation tool: analyse the effect of some hardware constraints
[GLPP-Para10]
Exploratory tool: gives us the taste of the behavior of our algorithms
running on “tomorrow” processors
26 / 30
Conclusion
Computing time: More science? Less hazard?
No definitive answer
PerPI result is far from perfect
Not abstract enough: instruction set dependence, compiler choice
Good abstraction level? Assembler program or high level programming
language?
Next step for f.p. summation: reproducibility to improve productivity
Web site with common and shared resources: tested + test + make file
sources, data files and generators, real and abstract associated measures
Open and dynamic interaction: load your new algorithm, your new data,
run them and let’s contribute
architectures? compilers?
suggestions and partners are welcome!
27 / 30
References I
D. H. Bailey.
Twelve ways to fool the masses when giving performance results on parallel computers.
Supercomputing Review, pages 54–55, Aug. 1991.
B. Goossens, P. Langlois, D. Parello, and E. Petit.
PerPI: A tool to measure instruction level parallelism.
In K. Jonasson, editor, Applied Parallel and Scientific Computing - 10th International
Conference, PARA 2010, Reykjavık, Iceland, June 6-9, 2010, Revised Selected Papers,
Part I, volume 7133 of Lecture Notes in Computer Science, pages 270–281. Springer,
2012.
N. J. Higham.
Accuracy and Stability of Numerical Algorithms.
SIAM, 2nd edition, 2002.
J.-M. Muller, N. Brisebarre, F. de Dinechin, C.-P. Jeannerod, V. Lefevre, G. Melquiond,
N. Revol, D. Stehle, and S. Torres.
Handbook of Floating-Point Arithmetic.
Birkhauser Boston, 2010.
28 / 30
References II
T. Ogita, S. M. Rump, and S. Oishi.
Accurate sum and dot product.
SIAM J. Sci. Comput., 26(6):1955–1988, 2005.
S. M. Rump.
Ultimately fast accurate summation.
SIAM J. Sci. Comput., 31(5):3466–3502, 2009.
S. M. Rump, T. Ogita, and S. Oishi.
Accurate floating-point summation – part I: Faithful rounding.
SIAM J. Sci. Comput., 31(1):189–224, 2008.
V. Weaver and J. Dongarra.
Can hardware performance counters produce expected, deterministic results?
In 3rd Workshop on Functionality of Hardware Performance Monitoring, 2010.
D. Zaparanuks, M. Jovic, and M. Hauswirth.
Accuracy of performance counter measurements.
In ISPASS, pages 23–32. IEEE, 2009.
29 / 30
References III
Y.-K. Zhu and W. B. Hayes.
Correct rounding and hybrid approach to exact floating-point summation.
SIAM J. Sci. Comput., 31(4):2981–3001, 2009.
Y.-K. Zhu and W. B. Hayes.
Algorithm 908: Online exact summation of floating-point streams.
ACM Transactions on Mathematical Software, 37(3):37:1–37:13, Sept. 2010.
30 / 30