gte: arbitrary precision arithmetic - geometric tools · numbers is the basis for the arbitrary...

45
GTE: Arbitrary Precision Arithmetic David Eberly, Geometric Tools, Redmond WA 98052 https://www.geometrictools.com/ This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA. Created: November 25, 2014 Last Modified: September 11, 2020 Contents 1 Introduction 3 2 IEEE 754-2008 Binary Representations 6 2.1 Representation of 32-bit Floating-Point Numbers ......................... 6 2.2 Representation of 64-bit Floating-Point Numbers ......................... 7 3 Binary Scientific Numbers 9 3.1 Multiplication ............................................. 9 3.2 Addition ................................................ 10 3.2.1 The Case p - n>q - m ................................... 10 3.2.2 The Case p - n<q - m ................................... 10 3.2.3 The Case p - n = q - m ................................... 11 3.2.4 Determining the Maximum Number of Bits for Addition ................. 11 3.3 Subtraction .............................................. 12 3.4 Unsigned Integer Arithmetic ..................................... 12 3.4.1 Addition ............................................ 13 3.4.2 Subtraction .......................................... 14 3.4.3 Multiplication ......................................... 15 3.4.4 Shift Left ........................................... 17 3.4.5 Shift Right to Odd Number ................................. 18 3.4.6 Comparisons ......................................... 19 3.5 Conversion of Floating-Point Numbers to Binary Scientific Numbers .............. 21 1

Upload: others

Post on 31-Jan-2021

8 views

Category:

Documents


0 download

TRANSCRIPT

  • GTE: Arbitrary Precision Arithmetic

    David Eberly, Geometric Tools, Redmond WA 98052https://www.geometrictools.com/

    This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copyof this license, visit http://creativecommons.org/licenses/by/4.0/ or send a letter to Creative Commons,PO Box 1866, Mountain View, CA 94042, USA.

    Created: November 25, 2014Last Modified: September 11, 2020

    Contents

    1 Introduction 3

    2 IEEE 754-2008 Binary Representations 6

    2.1 Representation of 32-bit Floating-Point Numbers . . . . . . . . . . . . . . . . . . . . . . . . . 6

    2.2 Representation of 64-bit Floating-Point Numbers . . . . . . . . . . . . . . . . . . . . . . . . . 7

    3 Binary Scientific Numbers 9

    3.1 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    3.2 Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    3.2.1 The Case p− n > q −m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    3.2.2 The Case p− n < q −m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    3.2.3 The Case p− n = q −m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    3.2.4 Determining the Maximum Number of Bits for Addition . . . . . . . . . . . . . . . . . 11

    3.3 Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    3.4 Unsigned Integer Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    3.4.1 Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    3.4.2 Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    3.4.3 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    3.4.4 Shift Left . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    3.4.5 Shift Right to Odd Number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    3.4.6 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    3.5 Conversion of Floating-Point Numbers to Binary Scientific Numbers . . . . . . . . . . . . . . 21

    1

    https://www.geometrictools.com/http://creativecommons.org/licenses/by/4.0/

  • 3.6 Conversion of Binary Scientific Numbers to Floating-Point Numbers . . . . . . . . . . . . . . 22

    4 Binary Scientific Rationals 28

    4.1 Arithmetic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    4.2 Conversion of Floating-Point Numbers to Binary Scientific Rationals . . . . . . . . . . . . . . 28

    4.3 Conversion of Binary Scientific Rationals to Floating-Point Numbers . . . . . . . . . . . . . . 29

    5 Implementation of Binary Scientific Numbers 30

    6 Implementation of Binary Scientific Rationals 36

    7 Performance Considerations 37

    7.1 Static Computation of the Maximum Bits of Precision . . . . . . . . . . . . . . . . . . . . . . 38

    7.2 Dynamic Computation of the Maximum Bits of Precision . . . . . . . . . . . . . . . . . . . . 40

    7.3 Memory Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

    8 Alternatives for Computing Signs of Determinants 42

    8.1 Interval Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

    8.2 Alternate Logic for Signs of Determinants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

    9 Miscellaneous Items 44

    2

  • 1 Introduction

    The information in this document occurs in revised and expanded form in the book Robust and Error-FreeGeometric Computations.

    Deriving the mathematical details for an algorithm that solves a geometric problem is a well understoodprocess. A direct implementation of the algorithm, however, does not always work out so well in practice.It is difficult to achieve robustness when computing using floating-point arithmetic. The IEEE 754-2008Standard for floating-point numbers defines a 32-bit binary representation, float, that has 24 bits of precisionand a 64-bit binary representation, double, that has 53 bits of precision. For many geometric problems, thenumber of bits of precision to obtain robustness is larger than those available from the standard floating-pointtypes.

    Example 1. Consider the 2D geometric query for testing whether a point P is inside or outside a convexpolygon with counterclockwise ordered vertices Vi for 0 ≤ i < n. The point is strictly inside the polygonwhen the point triples {P,Vi,Vi+1} are counterclockwise ordered for all i with the understanding thatVn = V0. If at least one of the triples is clockwise ordered, the point is outside the polygon. It is possiblethat the point is exactly on a polygon edge. An implementation of the algorithm is

    // Return : +1 (P s t r i c t l y i n s i d e ) , =1 (P s t r i c t l y o u t s i d e ) , 0 (P on edge )template i n t G e t C l a s s i f i c a t i o n ( Vector2 const& P, i n t n , Vector2 const * V){

    i n t numPos i t i v e = 0 , numNegative = 0 , numZero = 0 ;f o r ( i n t i 0 = n=1, i 1 = 0 ; i 1 < n ; i 0 = i 1++){

    Vector2 d i f f 0 = P = V[ i 0 ] , d i f f 1 = P = V[ i 1 ] ;f l o a t dotPerp = d i f f 0 [ 0 ] * d i f f 1 [ 1 ] = d i f f 0 [ 1 ] * d i f f 1 [ 0 ] ;i f ( dotPerp > ( Rea l )0 ){

    ++numPos i t i v e ;}e l s e i f ( dotPerp < ( Rea l )0 ){

    ++numNegative ;}e l s e{

    ++numZero ;}

    }

    r e t u r n ( numZero == 0 ? ( ( numPos i t i v e == n ? +1 : =1) : 0 ) ;}

    Vector2 P(0 . 5 f , 0 . 5 f ) , V [ 3 ] ;V [ 0 ] = Vector2(=7.29045947e=013f , 6 .29447341 e=013 f ) ;V [ 1 ] = Vector2(1.0 f , 8 .11583873 e=013 f ) ;V [ 2 ] = Vector2(9.37735566 e=013f , 1 . 0 f ) ;

    // The r e t u r n v a l u e i s 0 , so P i s de te rm ined to be on an edge o f the// t r i a n g l e . The f u n c t i o n computes numPos i t i v e = 2 , numNegative = 0 ,// and numZero = 1 .i n t c l a s s i f y U s i n g F l o a t = G e t C l a s s i f i c a t i o n(P , 3 , V ) ;

    // I f rP and rV [ i ] a r e r a t i o n a l r e p r e s e n t a t i o n s o f P and V[ i ] and you// have type Ra t i o n a l t ha t s uppo r t s e xac t r a t i o n a l a r i t hme t i c , then// the r e t u r n v a l u e i s +1, so P i s a c t u a l l y s t r i c t l y i n s i d e the t r i a n g l e .// The f u n c t i o n computes numPos i t i v e = 3 , numNegative = 0 , and numZero = 0 .i n t c l a s s i f y U s i n g E x a c t = G e t C l a s s i f i c a t i o n(rP , 3 , rV ) ;

    The values of the sign counters are sensitive to numerical rounding errors when P is nearly on a polygon edge.

    3

  • A geometric algorithm that depends on a theoretically correct classification can suffer when the floating-pointresults are not correct. For example, the convex hull of a set of 2D points involves this query. Incorrectclassification can cause havoc with a convex hull finder.

    Example 2. Compute the distance between two lines in n dimensions. The lines are specified using twopoints. The first line contains points P0 and P1 and is parameterized by P(s) = (1− s)P0 + sP1 for all realvalues s. The second line contains points Q0 and Q1 and is parameterized by Q(t) = (1 − t)Q0 + tQ1 forall real values t. The squared distance between a pair of points, one on each line, is

    F (s, t) = |P(s)−Q(t)|2 = as2 − 2bst+ ct2 + 2ds− 2et+ f

    where

    a = |P1 −P0|2, b = (P1 −P0) · (Q1 −Q0), c = |Q1 −Q0|2,

    d = (P1 −P0) · (P0 −Q0), e = (Q1 −Q0) · (P0 −Q0), f = |P0 −Q0|2

    Observe that ac− b2 = |(P1−P0)× (Q1−Q0)|2 ≥ 0. If ac− b2 > 0, the lines are not parallel and the graphof F (s, t) is a paraboloid with a unique global minimum that occurs when the gradient is the zero vector,∇F (s, t) = (0, 0). If the solution is (s̄, t̄), the closest points on the lines are P(s̄) and Q(t̄) and the distancebetween them is

    √F (s̄, t̄).

    If ac− b2 = 0, the lines are parallel and the graph of F (s, t) is a parabolic cylinder with a global minimumthat occurs along an entire line. This line contains all the vertices of the parabolas that form the cylinder.Each point (s̄, t̄) on this line produces a pair of closest points on the original lines, P(s̄) and Q(t̄), and thedistance between them is

    √F (s̄, t̄).

    Mathematically, the problem is straightforward to solve. The gradient equation is ∇F (s, t) = 2(as − bt +d,−bs+ ct− e) = (0, 0), which is a linear system of two equations in two unknowns, a −b

    −b c

    st

    = −d

    e

    The determinant of the coefficient matrix is ac − b2, so the equations have a unique solution (s̄, t̄) whenac− b2 > 0, s̄

    = 1ac− b2

    be− cdae− bd

    This is the case when the two lines are not parallel. Let us look at a particular example using floating-pointarithmetic to solve the system,

    Vector3 P0(=1.0896217473782599 , 9.7236145595088601 e=007 , 0 . 0 ) ;Vector3 P1( 0.91220578597858548 , =9.4369829432107506e=007 , 0 . 0 ) ;Vector3 Q0(=0.90010447502136237 , 9.0671446351334441 e=007 , 0 . 0 ) ;Vector3 Q1( 1.0730877178721130 , =9.8185787633992740e=007 , 0 . 0 ) ;Vector3 P1mP0 = P1 = P0 , Q1mQ0 = Q1 = Q0 , P0mQ0 = P0 = Q0 ;double a = Dot (P1mP0 , P1mP0 ) ; // 4.0073134733092228double b = Dot (P1mP0 , Q1mQ0) ; // 3.9499904603425491double c = Dot (Q1mQ0, Q1mQ0) ; // 3.8934874300993294double d = Dot (P1mP0 , P0mQ0 ) ; // =0.37938089385085144double e = Dot (Q1mQ0, P0mQ0 ) ; // =0.37395400223322062double det = a * c = b * b ; // 1.7763568394002505 e=015double sNumer = b * e = c * d ; // 2.2204460492503131 e=016

    4

  • double tNumer = a * e = b * d ; // 4.4408920985006262 e=016double s = sNumer / det ; // 0 .125double t = tNumer / det ; // 0 .25Vector3 PC lo s e s t = (1 . 0 = s ) * P0 + s * P1 ;Vector3 QClose s t = (1 . 0 = t ) * Q0 + t * Q1 ;double d i s t a n c e = Length ( PC lo s e s t = QClose s t ) ; // 0.43258687891076358

    Now repeat the experiment using a type Rational that supports exact rational arithmetic.

    Vector3 rP0 (P0 [ 0 ] , P0 [ 1 ] , P0 [ 2 ] ) ;Vector3 rP1 (P1 [ 0 ] , P1 [ 1 ] , P1 [ 2 ] ) ;Vector3 rQ0 (Q0 [ 0 ] , Q0 [ 1 ] , Q0 [ 2 ] ) ;Vector3 rQ1 (Q1 [ 0 ] , Q1 [ 1 ] , Q1 [ 2 ] ) ;Vector3 rP1mP0 = rP1 = rP0 , rQ1mQ0 = rQ1 = rQ0 , rP0mQ0 = rP0 = rQ0 ;Ra t i o n a l r a = Dot ( rP1mP0 , rP1mP0 ) ;Ra t i o n a l rb = Dot ( rP1mP0 , rQ1mQ0 ) ;Ra t i o n a l r c = Dot (rQ1mQ0 , rQ1mQ0 ) ;Ra t i o n a l rd = Dot ( rP1mP0 , rP0mQ0 ) ;Ra t i o n a l r e = Dot (rQ1mQ0 , rP0mQ0 ) ;Ra t i o n a l r d e t = ra * r c = rb * rb ; // 2.4974018083084524 e=020Ra t i o n a l rsNumer = rb * r e = r c * rd ; // =3.6091745045569584e=017Ra t i o n a l rtNumer = ra * r e = rb * rd ; // =3.6617913970635857e=017Ra t i o n a l r s = rsNumer / r d e t ; // =1445.1717351007826Ra t i o n a l r t = rtNumer / r d e t ; // =1466.2403882632732Ra t i o n a l one ( 1 ) ;Vector3 rPC l o s e s t = ( one = r s ) * rP0 + r s * rP1 ;Vector3 rQC lo s e s t = ( one = r t ) * rQ0 + r t * rQ1 ;Vector3 r d i f f = rPC l o s e s t = rQC lo s e s t ; // Ra t i o n a l (0 )Ra t i o n a l r s q rD i s t a n c e = Dot ( r d i f f , r d i f f ) ;double conve r t ed = s q r t ( ( double ) r s q rD i s t a n c e ) ; // 0 .0

    The values of rdet, rsNumer, rtNumer, rs, and rs were converted back to double values with some loss in precision.Clearly, the double-precision computations suffer greatly from the numerical rounding errors. In fact, thedistance must be zero because the two lines are not parallel and lie in the xy-plane, so they must intersect.

    Example 3. A common subproblem in geometric algorithms is computing the real-valued roots of aquadratic polynomial q(x) = a0 + a1x+ a2x

    2, where a2 6= 0. The quadratic formula provides the roots,

    r =−a1 ±

    √a21 − 4a0a2

    2a2

    The discriminant is ∆ = a21 − 4a0a2. If ∆ > 0, the polynomial has 2 distinct real-valued roots. If ∆ = 0,the polynomial has 1 repeated real-valued root. If ∆ < 0, the polynomial has no real-valued roots (both arecomplex-valued). The numerical problems arise when ∆ is nearly zero. Observe that ∆ is in the form of adeterminant, so it can suffer from the numerical problems in determining its theoretically correct sign.

    To obtain a correct classification, you can use exact rational arithmetic to compute ∆. When it is nonnega-tive, then convert the number back to a floating-point quantity and compute the roots using floating-pointarithmetic. The conversion can lose precision, and so the square root operation can magnify the errors justas if you used floating-point arithmetic entirely for the computations. Alternatively, you can implement analgorithm for approximating the square root, compute it using exact rational arithmetic, and then convertback to floating-point, hopefully producing a more accurate result than just the conversion-first-sqrt-secondapproach.

    5

  • 2 IEEE 754-2008 Binary Representations

    The 32-bit floating-point type float and the 64-bit floating-point type double are designed according to theIEEE 754-2008 standard for floating-point numbers. The binary scientific notation used to represent suchnumbers is the basis for the arbitrary precision arithmetic implemented in GTE.

    2.1 Representation of 32-bit Floating-Point Numbers

    The layout of float is shown in Figure 1.

    Figure 1. The layout of the floating-point type float.

    The sign of the number is stored in bit 31. A 0-valued bit is used for a nonnegative number and a 1-valuedbit is used for a negative number. Bits 23 through 30 form an 8-bit unsigned integer b = e + 127 that is abiased representation of the exponent e; the biased exponent satisfies 0 ≤ b ≤ 255 and the exponent satisfies−127 ≤ e ≤ 128. The trailing significand is stored as a 23-bit unsigned integer in bits 0 through 22. Itrepresents the fractional part of a number when written in binary scientific notation. The integer part ofthe notation is implied by the representation and not stored explicitly. It is 1 for a normal floating-pointnumber or 0 for a subnormal floating-point number. Listing 1 shows how the binary encoding represents32-bit floating-point numbers.

    Listing 1. Decoding a number of type float.

    b i na r y32 x = ;u i n t 3 2 t s = (0 x80000000 & x . encod ing ) >> 31 ; // s i g nu i n t 3 2 t b = (0 x7f800000 & x . encod ing ) >> 23 ; // b i a s e d exponentu i n t 3 2 t t = (0 x 0 0 7 f f f f f & x . encod ing ) ; // t r a i l i n g s i g n i f i c a n d

    i f ( b == 0){

    i f ( t == 0) // z e r o s{

    // x = (=1)ˆ s * 0 [ a l l ow s f o r +0 and =0]}e l s e // subnormal numbers{

    // x = (=1)ˆ s * 0 . t * 2ˆ{=126}}

    }e l s e i f ( b < 255) // normal numbers{

    // x = (=1)ˆ s * 1 . t * 2ˆ{b=127}}e l s e // s p e c i a l numbers{

    i f ( t == 0){

    // x = (=1)ˆ s * i n f i n i t y}e l s e{

    6

  • i f ( t & 0x00400000 ){

    // x = qu i e t NaN}e l s e{

    // x = s i g n a l i n g NaN}// pay load = t & 0 x 0 0 3 f f f f f

    }}

    The encoding has signed zeros, +0 (encoding 0x00000000) and −0 (encoding 0x80000000). There arenumerical applications for which it is important to have both representations. The encoding also has signedinfinities, +∞ (encoding 0x7f800000) and −∞ (encoding 0xff800000). Infinities have special rules appliedto them during arithmetic operations. Finally, the encoding has special values, each called Not-a-Number(NaN). Some of these are called quiet NaNs and are used to provide diagnostic information when unexpectedconditions occur during floating-point computations. The others are called signaling NaNs and also providediagnostic information but might also be used to support the needs of specialized applications. A NaN hasan associated payload whose meaning is at the discretion of the implementer. The IEEE 754-2008 Standardhas many requirements regarding the handling of NaNs in numerical computations.

    The smallest positive subnormal number occurs when b = 0 and t = 1, which is 2emin+1−p = 2−149. All finitefloating-point numbers are integral multiples of this number. The largest positive subnormal number occurswhen b = 0 and t has all 1-valued bits, which is 2emin(1 − 21−p) = 2−126(1 − 2−23). The smallest positivenormal number occurs when b = 1 and t = 0, which is 2emin = 2−126. The largest positive normal numberoccurs when b = 254 and t has all 1-valued bits, which is 2emax(2− 21−p) = 2127(2− 2−23).

    2.2 Representation of 64-bit Floating-Point Numbers

    The layout of double is shown in Figure 2.

    Figure 2. The layout of the floating-point type double.

    The sign of the number is stored in bit 63. A 0-valued bit is used for a nonnegative number and a 1-valuedbit is used for a negative number. Bits 52 through 62 form an 11-bit unsigned integer b = e + 1023 thatis a biased representation of the exponent e; the biased exponent satisfies 0 ≤ b ≤ 2047 and the exponentsatisfies −1023 ≤ e ≤ 1024. The trailing significand is stored as a 52-bit unsigned integer in bits 0 through51. It represents the fractional part of a number when written in binary scientific notation. The integer partof the notation is implied by the representation and not stored explicitly. It is 1 for a normal floating-pointnumber or 0 for a subnormal floating-point number. Listing 2 shows how the binary encoding represents64-bit floating-point numbers.

    Listing 2. Decoding a number of type double.

    7

  • b i na r y64 x = ;u i n t 6 4 t s = (0 x8000000000000000 & x . encod ing ) >> 63 ; // s i g nu i n t 6 4 t b = (0 x7f f0000000000000 & x . encod ing ) >> 52 ; // b i a s e d exponentu i n t 6 4 t t = (0 x 0 0 0 f f f f f f f f f f f f f & x . encod ing ) ; // t r a i l i n g s i g n i f i c a n d

    i f ( b == 0){

    i f ( t == 0) // z e r o s{

    // x = (=1)ˆ s * 0 [ a l l ow s f o r +0 and =0]}e l s e // subnormal numbers{

    // x = (=1)ˆ s * 0 . t * 2ˆ{=1022}}

    }e l s e i f ( b < 2047) // normal numbers{

    // x = (=1)ˆ s * 1 . t * 2ˆ{b=1023}}e l s e // s p e c i a l numbers{

    i f ( t == 0){

    // x = (=1)ˆ s * i n f i n i t y}e l s e{

    i f ( t & 0x0008000000000000 ){

    // x = qu i e t NaN}e l s e{

    // x = s i g n a l i n g NaN}// pay load = t & 0 x 0 0 0 7 f f f f f f f f f f f f

    }}

    The encoding has signed zeros, +0 (encoding 0x0000000000000000) and−0 (encoding 0x8000000000000000).There are numerical applications for which it is important to have both representations. The encoding alsohas signed infinities, +∞ (encoding 0x7ff0000000000000) and −∞ (encoding 0xfff0000000000000). In-finities have special rules applied to them during arithmetic operations. Finally, the encoding has specialvalues, each called Not-a-Number (NaN). Some of these are called quiet NaNs and are used to provide di-agnostic information when unexpected conditions occur during floating-point computations. The others arecalled signaling NaNs and also provide diagnostic information but might also be used to support the needs ofspecialized applications. A NaN has an associated payload whose meaning is at the discretion of the imple-menter. The IEEE 754-2008 Standard has many requirements regarding the handling of NaNs in numericalcomputations.

    The smallest positive subnormal number occurs when b = 0 and t = 1, which is 2emin+1−p = 2−1074. Allfinite floating-point numbers are integral multiples of this number. The largest positive subnormal numberoccurs when b = 0 and t has all 1-valued bits, which is 2emin(1 − 21−p) = 2−1022(1 − 2−52). The smallestpositive normal number occurs when b = 1 and t = 0, which is 2emin = 2−1022. The largest positive normalnumber occurs when b = 2046 and t has all 1-valued bits, which is 2emax(2− 21−p) = 21023(2− 2−52).

    8

  • 3 Binary Scientific Numbers

    All positive numbers r of type float and double can be written in the binary scientific notation as r = 1.c ∗ 2p,where c is a finite sequence of bits ci that are either 0 or 1 with the last bit required to be 1. The shorthandnotation expands to a summation

    r = 1.c ∗ 2p =

    (1 +

    n−1∑i=0

    ci2−(i+1)

    )2p (1)

    where c = c0 . . . cn−1 has n bits and where p is an integer-valued power. If we allow for an infinite sum,replacing n− 1 by ∞, we can represent all positive real-valued numbers. However, rational numbers requireeither a finite number of bits or a repeating pattern of bits. For example, in base-10 scientific notation therational number 51/4 = 1.275 ∗ 101. The rational number 1/3 = 3.3∞ ∗ 10−1 where the notation 3∞ meansthat the overlined block of numbers repeats ad infinitum. In binary scientific notation, 51/4 = 1.10011 ∗ 23and 1/3 = 1.01

    ∞ ∗2−2. Irrational numbers such as√

    2 or π always require an infinite number of bits withouta repeating block with a finite number of bits. We may keep track of an integer σ ∈ {−1, 0, 1} that representsthe sign of the number, +1 for a positive number, −1 for a negative number, or 0 for the number zero. Whenthe number is zero, in which case the sign is 0, the values c and p are irrelevant and ignored.

    The set B of numbers of the form σ(1.c) ∗ 2p in Equation (1), and including the number zero (σ = 0 butno bits c or power p assigned to the representation) are referred to as binary scientific numbers. Addition,subtraction, and multiplication of numbers in B are naturally defined based on the arithmetic of the realnumbers. If x and y are in B, then so are x+ y, x− y, and x ∗ y. However, division is not allowed. The ratiox/y of numbers in B for nonzero y is a real number of the form 1.c ∗ 2p, but that number is not necessarilyrepresentable with a finite number of bits. For example, 1 = 1.0 ∗ 20 and 3 = 1.1 ∗ 21, but the ratio 1/3requires an infinite number of bits in its representation.

    The binary scientific numbers are the basis for arbitrary precision arithmetic in GTE. In an implementation,we may store the quantities described next. Let r = 1.c ∗ 2p be a binary scientific number where c has n bitsc0 through cn−1 with the last bit cn−1 = 1. Rewrite r = ĉ ∗ 2p−n = ĉ ∗ 2b, where ĉ is an (n+ 1)-bit integerwhose first and last bits are 1. The power b is referred to as a biased exponent. We may represent r using aninteger-valued sign σ, an (n+ 1)-bit unsigned integer ĉ, and an integer biased exponent b. It is convenient tostore also the number n. Addition, subtraction, or multiplication applied to two binary scientific numbersreduce to the same arithmetic operation of two unsigned integers. Manipulation of the number of bits andthe biased exponents is relatively simple.

    3.1 Multiplication

    The product of x = 1.u ∗ 2p and y = 1.v ∗ 2q is z = 1.w ∗ 2r, where we need to determine the values of w andr. If u = 0, then x ∗ y = 1.v ∗ 2p+q. If v = 0, then x ∗ y = 1.v ∗ 2p+q. Otherwise, u > 0 and v > 0, so at leastone bit of u is not zero and at least one bit of v is not zero. Let u have n > 0 bits and v have m > 0 bits.The product is written as

    x ∗ y = 1.u ∗ 2p ∗ 1.v ∗ 2q = û ∗ 2p−n ∗ v̂ ∗ 2q−m = û ∗ v̂ ∗ 2p−n+q−m = ŵ ∗ 2p−n+q−m (2)

    where ŵ = û∗ v̂ is the product of odd integers. The integer û has n+ 1 bits and the integer v̂ has m+ 1 bits,so ŵ is an odd integer with either n + m + 1 or n + m + 2 bits. For example, consider the case n = 4 andm = 3. The product of the two smallest odd integers is 10001 ∗ 1001 = 10011001, which has n+m+ 1 = 8bits. The product of the two largest odd integers is 11111 ∗ 1111 = 111010001, which has n+m+ 2 = 9 bits.

    9

  • Define c = 0 when the leading bit of ŵ is at index n+m or c = 1 when the leading bit is at index n+m+ 1,and define ` = n + m + c. The integer ŵ is an (` + 1)-bit odd integer of the form ŵ = 1w0 . . . w`−1 =1.w0 . . . w`−1 ∗ 2` = 1.w ∗ 2` with w`−1 = 1. The product is

    x ∗ y = ŵ ∗ 2p−n+q−m = 1.w ∗ 2p−n+q−m+` = 1.w ∗ 2p+q+c = 1.w ∗ 2r = z (3)

    which implies r = p + q + c. The implementation of multiplication for BSNumber is based on performingthe integer multiplication ŵ = û ∗ v̂, selecting c by determining the index ` of the leading 1-bit of ŵ, andcomputing r = p+ q+ c, finally representing z = 1.w ∗ 2r as the unsigned integer ŵ and the biased exponentb = r − ` = (p− n) + (q −m).

    3.2 Addition

    The sum of x = 1.u ∗ 2p and y = 1.v ∗ 2q is z = 1.w ∗ 2r, where we need to determine the values of w and r.The cases x = 0 or y = 0 are trivial to handle, so assume x > 0 and y > 0 where u has n bits and v has mbits. The sum is

    x+ y = 1.u ∗ 2p + 1.v ∗ 2q = û ∗ 2p−n + v̂ ∗ 2q−m (4)

    The computation depends on the relative values of p− n and q −m.

    3.2.1 The Case p− n > q −m

    Let p− n > q −m. The sum is written as

    û ∗ 2p−n + v̂ ∗ 2q−m =(û ∗ 2d + v̂

    )∗ 2q−m = ŵ ∗ 2q−m (5)

    where d = (p − n) − (q −m). The integer û ∗ 2d is even because û is odd and d > 0. The integer v̂ is odd.Therefore, ŵ is an odd number. We can determine the index ` of the leading 1-bit of ŵ,

    x+ y = ŵ ∗ 2q−m = 1.w ∗ 2q−m+` = 1.w ∗ 2r = z (6)

    where r = q − m + `. The implementation of addition for BSNumber in this case involves computing theinteger û ∗ 2d by a shift-left operation, adding the result to v̂, determining the index ` of the leading 1-bit ofŵ, and finally representing z = 1.w ∗ 2r as the unsigned integer ŵ and the biased exponent b = q −m.

    3.2.2 The Case p− n < q −m

    Let p− n < q −m. The sum is written as

    û ∗ 2p−n + v̂ ∗ 2q−m =(û+ v̂ ∗ 2d

    )∗ 2p−n = ŵ ∗ 2p−n (7)

    where d = (q −m) − (p − n). The integer v̂ ∗ 2d is even because v̂ is odd and d > 0. The integer û is odd.Therefore, ŵ is an odd number. We can determine the index ` of the leading 1-bit of ŵ,

    x+ y = ŵ ∗ 2p−n = 1.w ∗ 2p−n+` = 1.w ∗ 2r = z (8)

    where r = p − n + `. The implementation of addition for BSNumber in this case involves computing theinteger v̂ ∗ 2d by a shift-left operation, adding the result to û, determining the index ` of the leading 1-bit ofŵ, and finally representing z = 1.w ∗ 2r as the unsigned integer ŵ and the biased exponent b = p− n.

    10

  • 3.2.3 The Case p− n = q −m

    Let p− n = q −m. The sum is written as

    û ∗ 2p−n + v̂ ∗ 2q−m = (û+ v̂) ∗ 2q−m = w̃ ∗ 2q−m (9)

    The integers û and v̂ are odd, so w̃ is even. Let ` be the index of the leading 1-bit of w̃ and let t be theindex of the trailing 1-bit of w̃. It is necessary that ` ≥ t > 0. We can shift-right w̃ by t bits to obtain anodd integer ŵ whose leading 1-bit is at index f = `− t; then

    x+ y = w̃ ∗ 2q−m = ŵ ∗ 2q−m+t = 1.w ∗ 2q−m+t+f = 1.w ∗ 2q−m+` = 1.w ∗ 2r = z (10)

    where r = q −m + `. The fractional part w (as an integer) has ` − t bits. The implementation of additionfor BSNumber in this case involves adding the integers û and v̂ to obtain w̃, determining the indices ` and t,shifting right w̃ by t bits to obtain ŵ, and finally representing z = 1.w ∗ 2r as the unsigned integer ŵ andthe biased exponent b = q −m+ t.

    3.2.4 Determining the Maximum Number of Bits for Addition

    We may concisely write the sum x + y = z = 1.w ∗ 2r. The implementation of BSNumber stores the biasedexponent b, the integer ŵ, and the number of bits in ŵ. In all three cases, we have a summation of integers.When adding a k0-bit positive integer and a k1-bit positive integer, the maximum number of bits requiredfor the sum is max{k0, k1} + 1. The extra bit occurs only when there is a carry-out by the addition. Forexample, 11100 is a 5-bit integer and 11 is a 2-bit integer. The sum is 11100 + 11 = 11111, which is a 5-bitinteger. However, 111 is a 3-bit integer, so the sum 11100 + 111 = 100011, which is a 6-bit integer; there isa carry-out in this example.

    By definition we know that û has n+1 bits and v̂ has m+1 bits. In the case p−n > q−m, û∗2(p−n)−(q−m)is an even integer with n + 1 + (p − n) − (q − m) bits. Adding this to v̂ to produce ŵ, the maximumnumber of bits required is max{n+ 1 + (p− n)− (q −m),m+ 1}+ 1. Similarly in the case p− n < q −m,the number of bits for v̂ ∗ 2(q−m)−(p−n) is m + 1 + (q − m) − (p − n). The integer ŵ requires at mostmax{n + 1,m + 1 + (q −m) − (p − n)} + 1 bits. Finally, in the case p − n = q −m we must first computew̃ = û+ v̂ before we can determine ` and t. The maximum number of bits for w̃ is max{n+ 1,m+ 1}+ 1.

    The maximum number of bits for the three cases was determined algebraically. A more intuitive argument isthe following. Imagine marking the integer points on the real line at which the consecutivie bits of x = 1.u∗2pare located. The largest marked integer is p, which corresponds to 1.0 ∗ 2p, the highest-order bit of x. Thesmallest marked integer is p− n, where u is an odd integer with n bits, and which corresponds to 1 ∗ 2p−n,the lowest-order bit of x (and of u). The same argument applies to y = 1.v ∗2q. The highest-order bit occursat q and the lowest-order bit occurs at q −m, where v is an odd integer with m bits. The bit locations ofx+ y = 1.w ∗ 2r are in the union of the bit locations for x and y, plus one more at the highest-order end incase the addition has a carry-out. Therefore, the maximum number of bits required for the sum is

    ` = max{p, q} −min{p− n, q −m}+ 2 (11)

    The right-hand side involves the maximum of exponents and the minimum of biased exponents. Whenp− n > q −m,

    ` = max{p, q} − (q −m) + 2

    = max{p− q +m+ 1,m+ 1}+ 1

    = max{n+ 1 + (p− n)− (q −m),m+ 1}+ 1

    11

  • which agrees with the previous construction. When p− n ≤ q −m,

    ` = max{p, q} − (p− n) + 2

    = max{n+ 1, q − p+ n+ 1}+ 1

    = max{n+ 1,m+ 1 + (q −m)− (p− n)}+ 1

    which agrees with the previous construction. In particular, when p− n = q −m,

    ` = max{p, q} − (p− n) + 2

    = max{n+ 1,m+ 1 + (q −m)− (p− n)}+ 1

    = max{n+ 1,m+ 1}+ 1

    which agrees with the previous construction.

    3.3 Subtraction

    We wish to compute x − y. The cases x = 0 and y = 0 are trivial to handle. It is sufficient to analyze thecase x > y > 0; when 0 < x < y, we know x− y = −(y − x), which leads to the case y − x with y > x > 0.The difference of x = 1.u ∗ 2p and y = 1.v ∗ 2q is z = 1.w ∗ 2r, where we need to determine the values of wand r. Let u have n bits and v have m bits. The difference is

    x− y = 1.u ∗ 2p − 1.v ∗ 2q = û ∗ 2p−n − v̂ ∗ 2q−m (12)

    The factoring depends on the relative values of p− n and q −m, just as it did for addition.

    Handling the various cases is the same as for addition, but we need to subtract two integers, the first largerthan the second. Our choice for subtraction is to use two’s-complement, the same as for native integerson the CPU. For example, consider the subtraction 19 = 29 − 10 = 29 + (−10). In binary notation,29 = 11101 and 10 = 01010, using the same number of bits for both numbers. To obtain the two’scomplement of a number, you negate the bits and add one to the number. For the example at hand,−10 = 01010+1 = 10101+1 = 10110 and 29+(−10) = 11101+10110 = 110011. Because we always choosethe same number of bits for both arguments, there will always be a carry-out. This bit must be discarded,so 29 + (−10) = 10011 = 19.

    The subtraction result is no larger than x, so it has at most the number of bits of x. However, to supporttwo’s complement and the carry-out bit, we need an extra bit; thus, the number of required bits is n + 2where x has n+ 1 bits.

    3.4 Unsigned Integer Arithmetic

    In each of multiplication, addition, and subtraction of binary scientific numbers, determining the sign andbiased exponent is simple. The heart of the computation reduces to the same operations applied to unsignedintegers with an arbitrary number of bits.

    One of the primary concerns when implementing arbitrary precision integer arithmetic is dealing with carriesand overflow. A typical way to handle this is to embedded the N -bit unsigned integer operands in a native

    12

  • type with 2N bits. For now, we assume that the CPU supports a 64-bit unsigned integer type uint64 t, whichis defined in the header . We may write the unsigned integers

    û =

    imax−1∑i=0

    ui(232)i, v̂ =

    jmax−1∑j=0

    vj(232)j (13)

    where ui and vj are 32-bit unsigned integers.

    3.4.1 Addition

    The straightforward algorithm for addition is illustrated for imax = 2 and jmax = 1, computing ŵ = û + v̂using the old-style notation of stacking the numbers and notating it with carry values. A specific exampleis shown on the left, the general layout on the right.

    1 1 0

    9 2 5

    8 3

    1 0 0 8

    c3 c2 c1

    u2 u1 u0

    v1 v0

    w3 w2 w1 w0

    carry bits

    û+ v̂

    (14)

    There is no carry-in value for the first term, but we can think of c0 = 0. Add u0, v0, and the carry-in, say,p0 = u0 + v0 + c0. The operands are of type uint32 t and, in worst case, the output p0 must be stored inuint64 t because its value is larger than 232. The low-order 32 bits is extracted to obtain w0 = Low(p0) andthe high-order 32 bits is extracted to obtain the carry-out c1 = High(p0). The process is repeated for the nextu-term, but now the carry value can be positive: p1 = u1 +v1 + c1, w1 = Low(p1), and c2 = High(p1). Thereare no more v-terms. Repeating the algorithm produces p2 = u2 + c2, w2 = Low(p2), and c3 = High(p2).There are no more u-terms, so w3 = c3.

    Pseudocode for the general algorithm is shown in Listing 3, where the assumption is that u[imax] and v[jmax]each contain at least one 1-bit. Also, let imax ≥ jmax.

    Listing 3. Standard addition of unsigned integers.

    // i n p u t si n t 3 2 t imax , jmax ;u i n t 3 2 t u [ imax ] , v [ jmax ] ;

    // ou tpu t si n t 3 2 t kmax = imax + 1 ; // = max( imax , jmax ) + 1 ;u i n t 3 2 t upv [ kmax ] ; // u + v

    u i n t 6 4 t c a r r y = 0 , sum ;i n t 3 2 t i , j ;f o r ( j = 0 ; j < jmax ; ++j ){

    sum = u [ j ] + ( v [ j ] + c a r r y ) ;upv [ j ] = ( u i n t 3 2 t ) ( sum & 0x00000000FFFFFFFF ) ;c a r r y = ( sum >> 32 ) ;

    }

    // We have no more v=b l o c k s . Propagate the ca r r y=out i f t h e r e i s one or// copy the r ema in i ng b l o c k s i f t h e r e i s not .i f ( c a r r y > 0)

    13

  • {f o r ( /**/ ; i < imax ; ++i ){

    sum = u0 [ i ] + c a r r y ;upv [ i ] = ( u i n t 3 2 t ) ( sum & 0x00000000FFFFFFFF ) ;c a r r y = ( sum >> 32 ) ;

    }i f ( c a r r y > 0){

    upv [ i ] = ( u i n t 3 2 t ) ( c a r r y & 0x00000000FFFFFFFF ) ;}

    }e l s e{

    f o r ( /**/ ; i < imax ; ++i ){

    upv [ i ] = u [ i ] ;}

    }

    3.4.2 Subtraction

    Subtraction of two unsigned integers uses two’s-complement arithmetic, û−v̂, and we assume that û > v̂ > 0.The result is computed as û + (∼ ŵ + 1), where ŵ equal to v̂ but has the same number of bits as û (thinkof paddding v̂ on the left by the appropriate number of zeros) and where ∼ ŵ is the number obtained bynegating the bits of ŵ—a 0-bit is changed to a 1-bit and vice versa. We know that v̂ 6= 0; thus, ŵ 6= 0,∼ ŵ cannot have all bits equal to 1, and the addition ∼ ŵ + 1 cannot have a carry-out. Once that sum iscomputed, the sum û + (∼ ŵ + 1) is computed using addition of unsigned integers. This sum can have acarry-out, but that bit is irrelevant to the result and can be ignored.

    Pseudocode for the general algorithm is shown in Listing 4, where the assumption is that u[imax] and v[jmax]each contain at least one 1-bit. Also, let imax ≥ jmax.

    Listing 4. Two’s-complement subtraction of unsigned integers.

    // i n p u t si n t 3 2 t imax , jmax ;u i n t 3 2 t u [ imax ] , v [ jmax ] ;

    // ou tpu t si n t 3 2 t kmax = imax + 1 ; // = max( imax , jmax ) + 1 ;u i n t 3 2 t umv [ imax ] ; // u = v

    // Crea te the two ’ s=complement number compV [ ] from v [ ] . F i r s t l y , negate// the b i t s ; s e cond l y , add 1 .u i n t 3 2 t compV [ imax ] ;u i n t 6 4 t ca r r y , sum ;i n t 3 2 t i ;f o r ( i = 0 ; i < jmax ; ++i ){

    compV [ i ] = ˜v [ i ] ;}f o r ( /**/ ; i < imax ; ++i ){

    compV [ i ] = 0xFFFFFFFF ;}c a r r y = 1 ;f o r ( i = 0 ; i < jmax ; ++i ){

    sum = compV [ i ] + c a r r y ;compV [ i ] = ( u i n t 3 2 t ) ( sum & 0x00000000FFFFFFFF ) ;

    14

  • c a r r y = ( sum >> 32 ) ;}

    // Add the numbers as p o s i t i v e i n t e g e r s . Set the l a s t b l o ck to z e r o i n// ca se no ca r r y=out o c cu r s .umv [ imax = 1 ] = 0 ;c a r r y = 0 ;f o r ( i = 0 ; i < imax ; ++i ){

    sum = compV [ i ] + (u [ i ] + c a r r y ) ;umv [ i ] = ( u i n t 3 2 t ) ( sum & 0x00000000FFFFFFFF ) ;c a r r y = ( sum >> 32 ) ;

    }

    The subtraction can lead to umv[] having leading zero-valued blocks. The implementation enforces an in-variant that the leading block contains at least one 1-bit. The array umv[] is resized to eliminate the leadingzero-valued blocks.

    3.4.3 Multiplication

    The straightforward algorithm for multiplication is illustrated for imax = 0 and jmax = 2, computing ŵ = û∗v̂using the old-style notation of stacking the numbers and notating it with carry values. A specific exampleis shown on the left, the general layout on the right.

    2 1 3

    3 2 5

    7

    2 2 7 5

    c3 c2 c1

    v2 v1 v0

    u0

    w3 w2 w1 w0

    carry bits

    û ∗ v̂

    (15)

    There is no carry-in value for the first term, but we can think of c0 = 0. Multiply v0 by u0 and addthe carry-in, say, p0 = u0 ∗ v0 + c0. The operands are of type uint32 t and, in worst case, the output p0must be stored in uint64 t because its value is larger than 232. The low-order 32 bits is extracted to obtainw0 = Low(p0) and the high-order 32 bits is extracted to obtain the carry-out c1 = High(p0). The process isrepeated for the next v-term, but now the carry value can be positive: p1 = u0 ∗ v1 + c1, w1 = Low(p1), andc2 = High(p1). One more repetition produces p2 = u0 ∗ v2 + c2, w2 = Low(p2), and c3 = High(p2). Thereare no more v-terms, so w3 = c3.

    Now consider the case imax = 1. A specific example is shown on the left, the general layout on the right.

    1 1 2

    2 1 3

    3 2 5

    5 7

    0 0 1

    2 2 7 5

    1 6 2 5

    1 8 5 2 5

    d3 d2 d1

    c3 c2 c1

    v2 v1 v0

    u1 u0

    e3 e2 e1

    w3 w2 w1 w0

    q3 q2 q1 q0

    s4 s3 s2 s1 s0

    Carry bits for q̂ = u1 ∗ v̂

    Carry bits for ŵ = u0 ∗ v̂

    Carry bits for ŵ + q̂ ∗ 232

    ŵ = u0 ∗ v̂

    q̂ = u1 ∗ v̂

    û ∗ v̂

    (16)

    15

  • As in the previous example, the initial carry-in values are c0 = 0, d0 = 0, and e0 = 0. The product of u0with v is computed by pj = u0 ∗ vj + cj , wj = Low(pj), and cj+1 = High(pj). The product of u1 with v iscomputed by pj = u1 ∗ vj + dj , qj = Low(pj), and dj+1 = High(pj). The sum of products is computed byw4 = 0, s0 = w0, pj = wj+1 + qj + ej , sj+1 = Low(pj), and ej+1 = High(pj).

    Pseudocode for the general algorithm is shown in Listing 5, where the assumption is that u[imax] and v[jmax]each contain at least one 1-bit.

    Listing 5. Standard multiplication of unsigned integers.

    // i n p u t si n t 3 2 t imax , jmax ;u i n t 3 2 t u [ imax ] , v [ jmax ] ;

    // ou tpu t si n t 3 2 t kmax = imax + jmax ;u i n t 3 2 t utv [ kmax+1] ; // u * v

    SetToZero ( utv , kmax ) ; // uv i s the accumu la to r o f u [ i ]* vu i n t 3 2 t p roduc t [ kmax ] ;f o r ( i = 0 ; i < imax ; ++i ){

    // Compute the p roduc t p = u [ i ]* v .u i n t 6 4 t uBlock = u [ i ] , c a r r y = 0 ;i n t 3 2 t i1 , i 2 ;f o r ( j = 0 , k = i ; j < jmax ; ++j , ++k ){

    u i n t 6 4 t vBlock = v [ j ] ;u i n t 6 4 t term = uBlock * vBlock + c a r r y ;p roduc t [ k ] = ( u i n t 3 2 t ) ( term & 0x00000000FFFFFFFF ) ;c a r r y = ( term >> 32 ) ;

    }i f ( k < kmax ){

    product [ k ] = ( u i n t 3 2 t ) c a r r y ;}

    // Add p to the accumu la to r uv .u i n t 6 4 t sum ;c a r r y = 0 ;f o r ( j = 0 , k = i ; j < jmax ; ++j , ++k ){

    sum = product [ k ] + ( utv [ k ] + c a r r y ) ;utv [ k ] = ( u i n t 3 2 t ) ( sum & 0x00000000FFFFFFFF ) ;c a r r y = ( sum >> 32 ) ;

    }i f ( k < kmax ){

    sum = product [ k ] + c a r r y ; // utv [ k ] == 0 i s gua ran teedutv [ k ] = ( u i n t 3 2 t ) ( sum & 0x00000000FFFFFFFF ) ;

    }}

    For inputs with a large number of bits, the initialization of utv[] to zero is expensive (based on analysis witha profiler). The GTE implementation avoids this with some extra logic, setting the minimum number ofrequired elements of utv[] to u[0]*v on the first pass and then accumulating the remaining u[i]*v products onsubsequent passes.

    16

  • 3.4.4 Shift Left

    The algorithms for addition and subtraction involved computing v̂ = û ∗ 2σ, where û is an odd integer andσ > 0. This is equivalent to applying a left shift of the bits of û by σ bits.

    To illustrate, let û = 0x08192A3B4C5D6E7F and σ = 40. The input requires 2 32-bit blocks, is odd, and has60 bits starting with the leading 1-bit at index 59. The trailing 1-bit, of course, is at index 0. The left shiftby σ moves the leading 1-bit to index 99, so v̂ requires 4 32-bit blocks. Also, v̂ has 1 trailing zero-valuedblock because σ/32 = 1 (using integer division); thus, v[0] = 0. The trailing 1-bit of v̂ is at index σ%32 = 8in v[1]. The leading 1-bit is in v[3]. Figure 3 illustrates the input and output.

    Figure 3. Left shift of an unsigned integer by 40.

    Pseudocode for the general algorithm is shown in Listing 6.

    Listing 6. Shift left of unsigned integers.

    // i n p u t si n t 3 2 t numUBits = ;i n t 3 2 t numShi f tedUBi t s = numUBits + s h i f t ;i n t 3 2 t imax = 1 + ( numUBits = 1) / 32 ;u i n t 3 2 t u [ imax ] ;

    // ou tpu t si n t 3 2 t jmax = 1 + ( numShi f tedUBi t s = 1) / 32 ;u i n t 3 2 t s h i f t e dU [ jmax ] ; // ( u 0){

    // The t r a i l i n g 1=b i t s f o r s ou r c e and t a r g e t a r e at d i f f e r e n t r e l a t i v e// i n d i c e s . Each s h i f t e d s ou r c e b l o ck s t r a d d l e s a boundary between two// t a r g e t b l ock s , so we must e x t r a c t the s ubb l o c k s and copy a c c o r d i n g l y .i n t 3 2 t r i g h t S h i f t = 32 = l e f t S h i f t ;u i n t 3 2 t p r e v i o u s = 0 ;f o r ( j = s h i f t B l o c k , i = 0 ; i < imax ; ++j , ++i ){

    u i n t 3 2 t c u r r e n t = u [ i ] ;s h i f t e dU [ j ] = ( c u r r e n t > r i g h t S h i f t ) ;p r ev = cu r r e n t ;

    }i f ( j < jmax ){

    17

  • // The l e a d i n g 1=b i t o f the s ou r c e i s a t a r e l a t i v e i nd ex such tha t// when you l e f t =s h i f t , t h a t b i t o c cu r s i n a new b l o ck .s h i f t e dU [ j ] = ( p r e v i o u s >> r i g h t S h i f t ) ;

    }}e l s e{

    // The t r a i l i n g 1=b i t s f o r s ou r c e and t a r g e t a r e at the same r e l a t i v e// i ndex . The s h i f t r e duc e s to a b l o ck copy .f o r ( j = s h i f t B l o c k , i = 0 ; i < imax ; ++j , ++i ){

    s h i f t e dU [ j ] = u [ i ] ;}

    }

    In the example of Figure 3, imax = 2 and jmax = 4. The left shift is 8, which is positive. In the block ofcode handling this case, the loop terminates with j = 7. The if statement after that loop is executed. Thisindicates that the leading 1-bit of û was relatively shifted out of the 32-bit block that contained it into theleft-adjacent block. If instead σ = 36, we would have found that jmax = 3. The leading 1-bit would havestayed in the block and the loop would exit with j = 3, in which case the if statement would not be executed.When the left shift is 0, the relative location of the source’s leading 1-bit and the target’s leading 1-bit arethe same, in which case no subblock shifting needs to be performed.

    3.4.5 Shift Right to Odd Number

    A special case in the algorithms for addition and subtraction involved computing the sum of two odd integersor the difference of two odd integers where the first integer is larger than the second. In both cases the resultis a positive even integer. We need to apply a shift right of the bits to obtain an odd number, and we needthe shift amount for computing the biased exponents of the BSNumber.

    The example in the section on left shifting can be used to illustrate right shifting, but with v̂ as the inputand û as the output. The algorithm is similar to that of left shifting, except that the direction of subblockshifts is swapped and we determine the actual shift amount knowing that we need to obtain an odd integer.

    Pseudocode for the general algorithm is shown in Listing 7.

    Listing 7. Shift right of an unsigned integer to create an odd number.

    // i n p u t si n t 3 2 t numUBits = ;i n t 3 2 t imax = 1 + ( numUBits = 1) / 32 ;u i n t 3 2 t u [ imax ] ;

    // Get the l e a d i n g 1=b i t .i n t 3 2 t f i r s t B i t I n d e x = 32 * ( imax = 1) + GetLead ingB i t ( u [ imax = 1 ] ) ;

    // Get the t r a i l i n g 1=b i t .i n t 3 2 t i , j , l a s t B i t I n d e x ;f o r ( i = 0 ; i < imax ; ++i ){

    i f ( u [ i ] > 0){

    l a s t B i t I n d e x = 32 * i + G e tT r a i l i n gB i t ( u [ i ] ) ;break ;

    }}

    18

  • // ou tpu t si n t 3 2 t numShi f tedUBi t s = f i r s t B i t I n d e x = l a s t B i t I n d e x + 1 ;i n t 3 2 t jmax = 1 + ( numShi f tedUBi t s = 1) / 32 ;u i n t 3 2 t s h i f t e dU [ jmax ] ; // ( u 0){

    i n t 3 2 t l s h i f t = 32 = r s h i f t ;i = s h i f t B l o c k ;u i n t 3 2 t c u r r = u [ i ++];f o r ( j = 0 ; i < imax ; ++j , ++i ){

    u i n t 3 2 t nex t = u [ i ] ;s h i f t e dU [ j ] = ( c u r r >> r s h i f t ) | ( nex t > r s h i f t ) ;}

    }e l s e{

    f o r ( j = 0 , i = s h i f t B l o c k ; j < jmax ; ++j , ++i ){

    s h i f t e dU [ j ] = u [ i ] ;}

    }

    s h i f t = r s h i f t + 32 * s h i f t B l o c k ;

    3.4.6 Comparisons

    The comparison operations are based on the unsigned integers being members of the binary scientific repre-sentation. They are called only when the two BSNumber arguments to BSNumber::operatorX are of the form1.u ∗ 2p and 1.v ∗ 2p. The less-than comparison applies to 1.u and 1.v as unsigned integers with their leading1-bits aligned.

    Pseudocode for the equality comparison is shown in Listing 8.

    Listing 8. Equality comparison of two UInteger objects.

    boo l Equal ( U In t eg e r u [ imax ] , U In t eg e r v [ jmax ] ){

    i f ( u . numBits != v . numBits ){

    r e t u r n f a l s e ;}

    i f ( u . numBits > 0){

    f o r ( i n t 3 2 t i = imax=1; i >= 0 ; == i ){

    i f ( u [ i ] != v [ i ] ){

    19

  • r e t u r n f a l s e ;}

    }}r e t u r n t rue ;

    }

    Pseudocode for the less-than comparison is shown in Listing 9.

    Listing 9. Less-than comparison of two UInteger objects.

    boo l LessThan ( U In t eg e r u [ imax ] , U In t eg e r v [ jmax ] ) const{

    i f ( u . numBits > 0 && v . numBits > 0){

    // The numbers must be compared as i f t h e i r l e a d i n g 1=b i t s a r e a l i g n e d .i n t b i t I n d e x 0 = u . numBits = 1 ;i n t b i t I n d e x 1 = v . umBits = 1 ;i n t b lock0 = b i t I n d e x 0 / 32 ;i n t b lock1 = b i t I n d e x 1 / 32 ;i n t numBlockBits0 = 1 + ( b i t I n d e x 0 % 32 ) ;i n t numBlockBits1 = 1 + ( b i t I n d e x 1 % 32 ) ;u i n t 6 4 t n 0 s h i f t = u [ b l o ck0 ] ;u i n t 6 4 t n 1 s h i f t = v [ b l o ck1 ] ;wh i l e ( b l o ck0 >= 0 && b lock1 >= 0){

    // S h i f t the b i t s i n the l e a d i n g b l o c k s to the high=o r d e r b i t .u i n t 3 2 t va l u e0 =

    ( u i n t 3 2 t ) ( ( n 0 s h i f t > numBlockBits0 ) & 0x00000000FFFFFFFF ) ;}i f (==b lock1 >= 0){

    n 1 s h i f t = nB i t s [ b l o ck1 ] ;v a l u e1 |=( u i n t 3 2 t ) ( ( n 1 s h i f t >> numBlockBits1 ) & 0x00000000FFFFFFFF ) ;

    }i f ( v a l u e0 < va l u e1 ){

    r e t u r n t rue ;}i f ( v a l u e0 > va l u e1 ){

    r e t u r n f a l s e ;}

    }r e t u r n b lock0 < b lock1 ;

    }e l s e{

    // One or both numbers a r e n e g a t i v e . The on l y t ime ’ l e s s than ’ i s// ’ t r u e ’ i s when v i s p o s i t i v e .r e t u r n ( v . numBits > 0 ) ;

    }}

    The other comparisons may be implemented based on equality and less-than comparisons.

    20

  • 3.5 Conversion of Floating-Point Numbers to Binary Scientific Numbers

    The conversion algorithm is described for type float. The algorithm for type double is similar. In fact, theimplementation uses a template function that handles either type.

    Listing 1 contains a skeleton for the various flavors of float numbers. The zeros +0 and −0 both map to thesame BSNumber, the one whose sign is zero, whose biased exponent is zero, and whose unsigned integer iszero. In the following discussion, it is clear how to extract the sign from a floating-point number and set thebinary scientific number sign, so we will consider only positive numbers.

    The subnormals are of the form 0.t ∗ 2−126, where t > 0 is the trailing significand with 23 bits. Let the first1-bit of t occur at index f and let the last 1-bit occur at index `; then t has f − ` + 1 bits. The biasedexponent is b = −149 + `. Thus, 0.t ∗ 2−126 = û ∗ 2p, where u = tf tf−1 · · · t` and p = −149 + f . Pseudocodeis shown in Listing 10.

    Listing 10. Conversion of a subnormal float to a BSNumber.

    f l o a t subnormal ;BSNumber bsn ;u i n t 3 2 t s = subnormal . GetS ign ( ) ;u i n t 3 2 t e = subnormal . GetBiasedExponent ( ) ; // e = 0 f o r subnorma l su i n t 3 2 t t = subnormal . G e t T r a i l i n g S i g n i f i c a n d ( ) ;i n t 3 2 t l a s t = G e tT r a i l i n gB i t ( t ) ;i n t 3 2 t d i f f = 23 = l a s t ;bsn . s i g n = ( s > 0 ? =1 : 1 ) ;bsn . b i a s edExponent = =126 = d i f f ;bsn . u i n t e g e r = ( t >> l a s t ) ;

    The normals are of the form 1.t ∗ 2e−127, where t ≥ 0 is the trailing significand with 23 bits. If t = 0, thebiased exponent is b = e− 127, so 1.t ∗ 2e−127 = 1 ∗ 2e−127 = û ∗ 2p, where û = 1 and p = e− 127. If t > 0,let the last 1-bit of t occur at index `. The leading 1-bit of the floating-point number is implicitly 1; thatis, when you extract the trailing significand from the encoding, you will obtain t as an integer with 23 bits.You must append a 1 using an OR-operation at index 23. The first 1-bit of 1t occurs at index f = 23. Thenumber of bits is f − `+ 1 = 24− `. The biased exponent is e− 149 + `. Thus, 1.t ∗ 2e−127 = û ∗ 2p, whereu = 1t22 · · · t` and p = e− 150 + `. Pseudocode is shown in Listing 11.

    Listing 11. Conversion of a normal float to a BSNumber.

    f l o a t normal ;BSNumber bsn ;u i n t 3 2 t s = normal . GetS ign ( ) ;u i n t 3 2 t e = normal . GetBiasedExponent ( ) ; // 0 < e < 255 f o r norma l su i n t 3 2 t t = normal . G e t T r a i l i n g S i g n i f i c a n d ( ) ;i f ( t > 0){

    i n t 3 2 t l a s t = G e tT r a i l i n gB i t ( t ) ;i n t 3 2 t d i f f = 23 = l a s t ;bsn . s i g n = ( s > 0 ? =1 : 1 ) ;bsn . b i a s edExponent = e = 127 = d i f f ;bsn . u i n t e g e r = ( ( t | (1 > l a s t ) ;

    }e l s e{

    bsn . s i g n = ( s > 0 ? =1 : 1 ) ;bsn . b i a s edExponent = e = 127 ;bsn . u i n t e g e r = 1 ;

    }

    21

  • In practice you will use the class BSNumber only for finite floating-point numbers, so you should ensure thatyour inputs you manipulate are not infinities or NaNs. In our implementation, if the floating-point numberis an infinity, we convert it to ±2128; the sign is chosen to be that of the floating-point number. However,we do have a warning assertion that is triggered in our logging system that lets you know when such anumber is encountered. If we encounter a NaN, we convert the number to zero and issue an error assertion.Pseudocode is shown in Listing 12.

    Listing 12. Conversion of an infinity or NaN float to a BSNumber.

    f l o a t s p e c i a l ; // i n f i n i t y , NaNBSNumber bsn ;u i n t 3 2 t s = s p e c i a l . GetS ign ( ) ;u i n t 3 2 t e = s p e c i a l . GetBiasedExponent ( ) ; // e = 255 f o r s p e c i a l su i n t 3 2 t t = s p e c i a l . G e t T r a i l i n g S i g n i f i c a n d ( ) ;i f ( t == 0) // i n f i n i t i e s{

    // Warning : I npu t i s an i n f i n i t y .bsn . s i g n = ( s > 0 ? =1 : 1 ) ;bsn . b i a s edExponent = 128 ;bsn . u i n t e g e r = 1 ;

    }e l s e{

    // E r r o r : I npu t i s a NaN ( qu i e t o r s i l e n t ) .bsn . s i g n = 0 ;bsn . b i a s edExponent = 0 ;bsn . u i n t e g e r = 0 ;

    }

    3.6 Conversion of Binary Scientific Numbers to Floating-Point Numbers

    The conversion algorithm is described for type float. The algorithm for type double is similar. In fact, theimplementation uses a template function that handles either type.

    The conversion from a BSNumber to a float is generally not exact. Typically, a sequence of operations withbinary scientific numbers will produce a result with more bits than the 24 bits of precision for float. We chooseto convert the number using the default rounding mode for floating-point arithmetic: round-to-nearest, ties-to-even. Once again it is easy to handle the sign information, so the discussion is about positive binaryscientific numbers.

    Figure 4 illustrates the real number line and the location of floating-point numbers on it.

    Figure 4. The real number line and locations of some float numbers.

    22

  • Underflow occurs in the interval of real numbers (0, 2−149), which cannot be represented by float numbers.Subnormals occur in the interval of real numbers [2−149, 2−126). Normals occur in the interval of real numbers[2−126, 2128). Overflow occurs in the interval of real numbers [2128,∞), which cannot be represented by floatnumbers. The floating-point number 0.0

    221 ∗ 2−126 = 2−149 is the smallest subnormal. The notation 022

    means that 0 is repeated 22 times. The largest subnormal is 0.123 ∗ 2−126. The smallest normal is 2−126 and

    the largest normal is 1.123 ∗ 2127.

    Let x be a BSNumber in the interval [0, 2−149). The two closest float numbers are the interval endpoints.The midpoint of the interval is 2−150. If x ≤ 2−150, we round to zero. If x > 2−150, we round to 2−149.Pseudocode is shown in Listing 13.

    Listing 13. Conversion of a very small positive BSNumber to float.

    BSNumber bsn ; // assume bsn i s not z e r ou i n t 3 2 t s = ( bsn . s i g n < 0 ? 1 : 0 ) ;u i n t 3 2 t e ; // b i a s e d exponent f o r f l o a tu i n t 3 2 t t ; // t r a i l i n g s i g n i f i c a n d f o r f l o a ti n t 3 2 t p = bsn . GetExponent ( ) ; // b ia sedExponent + numBits = 1i f ( p < =149){

    i f ( p < =150 | | bsn . u i n t e g e r == 1){

    // round to 0 (when bsn . u i n t e g e r i s 1 , t h i s i s a t i e , so round to even )e = 0 ;t = 0 ;

    }e l s e{

    // round to 2ˆ{=249}e = 0 ;t = 1 ;

    }}// e l s e : o t h e r c a s e s d i s c u s s e d l a t e r

    f l o a t r e s u l t = Crea te IEEEF loa t ( s , e , t ) ; // as u i n t 3 2 t , ( s 1/2when that bit is not the last bit of u, in which case we round up. If that bit is the last bit of u, then f = 1/2and we round down when the previous bit is 0 or round up when the previous bit is 1.

    23

  • Table 1 shows the information that is computed in the code to determine the trailing significand and rounding.The bits extracted from û are stored in a uint32 t prefix and are left-aligned to start at bit-index 31 of theprefix.

    Table 1. Layouts for the bits of a BSNumber that produces a float subnormal.

    σ 0.t̂ bits to extractfrom û

    prefix index offirst bit of f

    0 0.1u 24 8

    1 0.01u 23 9

    2 0.001u 22 10...

    ......

    ...

    22 0.00000000000000000000001u 2 30

    Pseudocode is shown in Listing 14.

    Listing 14. Conversion of a BSNumber to a subnormal float.

    BSNumber bsn ; // assume bsn i s not z e r ou i n t 3 2 t s = ( bsn . s i g n < 0 ? 1 : 0 ) ;u i n t 3 2 t e ; // b i a s e d exponent f o r f l o a tu i n t 3 2 t t ; // t r a i l i n g s i g n i f i c a n d f o r f l o a ti n t 3 2 t p = bsn . GetExponent ( ) ; // b ia sedExponent + numBits = 1i f (=149

  • u i n t 3 2 t p r e f i x = u i n t e g e r . G e tP r e f i x ( numRequested ) ;

    // The i ndex i n t o ’ p r e f i x ’ o f the f i r s t b i t o f f , used f o r round ing .i n t 3 2 t r oundB i t I nd ex = 32 = numRequested ;u i n t 3 2 t mask = (1 ( r oundB i t I nd ex + 1 ) ) ;

    // Apply the round ing .t r a i l i n g += round ;r e t u r n t r a i l i n g ;

    }

    The extraction of the prefix is shown in Listing 16.

    Listing 16. Extraction of the requested bits from û and stored in a prefix.

    u i n t 3 2 t U In t eg e r : : G e tP r e f i x ( i n t numRequested ){

    // The U In t ege r i s an a r r a y u [ imax ] w i th u [ 0 ] odd and u [ imax=1]// nonzero . The number o f b i t s s t a r t i n g wi th the l e a d i n g 1=b i t i s// numBits .

    // Copy to ’ p r e f i x ’ the l e a d i n g 32= b i t b l o ck tha t i s nonzero .i n t 3 2 t b i t I n d e x = numBits = 1 ;i n t 3 2 t b l o c k I n d e x = b i t I n d e x / 32 ;u i n t 3 2 t p r e f i x = u [ b l o c k I n d e x ] ;

    // Get the number o f b i t s i n the b l o ck s t a r t i n g wi th the l e a d i n g 1=b i t .i n t 3 2 t f i r s t B i t I n d e x = b i t I n d e x % 32 ;i n t 3 2 t numBlockBits = f i r s t B i t I n d e x + 1 ;

    // S h i f t the l e a d i n g 1=b i t to i nd ex 31 o f p r e f i x . We have consumed// numBlockBits , which might not be the e n t i r e budget .

    25

  • i n t 3 2 t t a r g e t I n d e x = 31 ;p r e f i x = 0){

    // More b i t s a r e a v a i l a b l e . Copy and s h i f t the e n t i r e 32= b i t nex t// b l o ck and OR i t i n t o the ’ p r e f i x ’ .u i n t 3 2 t nex tB lock = b i t s [ b l o c k I n d e x ] ;t a r g e t I n d e x == numBlockBits ;nex tB lock

  • Listing 18. Construction of the significand for the normal.

    u i n t 3 2 t BSNumber : : Ge tNo rma lT ra i l i n g ( ){

    // Ex t r a c t l e a d i n g b i t s from hat{u} , s h i f t e d so f i r s t one i s a t i nd e x 31 o f p r e f i x .i n t 3 2 t numRequested = 25 ; // Get 1 e x t r a b i t f o r round ing .u i n t 3 2 t p r e f i x = u i n t e g e r . G e tP r e f i x ( numRequested ) ;

    // The i ndex i n t o ’ p r e f i x ’ o f the f i r s t b i t o f f , used f o r round ing .i n t 3 2 t r oundB i t I nd ex = 32 = numRequested ;u i n t 3 2 t mask = (1 ( r oundB i t I nd ex + 1 ) ) ;

    // Apply the round ing .t r a i l i n g += round ;r e t u r n t r a i l i n g ;

    }

    The code has only minor differences compared to GetSubnormalTrailing. The actual code combines these anduses a template to handle both float and double. The consolidation then requires the prefix to be stored in a64-bit unsigned integer because the worst case for double precision is a prefix storing 54 bits. The GetPrefixfunction also requires some modification to support double because you might have to extract the bits from3 blocks of û rather than the 2 blocks required by float.

    When x is in the interval [2128,∞), we have overflow and the BSNumber is mapped to the floating-pointinfinity.

    27

  • 4 Binary Scientific Rationals

    As stated previously, a binary scientific number has a finite number of bits in its representation. Addition,subtraction, and multiplication of two binary scientific numbers themselves have a finite number of bits.However, division of two binary scientific numbers might require an infinite number of bits. To supportarbitrary precision arithmetic for algorithms that require divisions, we can construct ratios of numbers inthe same way that the rational numbers are constructed from integers.

    Specifically, let B be the set of binary scientific numbers. Let x ∈ B and y ∈ B where y 6= 0. Ratios x/ycan be represented as 2-tuples

    R = {(x, y) : x ∈ B, y ∈ B, y 6= 0} (18)

    As with rational numbers, multiple representations can occur. For example, 1/3 and 2/6 represent the samerational number. Also, 0/1 and 0/2 are both representations for zero. It is possible to reduce a ratio toa canonical form by cancelling factors common to both numbers and arranging for the denominators to bepositive. For example, the ratio 2/6 can be reduced by cancelling the common factor 2 in the numeratorand denominator. The ratio 2/(−6) is reduced to −1/3. The rational number zero has canonical form 0/1.We will refer to R as the binary scientific rationals.

    4.1 Arithmetic Operations

    Arithmetic operations for rational numbers are defined as follows.

    x0y0

    + x1y1 =x0∗y1+x1∗y0

    y0∗y1 , addition

    x0y0− x1y1 =

    x0∗y1−x1∗y0y0∗y1 , subtraction

    x0y0

    ∗ x1y1 =x0∗x1y0∗y1 , multiplication

    x0y0

    / x1y1 =x0∗y1x1∗y0 , division

    (19)

    Division has the additional constraint x1 6= 0. Equivalent arithmetic operations can be defined for the2-tuples of R. Let r0 = (x0, y0) and r1 = (x1, y1); then

    r0 + r1 = (x0 ∗ y1 + x1 ∗ y0, y0 ∗ y1), addition

    r0 − r1 = (x0 ∗ y1 − x1 ∗ y0, y0 ∗ y1), subtraction

    r0 ∗ r1 = (x0 ∗ x1, y0 ∗ y1), multiplication

    r0 / r1 = (x0 ∗ y1, x1 ∗ y0), division

    (20)

    In all cases, the components of the 2-tuples are expressions involving addition, subtraction, and multiplicationof binary scientific numbers.

    4.2 Conversion of Floating-Point Numbers to Binary Scientific Rationals

    Because a binary scientific rational is represented as a pair of binary scientific numbers, the floating-pointnumber x is converted to a binary scientific number for the numerator and the floating-point number 1 isconverted to a binary scientific number for the denominator.

    28

  • If you want to represent a floating-point ratio x/y (with y/ 6= 0) as a binary scientific rational, you may con-struct a binary scientific number n for the numerator x and a binary scientific number d for the denominatory. The pair (n, d) represents the ratio. In GTE, we use a canonical form for such ratios. If n = 1.u ∗ 2p andd = 1.v ∗ 2q, we represent the ratio as n′ = 1.u ∗ 2p−q and d′ = 1.v; therefore d′ ∈ [1, 2).

    4.3 Conversion of Binary Scientific Rationals to Floating-Point Numbers

    Let the binary scientific rational have numerator n = 1.u ∗ 2p and denominator d = 1.v ∗ 2q 6= 0. The ratiois abstractly of the form

    n

    d=

    1.u1.v ∗ 2p−q, 1.u ≥ 1.v2∗(1.u)1.v ∗ 2

    p−q−1, 1.u < 1.v(21)

    The ratios on the right-hand side are rational numbers in the interval [1, 2), so we may represent the right-hand side as 1.w ∗ 2r with the understanding that w is potentially an infinite sequence of bits.

    To compute the exponent r for the result, we can use the comparison operator for binary scientific numbers.Compute r = p − q and modify n = 1.u and d = 1.v. If n < d, subtract 1 from r (it is now p − q − 1)and add 1 to the exponent for n (it is now 2 ∗ (1.u) and n ≥ d). At this time we have n/d = 1.w and weneed to compute enough bits of w to obtain the floating-point number closest to n/d using the mode ofround-to-nearest-ties-to-even.

    The conversion algorithm uses binary scientific rational arithmetic. We know that n/d = 1.w0w1 . . .. Sub-tracting 1 and multiplying by 2, we have 2(n − d)/d = w0.w1 . . .. The new numerator is n0 = 2(n − d),so n0/d = w0.w1 . . .. If n0 ≥ d, then w0 = 1 and we can repeat the algorithm to move w1 before thebinary point. If n0 < d, then w0 = 0; the next numerator is simply n1 = 2n0 because no subtraction by1 is necessary. The algorithm is repeated until we have 23 wi bits for float or 52 wi bits for double. As thebits are discovered, they are OR-ed into an unsigned integer to be used for the trailing significand for thefloating-point number.

    We need to examine the remaining bits of w to determine in which direction to round. At this time in thealgorithm, n/d = wk.wk+1 . . . ∈ [0, 2) and the trailing significand is the unsigned integer t (with either 23or 52 bits). We must round the number t.f , where f = 0.wkwk+1 . . .. If n

    ′ = n− d, the classification of thefraction is f < 1/2 when n′ < 0, f = 1/2 when n′ = 0, and f > 1/2 when n′ > 0. Rounding down uses t asis. Rounding up occurs when n′ > 0 or when n′ = 0 and the last 1-bit of the trailing significand is t0 = 1.

    We can take advantage of the conversions of binary scientific numbers to floating-point numbers by generatinga binary scientific number as the approximation to the binary scientific rational n/d. Thus, we must shift-right t to obtain an odd integer and set the biased exponent accordingly.

    Pseudocode is shown in Listing 19.

    Listing 19. Conversion of a BSRational to a floating-point number.

    // i n p u t s a r e BSNumber n and d f o r BSRat iona l n/d// output i s f l o a t i n g=po i n t o f the Rea l ( f l o a t o r doub l e )i f ( n . s i g n == 0){

    r e t u r n ( Rea l ) 0 ;}

    i n t 3 2 t s i g n = n . s i g n * d . s i g n ;

    29

  • n . s i g n = 1 ;d . s i g n = 1 ;i n t 3 2 t pmq = n . exponent = d . exponent ;n . b i a s edExponent = 1 = n . u i n t e g e r . numBits ;d . b i a s edExponent = 1 = d . u i n t e g e r . numBits ;i f ( n < d ){

    ++n . b ia sedExponent ;==pmq ;

    }

    i n t p r e c i s i o n = ; // 24 f o r f l o a t , 53 f o r doub l ei n t imax = p r e c i s i o n = 1 ;UIntType w = 0 ; // UIntType i s u i n t 3 2 t f o r f l o a t , u i n t 6 4 t f o r doub l eUIntType mask = (1 = 0 ; ==i , mask >>= 1){

    i f ( n < d ){

    n = 2 * n ;}e l s e{

    n = 2 * ( n = d ) ;w |= mask ;

    }}

    n = n = d ;i f ( n . s i g n > 0 | | ( n . s i g n == 0 && (w & 1) == 1)){

    // round up++w;

    }// e l s e round down ( no th i ng to do )

    i f (w > 0){

    i n t 3 2 t t r a i l i n g = Ge tT r a i l i n gB i t (w) ;w >>= t r a i l i n g ; // w i s now oddpmq += t r a i l i n g ;BSNumber r e s u l t ;r e s u l t . u i n t e g e r = w;r e s u l t . b i a s edExponent = pmq = imax ;Rea l c onve r t ed = ( Rea l ) r e s u l t ;i f ( s i g n < 0){

    conve r t ed = =conve r t ed ;}r e t u r n conve r t ed ;

    }e l s e{

    r e t u r n ( Rea l ) 0 ;}

    5 Implementation of Binary Scientific Numbers

    The GTE class that encapsulates the details of binary scientific numbers is BSNumber. The public interfaceand data members is shown in Listing 20.

    Listing 20. The class BSNumber that represents a binary scientific number.

    30

  • template c l a s s BSNumber{pub l i c :

    // Con s t r u c t i o n . The d e f a u l t c o n s t r u c t o r g e n e r a t e s the z e r o BSNumber .BSNumber ( ) ;BSNumber (BSNumber const& number ) ;BSNumber ( f l o a t number ) ;BSNumber ( double number ) ;BSNumber ( i n t 3 2 t number ) ;BSNumber ( u i n t 3 2 t number ) ;BSNumber ( i n t 6 4 t number ) ;BSNumber ( u i n t 6 4 t number ) ;

    // I m p l i c i t c o n v e r s i o n s .ope ra to r f l o a t ( ) const ;ope ra to r double ( ) const ;

    // Ass ignment .BSNumber& ope ra to r=(BSNumber const& number ) ;

    // Support f o r s t d : : move .BSNumber (BSNumber&& number ) ;BSNumber& ope ra to r=(BSNumber&& number ) ;

    // Member a c c e s s .i n t 3 2 t GetS ign ( ) const ;i n t 3 2 t GetBiasedExponent ( ) const ;i n t 3 2 t GetExponent ( ) const ;U IntegerType const& GetU In t ege r ( ) const ;

    // Compar i sons .boo l ope ra to r==(BSNumber const& number ) const ;boo l ope ra to r !=(BSNumber const& number ) const ;boo l operator< (BSNumber const& number ) const ;boo l operator (BSNumber const& number ) const ;boo l operator>=(BSNumber const& number ) const ;

    // Unary o p e r a t i o n s .BSNumber ope ra to r+() const ;BSNumber operator =() const ;

    // A r i t hme t i c .BSNumber ope ra to r+(BSNumber const& number ) const ;BSNumber operator=(BSNumber const& number ) const ;BSNumber ope ra to r *(BSNumber const& number ) const ;BSNumber& ope ra to r+=(BSNumber const& number ) ;BSNumber& operator==(BSNumber const& number ) ;BSNumber& ope ra to r *=(BSNumber const& number ) ;

    p r i v a t e :// He l p e r s f o r compar i sons .s t a t i c boo l Equa l I g no r eS i gn (BSNumber const& n0 , BSNumber const& n1 ) ;s t a t i c boo l Les sThan Igno reS i gn (BSNumber const& n0 , BSNumber const& n1 ) ;

    // He l p e r s f o r a r i t hm e t i c .s t a t i c BSNumber Add IgnoreS ign (BSNumber const& n0 , BSNumber const& n1 ,

    i n t 3 2 t r e s u l t S i g n ) ;s t a t i c BSNumber Sub Igno r eS i gn (BSNumber const& n0 , BSNumber const& n1 ,

    i n t 3 2 t r e s u l t S i g n ) ;

    // He l p e r s f o r c o n v e r s i o n s between BSNumber and f l o a t / doub l e .template vo id ConvertFrom ( typename IEEE : : F loatType number ) ;

    template typename IEEE : : F loatType ConvertTo ( ) const ;

    template typename IEEE : : UIntType G e tT r a i l i n g ( i n t 3 2 t normal , i n t 3 2 t s igma ) const ;

    31

  • // The number 0 i s r e p r e s e n t e d by : mSign = 0 , mBiasedExponent = 0 , and// mUInteger = 0 . For nonzero numbers , mSign != 0 and mUInteger > 0 .i n t 3 2 t mSign ;i n t 3 2 t mBiasedExponent ;UIntegerType mUInteger ;

    f r i e n d c l a s s BSRat iona l;} ;

    The class is template based, allowing the user to select the underlying representation of unsigned integers.Effectively, the heart of the system is the manipulation of unsigned integers of arbitrary size, stored as anarray. The template parameter UIntegerType must have the minimal interface shown in Listing 21.

    Listing 21. The minimal interface required for UIntegerType.

    c l a s s UIntegerType{pub l i c :

    // Con s t r u c t i o n . The d e f a u l t c o n s t r u c t o r g e n e r a t e s 0 .UIntegerType ( ) ;UIntegerType ( UIntegerType const& number ) ;UIntegerType ( u i n t 3 2 t number ) ;UIntegerType ( u i n t 6 4 t number ) ;UIntegerType ( i n t numBits ) ;

    // Ass ignment .UIntegerType& ope ra to r=(UIntegerType const& number ) ;

    // Support f o r s t d : : move .UIntegerType ( UIntegerType&& number ) ;UIntegerType& ope ra to r=(UIntegerType&& number ) ;

    // Member a c c e s s .i n t 3 2 t GetNumBits ( ) const ;

    // Comparison .boo l ope ra to r==(UIntegerType const& number ) const ;boo l operator< ( UIntegerType const& number ) const ;

    // A r i t hme t i c .vo id Add( UIntegerType const& n0 , UIntegerType const& n1 ) ;vo id Sub ( UIntegerType const& n0 , UIntegerType const& n1 ) ;vo id Mul ( UIntegerType const& n0 , UIntegerType const& n1 ) ;vo id S h i f t L e f t ( UIntegerType const& number , i n t s h i f t ) ;i n t 3 2 t Sh i f tR ightToOdd ( UIntegerType const& number ) ;

    // Support f o r c o n v e r s i o n s to f l o a t / doub l e .u i n t 6 4 t G e tP r e f i x ( i n t numRequested ) const ;

    } ;

    The standard type used in GTE for arbitrary size unsigned integers is std::vector.

    The class functions are built on top of the framework discussed previously in this document. The comparisonsare simple, as shown in Listing 22.

    Listing 22. The low-level comparisons for BSNumber.

    template boo l BSNumber : : Equa l I g no r eS i gn (BSNumber const& n0 , BSNumber const& n1 ){

    32

  • r e t u r n n0 . mBiasedExponent == n1 . mBiasedExponent&& n0 . mUInteger == n1 . mUInteger ;

    }

    template boo l BSNumber : : L e s sThan Igno reS i gn (BSNumber const& n0 , BSNumber const& n1 ){

    i n t 3 2 t e0 = n0 . GetExponent ( ) , e1 = n1 . GetExponent ( ) ;i f ( e0 < e1 ){

    r e t u r n t rue ;}i f ( e0 > e1 ){

    r e t u r n f a l s e ;}r e t u r n n0 . mUInteger < n1 . mUInteger ;

    }

    template boo l BSNumber : : ope ra to r==(BSNumber const& number ) const{

    r e t u r n (mSign == number . mSign ? Equa l I g no r eS i gn (* t h i s , number ) : f a l s e ) ;}

    template boo l BSNumber : : operator 0){

    i f ( number . mSign = 0){

    r e t u r n t rue ;}

    // Both numbers a r e n e g a t i v e .r e t u r n Les sThan Igno reS i gn ( number , * t h i s ) ;

    }e l s e{

    r e t u r n number . mSign > 0 ;}

    }

    The low-level addition and subtraction are shown in Listing 23. These are direct implementations of thelogic described in Section 3.2.

    Listing 23. The low-level comparisons for BSNumber.

    template BSNumber BSNumber : : Add IgnoreS ign (

    BSNumber const& n0 , BSNumber const& n1 , i n t 3 2 t r e s u l t S i g n ){

    BSNumber r e s u l t , temp ;i n t 3 2 t d i f f = n0 . mBiasedExponent = n1 . mBiasedExponent ;i f ( d i f f > 0)

    33

  • {temp . mUInteger . S h i f t L e f t ( n0 . mUInteger , d i f f ) ;r e s u l t . mUInteger . Add( temp . mUInteger , n1 . mUInteger ) ;r e s u l t . mBiasedExponent = n1 . mBiasedExponent ;

    }e l s e i f ( d i f f < 0){

    temp . mUInteger . S h i f t L e f t ( n1 . mUInteger , =d i f f ) ;r e s u l t . mUInteger . Add( n0 . mUInteger , temp . mUInteger ) ;r e s u l t . mBiasedExponent = n0 . mBiasedExponent ;

    }e l s e{

    temp . mUInteger . Add( n0 . mUInteger , n1 . mUInteger ) ;i n t 3 2 t s h i f t = r e s u l t . mUInteger . Sh i f tR ightToOdd ( temp . mUInteger ) ;r e s u l t . mBiasedExponent = n0 . mBiasedExponent + s h i f t ;

    }r e s u l t . mSign = r e s u l t S i g n ;r e t u r n r e s u l t ;

    }

    template BSNumber BSNumber : : Sub Igno r eS i gn (

    BSNumber const& n0 , BSNumber const& n1 , i n t 3 2 t r e s u l t S i g n ){

    BSNumber r e s u l t , temp ;i n t 3 2 t d i f f = n0 . mBiasedExponent = n1 . mBiasedExponent ;i f ( d i f f > 0){

    temp . mUInteger . S h i f t L e f t ( n0 . mUInteger , d i f f ) ;r e s u l t . mUInteger . Sub ( temp . mUInteger , n1 . mUInteger ) ;r e s u l t . mBiasedExponent = n1 . mBiasedExponent ;

    }e l s e i f ( d i f f < 0){

    temp . mUInteger . S h i f t L e f t ( n1 . mUInteger , =d i f f ) ;r e s u l t . mUInteger . Sub ( n0 . mUInteger , temp . mUInteger ) ;r e s u l t . mBiasedExponent = n0 . mBiasedExponent ;

    }e l s e{

    temp . mUInteger . Sub ( n0 . mUInteger , n1 . mUInteger ) ;i n t 3 2 t s h i f t = r e s u l t . mUInteger . Sh i f tR ightToOdd ( temp . mUInteger ) ;r e s u l t . mBiasedExponent = n0 . mBiasedExponent + s h i f t ;

    }r e s u l t . mSign = r e s u l t S i g n ;r e t u r n r e s u l t ;

    }

    The high-level arithmetic operations call the low-level ones, as shown in Listing 24.

    Listing 24. The high-level arithmetic for BSNumber.

    template BSNumber BSNumber : : ope ra to r+(BSNumber const& n1 ) const{

    BSNumber const& n0 = * t h i s ;i f ( n0 . mSign == 0){

    r e t u r n n1 ;}i f ( n1 . mSign == 0){

    r e t u r n n0 ;}i f ( n0 . mSign > 0){

    34

  • i f ( n1 . mSign > 0) // n0 + n1 = | n0 | + | n1 |{

    r e t u r n AddIgnoreS ign ( n0 , n1 , +1);}e l s e // n1 . mSign < 0{

    i f ( ! Equa l I g no r eS i gn ( n0 , n1 ) ){

    i f ( Le s sThan Igno r eS ign ( n1 , n0 ) ) // n0 + n1 = | n0 | = | n1 | > 0{

    r e t u r n Sub Igno r eS ign ( n0 , n1 , +1);}e l s e // n0 + n1 = =(|n1 | = | n0 | ) < 0{

    r e t u r n Sub Igno r eS ign ( n1 , n0 , =1);}

    }// e l s e n0 + n1 = 0

    }}e l s e // n0 . mSign < 0{

    i f ( n1 . mSign < 0) // n0 + n1 = =(|n0 | + | n1 | ){

    r e t u r n AddIgnoreS ign ( n0 , n1 , =1);}e l s e // n1 . mSign > 0{

    i f ( ! Equa l I g no r eS i gn ( n0 , n1 ) ){

    i f ( Le s sThan Igno r eS ign ( n1 , n0 ) ) // n0 + n1 = =(|n0 | = | n1 | ) < 0{

    r e t u r n Sub Igno r eS ign ( n0 , n1 , =1);}e l s e // n0 + n1 = | n1 | = | n0 | > 0{

    r e t u r n Sub Igno r eS ign ( n1 , n0 , +1);}

    }// e l s e n0 + n1 = 0

    }}r e t u r n BSNumber ( ) ; // = 0

    }

    template BSNumber BSNumber : : operator=(BSNumber const& n1 ) const{

    BSNumber const& n0 = * t h i s ;i f ( n0 . mSign == 0){

    r e t u r n =n1 ;}i f ( n1 . mSign == 0){

    r e t u r n n0 ;}i f ( n0 . mSign > 0){

    i f ( n1 . mSign < 0) // n0 = n1 = | n0 | + | n1 |{

    r e t u r n AddIgnoreS ign ( n0 , n1 , +1);}e l s e // n1 . mSign > 0{

    i f ( ! Equa l I g no r eS i gn ( n0 , n1 ) ){

    i f ( Le s sThan Igno r eS ign ( n1 , n0 ) ) // n0 = n1 = | n0 | = | n1 | > 0{

    r e t u r n Sub Igno r eS ign ( n0 , n1 , +1);}e l s e // n0 = n1 = =(|n1 | = | n0 | ) < 0

    35

  • {r e t u r n Sub Igno r eS ign ( n1 , n0 , =1);

    }}// e l s e n0 = n1 = 0

    }}e l s e // n0 . mSign < 0{

    i f ( n1 . mSign > 0) // n0 = n1 = =(|n0 | + | n1 | ){

    r e t u r n AddIgnoreS ign ( n0 , n1 , =1);}e l s e // n1 . mSign < 0{

    i f ( ! Equa l I g no r eS i gn ( n0 , n1 ) ){

    i f ( Le s sThan Igno r eS ign ( n1 , n0 ) ) // n0 = n1 = =(|n0 | = | n1 | ) < 0{

    r e t u r n Sub Igno r eS ign ( n0 , n1 , =1);}e l s e // n0 = n1 = | n1 | = | n0 | > 0{

    r e t u r n Sub Igno r eS ign ( n1 , n0 , +1);}

    }// e l s e n0 = n1 = 0

    }}r e t u r n BSNumber ( ) ; // = 0

    }

    template BSNumber BSNumber : : ope ra to r *(BSNumber const& number ) const{

    BSNumber r e s u l t ; // = 0i n t s i g n = mSign * number . mSign ;i f ( s i g n != 0){

    r e s u l t . mSign = s i g n ;r e s u l t . mBiasedExponent = mBiasedExponent + number . mBiasedExponent ;r e s u l t . mUInteger . Mul ( mUInteger , number . mUInteger ) ;

    }r e t u r n r e s u l t ;

    }

    6 Implementation of Binary Scientific Rationals

    The GTE class that encapsulates the details of binary scientific rationals is BSRational. The public interfaceand data members are shown in Listing 25.

    Listing 25. The class BSRational that represents a binary scientific rational.

    template c l a s s BSRat iona l{pub l i c :

    // Con s t r u c t i o n . The d e f a u l t c o n s t r u c t o r g e n e r a t e s the z e r o BSRat iona l .// The c o n s t r u c t o r s t ha t take on l y numerato r s s e t the denominato r s to one .BSRat iona l ( ) ;BSRat iona l ( BSRat iona l const& r a t i o n a l ) ;BSRat iona l ( f l o a t numerator ) ;BSRat iona l ( double numerator ) ;

    36

  • BSRat iona l ( i n t 3 2 t numerator ) ;BSRat iona l ( u i n t 3 2 t numerator ) ;BSRat iona l ( i n t 6 4 t numerator ) ;BSRat iona l ( u i n t 6 4 t numerator ) ;BSRat iona l (BSNumber const& numerator ) ;BSRat iona l ( f l o a t numerator , f l o a t denominator ) ;BSRat iona l ( double numerator , double denominator ) ;BSRat iona l (BSNumber const& numerator ,

    BSNumber const& denominator ) ;

    // I m p l i c i t c o n v e r s i o n s .ope ra to r f l o a t ( ) const ;ope ra to r double ( ) const ;

    // Ass ignment .BSRat iona l& ope ra to r=(BSRat iona l const& r a t i o n a l ) ;

    // Support f o r s t d : : move .BSRat iona l ( BSRat iona l&& r a t i o n a l ) ;BSRat iona l& ope ra to r=(BSRat iona l&& r a t i o n a l ) ;

    // Member a c c e s s .i n l i n e i n t GetS ign ( ) const ;i n l i n e BSNumber const& GetNumerator ( ) const ;i n l i n e BSNumber const& GetDenomator ( ) const ;

    // Compar i sons .boo l ope ra to r==(BSRat iona l const& r a t i o n a l ) const ;boo l ope ra to r !=( BSRat iona l const& r a t i o n a l ) const ;boo l operator< ( BSRat iona l const& r a t i o n a l ) const ;boo l operator ( BSRat iona l const& r a t i o n a l ) const ;boo l operator>=(BSRat iona l const& r a t i o n a l ) const ;

    // Unary o p e r a t i o n s .BSRat iona l ope ra to r+() const ;BSRat iona l operator =() const ;

    // A r i t hme t i c .BSRat iona l ope ra to r+(BSRat iona l const& r a t i o n a l ) const ;BSRat iona l operator=(BSRat iona l const& r a t i o n a l ) const ;BSRat iona l ope ra to r *( BSRat iona l const& r a t i o n a l ) const ;BSRat iona l ope ra to r /( BSRat iona l const& r a t i o n a l ) const ;BSRat iona l& ope ra to r+=(BSRat iona l const& r a t i o n a l ) ;BSRat iona l& operator==(BSRat iona l const& r a t i o n a l ) ;BSRat iona l& ope ra to r *=(BSRat iona l const& r a t i o n a l ) ;BSRat iona l& ope ra to r /=(BSRat iona l const& r a t i o n a l ) ;

    p r i v a t e :// Gene r i c c o n v e r s i o n code tha t c o n v e r t s to the c o r r e c t l y// rounded r e s u l t u s i n g round=to=nea r e s t=t i e s=to=even .template RealType Conver t ( ) const ;

    BSNumber mNumerator , mDenominator ;} ;

    The implementations of the member functions are straightforward according to the definitions of Section 4.

    7 Performance Considerations

    GTE provides a class UIntegerAP32 that supports the unsigned integer storage and logic for arbitrary precisionarithmetic (AP stands for Arbitrary Precision). The storage is of type std::vector. The arithmetic

    37

  • logic unit (ALU) is implemented in a template base class UIntegerALU. We use the CuriouslyRecurring Template Paradigm to allow other classes to share the ALU.

    Simple examples to use the arbitrary precision is shown in the next listing.

    f l o a t x = 1.2345 f , y = 6.789 f ;BSNumber nx ( x ) , ny ( y ) ;BSNumber nsum = nx + ny ;f l o a t sum = ( f l o a t ) nsum ;

    BSRat iona l r x ( x ) , r y ( y ) ;BSRat iona l r d i v = rx / r y ;f l o a t d i v = ( f l o a t ) r d i v ;

    Many of the template classes in GTE that support float or double through a template parameter Real will workwhen Real is replaced with BSNumber (code has no divisions) or BSRational(code has divisions).

    For a large number of computations and for a large required number of bits of precision to produce an exactresult, a profiler will show that the main bottleneck is allocation and deallocation of the std::vector arraystogether with the copies of data. To remedy this, we provide support for an unsigned integer type thatallows you to specify the maximum number of bits of precision. You specify the maximum number of 32-bitwords, say, N , to store the bits. The maximum number of bits of precision is 4N . The class that implementsthis is UIntegerFP32 (FP stands for Fixed Precision) and the storage is std::array. We havetaken care to optimize the code to compute quantities in-place, avoiding as many copies of arrays as possible.Without this, the cost of copying can be nearly as expensive as std::vector allocations and deallocations. Thefixed-precision class shares the ALU of the arbitrary precision class.

    7.1 Static Computation of the Maximum Bits of Precision

    The technical challenge for using UIntegerFP32 is to determine how large N must be for your sequenceof computations. The analysis for computing N can be tedious. The summary of bit counting in Section 3.4is listed next. Let bits(x) denote the number of bits of precision for x; this is 24 for float and 53 for double.Let pmax(x) denote the maximum exponent for x; this is 127 for float and 1023 for double. Let bmin(x) denotethe minimum biased exponent for x; this is −149 for float and −1074 for double. For multiplication,

    bits(x ∗ y) = bits(x) + bits(y)

    pmax(x ∗ y) = pmax(x) + pmax(y) + 1

    bmin(x ∗ y) = bmin(x) + bmin(y)

    (22)

    The exponent for a product is either the sum of exponents of the inputs or one more than the sum ofexponents. The formula for pmax uses the worst-case behavior. For addition,

    bmin(x+ y) = min{bmin(x), bmin(y)}

    pmax(x+ y) = max{pmax(x), pmax(y)}+ 1

    bits(x+ y) = pmax(x+ y)− bmin(x+ y)

    (23)

    We have provided a class, BSPrecision, that allows you to specify a sequence of expressions and compute N .The class interface is shown in Listing 26.

    38

  • Listing 26. The class interface for BSPrecision.

    c l a s s BSPrec i s i on{pub l i c :

    // This c o n s t r u c t o r i s used f o r ’ f l o a t ’ o r ’ doub l e ’ . The f l o a t i n g=po i n t// i n p u t s f o r the e x p r e s s i o n s have no r e s t r i c t i o n s ; t ha t i s , the i n p u t s// can be any f i n i t e f l o a t i n g=po i n t numbers ( normal o r subnormal ) .BSPrec i s i on ( boo l i s F l o a t ) ;

    // I f you know tha t your i n p u t s a r e l i m i t e d i n magnitude , use t h i s// c o n s t r u c t o r . For example , i f you know tha t your i n p u t s x s a t i s f y// | x |

  • The computed numWords allow you to have any finite floating-point input.

    Suppose that you know x, y, z, and w are in the interval [−1, 1]. The bit counting is then// Compute N f o r an exac t ’ f l o a t ’ computat ion when you know tha t your i n p u t s// a r e i n the i n t e r v a l [=1 ,1] and the maximum power i s 0 .px = BSPrec i s i on ( t rue , 0 ) ; // numbits = 24 , m inb i a s exp = =149, maxexp = 0py = px ; pz = px ; pw = px ;pxy = px * py ; pzw = pxy ; // numbits = 48 , m inb i a s exp = =298, maxexp = 1pd = pxy = pzw ;numBits = pd . GetNumBits ( ) ; // 300 , m inb i a s exp = =298, maxexp = 2numWords = pd . GetNumWords ( ) ; // 10

    // Compute N f o r an exac t ’ doub l e ’ computat ion when you know tha t your i n p u t s// a r e i n the i n t e r v a l [=1 ,1] and the maximum power i s 0 .px = BSPrec i s i on ( f a l s e , 0 ) ; // numbits = 53 , m inb i a s exp = =1074, maxexp = 0py = px ; pz = px ; pw = px ;pxy = px * py ; pzw = pxy ; // numbits = 106 , m inb i a s exp = =2148, maxexp = 1pd = pxy = pzw ;numBits = pd . GetNumBits ( ) ; // 2150 , m inb i a s exp = =2148, maxexp = 2numWords = pd . GetNumWords ( ) ; // 68

    The maximum number of bits and words are smaller for the restricted domain.

    Finally, suppose that you want to compute x ∗ y + z ∗w where the inputs are in the interval [1, 2). It is safeto use the third constructor because the expression does not have subtraction, in which case the choice ofminimum biased exponent is valid.

    // Compute N f o r an exac t ’ f l o a t ’ computat ion o f d = x*y + z*w, where// x , y , z , and w a r e i n [ 1 , 2 ) . The minimum and maximum powers a r e 0 .// The r e s u l t d i s i n [ 1 , 8 ) , so t h e r e i s no chance o f s u b t r a c t i o n s// cau s i n g n e g a t i v e powers to occu r . The number 1.1ˆ{23} *2ˆ0 i s the// l a r g e s t ’ f l o a t ’ sm a l l e r than , and 1.1ˆ{23} * 2ˆ0 = 1ˆ{24} * 2ˆ{=23} ,// so the minimum b i a s e d exponent i s =23.px = BSPrec i s i on (24 , =23, 0 ) ; // numbits = 24 , m inb i a s exp = =23, maxexp = 0py = px ; pz = px ; pw = px ;pxy = px * py ; pzw = pxy ; // numbits = 48 , m inb i a s exp = =46, maxexp = 1pd = pxy + pzw ;numBits = pd . GetNumBits ( ) ; // 48 , m inb i a s exp = =46, maxexp = 2numWords = pd . GetNumWords ( ) ; // 2IEEEBinary32 tmp ;tmp . number = 2 .0 f ;tmp . encod ing = tmp . GetNextD