floating-point arithmetic › ~yavuz › teaching › courses... · floating-point number systems...

ENEE446---Lectures-4/10-15/08

A. Yavuz Oruç�Professor, UMD, College Park�

Copyright © 2007 A. Yavuz Oruç. All rights reserved.

Floating-Point Arithmetic Integer or fixed-point arithmetic provides a complete representation over a domain of integers or fixed-point numbers, but it is inadequate for representing extreme domains of real numbers.

Example: With 4 bits we can represent the following sets of numbers and many more:

{0,1/16,2/16,…,15/16}--All fractions (Not all fractions– numbers are all fractions) {0,1/8,2/8,…,7/8,1,1+1/8, 1+2/8, 1+3/8,…,1+7/8} {0,1/4,2/4,3/4,1,1+1/4, 1+2/4, 1+3/4,2, 2+1/4, 2+2/4, 2+3/4, …,2+7/8, 3, 3+1/4, 3+2/4, 3+3/4} {0,1/2,1,1+1/2, 2,2+1/2 ,3, 3+1/2 ,4, 4+1/2 , 5,5+1/2 , 6,6+1/2 , 7,7+1/2} {0,1,2,…,15}--All integers (Not all integers– numbers are all integers)

So, we can represent numbers in any range but we are always limited to 2n numbers.

With a floating-point number system, we can represent very large numbers and very small numbers together!

We use the scientific notation:

€

u = ±mu ∗ bxu

mu is a p-digit number, called the mantissa or significand

xu is a k-digit number, called the exponent

b > 2 is called the base.

The mantissa provides the precision or resolution of a floating-point number system whereas the exponent gives its range.

Example: With p = 10, k = 20 and b = 10, and assuming that mantissas are sign-magnitude decimal fractions and exponents are decimal integers, we can represent numbers in the interval

[-(1-10-10)*1020, (1-10-10)*1020]

In this representation:

The least and most positive numbers are 10-10 and (1-10-10)*1020

The least and most negative numbers are -10-10 and -(1-10-10)*1020

In nearly all modern processors,

mu is a binary fraction

xu is a binary exponent

and base b = 2

Very often,

mu is normalized so that it is between 1 and 2 (excluding 2).

If mantissas are expressed in sign-magnitude notation, this means that they always begin with a 1 followed by the binary point as in

1.001101 or 1.111101100, etc.

In some representations, 1 that is on the left of the binary point is removed from the notation and is called a hidden bit. (Hidden bit is always 1 for sign-magnitude mantissas.

Machine Representation of Floating-Point Numbers

sign k-bit biased exponent p-bit mantissa with a hidden bitS X M

1 Hidden bit

The true exponent, x, is found by subtracting a fixed number from the biased exponent, X. This fixed number is called the bias. For a k-bit exponent, the bias is 2k-1-1, and the true exponent, x and X are related by

x = X - (2k-1-1)

Example: k = 3, x = X - (23-1-1) = X - 3

X x algebraic value 000 101 -3 001 110 -2 010 111 -1 011 000 0 100 001 1 101 010 2 110 011 3 111 100 4

Example: With p=2, k=2, and 1-bit sign, we have 32 floating-point numbers with biased exponents as shown in the table below.

S = 0exponent = -1 (denormalized) 0 00 00 = 0 0 00 01 = 1/8 0 00 10 = 1/4 0 00 11 = 3/8 0 0 01 00 = 1 0 01 01 = 5/4 0 01 10 = 3/2 0 01 11 = 7/4 1 0 10 00 = 2 0 10 01 = 5/2 0 10 10 = 3 0 10 11 = 7/2 2 0 11 00 = 4 0 11 01 = 5 0 11 10 = 6 0 11 11 = 7S = 1exponent = -1 (denormalized) 1 00 00 = 0 1 00 01 = -1/8 1 00 10 = -1/4 1 00 11 = -3/8 0 1 01 00 = -1 1 01 01 = -5/4 1 01 10 = -3/2 1 01 11 = -7/4 1 1 10 00 = -2 1 10 01 = -5/2 1 10 10 = -3 1 10 11 = -7/2 2 1 11 00 = -4 1 11 01 = -5 1 11 10 = -6 1 11 11 = -7

With a k-bit biased exponent and p-bit mantissa the most positive and negative representable numbers are

€

±22k−1

(1− 12p

)

Representation size Sign Exponent Mantissa8 1 2 5

16 1 4 1132 1 8 2364 1 11 52

Typical allocation of bits between the mantissa and exponent parts �(Last two rows are IEEE754 standard formats for single and double precision floating-point arithmetic.

without a hidden bit,

€

±22k−1

(2 − 12p ) with a hidden bit,

Precision of A floating-point representation

In the IEEE-754 single precision floating-point representation, the mantissa is 23 bits long. This means that any two numbers in this representation cannot be closer than

1/223 = 1.1920928955078125*10-7.

In double precision, this difference reduces to

1/252 = 2.220446049250313080847263336181640625*10-16.

Given that 2 is a factor of 10, both binary fractions have an exact representation in decimal.

This can be seen if we write

€

12p

=5p

10p

Hence, we can compute 5p as a (p*Log105)-digit number in decimal, and then divide it by 10p by shifting the radix-point to right p places, where p is 23 or 52.

Indeed, the number of digits in each of the representations is equal to 23*Log10 5 =17 and 52*Log10 5 =37.

The least and most positive and negative representations in IEEE-754 single precision floating-point format (with the hidden bit of 1) are

Most positive Most negative0 11111110 11111111111111111111111 1 11111110 11111111111111111111111

+(2-

€

1

223)∗

€

227−1 = +(1-

€

1

224)∗ 2128 -(2-

€

1

223)∗

€

227−1 = -(1-

€

1

224)∗ 2128

Least positive (denormalized) Least negative (denormalized)0 00000000 00000000000000000000001 1 00000000 00000000000000000000001

+

€

1

223∗

€

21−27 = 2-150 -

€

1

223∗

€

21−27 = -2-150

The exponent 11111111 is reserved to represent extreme numbers such as ∞, 0/0, ∞/∞, etc.

IEEE754 Normalized and De-Normalized Numbers

Denormalized-1 +1/2p … -1/2p-1 -1/2p -0 +0 1/2p 1/2p-1 … 1 – 1/2p

Normalized-2 +1/2p … -1 … -0 +0 … 1 … 2 –1/2p

Normalized + Denormalized-2 +1/2p … -1 -1 +1/2p … -1/2p-1 -1/2p -0 +0 1/2p 1/2p-1 … 1 – 1/2p 1 … 2 –1/2p

Extreme Numbers in Floating-Number Systems

In floating-point computations, besides the problem of precision, two other kinds of errors come from the results being either too large (overflow) or too small (underflow). Any result that is greater than the largest representable number is converted to ∞. Any result that is less than 1/2p is truncated to 0+. Likewise, any result that is less than the largest representable negative number is converted to -∞. Any negative result that is greater than the least negative number is converted to 0-.

In mathematics, ∞ is used to represent a number that is greater than all real numbers. It is the limit point of real numbers as they get arbitrarily large, and used to represent an arbitrarily large value rather than a specific value as a finite real number would represent. For example, u2 and u3 will both tend to ∞ as u becomes arbitrarily large or tends to ∞ even though u3 > u2 for all u > 1. In real arithmetic, we also encounter numbers and/or computations such as ∞, 0/0, ∞-∞, 0*∞, and ∞/∞. Ratios such as 0/0 or ∞/∞ arise in the limit of computations such as (u-1)/(u3-1) as u tends to 1 or ∞. We can also have 0*∞ when we try to compute u* (1/u) as u tends to 0.

NaNs, QNaNs and SNaNs

Floating-point number systems set aside certain binary patterns to represent ∞ and other undefined expressions and values that involve ∞. In IEEE-754 floating-point number system, the exponent 11111111 is reserved to represent undefined values such as ∞, 0/0, ∞-∞, 0*∞, and ∞/∞. The last four cases are referred to as Not-a-Number (NaN) and represent the outcomes of undefined real number operations. These special values are represented by setting X to 2k-1, or equivalently x to 2k-1.

The mantissa of the representation is used to distinguish between ∞ and NaNs. If M = 0 and X = 2k-1, then the representation denotes ∞. If M ≠ 0 but X = 2k-1, then the representation is for NaN.

In all of these special representations, the sign bit is used to distinguish between positive and negative versions of these numbers, i.e., +0, -0, +∞, -∞, +NaN, -NaN.

The NaNs are further refined into quiet NANs (QNaNs) and signaling NaNs (SNaNs). The QNaNs are designated by setting the most significant bit of the mantissa, and the SNaNs are specified by clearing the same bit. The QNaNs can be viewed as NaNs that can be tolerated during the course of a floating-point computation whereas SNaNs will force the processor to signal an invalid operation as in the case of division of 0 by 0.

Example: The numbers in the first row below represent +∞ and -∞, respectively, and those in the second row represent NaNs in a 16-bit floating-point number system:

0 1111 00000000000 = +∞ 1 1111 00000000000 = -∞0 1111 00000001000 = +SNaN 1 1111 10100010011 = -QNaN

Approximation of Real Numbers by Floating-Point Numbers

As p gets large, the distance between consecutive mantissas gets smaller, and tends to 0 as p tends to ∞. However, regardless of how large p becomes, not all decimal fractions can be represented in a binary mantissa format.

For example, any decimal fraction which includes 2-s in its binary expansion, where s > p, cannot be represented in p bits, but this not is the end of the story, a whole bunch of other numbers cannot be represented either even if they are greater than 2-p.

In fact, for any decimal fraction, d, to have an exact binary mantissa representation in p bits, 2p*d must be an integer since

€

d =mp−1

21+mp−2

22++

m0

2p

€

2p × d = 2p−1mp−1+ 2p−2mp−2 ++m0

if and only if

which implies that the left hand side of the equation must be an integer for the equation to hold since the right hand side is an integer.

Now, suppose that d is an r-digit decimal fraction and it has an exact representation in p bits.

It is easy to show that r < p, and by the argument above,

must be an integer.

This implies that 5r must evenly divide d* 10r or d* 2r must be an integer since 5 is relatively prime to 2, and 5r cannot divide 2p-r.

Conversely, it can be shown that if r < p, d < 1, 5r evenly divides d* 10r or equivalently d* 2r is an integer then d must have an exact representation in p bits.

€

2p ∗ d ∗10r

10r=2p−r ∗ d ∗10r

5r

For example, 0.125 can be represented exactly in p = 3 bits since 0.125 * 8 = 1 is an integer and r = 3 < p.

By the same token, all multiples of 0.125 that can be written in three or fewer digits can be represented exactly by a 3-bit mantissa. These are 0.125, 0.25, 0.375, 0.5, 0.625, 0.750, and 0.875. No other decimal fraction can be represented by a 3-bit mantissa.

Likewise, when p = 4, only the integral multiples of the decimal fraction 0.0625 can be represented by 4-bit mantissas since 54

only divides 104 * 0.0625 = 625 and its integral multiples evenly. Clearly, there are exactly 15 such proper fractions, i.e., excluding 0, when p = 4.

In general, it is easy to verify that only the 2p-1 multiples of the fraction 1/2p can be represented in p bits, excluding 0 as shown below. These are all the fractions that can be represented in p bits.

1/2p2/2p3/2p(2p-2)/2p(2p-1)/2p

For each fraction m with an exact representation in p bits, there is an infinite set of numbers in the open interval (m,m-1/2p). None of these has an exact representation in p bits. Each such number must therefore be approximated one way or another, and the most natural choices are the boundary points of the interval.

This is because one of these boundary points would be closer to the number that is being approximated than all the others in the representation. If, for any number mu in this interval as shown in the figure below,

then mu is closer to m-1/2p than it is to m. Therefore, it should be approximated by m-1/2p. On the other hand, if

then mu is closer to m, and it should be approximated by m. Finally, if then it can be approximated by either of the end point numbers.

€

mu − (m−1/ 2p ) < m− mu

€

m−1/ 2p < mu < m−1/ 2p+1or

mu

€

m −12 p

€

mm

€

m− 12 p+1

€

mu = m−1/ 2p+1

Example 2.1. Let p = 8, m = 12/16 = 0.75. In binary, m has an exact representation and is given by .11000000. Now, consider the numbers in the interval

€

12/16 − 1/256,12/16( )

None of these numbers have an exact representation if we use an 8-bit mantissa. One such number is

which is clearly greater than . Therefore, it should be approximated by 12/16. ||

€

12/16 − 1/256 + 3/1024 = 12/16 − 1/1024

€

12/16 − 1/512

The process of approximating a floating-point number is often carried out by rounding or truncating it.

In both cases, digits outside the available number of digits are removed from the representation.

However, when a (p+r)-bit mantissa is rounded to a p-bit mantissa, we add 1/2p to it if the (p+1)st bit is 1 and drop the last r bits if the same bit is 0.

When it is truncated, we simply drop the rightmost r bits.

In the above example, the 10-bit fraction

mu = (0.1011111111)2 that represents 12/16 – 1/256 + 3/1024,

is approximated by an 8-bit fraction, (0.11000000)2.

The latter number represents 12/16. This amounts to rounding rather than truncation as the latter fraction is obtained by adding (0.00000001)2 to (0.1011111111)2 in order to represent mu in 8 bits.

Approximating x by truncation would result in (0.10111111)2 with the last two bits removed without altering the rest of digits. This would give 12/16 – 1/256 that is clearly not the closest 8-bit fraction to x in this case. On the other hand, if mu =(12/16 – 1/256, 1/512),

i.e., mu = (0.1011111110)2

then it is exactly in the middle of the interval (12/16 – 1/256, 12/16). Rounding will approximate it to 12/16 and truncating will carry it to 12/16 – 1/256. In this case, both approximations are equally far apart from mu.

In general, rounding a real number always leads to the closest representable floating-point number except when the number is at an equal distance between one of the endpoints and the middle point of an interval into which it falls. In the latter case, truncation is as precise as rounding. In the truncation of decimal numbers, this happens when the digit to be rounded is 5, and by convention, it is rounded up to the next digit as, for example, 49.5 would be rounded to 50 rather than 49. Truncating it would give 49 which is as far apart from 49.5 as 50.

Rounding or truncating a number introduces computational errors into an operation. These errors are usually unavoidable, and can have significant undesirable effects in the result of the computation.

Example: Consider, for example, the machine numbers

0 1101 10000000000 = (1.1)2 * 26 = 96, and 0 1101 10000000001 = (1.10000000001)2 * 26 = 96.03125.

These representations are ``adjacent'', i.e., we cannot represent any other numbers between 96 and 96.03125 if we use an 11-bit mantissa. Now suppose we want to add 1000 fractions to 96 all of which are less than 0.3125, say they are around 0.02. If we perform the addition so that each fraction is added to 96 one after another, the result of the first addition will be about 96.02, but it will be truncated back to 96, assuming that we are using an 11-bit mantissa. Similarly, adding the second, the third, and adding all subsequent fractions will have no effect, so the result of the computation will be 96 whereas the correct result should have been 96 + 20 = 116. Therefore, care should be taken when adding fractions or small numbers to large numbers. In this example, a result which is much closer to 116 can be obtained by first summing the thousand fractions and then adding this sum to 96. ||

2’s Complement Floating-point Number System

Most processors use a sign-magnitude representation to represent mantissas in floating-point numbers. Instead, one can also use 1's or 2's complement notation as in fixed-point numbers to represent signed mantissas. This makes the subtraction of mantissas easier to handle.

Determining the value of a floating-point number with a 2's complement mantissa is only slightly more complex. In fact, if the sign bit of the mantissa is 0 then the value of the number is the same as if its mantissa is expressed in sign-magnitude notation. When the leading bit is 1 then the number is negative, and its value is determined by complementing its bits and adding 1/2p to it, where p is the number of bits in the mantissa part of the number.

Example: Consider the floating-point number

101011.01110111

in 2's complement notation. Its value is determined by complementing the bits and adding 0.00000001 to it.

-(010100.10001000 + 0.00000001)2 = -(010100.10001001)2 = -(20.53515625)10.

Floating-Point Addition and Subtraction

When adding or subtracting two floating-point numbers, we must first align their exponents. This is done by shifting the mantissa of the floating-point number with the smaller exponent to right while increasing its exponent and until its exponent is equal to the exponent of the other floating-point number. After the exponents are aligned, the operation (either addition or subtraction) is performed on the two mantissas, and the larger exponent is made the exponent of the result. The final step is to shift the mantissa and increase or decrease the exponent so that the mantissa is in normalized form.

Example Let u = 5.0 and v = 1.25 be represented as 16-bit floating-point numbers with a 4-bit biased-exponent, and 11-bit sign magnitude with a hidden bit. Let Mu, Mv, Mr represent the mantissas of u, v, and u-v, and let Eu, Ev, Er represent the biased exponents of u, v, and u-v. The difference u-v is computed as follows:

Design of A Floating-Point Adder/Subtractor

(p+1)-bit adder

(p+1)-bit complementer

Mu Mv

Bus & function select logic

b-bus

a-bus

alignment and shift logic

Eu Ev

Su

Sv

k-bit adder

add/sub

operation

1 1

1 2

4 control logic

clock

Mr Sr

Normalization logic

3

Exponent correction

Algorithm 2.1 (sign-magnitude floating-point addition){ //Add the hidden bits to Mu and Mv if they are not denormalized if(Eu != 0) Mu = 1 + Mu; if(Ev != 0) Mv = 1 + Mv; //Align if (Eu > Ev) {Mv = Mv ∗ 2Ev-Eu; Er = Eu;} else if (Eu < Ev) {Mu = Mu ∗ 2Eu-Ev; Er = Ev;} else Er= Eu; //Add if operation = 0 subtract if operation = 1 switch(Su) {case 0: switch(Sv) {case 0: switch(operation) { case 0: Mr= Mu + Mv; break; case 1: Mr= Mu + ~ Mv + 1; break;} break; case 1: switch(operation) { case 0: Mr= Mu + ~ Mv + 1; break; case 1: Mr= Mu + Mv; break; } break;} break; case 1: switch(Sv) {case 0: switch(operation) { case 0: Mr= ~ Mu + 1 + Mv; break; case 1: Mr= ~ Mu + 1 + ~ Mv + 1; break; } break; case 1: switch(operation) { case 0: Mr= ~ Mu + 1 + ~ Mv + 1; break; case 1: Mr= ~ Mu + 1 + Mv ; break; } break;} break; }//Normalize if (Mr >= 4) {Mr = Mr /2; Er = Er +1; er = er +1; } else while (Mr < 1/2) {Mr = Mr ∗ 2; Er = Er - 1; er = er - 1;}if (Er < 256) F = 0; else {F = 1; Er = 255; er = 128; Mr = 223-1;}//Set the sign bit and magnitudeIf (Mr > 0 ) Sr = 0; else {Sr = 1; Mr = ~ Mr + 1;}}

Floating-Point Multiplication and Division

When multiplying or dividing two floating-point numbers, the exponents and mantissas are again treated separately in these operations. Unlike as in floating-point addition/subtraction, it is not necessary to align the exponents to multiply or divide two floating-point numbers. Simply, the mantissas are multiplied (or divided), and the exponents are added (or subtracted). If the sign bits of the two numbers are the same, then the resulting sign bit is 0, otherwise it is set to 1. Finally, the resulting number is normalized by shifting the mantissa and increasing the exponent of the result, if needed.

In the case of multiplication, the biased exponent of the product must be corrected since adding two biased exponents introduces an extra bias. That is, when two floating-point numbers, u and v are multiplied, adding their exponents Eu = eu+2k-1-1 and Ev = ev+2k-1-1 results in

eu+2k-1-1 + ev+2k-1-1 = eu + ev+ 2 * (2k-1-1)

This bias must be corrected by subtracting 2k-1-1 from it. In contrast, when two floating-point numbers, u and v are divided, subtracting their exponents Eu = eu+2k-1-1 and Ev = ev+2k-1-1 results in

eu + 2k-1-1 - ev - 2k-1 + 1 = eu - ev

Therefore, in this case, we need to add 2k-1-1 to correct the bias. These extra steps can be carried out concurrently while mantissas are being multiplied or divided since the exponent of the result is not needed in the computation of the mantissas. All these ideas have been formalized in the algorithm below:

Algorithm (floating-point multiplication/division){//u is a sign-magnitude, biased exponent floating-point number. //v is a sign-magnitude, biased exponent floating-point number.//operation is a binary variable which specifies whether a multiplication or division is to be performed.//Add the hidden bit to Mu and Mv Mu = 1 + Mu; Mv = 1 + Mv;//Multiply if operation = 0 divide if operation = 1 switch(operation) {case 0: Mr = Mu ∗ Mv; Er = Eu + Ev - 2k-1 +1; break; case 1: Mr = Mu / Mv; Er = Eu + Ev + 2k-1 -1; break;}//Set the sign bitif (Su = Sv) Sr = 0; else Sr = 1;//Normalize if (Mr >= 2) {Mr = Mr /2; Er = Er +1; er = er +1; } else while (Mr < 1/2) {Mr = Mr∗ 2; Er = Er - 1; er = er - 1;} if (Er < 256) F = 0; else {F = 1; Er = 255; er = 128; Mr = 223-1;}}

The multiplication and division steps in this algorithm are left unspecified and can be carried out using any of the multiplication and division algorithms we described for integer operands. In the case of multiplication, the product of two p-bit mantissas is given by the expression:

€

1+M1

2p

× 1+

M2

2p

=

(2p +M1)× (2p +M2 )

22p

So, effectively, we are multiplying two (p+1)-bit integers to obtain a 2(p+1)-bit product, and then divide the product by 22p. The division by 22p amounts to shifting the binary point from the right hand side of the right most bit in the product to the right of the 2nd left most bit.

Furthermore, only the highest p+1 bits of the 2(p+1)-bit product are retained since the precision of the representation is limited to p+1 bits. This makes it redundant to compute the lower p+1 bits of the product which comes from the multiplication of the lower (p+1)/2 bits of the two mantissas.

Example 2.1. Let u = -6.5 and v = 3.5 be represented as 16-bit floating-point numbers with a 4-bit biased-exponent, and 11-bit sign magnitude with a hidden bit. The product u * v is computed as follows:

Step 1: Express u and v as floating-point numbers. u = 1 1001 10100000000, v = 0 1000 11000000000Step 2: Compute the exponent Er = Eu + Ev - 23+1. Er = 1001 + 1000 – 0111 = 1010Step 3: Compute the mantissa Mr = (1+Mu) ∗ (1+Mv). (1.10100000000) ∗ (1.11000000000) = 10.1101100000Step 4: Normalize Mr by shifting it right once Mr = 011011000000Step 5: Adjust the exponent by incrementing it by 1. Er = 1011Step 6: Combine the sign, Sr, Er, Mr. u∗v = 1 1011 01101100000

(110100 000000) * (111000 000000)

(110100 * 111000) 000000000000 +

(110100 * 000000 + 000000 * 111000) 000000 +

000000 * 000000 (this is ignored, not just because it is 0 but also, it is outside the 12-bit representation)

= (52 * 56) /1024 = 2912/1024

= (2048 +512 + 256 + 64 + 32)/1024 = 2912/1024

= 10.1101100000

Division works similarly with the following formula:

€

1+M1

2p

÷ 1+

M2

2p

=

2p +M1

2p +M2

Again, it is seen that the division of mantissas is reduced to the division of two (p+1)-bit integers. Any of the division algorithms can be used to carry out this division. The (p+1)-bit quotient obtained in the division becomes the mantissa of the result, and since the division involves two numbers which are between 2p and 2p+1, any mantissa which results from a division of two normalized numbers will always be between 1/2 and 2.

Let u = -6.5 and v = 3.5 be represented as 16-bit floating-point numbers with a 4-bit biased-exponent, and 11-bit sign magnitude with a hidden bit. The division u / v is computed as follows:

Step 1: Express u and v as floating-point numbers. u = 1 1001 10100000000, v = 0 1000 11000000000Step 2: Compute the exponent Er = Eu - Ev + 23-1. Er = 1001 - 1000 + 0111 = 1000Step 3: Compute the mantissa Mr = (1+Mu) / (1+Mv). 1.10100000000/ 1.11000000000 = 0.11101101101Step 4: Normalize Mr by shifting it left once Mr = 11011011010Step 5: Adjust the exponent by decrementing it by 1. Er = 0111Step 6: Combine the sign, Sr, Er, Mr. u∗v = 1 0111 11011011010

11010 (dividend, u) 1110 (divisor, v) 1110 (shift and subtract) 0.1 11000 1110(shift and subtract) 0.11 10100 1110(shift and subtract) 0.111 011000 1110(shift twice and subtract) 0.11101 10100 1110(shift and subtract) 0.111011 011000 1110(shift twice and subtract) 0.11101101 10100 1110(shift and subtract) 0.111011011 011000 1110(shift twice and subtract) 0.11101101101 (u/v) 1010

110100000000/ 111000000000 = 11010/ 11100 = 1/2 (11010/ 1110)

We terminate the division at the end of 12 bits since the number of bits in the mantissa is limited to 12 bits. Moreover, since the remainder is not equal to 0 after the last shift and subtraction step, the ratio u/v does not have an exact representation in 12 bits. In fact, a closer examination of the process shows that the shift and subtract steps entered a repetitious pattern once the remainder of 0110 is obtained. Therefore, u/v cannot have an exact representation regardless of how many bits we use.

The division algorithm can be implemented in hardware using a k-bit 2’s complement adder/subtractor, a p-bit multiplier, and a p-bit divider.

For multiplication, we can use the compact multiplier hardware that was described earlier or design an algorithm which generates only the most significant p bits of the product since the remaining p bits are discarded.

For division, we can use either restoring or non-restoring division, and discard the remainder. Unlike the implementation of the floating-point addition and subtraction operations, floating multiplication and division operations are generally implemented separately in hardware. This stems from the fact that division takes more clock cycles to execute than multiplication.

As we will see in subsequent chapters, in processors in which several operations can be scheduled for execution in parallel, it is desirable to execute these operations on different hardware units to speed up computations.

Machine Arithmetic in Real Processors

Motorola integer arithmetic instructions

PowerPC processors have four multiply and two divide instructions. Multiply instructions provide either the higher or lower half of a 64-bit product when two 32-bit numbers are multiplied. More specifically, in 32-bit mode, mulhw and mulhwu instructions multiply two 32-bit signed or unsigned operands and store the higher 32 bits of the product in a register. Similarly, mullw and mulli instructions retain the lower 32 bits of a 64-bit (or 48-bit) product that results from the multiplication of two 32-bit register operands or a 32-bit operand and a 16-bit signed number. A full 64-bit product can be obtained by a pair of multiply instructions. For example, mulhw and mullw can be used together to multiply two 32-bit signed numbers into a 64-bit signed product. It is also possible to obtain a 64-bit product using the 64-bit multiplication instructions with 32-bit operands. These same instructions can also be used together to obtain a signed 128-bit product of two 64-bit operands.

PowerPC's divide instructions divw, divd (signed division), divwu, divdu (unsigned division) divide a 32 or 64-bit dividend by a 32 or 64-bit divisor to produce a 32-bit quotient without a remainder. Even though the remainder is not computed by the execution of these division instructions, it can be obtained by subtracting the product of the quotient with the divisor from the dividend. In division instructions, division by 0 is not allowed, and sets the OV(overflow) flag when it is attempted. The OV flag is also set when the divw or divd instruction is used to divide -231 or -263 by –1 (Can you guess why?)

Instruction Operation Commentsfaddx,faddsx fd = fa + fb Floating-point operands in fa and fb are added and stored in fd .fsubx,fsubsx fd = fa - fb Floating-point operands in fa and fb are subtracted and stored in fd .fmulx,fmulsx fd = fa ∗ fb Floating-point operands in fa and fb are multiplied and stored in fd .fdivx,fdivsx fd = fa / fb Floating-point operand in fa is divided by that in fb and stored in fd .fmaddx,fmaddsx fd = fa ∗ fb +fc fa ∗ fb +fc is stored in fd .fnmaddx,fnmaddsx fd = -(fa ∗ fb +fc) -(fa ∗ fb +fc ) is stored in fd .fmsubx,fmsubsx fa ∗ fb - fc fa ∗ fb - fc is stored in fd .fnmsubx,fnmsubsx fd = -(fa ∗ fb - fc) -(fa ∗ fb - fc ) is stored in fd .fabs fd = |fa| Sign bit of fa is cleared and fa is stored in fd.fnabs fd = -|fa| Sign bit of fa is set and fa is stored in fd.fneg fd = -fa Sign bit of fa is inverted and fa is stored in fd.fres fd = 1/fa Estimate of the reciprocal of fa is stored in fd.fsqrtx,fsqrtsx fd = √fa

Square root of fa is stored in fd .frsqrtex fd = 1/√fa

Estimate of the reciprocal of the square root of fa is stored in fd.

Motorola floating-point arithmetic instructions

Instruction Operation Commentsadd r/md = r/md + r/ms or immediate operand Operands in r/md and r/ms or immediate operand are added

and stored in r/md .adc r/md = r/md + r/ms or immediate operand + CF (carry) Same as add except that CF is included in the addition.sub r/md = r/md - r/ms or immediate operand Operands in r/md and r/ms are subtracted and stored in r/md .sbb r/md = r/md - r/ms or immediate operand Same as sub except that CF is included in the subtraction.dec r/md = r/md + 1 Decrement the operand in r/md and store it in r/md .inc r/md = r/md - 1 Increment the operand in r/md and store it in r/md .neg r/md = -r/md Negate the operand in r/md and store it in r/md .mul rdx:rax = rax ∗ r/ms An unsigned 128-bit product of 64-bit operands in rax and

r/ms is stored in the register pair rdx:rax.imul rdx:rax = rax ∗ r/ms; or

rd = rd ∗ r/ms or immediate operand; or

rd = ra ∗ r/mb , immediate operand;

A signed 128-bit product of 64-bit operands in rax and r/msis stored in the register pair rdx:rax.Higher order 64 bits of a signed product of operands in rdand r/ms or immediate operand is stored in rd.Higher order 64 bits of a signed product of operands in ra

and r/mb or immediate operand is stored in rd.div rax = Quotient [rdx:rax / r/ms ]

rdx = Remainder [rdx:rax / r/ms ]Unsigned 128-bit operand in the register pair rdx:rax isdivided by the 64 operand in r/md .

idiv rax = Quotient [rdx:rax / r/ms]rdx = Remainder [rdx:rax / r/ms ]

Signed 128-bit operand in the register pair rdx:rax is dividedby the 64 operand in r/md .

A subset of Intel 64 architecture integer arithmetic instructions

Like PowerPC processors, Intel 64 processors support both signed and unsigned multiplication and division. The multiplication instructions, imul and mul handle signed and unsigned multiplication with a variety of operand combinations including 16, 32 and 64-bit products for two 8, 16, 32 and 64-bit operands. Likewise, idiv and div instructions provide signed and unsigned division for 16, 32, 64 and 128-bit dividends with corresponding 8, 16, 32 and 64-bit divisors. Results of the multiplication and division instructions are stored in specialized pairs of 64-bit registers, rax and rdx except for some of the signed multiplication instructions.

Intel 64 architecture processors also perform decimal arithmetic using packed and unpacked decimal operands. A packed decimal operand contains 8 decimal digits in the 32-bit mode, and 16 decimal digits in the 64-bit mode. An unpacked decimal uses only lower four bits of a byte, and so in the 32-bit mode, an unpacked decimal operands contains only four decimal digits, and in the 64-bit mode, it contains eight decimal digits.

Intel 64 architecture processors do not have decimal add or subtract instructions. Instead, they have instructions to convert binary values to packed and unpacked decimals. When two BCD (binary-coded-decimal) digits u and v are added as 4-bit binary numbers, a correction is performed by adding 6 to the sum u + v when it exceeds 9. This is because

x-10 (mod 16) = x + 6 (mod 16)

since -10 and 6 are congruent mod 16. i.e., adding 6 is the same as subtracting 10 in modulo 16. For example,

(7+8)10 = (0111 + 1000 + 0110)BCD = (1 0101)BCD = (15)10 (9+8)10 = (1001 + 1000 + 0110)BCD = (1 0111)BCD = (17)10

Similarly, when subtracting two decimal digits in binary, if the difference u - v > 10, it must be decreased by 6. For example,

(7-5)10 = (0111 - 0101)BCD = (0 0010)BCD = (2)10

(7-8)10 = (0111 - 1000 - 0110)BCD = (0 1001)BCD = (-1)10

(5-7)10 = (0101 – 0111 - 0110)BCD = (0 1000)BCD = (-2)10

Instruction Operation Commentsfadd fpu stack register = fpu stack register + r/ms or immediate operandfsub fpu stack register = fpu stack register - r/ms or immediate operandfsubr fpu stack register = r/ms or immediate operand - fpu stack registerfmul fpu stack register = fpu stack register ∗ r/ms or immediate operandfdiv fpu stack register = fpu stack register / r/ms or immediate operandfdivr fpu stack register = r/ms or immediate operand / fpu stack registerfsin fpu stack register0 = sine (fpu stack register0) argument in radiansfcos fpu stack register0 = cosine (fpu stack register0) argument in radiansfsincos fpu stack register0 = sine (fpu stack register0) fpu stack register1 = cosine (fpu stack register0) argument in radiansfptan fpu stack register0 = tangent(fpu stack register0) argument in radiansfatan fpu stack register0 = arctangent(fpu stack register0)fsqrt fpu stack register0 = squareroot(fpu stack register0)

A subset of Intel 64 architecture floating-point arithmetic instructions

floating-point arithmetic › ~yavuz › teaching › courses... · floating-point number systems...

Documents