esc101 l16

1

Floating point representation

2

sign bit

…

mantissabits 0-22

bit 31

…

exponent ebits 23-30

IEEE single precision format (Sun, Pentium)

d1 d2 d3 d 23e0e7sgn

1272323

22

11

sgn 2)2221()1( eddd

Number is:

(-1)sgn (1.d1d2 … d23 )2e-127

or more precisely as

3

Example: IEEE single precision

1272323

22

11

sgn 2)2221()1( eddd

Sign is 0: (-1)0 is 1. Exponent is 01111100 = 4 + 8 + 16 + 32 + 64 = 124 Mantissa: 1 + 2-2 = 1.25 Value = 1.25 X 2e-127 = 1.25 X 2-3 = 1.25/8 = 0.15625

Example

4

0 0111100 01000000000000000000001

Given number: 0.15625

Next number: (1 + ¼ + 2-23)2124-127 = 0.15625+2-

26

Previous number: (1+1/8+1/16+…2-23 )2-3

= (1+1/4-2-23)2-3 = 0.15625-2-26

0 011110000111111111111111111111

Real numbers in-between 0.15625 and 0.15625 +2-26 maybe all represented as 0.15625. Discrete representation: leads to approximation errors.

5

Finiteness of floating-pt

Floating-point is a finite arithmetic. On most computers, for historical reasons, an

integer and a floating-point number each occupy one word of storage.

Therefore there are the same number of representable numbers. For example, in

IBM System/370 a word is 32 bits so there are 232 representable integers (between −231 and +231).

and there are also 232 representable floating-point numbers.

6

Largest and smallest nos.

Largest number represented is (1.111111…1)2 X 2128 Largest negative number represented is

-(1.111111…1)2 X 2128 Smallest number in absolute value represented is

1.0 X20-127 = 2-127

sign bit

…

mantissabits 0-22

bit 32

…

exponentbits 23-30

IEEE single precision format

1272323

22

11

sgn 2)2221()1( eddd

7

Underflow and Overflow

Consider IEEE single precision.

Numbers smaller than 2-128 are indistinguishable from 0.

Such numbers occurring in calculations are said to be in underflow and are treated as 0.

Numbers that are larger than 2128 cannot be represented.

Such numbers occurring in calculations are said to have overflowed.

8

Overflow and underflow: the two issues of floating point computation

The representable floating-point numbers, under the arithmetic operations available on the computer are not closed.

Product of two large numbers may not be representable: too large to “fit”. This is overflow.

Conversely, underflow is caused by any calculation whose result is too small to be distinguished from zero.

9

Floating Point Arithmetic

Approximation Errors in Calculations

(underflow/overflow)

10

Representation

For simplicity, we will leave the IEEE single precision standard, and work with decimal numbers.

Assume that floating point numbers are represented as normalized numbers as follows:

Numerical range of machine is:-0.99…9 X 10m to 0.99…9X 10m

0.d1 d2 … dk X 10e1<= d1<= 9,0 <= di <= 9 for i=2..k-m<=e<=m

11

Errors in representation

Any number x in the numerical range is represented as follows.

x can be written as

x= 0.d0d1….dr… X 10n To represent x in floating point, we keep only

the most significant k bits.

x*= 0.d0d1….dk X 10n This is called chopping the number. The

other would be to round the number. Add 5 X 10 (n-k+1) to x and then chop.

12

Approximation errors Let x be a real number. Let x* be its floating point representation on

some machine. Let us use IEEE single precision: 1 sign bit, 23

bits of precision (mantissa). Error is defined in two ways:

absolute error, relative error. Absolute error: |x*-x| Relative error: |x*-x|/ |x|, provided x is not 0.

13

Evaluating Polynomials

14

Evaluation

15

Horner’s Rule

16

Horner’s Rule

18

Acknowledgments

These slides are modified from that of Sumit Ganguly

esc101 l16

Documents