esc101 l16

18
1 Floating point representation

Upload: nakul-surana

Post on 09-Aug-2015

49 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Esc101 l16

1

Floating point representation

Page 2: Esc101 l16

2

sign bit

mantissabits 0-22

bit 31

exponent ebits 23-30

IEEE single precision format (Sun, Pentium)

d1 d2 d3 d 23e0e7sgn

1272323

22

11

sgn 2)2221()1( eddd

Number is:

(-1)sgn (1.d1d2 … d23 )2e-127

or more precisely as

Page 3: Esc101 l16

3

Example: IEEE single precision

1272323

22

11

sgn 2)2221()1( eddd

Sign is 0: (-1)0 is 1. Exponent is 01111100 = 4 + 8 + 16 + 32 + 64 = 124 Mantissa: 1 + 2-2 = 1.25 Value = 1.25 X 2e-127 = 1.25 X 2-3 = 1.25/8 = 0.15625

Example

Page 4: Esc101 l16

4

0 0111100 01000000000000000000001

Given number: 0.15625

Next number: (1 + ¼ + 2-23)2124-127 = 0.15625+2-

26

Previous number: (1+1/8+1/16+…2-23 )2-3

= (1+1/4-2-23)2-3 = 0.15625-2-26

0 011110000111111111111111111111

Real numbers in-between 0.15625 and 0.15625 +2-26 maybe all represented as 0.15625. Discrete representation: leads to approximation errors.

Page 5: Esc101 l16

5

Finiteness of floating-pt

Floating-point is a finite arithmetic. On most computers, for historical reasons, an

integer and a floating-point number each occupy one word of storage.

Therefore there are the same number of representable numbers. For example, in

IBM System/370 a word is 32 bits so there are 232 representable integers (between −231 and +231).

and there are also 232 representable floating-point numbers.

Page 6: Esc101 l16

6

Largest and smallest nos.

Largest number represented is (1.111111…1)2 X 2128 Largest negative number represented is

-(1.111111…1)2 X 2128 Smallest number in absolute value represented is

1.0 X20-127 = 2-127

sign bit

mantissabits 0-22

bit 32

exponentbits 23-30

IEEE single precision format

1272323

22

11

sgn 2)2221()1( eddd

Page 7: Esc101 l16

7

Underflow and Overflow

Consider IEEE single precision.

Numbers smaller than 2-128 are indistinguishable from 0.

Such numbers occurring in calculations are said to be in underflow and are treated as 0.

Numbers that are larger than 2128 cannot be represented.

Such numbers occurring in calculations are said to have overflowed.

Page 8: Esc101 l16

8

Overflow and underflow: the two issues of floating point computation

The representable floating-point numbers, under the arithmetic operations available on the computer are not closed.

Product of two large numbers may not be representable: too large to “fit”. This is overflow.

Conversely, underflow is caused by any calculation whose result is too small to be distinguished from zero.

Page 9: Esc101 l16

9

Floating Point Arithmetic

Approximation Errors in Calculations

(underflow/overflow)

Page 10: Esc101 l16

10

Representation

For simplicity, we will leave the IEEE single precision standard, and work with decimal numbers.

Assume that floating point numbers are represented as normalized numbers as follows:

Numerical range of machine is:-0.99…9 X 10m to 0.99…9X 10m

0.d1 d2 … dk X 10e1<= d1<= 9,0 <= di <= 9 for i=2..k-m<=e<=m

Page 11: Esc101 l16

11

Errors in representation

Any number x in the numerical range is represented as follows.

x can be written as

x= 0.d0d1….dr… X 10n To represent x in floating point, we keep only

the most significant k bits.

x*= 0.d0d1….dk X 10n This is called chopping the number. The

other would be to round the number. Add 5 X 10 (n-k+1) to x and then chop.

Page 12: Esc101 l16

12

Approximation errors Let x be a real number. Let x* be its floating point representation on

some machine. Let us use IEEE single precision: 1 sign bit, 23

bits of precision (mantissa). Error is defined in two ways:

absolute error, relative error. Absolute error: |x*-x| Relative error: |x*-x|/ |x|, provided x is not 0.

Page 13: Esc101 l16

13

Evaluating Polynomials

Page 14: Esc101 l16

14

Evaluation

Page 15: Esc101 l16

15

Horner’s Rule

Page 16: Esc101 l16

16

Horner’s Rule

Page 17: Esc101 l16

17

Page 18: Esc101 l16

18

Acknowledgments

These slides are modified from that of Sumit Ganguly