representing fractions – fixed point the problem: how to represent fractions with finite number of...

$: Representing fractions – Fixed point The problem: How to represent fractions with finite number of bits ?$
Representing fractions – Fixed point

The problem:

How to represent fractions with finite number of bits ?


a1a2a3a4a5a6a7a8a9a10A number with 10 bits


a1a2a3a4a5a6a7a8a9a10A number with 10 bits

a1a2a3a4a5a6a7a8.a9a10

Fixing the point


Number Fraction

Signed (2 complement)

-27…27-1 Quanta of 0.25

Unsigned 0…28-1 Quanta of 0.25

Range of representation:

Fixed point : the problem

Cannot represent wide ranges of numbers.

In scientific applications.

Representing Fractions – Floating point

10 1 * 101

-1.23 * 10-2

Base (radix) - r

-0.123


10 1 * 101

-1.23 * 10-1

exponent

-0.123


10 1 * 101

-1.23 * 10-1

Number (Mantissa)

-0.123


10

-0.123

(-1)0*1 * 101

(-1)1*1.23 * 10-1

Sign bit

Problem of uniqueness

0.1

100*10-4

0.001*102

Representation is not Unique

Problem of uniqueness - Normalization

0.61

610*10-4

0.0061*102

Standardization6.1*10-1

One digit to the Left of the point

Normalized Binary Floating point

D = (-1)a0 * (1.a1a2a3…)*2b1b2b3…

a0b1b2…bna1a2a3…am

String of bits

Floating point - Questions Representing the (signed) exponent

How to represent zero? And Nan, infinity ?

How to add, subtract and multiply?

Rounding Errors.

Floating point – Representing the exponent

How to represent singed number ?

Sign bit

2-Complement


How to represent singed number ?

Sign bit

2-Complement

Neither


We want the exponent to be binary ordered:

0000 < 0001 < …. < 1000 < … < 1111


Number = Number - B

000…0001

111…1110

emin

emax

Usually B = 2n-1-1

We define the following sizes like this:

Floating point – Representing zero,NAN, ±

Exponent Mantissa Represent

e = emin-1 M = 0 ±0

e = emin-1 M ≠ 0 0.M*2emin

emin ≤ e ≤ emax

1.M*2e

e = emax+1 M = 0 ±

e = emax+1 M ≠ 0 NaN

IEEE754 special values

Denormalized number

normalized number

IEEE 754

Parameter

Single Double

Mantissa 24 53

emax 127 1023

emin -126 -1022

Exponent width

8 11

Format width

32 64

(Including the signBit)

What is NaN (not a number)

Operation NaN produced by

+ + (- )

* 0 * / 0/0 , /

Partial list

Operation

X±NaN

X*NaN

X/NaN

Infinity

Provide a safe was to continue calculation when overflow is encountered.

Calculations with Floating Point numbers

Addition: Equalize the exponents

(smallerlarger exponent)

Sum the mantissa

Renormalize if necessary


Example (in base 10):

91 9.10*101

|E| = 1 , |M| = 3

9.7 9.70*100


9.10*101

+ 9.70*100

Not The same Order.


9.10*101

+ 9.70*100

9.10*101

+ 0.97*101

10.7*101

renormalize

1.07*102


Example II (in base 10):

91 9.10*101

|E| = 1 , |M| = 3

9.75 9.75*100


9.10*101

+ 9.75*100

Not The same Order.


9.10*101

+ 9.75*100

9.10 *101

+ 0.975*101

10.75*101renormalize

1.07*102

5 (rounding error)

Rounding Errors

The Problem: Squeezing infinite many real numbers

into a finite number of bits

Measuring Rounding Errors

Units in last place (Ulps)

Relative Error

Measuring Rounding Errors – ULP

error = |d.dddd – (z/re)|*rp-1

If d.dddd*re represent z

p digits


Example I: r = 10 , p = 3

The number 3.14*10-2 represents 0.0314159

Error = 0.159


What is the maximum ULP if the rounding is toward the nearest number?

0.5 ULP

Measuring Rounding Errors – Relative Error

Relative error = |d.dddd*re – z|/z

If d.dddd*re represent z

p digits

Measuring Rounding Errors – Relative errors

Example I: r = 10 , p = 3

The number 3.14*10-2 represents 0.0314159

Relative Error ~ 0.0005

representing fractions – fixed point the problem: how to represent fractions with finite number of...

Documents

floating point numbersexample

floating point numbersaddition

fractions floating point101

number mantissa

singed number

binary floating pointd

nearest number

finite number of bitsmeasuring