ieee arithmetic uc berkeley fall 2004, e77 pack/e77 copyright 2005, andy packard. this work is...
TRANSCRIPT
IEEE ArithmeticUC Berkeley
Fall 2004, E77http://jagger.me.berkeley.edu/~pack/e77
Copyright 2005, Andy Packard. This work is licensed under the Creative Commons Attribution-ShareAlike License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/2.0/ or send a letter to
Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA.
Computer representation of numbers
Integers and rational numbers in different bases
Floating point
IEEE standards
Roundoff
Errors on basic arithmetic
Integers in base 10
The base10 number (2075)10 means
What are the “rules”?–In each digit, there is a number from 0 to 9–The “value” of the slots are increasing powers of 10, starting at
100
0123 105107100102
Throughout the lecture, the “explanatory” expression is always given in base 10
Integers in base 2
The base2 number (1001101)2 means
What are the “rules”?–In each digit, there is a number from 0 to 1–The “value” of the slots are increasing powers of 2, starting at 20
0123456 21202121202021
This numbers in this “explanatory” expression are base 10
Integers in base p
The base p number (d4 d3 d2 d1 d0)p means
What are the “rules”?–In each digit, there is a number from 0 to p-1–The “value” of the slots are increasing powers of p, starting at p0
00
11
22
33
44 pdpdpdpdpd
Rational numbers in base 10
The base 10 number (75.396)10 means
What are the “rules”?–In each digit, there is a number from 0 to 9–The “value” of the slots to the left of the decimal point are
increasing powers of 10, starting at 100
–The “value” of the slots to the right of the decimal point are decreasing powers of 10, starting at 10-1
–The leading digit is usually not zero, but it is ok if it is.– We usually write 75.396 instead of 075.396, but they are the same.
32101 106109103105107
Rational numbers in base 2
The base 2 number (1001.101)2 means
What are the “rules”?–In each digit, there is a number from 0 to 1–The “value” of the slots to the left of the decimal point are
increasing powers of 2, starting at 20
–The “value” of the slots to the right of the decimal point are decreasing powers of 2, starting at 2-1
–The leading digit is usually not zero (see previous slide)
3210123 21202121202021
Rational numbers in base p
The base p number (d2 d1 d0 .d-1 d-2 d-3)p means
What are the “rules”?–In each digit, there is a number from 0 to p-1–The “value” of the slots to the left of the decimal point are
increasing powers of p, starting at p0
–The “value” of the slots to the right of the decimal point are decreasing powers of p, starting at p-1
–The leading digit is usually not 0 (see previous slide)
33
22
11
00
11
22
pdpdpdpdpdpd
“Floating Point” numbers in base p
The base p number (d1 d0 .d-1 d-2 d-3) × pe means
What are the “rules”?–In each digit, there is a number from 0 to p-1–The “value” of the slots to the left of the decimal point are
increasing powers of p, starting at p0
–The “value” of the slots to the right of the decimal point are decreasing powers of p, starting at p-1
–The exponent, e, must be an integer.• It just “moves” the decimal point e places to the right.
eppdpdpdpdpd
33
22
11
00
11
exponentmantissa
Normalized Floating point numbers in base 2
A normalized binary number is written d0 .d-1 d-2 d-3 × 2e. The rules for normalized are:
–Only one digit to left of decimal point
–Value of this digit, d0 is always 1
So, normalized binary numbers appear as (for example)
1 .d-1 d-2 d-3 d-4 d-5 d-6 × 2e
In an unnormalized number, still allow only one digit to left of decimal point, but it may be 0.
Finally, we need a ± sign in front to denote positive or negative, giving the form
± d0 .d-1 d-2 d-3 d-4 d-5 d-6 × 2e
Computer representation of numbers
Computers “store” everything–Integers, Characters–Floating point numbers (eg., 1.435678, cos(2.3), etc.)–Memory location addresses–Computer chip instructions (“multiply contents of register A by
contents of register B, and store result in register C”)
as (long) sequences of digits made up of 0’s and 1’s.–Each single digit is called a “bit” (8 bits is a byte)–So, a “bit” can store a single base2 number (0 or 1)
For scientific computing and engineering calculations, an important question is “How are floating point numbers represented?” Need to store three things:
–the mantissa,– the exponent, and–the sign
Storing the mantissa, exponent and sign
Normalized numbers are of the form
±1 .d-1 d-2 d-3 d-4 d-5 d-6 × 2e
Store the sign with 1 bit: 0 means positive, 1 means negative.
Store the mantissa (6 digits above) bit-by-bit.
Store the exponent as a binary integer–What about the sign of the exponent? Later…
The number of bits used to store the mantissa controls the precision to which numbers can be stored.
The number of bits used to store the exponent controls the range of numbers (smallest to biggest) that can be stored
IEEE Standard
Single precision–32 bits
• 1 sign bit (0 means positive, 1 means negative)
• 8 bits for the exponent
• 23 bits for the fraction
Double precision–64 bits
• 1 sign bit (0 means positive, 1 means negative)
• 11 bits for the exponent
• 52 bits for the fraction
Plus, some additional rules as how to interpret things.
Let’s use a different precision to explain the rules.
Toy precision, using IEEE rules
7 bits (only 27=128 different possibilities)1 sign bit, S, (0 means positive, 1 means negative)
3 bits for the exponent, E (000,001,010,011,100,101,110,111)
3 bits for the mantissa, M
Rules–If 1≤E≤6, then value is -1S × 2E-3 × 1.M
• 2E-3 takes on values 0.25, 0.5, 1, 2, 4, 8
• 1.M values: 1, 1.125, 1.25, 1.375, 1.5, 1.6125, 1.75, 1.8725
• Minimum value is 0.25, maximum value is 15 (realmin, realmax)
• 48 positive numbers, 48 negative numbers
–If E=0, and M≠0, then value is -1S × 2-2 × 0.M• 1/32, 2/32, 3/32, 4/32, 5/32, 6/32, 7/32 (and their negatives)
• These are the unnormalized numbers (14 of them)
–If E=0 and M=0, then value is 0 or -0 (based on S)– If E=7 and M=0, then value is Inf or –Inf (based on S)– If E=7 and M≠0, then value is NaN (14 of these)
Double precision, using IEEE rules
64 bits (264=18400000000000000000 different possibilities)1 sign bit, S, (0 means positive, 1 means negative)
11 bits for the exponent, E (between 0 and 2047)
52 bits for the mantissa, M (between 0 and 4.5×1015)
Rules–If 1≤E≤2046, then value is -1S × 2E-1023 × 1.M
• 2E-1023 takes on 2046 different values, from 2-1022 to 21023
• 1.M takes on 4.5×1015 evenly spaced values from 1 to 2 (just less)
• Min value is 2.2×10-308 , maximum value is 1.8×10308
–If E=0, and M≠0, then value is -1S × 2-1022 × 0.M• 4.5×1015 evenly spaced values from 0 to 2-1022 (just less)
• These are the unnormalized numbers (9.0×1015 of them)
–If E=0 and M=0, then value is 0 or -0 (based on S)– If E=2047 and M=0, then value is Inf or –Inf (based on S)– If E=2047 and M≠0, then value is NaN (4.5×1015 of these)
Toy precision, using IEEE rules
3-bit mantissa, 3-bit exponent, and sign bit gives
110 distinct numbers between -15 and 15,
+0 and -0, +Inf and –Inf, NaN
There are gaps. For 15<x<15, let fl(x) denote the closest floating point number to x. The relative representation error is
For toy precision, the maximum relative error is 2-4. In IEEE double precision, the maximum relative error is 2-53 which is about 10-16.
Size of gaps between representable numbers depends on number (relative representation error)
)(
)(
xfl
xflx
IEEE Arithmetic
Arithmetic (add, subtract, multiply and divide) on pairs of these numbers can give results that are not representable.
IEEE standard is that the stored result of these operations should be the nearest floating point number. Hence
fl(x+y) = (x+y)×(1+δ) where | δ |<10-16
fl(x-y) = (x-y)×(1+δ) where | δ |<10-16
fl(x×y) = (x×y)×(1+δ) where | δ |<10-16
fl(x/y) = (x/y)×(1+δ) where | δ |<10-16
IEEE arithmetic also has fl(√y) = √y(1+δ)