ieee arithmetic uc berkeley fall 2004, e77 pack/e77 copyright 2005, andy packard. this work is...

17
IEEE Arithmetic UC Berkeley Fall 2004, E77 http://jagger.me.berkeley.edu/~ pack/e77 Copyright 2005, Andy Packard. This work is licensed under the Creative Commons Attribution-ShareAlike License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/2.0/ or send a letter to Creative

Upload: roy-walker

Post on 13-Dec-2015

222 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: IEEE Arithmetic UC Berkeley Fall 2004, E77 pack/e77 Copyright 2005, Andy Packard. This work is licensed under the Creative

IEEE ArithmeticUC Berkeley

Fall 2004, E77http://jagger.me.berkeley.edu/~pack/e77

Copyright 2005, Andy Packard. This work is licensed under the Creative Commons Attribution-ShareAlike License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/2.0/ or send a letter to

Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA.

Page 2: IEEE Arithmetic UC Berkeley Fall 2004, E77 pack/e77 Copyright 2005, Andy Packard. This work is licensed under the Creative

Computer representation of numbers

Integers and rational numbers in different bases

Floating point

IEEE standards

Roundoff

Errors on basic arithmetic

Page 3: IEEE Arithmetic UC Berkeley Fall 2004, E77 pack/e77 Copyright 2005, Andy Packard. This work is licensed under the Creative

Integers in base 10

The base10 number (2075)10 means

What are the “rules”?–In each digit, there is a number from 0 to 9–The “value” of the slots are increasing powers of 10, starting at

100

0123 105107100102

Throughout the lecture, the “explanatory” expression is always given in base 10

Page 4: IEEE Arithmetic UC Berkeley Fall 2004, E77 pack/e77 Copyright 2005, Andy Packard. This work is licensed under the Creative

Integers in base 2

The base2 number (1001101)2 means

What are the “rules”?–In each digit, there is a number from 0 to 1–The “value” of the slots are increasing powers of 2, starting at 20

0123456 21202121202021

This numbers in this “explanatory” expression are base 10

Page 5: IEEE Arithmetic UC Berkeley Fall 2004, E77 pack/e77 Copyright 2005, Andy Packard. This work is licensed under the Creative

Integers in base p

The base p number (d4 d3 d2 d1 d0)p means

What are the “rules”?–In each digit, there is a number from 0 to p-1–The “value” of the slots are increasing powers of p, starting at p0

00

11

22

33

44 pdpdpdpdpd

Page 6: IEEE Arithmetic UC Berkeley Fall 2004, E77 pack/e77 Copyright 2005, Andy Packard. This work is licensed under the Creative

Rational numbers in base 10

The base 10 number (75.396)10 means

What are the “rules”?–In each digit, there is a number from 0 to 9–The “value” of the slots to the left of the decimal point are

increasing powers of 10, starting at 100

–The “value” of the slots to the right of the decimal point are decreasing powers of 10, starting at 10-1

–The leading digit is usually not zero, but it is ok if it is.– We usually write 75.396 instead of 075.396, but they are the same.

32101 106109103105107

Page 7: IEEE Arithmetic UC Berkeley Fall 2004, E77 pack/e77 Copyright 2005, Andy Packard. This work is licensed under the Creative

Rational numbers in base 2

The base 2 number (1001.101)2 means

What are the “rules”?–In each digit, there is a number from 0 to 1–The “value” of the slots to the left of the decimal point are

increasing powers of 2, starting at 20

–The “value” of the slots to the right of the decimal point are decreasing powers of 2, starting at 2-1

–The leading digit is usually not zero (see previous slide)

3210123 21202121202021

Page 8: IEEE Arithmetic UC Berkeley Fall 2004, E77 pack/e77 Copyright 2005, Andy Packard. This work is licensed under the Creative

Rational numbers in base p

The base p number (d2 d1 d0 .d-1 d-2 d-3)p means

What are the “rules”?–In each digit, there is a number from 0 to p-1–The “value” of the slots to the left of the decimal point are

increasing powers of p, starting at p0

–The “value” of the slots to the right of the decimal point are decreasing powers of p, starting at p-1

–The leading digit is usually not 0 (see previous slide)

33

22

11

00

11

22

pdpdpdpdpdpd

Page 9: IEEE Arithmetic UC Berkeley Fall 2004, E77 pack/e77 Copyright 2005, Andy Packard. This work is licensed under the Creative

“Floating Point” numbers in base p

The base p number (d1 d0 .d-1 d-2 d-3) × pe means

What are the “rules”?–In each digit, there is a number from 0 to p-1–The “value” of the slots to the left of the decimal point are

increasing powers of p, starting at p0

–The “value” of the slots to the right of the decimal point are decreasing powers of p, starting at p-1

–The exponent, e, must be an integer.• It just “moves” the decimal point e places to the right.

eppdpdpdpdpd

33

22

11

00

11

exponentmantissa

Page 10: IEEE Arithmetic UC Berkeley Fall 2004, E77 pack/e77 Copyright 2005, Andy Packard. This work is licensed under the Creative

Normalized Floating point numbers in base 2

A normalized binary number is written d0 .d-1 d-2 d-3 × 2e. The rules for normalized are:

–Only one digit to left of decimal point

–Value of this digit, d0 is always 1

So, normalized binary numbers appear as (for example)

1 .d-1 d-2 d-3 d-4 d-5 d-6 × 2e

In an unnormalized number, still allow only one digit to left of decimal point, but it may be 0.

Finally, we need a ± sign in front to denote positive or negative, giving the form

± d0 .d-1 d-2 d-3 d-4 d-5 d-6 × 2e

Page 11: IEEE Arithmetic UC Berkeley Fall 2004, E77 pack/e77 Copyright 2005, Andy Packard. This work is licensed under the Creative

Computer representation of numbers

Computers “store” everything–Integers, Characters–Floating point numbers (eg., 1.435678, cos(2.3), etc.)–Memory location addresses–Computer chip instructions (“multiply contents of register A by

contents of register B, and store result in register C”)

as (long) sequences of digits made up of 0’s and 1’s.–Each single digit is called a “bit” (8 bits is a byte)–So, a “bit” can store a single base2 number (0 or 1)

For scientific computing and engineering calculations, an important question is “How are floating point numbers represented?” Need to store three things:

–the mantissa,– the exponent, and–the sign

Page 12: IEEE Arithmetic UC Berkeley Fall 2004, E77 pack/e77 Copyright 2005, Andy Packard. This work is licensed under the Creative

Storing the mantissa, exponent and sign

Normalized numbers are of the form

±1 .d-1 d-2 d-3 d-4 d-5 d-6 × 2e

Store the sign with 1 bit: 0 means positive, 1 means negative.

Store the mantissa (6 digits above) bit-by-bit.

Store the exponent as a binary integer–What about the sign of the exponent? Later…

The number of bits used to store the mantissa controls the precision to which numbers can be stored.

The number of bits used to store the exponent controls the range of numbers (smallest to biggest) that can be stored

Page 13: IEEE Arithmetic UC Berkeley Fall 2004, E77 pack/e77 Copyright 2005, Andy Packard. This work is licensed under the Creative

IEEE Standard

Single precision–32 bits

• 1 sign bit (0 means positive, 1 means negative)

• 8 bits for the exponent

• 23 bits for the fraction

Double precision–64 bits

• 1 sign bit (0 means positive, 1 means negative)

• 11 bits for the exponent

• 52 bits for the fraction

Plus, some additional rules as how to interpret things.

Let’s use a different precision to explain the rules.

Page 14: IEEE Arithmetic UC Berkeley Fall 2004, E77 pack/e77 Copyright 2005, Andy Packard. This work is licensed under the Creative

Toy precision, using IEEE rules

7 bits (only 27=128 different possibilities)1 sign bit, S, (0 means positive, 1 means negative)

3 bits for the exponent, E (000,001,010,011,100,101,110,111)

3 bits for the mantissa, M

Rules–If 1≤E≤6, then value is -1S × 2E-3 × 1.M

• 2E-3 takes on values 0.25, 0.5, 1, 2, 4, 8

• 1.M values: 1, 1.125, 1.25, 1.375, 1.5, 1.6125, 1.75, 1.8725

• Minimum value is 0.25, maximum value is 15 (realmin, realmax)

• 48 positive numbers, 48 negative numbers

–If E=0, and M≠0, then value is -1S × 2-2 × 0.M• 1/32, 2/32, 3/32, 4/32, 5/32, 6/32, 7/32 (and their negatives)

• These are the unnormalized numbers (14 of them)

–If E=0 and M=0, then value is 0 or -0 (based on S)– If E=7 and M=0, then value is Inf or –Inf (based on S)– If E=7 and M≠0, then value is NaN (14 of these)

Page 15: IEEE Arithmetic UC Berkeley Fall 2004, E77 pack/e77 Copyright 2005, Andy Packard. This work is licensed under the Creative

Double precision, using IEEE rules

64 bits (264=18400000000000000000 different possibilities)1 sign bit, S, (0 means positive, 1 means negative)

11 bits for the exponent, E (between 0 and 2047)

52 bits for the mantissa, M (between 0 and 4.5×1015)

Rules–If 1≤E≤2046, then value is -1S × 2E-1023 × 1.M

• 2E-1023 takes on 2046 different values, from 2-1022 to 21023

• 1.M takes on 4.5×1015 evenly spaced values from 1 to 2 (just less)

• Min value is 2.2×10-308 , maximum value is 1.8×10308

–If E=0, and M≠0, then value is -1S × 2-1022 × 0.M• 4.5×1015 evenly spaced values from 0 to 2-1022 (just less)

• These are the unnormalized numbers (9.0×1015 of them)

–If E=0 and M=0, then value is 0 or -0 (based on S)– If E=2047 and M=0, then value is Inf or –Inf (based on S)– If E=2047 and M≠0, then value is NaN (4.5×1015 of these)

Page 16: IEEE Arithmetic UC Berkeley Fall 2004, E77 pack/e77 Copyright 2005, Andy Packard. This work is licensed under the Creative

Toy precision, using IEEE rules

3-bit mantissa, 3-bit exponent, and sign bit gives

110 distinct numbers between -15 and 15,

+0 and -0, +Inf and –Inf, NaN

There are gaps. For 15<x<15, let fl(x) denote the closest floating point number to x. The relative representation error is

For toy precision, the maximum relative error is 2-4. In IEEE double precision, the maximum relative error is 2-53 which is about 10-16.

Size of gaps between representable numbers depends on number (relative representation error)

)(

)(

xfl

xflx

Page 17: IEEE Arithmetic UC Berkeley Fall 2004, E77 pack/e77 Copyright 2005, Andy Packard. This work is licensed under the Creative

IEEE Arithmetic

Arithmetic (add, subtract, multiply and divide) on pairs of these numbers can give results that are not representable.

IEEE standard is that the stored result of these operations should be the nearest floating point number. Hence

fl(x+y) = (x+y)×(1+δ) where | δ |<10-16

fl(x-y) = (x-y)×(1+δ) where | δ |<10-16

fl(x×y) = (x×y)×(1+δ) where | δ |<10-16

fl(x/y) = (x/y)×(1+δ) where | δ |<10-16

IEEE arithmetic also has fl(√y) = √y(1+δ)