computer representation of numbers and computer · pdf filecomputer representation of numbers...

Computer Representation of Numbers and Computer Arithmetic

In a Computer numbers are represented by binary digits 0 and 1. Computers employbinary arithmetic for performing operations on numbers. Since it gets cumbersome todisplay large numbers in binary form computers usually display them in hexadecimal oroctal or decimal system. All of these number systems are positional systems. In apositional system a number is represented by a set of symbols. Each of these symbolsdenote a particular value depending on its position. The number of symbols used in apositional system depends on its 'base'. Let us now discuss about various positionalnumber systems:

Decimal System:

The decimal system uses 10 as its base value and employs ten symbols 0 to 9 inrepresenting numbers. Let us consider a decimal number 7402 consisting of foursymbols 7,4,0,2. In terms of base 10 it can be expressed as follows.

So each of the symbols from a set of symbols denoting a number is multiplied withpower of the base (10) depending on its position counted from the right. The countalways begins with 0.

In general a decimal number consisting of symbols can be

expressed as:

where,

mywbut.com 1

Similarly, a fractional part of a decimal number can be expressed as

Binary system:

Binary system is the positional system consisting of two symbols i.e. 0,1 and '2' as itsbase. Any binary number actually represents a decimal value given by

where

Consider the binary number 10101. The decimal equivalent of 10101 is given by

Hexadecimal System:

The Hexadecimal system is the positional system consisting of sixteen symbols,0,1,2...9,A,B,C,D,E,F, and '16' as its base. Here the symbols A denotes 10, B denotes11 and so on. The decimal equivalent of the given hexadecimal number is

given by . For example consider

.

We can convert a binary number directly to a hexadecimal number by grouping thebinary digits, starting from the right, into sets of four and converting each group to itsequivalent hexadecimal digit. If in such a grouping the last set falls short of four binarydigits then do the obvious thing of prefixing it with adequate number of binary digit '0'.

mywbut.com 2

For example let us find the hexadecimal equivalent of

The vice-versa is also true.

Octal System: The octal system is the positional system that uses 8 as its base and as its symbol set of size 8. The decimal equivalent of an octal number

is given by . For

example consider

We can get the octal equivalent of a binary number by grouping the binary digits,starting from the right, into sets of three binary digits and converting each of these setsto its octal equivalent. If such a grouping results in a last set having less number ofdigits it may be prefixed with adequate number of binary digit 0. As an example theoctal equivalent of

Conversion of decimal system to non-decimal system:

To convert a decimal number to a number of any other system we should consider theinteger and fractional parts separately and follow the following procedure:

Conversion of integer part:

(a) Consider the integer part of a given decimal number and divide it by the base b ofthe new number system. The remainder will constitute the rightmost digit of the integer

mywbut.com 3

part of the new number.

(b) Next divide the quotient again by the base b. The remainder will constitute seconddigit from the right in the new system.

Continue this process until we end up with a zero-quotient. The last remainder is theleftmost digit of the new number.

Conversion of fractional part:

(a) Consider the fractional part of the given decimal number and multiply it with thebase b of the new system. The integral part of the product constitutes the leftmost digitof the fractional part in the new system.

(b) Now again multiply the fractional part resulting in step (a) by the base b of the newsystem. The integral part of the resultant product is the second digit from the left in thenew system.

Repeat the above step until we encounter a zero-fractional part or a duplicate fractionalpart. The integer part of this last product will be the rightmost digit of the fractional partof the new number.

Eg: Convert 54.45 into its binary equivalent.

(a) Consider the integer part i.e. 54 and apply the steps listed under conversion ofinteger part i.e.

(b) Conversion of fractional part:

Product integral part Binary number

mywbut.com 4

0.45 2 = 0.90 0

0.9 2 = 1.80 1

0.8 2 = 1.6 1

0.6 2 = 1.2 1

0.2 2 = 0.4 0

0.4 2 = 0.8 0

0.8 2 = 1.6 1

0.6 2 = 1.2 1

0.2 2 = 0.4 0

0.4 2 = 0.8 0

0.8 2 = 1.6 1

Here the overbar denotes the repetition of the binary digits.

Note: Using binary system as an intermediate stage we can easily convert octalnumbers to hexadecimal numbers and vice-versa.

mywbut.com 5

In the above two examples we have grouped the binary digits suitably either toquadruplets or triplets to convert octal to hexadecimal and hexadecimal to octalnumbers respectively.

mywbut.com 6

Computer Representation of Numbers

Computers are designed to use binary digits to represent numbers and other information. Thecomputer memory is organized into strings of bits called words of same length. Decimal numbersare first converted into their binary equivalents and then are represented in either integer orfloating point form.

Integer Representation

The largest decimal number that can be represented , in binary form , in a computer depends on

its word length. An n-bit word computer can handle a number as large as . For instance

a 16-bit word machine can represent numbers as large as . How do we represent negativenumbers ? Negative numbers are stored using complement. This is obtained by taking the

complement of the binary representation of the positive number and then adding to it.

For example let us represent in the binary form.

Here in an extra zero to the left of the binary number is appended to indicatethat it is positive. If this extra leftmost binary digit is set to then it indicates that the binarynumber is negative. So the general convention for storing signed numbers is to append a binary

mywbut.com 7

digit 0 or to the left of the binary number depending on the positive or negative sign of thenumber. So in a n-bit word computer, as one bit is reserved for sign , one can use maximum upto bits to store a signed number. So the largest signed number a 16-bit word can

represent is . On this machine since zero is defined as

it is redundant to use the number todefine a "minus zero". It is usually employed to represent an additional negative number i.e

and hence the range of signed numbers that can be represented on a 16-bit wordmachine is from to .

Floating Point Representation

Fractional numbers such as and large numbers like which fall outsidethe range of a d-bit word machine , say for instance 16-bit word machine are stored andprocessed in Exponential form. In exponential form these numbers have an embedded decimalpoint and are called floating point numbers or real numbers. The floating point representation ofa real number is where is called mantissa and is the exponent. So thefloating - point representation of the fractional number is and

that of the large number is .

Typically computers use a 32-bit representation for a floating point. The left most bit is reservedfor the sign. The next seven bits are reserved for exponent and the last twenty four bits are usedfor mantissa.

The shifting of the decimal point to the left of the most significant digit is called normalizationand the numbers represented in the normalized form are known as normalized floating pointnumbers.

For example , the normalized floating point form of the numbers , , are:

0.00695 = = .695E-2

56.2547 = = .562547E2

-684.6 = = -.6846E3

Inherent Errors

Inherent errors arise due to the data errors or due to the conversion errors.

Data Errors

mywbut.com 8

If the data supplied for a problem is obtained from some experiment or from some measurementthen it is prone to errors due to the limitations in instrumentation or reading. Such errors are alsoreferred to as empirical errors. So when the data supplied is correct , say to two decimals there isno use performing arithmetic accurate to four decimals!

Conversion Errors

Conversion errors arise due to the limitation on the number of the bits used for representingnumbers both under integer and floating point representation. So it is also called asrepresentation error. The digits that are not retained constitute the round-off error.

For example consider the case of representing a decimal number in a computer. The binaryequivalent of has a non-terminating form like ...... but the computer

has limited number of bits. If we add ten such numbers in a computer the result will not beexactly due to the round -off error during the conversion of to binary form.

mywbut.com 9

Computer Arithmetic

The most common computer arithmetic are integer arithmetic and floating pointarithmetic. Now these arithmetic systems will be briefly discussed.

Integer Arithmetic :

The result of any integer arithmetic operation is always an integer. The range ofintegers that can be represented on a given computer is finite. The result of an integerdivision is usually given as a quotient. The remainder is truncated as fractionalquantities which cannot be represented under the integer representation.Eg:

Remark:

(1) Simple rules like , where are integers may not hold

under computer integer arithmetic due to the truncation of the remainder.

(2) An integer operation may result in a very small or a very large number which isbeyond the range of that the computer can handle. When the result is larger than themaximum limit , it is referred to as an overflow and when it is less than the lower limit , itis referred to as underflow.

mywbut.com 10

Floating Point Arithmetic:In the floating point arithmetic all the numbers are stored and processed in normalizedexponential form . Firstly the process of addition under floating point arithmetic will bediscussed.

Addition under Floating Point Arithmetic:

Let and be the two numbers to be added and be the result. The normalizedfloating point representation of and are , ,

respectively. The rules for carrying out the addition are as follows :

(a) Set = maximum .

Say then .

b) Right shift by places, so that the exponent of are the same

and call it

c) Set

d) Normalize and let be its normalized representation.

e) Set

E.g : Add the numbers and

a)

b) on right shifting by 3 we get

mywbut.com 11

c)

d) which is already in normalized form

i.e ,

e)

Remark: Substraction is nothing but addition of numbers with different signs.

Multiplication Under Floating Point Arithmetic:

If , are two real numbers in normalized form then their

product

E.g : Say , then

Since is already in normalized form ,

mywbut.com 12

.

Remark:

(1)

(after normalization)

During the floating point arithmetic mantissa 'M' may be truncated due to the limitationon the number of bits available for its representation on a computer.

(2) Floating point arithmetic is prone to the following errors:

a) Errors due to inexact representation of a decimal number in binary form. For example. Since binary equivalent of has a repeating

fraction, it has to be terminated at some point.b) Error due to round-off-effectc) Subtractive cancellation : It is possible that some mantissa positions are unspecified.These unspecified positions may be arbitrarily filled by the computer.This may lead toserious loss of significance when two nearly equal numbers are subtracted.

For example if and then

has only one significant digit. However the

mantissa will have provision to store more number of significant digits, which may getarbitrarily filled as they may be specified. Further if the operands themselves areapproximate representation due to this non-specification problem the overall loss ofsignificance will get serious.d) Basic laws of arithmetic such as associative, distributive may not be satisfied i.e

(3) Numerical computation involves a series of computations consisting of basicarithmetic operation. There may be round-off or truncation error at every step of thecomputation. These errors accumulate with the increasing number of computations in aprocess. There can be situations where even a single operation may magnify the

mywbut.com 13

roundoff errors to a level that completely ruins the result.

A computation process in which the cumulative effect of all input errors is grosslymagnified is said to be numerically unstable. It is important to understand theconditions under which the process is likely to be 'sensitive' to input errors and becomeunstable. Investigations to see how small changes in input parameters influence theoutput are termed as sensitivity analysis.

(4) Roundoff and truncation errors effect on the final numerical result may be reducedby

a) Increasing the significant figures of the computer either through hardware or throughsoftware manipulations.For instance one may use double precision for floating pointarithmetic operations.b) Minimizing the number of arithmetic operations. Here one may try to rearrange aformula to reduce the number of arithmetic operations. For example in the evaluation ofa polynomial , it may be rearranged as

which requires less arithmetic operations.

c)A formula like may be replaced by to avoid substractive cancellation

d) While finding the sum of set of numbers, arrange the set so that they are inascending order of absolute value. i.e when then is better

than .

5) It may not be possible to simultaneously reduce both the truncation and round-offerror effects on the final result of a numerical computation. For instance in an iterativeprocedure when one tries to reduce the round-off error by increasing the step size , itmay lead to higher truncation error and vice-versa. Hence proper care has to be takento reduce both the errors simultaneously.

mywbut.com 14

Numerical Errors:

Numerical errors arise during computations due to round-off errors and truncationerrors.

Round-off Errors:

Round-off error occurs because computers use fixed number of bits and hence fixednumber of binary digits to represent numbers. In a numerical computation round-offerrors are introduced at every stage of computation. Hence though an individualround-off error due to a given number at a given numerical step may be small but thecumulative effect can be significant.

When the number of bits required for representing a number are less then the numberis usually rounded to fit the available number of bits. This is done either by chopping orby symmetric rounding.

Chopping: Rounding a number by chopping amounts to dropping the extra digits. Herethe given number is truncated. Suppose that we are using a computer with a fixed wordlength of four digits. Then the truncated representation of the number will be

. The digits will be dropped. Now to evaluate the error due to chopping let usconsider the normalized representation of the given number i.e.

chopping error in representing .

So in general if a number is the true value of a given number and is the

normalized form of the rounded (chopped) number and is the

mywbut.com 15

normalized form of the chopping error then

Since , the chopping error

Symmetric Round-off Error :

In the symmetric round-off method the last retained significant digit is rounded up by 1if the first discarded digit is greater or equal to 5.In other words, if in is such

that then the last digit in is raised by 1 before chopping . For

example let be two given numbers to be rounded to five

digit numbers. The normalized form x and y are and .

On rounding these numbers to five digits we get and

respectively. Now w.r.t here

In either case error .

Truncation Errors:

Often an approximation is used in place of an exact mathematical procedure. Forinstance consider the Taylor series expansion of say i.e.

Practically we cannot use all of the infinite number of terms in the series for computingthe sine of angle x. We usually terminate the process after a certain number of terms.The error that results due to such a termination or truncation is called as 'truncationerror'.

Usually in evaluating logarithms, exponentials, trigonometric functions, hyperbolic

mywbut.com 16

functions etc. an infinite series of the form is replaced by a finite series

. Thus a truncation error of is introduced in the computation.

For example let us consider evaluation of exponential function using first three terms at

Truncation Error

Some Fundamental definitions of Error Analysis:

Absolute and Relative Errors:

Absolute Error: Suppose that and denote the true and approximate values of a

datum then the error incurred on approximating by is given by

and the absolute error i.e. magnitude of the error is given by

mywbut.com 17

Relative Error: Relative Error or normalized error in representing a true datum by

an approximate value is defined by

and

Sometimes is defined by

If and then

Machine Epsilon: Let us assume that we have a decimal computer system.

We know that we would encounter round-off error when a number is represented infloating-point form. The relative round-off error due to chopping is defined by

Here we know that

mywbut.com 18

i.e. maximum relative round-off error due to chopping is given by . We know thatthe value of 'd' i.e the length of mantissa is machine dependent. Hence the maximumrelative round-off error due to chopping is also known as machine epsilon .

Similarly , maximum relative round-off error due to symmetric rounding is given by

Machine-Epsilon for symmetric rounding is given by,

It is important to note that the machine epsilon represents upper bound for theround-off error due to floating point representation.

For a computer system with binary representation the machine epsilon due to choppingand symmetric rounding are given by

respectively.

Eg: Assume that our binary machine has 24-bit mantissa. Then. Say that our system can represent a q decimal digit

mantissa.

Then,

i.e

that our machine can store numbers with seven significant decimal digits.

mywbut.com 19

Approximations and Round-off Errors

Approximations and errors are integral part of numerical methods. Prior to using the numericalmethods it is essential to know how errors arise, how they grow during the numericalcomputations and how they affect the accuracy of a solution.Errors can come in a variety offorms and sizes. To get a quick feel let us look at the following taxonomy of errors:

Further discussion will be focussed on errors due to computing machine and those due tonumerical method. Firstly the notion of significant digits will be introduced.

Significant Digits

Usually , the numerical solution to a given problem is sought to a desired level of accuracy and

mywbut.com 20

precision wherein the error is below a set tolerance level.The idea of significant numbers isessential to understand the concept of accuracy and precision in the solution and also todesignate the reliability of a numerical value.

The Significant Digits of a number are those that can be used with confidence. Suppose we seek

a numerical solution to an accuracy of and obtain as solution . Here the

solution is reliable only up to the first three decimal places i.e or the solution has

five significant digits . Some numbers like , , etc. have infinite number

of significant digits. For example consider ,

=

Such numbers can never be represented exactly on a computer which operates with fixednumber of significant digits due to hardware limitations.The omission of certain digits from suchnumbers results in what is called round-off-error. Some thumb rules on the significant digits ,within the desired level of accuracy are :

(a) All non-zero digits are significant ,

(b)All zeros occurring between non-zero digits are significant,

(c)Trailing zeros following a decimal point are significant.

(e.g , , have three significant digits),

(d) Zeros between the decimal point and preceding a non-zero digit are not significant. Forexample , , , have

four significant digits.

(e) Trailing zeros in large numbers without the decimal point are not significant. For instance may be written in scientific notation as and contains only two significant

digits.

The concept of accuracy and precision are closely related to significant digits as follows:Accuracy refers to the number of significant digits in a value. For example the number isaccurate to five significant digits: Precision refers to the number of decimal positions i.e the orderof magnitude of the last digit in a value. The number has a precision of or .

mywbut.com 21

computer representation of numbers and computer · pdf filecomputer representation of numbers...

Documents