computer arithmetic in computer architecture

Isha Padhy,CSE Dept, CBIT

Flynn’s Taxonomy•The purpose of parallel processing is to speed up the computer process and also increase its throughput, so the hardware will also increase for this purpose which will make the system costly. •Michael Flynn proposed a classification for computer architectures based on the number of instruction steams and data streams (Flynn’s Taxonomy). •The instruction stream is defined as the sequence of instructions performed by the processing unit or fetched from memory. •The data stream is defined as the operations performed on the data in the processor.•A stream simply means a sequence of items (data or instructions). Flynn’s Taxonomy:• SISD: Single instruction single data (executing 1 instruction or data at a

time) – Classical von Neumann architecture


• SIMD: Single instruction multiple data

There are no. of processing elements: no. of registers but control unit is same. All the processing elements are executing same instruction. Ex image processing , add 2 images: 2 matrix m*n, every matrix point will represent a intensity value in digitized form. same operation(add) to be performed on all the elements of the matrix.

• MISD: Multiple instructions single data

- Non existent, just listed for completeness.- Theoretical true because multiple operation on same data is hardly possible.• MIMD: Multiple instructions multiple data

- Most common and general parallel machine

- Leads to a type of computing called distributed computing. no. of independent computers , each computer is executing diff instruction ,each instruction working on diff data set. If different instructions belong to same task then each computer has to interact with each other and that can be done LAN, closely n/w.


SISD SIMD

MISD MIMD

Isha Padhy,CSE Dept, CBITNon- pipeline process Pipeline process

- Pipelining is a technique of decomposing a sequential process into sub- processes, with each sub-process being executed in a special dedicated segment that operates concurrently with all other segments. It is an implementation technique whereby multiple instructions are executed in an overlapped manner. Min aim is to increase the throughput of the processor.- This technique is efficient for applications that are going to repeat the same task on different set of data.


Non- pipeline process Pipeline process


General considerations: 4 segment pipeline

• The operands pass through all the segments in a fixed sequence.• Task T is divided into sub task:t1,t2,t3, t4.• Assume there are 4 computing block S1,S2,S3,S4 is capable of

executing S1(t1),S2(t2),S3(t3),S4(t4), each taking same amount of time to execute (τ)

• So for completion of 4 subtask or task T in case of single instruction is 4 τ. If n no. of task, then completion time is 4n τ.

• If we have identical jobs to be done, pipeline concept can be used to increase the throughput.

• Once the pipeline is full, when every processing element has a job to execute, now at every τ time we can get an output. In single inst after every 4 τ we will get an output.


• if there are K stages in the pipeline, where every stage takes τ time to execute, and n jobs to be executed,

- In a non-pipeline architecture: amount of time taken is nkτ.- In a pipeline architecture: (n+(k-1)) τ Ex. No. of stages=4(k)

no. of sub task= 6(n), so 9 clock cycles.- In general all stages will not take equal amount of time:

τ=max{τi}1K +τl , τi :time taken by the ith stage, K: K no. of

stages, τl : latching delay τ : time taken to operate, depends on which stage is taking

max time.

• 2 areas where pipeline organization is applicable: arithmetic pipeline, instruction pipeline.


Arithmetic Pipeline• Units are found in high speed computers for floating point operations.• Ex. 2 floating point number we r going to add, • a,b are mantissas p,q are exponents• Steps to execute:1. Compare the exponents2. Align the mantissa3. Add or subtract mantissa4. Normalize the result.• A=a*2p 0.5*103 equalize the power values:

0.005*105

B=b*2q 0.3*105 0.3*105

0.305*105

• Always after point, 0 values are removed so depending on power of 10 is reduced. ex.003*104=.3*102


Pipeline for floating addition1. 1st latch will take fraction and exponent part: a , p.

2nd latch will take b , q2. Compare the exponents 10 power(p,q)-> result, depending on the result we have

to normalize the fraction part(to same exponents ex 105 )3. Fraction selector: input a,b , result from comparator, then make right shift(.5-

>.005) depending between difference between p,q.4. |p-q| will denote how many bits shift will take place.5. No. of latches should be equal both sides( exponent and fraction) other wise time

delay will lead to data loss(we will get fraction and no exponent or vice versa).6. From exp comparator we get max exponent.7. Leading 0 counter: gives no. of bits left shift (re -normalization of result.)8. Now depending on leading 0’s we have to give left shift.9. Exp subtractor: no. of left shift will be the no. that will be deducted from

exponent.(ex.003*104=.3*102 ) No. of left shift= subtract that factor from exponent.

10. Final result will be d*2s.


Nt: ex. Addition of 2 linear array of floating numbers, pipeline can be used. Latches are required only when the computation speed is different in different pipeline stages. This type is called execution pipeline.

s d


Instruction Pipeline• An instruction pipeline reads consecutive instructions from memory while previous

instructions are getting executed in other segments.• Dividing the instruction execution across several stages ,require each stage access

only a subset of CPU’s resources(ALU, Registers, Bus etc). Ideally(may change) a new instruction can be issued in every cycle.

• Any instruction can be done in following steps:-- Fetch the instruction- Decode the instruction- Calculate the effective address- Fetch the operands from memory- Execute the instruction- Store the result in the proper place• Sometimes there will be difficulty in pipelining instructions because all the

segments do not require same amount of time, some segments are skipped(ex in register transfers effective address calc is not required),2 segments may require to read memory same time etc.

4 segment instruction pipeline• We assume all segments take same time. • Step 2,3 are combined, steps 5,6 can be combined and

executed. We assume that we use processor registers to hold the result.

• Fig 1. When an instruction is executed in segment 4, another instruction is busy fetching operand from memory in segment3.

• When the instruction is program control type(a branch out of normal sequence), the pending operations in the last 2 segments are completed and all the instructions in instruction buffer is deleted. Then a new address is entered to PC.


Timing of instruction pipeline


-We assume that the processor has separate instruction and data memories so FI and FO can be done at same time.- No branch instruction, each segment operates on different instructions.-When DA of 3rd instruction was decoded and found that it was a branch instruction, FI of 4th instruction is halted till the branch instruction is executed. If branch is taken new instruction is send to step 7 or else normal 4th step is executed.

FI: segment 1 that fetches the instruction.DA: segment 2 that decodes the instruction and calculates the effective address.FO: segment 3 that fetches the operands.EX: segment 4 that executes the instruction.


Problems with instruction pipeline (Pipeline Hazards).• A pipeline hazard occurs when the instruction pipeline

deviates at some phases, some operational conditions that do not permit the continued execution. The major pipeline hazards are described below:

• Resource hazard (Resource conflict).• Data hazard (data dependency).• Branch hazard (branch difficulties).1. Resource hazard It is caused by the access to shared memory by 2

segments at the same time. This can be resolved by using separate instruction and data memories.


• 2. Data Hazard It arises when instructions depend on the result of previous instruction but

the previous instruction is not available yet. Ex. An instruction in FO segment may need to fetch an operand which is produced by previous instruction. The most common techniques used to resolve data hazard are:

(a) Hardware interlock - a hardware interlock is a circuit that detects instructions whose source operands are destinations of instructions farther up in the pipeline. It then inserts enough number of clock cycles to delays the execution of such instructions.

(b) Operand forwarding - This method uses a special hardware to detect conflicts in instruction execution and then avoid it by routing the data through special path between pipeline segments.

for example, instead of transferring an ALU result into a destination result, the hardware checks the destination operand, and if it is needed in next instruction, it passes the result directly into ALU input, bypassing the register.

c) Delayed load - It is software solution where the compiler is designed in such a way that it can detect the conflicts; re-order the instructions to delay the loading of conflicting data by inserting no operation instruction.


• 3. Branch hazardThe major obstruction that a instruction pipeline is branch instructions. Such

instructions can be conditional(depend on the condition, if yes branch to target inst address or next inst in sequence will be executed) or unconditional(PC is assigned the target instruction address).Computers use hardware technique to minimize this hazard.

(a) Multiple streaming - It is an approach which replicates the initial portions of the pipeline and allows the pipeline to fetch both instructions(target instruction of branch and instruction next to branch), making use of two streams (branches). If condition is successful then target is selected or the next is selected.

(b) Branch target buffer (BTB)- BTB is a memory present in fetch segment of an instruction. Each entry consists of address of previously executed branch instruction and their target instruction. After decoding the instruction as branch instruction it refers its own BTB, if any information of the required branch instruction is available then use it or pipeline shifts to a new instruction stream and stores the target instruction in BTB. Thus branch instructions previously used are readily available.


(c) Loop buffer - A variation of BTB. A loop buffer is a small, very-high-speed memory maintained by the instruction fetch stage of the pipeline. When a loop is detected in the program , complete instruction set is kept in this memory area along with any branches present in the loop. And this loop can be executed without disturbing the memory until loop breaks.

• (d) Branch prediction - uses additional logic to predict the outcomes of a (conditional) branch before it is executed.

• (e) Delayed branch - This technique is employed in most RISC processors. In this technique, compiler detects the branch instructions and re-arranges the instructions by inserting no-operation instruction after branch instructions to avoid pipeline hazards.

Isha Padhy, CSE Dept, CBIT

RISC PIPELINE• In RISC, all the instruction formats are of same length so

decoding of instruction and register selection will occur at same time. Every operand is present in registers so no calculation of effective address.

• The pipeline can be done in 2 or 3 segments. So the steps can be:

- Fetching instruction from memory.- Executing operation in ALU.- Storing the result in destination register.• For register transfers , only 2 instructions are present load,

store.• To prevent conflicts between a memory access to fetch an

instruction and to load /store operand, there are 2 separate memories ,one for instruction and one for data.

• RISC has ability to complete an instruction in 1 clock cycles so pipeline segments can be achieved easily, but CISC will have many segments with the longest segment taking 2 or 3 clock cycles.

-Fetch the instruction-Decode the instruction-Calculate the effective address-Fetch the operands from memory-Execute the instruction-Store the result in the proper place


• RISC instruction formats: 32 bit, 31 instructions.• Addressing modes: register addressing, immediate addressing, relative to PC

addressing for branch instructions.• 138 registers: 8 windows of 32 registers each(10 global, 10 local, 6 low overlapping, 6

high overlapping)• Instruction format: 32 bit: 8(7bits operation+ 1 b status bit depending on ALU oper),

5(5b to specify 1 out of 32 reg), 13 bit=0(low order of S2 specifies another source reg),13th bit=1(13 b operand), Y(relative address to jump), COND.

• 31 instructions: Data manipulation, transfer, program control.• All instructions have 3 operands: 2nd operand can be a register, immediate operand,3rd

register can be specified as R0(all bits 0).• LDL (R22)#150,R5//R5<-M[R22]+150

LDL(R22)#0, R5 //R5<-M[R22]Opcode Rd Rs 0 Not Used S2

Opcode Rd Rs 1 S2

Opcode COND Y

8 5 5 1 8 5

8 5 5 1 13

8 5 19

Register Mode

Register imm Mode

PC Relative Mode


3 segment Instruction pipelineThe instruction cycle is divided into 3 segments:-- (I) Instruction fetch: fetches the instruction from program memory and decoded, the registers needed for execution are also selected.- (A) ALU performs the operation depending on the decoded instruction. There are 3 operations that can be done by ALU.a. Data manipulation instruction.b. Evaluate the effective address for a load or store

instruction(register indirect method)c. Calculates the branch address for a program control instruction.

PC<-PC+Y- (E) directs the output of the ALU to any 1 destination(destination

register/data memory/ branch address to PC)


Delayed Load• LOAD : R1<- M[address 1]

LOAD: R2<-M[address 2]ADD: R3<-R1+R2STORE: M[address 3]<- R3

• For the above instructions to be pipeline: A segment in cycle 4 is using the data from R2, but the value of R2 is not a correct value as the data is not transferred from memory at that time. So to solve this problem we can insert a no- operation instruction between the 2 conflicting instruction such that a clock cycle is wasted and thus use of data loaded from memory is delayed.


Vector processingIn computing, a vector processor or array processor is a CPU that implements an instruction set containing instructions that operate on 1- dim arrays of data called vectors, compared to scalar processors, whose instructions operate on single data items. Vector processors have high-level operations that work on linear arrays of numbers: "vectors“=> Each result independent of previous result=> long pipeline, compiler ensures no dependencies=> high clock rate


• Five basic types of vector operations: 1. V <- V Example: Complement all elements 2. S <- V Examples: Min, Max, Sum 3. V <- V x V Examples: Vector addition, multiplication,

division 4. V <- V x S Examples: Multiply or add a scalar to a

vector5. S <- V x V Example: Calculate an element of a matrix• Application areas where vector processing is used:Long range weather fore-casting, biological modeling, car

crash simulation, speech, image processing etc.


Vector Operations• V=[V1, V2,…..Vn] row vector of length n.• The element Vi of vector V is written as V[I], I is index refers to

memory address or register where the number is stored.• int a[10], b[10]; Initialize i=0

int c[10]; Read a[I], Read b[I]…. for (i = 0; i < n; i++) Store C[I]=a[I]+b[I]

c[i] = a[i] + b[i]; Increment I=I+1 //pointing to next location If I<10, continue.

// Vector processors have the ability to remove overhead of instruction fetch and execution in loop.


• Instruction can be written as c(1:10)=A(1:10)+B(1:10)

• Any vector instruction has:

• Addition is done with pipelined floating point adder in same process as scalar.


Complexity =27 operationsN*N mul requires, n3 multiply- add operations. Ex C11=((c11+a11b11)+a12b21)+a13b31. so 9*3=27 operations.


• Every inner product consists of sum of k- productsC= A1B1+A2B2+….AkBk• The multiplier and adder have 4 segments pipeline.To each segment 1 mul is

allowed in clock cycle, so in 1st 4 cycles: A1B1, A2B2, A3B3, A4B4 are finished and mean time the outputs of adder are all 0.

• For next 4 cycles 0 is added to 1st 4 multiplications . So in next 8 cycles , A1B1 to A4B4 are in adder and A5B5 to A8B8 are in multiplier segment.At beginning of 9 th cycle the output of adder A1B1 is added to A5B5, then A2b2+A6B6….

• So we get 4 sections:A1B1+A5B5+A9B9+..+A2B2+A6B6+…+A3B3+A7B7+…+A4B4+….


Memory Interleaving- For pipelining and vector processing, memory can be required to access simultaneously by 2 sources. - The memory can be partitioned into a no. of modules connected to common address bus and data bus.- A memory module consists of memory array, its own registers. The least 2 sig bits in the address can be used to distinguish between the 4 modules and the rest bits for exact location.


Array Processors• Suitable for scientific computations involving two dimensional matrices• An array processor has single instruction for mathematical operations, for example,

SUM a, b, N, M a ─ an array base address , b ─ another vector base address, N─ the number of elements in the column vector, M ─ the number of elements in row vector

• 2 types: - Attached array processor: 1. In this an auxiliary processor is attached to a general purpose computer. It enhances the

performance of the host computer in specific numerical computation tasks.2. This attachment is done by parallel processing i.e. it contains one or more pipelined

floating point adders and multipliers.3. Can be programmed by user.4. The array processor is connected through an i/p-o/p controller to the computer. Data is

received from main memory through high speed bus.


SIMD Array Processor Ex ILLIAC IV


Why use array processor


Computer arithmetic

• Addition , subtraction, multiplication and division of :-

1. fixed-point(total,fraction) binary data in signed magnitude representation.

2. Fixed point binary data in signed 2’s complement representation

3. Floating point binary data4. Binary coded decimal data(BCD)


Addition and subtraction• Negative fixed point binary numbers can be represented as: signed magnitude, signed- 1’s

complement, signed 2’s complement.• This representation of negative numbers are used to represent number in registers before and

after arithmetic operation.• Addition(subtraction) algorithms: - When the signs of A and B are identical(different), add(subtract) the 2 magnitudes and attach the

sign of A to the result,- when different (identical), compare the magnitudes and subtract the smaller from the larger,

choose the sign of the result to be same as A if A>B, or complement the sign of A if A<B. If A=B , then subtract B from A make the result sign +.

- Hardware Implementation: 2 registers(Ra,Rb) to hold 2 operands, 2 flip-flops(As, Bs) to store the signs, parallel adder(A+B), a comparator circuit to compare (A>B, B>A, A=B) , 2 parallel subtractor circuits (A-B,B-A). Sign can be formed from an X-OR gate with As,Bs as inputs.

This can be implemented as: - Subtraction can be done as A+2’s comp of B. - E flip-flop determines the relative magnitude of A,B- AVF holds the overflow bit when A,B are added.- M –signal=0,the B is transferred to adder, input carry is 0, so the sum=A+B.- M=1, 1’s complement of B(complementer)+A+1(input carry)= A-B


Addition and subtraction with signed-magnitude


Multiplication Algorithms• Multiplication: If the multiplier is a 1, then copy the multiplicand ,if 0 then all

the bits are 0. the numbers copied down in successive lines are shifted to left by 1 position from the previous number, finally all the numbers are added.

• Steps:1. Initially the multiplicand is in B, multiplier in Q with their signs in Bs,Qs.2. Registers A, E are cleared(A<-0,E<-0), sequence counter= number equal to

number of bits of the multiplier(SC<-n-1). We assume n-bit word(operand) is transferred from memory, where magnitude=n-1 bits, 1 bit for sign.

3. Low order bit of Qn is tested, if 1, the multiplicand is added to A(0 initially).if 0 then nothing is done. Register EAQ is shifted right. Sequence counter is decremented by 1 and new value checked, if not 0 then process is repeated.

4. The partial product formed in A is shifted into Q slowly and ultimately the low order bits are in Q and MSB bits of the result in A.


10111 *10011 10111 10111 00000 . .10111 110110101


Booth’s Algorithm-Algorithm is used for multiplying binary integers in signed 2’s compl.- mp:2p

- multiply multiplicand M by 14(23+1-21)[multiplier]- Algorithm also requires checking multiplier bits and shifting of partial product, but prior to shifting following steps may be done:


Ex of booth algorithm

• 6*2 ,6:0110 2:00106 in BR(multiplicand register) 2 in QR(multiplier register)

Q0 Qn+1 Action

0 0 RS

1 1 RS

1 0 A<-A-BR

0 1 A<-A+BR

BR A QR Qn+1 SC

0110 0000 0010 0 100

00 shift 0000 0001 0 011

10 subtractshift

0000-011010101101

00010000

0

1

010

01 Add, shift 1101+011000110001

00001000

10

001

00 shift 0000 1100 0 000

Ex. -9(10111)*-13(10011)


2 bit by 2 bit array multiplier

- Why AND gate: a0=1 b0=1 a0b0 will be 1 only if a0*b0=1, otherwise 0.- For j multiplier bits, k multiplicand bits, we need j*k AND gates and (j-1) k bits adders to produce product of (j+k) bits.


4 bit by 3 bit array multiplier


Division

• 1000)1001010(0001001

1000 10 101 1010- 1000 10

-Shift divisor right and compare it with current dividend-If divisor is larger, shift 0 as the next bit of the quotient-If divisor is smaller, subtract to get new dividend and shift 1 as the next bit of the quotient


Division Algorithm

Steps:- Divisor is in B, double length dividend is stored in A and Q.-Dividend is shifted to left and divisor is subtracted by adding 2’s complement value.- E gives the relative value of B and A.-If E=1(A>=B), I is inserted in Qn, shift left, subtract B(2’s comp of B+1)-If E=0, Qn=0, add B to restore the value of B, shift left + subtract B.


Divide Overflow• Condition for overflow:1. High order half bits of dividend is greater than or equal to divisor.2. Division by 0.• Overflow condition can be checked by a flip flop called DVF.• Handling of overflow:- Check the overflow condition after every divide operation and if

DVF is set then there will be branch to a subroutine which will take measures to rescale the data.

- Divide stop: stop the operation of computer.- Interrupt request is sent .- Stop the program and send to user an error message explaining the

reason why program is stopped.


Explanation of flow-chart

• Dividend is kept in AQ, divisor in B• Sign bit of A=As, B=Bs, As XOR Bs= Qs(Both A and B are of same magnitude or

not, If yes Qs=0,or Qs=1)• EA:A+B’+1 //A-B to check whether A>B(Dividend>divisor)• E=1, DVF<-1, Dividend > divisor, so hardware will not allow, DVF condition

prevails.E=0, A<B, DVF<-0, so we move on to next step.

For both above steps, B is added(A+B) to get back the value of A(previous step it was deducted(A-B) to check whether A>B)

Step 1. E=0: 1. shl EAQ • E=0, EA<-A+B’+1(ADD B 2’s complement)• E=1, A<-A+B’+1 • E=1, Qn=1• E=0, EA<- A+B

SC<-SC+1Continue the process from step-1


Floating –point arithmetic operations

• Floating point number register consists of 2 parts: mantissa m and exponent e --- m*re

• A floating point number that has a 0 in the MSB of the mantissa is said to have an underflow. To normalize the number it is necessary to shift the mantissa to the left and decrement the exponent until a nonzero digit appears in the 1st position. Ex .00345*105= .345*103

• Register configuration: same as for fixed point operation. The difference lies in the way exponents are handled.


Registers for floating point arithmetic operations

• 3 registers :BR,AC,QR• All have 2 parts: mantissa(upper case

letter), exponent(lower case letter)• Assumed that each floating point number

has a mantissa in signed magnitude and a biased exponent(bias is no. that is added to exponent. Ex exp that ranges from -50 to 49, we consider 00 to 99 as +ve number. Positive exponents range from 99 to 50 and 49 to 00 as negative exponent)

• Comparator: used to compare the +ve exponents to adjust mantissa part for normalization.

• As:sign bit of A, Qs: sign bit of Q, Bs: sign bit of B., A1: 1 if the number is to normalized.

• 2 parallel adder, A: mantissa Q+B, E:carry bit a: exponent b+q


Addition and subtraction


Multiplication of floating point numbers


Division of floating point numbers


BCD ADD- To add 2 BCD decimal numbers, max value for a operand allowed is 9. So 2 nos

addition will be 9+9+1(carry)=19.The adder will form addition in form of binary and result will be within 19.

- The binary representation should be converted to BCD form, we see till 1001 the corresponding BCD is equal but after that if we add binary 6(0110) to the binary sum converts correctly to BCD form.

- The ckt detects the necessry condition to know which binary nos. are to converted to BCD. Correction is required when

1. K bit =12. For 1010 to 1111, k8 bit =13. K8 bit of 1001 and 1000 is also 1, so k8 with K4 or K2 bit =1- We can get the boolean function

C= K+Z8Z4+Z8Z2

- When C=1, add 0110 to binary sum and provide carry to next stage.

computer arithmetic in computer architecture

Education