design of mips processor supporting mac with fpgabchong/vlsi/resources/other_libraries... ·...

- i -

<Undergraduate Thesis: EEE-2007-01-48>

Design of MIPS Processor Supporting

MAC with FPGA

Munju Lee, Taewoo Han

School of Electrical and Electronic Engineering

College of Engineering

- ii -

Yonsei University<Undergraduate Thesis: EEE-2007-01-48>

Design of MIPS Processor Supporting

MAC with FPGA

Thesis Advisor: Yongserk Lee

A thesis submitted in a partial fulfillment

for the senior independent study's requirements

June 2007

Munju Lee, Taewoo Han

School of Electrical and Electronic Engineering

- iii -

College of Engineering

Yonsei University

감사의 글

2006년 두 학기 동안 저희에게 알찬 강의를 해주시고, 졸업 연구까

지 지도해 주신 이용석 교수님께 가장 먼저 감사의 글을 올립니다.

잠깐 방학동안 연구실에 들어가 있었지만, 진심으로 맞아주신 연구

실 분들께도 감사의 마음을 전합니다. 좀 더 나은 결과를 위해 매일

연구실을 나갔던 저희에게 좋은 연구여건을 만들어 주셨습니다. 졸

업 연구 담당 조교로서 큰 줄기를 잡는데 조언을 아끼지 않으셨던

전호윤 조교님께 감사의 말씀을 드립니다. 그리고 처음 막상 아무

것도 할 수 없었을 때, 구체적으로 도움 마다하지 않으셨던 원영 형,

판기 형 정말 감사드립니다. 두 학기 간 교과목 조교였던 현필 형

그리고 항상 옆에 계셔주셨던 형표 형, 종수 형, 재인이 형, 지나가

면서 한 번 보시고 큰 문제점을 지적해 주셨던 용주 형님, 하영이

형, 이것저것 귀찮게 많이 물어봐도 항상 친절했던 동현 씨, 그리고

은지 누나, 재희 씨, 상식 형, 연구실 살림꾼 민영 씨, 석한 씨 모든

연구실 분들께 감사의 마음을 전합니다.

- iv -

Figure index ivTable index viAbstract vii1. Introduction 12. Backgrounds 2

2.1. MAC 22.1.1. A radix-4 modified Booth's algorithm 22.1.2. Sign or zero Extension 32.1.3. Wallace Tree 52.1.4. 4:2 carry save adder 52.1.5 Carry select adder 6

2.2. MIPS processor 7 2.2.1. MIPS is the RISC processor 8 2.2.2. MIPS Instruction formats 9 2.2.3. Pipelining 10 2.2.4. Pipeline Hazards 113. Circuit Design Features 13

3.1. MAC 133.1.1. Block diagram of MAC 133.1.2. Block diagram of multiplier 143.1.3. Block diagram of an Accumulator 153.1.4. Signed or unsigned multiplication select mode 153.1.5. Booth encoder 153.1.6. Problem of the sign extension bits 163.1.7. Wallace tree with 4:2 CSA 19

3.2. MIPS processor 213.2.1. Instruction fetch 223.2.2. Instruction decode 223.2.3. Execute 233.2.4. Memory 233.2.5. Write back 23

3.2.6. Using a HDL to Describe and Modeling a MIPS processor 24 3.3. MIPS with MAC 25 3.3.1 Special instruction MAC 25 3.3.2 Total diagram of designed MIPS with MAC 25

Contents

- v -

4. Verification and System Analysis 274.1. MAC 27

4.1.1. Booth multiplier 274.1.2. Basic type MAC 284.1.3 Advanced MAC 30

4.2. MIPS processor 33 4.2.1. One instruction mul 33 4.2.2. Loop operation 33

4.3. MIPS with MAC 35 4.3.1. Use MAC as Booth multiplier 35 4.3.2. Use MAC as Multiplier and Accumulator 37 4.3.3. Compare the two methods of Matrix Calculating 385. Conclusion 39

5.1 What we know 395.2 A fall of system performance after connecting MAC to MIPS processor 395.3 Solution 40

6. Reference 44국문요약 45

- vi -

Figure. 2.1. 4 : 2 carry save adder and implementation 5Figure. 2.2. Block diagram of 64bit carry select adder 6Figure. 2.3. Block diagram of 8bit carry select adder 7Figure. 3.1. Block diagram of MAC(Multiplier and Accumulator Unit) 13Figure. 3.2. Block diagram of a 32bit x 32bit Booth multiplier 14Figure. 3.3. Block diagram of an accumulator 15Figure. 3.4. Unnecessary sign extension bits 16Figure. 3.5. Principal of sign generate method 17Figure. 3.6. Array of partial products after applying sign generate method 17Figure. 3.7. Array of partial products after applying

further modified method18

Figure. 3.8. Array of partial products in our multiplier 18Figure. 3.9. Wallace tree with 4:2 CSA 19Figure. 3.10. Wallace tree with full adder 20Figure. 3.11. Total diagram of designed MIPS 21Figure. 3.12. Total diagram of designed MIPS with MAC 26Figure. 4.1. Verification of 2 stage pipelining multiplier

with 100,000 samples.27

Figure. 4.2. Synthesis report and vector wave form file of multiplier with EXCALIBUR_ARM

28

Figure. 4.3. Operation of MAC with multiplier and 1 accumulator 29Figure. 4.4. Synthesis report of the MAC with Stratix II 30Figure. 4.5. Block diagram of 8 parallel MAC with 8 multipliers and 7 accumulators

31

Figure. 4.6. Verilog HDL verification of 8 parallel MAC with 100,000 samples

31

Figure. 4.7. Synthesis report and timing analyzer summary of the 8 parallel MAC with Stratix II

32

Figure. 4.8. Post simulation result of mul instruction 32Figure. 4.9. Instructions in the MIPS instruction memory 33Figure. 4.10. Wave form of loop operation in MIPS 33Figure. 4.11. Matrix for verification 34Figure. 4.12. synthesized result of use MAC as booth multiplier 36

Figure index

- vii -

Figure. 4.13. result waveform of use MAC as booth multiplier 36Figure. 4.14. synthesized result of use MAC as Multiplier and Accumulator

37

Figure. 4.15. result waveform of use MAC as Multiplier and Accumulator

37

Figure. 6.1. Distribution of random numbers in Verilog HDL 41

- viii -

Table. 2.1. The Truth table of A Radix-4 Modified Booth's Algorithm

3

Table. 2.2. Multiplication sign or zero extension with Booth‘s radix-4

4

Table. 2.3. Range of signed and unsigned multiplication 4Table. 2.4. MIPS instruction formats 9Table. 2.5. Pipelined instructions 10Table. 2.6. MIPS instructions classically take five steps 11Table. 2.7. Sequence of instructions 12Table. 3.1. Equation of further modified method 18Table. 3.2. The set of Instructions supported by designed MIPS 24Table. 4.1. MAC instruction based on [x by 4] matrix

multiplication [4 by y]29

Table. 4.2. Compare the two methods of Matrix Calculating 38Table. 5.1. Clock period of each system 39Table. 6.1. MATLAB source for analysis of random numbers 43

Table index

- ix -

ABSTRACT

Design of MIPS Processor supporting MAC with FPGA

In this paper first we designed 32bit × 32bit Booth multiplier and an accumulator

implemented by one 64 carry select adder adding continuous outputs. This

MAC(Multiplier and Accumulator Unit) with 1 multiplier and 1 accumulator is

operated in our MIPS processor (supporting add, sub, bnq, ...instruction)

MAC is composed of a multiplier and an accumulator. The multiplier is composed of

Booth encoder block, wallace tree, 64 bit carry select adder block. In 32bit × 32 bit

multiplication, the modified Booth's Radix-8 can't offer fast multiplication due to 3X

problem. Therefore, we used modified Radix-4 Booth's algorithm where outputs of

encoding block are partial products and additional 1bit signal were added to wallace

tree. In 2's compliment, -X is the sum of compliment number and 1 which has fast

development but adding 1 is a slow process. Thus, the encoder makes 2 outputs and

by adding 1bit in the case of -X, -2X to the wallace tree improves the system

performance. To calculate 18 operands at the same time, the Wallace tree has 4 4:2

carry save adder stages. By implementing Wallace tree with 4:2 CSA improves the

regularity because c_out is independent to c_in. In fact, each stage acts at the same

time. The final result is the sum of carry and the sum vector through 64bit carry

select adder. This 64 carry select adder is composed of 8bit carry select adders and

each 8bit adder has serial connection with full adders and a half adder. We applied 2

stage pipelining where the first stage has encoding block and wallace tree; second

stage has 64bit carry select adder block. Final part of this thesis is about the synthesis

result of this multiplier with EXCALIBUR_ARM Family which has 21ns clock period

and 2,879 logic elements. The output of multiplier is accumulated by the accumulator

and we used 64bit carry select adder in the accumulator. Thereby we were able to

- x -

apply 2 stages of pipelining with multiply stage and accumulate stage. As a result, the

synthesis result with EXCALIBUR_ARM Family has 24.396ns clock period and 3,129

logic elements. Also the synthesis result with Stratix II Family has 9.326ns clock

period. Finally, we implemented multiply stage with 8 multipliers and accumulate stage

with 7 accumulators. As a consequence, the synthesis results with Stratix II shows

30ns clock period. We got 249% performance improvement however our 'mac'

instruction in MIPS processor can handle just two operands in 1 instruction. Hence,

we could connect 8 parallel MAC to MIPS processor.

In addition, we generated 100,000 random numbers in Verilog HDL to verify our

Booth multiplier, basic type MAC and advance type MAC. Thus, we knew this

random numbers generated in Verilog HDL were uniform thought MATLAB

simulation.

MIPS processor is standard DLX machine. We designed MIPS processor according to

the MIPS3000 format. The MIPS processor can support R, I, and J type instructions.

Pipelining can improve the performance of processor dynamically and we need Hazard

control unit for control the hazards by pipelining. Also, it can support MAC for

efficient multiply algorithm. MIPS support several MIPS assembler instructions and

additional to this, we made MAC instruction. The mul instruction is executed by just

* symbol in the ALU and MAC instruction is added the concept of accumulator to it.

We described and modeling MIPS processor according to this order. First, implement

pipelining. Use the nonblocking assignment in the always block, according to the clock

edge, the register values in each stages are transfer concurrently. So we can implement

the pipelining. Second, connect control lines. We can do the process of designing

easier by implement data register in the previous process, after this, design control

lines separately. Registers are sequential logic and connecting these things are

combinational logic. This methodology have another advantage. This designing is RTL.

During the verification, we can know the location of error logic easily by this method.

Third, add Hazard control units. complete the MIPS processor by connecting the

- xi -

Hazard control unit. Hazard control units are forwarding unit and stall unit. Multiply

and MAC is R-type instruction so in the viewpoint of it doesn't has Memory stage,

devide the MAC in two, and put them in to the Execute stage and memory stage in

the MIPS. It can increase the performance of MIPS by reduce the clock period. We

substituted the execute stage for the memory stage. It reduce the possibility of more

hazard occurrence by new instruction MAC and also we can hold the RISC type.

Finally, we get the gate level logic from designed RTL logic by synthesis tool.

For verify whether the designed MIPS processor operates correctly, we analysis the

behavior by one instruction mul and loop operation algorithm. The last verification is

for MIPS with MAC. We calculated a matrix set by MIPS with MAC as two

methods. The first way is use the MAC just as Booth multiplier and another way is

add the concept of accumulator to the first thing.

Key words : Booth multiplier, accumulator, Wallace tree, MIPS, processor, RISC type

- xii -

1. Introduction

The MAC(Multiplier and Accumulator Unit) is used for image processing and

digital signal processing (DSP) in a DSP processor. Algorithm of MAC is Booth's

radix-4 algorithm, wallace tree, 4:2 CSA, 64bit carry select adder and improves speed.

MIPS was implemented as micro processors and permitted high performance pipeline

implementations through the use of their simple register oriented instruction sets.

Although those algorithms ( radix-4 algorithm, pipelining, etc ) are widely used

technique for speeding up each part, the MAC on specific processor cannot be run at

100% efficiency. Due to the reasons of lower speed of MAC, MIPS instruction "mul"

(multiplication) takes longer time than any other instruction in our MIPS processor. To

improve speed of MIPS, MAC needs to be fast and MIPS must have special

algorithm for "mul" instruction. One of the method we chose was to design

multi-clock MAC instead of one-clock MAC which improved the speed of MIPS. In

general, the instruction set of MIPS processor includes complex works like

multiplication and floating point operation which has multi execution stage. Therefore,

system clock of the processor was increased efficiently.

We applied 2 stage pipelining to the MAC to MIPS processor and as a result we

were able to get the result of matrix multiplication which was used for image

processing in our MIPS processor that supports MAC.

- xiii -

2. Backgrounds

2.1. MAC

2.1.1. A radix-4 modified Booth's algorithm

Booth's Algorithm is simple but powerful. Speed of MAC is dependent on the

number of partial products and speed of accumulate partial product. Booth's Algorithm

provide us to reduced partial products. We choose radix-4 algorithm because of below

reasons.

①. Original Booth's algorithm has a inefficient case.

ex)

The 33 partial products are generated in 32bit x 32bit signed or unsigned

multiplication.

②. Modified Booth's radix-8 algorithm has fatal encoding time in 32bit x 32bit

multiplication.

Radix-8 Algorithm has a 3x term which means that a partial product cannot be

generated by shifting. Therefore, 2x + 1x are needed in encoding processing. One of

the solution is by handling a additional 1x term in wallace tree. However, large

wallace tree has some problems too.

A radix-4 modified Booth's algorithm : Booth's radix-4 algorithm is widely used to

reduce the area of multiplier and to increase the speed. Grouping a 3 bits of

multiplier with overlapping, has a half partial products which improves the system

speed. Radix-4 modified Booth's algorithm is shown below:

① x-1 = 0; Insert 0 on the right side of LSB of multiplier.

② Start grouping each 3bits with overlapping from x-1

③ If the number of multiplier bits is odd, add a extra 1 bit on left side of MSB

④ generate partial product from truth table

⑤ when new partial product is generated, each partial product is added 2 bit left

shifting in regular sequence

- xiv -

yi+1 yi yi-1 partial product

0

0

0

0

1

1

1

1

0

0

1

1

0

0

1

1

0

1

0

1

0

1

0

1

0X (No string)

+ 1X (End of string)

+ 1X (A string)

+ 2X (End of string)

- 2X (Beginning of string)

- 1X (-2X+X)

- 1X (Beginning of string)

- 0X (Center of string)

Table. 2.1. The Truth table of A Radix-4 Modified Booth's Algorithm

x: multiplicand y: multiplier

2.1.2. Sign or zero extension

Our MAC supports signed or unsigned multiplication and the produced result is

64bit which are stored in 2 special 32bit register. First MAC receives a multiplicand

and multiplier but just 32bit operands are signed number in Booth's radix-4 algorithm.

Hence, extension bit is required to express 32bit signed number. The core idea of this

is that 32bit unsigned number can be expressed by 33bit signed number. The 17

partial products are generated in 33bit x 33bit case (16 partial products in 32bit x

32bit case).

Here is an example of signed and unsigned multiplication. When x(multiplicand) is

3bit 111 and y(multiplier) is 3bit 111, the signed and unsigned multiplication is

different. In signed case × ( -1 x -1 = 1 ) and in unsigned case ×

( 7 x 7 = 49 ).

- xv -

signed ( -1 x -1 = 1 ) unsigned ( 7 x 7 = 49 )

x (1) 1 1 1y × (1) 1 1 1 (0)

0 0 0 0 0 1 1x0 0 0 0 0 0x

0 0 0 0 0 1 1dec

x (0) 1 1 1y × (0) 1 1 1 (0)

1 1 1 0 0 1 -x1 1 1 0 +2x

1 1 0 0 0 1 49dec

Table. 2.2. Multiplication sign or zero extension with Booth‘s radix-4

Each signed and unsigned case has the same 6bit result in upper example. Hence,

our 32bit x 32bit MAC has the 64bit result. The range of 32bit x 32bit signed and

unsigned multiplication illustrated below:

signed multiplication unsigned multiplication

multiplicand ∼ ∼

multiplier ∼ ∼

result ∼ ∼

Table. 2.3. Range of signed and unsigned multiplication

Upper table certifies that result of the 32bit x 32 bit signed and unsigned

multiplication can be expressed in 64 bits. Thereby, 65th overflow bit generated in

wallace tree was not considered.

- xvi -

2.1.3. Wallace tree

A ripple carry adder has carry propagation due to present stage adder which does

not execute until carry of previous stage is introduced. Thus, if adder connection is

series, the propagation delay increases in proportion to the number of adders. But

wallace tree has a fewer propagation delay since they execute at the same stage

region. Assume we add 4 operands a, b, c, d.

① Inefficient case (ripple carry adder):

(l = a + b) next (m = l + b) next (n = m + c) next (result = n + d)

② Efficient case (concept of wallace tree)

(l = a + b and m = c + d) next (result = l + m)

In upper example the first one has 4 times the delay, but on the other hand, the

second one has only 2 times the delay. Similarly, wallace tree can execute operands

which has independency to each other because carry propagation delay can be hiden

by time of the adding operation. As a result, the wallace tree is composed of carry

vector and sum vector.

2.1.4. 4 : 2 CSA (Carry Save Adder)

Figure. 2.1. 4 : 2 Carry Save Adder and Implementation

- xvii -

A 4 : 2 Carry Save Adder has 5 inputs and 3 outputs.

The 4 inputs are operands and 1 input is c_in and the outputs are the sum, carry and

c_out. The schematic is on the right side of upper table. The 4:2 CSA is slower than

full adder however, it can execute 5 inputs at the same time when the full adder

executes 3 inputs and thereby the regularity improves the speed. Since c_out is

independent to c_in, the c_out from LSB can be the input of the c_in to MSB at the

same time in the same stage.

2.1.5. Carry select adder

The result from wallace tree is the sum and carry vector hence, it needs to be

added to get final result. If we use ripple carry adder to add the sum and the carry

vector, then the carry propagation delay has a bad effect on the performance of MAC.

Carry Select Adder computes when both carry is 0 and 1. In spite of large

implementation, it can increase the speed of execution.

Figure. 2.2. Block diagram of 64bit Carry Select Adder

Each 8bit carry select adder executes at the same time and the delay of 64bit

carry select adder is delay of the 8bit carry select adder and some mux delay for

selecting next sum.

- xviii -

Figure. 2.3. Block diagram of 8bit Carry Select adder

Figure. 2.3. shows 8bit carry select adder. It has a serial connection with full

adders and a half adder.

2.2. MIPS Processor

The MIPS architecture is one of the most widely supported processor architectures,

with a vast infrastructure of industry-standard tools, software and services that help

ensure rapid, reliable and cost-effective SoC design. MIPS Technologies delivers the

widest range of low power, high-performance processor cores with distinct cost

advantages to semiconductor companies, ASIC developers and system OEMs

worldwide. The company offers award-winning embedded microprocessor cores,

instruction set architectures, as well as system controllers, and incorporates a complete

array of software and system level debug tools to provide maximum flexibility and

convenience for ever-changing market requirements: high-performance, low power, low

silicon cost and rapid time-to-market.

- xix -

To command a computer's hardware, you must speak its language. The words of a

computer's language are called instructions, and its vocabulary is called an instruction

set. The languages of computers would be as diverse as those of humans, but in

reality computer languages are quite similar, more like regional dialects than like

independent languages. This similarity occurs because all computers are constructed

from hardware technologies based on similar underlying principles and because there

are a few basic operations that all computers must provide. Generally, computer

designers have a common goal to find a language that makes it easy to build the

hardware and compiler while maximizing performance and minimizing cost.

2.2.1. MIPS is the RISC processor

The MIPS processor, designed in 1984 by researchers at Stanford University, is a

RISC (Reduced Instruction Set Computer) processor. Compared with their CISC

(Complex Instruction Set Computer) counterparts (such as the Intel Pentium

processors), RISC processors typically support fewer and much simpler instructions.

The premise is, however, that a RISC processor can be made much faster than a

CISC processor because of its simpler design. These days, it is generally accepted that

RISC processors are more efficient than CISC processors; and even the only popular

CISC processor that is still around (Intel Pentium) internally translates the CISC

instructions into RISC instructions before they are executed. RISC processors typically

have a load-store architecture. This means there are two instructions for accessing

memory: a load "l" instruction to load data from memory and a store "s" instruction

to write data to memory. It also means that none of the other instructions can access

memory directly. So, an instruction like "add this byte from memory to register 1"

from a CISC instruction set would need two instructions in a load-store architecture:

"load this byte from memory into register 2" and "add register 2 to register 1". CISC

processors also offer many different addressing modes. Consider the following

instruction from the Intel 80x86 instruction set (with simplified register names):

- xx -

add r1, [r2+r3*4+60] // x86 (not MIPS) example

This instruction loads a value from memory and adds it to a register. The memory

location is given in between the square brackets. As you can see, the Intel instruction

set, as is typical for CISC architectures, allows for very complicated expressions for

address calculations or "addressing modes". The MIPS processor, on the other hand,

only allows for one, rather simple, addressing mode: to specify an address, you can

specify a constant and a register. So, the above Intel instruction could be translated as:

slli r3, r3, 2 // r3 := r3 << 2 (i.e. r3 := r3 * 4)

add r2, r2, r3 // r2 := r2 + r3

l r4, 60(r2) // r4 := memory[60+r4]

add r1, r1, r4 // r1 := r1 + r4

We need four instructions instead of one, and an extra register (r4) to do what can

be done with one instruction in a CISC architecture. The internal circuitry of the

RISC processor is much simpler, however, and can thus be made very fast.

2.2.2. MIPS Instruction formats

Table. 2.4. MIPS instruction formats

Here is the meaning of each name of the fields in MIPS instructions:

op : opcode. basic operation of the instruction.

rs : the first register source operand.

rt : the second register source operand.

rd : the rigister distination operand. It gets the result of the operation.

shamt : shift amout.

- xxi -

funct : function. this field selects the specific variant of the operation in the op

field.

The compromise chosen by the MIPS designers is to keep all instructions the same

length, thereby requiring different kinds of instruction formats for different kinds of

instructions. The format above is called R-type (for register), I-type (for immediate),

and J-type(for jump).

2.2.3. Pipelining

Pipelining is an implementation technique in which multiple instructions are

overlapped in execution. Today, pipelining is key to making processors fast. At this

point all steps - called stages in pipelining - are operating concurrently. As long as

we have separate resources for each stage, we can pipeline the tasks. Pipelining is an

implementation technique where multiple instructions are overlapped in execution. The

computer pipeline is divided in stages. Each stage completes a part of an instruction

in parallel. The stages are connected one to the next to form a pipe - instructions

enter at one end, progress through the stages, and exit at the other end. Pipelining

does not decrease the time for individual instruction execution. Instead, it increases

instruction throughput. The throughput of the instruction pipeline is determined by how

often an instruction exits the pipeline.

Instruction Num 1 2 3 4 5 6 7 8 9

instruction i IF ID EX MEM WB

instruction i+1 IF ID EX MEM WB




Table. 2.5. pipelined instructions

The simplicity of the MIPS instruction set makes resource evaluation relatively easy.

The major functional units are used in different cycles and hence overlapping the

execution of multiple instructions introduces relatively few conflicts.

- xxii -

IF

The Instruction Fetch stage fetches the next instruction from memory using the address in the PC (Program Counter) register and stores this instruction in the IR (Instruction Register)

IDThe Instruction Decode stage decodes the instruction in the IR, calculates

the next PC, and reads any operands required from the register file.

EX

The Execute stage "executes" the instruction. In fact, all ALU operations are done in this stage. (The ALU is the Arithmetic and Logic Unit and performs operations such as addition, subtraction, shifts left and right, etc.)

MEM

The Memory Access stage performs any memory access required by the current instruction, So, for loads, it would load an operand from memory. For stores, it would store an operand into memory. For all other instructions, it would do nothing.

WB

For instructions that have a result (a destination register), the Write Back writes this result back to the register file. Note that this includes nearly all instructions, except nops (a nop, no-op or no-operation instruction simply does nothing) and s (stores).

Table. 2.6. MIPS instructions classically take five steps

2.2.4. Pipeline Hazards

There are situations, called hazards, that prevent the next instruction in the

instruction stream from being executing during its designated clock cycle. Hazards

reduce the performance from the ideal speedup gained by pipelining. There are three

classes of hazards:

Structural Hazards: They arise from resource conflicts when the hardware cannot

support all possible combinations of instructions in simultaneous

overlapped execution.

Data Hazards: They arise when an instruction depends on the result of a previous

instruction in a way that is exposed by the overlapping of

instructions in the pipeline.

Control Hazards: They arise from the pipelining of branches and other instructions

that change the PC.

- xxiii -

Hazards in pipelines can make it necessary to stall the pipeline. The processor can

stall on different events:

A cache miss: A cache miss stalls all the instructions on pipeline both before and

after the instruction causing the miss.

A hazard in pipeline: Eliminating a hazard often requires that some instructions in

the pipeline to be allowed to proceed while others are

delayed. When the instruction is stalled, all the instructions

issued later than the stalled instruction are also stalled.

Instructions issued earlier than the stalled instruction must

continue, since otherwise the hazard will never clear.

ADD R1, R2, R3

SUB R4, R5, R1

AND R6, R1, R7

Table. 2.7. sequence of instructions

The problem with data hazards, introduced by this sequence of instructions can be

solved with a simple hardware technique called forwarding. The key insight in

forwarding is that the result is not really needed by SUB until after the ADD actually

produces it. The only problem is to make it available for SUB when it needs it. If

the result can be moved from where the ADD produces it (EX/MEM register), to

where the SUB needs it (ALU input latch), then the need for a stall can be avoided.

Using this observation , forwarding works as follows:

The ALU result from the EX/MEM register is always fed back to the ALU input

latches.

If the forwarding hardware detects that the previous ALU operation has written the

register corresponding to the source for the current ALU operation, control logic

selects the forwarded result as the ALU input rather than the value read from the

register file.

- xxiv -

3. Circuit Design Features

3.1. MAC

3.1.1 Block Diagram of MAC

Figure. 3.1. Block diagram of MAC ( Multiplier and Accumulator Unit )

I applied 2 stage pipelining into the MAC. In first multiplier stage, 32bit multiplier

and 32bit multiplicand are stored in 2 D-flipflops. At rising edge of the system clock,

multiplicand and multiplier become inputs of Booth multiplier which produces the

output of 64bits. In next clock, the D-flipflops in accumulator releases the result of a

Booth multiplier and a accumulator's output of previous clock. An accumulator adds 2

inputs through 64bit carry select adder and makes two outputs. One is 1bit overflow

and the other is the 64bit accumulation result since both multiplier and accumulator

are independent and applying the 2 stage pipelining is easier process.

- xxv -

3.1.2. Block Diagram of Multiplier

Figure. 3.2. Block diagram of a 32bit x 32bit Booth multiplier

Figure. 3 is the diagram of Booth multiplier. The 32bit multiplicand and multiplier

are sign extend or zero extend when selection signed or unsigned multiplication 1bit

input comes into the module. Booth encoder develops 17 partial products and

additional 17 one bit to add in wallace tree. This additional 1bit improves the system

performance by reducing time instead time required in encoding block. The outputs of

wallace tree are 64bit carry and sum vectors. Both vector are added in 64bit carry

select adder block and the final result is generated.

- xxvi -

3.1.3. Block Diagram of an Accumulator

Figure. 3.3. Block diagram of an accumulator

Accumulator holds output of previous clock from accumulation register. Holding

outputs in accumulation register can reduce additional "add" instruction. An

accumulator was implemented with one 64bit carry select adder.

3.1.4. Signed or unsigned multiplication select mode

First 32bit multiplicand and multiplier comes at the same time and there after if

the mode select signal is enable, 32bit operands become 33bit through sign or zero

extension.

3.1.5. Booth encoder

Due to 33bit multiplier, the 17 partial products are generated from 17 Booth

encoders. In addition, each encoder block is connected to x, 2x, 0x, -2x, -x signal.

- xxvii -

Hence, each 17 Booth encoder behaves at the same time. But -x, -2x signal is just 1's

compliment and on the other hand 2's compliment needs complimenting each bit and

adding 1. Since adding 1 to compliment number in Booth encoding block have carry

propagation delay, it's therefore, not wise decision. Thus, when -x or -2x are

generated, additional registers have to be enabled to check that if there is additional 1

to be added to 1's compliment. The data of additional register will be added through

wallace tree with fewer penalty. Our MAC generates 17 partial products and 17

signals to be added through wallace tree.

3.1.6. Sign extension problem

When 17 partial products are added in wallace tree, sign extension is essential. But

unnecessary sign extension term effects on speed of MAC. Below 8bit x 8bit case S0,

S1, S2 terms can be reduced by some technique (sign generate method).

Figure. 3.4. Unnecessary sign extension bits

S0S0S0S0S0S0S0S0 term in the first row is equivalent to -S0 due to the 2's

compliment. Therefore, sign extension parts are reduced. But -S0 is not 1bit data

which can be easily changed to non-negative number. Following diagram depicts this,

- xxviii -

Figure. 3.5. Principal of sign generate method

Sign generate method gives us real essential sign extension terms like this.

Figure. 3.6. Array of partial products after applying sign generate method

The 'neg' terms mean additional term to be added in wallace tree.

Next when we use some equation then the upper array will be changed.

- xxix -

Table. 3.1. Equation of further modified method

This table shows our 32bit x 32bit signed or unsigned case. However, array shape

is same as below small multiplication. The partial products array are shown as below:

Figure. 3.7. Array of partial products after applying further modified method

Figure. 3.8. Array of partial products in our multiplier

- xxx -

3.1.7. Wallace tree with 4:2 CSA

4 stages are needed to generate carry and sum vector with 4:2 CSA wallace tree.

Figure. 3.9. Wallace tree with 4:2 CSA

The 33rd bits has 18 operands, 4 stages and eight 4:2 CSAs are needed in the

wallace tree. The term '4 stages' means multiplier can calculate the 17 partial product

arrays in just four 4:2 CSA delay time.

- xxxi -

Figure. 3.10. Wallace tree with full adder

In our 4:2 CSA wallace tree, it has 4 delta × 1.5. Wallace tree with full adder

has 6 stages, and also it has 6 delta × 1 delay time. Both two have almost the same

performance but I chose 4:2 CSA wallace tree since the regularity of 4:2 CSA can

improve the system performance. When I implemented wallace tree with 4 stage 4:2

CSA from 3 stage 4:2 CSA and additional 1 full adder stage, it produced

improvement of 10% system performance. The output C_out is in 4:2 CSA which is

independent to C_in. This means all 4:2 carry save adders in same stage can operate

almost at the same time which further improves the system regularity.

- xxxii -

3.2. MIPS

Figure. 3.11. Total diagram of designed MIPS

This diagram is our designed MIPS. An abstract view of the implementation of the

MIPS subset showing the major functional units and the major connections between

them and including the necessary multiplexors and control lines. Also it has

Forwarding unit and Hazard detection unit. It is the process of execute one instruction

- IF, ID, EXE, MEM, WB and we can see the registers which have these stages

name in above diagram. It also helps MIPS to doing popeling. So registers divide

stage each. For example, between ID/EXE register and EXE/MEM register is EXE

stage. This design methodology is "RTL : Register Transfer Level" total registers(state

element) are sequential logic and combinational logic in between register. This method

has many advantages. We can Debuging designed code easy and it is related to

Clocking Methodology. Let's look into behavior of major blocks and datapath elements

in each stage.

- xxxiii -

3.2.1. Instruction Fetch

The first element is Instruction memory unit to store the instructions of a program

and supply instructions given address. We designed it has instructions in form of

machine language which is correspond to MIPS assembly language.

Program counter is a register, which is used to hold the address of the current

instruction. If we need "stall" for solve hazard, control signal from Hazard detection

unit is control this behavior.

And adder in this stage is need for increment the PC to the address of the next

instruction. This adder, which is combinational, be built in "Carry Lookahead Adder"

for support fast add calculation for between 32bit numbers.

The mux in this stage is select next PC value between current PC + 4 and branch

PC value. It is controled by PC_src control signal from the Control unit.

3.2.2. Instruction Decode

The processor's 32 general purpose registers are stored in a structure called a

Register file. It is a collection of registers in which any register can be read or

written by specifying the number of the register in the file. The register file contains

the register state of the MAChine. In addition, we need an ALU to operate on the

values read from the registers.

Between to register outputs to ALU unit, there is a branch unit. It determines

whether the next instruction is the instruction that follows sequentially or the

instruction at the branch target address.

A sign extend unit is used to sign-extend the 16-bit offset field in the instruction to

a 32-bit signed value.

If it occur hazards, and detected by hazard detection unit, set control signals to zero.

Accordingly, all elements are can't read or write anything and this behavior is "stall".

This need to hold PC value in Program counter, so hazard detection unit is control

this. If we need "flush" instructions at branch hazard, it is designed by no hold PC

- xxxiv -

value in Program counter and, instructions are setting to zero.

3.2.3. Execution

We can generate the 4-bit ALU control input using a small control unit that has

as inputs the function field of the instruction and a 2-bit control field, which we call

ALUop.

ALU32 is Arithmetic Logic Unit - 32-bit, which is execute according to instructions.

This includes some arithmetic and logic algorithm. Supported instructions by this ALU

and out MIPS is displayed in next page.

If it occurs hazard, somethings are can solved by "stall" but, stalls are decrease the

performance of MIPS. We can block this state by replace "stall" to "forwarding".

* EX/MEM.register Rd = ID/EX.register Rs

* EX/MEM.register Rd = ID/EX.register Rt

* MEM/WB.register Rd = ID/EX.register Rs

* MEM/WB.register Rd = ID/EX.register Rt

These are hazard conditions and ALU inputs is substituted to forwarded data in this

case. So the muxs are can select the input of ALU which is before the ALU.

3.2.4. Memory

The data memory must be written on store instructions; hence, it has both read

and write control signals, an address input, as well as and input for the data to be

written into memory.

3.2.5. Write back

In this stage, the result of ALU or data memory value is write to the register file

according to control signals.

- xxxv -

Category Instruction Example Meaning

Arithmetic

add add $s1,$s2,$s3 $s1 = $s2+$s3

subtract sub $s1,$s2,$s3 $s1 = $s2-$s3

add immediate addi $s1,$s2,100 $s1 = $s2+100

sub immediate subi $s1,$s2,100 $s1 = $s2-100

multiply mul $s1,$s2,$s3 $s1 = $s2*$s3

Data

transfer

load word lw $s1,100($s2) $s1 = Memory[$s2+100]

store word sw $s1,100($s2) Memory[$s2+100] = $s1

load immediate li $s1, 100 $s1 = 100

Logical

and and $s1,$s2,$s3 $s1 = $s2&$s3

or or $s1,$s2,$s3 $s1 = $s2|$s3

nor nor $s1,$s2,$s3 $s1 = ~($s2|$s3)

and immediate andi $s1,$s2,100 $s1 = $s2&100

or immediate ori $s1,$s2,100 $s1 = $s2|100

branch

branch on equal beq $s1,$s2,L if($s1==$s2)go to L

branch on not equal bne $s1,$s2,L if($s1!=$s2)go to L

set on less than slt $s1,$s2,$s3if($s2<$s3)$s1 = 1;

else $s1 = 0;

set on less than

immediateslt $s1,$s2,100

if($s2<100)$s1 = 1;

else $s1 = 0;

branch greater than bgt $s1,$s2,L if($s1>$s2)go to L

branch greater than

zerobgtz $s1,L if($s1>0)go to L

jump jump j L go to L

Table. 3.2. The set of Instructions supported by designed MIPS

3.2.6. Using a HDL to Describe and Modeling a MIPS processor

We designed MAC and MIPS by verilog HDL. Now, look in to the order of

implementation and the methodology of describe and modeling.

First, implement pipelining. Use the nonblocking assignment in the always block,

according to the clock edge, the register values in each stages are transfer

concurrently. So we can implement the pipelining.

Second, connect control lines. We can do the process of designing easier by

implement data register in the previous process, after this, design control lines

separately. Registers are sequential logic and connecting these things are combinational

- xxxvi -

logic. This methodology have another advantage. This designing is RTL. During the

verification, we can know the location of error logic easily by this method.

Third, add Hazard control units. complete the MIPS processor by connecting the

Hazard control unit. Hazard control units are forwarding unit and stall unit.

3.3. MIPS with MAC

3.3.1 Special instruction MAC

For support MAC unit in MIPS process, MIPS can work in "MAC" instruction.

This MAC instruction is different from the previous "mul" instruction which is just

used "*" symbol in the ALU. The meaning of this instruction is after operate MAC

instruction, the multiplied value of two operands are added to the result of front MAC

instruction. That is, add the concept of Accumulator to the multiply operation.

3.3.2 Total diagram of designed MIPS with MAC

It is total diagram of MIPS with MAC. This is not the diagram which is exiting

is the references. It shows the MIPS and MAC which is designed ourselves. Multiply

and MAC is R-type instruction so in the viewpoint of it doesn't has Memory stage,

devide the MAC in two, and put them in to the Execute stage and memory stage in

the MIPS. Now, MAC instruction has instruction fetch, instruction decode, execute,

execute, and write back stage. It can increase the performance of MIPS by reduce the

clock period. We substituted the execute stage for the memory stage. It reduce the

possibility of more hazard occurrence by new instruction MAC and also we can hold

the RISC type.

- xxxvii -

Figure. 3.12. Total diagram of designed MIPS with MAC

- xxxviii -

Figure. 4.1. Verification of 2 stage pipelining multiplier with 100,000 samples.

4. Verification and System Analysis

4.1. MAC

4.1.1 Multiplier

For the purpose of verifying our MAC, we use 100,000 random multipliers and

multiplicands with Modelsim. This developed 100% of accuracy when we compare the

result of our MAC with the result of the signed and unsigned Modelsim multiplication

function.

- xxxix -

Figure. 4.2. Synthesis report and vector wave form file of multiplier with

EXCALIBUR_ARM

Upper figure is Quartus II Vector waveform file in every 60ns. The result is close

to 40ns. Also the 3 x 5 which is started at 60ns and produced the answer of 15

which became near the 100ns mark.

4.1.2. MAC with 1 multiplier and 1 accumulator

This basic type MAC with 1 multiplier and 1 accumulator is based on 4 MAC

instruction. Right below demonstrates the instructions which show the reason why

accumulation register must be reset in every 4 "mac" instruction.

- xl -

Figure. 4.3. Operation of MAC with multiplier and 1 accumulator

MAC($A0,$B0)

MAC($A1,$B1)

MAC($A2,$B2)

MAC($A3,$B3)

MAC($A4,$B0)

MAC($A5,$B1)

MAC($A6,$B2)

MAC($A7,$B3)

Table. 4.1. "mac" instruction based on [x by 4] matrix multiplication [4 by y]

The output of MAC is accumulated in special register indefinitely. Hence, after the

instruction MAC($A4, $B0) value of accumulation register will be A0B0 + A1B1 +

A2B2 + A3B3 + A4B0. Since MAC could not identify what kind of matrix

multiplication there is, then the system just multiplies and accumulates to one special

register. Thereby we assumed MAC instruction in our multiplier and accumulator unit

which is based on [x by 4] matrix multiplication [4 by y] and reset accumulation

register in every 4 MAC instruction.

After synthesis, the result of 2 stage pipelined MAC produced 24.396ns clock

period time. The non-pipelined MAC has 34ns clock period and vector wave form file

shows the result of the synthesis that is succeed.

- xli -

Figure. 4.4. Synthesis report of the MAC with Stratix II

Figure 4.4 shows the result of the synthesis with Stratix II on the same MAC. The

clock period is 9.326ns.

4.1.3. Advanced type MAC with 8 multipliers and 7 accumulators

Figure. 4.5. Block diagram of 8 parallel MAC with 8 multipliers and 7 accumulators

- xlii -

Figure. 4.6. Verilog HDL verification of 8 parallel MAC with 100,000 samples

Improved MAC was needed for 8 multipliers and 7 accumulators. The 8 multipliers

operate at the same time in the multiply stage and 7 accumulators act at the next

accumulator stage.

This MAC can handle 16 operands at the same time. Though clock period is

slower than basic type MAC, just one MAC instruction in advanced type can replace

the 8 MAC instruction in basic type. It has about 30ns clock period in the result of

Stratix II synthesis. This MAC has approximately 249% performance improvement but

in our MIPS processor, it can not handle 16 operands in just one instruction, hence

basic type MAC was connected to MIPS processor.

- xliii -

Figure. 4.7. Synthesis report and timing analyzer summary of the 8 parallel MA C w ith Stratix II

- xliv -

4.2. MIPS

4.2.1. One instruction mul

Figure. 4.8. Post simulation result of mul instruction

For execute this multiply, Load the two data immediately, and doing the mul instruction. In

the red box things are stage devide registers. For example, MEMWB register is locate between

memory and write back stage so when we see the waveform, we must think pipelining like

this. The data is transfer according th the next clock. This viewpoint is applicate to the next

waveforms. we can see the result of this multiply is correctly.

4.2.2. Loop operation

For verify designed MIPS processor, we made some arithmetic algorithm by assembly code.

In the instruction memory, assembly codes are included by form of machine language. Below

codes are some part of instruction memory. The total code is for matrix algorism. It will be use

for verification of designed MAC on MIPS processor lastly.

Figure. 4.9. Instructions in the MIPS instruction memory

- xlv -

We made a set of MIPS assembler code and put in to the instruction memory of MIPS by

form of machine language. It is loop assembler code for matrix caculation.

248, 252... is the PC value which is increased 4bytes each. Timing diagram in the next page is

show the result of this code. It is include data, control, branch hazard. (It made intentional for

verification) So we can observe "stall" and "forwarding" behavior by MIPS for solve the

hazards.

Figure. 4.10. Wave form of loop operation in MIPS

- The terms in the vertical axis are register. We must remind MIPS is operates by pipelined,

so the next stage value of in register is showed after one clock from the register value of

current stage.

- PC value 252 is the starting point of Loop1 for calculate matrix so we can see the repetition

of PC value; 252, 256, 260, ... , 284, 252

- This part is calculate for 19 * 11 = 209. When doing multiply operation, mul instruction

needs two EXE stages so when PC value is 264, this term is look in two clock cycles.

- The values in the IDEXA, IDEXB registers are have multiply operands. We want operands

are 19 and 11 but, according to above diagram, there are 18 and 10 when PC value is 264.

The reason of this state is prev results are don't writed in IDEXA, IDEXB registers ye. After

one clock, the value of IDEXB is changed to 11 collectly. The forwarding unit is actived.

- xlvi -

Also in next one clock the value of IDEXA is will be forwarded by forwarding unit and it

will be changed to 19. Finally, we can see the value of in the MEMWBvalue3 register is

209.

- When PC value is 280 is the last term in the loop1 so after this clock, the next instruction

which have PC value as 284 is "flush" and we can see the register values of this instruction

are seting to zero. And next executed instruction is 252, repeatly.

4.3. MIPS with MAC

Verification result for whether our MIPS and MAC behave correctly.

Figure. 4.11. Matrix for verification

We calculated this matrix by MIPS with MAC as two methods. The first way is use

the MAC just as Booth multiplier and another way is add the concept of accumulator

to the first thing.

4.3.1. Use MAC as Booth multiplier

The result of synthesized

and post simulation.

8,300/38,400 numbers of

logic elements and 42 ns

clock period.

- xlvii -

Figure. 4.12. synthesized result of use MAC as booth multiplier

Figure. 4.13. result waveform of use MAC as booth multiplier

From wave form, 15078ns is end time, when the last result data 782 is outed. The

result is outed form MEMWBvalue2 register. 1,2,3 registers are store the data from

memory, ALU and MAC. we need add instruction for add the result of the MAC in

matrix calculating by using the MAC as just booth multiplier. so write back the value

form the ALU.

- xlviii -

4.3.2. Use MAC as Multiplier and Accumulator

The result of synthesized

and post simulation.

9,025/38,400 numbers of logic

elements and 56 ns clock period.

Figure. 4.14. synthesized result of use MAC as Multiplier and Accumulator

Figure. 4.15. result waveform of use MAC as Multiplier and Accumulator

- xlix -

It is the last part of result waveform. The end time is 18312 ns. We can see the

last data is outed from the MEMWBvalue3 register. In this case, we don't need

special add instruction. so write back the result of MAC straight away.

4.3.3. Compare the two methods of Matrix Calculating

Table. 4.2. Compare the two methods of Matrix Calculating

Compare the two method of matrix calculation. In case of using the MAC as

multiplier and accumulator needs more time and logic element but, needs less number

of clocks. We did relative simple matrix calculation so Loop repeat times was small.

It brings about small difference number of clock cycles but it can be more efficient in

the complex matrix calculation. It is the way of keep the meaning of MIPS and the

meanging of MAC. Also, we can think this is the way of design for the efficient

DSP machine.

- l -

5. Conclusion

5.1. What we know

First foremost we understood the backgrounds of the multiplier, accumulator,

computer architecture and each design method. Thereafter we designed each system

and then developed a set of MIPS assembly code to operate our process. Hence we

were able to learn the details of the system, the algorithm, the hardware and finally

the software design.

5.2. A fall of the system performance after connecting the MAC

to MIPS processor

As discussed in the thesis, the basic type of MAC, the advanced type of MAC

and the MIPS processor which supports the basic type of MAC are illustrated right

below:

Type of a systemSynthesis results

(Quatus II EXCALIBUR_ARM)Booth multiplier 21nsBasic type MAC 24nsAdvanced type MAC 30nsBooth multiplier on MIPS processor 42nsBasic type MAC on MIPS processor 56ns

Table. 5.1. Clock period of each system

Booth multiplier or MAC has small clock period before it connects to MIPS

processor. After connecting, the system performance which was dependent on synthesis

tool played vital part for the system to fall. When the MIPS processor which had the

multipliers or MAC that was synthesized, it produced more logic cells and larger areas

than one multiplier or the MAC. This caused the bad performance during the MIPS

synthesis process. The other reason for the bad performance was the insufficient in

- li -

clock dividing of the MAC stage. Since our MIPS processor offers 5 stage pipelining,

it is easy to implements a instruction "mac" with 5 clocks. If a "mac" instruction has

more than 5 clocks, stall technique and hazard control are needed. Thereby due to the

complication in implementation of the MIPS processor, we only applied 2 stage

pipelining.

5.3. Solution

We could not connect advanced type MAC with 8 multiplier and 7 accumulator to

MIPS processor, since our MIPS processor can handle just 2 operands in one

instruction. Advanced type MAC needs information of 16 operands in one instruction

and one may expect that indirect method can solve this problem. In the first clock of

a "mac" instruction, 8 operands data memory which has 2 operands (Hi, Low) are

accessed and return 8 index numbers. In the next clock of a "mac" instruction, MAC

accesses 16 index operands fields and it multiplies and accumulates.

- lii -

6. Appendix

Ascent sorting of 100 samples Ascent sorting of 1000 samples

Ascent sorting of 10000 samples Ascent sorting of 100000 samples

Frequency of random numbers with an interval of 100,000Figure. 6.1. Distribution of random numbers in Verilog HDL

- liii -

Because I didn't know random numbers that were generated to verify in Verilog HDL

which needs to be uniformed, hence the MATLAB simulation was used for this

verification. But 4294967295 data rows in MATLAB could not be supported and I

used simple sorting. If the line is linear, then the random numbers generated in

Verilog HDL are in uniform. I used 100,000 samples to verify multiplier and MAC

with random 2 inputs. The last graph means frequency of random numbers with an

interval of 100,000 and it is true the random numbers generated in Verilog HDL are

uniform and my verification results are trustworthy.

- liv -

%sorting analysis of random numbers

clear;

clc;

data = load('1.txt');

a = data(1:100);

a_sort = sort(a, 'ascend');

b = data(1:1000);

b_sort = sort(b, 'ascend');

c = data(1:10000);

c_sort = sort(c, 'ascend');

data_sort = sort(data, 'ascend');

%range of random numbers

clear;

clc;

%X_axis = 0 : 2^32 - 1;

data = load('1.txt');

for i = 0 : 42949

tmp = ((100000 * i < data) & (data < 100000 * i + 100000));

freq(i+1) = sum(tmp,1);

end

Table. 6.1. MATLAB source for analysis of random numbers

- lv -

References

[1] M. Amde, I. Blunno, and C.P. Sotiriou. Automating the design of and asynchronous DLX

Microprocessor. In Proceedings of the 40th Design Automation Conference(DAC), ACM,

pages 502-507, 2003.

[2] Israel Koren, Computer Arithmetic Algorithm

[3] Michael Golden, Trevor Mudge. A COmparison of Two pipeline Organizarions, Electrical

Engineering and Computer Science Department, University of Michigan.

[4] Hiroaki Murakami, Naoka Yano, Yukip Ootaguro, Yukio Sugeno, Maki Ueno. A

multiplier-accumulator MACro for a 45 MIPS Embedded RISC processor. IEEE journal of

solid state circuits, VOL 31, No.7, July 1996.

[5] An 9-bit Parallel Pipelined Multiplier Based On 3-bit Recoding From Booth's Algorithm

-Laercio Caldeira, Tales Cleber Pimenta and Evandro D. C. Cotrim

[6] A 4 Clock Cycle 64 x 64 Multiplier with 60Hz clock -Yong Surk Lee

[7] 32비트 RISC/DSP 프로세서를 위한 17비트 x 17비트 곱셈의 설계 -박종환

[8] Computer organization and design. The hardware/software Interface. David A. PAtterson,

John L. Hennessy. Morgan Kaufmann.

[9] Verilog HDL. Samir Palnitkar. Prentice hall.

[10] MIPS internet URL : http://www.mips.com/

[11] internet URL about DLX processor

https://www.cs.tcd.ie/Jeremy.Jones/vivio/dlx/dlxtutorial.htm

http://www.cs.umd.edu/class/fall2001/cmsc411/proj01/DLX/aboutDLX.html#Basic

- lvi -

국 문 요 약

FPGA MAC MIPS

32 × 32

, MAC

( ) . MAC ( ex add, sub, bnq, ...)

MIPS mac

.

MAC ,

, .

Radix-4 . 32 × 32

Radix-8 3x

. 17

1 . x 2 -x -2x

x 1

. 1

1

. 17

17 1

. 4 4:2

. 18 4 4:2

CSA 2 4:2 CSA 4:2 CSA

. 4:2 CSA

c_in c_out

- lvii -

. 4:2 CSA ,

4:2 CSA full adder 4:2 CSA

10% .

64 .

64 8 , 8 carry

propagation adder . 64

.

2

2 _ 21ns

. 37ns .

2,879 .

. 64

,

_ 24.396ns 3,129

. MAC Stratix II

9.326ns . 8

7 Stratix II 30ns

, MAC 2.5

. MIPS

mac ,

. , 1 MAC, 8 7

MAC 100,000 Verilog HDL

, 100% . Verilog HDL

MATLAB 32

. MIPS

,

.

MIPS Standard DLX machine , MIPS3000

- lviii -

. MIPS R, I, J .

. MAC

. MIPS

MAC

. mul ALU unit *

, MAC ,

, MAC .

MIPS HDL . ,

. always nonblocking ,

, ,

. , .

, . ,

. ,

. ,

MIPS . R

, ,

MAC EXE , MEM

MIPS . RISC

. RTL

GATE .

MIPS ,

, MAC

MIPS MAC booth

.

핵심되는 말 : 부스 알고리즘, 월레스 트리, 덧셈기, MIPS, 프로세서, RISC 형식

design of mips processor supporting mac with fpgabchong/vlsi/resources/other_libraries... ·...

Documents