design of mips processor supporting mac with fpgabchong/vlsi/resources/other_libraries... ·...
TRANSCRIPT
- i -
<Undergraduate Thesis: EEE-2007-01-48>
Design of MIPS Processor Supporting
MAC with FPGA
Munju Lee, Taewoo Han
School of Electrical and Electronic Engineering
College of Engineering
- ii -
Yonsei University<Undergraduate Thesis: EEE-2007-01-48>
Design of MIPS Processor Supporting
MAC with FPGA
Thesis Advisor: Yongserk Lee
A thesis submitted in a partial fulfillment
for the senior independent study's requirements
June 2007
Munju Lee, Taewoo Han
School of Electrical and Electronic Engineering
- iii -
College of Engineering
Yonsei University
감사의 글
2006년 두 학기 동안 저희에게 알찬 강의를 해주시고, 졸업 연구까
지 지도해 주신 이용석 교수님께 가장 먼저 감사의 글을 올립니다.
잠깐 방학동안 연구실에 들어가 있었지만, 진심으로 맞아주신 연구
실 분들께도 감사의 마음을 전합니다. 좀 더 나은 결과를 위해 매일
연구실을 나갔던 저희에게 좋은 연구여건을 만들어 주셨습니다. 졸
업 연구 담당 조교로서 큰 줄기를 잡는데 조언을 아끼지 않으셨던
전호윤 조교님께 감사의 말씀을 드립니다. 그리고 처음 막상 아무
것도 할 수 없었을 때, 구체적으로 도움 마다하지 않으셨던 원영 형,
판기 형 정말 감사드립니다. 두 학기 간 교과목 조교였던 현필 형
그리고 항상 옆에 계셔주셨던 형표 형, 종수 형, 재인이 형, 지나가
면서 한 번 보시고 큰 문제점을 지적해 주셨던 용주 형님, 하영이
형, 이것저것 귀찮게 많이 물어봐도 항상 친절했던 동현 씨, 그리고
은지 누나, 재희 씨, 상식 형, 연구실 살림꾼 민영 씨, 석한 씨 모든
연구실 분들께 감사의 마음을 전합니다.
- iv -
Figure index ivTable index viAbstract vii1. Introduction 12. Backgrounds 2
2.1. MAC 22.1.1. A radix-4 modified Booth's algorithm 22.1.2. Sign or zero Extension 32.1.3. Wallace Tree 52.1.4. 4:2 carry save adder 52.1.5 Carry select adder 6
2.2. MIPS processor 7 2.2.1. MIPS is the RISC processor 8 2.2.2. MIPS Instruction formats 9 2.2.3. Pipelining 10 2.2.4. Pipeline Hazards 113. Circuit Design Features 13
3.1. MAC 133.1.1. Block diagram of MAC 133.1.2. Block diagram of multiplier 143.1.3. Block diagram of an Accumulator 153.1.4. Signed or unsigned multiplication select mode 153.1.5. Booth encoder 153.1.6. Problem of the sign extension bits 163.1.7. Wallace tree with 4:2 CSA 19
3.2. MIPS processor 213.2.1. Instruction fetch 223.2.2. Instruction decode 223.2.3. Execute 233.2.4. Memory 233.2.5. Write back 23
3.2.6. Using a HDL to Describe and Modeling a MIPS processor 24 3.3. MIPS with MAC 25 3.3.1 Special instruction MAC 25 3.3.2 Total diagram of designed MIPS with MAC 25
Contents
- v -
4. Verification and System Analysis 274.1. MAC 27
4.1.1. Booth multiplier 274.1.2. Basic type MAC 284.1.3 Advanced MAC 30
4.2. MIPS processor 33 4.2.1. One instruction mul 33 4.2.2. Loop operation 33
4.3. MIPS with MAC 35 4.3.1. Use MAC as Booth multiplier 35 4.3.2. Use MAC as Multiplier and Accumulator 37 4.3.3. Compare the two methods of Matrix Calculating 385. Conclusion 39
5.1 What we know 395.2 A fall of system performance after connecting MAC to MIPS processor 395.3 Solution 40
6. Reference 44국문요약 45
- vi -
Figure. 2.1. 4 : 2 carry save adder and implementation 5Figure. 2.2. Block diagram of 64bit carry select adder 6Figure. 2.3. Block diagram of 8bit carry select adder 7Figure. 3.1. Block diagram of MAC(Multiplier and Accumulator Unit) 13Figure. 3.2. Block diagram of a 32bit x 32bit Booth multiplier 14Figure. 3.3. Block diagram of an accumulator 15Figure. 3.4. Unnecessary sign extension bits 16Figure. 3.5. Principal of sign generate method 17Figure. 3.6. Array of partial products after applying sign generate method 17Figure. 3.7. Array of partial products after applying
further modified method18
Figure. 3.8. Array of partial products in our multiplier 18Figure. 3.9. Wallace tree with 4:2 CSA 19Figure. 3.10. Wallace tree with full adder 20Figure. 3.11. Total diagram of designed MIPS 21Figure. 3.12. Total diagram of designed MIPS with MAC 26Figure. 4.1. Verification of 2 stage pipelining multiplier
with 100,000 samples.27
Figure. 4.2. Synthesis report and vector wave form file of multiplier with EXCALIBUR_ARM
28
Figure. 4.3. Operation of MAC with multiplier and 1 accumulator 29Figure. 4.4. Synthesis report of the MAC with Stratix II 30Figure. 4.5. Block diagram of 8 parallel MAC with 8 multipliers and 7 accumulators
31
Figure. 4.6. Verilog HDL verification of 8 parallel MAC with 100,000 samples
31
Figure. 4.7. Synthesis report and timing analyzer summary of the 8 parallel MAC with Stratix II
32
Figure. 4.8. Post simulation result of mul instruction 32Figure. 4.9. Instructions in the MIPS instruction memory 33Figure. 4.10. Wave form of loop operation in MIPS 33Figure. 4.11. Matrix for verification 34Figure. 4.12. synthesized result of use MAC as booth multiplier 36
Figure index
- vii -
Figure. 4.13. result waveform of use MAC as booth multiplier 36Figure. 4.14. synthesized result of use MAC as Multiplier and Accumulator
37
Figure. 4.15. result waveform of use MAC as Multiplier and Accumulator
37
Figure. 6.1. Distribution of random numbers in Verilog HDL 41
- viii -
Table. 2.1. The Truth table of A Radix-4 Modified Booth's Algorithm
3
Table. 2.2. Multiplication sign or zero extension with Booth‘s radix-4
4
Table. 2.3. Range of signed and unsigned multiplication 4Table. 2.4. MIPS instruction formats 9Table. 2.5. Pipelined instructions 10Table. 2.6. MIPS instructions classically take five steps 11Table. 2.7. Sequence of instructions 12Table. 3.1. Equation of further modified method 18Table. 3.2. The set of Instructions supported by designed MIPS 24Table. 4.1. MAC instruction based on [x by 4] matrix
multiplication [4 by y]29
Table. 4.2. Compare the two methods of Matrix Calculating 38Table. 5.1. Clock period of each system 39Table. 6.1. MATLAB source for analysis of random numbers 43
Table index
- ix -
ABSTRACT
Design of MIPS Processor supporting MAC with FPGA
In this paper first we designed 32bit × 32bit Booth multiplier and an accumulator
implemented by one 64 carry select adder adding continuous outputs. This
MAC(Multiplier and Accumulator Unit) with 1 multiplier and 1 accumulator is
operated in our MIPS processor (supporting add, sub, bnq, ...instruction)
MAC is composed of a multiplier and an accumulator. The multiplier is composed of
Booth encoder block, wallace tree, 64 bit carry select adder block. In 32bit × 32 bit
multiplication, the modified Booth's Radix-8 can't offer fast multiplication due to 3X
problem. Therefore, we used modified Radix-4 Booth's algorithm where outputs of
encoding block are partial products and additional 1bit signal were added to wallace
tree. In 2's compliment, -X is the sum of compliment number and 1 which has fast
development but adding 1 is a slow process. Thus, the encoder makes 2 outputs and
by adding 1bit in the case of -X, -2X to the wallace tree improves the system
performance. To calculate 18 operands at the same time, the Wallace tree has 4 4:2
carry save adder stages. By implementing Wallace tree with 4:2 CSA improves the
regularity because c_out is independent to c_in. In fact, each stage acts at the same
time. The final result is the sum of carry and the sum vector through 64bit carry
select adder. This 64 carry select adder is composed of 8bit carry select adders and
each 8bit adder has serial connection with full adders and a half adder. We applied 2
stage pipelining where the first stage has encoding block and wallace tree; second
stage has 64bit carry select adder block. Final part of this thesis is about the synthesis
result of this multiplier with EXCALIBUR_ARM Family which has 21ns clock period
and 2,879 logic elements. The output of multiplier is accumulated by the accumulator
and we used 64bit carry select adder in the accumulator. Thereby we were able to
- x -
apply 2 stages of pipelining with multiply stage and accumulate stage. As a result, the
synthesis result with EXCALIBUR_ARM Family has 24.396ns clock period and 3,129
logic elements. Also the synthesis result with Stratix II Family has 9.326ns clock
period. Finally, we implemented multiply stage with 8 multipliers and accumulate stage
with 7 accumulators. As a consequence, the synthesis results with Stratix II shows
30ns clock period. We got 249% performance improvement however our 'mac'
instruction in MIPS processor can handle just two operands in 1 instruction. Hence,
we could connect 8 parallel MAC to MIPS processor.
In addition, we generated 100,000 random numbers in Verilog HDL to verify our
Booth multiplier, basic type MAC and advance type MAC. Thus, we knew this
random numbers generated in Verilog HDL were uniform thought MATLAB
simulation.
MIPS processor is standard DLX machine. We designed MIPS processor according to
the MIPS3000 format. The MIPS processor can support R, I, and J type instructions.
Pipelining can improve the performance of processor dynamically and we need Hazard
control unit for control the hazards by pipelining. Also, it can support MAC for
efficient multiply algorithm. MIPS support several MIPS assembler instructions and
additional to this, we made MAC instruction. The mul instruction is executed by just
* symbol in the ALU and MAC instruction is added the concept of accumulator to it.
We described and modeling MIPS processor according to this order. First, implement
pipelining. Use the nonblocking assignment in the always block, according to the clock
edge, the register values in each stages are transfer concurrently. So we can implement
the pipelining. Second, connect control lines. We can do the process of designing
easier by implement data register in the previous process, after this, design control
lines separately. Registers are sequential logic and connecting these things are
combinational logic. This methodology have another advantage. This designing is RTL.
During the verification, we can know the location of error logic easily by this method.
Third, add Hazard control units. complete the MIPS processor by connecting the
- xi -
Hazard control unit. Hazard control units are forwarding unit and stall unit. Multiply
and MAC is R-type instruction so in the viewpoint of it doesn't has Memory stage,
devide the MAC in two, and put them in to the Execute stage and memory stage in
the MIPS. It can increase the performance of MIPS by reduce the clock period. We
substituted the execute stage for the memory stage. It reduce the possibility of more
hazard occurrence by new instruction MAC and also we can hold the RISC type.
Finally, we get the gate level logic from designed RTL logic by synthesis tool.
For verify whether the designed MIPS processor operates correctly, we analysis the
behavior by one instruction mul and loop operation algorithm. The last verification is
for MIPS with MAC. We calculated a matrix set by MIPS with MAC as two
methods. The first way is use the MAC just as Booth multiplier and another way is
add the concept of accumulator to the first thing.
Key words : Booth multiplier, accumulator, Wallace tree, MIPS, processor, RISC type
- xii -
1. Introduction
The MAC(Multiplier and Accumulator Unit) is used for image processing and
digital signal processing (DSP) in a DSP processor. Algorithm of MAC is Booth's
radix-4 algorithm, wallace tree, 4:2 CSA, 64bit carry select adder and improves speed.
MIPS was implemented as micro processors and permitted high performance pipeline
implementations through the use of their simple register oriented instruction sets.
Although those algorithms ( radix-4 algorithm, pipelining, etc ) are widely used
technique for speeding up each part, the MAC on specific processor cannot be run at
100% efficiency. Due to the reasons of lower speed of MAC, MIPS instruction "mul"
(multiplication) takes longer time than any other instruction in our MIPS processor. To
improve speed of MIPS, MAC needs to be fast and MIPS must have special
algorithm for "mul" instruction. One of the method we chose was to design
multi-clock MAC instead of one-clock MAC which improved the speed of MIPS. In
general, the instruction set of MIPS processor includes complex works like
multiplication and floating point operation which has multi execution stage. Therefore,
system clock of the processor was increased efficiently.
We applied 2 stage pipelining to the MAC to MIPS processor and as a result we
were able to get the result of matrix multiplication which was used for image
processing in our MIPS processor that supports MAC.
- xiii -
2. Backgrounds
2.1. MAC
2.1.1. A radix-4 modified Booth's algorithm
Booth's Algorithm is simple but powerful. Speed of MAC is dependent on the
number of partial products and speed of accumulate partial product. Booth's Algorithm
provide us to reduced partial products. We choose radix-4 algorithm because of below
reasons.
①. Original Booth's algorithm has a inefficient case.
ex)
The 33 partial products are generated in 32bit x 32bit signed or unsigned
multiplication.
②. Modified Booth's radix-8 algorithm has fatal encoding time in 32bit x 32bit
multiplication.
Radix-8 Algorithm has a 3x term which means that a partial product cannot be
generated by shifting. Therefore, 2x + 1x are needed in encoding processing. One of
the solution is by handling a additional 1x term in wallace tree. However, large
wallace tree has some problems too.
A radix-4 modified Booth's algorithm : Booth's radix-4 algorithm is widely used to
reduce the area of multiplier and to increase the speed. Grouping a 3 bits of
multiplier with overlapping, has a half partial products which improves the system
speed. Radix-4 modified Booth's algorithm is shown below:
① x-1 = 0; Insert 0 on the right side of LSB of multiplier.
② Start grouping each 3bits with overlapping from x-1
③ If the number of multiplier bits is odd, add a extra 1 bit on left side of MSB
④ generate partial product from truth table
⑤ when new partial product is generated, each partial product is added 2 bit left
shifting in regular sequence
- xiv -
yi+1 yi yi-1 partial product
0
0
0
0
1
1
1
1
0
0
1
1
0
0
1
1
0
1
0
1
0
1
0
1
0X (No string)
+ 1X (End of string)
+ 1X (A string)
+ 2X (End of string)
- 2X (Beginning of string)
- 1X (-2X+X)
- 1X (Beginning of string)
- 0X (Center of string)
Table. 2.1. The Truth table of A Radix-4 Modified Booth's Algorithm
x: multiplicand y: multiplier
2.1.2. Sign or zero extension
Our MAC supports signed or unsigned multiplication and the produced result is
64bit which are stored in 2 special 32bit register. First MAC receives a multiplicand
and multiplier but just 32bit operands are signed number in Booth's radix-4 algorithm.
Hence, extension bit is required to express 32bit signed number. The core idea of this
is that 32bit unsigned number can be expressed by 33bit signed number. The 17
partial products are generated in 33bit x 33bit case (16 partial products in 32bit x
32bit case).
Here is an example of signed and unsigned multiplication. When x(multiplicand) is
3bit 111 and y(multiplier) is 3bit 111, the signed and unsigned multiplication is
different. In signed case × ( -1 x -1 = 1 ) and in unsigned case ×
( 7 x 7 = 49 ).
- xv -
signed ( -1 x -1 = 1 ) unsigned ( 7 x 7 = 49 )
x (1) 1 1 1y × (1) 1 1 1 (0)
0 0 0 0 0 1 1x0 0 0 0 0 0x
0 0 0 0 0 1 1dec
x (0) 1 1 1y × (0) 1 1 1 (0)
1 1 1 0 0 1 -x1 1 1 0 +2x
1 1 0 0 0 1 49dec
Table. 2.2. Multiplication sign or zero extension with Booth‘s radix-4
Each signed and unsigned case has the same 6bit result in upper example. Hence,
our 32bit x 32bit MAC has the 64bit result. The range of 32bit x 32bit signed and
unsigned multiplication illustrated below:
signed multiplication unsigned multiplication
multiplicand ∼ ∼
multiplier ∼ ∼
result ∼ ∼
Table. 2.3. Range of signed and unsigned multiplication
Upper table certifies that result of the 32bit x 32 bit signed and unsigned
multiplication can be expressed in 64 bits. Thereby, 65th overflow bit generated in
wallace tree was not considered.
- xvi -
2.1.3. Wallace tree
A ripple carry adder has carry propagation due to present stage adder which does
not execute until carry of previous stage is introduced. Thus, if adder connection is
series, the propagation delay increases in proportion to the number of adders. But
wallace tree has a fewer propagation delay since they execute at the same stage
region. Assume we add 4 operands a, b, c, d.
① Inefficient case (ripple carry adder):
(l = a + b) next (m = l + b) next (n = m + c) next (result = n + d)
② Efficient case (concept of wallace tree)
(l = a + b and m = c + d) next (result = l + m)
In upper example the first one has 4 times the delay, but on the other hand, the
second one has only 2 times the delay. Similarly, wallace tree can execute operands
which has independency to each other because carry propagation delay can be hiden
by time of the adding operation. As a result, the wallace tree is composed of carry
vector and sum vector.
2.1.4. 4 : 2 CSA (Carry Save Adder)
Figure. 2.1. 4 : 2 Carry Save Adder and Implementation
- xvii -
A 4 : 2 Carry Save Adder has 5 inputs and 3 outputs.
The 4 inputs are operands and 1 input is c_in and the outputs are the sum, carry and
c_out. The schematic is on the right side of upper table. The 4:2 CSA is slower than
full adder however, it can execute 5 inputs at the same time when the full adder
executes 3 inputs and thereby the regularity improves the speed. Since c_out is
independent to c_in, the c_out from LSB can be the input of the c_in to MSB at the
same time in the same stage.
2.1.5. Carry select adder
The result from wallace tree is the sum and carry vector hence, it needs to be
added to get final result. If we use ripple carry adder to add the sum and the carry
vector, then the carry propagation delay has a bad effect on the performance of MAC.
Carry Select Adder computes when both carry is 0 and 1. In spite of large
implementation, it can increase the speed of execution.
Figure. 2.2. Block diagram of 64bit Carry Select Adder
Each 8bit carry select adder executes at the same time and the delay of 64bit
carry select adder is delay of the 8bit carry select adder and some mux delay for
selecting next sum.
- xviii -
Figure. 2.3. Block diagram of 8bit Carry Select adder
Figure. 2.3. shows 8bit carry select adder. It has a serial connection with full
adders and a half adder.
2.2. MIPS Processor
The MIPS architecture is one of the most widely supported processor architectures,
with a vast infrastructure of industry-standard tools, software and services that help
ensure rapid, reliable and cost-effective SoC design. MIPS Technologies delivers the
widest range of low power, high-performance processor cores with distinct cost
advantages to semiconductor companies, ASIC developers and system OEMs
worldwide. The company offers award-winning embedded microprocessor cores,
instruction set architectures, as well as system controllers, and incorporates a complete
array of software and system level debug tools to provide maximum flexibility and
convenience for ever-changing market requirements: high-performance, low power, low
silicon cost and rapid time-to-market.
- xix -
To command a computer's hardware, you must speak its language. The words of a
computer's language are called instructions, and its vocabulary is called an instruction
set. The languages of computers would be as diverse as those of humans, but in
reality computer languages are quite similar, more like regional dialects than like
independent languages. This similarity occurs because all computers are constructed
from hardware technologies based on similar underlying principles and because there
are a few basic operations that all computers must provide. Generally, computer
designers have a common goal to find a language that makes it easy to build the
hardware and compiler while maximizing performance and minimizing cost.
2.2.1. MIPS is the RISC processor
The MIPS processor, designed in 1984 by researchers at Stanford University, is a
RISC (Reduced Instruction Set Computer) processor. Compared with their CISC
(Complex Instruction Set Computer) counterparts (such as the Intel Pentium
processors), RISC processors typically support fewer and much simpler instructions.
The premise is, however, that a RISC processor can be made much faster than a
CISC processor because of its simpler design. These days, it is generally accepted that
RISC processors are more efficient than CISC processors; and even the only popular
CISC processor that is still around (Intel Pentium) internally translates the CISC
instructions into RISC instructions before they are executed. RISC processors typically
have a load-store architecture. This means there are two instructions for accessing
memory: a load "l" instruction to load data from memory and a store "s" instruction
to write data to memory. It also means that none of the other instructions can access
memory directly. So, an instruction like "add this byte from memory to register 1"
from a CISC instruction set would need two instructions in a load-store architecture:
"load this byte from memory into register 2" and "add register 2 to register 1". CISC
processors also offer many different addressing modes. Consider the following
instruction from the Intel 80x86 instruction set (with simplified register names):
- xx -
add r1, [r2+r3*4+60] // x86 (not MIPS) example
This instruction loads a value from memory and adds it to a register. The memory
location is given in between the square brackets. As you can see, the Intel instruction
set, as is typical for CISC architectures, allows for very complicated expressions for
address calculations or "addressing modes". The MIPS processor, on the other hand,
only allows for one, rather simple, addressing mode: to specify an address, you can
specify a constant and a register. So, the above Intel instruction could be translated as:
slli r3, r3, 2 // r3 := r3 << 2 (i.e. r3 := r3 * 4)
add r2, r2, r3 // r2 := r2 + r3
l r4, 60(r2) // r4 := memory[60+r4]
add r1, r1, r4 // r1 := r1 + r4
We need four instructions instead of one, and an extra register (r4) to do what can
be done with one instruction in a CISC architecture. The internal circuitry of the
RISC processor is much simpler, however, and can thus be made very fast.
2.2.2. MIPS Instruction formats
Table. 2.4. MIPS instruction formats
Here is the meaning of each name of the fields in MIPS instructions:
op : opcode. basic operation of the instruction.
rs : the first register source operand.
rt : the second register source operand.
rd : the rigister distination operand. It gets the result of the operation.
shamt : shift amout.
- xxi -
funct : function. this field selects the specific variant of the operation in the op
field.
The compromise chosen by the MIPS designers is to keep all instructions the same
length, thereby requiring different kinds of instruction formats for different kinds of
instructions. The format above is called R-type (for register), I-type (for immediate),
and J-type(for jump).
2.2.3. Pipelining
Pipelining is an implementation technique in which multiple instructions are
overlapped in execution. Today, pipelining is key to making processors fast. At this
point all steps - called stages in pipelining - are operating concurrently. As long as
we have separate resources for each stage, we can pipeline the tasks. Pipelining is an
implementation technique where multiple instructions are overlapped in execution. The
computer pipeline is divided in stages. Each stage completes a part of an instruction
in parallel. The stages are connected one to the next to form a pipe - instructions
enter at one end, progress through the stages, and exit at the other end. Pipelining
does not decrease the time for individual instruction execution. Instead, it increases
instruction throughput. The throughput of the instruction pipeline is determined by how
often an instruction exits the pipeline.
Instruction Num 1 2 3 4 5 6 7 8 9
instruction i IF ID EX MEM WB
instruction i+1 IF ID EX MEM WB
instruction i+2 IF ID EX MEM WB
instruction i+3 IF ID EX MEM WB
instruction i+4 IF ID EX MEM WB
Table. 2.5. pipelined instructions
The simplicity of the MIPS instruction set makes resource evaluation relatively easy.
The major functional units are used in different cycles and hence overlapping the
execution of multiple instructions introduces relatively few conflicts.
- xxii -
IF
The Instruction Fetch stage fetches the next instruction from memory using the address in the PC (Program Counter) register and stores this instruction in the IR (Instruction Register)
IDThe Instruction Decode stage decodes the instruction in the IR, calculates
the next PC, and reads any operands required from the register file.
EX
The Execute stage "executes" the instruction. In fact, all ALU operations are done in this stage. (The ALU is the Arithmetic and Logic Unit and performs operations such as addition, subtraction, shifts left and right, etc.)
MEM
The Memory Access stage performs any memory access required by the current instruction, So, for loads, it would load an operand from memory. For stores, it would store an operand into memory. For all other instructions, it would do nothing.
WB
For instructions that have a result (a destination register), the Write Back writes this result back to the register file. Note that this includes nearly all instructions, except nops (a nop, no-op or no-operation instruction simply does nothing) and s (stores).
Table. 2.6. MIPS instructions classically take five steps
2.2.4. Pipeline Hazards
There are situations, called hazards, that prevent the next instruction in the
instruction stream from being executing during its designated clock cycle. Hazards
reduce the performance from the ideal speedup gained by pipelining. There are three
classes of hazards:
Structural Hazards: They arise from resource conflicts when the hardware cannot
support all possible combinations of instructions in simultaneous
overlapped execution.
Data Hazards: They arise when an instruction depends on the result of a previous
instruction in a way that is exposed by the overlapping of
instructions in the pipeline.
Control Hazards: They arise from the pipelining of branches and other instructions
that change the PC.
- xxiii -
Hazards in pipelines can make it necessary to stall the pipeline. The processor can
stall on different events:
A cache miss: A cache miss stalls all the instructions on pipeline both before and
after the instruction causing the miss.
A hazard in pipeline: Eliminating a hazard often requires that some instructions in
the pipeline to be allowed to proceed while others are
delayed. When the instruction is stalled, all the instructions
issued later than the stalled instruction are also stalled.
Instructions issued earlier than the stalled instruction must
continue, since otherwise the hazard will never clear.
ADD R1, R2, R3
SUB R4, R5, R1
AND R6, R1, R7
Table. 2.7. sequence of instructions
The problem with data hazards, introduced by this sequence of instructions can be
solved with a simple hardware technique called forwarding. The key insight in
forwarding is that the result is not really needed by SUB until after the ADD actually
produces it. The only problem is to make it available for SUB when it needs it. If
the result can be moved from where the ADD produces it (EX/MEM register), to
where the SUB needs it (ALU input latch), then the need for a stall can be avoided.
Using this observation , forwarding works as follows:
The ALU result from the EX/MEM register is always fed back to the ALU input
latches.
If the forwarding hardware detects that the previous ALU operation has written the
register corresponding to the source for the current ALU operation, control logic
selects the forwarded result as the ALU input rather than the value read from the
register file.
- xxiv -
3. Circuit Design Features
3.1. MAC
3.1.1 Block Diagram of MAC
Figure. 3.1. Block diagram of MAC ( Multiplier and Accumulator Unit )
I applied 2 stage pipelining into the MAC. In first multiplier stage, 32bit multiplier
and 32bit multiplicand are stored in 2 D-flipflops. At rising edge of the system clock,
multiplicand and multiplier become inputs of Booth multiplier which produces the
output of 64bits. In next clock, the D-flipflops in accumulator releases the result of a
Booth multiplier and a accumulator's output of previous clock. An accumulator adds 2
inputs through 64bit carry select adder and makes two outputs. One is 1bit overflow
and the other is the 64bit accumulation result since both multiplier and accumulator
are independent and applying the 2 stage pipelining is easier process.
- xxv -
3.1.2. Block Diagram of Multiplier
Figure. 3.2. Block diagram of a 32bit x 32bit Booth multiplier
Figure. 3 is the diagram of Booth multiplier. The 32bit multiplicand and multiplier
are sign extend or zero extend when selection signed or unsigned multiplication 1bit
input comes into the module. Booth encoder develops 17 partial products and
additional 17 one bit to add in wallace tree. This additional 1bit improves the system
performance by reducing time instead time required in encoding block. The outputs of
wallace tree are 64bit carry and sum vectors. Both vector are added in 64bit carry
select adder block and the final result is generated.
- xxvi -
3.1.3. Block Diagram of an Accumulator
Figure. 3.3. Block diagram of an accumulator
Accumulator holds output of previous clock from accumulation register. Holding
outputs in accumulation register can reduce additional "add" instruction. An
accumulator was implemented with one 64bit carry select adder.
3.1.4. Signed or unsigned multiplication select mode
First 32bit multiplicand and multiplier comes at the same time and there after if
the mode select signal is enable, 32bit operands become 33bit through sign or zero
extension.
3.1.5. Booth encoder
Due to 33bit multiplier, the 17 partial products are generated from 17 Booth
encoders. In addition, each encoder block is connected to x, 2x, 0x, -2x, -x signal.
- xxvii -
Hence, each 17 Booth encoder behaves at the same time. But -x, -2x signal is just 1's
compliment and on the other hand 2's compliment needs complimenting each bit and
adding 1. Since adding 1 to compliment number in Booth encoding block have carry
propagation delay, it's therefore, not wise decision. Thus, when -x or -2x are
generated, additional registers have to be enabled to check that if there is additional 1
to be added to 1's compliment. The data of additional register will be added through
wallace tree with fewer penalty. Our MAC generates 17 partial products and 17
signals to be added through wallace tree.
3.1.6. Sign extension problem
When 17 partial products are added in wallace tree, sign extension is essential. But
unnecessary sign extension term effects on speed of MAC. Below 8bit x 8bit case S0,
S1, S2 terms can be reduced by some technique (sign generate method).
Figure. 3.4. Unnecessary sign extension bits
S0S0S0S0S0S0S0S0 term in the first row is equivalent to -S0 due to the 2's
compliment. Therefore, sign extension parts are reduced. But -S0 is not 1bit data
which can be easily changed to non-negative number. Following diagram depicts this,
- xxviii -
Figure. 3.5. Principal of sign generate method
Sign generate method gives us real essential sign extension terms like this.
Figure. 3.6. Array of partial products after applying sign generate method
The 'neg' terms mean additional term to be added in wallace tree.
Next when we use some equation then the upper array will be changed.
- xxix -
Table. 3.1. Equation of further modified method
This table shows our 32bit x 32bit signed or unsigned case. However, array shape
is same as below small multiplication. The partial products array are shown as below:
Figure. 3.7. Array of partial products after applying further modified method
Figure. 3.8. Array of partial products in our multiplier
- xxx -
3.1.7. Wallace tree with 4:2 CSA
4 stages are needed to generate carry and sum vector with 4:2 CSA wallace tree.
Figure. 3.9. Wallace tree with 4:2 CSA
The 33rd bits has 18 operands, 4 stages and eight 4:2 CSAs are needed in the
wallace tree. The term '4 stages' means multiplier can calculate the 17 partial product
arrays in just four 4:2 CSA delay time.
- xxxi -
Figure. 3.10. Wallace tree with full adder
In our 4:2 CSA wallace tree, it has 4 delta × 1.5. Wallace tree with full adder
has 6 stages, and also it has 6 delta × 1 delay time. Both two have almost the same
performance but I chose 4:2 CSA wallace tree since the regularity of 4:2 CSA can
improve the system performance. When I implemented wallace tree with 4 stage 4:2
CSA from 3 stage 4:2 CSA and additional 1 full adder stage, it produced
improvement of 10% system performance. The output C_out is in 4:2 CSA which is
independent to C_in. This means all 4:2 carry save adders in same stage can operate
almost at the same time which further improves the system regularity.
- xxxii -
3.2. MIPS
Figure. 3.11. Total diagram of designed MIPS
This diagram is our designed MIPS. An abstract view of the implementation of the
MIPS subset showing the major functional units and the major connections between
them and including the necessary multiplexors and control lines. Also it has
Forwarding unit and Hazard detection unit. It is the process of execute one instruction
- IF, ID, EXE, MEM, WB and we can see the registers which have these stages
name in above diagram. It also helps MIPS to doing popeling. So registers divide
stage each. For example, between ID/EXE register and EXE/MEM register is EXE
stage. This design methodology is "RTL : Register Transfer Level" total registers(state
element) are sequential logic and combinational logic in between register. This method
has many advantages. We can Debuging designed code easy and it is related to
Clocking Methodology. Let's look into behavior of major blocks and datapath elements
in each stage.
- xxxiii -
3.2.1. Instruction Fetch
The first element is Instruction memory unit to store the instructions of a program
and supply instructions given address. We designed it has instructions in form of
machine language which is correspond to MIPS assembly language.
Program counter is a register, which is used to hold the address of the current
instruction. If we need "stall" for solve hazard, control signal from Hazard detection
unit is control this behavior.
And adder in this stage is need for increment the PC to the address of the next
instruction. This adder, which is combinational, be built in "Carry Lookahead Adder"
for support fast add calculation for between 32bit numbers.
The mux in this stage is select next PC value between current PC + 4 and branch
PC value. It is controled by PC_src control signal from the Control unit.
3.2.2. Instruction Decode
The processor's 32 general purpose registers are stored in a structure called a
Register file. It is a collection of registers in which any register can be read or
written by specifying the number of the register in the file. The register file contains
the register state of the MAChine. In addition, we need an ALU to operate on the
values read from the registers.
Between to register outputs to ALU unit, there is a branch unit. It determines
whether the next instruction is the instruction that follows sequentially or the
instruction at the branch target address.
A sign extend unit is used to sign-extend the 16-bit offset field in the instruction to
a 32-bit signed value.
If it occur hazards, and detected by hazard detection unit, set control signals to zero.
Accordingly, all elements are can't read or write anything and this behavior is "stall".
This need to hold PC value in Program counter, so hazard detection unit is control
this. If we need "flush" instructions at branch hazard, it is designed by no hold PC
- xxxiv -
value in Program counter and, instructions are setting to zero.
3.2.3. Execution
We can generate the 4-bit ALU control input using a small control unit that has
as inputs the function field of the instruction and a 2-bit control field, which we call
ALUop.
ALU32 is Arithmetic Logic Unit - 32-bit, which is execute according to instructions.
This includes some arithmetic and logic algorithm. Supported instructions by this ALU
and out MIPS is displayed in next page.
If it occurs hazard, somethings are can solved by "stall" but, stalls are decrease the
performance of MIPS. We can block this state by replace "stall" to "forwarding".
* EX/MEM.register Rd = ID/EX.register Rs
* EX/MEM.register Rd = ID/EX.register Rt
* MEM/WB.register Rd = ID/EX.register Rs
* MEM/WB.register Rd = ID/EX.register Rt
These are hazard conditions and ALU inputs is substituted to forwarded data in this
case. So the muxs are can select the input of ALU which is before the ALU.
3.2.4. Memory
The data memory must be written on store instructions; hence, it has both read
and write control signals, an address input, as well as and input for the data to be
written into memory.
3.2.5. Write back
In this stage, the result of ALU or data memory value is write to the register file
according to control signals.
- xxxv -
Category Instruction Example Meaning
Arithmetic
add add $s1,$s2,$s3 $s1 = $s2+$s3
subtract sub $s1,$s2,$s3 $s1 = $s2-$s3
add immediate addi $s1,$s2,100 $s1 = $s2+100
sub immediate subi $s1,$s2,100 $s1 = $s2-100
multiply mul $s1,$s2,$s3 $s1 = $s2*$s3
Data
transfer
load word lw $s1,100($s2) $s1 = Memory[$s2+100]
store word sw $s1,100($s2) Memory[$s2+100] = $s1
load immediate li $s1, 100 $s1 = 100
Logical
and and $s1,$s2,$s3 $s1 = $s2&$s3
or or $s1,$s2,$s3 $s1 = $s2|$s3
nor nor $s1,$s2,$s3 $s1 = ~($s2|$s3)
and immediate andi $s1,$s2,100 $s1 = $s2&100
or immediate ori $s1,$s2,100 $s1 = $s2|100
branch
branch on equal beq $s1,$s2,L if($s1==$s2)go to L
branch on not equal bne $s1,$s2,L if($s1!=$s2)go to L
set on less than slt $s1,$s2,$s3if($s2<$s3)$s1 = 1;
else $s1 = 0;
set on less than
immediateslt $s1,$s2,100
if($s2<100)$s1 = 1;
else $s1 = 0;
branch greater than bgt $s1,$s2,L if($s1>$s2)go to L
branch greater than
zerobgtz $s1,L if($s1>0)go to L
jump jump j L go to L
Table. 3.2. The set of Instructions supported by designed MIPS
3.2.6. Using a HDL to Describe and Modeling a MIPS processor
We designed MAC and MIPS by verilog HDL. Now, look in to the order of
implementation and the methodology of describe and modeling.
First, implement pipelining. Use the nonblocking assignment in the always block,
according to the clock edge, the register values in each stages are transfer
concurrently. So we can implement the pipelining.
Second, connect control lines. We can do the process of designing easier by
implement data register in the previous process, after this, design control lines
separately. Registers are sequential logic and connecting these things are combinational
- xxxvi -
logic. This methodology have another advantage. This designing is RTL. During the
verification, we can know the location of error logic easily by this method.
Third, add Hazard control units. complete the MIPS processor by connecting the
Hazard control unit. Hazard control units are forwarding unit and stall unit.
3.3. MIPS with MAC
3.3.1 Special instruction MAC
For support MAC unit in MIPS process, MIPS can work in "MAC" instruction.
This MAC instruction is different from the previous "mul" instruction which is just
used "*" symbol in the ALU. The meaning of this instruction is after operate MAC
instruction, the multiplied value of two operands are added to the result of front MAC
instruction. That is, add the concept of Accumulator to the multiply operation.
3.3.2 Total diagram of designed MIPS with MAC
It is total diagram of MIPS with MAC. This is not the diagram which is exiting
is the references. It shows the MIPS and MAC which is designed ourselves. Multiply
and MAC is R-type instruction so in the viewpoint of it doesn't has Memory stage,
devide the MAC in two, and put them in to the Execute stage and memory stage in
the MIPS. Now, MAC instruction has instruction fetch, instruction decode, execute,
execute, and write back stage. It can increase the performance of MIPS by reduce the
clock period. We substituted the execute stage for the memory stage. It reduce the
possibility of more hazard occurrence by new instruction MAC and also we can hold
the RISC type.
- xxxvii -
Figure. 3.12. Total diagram of designed MIPS with MAC
- xxxviii -
Figure. 4.1. Verification of 2 stage pipelining multiplier with 100,000 samples.
4. Verification and System Analysis
4.1. MAC
4.1.1 Multiplier
For the purpose of verifying our MAC, we use 100,000 random multipliers and
multiplicands with Modelsim. This developed 100% of accuracy when we compare the
result of our MAC with the result of the signed and unsigned Modelsim multiplication
function.
- xxxix -
Figure. 4.2. Synthesis report and vector wave form file of multiplier with
EXCALIBUR_ARM
Upper figure is Quartus II Vector waveform file in every 60ns. The result is close
to 40ns. Also the 3 x 5 which is started at 60ns and produced the answer of 15
which became near the 100ns mark.
4.1.2. MAC with 1 multiplier and 1 accumulator
This basic type MAC with 1 multiplier and 1 accumulator is based on 4 MAC
instruction. Right below demonstrates the instructions which show the reason why
accumulation register must be reset in every 4 "mac" instruction.
- xl -
Figure. 4.3. Operation of MAC with multiplier and 1 accumulator
MAC($A0,$B0)
MAC($A1,$B1)
MAC($A2,$B2)
MAC($A3,$B3)
MAC($A4,$B0)
MAC($A5,$B1)
MAC($A6,$B2)
MAC($A7,$B3)
Table. 4.1. "mac" instruction based on [x by 4] matrix multiplication [4 by y]
The output of MAC is accumulated in special register indefinitely. Hence, after the
instruction MAC($A4, $B0) value of accumulation register will be A0B0 + A1B1 +
A2B2 + A3B3 + A4B0. Since MAC could not identify what kind of matrix
multiplication there is, then the system just multiplies and accumulates to one special
register. Thereby we assumed MAC instruction in our multiplier and accumulator unit
which is based on [x by 4] matrix multiplication [4 by y] and reset accumulation
register in every 4 MAC instruction.
After synthesis, the result of 2 stage pipelined MAC produced 24.396ns clock
period time. The non-pipelined MAC has 34ns clock period and vector wave form file
shows the result of the synthesis that is succeed.
- xli -
Figure. 4.4. Synthesis report of the MAC with Stratix II
Figure 4.4 shows the result of the synthesis with Stratix II on the same MAC. The
clock period is 9.326ns.
4.1.3. Advanced type MAC with 8 multipliers and 7 accumulators
Figure. 4.5. Block diagram of 8 parallel MAC with 8 multipliers and 7 accumulators
- xlii -
Figure. 4.6. Verilog HDL verification of 8 parallel MAC with 100,000 samples
Improved MAC was needed for 8 multipliers and 7 accumulators. The 8 multipliers
operate at the same time in the multiply stage and 7 accumulators act at the next
accumulator stage.
This MAC can handle 16 operands at the same time. Though clock period is
slower than basic type MAC, just one MAC instruction in advanced type can replace
the 8 MAC instruction in basic type. It has about 30ns clock period in the result of
Stratix II synthesis. This MAC has approximately 249% performance improvement but
in our MIPS processor, it can not handle 16 operands in just one instruction, hence
basic type MAC was connected to MIPS processor.
- xliii -
Figure. 4.7. Synthesis report and timing analyzer summary of the 8 parallel MA C w ith Stratix II
- xliv -
4.2. MIPS
4.2.1. One instruction mul
Figure. 4.8. Post simulation result of mul instruction
For execute this multiply, Load the two data immediately, and doing the mul instruction. In
the red box things are stage devide registers. For example, MEMWB register is locate between
memory and write back stage so when we see the waveform, we must think pipelining like
this. The data is transfer according th the next clock. This viewpoint is applicate to the next
waveforms. we can see the result of this multiply is correctly.
4.2.2. Loop operation
For verify designed MIPS processor, we made some arithmetic algorithm by assembly code.
In the instruction memory, assembly codes are included by form of machine language. Below
codes are some part of instruction memory. The total code is for matrix algorism. It will be use
for verification of designed MAC on MIPS processor lastly.
Figure. 4.9. Instructions in the MIPS instruction memory
- xlv -
We made a set of MIPS assembler code and put in to the instruction memory of MIPS by
form of machine language. It is loop assembler code for matrix caculation.
248, 252... is the PC value which is increased 4bytes each. Timing diagram in the next page is
show the result of this code. It is include data, control, branch hazard. (It made intentional for
verification) So we can observe "stall" and "forwarding" behavior by MIPS for solve the
hazards.
Figure. 4.10. Wave form of loop operation in MIPS
- The terms in the vertical axis are register. We must remind MIPS is operates by pipelined,
so the next stage value of in register is showed after one clock from the register value of
current stage.
- PC value 252 is the starting point of Loop1 for calculate matrix so we can see the repetition
of PC value; 252, 256, 260, ... , 284, 252
- This part is calculate for 19 * 11 = 209. When doing multiply operation, mul instruction
needs two EXE stages so when PC value is 264, this term is look in two clock cycles.
- The values in the IDEXA, IDEXB registers are have multiply operands. We want operands
are 19 and 11 but, according to above diagram, there are 18 and 10 when PC value is 264.
The reason of this state is prev results are don't writed in IDEXA, IDEXB registers ye. After
one clock, the value of IDEXB is changed to 11 collectly. The forwarding unit is actived.
- xlvi -
Also in next one clock the value of IDEXA is will be forwarded by forwarding unit and it
will be changed to 19. Finally, we can see the value of in the MEMWBvalue3 register is
209.
- When PC value is 280 is the last term in the loop1 so after this clock, the next instruction
which have PC value as 284 is "flush" and we can see the register values of this instruction
are seting to zero. And next executed instruction is 252, repeatly.
4.3. MIPS with MAC
Verification result for whether our MIPS and MAC behave correctly.
Figure. 4.11. Matrix for verification
We calculated this matrix by MIPS with MAC as two methods. The first way is use
the MAC just as Booth multiplier and another way is add the concept of accumulator
to the first thing.
4.3.1. Use MAC as Booth multiplier
The result of synthesized
and post simulation.
8,300/38,400 numbers of
logic elements and 42 ns
clock period.
- xlvii -
Figure. 4.12. synthesized result of use MAC as booth multiplier
Figure. 4.13. result waveform of use MAC as booth multiplier
From wave form, 15078ns is end time, when the last result data 782 is outed. The
result is outed form MEMWBvalue2 register. 1,2,3 registers are store the data from
memory, ALU and MAC. we need add instruction for add the result of the MAC in
matrix calculating by using the MAC as just booth multiplier. so write back the value
form the ALU.
- xlviii -
4.3.2. Use MAC as Multiplier and Accumulator
The result of synthesized
and post simulation.
9,025/38,400 numbers of logic
elements and 56 ns clock period.
Figure. 4.14. synthesized result of use MAC as Multiplier and Accumulator
Figure. 4.15. result waveform of use MAC as Multiplier and Accumulator
- xlix -
It is the last part of result waveform. The end time is 18312 ns. We can see the
last data is outed from the MEMWBvalue3 register. In this case, we don't need
special add instruction. so write back the result of MAC straight away.
4.3.3. Compare the two methods of Matrix Calculating
Table. 4.2. Compare the two methods of Matrix Calculating
Compare the two method of matrix calculation. In case of using the MAC as
multiplier and accumulator needs more time and logic element but, needs less number
of clocks. We did relative simple matrix calculation so Loop repeat times was small.
It brings about small difference number of clock cycles but it can be more efficient in
the complex matrix calculation. It is the way of keep the meaning of MIPS and the
meanging of MAC. Also, we can think this is the way of design for the efficient
DSP machine.
- l -
5. Conclusion
5.1. What we know
First foremost we understood the backgrounds of the multiplier, accumulator,
computer architecture and each design method. Thereafter we designed each system
and then developed a set of MIPS assembly code to operate our process. Hence we
were able to learn the details of the system, the algorithm, the hardware and finally
the software design.
5.2. A fall of the system performance after connecting the MAC
to MIPS processor
As discussed in the thesis, the basic type of MAC, the advanced type of MAC
and the MIPS processor which supports the basic type of MAC are illustrated right
below:
Type of a systemSynthesis results
(Quatus II EXCALIBUR_ARM)Booth multiplier 21nsBasic type MAC 24nsAdvanced type MAC 30nsBooth multiplier on MIPS processor 42nsBasic type MAC on MIPS processor 56ns
Table. 5.1. Clock period of each system
Booth multiplier or MAC has small clock period before it connects to MIPS
processor. After connecting, the system performance which was dependent on synthesis
tool played vital part for the system to fall. When the MIPS processor which had the
multipliers or MAC that was synthesized, it produced more logic cells and larger areas
than one multiplier or the MAC. This caused the bad performance during the MIPS
synthesis process. The other reason for the bad performance was the insufficient in
- li -
clock dividing of the MAC stage. Since our MIPS processor offers 5 stage pipelining,
it is easy to implements a instruction "mac" with 5 clocks. If a "mac" instruction has
more than 5 clocks, stall technique and hazard control are needed. Thereby due to the
complication in implementation of the MIPS processor, we only applied 2 stage
pipelining.
5.3. Solution
We could not connect advanced type MAC with 8 multiplier and 7 accumulator to
MIPS processor, since our MIPS processor can handle just 2 operands in one
instruction. Advanced type MAC needs information of 16 operands in one instruction
and one may expect that indirect method can solve this problem. In the first clock of
a "mac" instruction, 8 operands data memory which has 2 operands (Hi, Low) are
accessed and return 8 index numbers. In the next clock of a "mac" instruction, MAC
accesses 16 index operands fields and it multiplies and accumulates.
- lii -
6. Appendix
Ascent sorting of 100 samples Ascent sorting of 1000 samples
Ascent sorting of 10000 samples Ascent sorting of 100000 samples
Frequency of random numbers with an interval of 100,000Figure. 6.1. Distribution of random numbers in Verilog HDL
- liii -
Because I didn't know random numbers that were generated to verify in Verilog HDL
which needs to be uniformed, hence the MATLAB simulation was used for this
verification. But 4294967295 data rows in MATLAB could not be supported and I
used simple sorting. If the line is linear, then the random numbers generated in
Verilog HDL are in uniform. I used 100,000 samples to verify multiplier and MAC
with random 2 inputs. The last graph means frequency of random numbers with an
interval of 100,000 and it is true the random numbers generated in Verilog HDL are
uniform and my verification results are trustworthy.
- liv -
%sorting analysis of random numbers
clear;
clc;
data = load('1.txt');
a = data(1:100);
a_sort = sort(a, 'ascend');
b = data(1:1000);
b_sort = sort(b, 'ascend');
c = data(1:10000);
c_sort = sort(c, 'ascend');
data_sort = sort(data, 'ascend');
%range of random numbers
clear;
clc;
%X_axis = 0 : 2^32 - 1;
data = load('1.txt');
for i = 0 : 42949
tmp = ((100000 * i < data) & (data < 100000 * i + 100000));
freq(i+1) = sum(tmp,1);
end
Table. 6.1. MATLAB source for analysis of random numbers
- lv -
References
[1] M. Amde, I. Blunno, and C.P. Sotiriou. Automating the design of and asynchronous DLX
Microprocessor. In Proceedings of the 40th Design Automation Conference(DAC), ACM,
pages 502-507, 2003.
[2] Israel Koren, Computer Arithmetic Algorithm
[3] Michael Golden, Trevor Mudge. A COmparison of Two pipeline Organizarions, Electrical
Engineering and Computer Science Department, University of Michigan.
[4] Hiroaki Murakami, Naoka Yano, Yukip Ootaguro, Yukio Sugeno, Maki Ueno. A
multiplier-accumulator MACro for a 45 MIPS Embedded RISC processor. IEEE journal of
solid state circuits, VOL 31, No.7, July 1996.
[5] An 9-bit Parallel Pipelined Multiplier Based On 3-bit Recoding From Booth's Algorithm
-Laercio Caldeira, Tales Cleber Pimenta and Evandro D. C. Cotrim
[6] A 4 Clock Cycle 64 x 64 Multiplier with 60Hz clock -Yong Surk Lee
[7] 32비트 RISC/DSP 프로세서를 위한 17비트 x 17비트 곱셈의 설계 -박종환
[8] Computer organization and design. The hardware/software Interface. David A. PAtterson,
John L. Hennessy. Morgan Kaufmann.
[9] Verilog HDL. Samir Palnitkar. Prentice hall.
[10] MIPS internet URL : http://www.mips.com/
[11] internet URL about DLX processor
https://www.cs.tcd.ie/Jeremy.Jones/vivio/dlx/dlxtutorial.htm
http://www.cs.umd.edu/class/fall2001/cmsc411/proj01/DLX/aboutDLX.html#Basic
- lvi -
국 문 요 약
FPGA MAC MIPS
32 × 32
, MAC
( ) . MAC ( ex add, sub, bnq, ...)
MIPS mac
.
MAC ,
, .
Radix-4 . 32 × 32
Radix-8 3x
. 17
1 . x 2 -x -2x
x 1
. 1
1
. 17
17 1
. 4 4:2
. 18 4 4:2
CSA 2 4:2 CSA 4:2 CSA
. 4:2 CSA
c_in c_out
- lvii -
. 4:2 CSA ,
4:2 CSA full adder 4:2 CSA
10% .
64 .
64 8 , 8 carry
propagation adder . 64
.
2
2 _ 21ns
. 37ns .
2,879 .
. 64
,
_ 24.396ns 3,129
. MAC Stratix II
9.326ns . 8
7 Stratix II 30ns
, MAC 2.5
. MIPS
mac ,
. , 1 MAC, 8 7
MAC 100,000 Verilog HDL
, 100% . Verilog HDL
MATLAB 32
. MIPS
,
.
MIPS Standard DLX machine , MIPS3000
- lviii -
. MIPS R, I, J .
. MAC
. MIPS
MAC
. mul ALU unit *
, MAC ,
, MAC .
MIPS HDL . ,
. always nonblocking ,
, ,
. , .
, . ,
. ,
. ,
MIPS . R
, ,
MAC EXE , MEM
MIPS . RISC
. RTL
GATE .
MIPS ,
, MAC
MIPS MAC booth
.
핵심되는 말 : 부스 알고리즘, 월레스 트리, 덧셈기, MIPS, 프로세서, RISC 형식