the processor: datapath and control
DESCRIPTION
The Processor: Datapath and Control. Outline. Goals in processor implementation Brief review of sequential logic design Pieces of the processor implementation puzzle A simple implementation of a MIPS integer instruction subset Datapath Control logic design - PowerPoint PPT PresentationTRANSCRIPT
The Processor:
Datapath and Control
Outline
Goals in processor implementation
Brief review of sequential logic design
Pieces of the processor implementation puzzle
A simple implementation of a MIPS integer instruction subsetDatapath Control logic design
A multi-cycle MIPS implementationDatapath Control logic design
Microcoded control
Exceptions
Some real microprocessor datapath and control
Goals in processor implementation
Balance the rate of supply of instructions and data and the rate at which the execution core can consume them and can update memory
instruction supply data supplyexecution core
Goals in processor implementation
Recall from Chapter 2CPU Time = INST x CPI x CT
INST largely a function of the ISA and compiler
Objective: minimize CPI x CT within design constraints (cost, power, etc.)
Trading off CPI and CT is tricky
multiplier
multiplier
multiplier
logic
logic
logic
Brief review of sequential logic design
State elements are clocked devicesFlip flops, etc
Combinatorial elements hold no stateALU, caches, multiplier, multiplexers, etc.
In edge triggered clocking, state elements are only updated on the (rising) edge of the clock pulse
Brief review of sequential logic design
The same state element can be read at the beginning of a clock cycle and updated at the end
Example: incrementing the PC
Add
12
8
PC
4
clock
PC register 8 12
12Add output
Add input 8
clock
Our processor design progression
(1) Instruction fetch, execute, and operand reads from data memory all take place in a single clock cycle
(2) Instruction fetch, execute, and operand reads from data memory take place in successive clock cycles
(3) A pipelined design
Pieces of the processor puzzle
Instruction fetch
Execution
Data memory
instruction supply data supplyexecution core
Instruction fetch datapath
Memory to hold instructions
Register to hold the instruction memory address
Logic to generate the next instruction address
PC +4
Execution datapath
Focus on only a subset of all MIPS instructionsadd, sub, and, orlw, sw sltbeq, j
For all instructions except j, we Read operands from the register filePerform an ALU operation
For all instructions except sw, beq, and j, we write a result into the register file
Execution datapath
Register file block diagram
Read register 1,2: source operand register numbers Read data 1,2: source operands (32 bits each)Write register: destination operand register numberWrite data: data written into register file RegWrite: when asserted, enables the writing of Write
Data
Execution datapath
Datapath for R-type (add, sub, and, or, slt)
R-type instruction format:
op rs rt functrd shamt31 26 16 15 11 10 6 5 025 2021
Execution datapath
Datapath for beq instruction
I-type instruction format:
Zero ALU output indicates if rs=rt (branch is taken/not taken)Branch target address is the sign extended immediate left
shifted two positions, and added to PC+4
op rs rt immediate31 26 16 15 025 2021
Data memory Used for lw, sw (I-type format)
Block diagram
Address: memory location to be read or writtenRead data: data out of the memory on a loadWrite data: data into the memory on a storeMemRead: indicates a read operation is to be performedMemWrite: indicates a write operation is to be performed
Execution datapath + data memory
Datapath for lw, sw
Address is the sign-extended immediate added to the source operand read out of the register file
sw: data written to memory from specified registerlw: data written to register file from specified memory
address
Putting the pieces together Single clock cycle for fetch, execute, and
operand read from data memory
3 MUXesRegister file operand or sign extended immediate to ALUALU or data memory output written to register filePC+4 or branch target address written to PC register
Datapath for R-type instructions
Example: add $4, $18, $30
Datapath for I-type ALU instructions
Example: slti $7, $4, 100
Datapath for not taken beq instruction
Example: beq $28, $13, EXIT
Datapath for taken beq instruction
Example: beq $28, $13, EXIT
Datapath for load instruction
Example: lw $8, 112($2)
Datapath for store instruction
Example: sw $10, 0($3)
Control signals we need to generate
ALU operation control
ALU control input codes from Chapter 4
Two steps to generate the ALU control inputUse the opcode to distinguish R-type, lw and sw, and
beqIf R-type, use funct field to determine the ALU control
input
ALU control input ALU operation Used for
000 and and
001 or or
010 add add, lw, sw
110 subtract sub, beq
111 set on less than slt
ALU operation control
Opcode used to generate a 2-bit signal called ALUOp with the following encodings00: lw or sw, perform an ALU add 01: beq, perform an ALU subtract 10: R-type, ALU operation is determined by the funct
field
Funct Instruction
ALU control input
100000 add 010
100010 sub 110
100100 and 000
100101 or 001
101010 slt 111
Comparing instruction fields
Opcode, source registers, function code, and immediate fields always in same place
Destination register isbits 15-11 (rd) for R-typebits 20-16 (rt) for lwMUX to select the right one
0 rs rt functrd shamt31 26 16 15 11 10 6 5 025 2021
4 rs rt immediate (offset)31 26 16 15 025 2021
R-type
beq
35 (43) rs rt immediate (offset)31 26 16 15 025 2021
lw (sw)
Datapath with instr fields and ALU control
Main control unit design
Main control unit design
Truth table
(4)
(0)
(34)
(43)
Adding support for jump instructions
J-type format
Next PC formed by shifting left the 26-bit target two bits and combining it with the 4 high-order bits of PC+4
Now the next PC will be one ofPC+4beq target addressj target address
We need another MUX and control bit
2 target31 26 025
Adding support for jump instructions
Evaluation of the simple implementation
All instructions take one clock cycle (CPI = 1)
Assume the following worst case delaysInstruction memory: 4 time units Data memory: 4 time units (read), 2 time units (write)ALU: 4 time unitsAdders: 3 time unitsRegister file: 2 time units (read), 1 time unit (write)MUXes, sign extension, gates, and shifters: 1 time unit
Large disparity in worst case delays among instruction typesR-type: 4+2+1+4+1+1 = 13 time unitsbeq: 4+2+1+4+1+1+1 = 14 time unitsj: 4+1+1 = 6 time unitsstore: 4+2+4+2 = 12 time unitsload: 4+2+4+4+1+1 = 16 time units
Evaluation of the simple implementation
Disparity would be worse in a real machineEven slower integer instructions (e.g., multiply/divide
in MIPS)Floating point instructions
Simple instructions take as long as complex ones
A multicycle implementation
Instruction fetch, register file access, etc occur in separate clock cycles
Different instruction types take different numbers of cycles to complete
Clock cycle time should be faster
High level view of datapath
New registers store results of each step Not programmer visible!
Hardware can be sharedOne ALU for PC+4, branch target calculation, EA calculation,
and arithmetic operationsOne memory for instructions and data
Detailed multi-cycle datapath
Multi-cycle control
First two cycles for all instructions
Instruction fetch (1st cycle)Load the instruction into the IR register
IR = Memory[PC]Increment the PC
PC = PC+4
Instruction decode and register fetch (2nd cycle)Read register file locations rs and rt, results into the A
and B registersA=Reg[IR[25-21]]B=Reg[IR[20-16]]
Calculate the branch target address and load into ALUOutALUOut = PC+(sign-extend (IR[15-0]) <<2)
Instruction fetch
IR=Mem[PC]
Instruction fetch
PC=PC+4
Instruction decode and register fetch
A=Reg[IR[25-21]], B=Reg[IR[20-16]]
Instruction decode and register fetch
ALUOut = PC+(sign-extend (IR[15-0]) <<2)
Additional cycles for R-type
Execution ALUOut = A op B
CompletionReg[IR[15-11]] = ALUOut
R-type execution cycle
ALUOut = A op B
R-type completion cycle
Reg[IR[15-11]] = ALUOut
Additional cycles for store
Address computationALUOut = A + sign-extend (IR[15-0])
Memory accessMemory[ALUOut] = B
Store address computation cycle
ALUOut = A + sign-extend (IR[15-0])
Store memory access cycle
Memory[ALUOut] = B
Additional cycles for load
Address computation ALUOut = A + sign-extend (IR[15-0])
Memory accessMDR = Memory[ALUOut]
Read completionReg[IR[20-16]] = MDR
Load memory access cycle
MDR = Memory[ALUOut]
Load read completion cycle
Reg[IR[20-16]] = MDR
Additional cycle for beq
Branch completionif (A == B) PC = ALUOut
Branch completion cycle for beq
if (A == B) PC = ALUOut
Additional cycle for j
Jump completionPC = PC[31-28] || (IR[25-0]<<2)
Jump completion cycle for j
PC = PC[31-28] || (IR[25-0]<<2)
Control logic design
Implemented as a Finite State Machine
Inputs: 6 opcode bitsOutputs: 16 control signalsState: 4 bits for 10 states
High-level view of FSM
Instruction fetch cycle
Instruction decode/register fetch cycle
R-type execution cycle
R-type completion cycle
Memory address computation cycle
Store memory access cycle
Load memory access cycle
Load read completion cycle
beq branch completion cycle
j jump completion cycle
Complete FSM
Evaluation of the multi-cycle design
CPI calculated based on the instruction mixFor gcc (Figure 4.54)
23% loads (5 cycles each)13% stores (4 cycles each)19% branches (3 cycles each)2% jumps (3 cycles each)43% ALU (4 cycles each)
CPI = 0.23*5+0.13*4+0.19*3+0.02*3+0.43*4=4.02
Cycle time is calculated from the longest delay path assuming the same timing delays as before
Worst case datapath: branch target
ALUOut = PC+(sign-extend (IR[15-0]) <<2)
Delay = 7 time units (delay of simple = 16)
Evaluation of the multi-cycle design
Time per instruction of simple and multi-cycleTPI(simple) = CPI(simple) x cycle time(simple) = 16TPI(multi-cycle) = 4.02 x 7 = 28.1
Simple single-cycle implementation is faster
Multicycle with pipelining will be considerably faster than single-cycle implementation
Exceptions
An exception is an event that causes a deviation from the normal execution of instructions
Types of exceptions Operating system call (e.g., read a file, print a file)Input/output device requestPage fault (request for instruction/data not in memory – Ch 7)Arithmetic error (overflow, underflow, etc.)Undefined instructionMisaligned memory access (e.g., word access to odd address)Memory protection violationHardware errorPower failure
An exception is not usually due to an error!
We need to be able to restart the program at the point where the exception was detected
Handling exceptions
Detect the exception
Save enough information about the exception to handle it properly
Save enough information about the program to resume it after the exception is handled
Handle the exception
Either terminate the program or resume executing it depending on the exception type
Detecting exceptions
Performed by hardware
Overflow: determined from the opcode and the overflow output of the ALU
Undefined instruction: determined from The opcode in the main control unitThe function code and ALUop in the ALU control logic
Detecting exceptions
overflow
undefinedinstruction
Saving exception information
Performed by hardware
We need the type of exception and the PC of the instruction when the exception occurred
In MIPS, the Cause register holds the exception typeNeed an encoding for each exception typeNeed a signal from the control unit to load it into the
Cause register
and the Exception Program Counter (EPC) register holds the PCNeed to subtract 4 from the PC register to get the
correct PC (since we loaded PC+4 into the PC register during the Instruction Fetch cycle)
Need a signal from the control unit to load it into EPC
Saving exception information
Saving program information
Needed in order to restart the program from the point where the exception occurred
Performed by hardware and software
EPC register holds the PC of the instruction that had the exception (where we will restart the program)
The software routine that handles the exception saves any registers that it will need to the stack and restores them when it is done
Handling the exception
Performed by hardware and software
Need to transfer control to a software routine to handle the exception (exception handler)
The exception handler runs in a privileged mode that allows it to use special instructions and access all of memoryOur programs run in user mode
The hardware enables the privileged mode, loads PC with the address of the exception handler, and transitions to the Fetch state
Handling the exception Loading the PC with exception handler
address
Exception handler
Stores the values of the registers that it will need to the stack
Handles the particular exceptionOperating system call: calls the subroutine associated with the
callUnderflow: sets register to zero or uses denormalized numbers I/O: handles the particular I/O request, e.g., keyboard input
Restores registers from the stack (if program is to be restarted)
Terminates the program, or resumes execution by loading the PC with EPC and transitioning to the Instruction Fetch state
FSM modifications
The Intel Pentium processor
Introduced in 1993
Uses a multi-cycle datapath with the following steps for integer instructionsPrefetch (PF): read instruction from the instruction
memoryDecode 1 (D1): first stage of instruction decodeDecode 2 (D2): second stage of instruction decodeExecute (E): perform the ALU operationWrite back (WB): write the result to the register file
Datapath usage varies by instruction typeSimple instructions make one pass through the
datapath using state machine controlComplex instructions make multiple passes, reusing
the same hardware elements under microcode control
The Intel Pentium processor
The Pentium is a 2-way superscalar design as two instructions can simultaneously execute
Ideal CPI for a 2-way superscalar is 0.5
Conditions for superscalar executionBoth must be simple instructionsThe result of the first instruction cannot be needed by the
secondBoth instructions cannot write the same registerThe first instruction in program sequence cannot be a
jump
PF D1
D2 E WB
D2 E WB U pipe
V pipe
The Intel Pentium Pro processor
Introduced in 1995 as the successor to the Pentium
The basis for the Pentium II and Pentium III
Implements a 14-cycle, 3-way superscalar integer datapathVery high frequency is the goal
Uses out-of-order execution in that instructions may execute out of their original program orderCompletely handled by hardware transparently to
softwareInstructions execute as soon as their source operands
become availableComplicates exception handling
Some instructions before the excepting one may not have executed, while some after it may have executed
The Intel Pentium Pro processor
Pentium Pro designers (and AMD designers before them) used innovative engineering to overcome the disadvantages of CISC ISAsMany complex X86 instructions are internally
translated by hardware into RISC-like micro-ops with state machine control
Achieves a very low CPI for simple integer operations even on programs compiled for older implementations
Combination of high frequency and low CPI gave the Pentium Pro extremely competitive integer performance versus RISC microprocessorsResult has been that RISC CPUs have failed to gain the
desktop market share that had been expected
The Intel Pentium 4 processor
20 cycle superscalar integer pipeline
Extremely high frequency (>3GHz)
Major effort to lower power dissipationClock gating: clock to a unit is turned off when the unit
is not in useTrace cache: caches micro-ops of previously decoded
complex instructions to avoid power-consuming decode operation