the processor: datapath and control

The Processor:

Datapath and Control

Outline

Goals in processor implementation

Brief review of sequential logic design

Pieces of the processor implementation puzzle

A simple implementation of a MIPS integer instruction subsetDatapath Control logic design

A multi-cycle MIPS implementationDatapath Control logic design

Microcoded control

Exceptions

Some real microprocessor datapath and control


Balance the rate of supply of instructions and data and the rate at which the execution core can consume them and can update memory

instruction supply data supplyexecution core


Recall from Chapter 2CPU Time = INST x CPI x CT

INST largely a function of the ISA and compiler

Objective: minimize CPI x CT within design constraints (cost, power, etc.)

Trading off CPI and CT is tricky

multiplier

multiplier

multiplier

logic

logic

logic


State elements are clocked devicesFlip flops, etc

Combinatorial elements hold no stateALU, caches, multiplier, multiplexers, etc.

In edge triggered clocking, state elements are only updated on the (rising) edge of the clock pulse


The same state element can be read at the beginning of a clock cycle and updated at the end

Example: incrementing the PC

Add

12

8

PC

4

clock

PC register 8 12

12Add output

Add input 8

clock

Our processor design progression

(1) Instruction fetch, execute, and operand reads from data memory all take place in a single clock cycle

(2) Instruction fetch, execute, and operand reads from data memory take place in successive clock cycles

(3) A pipelined design

Pieces of the processor puzzle

Instruction fetch

Execution

Data memory

instruction supply data supplyexecution core

Instruction fetch datapath

Memory to hold instructions

Register to hold the instruction memory address

Logic to generate the next instruction address

PC +4

Execution datapath

Focus on only a subset of all MIPS instructionsadd, sub, and, orlw, sw sltbeq, j

For all instructions except j, we Read operands from the register filePerform an ALU operation

For all instructions except sw, beq, and j, we write a result into the register file

Execution datapath

Register file block diagram

Read register 1,2: source operand register numbers Read data 1,2: source operands (32 bits each)Write register: destination operand register numberWrite data: data written into register file RegWrite: when asserted, enables the writing of Write

Data

Execution datapath

Datapath for R-type (add, sub, and, or, slt)

R-type instruction format:

op rs rt functrd shamt31 26 16 15 11 10 6 5 025 2021

Execution datapath

Datapath for beq instruction

I-type instruction format:

Zero ALU output indicates if rs=rt (branch is taken/not taken)Branch target address is the sign extended immediate left

shifted two positions, and added to PC+4

op rs rt immediate31 26 16 15 025 2021

Data memory Used for lw, sw (I-type format)

Block diagram

Address: memory location to be read or writtenRead data: data out of the memory on a loadWrite data: data into the memory on a storeMemRead: indicates a read operation is to be performedMemWrite: indicates a write operation is to be performed

Execution datapath + data memory

Datapath for lw, sw

Address is the sign-extended immediate added to the source operand read out of the register file

sw: data written to memory from specified registerlw: data written to register file from specified memory

address

Putting the pieces together Single clock cycle for fetch, execute, and

operand read from data memory

3 MUXesRegister file operand or sign extended immediate to ALUALU or data memory output written to register filePC+4 or branch target address written to PC register

Datapath for R-type instructions

Example: add $4, $18, $30

Datapath for I-type ALU instructions

Example: slti $7, $4, 100

Datapath for not taken beq instruction

Example: beq $28, $13, EXIT

Datapath for taken beq instruction

Example: beq $28, $13, EXIT

Datapath for load instruction

Example: lw $8, 112($2)

Datapath for store instruction

Example: sw $10, 0($3)

Control signals we need to generate

ALU operation control

ALU control input codes from Chapter 4

Two steps to generate the ALU control inputUse the opcode to distinguish R-type, lw and sw, and

beqIf R-type, use funct field to determine the ALU control

input

ALU control input ALU operation Used for

000 and and

001 or or

010 add add, lw, sw

110 subtract sub, beq

111 set on less than slt

ALU operation control

Opcode used to generate a 2-bit signal called ALUOp with the following encodings00: lw or sw, perform an ALU add 01: beq, perform an ALU subtract 10: R-type, ALU operation is determined by the funct

field

Funct Instruction

ALU control input

100000 add 010

100010 sub 110

100100 and 000

100101 or 001

101010 slt 111

Comparing instruction fields

Opcode, source registers, function code, and immediate fields always in same place

Destination register isbits 15-11 (rd) for R-typebits 20-16 (rt) for lwMUX to select the right one

0 rs rt functrd shamt31 26 16 15 11 10 6 5 025 2021

4 rs rt immediate (offset)31 26 16 15 025 2021

R-type

beq

35 (43) rs rt immediate (offset)31 26 16 15 025 2021

lw (sw)

Datapath with instr fields and ALU control

Main control unit design

Main control unit design

Truth table

(4)

(0)

(34)

(43)

Adding support for jump instructions

J-type format

Next PC formed by shifting left the 26-bit target two bits and combining it with the 4 high-order bits of PC+4

Now the next PC will be one ofPC+4beq target addressj target address

We need another MUX and control bit

2 target31 26 025

Adding support for jump instructions

Evaluation of the simple implementation

All instructions take one clock cycle (CPI = 1)

Assume the following worst case delaysInstruction memory: 4 time units Data memory: 4 time units (read), 2 time units (write)ALU: 4 time unitsAdders: 3 time unitsRegister file: 2 time units (read), 1 time unit (write)MUXes, sign extension, gates, and shifters: 1 time unit

Large disparity in worst case delays among instruction typesR-type: 4+2+1+4+1+1 = 13 time unitsbeq: 4+2+1+4+1+1+1 = 14 time unitsj: 4+1+1 = 6 time unitsstore: 4+2+4+2 = 12 time unitsload: 4+2+4+4+1+1 = 16 time units

Evaluation of the simple implementation

Disparity would be worse in a real machineEven slower integer instructions (e.g., multiply/divide

in MIPS)Floating point instructions

Simple instructions take as long as complex ones

A multicycle implementation

Instruction fetch, register file access, etc occur in separate clock cycles

Different instruction types take different numbers of cycles to complete

Clock cycle time should be faster

High level view of datapath

New registers store results of each step Not programmer visible!

Hardware can be sharedOne ALU for PC+4, branch target calculation, EA calculation,

and arithmetic operationsOne memory for instructions and data

Detailed multi-cycle datapath

Multi-cycle control

First two cycles for all instructions

Instruction fetch (1st cycle)Load the instruction into the IR register

IR = Memory[PC]Increment the PC

PC = PC+4

Instruction decode and register fetch (2nd cycle)Read register file locations rs and rt, results into the A

and B registersA=Reg[IR[25-21]]B=Reg[IR[20-16]]

Calculate the branch target address and load into ALUOutALUOut = PC+(sign-extend (IR[15-0]) <<2)

Instruction fetch

IR=Mem[PC]

Instruction fetch

PC=PC+4

Instruction decode and register fetch

A=Reg[IR[25-21]], B=Reg[IR[20-16]]

Instruction decode and register fetch

ALUOut = PC+(sign-extend (IR[15-0]) <<2)

Additional cycles for R-type

Execution ALUOut = A op B

CompletionReg[IR[15-11]] = ALUOut

R-type execution cycle

ALUOut = A op B

R-type completion cycle

Reg[IR[15-11]] = ALUOut

Additional cycles for store

Address computationALUOut = A + sign-extend (IR[15-0])

Memory accessMemory[ALUOut] = B

Store address computation cycle

ALUOut = A + sign-extend (IR[15-0])

Store memory access cycle

Memory[ALUOut] = B

Additional cycles for load

Address computation ALUOut = A + sign-extend (IR[15-0])

Memory accessMDR = Memory[ALUOut]

Read completionReg[IR[20-16]] = MDR

Load memory access cycle

MDR = Memory[ALUOut]

Load read completion cycle

Reg[IR[20-16]] = MDR

Additional cycle for beq

Branch completionif (A == B) PC = ALUOut

Branch completion cycle for beq

if (A == B) PC = ALUOut

Additional cycle for j

Jump completionPC = PC[31-28] || (IR[25-0]<<2)

Jump completion cycle for j

PC = PC[31-28] || (IR[25-0]<<2)

Control logic design

Implemented as a Finite State Machine

Inputs: 6 opcode bitsOutputs: 16 control signalsState: 4 bits for 10 states

High-level view of FSM

Instruction fetch cycle

Instruction decode/register fetch cycle

R-type execution cycle

R-type completion cycle

Memory address computation cycle

Store memory access cycle

Load memory access cycle

Load read completion cycle

beq branch completion cycle

j jump completion cycle

Complete FSM

Evaluation of the multi-cycle design

CPI calculated based on the instruction mixFor gcc (Figure 4.54)

23% loads (5 cycles each)13% stores (4 cycles each)19% branches (3 cycles each)2% jumps (3 cycles each)43% ALU (4 cycles each)

CPI = 0.23*5+0.13*4+0.19*3+0.02*3+0.43*4=4.02

Cycle time is calculated from the longest delay path assuming the same timing delays as before

Worst case datapath: branch target

ALUOut = PC+(sign-extend (IR[15-0]) <<2)

Delay = 7 time units (delay of simple = 16)

Evaluation of the multi-cycle design

Time per instruction of simple and multi-cycleTPI(simple) = CPI(simple) x cycle time(simple) = 16TPI(multi-cycle) = 4.02 x 7 = 28.1

Simple single-cycle implementation is faster

Multicycle with pipelining will be considerably faster than single-cycle implementation

Exceptions

An exception is an event that causes a deviation from the normal execution of instructions

Types of exceptions Operating system call (e.g., read a file, print a file)Input/output device requestPage fault (request for instruction/data not in memory – Ch 7)Arithmetic error (overflow, underflow, etc.)Undefined instructionMisaligned memory access (e.g., word access to odd address)Memory protection violationHardware errorPower failure

An exception is not usually due to an error!

We need to be able to restart the program at the point where the exception was detected

Handling exceptions

Detect the exception

Save enough information about the exception to handle it properly

Save enough information about the program to resume it after the exception is handled

Handle the exception

Either terminate the program or resume executing it depending on the exception type

Detecting exceptions

Performed by hardware

Overflow: determined from the opcode and the overflow output of the ALU

Undefined instruction: determined from The opcode in the main control unitThe function code and ALUop in the ALU control logic

Detecting exceptions

overflow

undefinedinstruction

Saving exception information

Performed by hardware

We need the type of exception and the PC of the instruction when the exception occurred

In MIPS, the Cause register holds the exception typeNeed an encoding for each exception typeNeed a signal from the control unit to load it into the

Cause register

and the Exception Program Counter (EPC) register holds the PCNeed to subtract 4 from the PC register to get the

correct PC (since we loaded PC+4 into the PC register during the Instruction Fetch cycle)

Need a signal from the control unit to load it into EPC

Saving exception information

Saving program information

Needed in order to restart the program from the point where the exception occurred

Performed by hardware and software

EPC register holds the PC of the instruction that had the exception (where we will restart the program)

The software routine that handles the exception saves any registers that it will need to the stack and restores them when it is done

Handling the exception

Performed by hardware and software

Need to transfer control to a software routine to handle the exception (exception handler)

The exception handler runs in a privileged mode that allows it to use special instructions and access all of memoryOur programs run in user mode

The hardware enables the privileged mode, loads PC with the address of the exception handler, and transitions to the Fetch state

Handling the exception Loading the PC with exception handler

address

Exception handler

Stores the values of the registers that it will need to the stack

Handles the particular exceptionOperating system call: calls the subroutine associated with the

callUnderflow: sets register to zero or uses denormalized numbers I/O: handles the particular I/O request, e.g., keyboard input

Restores registers from the stack (if program is to be restarted)

Terminates the program, or resumes execution by loading the PC with EPC and transitioning to the Instruction Fetch state

FSM modifications

The Intel Pentium processor

Introduced in 1993

Uses a multi-cycle datapath with the following steps for integer instructionsPrefetch (PF): read instruction from the instruction

memoryDecode 1 (D1): first stage of instruction decodeDecode 2 (D2): second stage of instruction decodeExecute (E): perform the ALU operationWrite back (WB): write the result to the register file

Datapath usage varies by instruction typeSimple instructions make one pass through the

datapath using state machine controlComplex instructions make multiple passes, reusing

the same hardware elements under microcode control

The Intel Pentium processor

The Pentium is a 2-way superscalar design as two instructions can simultaneously execute

Ideal CPI for a 2-way superscalar is 0.5

Conditions for superscalar executionBoth must be simple instructionsThe result of the first instruction cannot be needed by the

secondBoth instructions cannot write the same registerThe first instruction in program sequence cannot be a

jump

PF D1

D2 E WB

D2 E WB U pipe

V pipe

The Intel Pentium Pro processor

Introduced in 1995 as the successor to the Pentium

The basis for the Pentium II and Pentium III

Implements a 14-cycle, 3-way superscalar integer datapathVery high frequency is the goal

Uses out-of-order execution in that instructions may execute out of their original program orderCompletely handled by hardware transparently to

softwareInstructions execute as soon as their source operands

become availableComplicates exception handling

Some instructions before the excepting one may not have executed, while some after it may have executed

The Intel Pentium Pro processor

Pentium Pro designers (and AMD designers before them) used innovative engineering to overcome the disadvantages of CISC ISAsMany complex X86 instructions are internally

translated by hardware into RISC-like micro-ops with state machine control

Achieves a very low CPI for simple integer operations even on programs compiled for older implementations

Combination of high frequency and low CPI gave the Pentium Pro extremely competitive integer performance versus RISC microprocessorsResult has been that RISC CPUs have failed to gain the

desktop market share that had been expected

The Intel Pentium 4 processor

20 cycle superscalar integer pipeline

Extremely high frequency (>3GHz)

Major effort to lower power dissipationClock gating: clock to a unit is turned off when the unit

is not in useTrace cache: caches micro-ops of previously decoded

complex instructions to avoid power-consuming decode operation

the processor: datapath and control

Documents

numberwrite data

writtenread data

loadwrite data

register fileperform

processor implementationrecall

instruction addresspc

register file regwrite

sltrtype instruction