pipelining basics - bt.nitk.ac.in · data types and sizes signed and unsigned data – 2's...
TRANSCRIPT
Pipelining Basics
Outline● Addressing Modes● MIPS ISA● MIPS Pipeline
Addressing Modes● How are operands specified in instructions?
Add R1, R2, R3 Regs[R4] <- Regs[R3] + Regs[R2] Register
Add R4, R3, #5 Regs[R4] <- Regs[R3] + 5 Immediate
Regs[R4] <- Regs[R3] + Mem[100 + Regs[R1]]
DisplacementAdd R4, R3, 100(R1)
Regs[R4] <- Regs[R3] + Mem[Regs[R1]]
Register IndirectAdd R4, R3, (R1)
Regs[R4] <- Regs[R3] + Mem[0x475] AbsoluteAdd R4, R3, (0x475)
Regs[R4] <- Regs[R3] + Mem[Mem[R1]]
Memory IndirectAdd R4, R3, @(R1)
Regs[R4] <- Regs[R3] + Mem[100 + PC]
PC relativeAdd R4, R3, 100(PC)
Regs[R4] <- Regs[R3] + Mem[100 + Regs[R1] + Regs[R5] * 4]
ScaledAdd R4, R3, 100(R1)[R5]
Data Types and Sizes
● Signed and Unsigned Data– 2's complement representation
● Real numbers (Floating point)– IEEE 754 Single precision and Double precision
● Addresses
ISA Encoding● Fixed Width
– Eg.: RISC Architectures: MIPS, PowerPC, SPARC, ARM
● Variable Length (Mostly Fixed or Compressed)– Eg. CISC Architectures: IBM 360, x86, Motorola
68K, VAX, …
● Mostly fixed or Compressed– MIPS16, THUMB
● Very Long Instruction Words– Multiple instructions in a fixed width bundle
– Eg.: Multiflow, HP/ST Lx, TI C6000
x86 (IA-32) Instruction Encoding
InstructionPrefix
Opcode ModR/MScale,Index
BaseDisplace
mentImmediate
Up to four prefixes
(1 byte each)
1, 2 or 3B 1B(if needed)
1B(if needed)
0,1,2, or 4B(if needed)
0,1,2, or 4B(if needed)
x86 and x86-64 instruction formatPossible instructions 1 to 18 bytes long
REP MOVSB
Example – MIPS64 ISA● RISC, load-store architecture● 32-bit instructions, fixed format● 32 64-bit GPRs, R0-R31, 32 64-bit FPRs, F0-F31
– R0 is hardwired to 0.
– Can hold 32-bit floats also (with other ½ unused).
– “SIMD” extensions operate on more floats in 1 FPR
● Special registers– Floating-point status register
● Load/store 8-, 16-, 32-, 64-bit integers– All sign-extended to fill 64-bit GPR
– Also 32- bit floats/doubles
MIPS64 Addressing Modes● Register (Arithmetic, Logical ops only)● Immediate (Arithmetic, Logical ) & Displacement
(load/stores only)– 16-bit immediate/offset field
– Register indirect: use 0 as displacement offset
– Direct (absolute): use R0 as displacement base
● Byte-addressed memory, 64-bit address● Software-settable big-endian/little-endian flag● Alignment required 100 101 102 103
104 105 106 107
Word aligned addresses
MIPS64 InstructionsDATA TRANSFER INSTRUCTIONSInstruction Opcode/Mnemonic Examples
Load LB, LBU, LH, LHU, LW, LWU, LD, SDL.S, L.D
LD R1, 30(R2)L.S F0, 50(R3)
Store SB, SH, SW, SDS.S, S.D
SH R3, 502(R2)SB R2, R1(R3)
● L: Load● S: Store
● B: Byte (8b), H: Half Word (16b), W: Word (32b)
● U: Upper● I: Immediate
Decode Instruction, Fetch Operands, Effective address calculation,
Memory access, Update RF.
MIPS64 Instructions
ARITHMETIC/LOGICAL INSTRUCTIONS
Logical and Arithmetic Shift, Set less than…
DADD, DADDI, DADDIU, DSUB, DSUBU, DMUL, DMULU, DDIV, DDIVUAND, OR, XOR, ANDI, ORI, XORILUIDSLL, DSRL, SLT, SLTI, SLTU
DADDU R1, R2, R3
ANDI R1, #43
SLT R1, R2, R3
Decode Instruction, Fetch operands, Arithmetic operation, Update results in RF.
MIPS64 Instructions
CONTROL INSTRUCTIONS
Branch, Jump, Control transfer
BEQZ, BNEZBEQ, BNEJ, JRJAL, JALRERET
BEQ R1, R2, label
J label
Decode Instruction, Fetch operands, Compare condition, Update PC.
MIPS Instruction Formats
● R-type.
● I-type.
● J-type
6 bits 5 bits 5 bits 5 bits 6 bits5 bits
op rs rt rd shamt funct
6 bits 5 bits 5 bits 16 bits
op rs rt immediate
6 bits 26 bits
op Offset added to PC
op: Opcode (class of instruction). Eg. ALUfunct: Which subunit of the ALU to activate?
OP rt, rs, IMM
OP rd, rs, rt
OP LABEL
Implementation of RISC ISA - 1● Instruction Fetch (IF)
AD
D
PC
4
InstructionMemory
IR
NPC
IR Mem[PC]
NPC PC+4
Implementation of RISC ISA - 2● Instruction Decode/Register Fetch (ID)
RegistersIR
Imm Sign-extended immediate filed of IR
A Regs[rs]
SignExtend
A
B
Imm16 32
B Regs[rt]
rs
rt
rd
Implementation of RISC ISA - 3● Execution/Effective Address (EX)
AL
UALUOuput A + Imm
A
B
Imm
ALUOutput
MUX
ALUOuput A func B
ALUOuput A func Imm
Register-Register andRegister-Immediate Instructions
Memory Reference
Implementation of RISC ISA - 3● Execution/Effective Address (EX)
AL
UALUOuput A + Imm
A
B
Imm
ALUOutput
MUX
ALUOuput A func B
ALUOuput A func Imm
Register-Register andRegister-Immediate Instructions
Memory Reference ALUOuput NPC + (Imm << 2);
Cond (A == 0)
Branch Instruction
Implementation of RISC ISA – 3 (cont)● Execution/Effective Address (EX)
AL
U
ALUOuput A + Imm
A
B
Imm
ALUOutput
MUX
ALUOuput A func B
ALUOuput A func Imm
Register-Register andRegister-Immediate Instructions
Memory Reference ALUOuput NPC + (Imm << 2);
Cond (A == 0)
Branch Instruction
NPC
MUX
Zero? Cond
Implementation of RISC ISA - 4● Memory Access/Branch Completion (MEM)
DataMemory
LMD
NPC
ALUOutput
Cond
MUX
PC
LMD Mem[ALUOutput]
Memory Reference
Mem[ALUOutput] B
if (Cond) PC ALUOutputBranch
B
Implementation of RISC ISA - 5● Write back (WB)
ALUOutput
MUX
LMD
Regs[rd] ALUOutput
Regs[rt] ALUOutput
Register-Register andRegister-Immediate Instructions
Regs[rt] LMD
Load Instruction
RegisterFile
Implementation of RISC ISA - Stages● Instruction Fetch (IF)● Instruction Decode/Register Fetch (ID)
– Fixed field decoding
● Execution/Effective address (EX)● Memory Access (MEM)● Write back (WB)
MIPS Datapath
AD
D
PC
4
IM
NPC
RegsIR
SignExtend
A
B
Imm16 32
rs
rt
rd
AL
U ALUOutput
MUX
MUX
Zero? Cond
DM LMD MUX
MUX
Instruction Fetch Instruction Decode/Register Fetch
Execute/Address
Calculation
MemoryAccess
WriteBack
IF ID EX MEM WB
MIPS Pipeline
Hennessy & Patterson, CA-QA, Appendix C, 5ed. MK, 2013
IF ID EX MEM WB
MIPS Pipeline
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
i1
i2
i3
i4
...
Time(clock cycles)
1 2 3 4 5 6 7 8 9
Example: When will i10000 complete? What is the average clock cycles per Instruction (CPI)? If the processor were not pipelined, when would i10000 complete? What is the average CPI? (Assume same clock period for both designs)
Some Equations
● Unpipelined: Time to execute one instruction
● N stage pipeline. Time per stage,
T exec=T +T ovh
T stage=TN
+T ovh
IF ID EX MEM WB
Tovh
Tstage
IF ID EX MEM WB
Tovh
T
Unpipelined ProcessorUnpipelined Processor
Pipelined ProcessorPipelined Processor
Some Equations● Unpipelined: Time to execute one instruction
● N stage pipeline. Time per stage,
● Total time per instruction = ● Clock cycle time = ● Clock speed = ● Ideal speedup = ● Cycles to complete one instruction = N● Average CPI = 1
T exec=T +T ovh
T stage=TN
+T ovh
T inst=N×(TN
+T ovh)=T +N×T ovh
1T clock
T clock=TN
+T ovh
Speedupideal=T+T ovh
T /N +T ovh
Pipeline PerformanceAn unpipelined processor has 1ns clock cycle. ALU Operation and branches take 4 cycles and Memory ops take 5 cycles. Relative frequencies of the operations are 40%, 20%, and 40%. Suppose Clock skew and setup, pipelining adds 0.2ns of overhead to the clock. What is the speedup?
Average Instruction Execution time = Clock cycle * Average CPI
CPI=∑i=1
n IC iInstructionCount
×CPI i
Multiple Issue Integer Pipeline
IMRF
Read
AB
DM
RF
Write
IR0
IR1
Zero?
IF ID EX MEM WB
Outline● Addressing Modes● MIPS ISA● MIPS Pipeline
References
EXTRA
Operations and Operands
ALUControl
i1 i2
o
... Register File
.........
...Memory
PR
OC
ES
SO
R
Machine Models
ALU
...
.........
...
TOS
STACK
ALU
.........
...
ACCUMULATOR
ALU
...
.........
...
REGISTOR-MEMORY
ALU
...
.........
...
REGISTER-REGISTER
C = A + B
ALU
...
............
TOS
STACK
ALU
............
ACCUMULATOR
ALU
...
............
REGISTOR-MEMORY
ALU
...
............
REGISTER-REGISTER
Push APush BAddPop C
Load AAdd BStore C
Load R1, AAdd R3, R1, BStore R3, C
Load R1, ALoad R2, BAdd R3, R1, R2Store R3, C
Machine Models – Comparison● Number of explicitly named operands● Number of instructions that can access data
from memory● Code size● Amount of data transferred between memory
and processor● Complexity of hardware● Ease of compilation (ease of generation of
machine code).
The Stack Machine Model
● What is the sequence of instructions?● Convert the equation to its Reverse Polish
Notation form.– ab*cde/-*
How is the expression x = (a*b)+(c- (d/e) evaluated ona stack based machine?How is the expression x = (a*b)+(c- (d/e) evaluated ona stack based machine?
ExampleExample
The Stack Machine Model
Evaluate ab*cde/- on a stack based machineEvaluate ab*cde/- on a stack based machine
...
...
...
...
...
...
...
STACK
0xFF
0xFE
172
3
13............7
a
b
c
d
...
...
MEMORY
0x00
0x01
0x02
0x03
0x04
0x05
0x065
17210
1721
172
d
de
dx
What is the minimumsize of the stackrequired to evaluatethis expression ?
What is the minimumsize of the stackrequired to evaluatethis expression ?
Class Work Example
For each machine model, write a code sequence to evaluatethe following expressions.For each machine model, write a code sequence to evaluatethe following expressions.
ExampleExample
b=a3+3⋅a2+2⋅a+7c= x3
+3⋅a2+2⋅b+7
For each machine model, what is the (a) total instructions inthe code sequence, (b) Execution time in clock cycles, (c) CPI?Given: Load, store, arithmetic and logic tasks take 1 cycle.Multiply completes in 4 clock cycles.
For each machine model, what is the (a) total instructions inthe code sequence, (b) Execution time in clock cycles, (c) CPI?Given: Load, store, arithmetic and logic tasks take 1 cycle.Multiply completes in 4 clock cycles.
Real World Instruction SetsArch Type #Oper #Mem Data
Size#Regs Addr
SizeUse
Alpha Reg-Reg 3 0 64b 32 64b Workstation
ARM Reg-Reg 3 0 32/64b 16 32/64b Cell Phone, Embedded
MIPS Reg-Reg 3 0 32/64b 32 32b/64b Workstation
SPARC Reg-Reg 3 0 32/64b 24-32 32b/64b DSP
TI C6000 Reg-Reg 3 0 32b 32 32b Mainframe
IBM 360 Reg-Mem 2 1 32b 16 24/31/64 Personal Computers
x86 Reg-Mem 2 1 8/16/32/64b
4/8/24 16/32/64 PC
VAX Mem-Mem 3 3 32b 16 32b Minicomputers
Motorola6800
Accumulator
1 1/2 8b 0 16b Microcontroller
MIPS64 InstructionsDATA TRANSFER INSTRUCTIONSInstruction Opcode/Mnemonic Examples
Load LB, LBU, LH, LHU, LW, LWU, LD, SDL.S, L.D
LD R1, 30(R2)L.S F0, 50(R3)
Store SB, SH, SW, SDS.S, S.D
SH R3, 502(R2)SB R2, R1(R3)
Move MOV.S, MOV.DMFC0, MTC0MFC1, MTC1
MOV.S F2, F3
● L: Load● S: Store
● B: Byte (8b), H: Half Word (16b), W: Word (32b)
● U: Upper● I: Immediate
MIPS64 Instructions
ARITHMETIC/LOGICAL INSTRUCTIONS
Multiply Accumulate,Logical and Arithmetic Shift, Set less than…
DADD, DADDI, DADDIU, DSUB, DSUBU, DMUL, DMULU, DDIV, DDIVUAND, OR, XOR, ANDI, ORI, XORILUIDSLL, DSRL, DSRA, DSLLVSLT, SLTI, SLTU
DADDU R1, R2, R3
LUI R1, #43
SLT R1, R2, R3
43
LUI R1, #43
0 0 …. … … … … … … … 0 0 0 …. …. 0
MIPS64 Instructions
CONTROL INSTRUCTIONS
Branch, Jump, Control transfer
BEQZ, BNEZBEQ, BNEMOVN, MOVZJ, JRJAL, JALRERET
BEQ R1, R2, label
MOVZ R1, R2, R3
J label
MIPS64 Instructions
FLOATING POINT
FP Arithmetic ADD.D, ADD.S, ADD.PSSUB.D, SUB.S, SUB.PSMULD, MUL.S, MUL.PSDIV.D, DIV.S, DIV.PSCVT.D.S, CVT.D.L, CVT.D.W, CVT.S._.C.LT.D, C.GT.D, C.LE.D, C.GE.D, C.EQ.D, C.NE.D, C.__.S