lecture 6: superscalar decode and other pipelining

44
Advanced Microarchitecture Lecture 6: Superscalar Decode and Other Pipelining

Upload: verity-geraldine-thompson

Post on 24-Dec-2015

234 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Lecture 6: Superscalar Decode and Other Pipelining

Advanced MicroarchitectureLecture 6: Superscalar Decode and Other Pipelining

Page 2: Lecture 6: Superscalar Decode and Other Pipelining

2

RISC ISA Format• This should be review…

– Fixed-length• MIPS all insts are 32-bits/4 bytes

– Few formats• MIPS has 3: R-, I-, J- formats• Alpha has 5: Operate, Op w/ Imm, Mem, Branch, FP

– Regularity across formats (when possible/practical)• MIPS, Alpha opcode in same bit-position for all formats• MIPS rs & rt fields in same bit-position for I- and J-

formats• Alpha ra/fa field in same bit-position for all 5 formats

Lecture 6: Superscalar Decode and Other Pipelining

Page 3: Lecture 6: Superscalar Decode and Other Pipelining

3

RISC Decode (MIPS)

000 001 010 011 100 101 110 111000 func rt j jal beq bne blez bgtz

001 addiaddi

uslti sltiu andi ori xori lui

010 rs rs rs rs011100 lb lh lwl lw lbu lhu lwr101 sb sh swl sw swr110 lwc0 lwc1 lwc2 lwc3111 swc0 swc1 swc2 swc3

Lecture 6: Superscalar Decode and Other Pipelining

opcode6

other21

func5

R-format only

op

code[5

,3]

opcode[2,0]

001xxx = Immediate

1xxxxx = Memory(1x0: LD, 1x1: ST)

000xxx = Br/Jump(except for 000000)

Page 4: Lecture 6: Superscalar Decode and Other Pipelining

6

Superscalar Decode for RISC ISAs• To sustain X instructions per cycle, must

decode X instructions per cycle– Just duplicate the hardware

Lecture 6: Superscalar Decode and Other Pipelining

32-bit inst

Decoder

decodedinst

scalar

Decoder Decoder Decoder

32-bit inst

Decoder

decodedinst

superscalar

4-wide superscalar fetch

32-bit inst32-bit inst32-bit inst

decodedinst

decodedinst

decodedinst

1-Fetch

Page 5: Lecture 6: Superscalar Decode and Other Pipelining

7

VLIW/EPIC ISAs• Compiler finds the parallelism, packs

multiple instructions into a “very long” instruction

Lecture 6: Superscalar Decode and Other Pipelining

Add

Load

Branch

Sub

Store

Xor

RISC

IMB Add Load Branch

IMI Sub Store Xor

VLIW (EPIC-like)

“template” bits

Page 6: Lecture 6: Superscalar Decode and Other Pipelining

8

VLIW Decoder• Similar to superscalar RISC decoder

Lecture 6: Superscalar Decode and Other Pipelining

Decoder

inst0

TemplateDecoder

tmplt

Decoder Decoder

inst1 inst2

decodedinst

decodedinst

decodedinst

Page 7: Lecture 6: Superscalar Decode and Other Pipelining

9

CISC ISA• RISC focus on fast access to information

– easy decode, I$, large RF’s, D$• CISCs are older

– designed in era with fewer transistors, chips– each memory access very expensive

• pack as much work into as few bytes as possible• more “expressive” instructions

– compare to simple RISC insts– better potential code generation– more complex code generation in practice

Lecture 6: Superscalar Decode and Other Pipelining

Page 8: Lecture 6: Superscalar Decode and Other Pipelining

10

Example: VAX• Superset of ISAs, incl. IBM360, DEC PDP-11• VAX = “Virtual Address Extension”• 16 32-bit registers• 32-bit memory addressing• Encoding:

– 1-2 byte opcode, followed by 0-6 operand specifiers, each of which may be up to 5 bytes

– Opcode implies datatype, size, # operands– Orthogonality: any opcode with any addressing

mode

Lecture 6: Superscalar Decode and Other Pipelining

Page 9: Lecture 6: Superscalar Decode and Other Pipelining

11

VAX operand addressingMode Example Meaning

Register Add R4, R3 R4 = R4 + R3

Immediate Add R4, #3 R4 = R4 + 3

Displacement Add R4, 100(R1) R4 = R4 + Mem[100+R1]

Register Indirect

Add R4, (R1) R4 = R4 + Mem[R1]

Indexed/Base Add R3, (R1+R2) R3 = R3 + Mem[R1+R2]

Direct/Absolute Add R1, (1234) R1 = R1 + Mem[1234]

Memory Indirect

Add R1, @(R3) R1 = R1 + Mem[Mem[R3]]

Auto-Increment Add R1,(R2)+ R1 = R1 + Mem[R2]; R2++

Auto-Decrement

Add R1, -(R2) R2--; R1 = R1 + Mem[R2]Lecture 6: Superscalar Decode and Other Pipelining

Any mode could be applied to any instruction!

Page 10: Lecture 6: Superscalar Decode and Other Pipelining

12

x86• CISC, stemming from the original 4004• Example: “Move” instructions

1. General Purpose data movement– RR, MR, RM, IR, IM

2. Exchanges– EAX ↔ ECX, byte order within a register

3. Stack Manipulation– push pop R ↔ Stack, PUSHA/POPA

4. Type Conversion5. Conditional Moves

Lecture 6: Superscalar Decode and Other Pipelining

Many ways to do the same/similar operation

Page 11: Lecture 6: Superscalar Decode and Other Pipelining

13

x86 Encoding• Basic x86 Instruction:

Lecture 6: Superscalar Decode and Other Pipelining

Prefixes0-4 bytes

Opcode1-2 bytes

Mod R/M0-1 bytes

SIB0-1 bytes

Displacement0/1/2/4 bytes

Immediate0/1/2/4 bytes

Longest Inst 15 bytesShortest Inst: 1 byte

• Opcode specifies operation, and if the Mod R/M byte is used– Most instructions use the Mod R/M byte– Mod R/M specifies if optional SIB byte is

used– Mod R/M and SIB may specify additional

constants

Page 12: Lecture 6: Superscalar Decode and Other Pipelining

14

Mod R/M Byte

• Mode = 00: No-displacement, use Mem[ regmmm ]

• Mode = 01: 8-bit displacement, Mem[ regmmm + SExt(disp) ]

• Mode = 10: 32-bit displacement (similar to previous)

• Mode = 11: Register-to-Register, use regmmm

Lecture 6: Superscalar Decode and Other Pipelining

M M r r r m m m

Mode Register R/M

1 of 8 registers

Add EBX, ECX 11 011 001Mod R/M

Add EBX, [ECX] 00 011 001Mod R/M

Page 13: Lecture 6: Superscalar Decode and Other Pipelining

15

Exceptions• Mod=00, R/M = 5 get operand from 32-

bit immediate– Add EDX =

EDX+Mem[0cff1234]• Mod=00, 01 or 10, R/M = 4 use the “SIB”

byte– SIB = Scale/Index/Base

Lecture 6: Superscalar Decode and Other Pipelining

000101010cff1234

00010100

Mod R/M

ss iii bbb

SIBbbb 5: use regbbb

bbb = 5: use 32-bit imm(Mod = 00 only)

iii 4: use siiii = 4: use 0

si = regiii << ss

Page 14: Lecture 6: Superscalar Decode and Other Pipelining

16

Opcode Confusion• There are different opcodes for AB and

BA

Lecture 6: Superscalar Decode and Other Pipelining

10001011 11000011 MOV EAX, EBX

10001001 11000011 MOV EBX, EAX

10001001 11011000 MOV EAX, EBX

• If Opcode = 0F, then use next byte as opcode

• If Opcode = D8-DF, then FP instruction

11011000 11 R/M

FP opcode

Page 15: Lecture 6: Superscalar Decode and Other Pipelining

17

x86 Decode Example

Lecture 6: Superscalar Decode and Other Pipelining

11000111

MOV regimm (use Mod R/M, 32-bit Imm to follow)

opcode10000100

Mod=2 (use 32-bit Disp)R/M = 4 (use SIB)

reg ignored

Mod R/M

11000011SIB

ss=3 Scale by 8use EAX, EBX

Disp Imm

*( (EAX<<3) + EBX + Disp ) = ImmTotal: 11 byte instruction

Note: Add 4 prefixes, andyou reach the max size

Page 16: Lecture 6: Superscalar Decode and Other Pipelining

18

In RISC (MIPS)

1. lui R1 = Disp[31:16]2. ori R1 = R1,

Disp[15:0]3. add R1 = R1 + R24. shli R3 = R3 << 35. add R3 = R3 + R16. lui R1 = Imm[31:16]7. ori R1 = R1,

Imm[15:0]8. st [R3] R1

Lecture 6: Superscalar Decode and Other Pipelining

8 instructions, 32 bits each32 bytes total

2.9x Bigger!

Page 17: Lecture 6: Superscalar Decode and Other Pipelining

19

x86-64 / EM64T• 816 general purpose registers

– only 3-bit register fields?• Registers extended from 3264 bits each• Default: instructions still 32-bit

– New “REX” prefix byte to specify additional information

Lecture 6: Superscalar Decode and Other Pipelining

0100 m R I B opcode

m=0 64-bit modem=1 32-bit mode

md rrr mrm

R rrr

ss iii bbb

iiiIbbbB

REX

Register specifiers are now 4 bitseach: can choose 1 of 16 registers

Page 18: Lecture 6: Superscalar Decode and Other Pipelining

20

64-bit Extensions to IA32

Lecture 6: Superscalar Decode and Other Pipelining

IA32+64-bit exts IA32

CPU architect

(Taken from Bob Colwell’s Eckert-MauchlyAward Talk, ISCA 2005)

Ugly? Scary?… but it works

Page 19: Lecture 6: Superscalar Decode and Other Pipelining

21

x86 Decode Hardware

Lecture 6: Superscalar Decode and Other Pipelining

PrefixDecoder

Left Shift

Num Prefixes

opcodedecoder

2nd opcodedecoder

Mod R/Mdecoder

SIBdecoder

Left Shift Left Shift

+ +

Instruction bytes

Page 20: Lecture 6: Superscalar Decode and Other Pipelining

22

Decoded x86 Format• RISC: easy to expand union of needed

info– generalized opcode (not too hard)– reg1, reg2, reg3, immediate (possibly extended)– some fields ignored

• CISC: union of all possible info is huge– generalized opcode (too many options)– up to 3 regs, 2 immediates– segment information– “rep” specifiers

• would lead to 100’s of bits• common case only needs a fraction a lot of waste

Lecture 6: Superscalar Decode and Other Pipelining

Page 21: Lecture 6: Superscalar Decode and Other Pipelining

23

x86 RISC-like mops• Each x86 instruction decoded into a variable

number of “uops” (micro-ops - Intel) or ROPs (RISC ops - AMD)– Each uop is RISC-like– Uops have limitations to keep union of info practical

Lecture 6: Superscalar Decode and Other Pipelining

ADD EAX, EBX ADD EAX, EBX 1 uop

ADD EAX, [EBX] Load tmp = [EBX]ADD EAX, tmp

2 uops

ADD [EAX], EBX Load tmp = [EAX]ADD tmp, EBX

STA EAXSTD tmp

4 uops

Page 22: Lecture 6: Superscalar Decode and Other Pipelining

24

uop Limits• How many uops can a decoder generate?

– For complex x86 insts, many are needed (10’s, 100’s?)

– Makes decoder horribly complex– Typically there’s a limit to keep complexity

under control• One x86 instruction 1-4 uops• Most instructions translate to 1.5-2.0 uops

• Ok, what happens if a complex instruction needs more than 4 uops?

Lecture 6: Superscalar Decode and Other Pipelining

Page 23: Lecture 6: Superscalar Decode and Other Pipelining

25

UROM/MS for Complex x86 Insts• UROM (mcode-ROM) stores the uop

equivalents for nasty x86 instructions– “Nasty” could be large/complex (> 4 uops like

PUSHA or STRREP.MOV) or obsolete instructions (AAA)

• Microsequencer (MS) is the control logic that interfaces between the post-decode pipestages, the UROM, the decoders and the PC-generation

Lecture 6: Superscalar Decode and Other Pipelining

Page 24: Lecture 6: Superscalar Decode and Other Pipelining

26

UROM/MS Example (3 uop-wide)

Lecture 6: Superscalar Decode and Other Pipelining

ADD

STORE

SUB

Cycle 1

ADD

STA

STD

SUB

Cycle 2

REP.MOV

ADD [ ]

Fetch- x86 insts

Decode - uops

UROM - uops

SUB

LOAD

STORE

REP.MOV

ADD [ ]

XOR

Cycle 3

INC

mJCC

LOAD

Cycle 4

STORE

INC

mJCC

Cycle 5

Cycle …

mJCCREP.MOV

ADD [ ]

XOR

LOAD

Cycle n Cycle n+1

ADD

Complex instructions, getuops from mcode sequencer

Page 25: Lecture 6: Superscalar Decode and Other Pipelining

27

Superscalar CISC Decode

1. Instruction Length Decode (ILD)• Where are the instructions?

• Limited decode – just enough to parse prefixes, modes

2. Shift/Alignment• Get the right bytes to the decoders

3. Decode• Crack into uops

Lecture 6: Superscalar Decode and Other Pipelining

And then do this for N instructions per cycle!

Page 26: Lecture 6: Superscalar Decode and Other Pipelining

28

ILD Recurrence/Loop• PCi = X

• PCi+1= PCi + sizeof( Mem[PCi] )

• PCi+2= PCi+1 + sizeof( Mem[PCi+1] )

= PCi + sizeof( Mem[PCi] ) + sizeof( Mem[PCi+1] )

• Can’t find start of next instruction without decoding the first

• Critical loop not pipelineable– ILD of 4 instructions per cycle imples that clock cycle

time will be 4 x latency(ILD)

Lecture 6: Superscalar Decode and Other Pipelining

Page 27: Lecture 6: Superscalar Decode and Other Pipelining

29

Decode Implementation

Lecture 6: Superscalar Decode and Other Pipelining

Left Shifter

Decoder 3

Cycl

e 3

Decoder 2Decoder 1

ILD dominatescycle time; not

scalable

Instruction Bytes (ex. 16 bytes)

ILD (limited decode)Length 1

Length 2

+

Cycl

e 1

Inst 1 Inst 2 Inst 3 Remainder

Cycl

e 2

+

Length 3

bytesdecoded

ILD (limited decode)

ILD (limited decode)

Page 28: Lecture 6: Superscalar Decode and Other Pipelining

30

Hardware-Intensive Decode

Lecture 6: Superscalar Decode and Other Pipelining

Decode from everypossible instructionstarting point!

Giant MUXes toselect instructionbytes

ILD

DecoderIL

D

ILD

ILD

ILD

ILD

ILD

ILD

ILD

ILD

ILD

ILD

ILD

ILD

ILD

ILD

DecoderDecoder

Page 29: Lecture 6: Superscalar Decode and Other Pipelining

31

ILD in Hardware-Intensive Approach

Lecture 6: Superscalar Decode and Other Pipelining

6 bytes

4 bytes

3 bytes+

Total bytes decode = 11Previous: 3 ILD + 2add

Now: 1ILD + 2(mux+add)

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Page 30: Lecture 6: Superscalar Decode and Other Pipelining

32

Predecoding• ILD loop is hardware intensive, impacts

latency, and can consume substantial power

• Observation: when instructions A, B and C are decoded into lengths 3, 5 and 1, the next time we encounter A, B and C, their lengths will still be the same!– cache the ILD work– do once, reuse many times

Lecture 6: Superscalar Decode and Other Pipelining

Page 31: Lecture 6: Superscalar Decode and Other Pipelining

33

Decoder Example: AMD K5

Lecture 6: Superscalar Decode and Other Pipelining

Predecode Logic

From Memory

b0 b1 b2 … b78 bytes

I$

b0 b1 b2 … b7

8 bits

+5 bits

13 bytes

Decode

16 (8-bit inst + 5-bit predecode)

8 (8-bit inst + 5-bit predecode)

Up to 4 ROPs

Page 32: Lecture 6: Superscalar Decode and Other Pipelining

34

Decoder Example: AMD K5• Predecode information makes decode

easier– Instruction start/end location (ILD)– Number of ROPs needed per inst– Opcode and prefix locations

• Power/performance tradeoffs– Larger I$ (increase data by 62.5%)

• Longer I$ latency, More I$ power consumption– Remove logic from decode

• Shorter branch mispred penalty, simpler logic• Cache and reused decode work less decode power

– Longer effective IL1 miss latency

Lecture 6: Superscalar Decode and Other Pipelining

Page 33: Lecture 6: Superscalar Decode and Other Pipelining

35

Limits on Decode• Max branches (color allocation)• Taken branches• Incomplete instructions

– x86 insts are not aligned, may span two cache lines

– can’t decode until both halves have been fetched

• Instruction complexity– decoding “complex” (2-4 uop) instructions

requires a more complex decoder; expensive to replicate

– compromise: fewer complex decoders plus simpler decoders for instructions with single-uop mappings

Lecture 6: Superscalar Decode and Other Pipelining

Page 34: Lecture 6: Superscalar Decode and Other Pipelining

36

Decoder Example: Intel P-Pro

Lecture 6: Superscalar Decode and Other Pipelining

16 Raw Instruction Bytes

Decoder0

Decoder1

Decoder2

BranchAddress

Calculator

Fetch resteerif needed

mROM

4 uops 1 uop 1 uop

If instruction in Decoder 1 or 2 requires > 1 uop, do not generateany output, and then shift to Decoder to the left on next cycle

Only Decoder 0 can interface with the uROM and MS

Page 35: Lecture 6: Superscalar Decode and Other Pipelining

37

Decoder Example: Intel P4

Lecture 6: Superscalar Decode and Other Pipelining

L2 Cache

Raw instruction bytes

4-uop DecoderuROM

trace const. buffer

Trace Cache

Decode at mostone inst per cycle

Fetch up to 3 uopsper cycle

P4 has a strangled front-end, at best it can only deliver 3 uops per cycle; contrast to P-Pro that can deliver up to 6 uops per cycle (if they’re 4/1/1)

More on this when westudy the P4 in detail

Page 36: Lecture 6: Superscalar Decode and Other Pipelining

38

Pipeline Control

Lecture 6: Superscalar Decode and Other Pipelining

Dispatch ROB, LSQ, RSfull, stall front-end

Disp

ROB, LSQ, RSfull, stall front-end

RenDecDecRotILDI$I$BPred

Except not everyone stalls…

This logic starting toget pretty intense

DecodeFetch

Page 37: Lecture 6: Superscalar Decode and Other Pipelining

39

Just because there’s a stall conditionsomewhere does not imply that

everybody has to stall

Non-Uniform Stall

Lecture 6: Superscalar Decode and Other Pipelining

Disp

nops due to I$ miss

ROBFull!

not full

full

full

full

full

RenDecDecRotILDI$I$BP

Page 38: Lecture 6: Superscalar Decode and Other Pipelining

40

Compressing/Serpentine Pipelines

Lecture 6: Superscalar Decode and Other Pipelining

Disp

ROBFull!

1 entryfree

RenDecDecRotILDI$I$BP

Better “flow”, but much morecomplex since need to track howmany insts can advance per stage

Page 39: Lecture 6: Superscalar Decode and Other Pipelining

41

Lots o’ Stalls• I$ miss, ITLB miss• Decoder limitations

– x86 4-1-1 limit– branch limits (max/cycle, max taken/cycle)

• Renamer – out of physical registers

Lecture 6: Superscalar Decode and Other Pipelining

Page 40: Lecture 6: Superscalar Decode and Other Pipelining

42

Smaller Control Domains• Separate long pipeline into multiple smaller

pipelines

Lecture 6: Superscalar Decode and Other Pipelining

BPred I$ I$ Dec Dec Dec Ren Alloc Sched

BPred I$ I$ Dec Dec Dec

Ren Alloc Sched

Page 41: Lecture 6: Superscalar Decode and Other Pipelining

43

Smaller Control Domains (2)

• Non-decoupled pipe needed logic to simultaneously control ~10 stages

• Decoupled pipe needs multiple control logic circuits– each only needs to interact with ~5 stages (~3

real stages, plus the queue ahead and behind)

Lecture 6: Superscalar Decode and Other Pipelining

Pipeline ControlLogic

Non-decoupled

I$ Dec Dec Dec Ren

Pipeline ControlLogic

Decoupled

XX

No direct control logicfor stages outside of

local pipeline

Page 42: Lecture 6: Superscalar Decode and Other Pipelining

44

Smaller Control Domains (3)

• Queues can effectively add more pipeline stages

Lecture 6: Superscalar Decode and Other Pipelining

cycleboundary

previous stage next stage

Inter-pipe queueenqueue

logic latchdequeue

logic

previous stage next stage

• Avoid this by writing and reading in the same cycle (affects timing, complexity)

Page 43: Lecture 6: Superscalar Decode and Other Pipelining

45

Queues provide Smoothing• Approximation to serpentine pipes

(compress only at certain locations – i.e., the queues)

• Different levels of decoupling possible depending on frequency target, power, complexity tolerance

Lecture 6: Superscalar Decode and Other Pipelining

BPred I$ I$ Dec Ren Sched

The “SimpleScalar” pipeline

Note: RS is effectivelya queue (more later)

Page 44: Lecture 6: Superscalar Decode and Other Pipelining

46

Different Clocking Domains• Decoupling the pipe allows each segment

to operate independently (local control)• Also means each can run at different

speeds (P4)

Lecture 6: Superscalar Decode and Other Pipelining

TC TC Dec … Alloc

Sched … WB

(ROB)

Commit

(IAQ)

(uopQ)

1x freq(3 uops/Mclk)½x freq

(6 uops/2Mclk’s 3 uops/Mclk)

2x freq(2 uops/Fclk 4 uops/Mclk)

1x freq(3 uops/Mclk)