december 4, 2003 ilhyun kim -- micro-36 slide 1 of 23 macro-op scheduling: relaxing scheduling loop...

December 4, 2003 Ilhyun Kim -- MICRO-36 Slide 1 of 23

Macro-op Scheduling:Relaxing Scheduling Loop

Constraints

Ilhyun KimMikko H. LipastiPHARM Team

University of Wisconsin-Madison

December 4, 2003 Slide 2 of 23Ilhyun Kim -- MICRO-36

It’s all about granularity Instruction-centric hardware design

HW structures are built to match an instruction’s specifications Controls occur at every instruction boundary

Instruction granularity may impose constraints on the hardware design space

Relaxing the constraints at different processing granularities

CoarserFiner Processing granularity

instructionoperand

Half-pricearchitecture (ISCA03)

conventional Coarser-granular architecture

macro-op


Outline

Scheduling loop constraints Overview of coarser-grained scheduling Macro-op scheduling implementation Performance evaluation Conclusions & future work


Scheduling loop constraints Loops in out-of-order execution

Scheduling atomicity (wakeup / select within a single cycle) Essential for back-to-back instruction execution Hard to pipeline in conventional designs

Poor scalability Extractable ILP is a function of window size Complexity increases exponentially as the size grows Increasing pressure due to deeper pipelining and slower memory system

Fetch Decode Sched Disp RF Exe WB Commit

Scheduling loop(wakeup / select)

Exe loop(bypass)

Load latency resolution loop


Related Work Scheduling atomicity

Speculation & pipelining Grandparent scheduling [Stark], Select-free scheduling [Brown]

Poor scalability Low complexity scheduling logic

FIFO style window [Palacharla, H.Kim] Data-flow based window [Canal, Michaud, Raasch …]

Judicious window scaling Segmented windows [Hrishikesh], WIB [Lebeck] …

Issue queue entry sharing AMD K7 (MOP), Intel Pentium M (uops fusion)

Still based on instruction-centric scheduler designs Making a scheduling decision at every instruction boundary Overcoming atomicity and scalability in isolation


Source of the atomicity constraint

Minimal execution latency of instruction Many ALU operations have single-cycle latency Schedule should keep up with execution 1-cycle instructions need 1-cycle scheduling

Multi-cycle operations do not need atomic scheduling

Relax the constraints by increasing the size of scheduling unit Combine multiple instructions into a multi-cycle latency unit Scheduling decisions occur at multiple instruction boundaries Attack both atomicity and scalability constraints


Macro-op scheduling overview

Issuequeueinsert

Wakeup

Pipelined scheduling

RFSelectPayload RAM

Sequencinginstructions

EXEI-cacheFetch

MOPdetection

Wakeup order information

Dependenceinformation

MOPpointers

Fetch / Decode / Rename Queue Scheduling RF / EXE / MEM / WB / Commit

CoarserMOP-grained Instruction-grainedInstruction-grained

MEM

cacheports

MOP formation

Rename

Disp

WBCommit


MOP scheduling(2x) example

Pipelined instruction scheduling of multi-cycle MOPs Still issues original instructions consecutively

Larger instruction window Multiple original instructions logically share a single issue queue entry

12

3 45

6

7 98

10 11

12

13

14

15

16

n

n+1

selectwakeupselectwakeup

1

3

2

5

4

87

10

12

9

11

13

1415

16

n

n+1

select/ wakeup

select/ wakeup

6 Macro-op (MOP)

• 9 cycles• 16 queue entries

• 10 cycles• 9 queue entries


Outline



Issues in grouping instructions Candidate instructions

Single-cycle instructions: integer ALU, control, store agen operations Multi-cycle instructions (e.g. loads) do not need single-cycle scheduling

The number of source operands Grouping two dependent instructions up to 3 source operands Allow up to 2 source operands (conventional) / no restriction (wired-OR)

MOP size Bigger MOP sizes may be more beneficial 2 instructions in this study

MOP formation scope Instructions are processed in order before inserted into issue queue Candidate instructions need to be captured within a reasonable scope


Dependence edge distance (instruction count)

73% of value-generating candidates (potential MOP heads) have dependent candidate instructions (potential MOP tails)

An 8-instruction scope captures many dependent pairs Variability in distances (e.g. gap vs. vortex) remember this

Our configuration: grouping 2 single-cycle instructions within an 8-instruction scope

49.2 50.9 27.8 48.7 37.4 56.3 40.2 47.5 42.7 47.7 37.6 44.7% total insts

MOP potential

0%

20%

40%

60%

80%

100%

bzip

craf

ty

eon

gap

gcc

gzip

mcf

pars

er

perl

twol

f

vort

ex vpr

Tot

al v

alue

-gen

erat

ing

cand

idat

e in

stru

ctio

ns c

omitt

ed

dynamically deadnot MOP candidate8+ instructions4~7 instructions1~3 instructions

8-instruction scope


MOP detection

Finds groupable instruction pairs Dependence matrix-based detection (detailed in

the paper) Performance is insensitive to detection latency (pointers reused

repeatedly) A pessimistic 100-cycle latency loses 0.22% of IPC

Generates MOP pointers 4 bits per instruction, stored in $IL1 A MOP pointer represents a groupable instruction pair

Issuequeueinsert

Wakeup RFSelectPayload RAM

EXEI-cacheFetch

MOPdetection



MOPpointers

MEM

MOP formation

Rename

WBCommit

poin

ter

pointer


MOP detection –

Avoiding cycle conditions Cycle condition examples (leading to deadlocks)

Conservative cycle detection heuristic Precise detection is hard (multiple levels of dep tracking)

?

1

3

2

1

3

2

4

Assume a cycle if both outgoing and incoming edges are detected

Captures over 90% of MOP opportunities (compared to the precise detection)


MOP formation

Locates MOP pairs using MOP pointers MOP pointers are fetched along with instructions

Converts register dependences to MOP dependences Architected register IDs MOP IDs Identical to register renaming

Except that it assigns a single ID to two groupable instructions Reflects the fact that two instructions are grouped into one scheduling unit

Two instructions are later inserted into one issue entry

Issuequeueinsert


EXEI-cacheFetch

MOPdetection



MOPpointers

MEM

MOP formation

Rename

WBCommit

MOP

MOP


Scheduling MOPs

Instructions in a MOP are scheduled as a single unit A MOP is a non-pipelined, 2-cycle operation from the scheduler’s perspective Issued when all source operands are ready, incurs one tag broadcast

Wakeup / select timings

Issuequeueinsert


EXEI-cacheFetch

MOPdetection



MOPpointers

MEM

MOP formation

Rename

WBCommit

n

n+1

n+2

n+3

n+4

select 1

wakeup 2, 3

select 2, 3

wakeup 4

select 4

select MOP(1, 3)

wakeup 2, 4

select 2, 4

select 1wakeup 2, 3

select 2, 3wakeup 4

select 4

Atomic scheduling 2-cycle scheduling 2-cycle MOP schedulingcycle

1

4

1

2 3

4

1

3

2 4

2 3


Sequencing instructions

A MOP is converted back to two original instructions The dual-entry payload RAM sends two original instructions Original instructions are sequentially executed within 2 cycles Register values are accessed using physical register IDs

ROB separately commits original instructions in order MOPs do not affect precise exception or branch misprediction recovery

Issuequeueinsert


EXEI-cacheFetch

MOPdetection



MOPpointers

MEM

MOP formation

Rename

WBCommit

sequence original insts


Outline



Machine parameters Simplescalar-Alpha-based 4-wide OoO + speculative

scheduling w/ selective replay, 14 stages Ideally pipelined scheduler

conceptually equivalent to atomic scheduling + 1 extra stage 128 ROB, unrestricted / 32-entry issue queue 4 ALUs, 2 memory ports, 16K IL1 (2), 16K DL1 (2), 256K L2 (8),

memory (100) Combined branch prediction, fetch until the first taken branch

MOP scheduling 2-cycle (pipelined) scheduling + 2X MOP technique 2 (conventional) or 3 (wired-OR) source operands MOP detection scope: 2 cycles (4-wide X 2-cycle = up to 8 insts)

Spec2k INT, reduced input sets Reference input sets for crafty, eon, gap (up to 3B instructions)


0%

20%

40%

60%

80%

100%

b

zip

cra

fty

eon

gap

gcc

gzip

mcf

pars

er

perl

twolf

vort

ex

vpr

Tota

l dynam

ic instr

uctions c

om

mitte

d

not MOP candidate

MOP candidate

0%

20%

40%

60%

80%

100%

b

zip

cra

fty

eon

gap

gcc

gzip

mcf

pars

er

perl

twolf

vort

ex

vpr

Tota

l dynam

ic instr

uctions c

om

mitte

d

not MOP candidate

MOP candidate but not grouped

MOP

0%

20%

40%

60%

80%

100%

b

zip

cra

fty

eon

gap

gcc

gzip

mcf

pars

er

perl

twolf

vort

ex

vpr

Tota

l dynam

ic instr

uctions c

om

mitte

d

not MOP candidateMOP candidate but not groupedindependent MOPMOP

# grouped instructions

28~46% of total instructions are grouped 14~23% reduction in the instructions count in scheduler Dependent MOP cases enable consecutive issue of dependent

instructions

2-sr

c3-

src


MOP scheduling performance(relaxed atomicity constraint only)

Up to ~19% of IPC loss in 2-cycle scheduling MOP scheduling restores performance

Enables consecutive issue of dependent instructions 97.2% of atomic scheduling performance on average

0.8

0.85

0.9

0.95

1

1.05bzip

cra

fty

eon

gap

gcc

gzip

mcf

pars

er

perl

twolf

vort

ex

vpr

IPC

norm

alized t

o b

ase s

cheduling

2-cycle MOP-2src MOP-3src

0.8

0.85

0.9

0.95

1

1.05bzip

cra

fty

eon

gap

gcc

gzip

mcf

pars

er

perl

twolf

vort

ex

vpr

IPC

norm

alized t

o b

ase s

cheduling

2-cycle MOP-2src MOP-3srcUnrestricted IQ / 128 ROB


Insight into MOP scheduling Performance loss of 2-cycle scheduling

Correlated to dependence edge distance Short dependence edges (e.g. gap)

instruction window is filled up with chains of dependent instructions 2-cycle scheduler cannot find plenty of ready instructions to issue

MOP scheduling captures short-distance dependent instruction pairs They are the important ones Low MOP coverage due to long dependence edges does not matter

2-cycle scheduler can find many instructions to issue (e.g. vortex)

MOP scheduling complements 2-cycle scheduling Overall performance is less sensitive to code layout


0.8

0.85

0.9

0.95

1

1.05bzip

cra

fty

eon

gap

gcc

gzip

mcf

pars

er

perl

twolf

vort

ex

vpr

IPC

norm

alized t

o b

ase s

cheduling

2-cycle MOP-2src MOP-3src

MOP scheduling performance(relaxed atomicity + scalability constraints)

Benefits from both relaxed atomicity and scalability constraints

Pipelined 2-cycle MOP scheduling performs comparably or better than atomic scheduling

32 IQ / 128 ROB


Conclusions & Future work Changing processing granularity can relax the

constraints imposed by instruction-centric designs

Constraints in instruction scheduling loop Scheduling atomicity, poor scalability

Macro-op scheduling relaxes both constraints at a coarser granularity

Pipelined, 2-cycle macro-op scheduling can perform comparably or even better than atomic scheduling

Potentials for narrow bandwidth microarchitecture Extending the MOP idea to the whole pipeline (Disp, RF, bypass) e.g. achieving 4-wide machine performance using 2-wide bandwidth


Questions??


0.8

0.85

0.9

0.95

1

1.05

bzip

craf

ty

eon

gap

gcc

gzip

mcf

pars

er

perl

twol

f

vort

ex vpr

IPC

nor

mal

ized

to

base

sch

edul

ing

Select-free-squash-dep Select-free-scoreboard MOP-wiredOR

Select-free (Brown et al.) vs. MOP scheduling

4.1% better IPC on average over select-free-scoreboard (best 8.3%) Select-free cannot outperform the atomic scheduling

Select-free scheduling is speculative and requires recovery operations MOP scheduling is non-speculative, leading to many advantages

32 IQ / 128 ROB, no extra stage for MOP formation


MOP detection –

MOP pointer generation Finding dependent pairs

Dependence matrix-based detection (detailed in MICRO paper) Insensitive to detection latency (pointers reused repeatedly)

A pessimistic 100-cycle latency loses 0.22% of IPC Similar to instruction preprocessing in trace cache lines

MOP pointers (4 bits per instruction)

0 011: add r1 r2, r3

0 000: lw r4 0(r3)

1 010: and r5 r4, r2

0 000: bez r1, 0xff (taken)

0 000: sub r6 r5, 1

control offset

MOPpointers

Control bit (1)

: captures up to 1 control discontinuity Offset bits (3)

: instruction count from head to tail


MOP formation –

MOP dependence translation Assigns a single ID to two MOPable instructions

reflecting the fact that two instructions are grouped into one unit The process and required structure is identical to register

renaming Register values are still access based on original register IDs

1234

34

56

567…

78--

Logicalreg ID

Physicalreg ID

Register rename table

p5

p6 p7

p8

p3

p4

I1

I2 I3

I4

m5

m5

m6

m6

m3

m4

1234

34

55

567…

66--

Logicalreg ID MOP ID

MOP translation table

a single MOP ID

is allocated totwo groupedinstructions

I1

I2

I3

I4


Inserting MOPs into issue queue

Inserting instructions across different groups

Issuequeueinsert


EXEI-cacheFetch

MOPdetection



MOPpointers

MEM

MOP formation

Rename

WBCommit

Issuequeue

21 3 4

65 7 8

pending

cycle n

X

124

3

65 7 8

pending

cycle n+1

124568

3

7

pending

cycle n+2

:MOP pointer


Performance considerations Independent MOPs

Group independent instructions with the same source dependences No direct performance benefit but reduce queue contention

Last-arriving operands in tail instructions

1

2

3

CLK 10

CLK 15

CLK 19

CLK 17

1

2

3

CLK 10

CLK 15

CLK 17

CLK 12

Unnecessarily delays head instructions

MOP detection logic filters out harmful grouping

Create an alternative pair if any


1

1 1 1 1

1

1

1

1

1 1

2 2

1inval

1 2 3 4

1

2

3

4

1

1

1 1

2 2

2 2

1

1

1

1 2 3 4 5 6 7 8

1

2

3

4

5

6

7

8

1

1

1

1

1

2

1

2

1

1 1

5 6 7 8 9 10 11 12

5

6

7

8

9

10

11

12

2

1

2

1 1 1 1 1 1

1

1

1

1

1

1

1

1

1

1

1

23:5

7:8

3:5

7:8

9:10

11:121

tail

head

possiblecycle

detected

prioritydecoder

picksone

MOPpointers

MOPpointers

MOPpointers

1

2

3

4

5

6

7

8

9

10

11

12

STEP 1

STEP 2 STEP 3

Originaldata

dependence graph

clk nclk n+1clk n+2

notgroupable

MOP pointerdetected

after step 3

december 4, 2003 ilhyun kim -- micro-36 slide 1 of 23 macro-op scheduling: relaxing scheduling loop...

Documents

atomic scheduling

isolationilhyun kim

madisonilhyun kim

queue entriesilhyun

execution1cycle instructions

singlecycle latencyschedule

window canal

wakeup6macroop mop