smt/vliw/epic, statically scheduled...

72
SMT/VLIW/EPIC, Statically Scheduled ILP Adapted from UCB cs252 Sp 2007 lectures CprE 581 Computer Systems Architecture Text: Appendix G

Upload: others

Post on 18-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

SMT/VLIW/EPIC,Statically Scheduled ILP

Adapted from UCB cs252 Sp 2007 lectures

CprE 581 Computer SystemsArchitecture

Text: Appendix G

Page 2: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

2

Recap: Advanced SuperscalarsEven simple branch prediction can be quite effectivePath-based predictors can achieve >95% accuracyBTB redirects control flow early in pipe, BHT cheaper per entry but must wait for instruction decodeBranch mispredict recovery requires snapshots of pipeline state to reduce penaltyUnified physical register file design, avoids reading data from multiple locations (ROB+arch regfile)Superscalars can rename multiple dependent instructions in one clock cycleNeed speculative store buffer to avoid waiting for stores to commit

Page 3: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

3

Fetch Decode & Rename Reorder BufferPC

BranchPrediction

Update predictors

Commit

BranchResolution

BranchUnit ALU

Reg. File

MEM Store Buffer D$

Execute

killk ill

k ill k ill

Recap: Branch Predictionand Speculative Execution

Page 4: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

4

Little’s LawParallelism = Throughput * Latency

or

Latency in Cycles

Throughput per Cycle

One Operation

LTN ×=

Page 5: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

5

Example Pipelined ILP Machine

How much instruction-level parallelism (ILP) required to keep machine pipelines busy?

One Pipeline StageTwo Integer Units,

Single Cycle Latency

Two Load/Store Units,Three Cycle Latency

Two Floating-Point Units,Four Cycle Latency

Max Throughput, Six Instructions per Cycle

Latency in Cycles

6 T =322

62x4) 2x3 (2x1 L =

++= 61

322 6 N =×=

Page 6: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

6

Superscalar Control Logic Scaling

Each issued instruction must check against W*L instructions, i.e., growth in hardware ∝ W*(W*L)For in-order machines, L is related to pipeline latenciesFor out-of-order machines, L also includes time spent in instruction buffers (instruction window or ROB)As W increases, larger instruction window is needed to find enough parallelism to keep machine busy => greater L

=> Out-of-order control logic grows faster than W2 (~W4)

Lifetime L

Issue Group

Previously Issued Instructions

Issue/dispatch Width W

Page 7: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

7

Out-of-Order Control Complexity:MIPS R10000

Control Logic

[ SGI/MIPS Technologies Inc., 1995 ]

Page 8: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

8

Multithreading

Difficult to continue to extract ILP from a single threadMany workloads can make use of thread-level parallelism (TLP) TLP from multiprogramming (run

independent sequential jobs) TLP from multithreaded applications (run

one job faster using parallel threads)Multithreading uses TLP to improve utilization of a single processor

Page 9: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

9

Pipeline Hazards

Each instruction may depend on the next

LW r1, 0(r2)LW r5, 12(r1)ADDI r5, r5, #12SW 12(r1), r5

F D X M Wt0 t1 t2 t3 t4 t5 t6 t7 t8

F D X M WD D DF D X M WD D DF F F

F DD D DF F F

t9 t10t11t12t13t14

What can be done to cope with this?

Page 10: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

10

MultithreadingHow can we guarantee no dependencies between

instructions in a pipeline?-- One way is to interleave execution of instructions

from different program threads on same pipeline

F D X M Wt0 t1 t2 t3 t4 t5 t6 t7 t8

T1: LW r1, 0(r2)T2: ADD r7, r1, r4T3: XORI r5, r4, #12T4: SW 0(r7), r5T1: LW r5, 12(r1)

t9

F D X M WF D X M W

F D X M WF D X M W

Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe

Prior instruction in a thread always completes write-back before next instruction in same thread reads register file

Page 11: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

11

CDC 6600 Peripheral Processors(Cray, 1964)

First multithreaded hardware10 “virtual” I/O processorsFixed interleave on simple pipelinePipeline has 100ns cycle timeEach virtual processor executes one instruction every 1000nsAccumulator-based instruction set to reduce processor state

Page 12: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

12

Simple Multithreaded Pipeline

Have to carry thread select down pipeline to ensure correct state bits read/written at each pipe stageAppears to software (including OS) as multiple, albeit slower, CPUs

+1

2 Thread select

PC1PC

1PC1PC

1

I$ IR GPR1GPR1GPR1GPR1

X

Y

2

D$

Page 13: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

13

Multithreading Costs

Each thread requires its own user state PC GPRs

Also, needs its own system state virtual memory page table base register exception handling registers

Other costs?

Page 14: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

14

Thread Scheduling PoliciesFixed interleave (CDC 6600 PPUs, 1964) each of N threads executes one instruction every N cycles if thread not ready to go in its slot, insert pipeline bubble

Software-controlled interleave (TI ASC PPUs, 1971) OS allocates S pipeline slots amongst N threads hardware performs fixed interleave over S slots, executing

whichever thread is in that slot

Hardware-controlled thread scheduling (HEP, 1982) hardware keeps track of which threads are ready to go picks next thread to execute based on hardware priority

scheme

Page 15: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

15

Denelcor HEP(Burton Smith, 1982)

First commercial machine to use hardware threading in main CPU 120 threads per processor 10 MHz clock rate Up to 8 processors precursor to Tera MTA (Multithreaded Architecture)

Page 16: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

16

Tera MTA (1990-97)

Up to 256 processorsUp to 128 active threads per processorProcessors and memory modules populate a sparse 3D torus interconnection fabricFlat, shared main memory No data cache Sustains one main memory access per cycle per

processorGaAs logic in prototype, 1KW/processor @ 260MHz CMOS version, MTA-2, 50W/processor

Page 17: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

17

MTA ArchitectureEach processor supports 128 active hardware threads 1 x 128 = 128 stream status word (SSW)

registers, 8 x 128 = 1024 branch-target registers, 32 x 128 = 4096 general-purpose registers

Three operations packed into 64-bit instruction (short VLIW) One memory operation, One arithmetic operation, plus One arithmetic or branch operation

Thread creation and termination instructions

Explicit 3-bit “lookahead” field in instruction gives number of subsequent instructions (0-7) that are independent of this one c.f. instruction grouping in VLIW allows fewer threads to fill machine pipeline used for variable-sized branch delay slots

Page 18: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

18

MTA Pipeline

A

W

C

W

M

Inst Fetch

Mem

ory

Pool

Retry Pool

Interconnection Network

Wri

te P

ool

W

Memory pipeline

Issue Pool• Every cycle, one VLIW instruction from one active thread is launched into pipeline• Instruction pipeline is 21 cycles long• Memory operations incur ~150 cycles of latency

Assuming a single thread issues one instruction every 21 cycles, and clock rate is 260 MHz…What is single-thread performance?

Effective single-thread issue rate is 260/21 = 12.4 MIPS

Page 19: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

19

Coarse-Grain MultithreadingTera MTA designed for supercomputing applications

with large data sets and low locality No data cache Many parallel threads needed to hide large

memory latency

Other applications are more cache friendly Few pipeline bubbles when cache getting hits Just add a few threads to hide occasional

cache miss latencies Swap threads on cache misses

Page 20: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

20

MIT Alewife (1990)

Modified SPARC chips register windows hold

different thread contexts

Up to four threads per nodeThread switch on local

cache miss

Page 21: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

21

IBM PowerPC RS64-IV (2000)Commercial coarse-grain multithreading CPUBased on PowerPC with quad-issue in-order five-stage pipelineEach physical CPU supports two virtual CPUsOn L2 cache miss, pipeline is flushed and execution switches to second thread short pipeline minimizes flush penalty (4 cycles),

small compared to memory access latency flush pipeline to simplify exception handling

Page 22: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

22

Simultaneous Multithreading (SMT) for OoO Superscalars

Techniques presented so far have all been “vertical” multithreading where each pipeline stage works on one thread at a timeSMT uses fine-grain control already present inside an OoO superscalar to allow instructions from multiple threads to enter execution on same clock cycle. Gives better utilization of machine resources.

Page 23: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

23

For most apps, most execution units lie idle in an OoO superscalar

From: Tullsen, Eggers, and Levy,“Simultaneous Multithreading: Maximizing On-chip Parallelism, ISCA 1995.

For an 8-way superscalar.

Page 24: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

24

Superscalar Machine Efficiency

Issue w idth

Time

Completely idle cycle (vertical waste)

Instruction issue

Partially filled cycle, i.e., IPC < 4(horizontal waste)

Page 25: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

25

Vertical Multithreading

What is the effect of cycle-by-cycle interleaving? removes vertical waste, but leaves some horizontal waste

Issue w idth

Time

Second thread interleaved cycle-by-cycle

Instruction issue

Partially filled cycle, i.e., IPC < 4(horizontal waste)

Page 26: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

26

Chip Multiprocessing (CMP)

What is the effect of splitting into multiple processors? reduces horizontal waste, leaves some vertical waste, and puts upper limit on peak throughput of each thread.

Issue w idth

Time

Page 27: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

27

Ideal Superscalar Multithreading [Tullsen, Eggers, Levy, UW, 1995]

Interleave multiple threads to multiple issue slots with no restrictions

Issue w idth

Time

Page 28: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

28

O-o-O Simultaneous Multithreading[Tullsen, Eggers, Emer, Levy, Stamm, Lo, DEC/UW, 1996]

Add multiple contexts and fetch engines and allow instructions fetched from different threads to issue simultaneouslyUtilize wide out-of-order superscalar processor issue queue to find instructions to issue from multiple threadsOOO instruction window already has most of the circuitry required to schedule from multiple threadsAny single thread can utilize whole machine

Page 29: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

29

Power 4Single-threaded predecessor to Power 5. 8 execution units inout-of-order engine, each mayissue an instruction each cycle.

Page 30: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

10/30/2007 30

Power 4

Power 5

2 fetch (PC),2 initial decodes

2 commits (architected register sets)

Page 31: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

31

Power 5 data flow ...

Why only 2 threads? With 4, one of the shared resources (physical registers, cache, memory bandwidth) would be prone to bottleneck

Page 32: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

32

Changes in Power 5 to support SMTIncreased associativity of L1 instruction cache and the instruction address translation buffers Added per thread load and store queues Increased size of the L2 (1.92 vs. 1.44 MB) and L3 cachesAdded separate instruction prefetch and buffering per threadIncreased the number of virtual registers from 152 to 240Increased the size of several issue queuesThe Power5 core is about 24% larger than the Power4 core because of the addition of SMT support

Page 33: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

33

Pentium-4 Hyperthreading (2002)

First commercial SMT design (2-way SMT) Hyperthreading == SMT

Logical processors share nearly all resources of the physical processor Caches, execution units, branch predictors

Die area overhead of hyperthreading ~ 5%When one logical processor is stalled, the other can make progress No logical processor can use all entries in queues

when two threads are activeProcessor running only one active software thread runs at approximately same speed with or without hyperthreading

Page 34: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

34

Pentium-4 HyperthreadingFront End

[ Intel Technology Journal, Q1 2002 ]

Resource divided between logical CPUs

Resource shared between logical CPUs

Page 35: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

35

Pentium-4 HyperthreadingExecution Pipeline

[ Intel Technology Journal, Q1 2002 ]

Page 36: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

36

SMT adaptation to parallelism type

For regions with high thread level parallelism (TLP) entire machine width is shared by all threads

Issue w idth

Time

Issue w idth

Time

For regions with low thread level parallelism (TLP) entire machine width is available for instruction level parallelism (ILP)

Page 37: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

37

Initial Performance of SMTPentium 4 Extreme SMT yields 1.01 speedup for SPECint_rate benchmark and 1.07 for SPECfp_rate Pentium 4 is dual threaded SMT SPECRate requires that each SPEC benchmark be run

against a vendor-selected number of copies of the same benchmark

Running on Pentium 4 each of 26 SPEC benchmarks paired with every other (262 runs) speed-ups from 0.90 to 1.58; average was 1.20Power 5, 8-processor server 1.23 faster for SPECint_rate with SMT, 1.16 faster for SPECfp_ratePower 5 running 2 copies of each app speedup between 0.89 and 1.41 Most gained some Fl.Pt. apps had most cache conflicts and least gains

Page 38: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

38

Power 5 thread performance ...Relative priority of each thread controllable in hardware.

For balanced operation, both threads run slower than if they “owned” the machine.

Page 39: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

39

Icount based Thread selection Policy

Why does this enhance throughput?

Fetch from thread with the least instructions in flight.

Page 40: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

40

SMT Fetch Policies (Locks)Problem:

Spin looping thread consumes resources

Solution:Provide quiescing operation that allows athread to sleep until a memory location changes

loop:ARM r1, 0(r2)BEQ r1, got_itQUIESCEBR loop

got_it:

Load and start watching 0(r2)

Inhibit scheduling of thread until activity observed on 0(r2)

Page 41: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

41

Summary: Multithreaded Categories

Time (

proc

esso

r cyc

le) Superscalar Fine-Grained Coarse-Grained MultiprocessingSimultaneousMultithreading

Thread 1Thread 2

Thread 3Thread 4

Thread 5Idle slot

Page 42: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

42

Check instruction dependencies

Superscalar processor

Sequential ISA Bottleneck

a = foo(b);for (i=0, i<

Sequential source code

Superscalar compiler

Find independent operations

Schedule operations

Sequential machine code

Schedule execution

Page 43: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

43

VLIW: Very Long Instruction Word

Multiple operations packed into one instructionEach operation slot is for a fixed functionConstant operation latencies are specifiedArchitecture requires guarantee of: Parallelism within an instruction => no x-operation RAW

check No data use before data ready => no data interlocks

Two Integer Units,Single Cycle Latency

Two Load/Store Units,Three Cycle Latency Two Floating-Point Units,

Four Cycle Latency

Int Op 2 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2Int Op 1

Page 44: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

44

VLIW Compiler Responsibilities

The compiler:

Schedules to maximize parallel execution

Guarantees intra-instruction parallelism

Schedules to avoid data hazards (no interlocks) Typically separates operations with explicit NOPs

Page 45: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

45

Early VLIW Machines

FPS AP120B (1976) scientific attached array processor first commercial wide instruction machine hand-coded vector math libraries using software

pipelining and loop unrollingMultiflow Trace (1987) commercialization of ideas from Fisher’s Yale group

including “trace scheduling” available in configurations with 7, 14, or 28

operations/instruction 28 operations packed into a 1024-bit instruction word

Cydrome Cydra-5 (1987) 7 operations encoded in 256-bit instruction word rotating register file

Page 46: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

46

Loop Execution

for (i=0; i<N; i++)B[i] = A[i] + C;

Int1

Int 2 M1 M2 FP

+FPx

loop:

How many FP ops/cycle?

ld add r1

fadd

sdadd r2 bne

1 fadd / 8 cycles = 0.125

loop: ld f1, 0(r1)add r1, 8fadd f2, f0, f1sd f2, 0(r2)add r2, 8bne r1, r3, loop

Compile

Schedule

Page 47: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

47

Loop Unrollingfor (i=0; i<N; i++)

B[i] = A[i] + C;

for (i=0; i<N; i+=4){

B[i] = A[i] + C;B[i+1] = A[i+1] + C;B[i+2] = A[i+2] + C;B[i+3] = A[i+3] + C;

}

Unroll inner loop to perform 4 iterations at once

Need to handle values of N that are not multiples of unrolling factor with final cleanup loop

Page 48: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

48

Scheduling Loop Unrolled Code

loop: ld f1, 0(r1)ld f2, 8(r1)ld f3, 16(r1)ld f4, 24(r1)add r1, 32fadd f5, f0, f1fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4sd f5, 0(r2)sd f6, 8(r2)sd f7, 16(r2)sd f8, 24(r2)add r2, 32bne r1, r3,

loop

Schedule

Int1

Int 2 M1 M2 FP

+FPx

loop:

Unroll 4 ways

ld f1ld f2ld f3ld f4add r1 fadd f5

fadd f6fadd f7fadd f8

sd f5sd f6sd f7sd f8add r2 bne

How many FLOPS/cycle?

4 fadds / 11 cycles = 0.36

Page 49: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

49

Software Pipelining

loop: ld f1, 0(r1)ld f2, 8(r1)ld f3, 16(r1)ld f4, 24(r1)add r1, 32fadd f5, f0, f1fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4sd f5, 0(r2)sd f6, 8(r2)sd f7, 16(r2)add r2, 32sd f8, -8(r2)bne r1, r3,

loop

Int1 Int2 M1 M2 FP+ FPxUnroll 4 ways first

ld f1ld f2ld f3ld f4

fadd f5fadd f6fadd f7fadd f8

sd f5sd f6sd f7sd f8

add r1

add r2bne

ld f1ld f2ld f3ld f4

fadd f5fadd f6fadd f7fadd f8

sd f5sd f6sd f7sd f8

add r1

add r2bne

ld f1ld f2ld f3ld f4

fadd f5fadd f6fadd f7fadd f8

sd f5

add r1

loop:iterate

prolog

epilog

How many FLOPS/cycle?

4 fadds / 4 cycles = 1

Page 50: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

Software Pipelining

50

Page 51: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

51

Software Pipelining vs. Loop Unrolling

time

performance

time

performance

Loop Unrolled

Software Pipelined

Startup overhead

Wind-down overhead

Loop Iteration

Loop Iteration

Software pipelining pays startup/wind-down costs only once per loop, not once per iteration

Page 52: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

52

What if there are no loops?

Branches limit basic block size in control-flow intensive irregular codeDifficult to find ILP in individual basic blocks

Basic block

Page 53: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

53

Trace Scheduling [ Fisher,Ellis]

Pick string of basic blocks, a trace, that represents most frequent branch pathUse profiling feedback or compiler heuristics to find common branch paths Schedule whole “trace” at onceAdd fixup code to cope with branches jumping out of trace

Page 54: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

Trace Scheduling

54

Page 55: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

55

Page 56: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

56

Page 57: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

57

Problems with “Classic” VLIW

Object-code compatibility have to recompile all code for every machine, even for two

machines in same generationObject code size instruction padding wastes instruction memory/cache loop unrolling/software pipelining replicates code

Scheduling variable latency memory operations caches and/or memory bank conflicts impose statically

unpredictable variabilityKnowing branch probabilities Profiling requires an significant extra step in build process

Scheduling for statically unpredictable branches optimal schedule varies with branch path

Page 58: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

58

VLIW Instruction Encoding

Schemes to reduce effect of unused fields Compressed format in memory, expand on I-cache refill

used in Multiflow Trace introduces instruction addressing challenge

Mark parallel groups used in TMS320C6x DSPs, Intel IA-64

Provide a single-op VLIW instruction Cydra-5 UniOp instructions

Group 1 Group 2 Group 3

Page 59: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

Rotating Register FilesProblems: Scheduled loops require lots of registers,

Lots of duplicated code in prolog, epilog

Solution: Allocate new set of registers for each loop iteration

ld r1, ()add r2, r1, #1st r2, ()

ld r1, ()add r2, r1, #1st r2, ()

ld r1, ()add r2, r1, #1st r2, ()

ld r1, ()add r2, r1, #1

st r2, ()ld r1, ()

add r2, r1, #1st r2, ()

ld r1, ()add r2, r1, #1

st r2, ()

Prolog

Epilog

Loop

Page 60: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

60

Rotating Register File

P0P1P2P3P4P5P6P7

RRB=3

+R1

Rotating Register Base (RRB) register points to base of current register set. Value added on to logical register specifier to give physical register number. Usually, split into rotating and non-rotating registers.

dec RRBbloopdec RRBdec RRB

dec RRBld r1, ()add r3, r2, #1

st r4, ()ld r1, ()

add r3, r2, #1st r4, ()

ld r1, ()add r2, r1, #1

st r4, ()

Prolog

Epilog

LoopLoop closing branch decrements RRB

Page 61: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

61

Rotating Register File(Previous Loop Example)

bloopsd f9, ()fadd f5, f4, ...ld f1, ()

Three cycle load latency encoded as difference of 3 in register specifier number (f4 - f1 = 3)

Four cycle fadd latency encoded as difference of 4 in register specifier number (f9 – f5 = 4)

bloopsd P17, ()fadd P13, P12,ld P9, () RRB=8bloopsd P16, ()fadd P12, P11,ld P8, () RRB=7bloopsd P15, ()fadd P11, P10,ld P7, () RRB=6bloopsd P14, ()fadd P10, P9,ld P6, () RRB=5bloopsd P13, ()fadd P9, P8,ld P5, () RRB=4bloopsd P12, ()fadd P8, P7,ld P4, () RRB=3bloopsd P11, ()fadd P7, P6,ld P3, () RRB=2bloopsd P10, ()fadd P6, P5,ld P2, () RRB=1

Page 62: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

62

Cydra-5:Memory Latency Register (MLR)

Problem: Loads have variable latencySolution: Let software choose desired memory latency

Compiler schedules code for maximum load-use distanceSoftware sets MLR to latency that matches code schedule Hardware ensures that loads take exactly MLR cycles to return values into processor pipeline

Hardware buffers loads that return early Hardware stalls processor if loads return

late

Page 63: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

63

Intel EPIC IA-64

EPIC is the style of architecture (CISC, RISC) Explicitly Parallel Instruction Computing

IA-64 is Intel’s chosen ISA (x86, MIPS) IA-64 = Intel Architecture 64-bit An object-code compatible VLIW

Itanium (aka Merced) is first implementation ( 8086) First customer shipment were expected 1997

(actual 2001) McKinley, second implementation shipped in 2002

Page 64: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

64

IA-64 Instruction Format

Template bits describe grouping of these instructions with others in adjacent bundlesEach group contains instructions that can execute in parallel

Instruction 2 Instruction 1 Instruction 0 Template

128-bit instruction bundle

group i group i+1 group i+2group i-1

bundle j bundle j+1 bundle j+2bundle j-1

Page 65: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

65

IA-64 Registers128 General Purpose 64-bit Integer Registers128 General Purpose 64/80-bit Floating Point Registers64 1-bit Predicate Registers

GPRs rotate to reduce code size for software pipelined loops

Page 66: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

66

IA-64 Predicated ExecutionProblem: Mispredicted branches limit ILPSolution: Eliminate hard to predict branches with predicated execution

Almost all IA-64 instructions can be executed conditionally under predicate Instruction becomes NOP if predicate register false

Inst 1Inst 2br a==b, b2

Inst 3Inst 4br b3

Inst 5Inst 6

Inst 7Inst 8

b0:

b1:

b2:

b3:

if

else

then

Four basic blocks

Inst 1Inst 2p1,p2 <-cmp(a==b)(p1) Inst 3 || (p2) Inst 5(p1) Inst 4 || (p2) Inst 6Inst 7Inst 8

Predication

One basic block

Mahlke et al, ISCA95: On average >50% branches removed

Page 67: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

Predicate Software Pipeline StagesSingle VLIW Instruction

Software pipeline stages turned on by rotating predicate registers Much denser encoding of loops

(p1) bloop(p3) st r4(p2) add r3(p1) ld r1

Dynamic Execution

(p1) ld r1(p1) ld r1(p1) ld r1(p1) ld r1(p1) ld r1

(p2) add r3(p2) add r3(p2) add r3(p2) add r3(p2) add r3

(p3) st r4(p3) st r4(p3) st r4(p3) st r4(p3) st r4

(p1) bloop(p1) bloop(p1) bloop(p1) bloop(p1) bloop(p1) bloop(p1) bloop

Page 68: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

68

Fully Bypassed Datapath

ASrcIRIR IR

PCA

B

Y

R

MD1 MD2

addrinst

InstMemory

0x4Add

IR ALU

ImmExt

rd1

GPRs

rs1rs2

wswd rd2

we

wdata

addr

wdata

rdataData Memory

we

31

nop

stall

D

E M W

PC for JAL, ...

BSrc

Where does predication fit in?

Page 69: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

69

IA-64 Speculative ExecutionProblem: Branches restrict compiler code motion

Inst 1Inst 2br a==b, b2

Load r1Use r1Inst 3

Can’t move load above branch because might cause spurious exception

Load.s r1Inst 1Inst 2br a==b, b2

Chk.s r1Use r1Inst 3

Speculative load never causes exception, but sets “poison” bit on destination register

Check for exception in original home block jumps to fixup code if exception detected

Particularly useful for scheduling long latency loads early

Solution: Speculative operations that don’t cause exceptions

Page 70: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

70

IA-64 Data SpeculationProblem: Possible memory hazards limit code scheduling

Requires associative hardware in address check table

Inst 1Inst 2StoreLoad r1Use r1Inst 3

Can’t move load above store because store might be to same address

Load.a r1Inst 1Inst 2StoreLoad.cUse r1Inst 3

Data speculative load adds address to address check table

Store invalidates any matching loads in address check table

Check if load invalid (or missing), jump to fixup code if so

Solution: Hardware to check pointer hazards

Page 71: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

71

Clustered VLIW

Divide machine into clusters of local register files and local functional unitsLower bandwidth/higher latency interconnect between clustersSoftware responsible for mapping computations to minimize communication overheadCommon in commercial embedded VLIW processors, e.g., TI C6x DSPs, HP Lx processor(Same idea used in some superscalar processors, e.g., Alpha 21264)

Cluster Interconnect Local Regfile

Local Regfile

Memory Interconnect

Cache/Memory Banks

Cluster

Page 72: SMT/VLIW/EPIC, Statically Scheduled ILPclass.ece.iastate.edu/tyagi/cpre581/lectures/LecVLIWEPIC.pdfThread Scheduling Policies. Fixed interleave (CDC 6600 PPUs, 1964) ... width is available

72

Limits of Static Scheduling

Unpredictable branchesVariable memory latency (unpredictable cache misses)Code size explosionCompiler complexity

Question:How applicable are the VLIW-inspired techniques to traditional RISC/CISC processor architectures?