smt/vliw/epic, statically scheduled...

SMT/VLIW/EPIC,Statically Scheduled ILP

Adapted from UCB cs252 Sp 2007 lectures

CprE 581 Computer SystemsArchitecture

Text: Appendix G

2

Recap: Advanced SuperscalarsEven simple branch prediction can be quite effectivePath-based predictors can achieve >95% accuracyBTB redirects control flow early in pipe, BHT cheaper per entry but must wait for instruction decodeBranch mispredict recovery requires snapshots of pipeline state to reduce penaltyUnified physical register file design, avoids reading data from multiple locations (ROB+arch regfile)Superscalars can rename multiple dependent instructions in one clock cycleNeed speculative store buffer to avoid waiting for stores to commit

3

Fetch Decode & Rename Reorder BufferPC

BranchPrediction

Update predictors

Commit

BranchResolution

BranchUnit ALU

Reg. File

MEM Store Buffer D$

Execute

killk ill

k ill k ill

Recap: Branch Predictionand Speculative Execution

4

Little’s LawParallelism = Throughput * Latency

or

Latency in Cycles

Throughput per Cycle

One Operation

LTN ×=

5

Example Pipelined ILP Machine

How much instruction-level parallelism (ILP) required to keep machine pipelines busy?

One Pipeline StageTwo Integer Units,

Single Cycle Latency

Two Load/Store Units,Three Cycle Latency

Two Floating-Point Units,Four Cycle Latency

Max Throughput, Six Instructions per Cycle

Latency in Cycles

6 T =322

62x4) 2x3 (2x1 L =

++= 61

322 6 N =×=

6

Superscalar Control Logic Scaling

Each issued instruction must check against W*L instructions, i.e., growth in hardware ∝ W*(W*L)For in-order machines, L is related to pipeline latenciesFor out-of-order machines, L also includes time spent in instruction buffers (instruction window or ROB)As W increases, larger instruction window is needed to find enough parallelism to keep machine busy => greater L

=> Out-of-order control logic grows faster than W2 (~W4)

Lifetime L

Issue Group

Previously Issued Instructions

Issue/dispatch Width W

7

Out-of-Order Control Complexity:MIPS R10000

Control Logic

[ SGI/MIPS Technologies Inc., 1995 ]

8

Multithreading

Difficult to continue to extract ILP from a single threadMany workloads can make use of thread-level parallelism (TLP) TLP from multiprogramming (run

independent sequential jobs) TLP from multithreaded applications (run

one job faster using parallel threads)Multithreading uses TLP to improve utilization of a single processor

9

Pipeline Hazards

Each instruction may depend on the next

LW r1, 0(r2)LW r5, 12(r1)ADDI r5, r5, #12SW 12(r1), r5

F D X M Wt0 t1 t2 t3 t4 t5 t6 t7 t8

F D X M WD D DF D X M WD D DF F F

F DD D DF F F

t9 t10t11t12t13t14

What can be done to cope with this?

10

MultithreadingHow can we guarantee no dependencies between

instructions in a pipeline?-- One way is to interleave execution of instructions

from different program threads on same pipeline

F D X M Wt0 t1 t2 t3 t4 t5 t6 t7 t8

T1: LW r1, 0(r2)T2: ADD r7, r1, r4T3: XORI r5, r4, #12T4: SW 0(r7), r5T1: LW r5, 12(r1)

t9

F D X M WF D X M W

F D X M WF D X M W

Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe

Prior instruction in a thread always completes write-back before next instruction in same thread reads register file

11

CDC 6600 Peripheral Processors(Cray, 1964)

First multithreaded hardware10 “virtual” I/O processorsFixed interleave on simple pipelinePipeline has 100ns cycle timeEach virtual processor executes one instruction every 1000nsAccumulator-based instruction set to reduce processor state

12

Simple Multithreaded Pipeline

Have to carry thread select down pipeline to ensure correct state bits read/written at each pipe stageAppears to software (including OS) as multiple, albeit slower, CPUs

+1

2 Thread select

PC1PC

1PC1PC

1

I$ IR GPR1GPR1GPR1GPR1

X

Y

2

D$

13

Multithreading Costs

Each thread requires its own user state PC GPRs

Also, needs its own system state virtual memory page table base register exception handling registers

Other costs?

14

Thread Scheduling PoliciesFixed interleave (CDC 6600 PPUs, 1964) each of N threads executes one instruction every N cycles if thread not ready to go in its slot, insert pipeline bubble

Software-controlled interleave (TI ASC PPUs, 1971) OS allocates S pipeline slots amongst N threads hardware performs fixed interleave over S slots, executing

whichever thread is in that slot

Hardware-controlled thread scheduling (HEP, 1982) hardware keeps track of which threads are ready to go picks next thread to execute based on hardware priority

scheme

15

Denelcor HEP(Burton Smith, 1982)

First commercial machine to use hardware threading in main CPU 120 threads per processor 10 MHz clock rate Up to 8 processors precursor to Tera MTA (Multithreaded Architecture)

16

Tera MTA (1990-97)

Up to 256 processorsUp to 128 active threads per processorProcessors and memory modules populate a sparse 3D torus interconnection fabricFlat, shared main memory No data cache Sustains one main memory access per cycle per

processorGaAs logic in prototype, 1KW/processor @ 260MHz CMOS version, MTA-2, 50W/processor

17

MTA ArchitectureEach processor supports 128 active hardware threads 1 x 128 = 128 stream status word (SSW)

registers, 8 x 128 = 1024 branch-target registers, 32 x 128 = 4096 general-purpose registers

Three operations packed into 64-bit instruction (short VLIW) One memory operation, One arithmetic operation, plus One arithmetic or branch operation

Thread creation and termination instructions

Explicit 3-bit “lookahead” field in instruction gives number of subsequent instructions (0-7) that are independent of this one c.f. instruction grouping in VLIW allows fewer threads to fill machine pipeline used for variable-sized branch delay slots

18

MTA Pipeline

A

W

C

W

M

Inst Fetch

Mem

ory

Pool

Retry Pool

Interconnection Network

Wri

te P

ool

W

Memory pipeline

Issue Pool• Every cycle, one VLIW instruction from one active thread is launched into pipeline• Instruction pipeline is 21 cycles long• Memory operations incur ~150 cycles of latency

Assuming a single thread issues one instruction every 21 cycles, and clock rate is 260 MHz…What is single-thread performance?

Effective single-thread issue rate is 260/21 = 12.4 MIPS

19

Coarse-Grain MultithreadingTera MTA designed for supercomputing applications

with large data sets and low locality No data cache Many parallel threads needed to hide large

memory latency

Other applications are more cache friendly Few pipeline bubbles when cache getting hits Just add a few threads to hide occasional

cache miss latencies Swap threads on cache misses

20

MIT Alewife (1990)

Modified SPARC chips register windows hold

different thread contexts

Up to four threads per nodeThread switch on local

cache miss

21

IBM PowerPC RS64-IV (2000)Commercial coarse-grain multithreading CPUBased on PowerPC with quad-issue in-order five-stage pipelineEach physical CPU supports two virtual CPUsOn L2 cache miss, pipeline is flushed and execution switches to second thread short pipeline minimizes flush penalty (4 cycles),

small compared to memory access latency flush pipeline to simplify exception handling

22

Simultaneous Multithreading (SMT) for OoO Superscalars

Techniques presented so far have all been “vertical” multithreading where each pipeline stage works on one thread at a timeSMT uses fine-grain control already present inside an OoO superscalar to allow instructions from multiple threads to enter execution on same clock cycle. Gives better utilization of machine resources.

23

For most apps, most execution units lie idle in an OoO superscalar

From: Tullsen, Eggers, and Levy,“Simultaneous Multithreading: Maximizing On-chip Parallelism, ISCA 1995.

For an 8-way superscalar.

24

Superscalar Machine Efficiency

Issue w idth

Time

Completely idle cycle (vertical waste)

Instruction issue

Partially filled cycle, i.e., IPC < 4(horizontal waste)

25

Vertical Multithreading

What is the effect of cycle-by-cycle interleaving? removes vertical waste, but leaves some horizontal waste

Issue w idth

Time

Second thread interleaved cycle-by-cycle

Instruction issue

Partially filled cycle, i.e., IPC < 4(horizontal waste)

26

Chip Multiprocessing (CMP)

What is the effect of splitting into multiple processors? reduces horizontal waste, leaves some vertical waste, and puts upper limit on peak throughput of each thread.

Issue w idth

Time

27

Ideal Superscalar Multithreading [Tullsen, Eggers, Levy, UW, 1995]

Interleave multiple threads to multiple issue slots with no restrictions

Issue w idth

Time

28

O-o-O Simultaneous Multithreading[Tullsen, Eggers, Emer, Levy, Stamm, Lo, DEC/UW, 1996]

Add multiple contexts and fetch engines and allow instructions fetched from different threads to issue simultaneouslyUtilize wide out-of-order superscalar processor issue queue to find instructions to issue from multiple threadsOOO instruction window already has most of the circuitry required to schedule from multiple threadsAny single thread can utilize whole machine

29

Power 4Single-threaded predecessor to Power 5. 8 execution units inout-of-order engine, each mayissue an instruction each cycle.

10/30/2007 30

Power 4

Power 5

2 fetch (PC),2 initial decodes

2 commits (architected register sets)

31

Power 5 data flow ...

Why only 2 threads? With 4, one of the shared resources (physical registers, cache, memory bandwidth) would be prone to bottleneck

32

Changes in Power 5 to support SMTIncreased associativity of L1 instruction cache and the instruction address translation buffers Added per thread load and store queues Increased size of the L2 (1.92 vs. 1.44 MB) and L3 cachesAdded separate instruction prefetch and buffering per threadIncreased the number of virtual registers from 152 to 240Increased the size of several issue queuesThe Power5 core is about 24% larger than the Power4 core because of the addition of SMT support

33

Pentium-4 Hyperthreading (2002)

First commercial SMT design (2-way SMT) Hyperthreading == SMT

Logical processors share nearly all resources of the physical processor Caches, execution units, branch predictors

Die area overhead of hyperthreading ~ 5%When one logical processor is stalled, the other can make progress No logical processor can use all entries in queues

when two threads are activeProcessor running only one active software thread runs at approximately same speed with or without hyperthreading

34

Pentium-4 HyperthreadingFront End

[ Intel Technology Journal, Q1 2002 ]

Resource divided between logical CPUs

Resource shared between logical CPUs

35

Pentium-4 HyperthreadingExecution Pipeline

[ Intel Technology Journal, Q1 2002 ]

36

SMT adaptation to parallelism type

For regions with high thread level parallelism (TLP) entire machine width is shared by all threads

Issue w idth

Time

Issue w idth

Time

For regions with low thread level parallelism (TLP) entire machine width is available for instruction level parallelism (ILP)

37

Initial Performance of SMTPentium 4 Extreme SMT yields 1.01 speedup for SPECint_rate benchmark and 1.07 for SPECfp_rate Pentium 4 is dual threaded SMT SPECRate requires that each SPEC benchmark be run

against a vendor-selected number of copies of the same benchmark

Running on Pentium 4 each of 26 SPEC benchmarks paired with every other (262 runs) speed-ups from 0.90 to 1.58; average was 1.20Power 5, 8-processor server 1.23 faster for SPECint_rate with SMT, 1.16 faster for SPECfp_ratePower 5 running 2 copies of each app speedup between 0.89 and 1.41 Most gained some Fl.Pt. apps had most cache conflicts and least gains

38

Power 5 thread performance ...Relative priority of each thread controllable in hardware.

For balanced operation, both threads run slower than if they “owned” the machine.

39

Icount based Thread selection Policy

Why does this enhance throughput?

Fetch from thread with the least instructions in flight.

40

SMT Fetch Policies (Locks)Problem:

Spin looping thread consumes resources

Solution:Provide quiescing operation that allows athread to sleep until a memory location changes

loop:ARM r1, 0(r2)BEQ r1, got_itQUIESCEBR loop

got_it:

Load and start watching 0(r2)

Inhibit scheduling of thread until activity observed on 0(r2)

41

Summary: Multithreaded Categories

Time (

proc

esso

r cyc

le) Superscalar Fine-Grained Coarse-Grained MultiprocessingSimultaneousMultithreading

Thread 1Thread 2

Thread 3Thread 4

Thread 5Idle slot

42

Check instruction dependencies

Superscalar processor

Sequential ISA Bottleneck

a = foo(b);for (i=0, i<

Sequential source code

Superscalar compiler

Find independent operations

Schedule operations

Sequential machine code

Schedule execution

43

VLIW: Very Long Instruction Word

Multiple operations packed into one instructionEach operation slot is for a fixed functionConstant operation latencies are specifiedArchitecture requires guarantee of: Parallelism within an instruction => no x-operation RAW

check No data use before data ready => no data interlocks

Two Integer Units,Single Cycle Latency

Two Load/Store Units,Three Cycle Latency Two Floating-Point Units,

Four Cycle Latency

Int Op 2 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2Int Op 1

44

VLIW Compiler Responsibilities

The compiler:

Schedules to maximize parallel execution

Guarantees intra-instruction parallelism

Schedules to avoid data hazards (no interlocks) Typically separates operations with explicit NOPs

45

Early VLIW Machines

FPS AP120B (1976) scientific attached array processor first commercial wide instruction machine hand-coded vector math libraries using software

pipelining and loop unrollingMultiflow Trace (1987) commercialization of ideas from Fisher’s Yale group

including “trace scheduling” available in configurations with 7, 14, or 28

operations/instruction 28 operations packed into a 1024-bit instruction word

Cydrome Cydra-5 (1987) 7 operations encoded in 256-bit instruction word rotating register file

46

Loop Execution

for (i=0; i<N; i++)B[i] = A[i] + C;

Int1

Int 2 M1 M2 FP

+FPx

loop:

How many FP ops/cycle?

ld add r1

fadd

sdadd r2 bne

1 fadd / 8 cycles = 0.125

loop: ld f1, 0(r1)add r1, 8fadd f2, f0, f1sd f2, 0(r2)add r2, 8bne r1, r3, loop

Compile

Schedule

47

Loop Unrollingfor (i=0; i<N; i++)

B[i] = A[i] + C;

for (i=0; i<N; i+=4){

B[i] = A[i] + C;B[i+1] = A[i+1] + C;B[i+2] = A[i+2] + C;B[i+3] = A[i+3] + C;

}

Unroll inner loop to perform 4 iterations at once

Need to handle values of N that are not multiples of unrolling factor with final cleanup loop

48

Scheduling Loop Unrolled Code

loop: ld f1, 0(r1)ld f2, 8(r1)ld f3, 16(r1)ld f4, 24(r1)add r1, 32fadd f5, f0, f1fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4sd f5, 0(r2)sd f6, 8(r2)sd f7, 16(r2)sd f8, 24(r2)add r2, 32bne r1, r3,

loop

Schedule

Int1

Int 2 M1 M2 FP

+FPx

loop:

Unroll 4 ways

ld f1ld f2ld f3ld f4add r1 fadd f5

fadd f6fadd f7fadd f8

sd f5sd f6sd f7sd f8add r2 bne

How many FLOPS/cycle?

4 fadds / 11 cycles = 0.36

49

Software Pipelining

loop: ld f1, 0(r1)ld f2, 8(r1)ld f3, 16(r1)ld f4, 24(r1)add r1, 32fadd f5, f0, f1fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4sd f5, 0(r2)sd f6, 8(r2)sd f7, 16(r2)add r2, 32sd f8, -8(r2)bne r1, r3,

loop

Int1 Int2 M1 M2 FP+ FPxUnroll 4 ways first

ld f1ld f2ld f3ld f4

fadd f5fadd f6fadd f7fadd f8

sd f5sd f6sd f7sd f8

add r1

add r2bne



sd f5sd f6sd f7sd f8

add r1

add r2bne



sd f5

add r1

loop:iterate

prolog

epilog

How many FLOPS/cycle?

4 fadds / 4 cycles = 1

Software Pipelining

50

51

Software Pipelining vs. Loop Unrolling

time

performance

time

performance

Loop Unrolled

Software Pipelined

Startup overhead

Wind-down overhead

Loop Iteration

Loop Iteration

Software pipelining pays startup/wind-down costs only once per loop, not once per iteration

52

What if there are no loops?

Branches limit basic block size in control-flow intensive irregular codeDifficult to find ILP in individual basic blocks

Basic block

53

Trace Scheduling [ Fisher,Ellis]

Pick string of basic blocks, a trace, that represents most frequent branch pathUse profiling feedback or compiler heuristics to find common branch paths Schedule whole “trace” at onceAdd fixup code to cope with branches jumping out of trace

Trace Scheduling

54

57

Problems with “Classic” VLIW

Object-code compatibility have to recompile all code for every machine, even for two

machines in same generationObject code size instruction padding wastes instruction memory/cache loop unrolling/software pipelining replicates code

Scheduling variable latency memory operations caches and/or memory bank conflicts impose statically

unpredictable variabilityKnowing branch probabilities Profiling requires an significant extra step in build process

Scheduling for statically unpredictable branches optimal schedule varies with branch path

58

VLIW Instruction Encoding

Schemes to reduce effect of unused fields Compressed format in memory, expand on I-cache refill

used in Multiflow Trace introduces instruction addressing challenge

Mark parallel groups used in TMS320C6x DSPs, Intel IA-64

Provide a single-op VLIW instruction Cydra-5 UniOp instructions

Group 1 Group 2 Group 3

Rotating Register FilesProblems: Scheduled loops require lots of registers,

Lots of duplicated code in prolog, epilog

Solution: Allocate new set of registers for each loop iteration

ld r1, ()add r2, r1, #1st r2, ()

ld r1, ()add r2, r1, #1st r2, ()

ld r1, ()add r2, r1, #1st r2, ()

ld r1, ()add r2, r1, #1

st r2, ()ld r1, ()

add r2, r1, #1st r2, ()

ld r1, ()add r2, r1, #1

st r2, ()

Prolog

Epilog

Loop

60

Rotating Register File

P0P1P2P3P4P5P6P7

RRB=3

+R1

Rotating Register Base (RRB) register points to base of current register set. Value added on to logical register specifier to give physical register number. Usually, split into rotating and non-rotating registers.

dec RRBbloopdec RRBdec RRB

dec RRBld r1, ()add r3, r2, #1

st r4, ()ld r1, ()

add r3, r2, #1st r4, ()

ld r1, ()add r2, r1, #1

st r4, ()

Prolog

Epilog

LoopLoop closing branch decrements RRB

61

Rotating Register File(Previous Loop Example)

bloopsd f9, ()fadd f5, f4, ...ld f1, ()

Three cycle load latency encoded as difference of 3 in register specifier number (f4 - f1 = 3)

Four cycle fadd latency encoded as difference of 4 in register specifier number (f9 – f5 = 4)

bloopsd P17, ()fadd P13, P12,ld P9, () RRB=8bloopsd P16, ()fadd P12, P11,ld P8, () RRB=7bloopsd P15, ()fadd P11, P10,ld P7, () RRB=6bloopsd P14, ()fadd P10, P9,ld P6, () RRB=5bloopsd P13, ()fadd P9, P8,ld P5, () RRB=4bloopsd P12, ()fadd P8, P7,ld P4, () RRB=3bloopsd P11, ()fadd P7, P6,ld P3, () RRB=2bloopsd P10, ()fadd P6, P5,ld P2, () RRB=1

62

Cydra-5:Memory Latency Register (MLR)

Problem: Loads have variable latencySolution: Let software choose desired memory latency

Compiler schedules code for maximum load-use distanceSoftware sets MLR to latency that matches code schedule Hardware ensures that loads take exactly MLR cycles to return values into processor pipeline

Hardware buffers loads that return early Hardware stalls processor if loads return

late

63

Intel EPIC IA-64

EPIC is the style of architecture (CISC, RISC) Explicitly Parallel Instruction Computing

IA-64 is Intel’s chosen ISA (x86, MIPS) IA-64 = Intel Architecture 64-bit An object-code compatible VLIW

Itanium (aka Merced) is first implementation ( 8086) First customer shipment were expected 1997

(actual 2001) McKinley, second implementation shipped in 2002

64

IA-64 Instruction Format

Template bits describe grouping of these instructions with others in adjacent bundlesEach group contains instructions that can execute in parallel

Instruction 2 Instruction 1 Instruction 0 Template

128-bit instruction bundle

group i group i+1 group i+2group i-1

bundle j bundle j+1 bundle j+2bundle j-1

65

IA-64 Registers128 General Purpose 64-bit Integer Registers128 General Purpose 64/80-bit Floating Point Registers64 1-bit Predicate Registers

GPRs rotate to reduce code size for software pipelined loops

66

IA-64 Predicated ExecutionProblem: Mispredicted branches limit ILPSolution: Eliminate hard to predict branches with predicated execution

Almost all IA-64 instructions can be executed conditionally under predicate Instruction becomes NOP if predicate register false

Inst 1Inst 2br a==b, b2

Inst 3Inst 4br b3

Inst 5Inst 6

Inst 7Inst 8

b0:

b1:

b2:

b3:

if

else

then

Four basic blocks

Inst 1Inst 2p1,p2 <-cmp(a==b)(p1) Inst 3 || (p2) Inst 5(p1) Inst 4 || (p2) Inst 6Inst 7Inst 8

Predication

One basic block

Mahlke et al, ISCA95: On average >50% branches removed

Predicate Software Pipeline StagesSingle VLIW Instruction

Software pipeline stages turned on by rotating predicate registers Much denser encoding of loops

(p1) bloop(p3) st r4(p2) add r3(p1) ld r1

Dynamic Execution

(p1) ld r1(p1) ld r1(p1) ld r1(p1) ld r1(p1) ld r1

(p2) add r3(p2) add r3(p2) add r3(p2) add r3(p2) add r3

(p3) st r4(p3) st r4(p3) st r4(p3) st r4(p3) st r4

(p1) bloop(p1) bloop(p1) bloop(p1) bloop(p1) bloop(p1) bloop(p1) bloop

68

Fully Bypassed Datapath

ASrcIRIR IR

PCA

B

Y

R

MD1 MD2

addrinst

InstMemory

0x4Add

IR ALU

ImmExt

rd1

GPRs

rs1rs2

wswd rd2

we

wdata

addr

wdata

rdataData Memory

we

31

nop

stall

D

E M W

PC for JAL, ...

BSrc

Where does predication fit in?

69

IA-64 Speculative ExecutionProblem: Branches restrict compiler code motion

Inst 1Inst 2br a==b, b2

Load r1Use r1Inst 3

Can’t move load above branch because might cause spurious exception

Load.s r1Inst 1Inst 2br a==b, b2

Chk.s r1Use r1Inst 3

Speculative load never causes exception, but sets “poison” bit on destination register

Check for exception in original home block jumps to fixup code if exception detected

Particularly useful for scheduling long latency loads early

Solution: Speculative operations that don’t cause exceptions

70

IA-64 Data SpeculationProblem: Possible memory hazards limit code scheduling

Requires associative hardware in address check table

Inst 1Inst 2StoreLoad r1Use r1Inst 3

Can’t move load above store because store might be to same address

Load.a r1Inst 1Inst 2StoreLoad.cUse r1Inst 3

Data speculative load adds address to address check table

Store invalidates any matching loads in address check table

Check if load invalid (or missing), jump to fixup code if so

Solution: Hardware to check pointer hazards

71

Clustered VLIW

Divide machine into clusters of local register files and local functional unitsLower bandwidth/higher latency interconnect between clustersSoftware responsible for mapping computations to minimize communication overheadCommon in commercial embedded VLIW processors, e.g., TI C6x DSPs, HP Lx processor(Same idea used in some superscalar processors, e.g., Alpha 21264)

Cluster Interconnect Local Regfile

Local Regfile

Memory Interconnect

Cache/Memory Banks

Cluster

72

Limits of Static Scheduling

Unpredictable branchesVariable memory latency (unpredictable cache misses)Code size explosionCompiler complexity

Question:How applicable are the VLIW-inspired techniques to traditional RISC/CISC processor architectures?

smt/vliw/epic, statically scheduled...

Documents