smt/vliw/epic, statically scheduled...
TRANSCRIPT
SMT/VLIW/EPIC,Statically Scheduled ILP
Adapted from UCB cs252 Sp 2007 lectures
CprE 581 Computer SystemsArchitecture
Text: Appendix G
2
Recap: Advanced SuperscalarsEven simple branch prediction can be quite effectivePath-based predictors can achieve >95% accuracyBTB redirects control flow early in pipe, BHT cheaper per entry but must wait for instruction decodeBranch mispredict recovery requires snapshots of pipeline state to reduce penaltyUnified physical register file design, avoids reading data from multiple locations (ROB+arch regfile)Superscalars can rename multiple dependent instructions in one clock cycleNeed speculative store buffer to avoid waiting for stores to commit
3
Fetch Decode & Rename Reorder BufferPC
BranchPrediction
Update predictors
Commit
BranchResolution
BranchUnit ALU
Reg. File
MEM Store Buffer D$
Execute
killk ill
k ill k ill
Recap: Branch Predictionand Speculative Execution
4
Little’s LawParallelism = Throughput * Latency
or
Latency in Cycles
Throughput per Cycle
One Operation
LTN ×=
5
Example Pipelined ILP Machine
How much instruction-level parallelism (ILP) required to keep machine pipelines busy?
One Pipeline StageTwo Integer Units,
Single Cycle Latency
Two Load/Store Units,Three Cycle Latency
Two Floating-Point Units,Four Cycle Latency
Max Throughput, Six Instructions per Cycle
Latency in Cycles
6 T =322
62x4) 2x3 (2x1 L =
++= 61
322 6 N =×=
6
Superscalar Control Logic Scaling
Each issued instruction must check against W*L instructions, i.e., growth in hardware ∝ W*(W*L)For in-order machines, L is related to pipeline latenciesFor out-of-order machines, L also includes time spent in instruction buffers (instruction window or ROB)As W increases, larger instruction window is needed to find enough parallelism to keep machine busy => greater L
=> Out-of-order control logic grows faster than W2 (~W4)
Lifetime L
Issue Group
Previously Issued Instructions
Issue/dispatch Width W
7
Out-of-Order Control Complexity:MIPS R10000
Control Logic
[ SGI/MIPS Technologies Inc., 1995 ]
8
Multithreading
Difficult to continue to extract ILP from a single threadMany workloads can make use of thread-level parallelism (TLP) TLP from multiprogramming (run
independent sequential jobs) TLP from multithreaded applications (run
one job faster using parallel threads)Multithreading uses TLP to improve utilization of a single processor
9
Pipeline Hazards
Each instruction may depend on the next
LW r1, 0(r2)LW r5, 12(r1)ADDI r5, r5, #12SW 12(r1), r5
F D X M Wt0 t1 t2 t3 t4 t5 t6 t7 t8
F D X M WD D DF D X M WD D DF F F
F DD D DF F F
t9 t10t11t12t13t14
What can be done to cope with this?
10
MultithreadingHow can we guarantee no dependencies between
instructions in a pipeline?-- One way is to interleave execution of instructions
from different program threads on same pipeline
F D X M Wt0 t1 t2 t3 t4 t5 t6 t7 t8
T1: LW r1, 0(r2)T2: ADD r7, r1, r4T3: XORI r5, r4, #12T4: SW 0(r7), r5T1: LW r5, 12(r1)
t9
F D X M WF D X M W
F D X M WF D X M W
Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe
Prior instruction in a thread always completes write-back before next instruction in same thread reads register file
11
CDC 6600 Peripheral Processors(Cray, 1964)
First multithreaded hardware10 “virtual” I/O processorsFixed interleave on simple pipelinePipeline has 100ns cycle timeEach virtual processor executes one instruction every 1000nsAccumulator-based instruction set to reduce processor state
12
Simple Multithreaded Pipeline
Have to carry thread select down pipeline to ensure correct state bits read/written at each pipe stageAppears to software (including OS) as multiple, albeit slower, CPUs
+1
2 Thread select
PC1PC
1PC1PC
1
I$ IR GPR1GPR1GPR1GPR1
X
Y
2
D$
13
Multithreading Costs
Each thread requires its own user state PC GPRs
Also, needs its own system state virtual memory page table base register exception handling registers
Other costs?
14
Thread Scheduling PoliciesFixed interleave (CDC 6600 PPUs, 1964) each of N threads executes one instruction every N cycles if thread not ready to go in its slot, insert pipeline bubble
Software-controlled interleave (TI ASC PPUs, 1971) OS allocates S pipeline slots amongst N threads hardware performs fixed interleave over S slots, executing
whichever thread is in that slot
Hardware-controlled thread scheduling (HEP, 1982) hardware keeps track of which threads are ready to go picks next thread to execute based on hardware priority
scheme
15
Denelcor HEP(Burton Smith, 1982)
First commercial machine to use hardware threading in main CPU 120 threads per processor 10 MHz clock rate Up to 8 processors precursor to Tera MTA (Multithreaded Architecture)
16
Tera MTA (1990-97)
Up to 256 processorsUp to 128 active threads per processorProcessors and memory modules populate a sparse 3D torus interconnection fabricFlat, shared main memory No data cache Sustains one main memory access per cycle per
processorGaAs logic in prototype, 1KW/processor @ 260MHz CMOS version, MTA-2, 50W/processor
17
MTA ArchitectureEach processor supports 128 active hardware threads 1 x 128 = 128 stream status word (SSW)
registers, 8 x 128 = 1024 branch-target registers, 32 x 128 = 4096 general-purpose registers
Three operations packed into 64-bit instruction (short VLIW) One memory operation, One arithmetic operation, plus One arithmetic or branch operation
Thread creation and termination instructions
Explicit 3-bit “lookahead” field in instruction gives number of subsequent instructions (0-7) that are independent of this one c.f. instruction grouping in VLIW allows fewer threads to fill machine pipeline used for variable-sized branch delay slots
18
MTA Pipeline
A
W
C
W
M
Inst Fetch
Mem
ory
Pool
Retry Pool
Interconnection Network
Wri
te P
ool
W
Memory pipeline
Issue Pool• Every cycle, one VLIW instruction from one active thread is launched into pipeline• Instruction pipeline is 21 cycles long• Memory operations incur ~150 cycles of latency
Assuming a single thread issues one instruction every 21 cycles, and clock rate is 260 MHz…What is single-thread performance?
Effective single-thread issue rate is 260/21 = 12.4 MIPS
19
Coarse-Grain MultithreadingTera MTA designed for supercomputing applications
with large data sets and low locality No data cache Many parallel threads needed to hide large
memory latency
Other applications are more cache friendly Few pipeline bubbles when cache getting hits Just add a few threads to hide occasional
cache miss latencies Swap threads on cache misses
20
MIT Alewife (1990)
Modified SPARC chips register windows hold
different thread contexts
Up to four threads per nodeThread switch on local
cache miss
21
IBM PowerPC RS64-IV (2000)Commercial coarse-grain multithreading CPUBased on PowerPC with quad-issue in-order five-stage pipelineEach physical CPU supports two virtual CPUsOn L2 cache miss, pipeline is flushed and execution switches to second thread short pipeline minimizes flush penalty (4 cycles),
small compared to memory access latency flush pipeline to simplify exception handling
22
Simultaneous Multithreading (SMT) for OoO Superscalars
Techniques presented so far have all been “vertical” multithreading where each pipeline stage works on one thread at a timeSMT uses fine-grain control already present inside an OoO superscalar to allow instructions from multiple threads to enter execution on same clock cycle. Gives better utilization of machine resources.
23
For most apps, most execution units lie idle in an OoO superscalar
From: Tullsen, Eggers, and Levy,“Simultaneous Multithreading: Maximizing On-chip Parallelism, ISCA 1995.
For an 8-way superscalar.
24
Superscalar Machine Efficiency
Issue w idth
Time
Completely idle cycle (vertical waste)
Instruction issue
Partially filled cycle, i.e., IPC < 4(horizontal waste)
25
Vertical Multithreading
What is the effect of cycle-by-cycle interleaving? removes vertical waste, but leaves some horizontal waste
Issue w idth
Time
Second thread interleaved cycle-by-cycle
Instruction issue
Partially filled cycle, i.e., IPC < 4(horizontal waste)
26
Chip Multiprocessing (CMP)
What is the effect of splitting into multiple processors? reduces horizontal waste, leaves some vertical waste, and puts upper limit on peak throughput of each thread.
Issue w idth
Time
27
Ideal Superscalar Multithreading [Tullsen, Eggers, Levy, UW, 1995]
Interleave multiple threads to multiple issue slots with no restrictions
Issue w idth
Time
28
O-o-O Simultaneous Multithreading[Tullsen, Eggers, Emer, Levy, Stamm, Lo, DEC/UW, 1996]
Add multiple contexts and fetch engines and allow instructions fetched from different threads to issue simultaneouslyUtilize wide out-of-order superscalar processor issue queue to find instructions to issue from multiple threadsOOO instruction window already has most of the circuitry required to schedule from multiple threadsAny single thread can utilize whole machine
29
Power 4Single-threaded predecessor to Power 5. 8 execution units inout-of-order engine, each mayissue an instruction each cycle.
10/30/2007 30
Power 4
Power 5
2 fetch (PC),2 initial decodes
2 commits (architected register sets)
31
Power 5 data flow ...
Why only 2 threads? With 4, one of the shared resources (physical registers, cache, memory bandwidth) would be prone to bottleneck
32
Changes in Power 5 to support SMTIncreased associativity of L1 instruction cache and the instruction address translation buffers Added per thread load and store queues Increased size of the L2 (1.92 vs. 1.44 MB) and L3 cachesAdded separate instruction prefetch and buffering per threadIncreased the number of virtual registers from 152 to 240Increased the size of several issue queuesThe Power5 core is about 24% larger than the Power4 core because of the addition of SMT support
33
Pentium-4 Hyperthreading (2002)
First commercial SMT design (2-way SMT) Hyperthreading == SMT
Logical processors share nearly all resources of the physical processor Caches, execution units, branch predictors
Die area overhead of hyperthreading ~ 5%When one logical processor is stalled, the other can make progress No logical processor can use all entries in queues
when two threads are activeProcessor running only one active software thread runs at approximately same speed with or without hyperthreading
34
Pentium-4 HyperthreadingFront End
[ Intel Technology Journal, Q1 2002 ]
Resource divided between logical CPUs
Resource shared between logical CPUs
35
Pentium-4 HyperthreadingExecution Pipeline
[ Intel Technology Journal, Q1 2002 ]
36
SMT adaptation to parallelism type
For regions with high thread level parallelism (TLP) entire machine width is shared by all threads
Issue w idth
Time
Issue w idth
Time
For regions with low thread level parallelism (TLP) entire machine width is available for instruction level parallelism (ILP)
37
Initial Performance of SMTPentium 4 Extreme SMT yields 1.01 speedup for SPECint_rate benchmark and 1.07 for SPECfp_rate Pentium 4 is dual threaded SMT SPECRate requires that each SPEC benchmark be run
against a vendor-selected number of copies of the same benchmark
Running on Pentium 4 each of 26 SPEC benchmarks paired with every other (262 runs) speed-ups from 0.90 to 1.58; average was 1.20Power 5, 8-processor server 1.23 faster for SPECint_rate with SMT, 1.16 faster for SPECfp_ratePower 5 running 2 copies of each app speedup between 0.89 and 1.41 Most gained some Fl.Pt. apps had most cache conflicts and least gains
38
Power 5 thread performance ...Relative priority of each thread controllable in hardware.
For balanced operation, both threads run slower than if they “owned” the machine.
39
Icount based Thread selection Policy
Why does this enhance throughput?
Fetch from thread with the least instructions in flight.
40
SMT Fetch Policies (Locks)Problem:
Spin looping thread consumes resources
Solution:Provide quiescing operation that allows athread to sleep until a memory location changes
loop:ARM r1, 0(r2)BEQ r1, got_itQUIESCEBR loop
got_it:
Load and start watching 0(r2)
Inhibit scheduling of thread until activity observed on 0(r2)
41
Summary: Multithreaded Categories
Time (
proc
esso
r cyc
le) Superscalar Fine-Grained Coarse-Grained MultiprocessingSimultaneousMultithreading
Thread 1Thread 2
Thread 3Thread 4
Thread 5Idle slot
42
Check instruction dependencies
Superscalar processor
Sequential ISA Bottleneck
a = foo(b);for (i=0, i<
Sequential source code
Superscalar compiler
Find independent operations
Schedule operations
Sequential machine code
Schedule execution
43
VLIW: Very Long Instruction Word
Multiple operations packed into one instructionEach operation slot is for a fixed functionConstant operation latencies are specifiedArchitecture requires guarantee of: Parallelism within an instruction => no x-operation RAW
check No data use before data ready => no data interlocks
Two Integer Units,Single Cycle Latency
Two Load/Store Units,Three Cycle Latency Two Floating-Point Units,
Four Cycle Latency
Int Op 2 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2Int Op 1
44
VLIW Compiler Responsibilities
The compiler:
Schedules to maximize parallel execution
Guarantees intra-instruction parallelism
Schedules to avoid data hazards (no interlocks) Typically separates operations with explicit NOPs
45
Early VLIW Machines
FPS AP120B (1976) scientific attached array processor first commercial wide instruction machine hand-coded vector math libraries using software
pipelining and loop unrollingMultiflow Trace (1987) commercialization of ideas from Fisher’s Yale group
including “trace scheduling” available in configurations with 7, 14, or 28
operations/instruction 28 operations packed into a 1024-bit instruction word
Cydrome Cydra-5 (1987) 7 operations encoded in 256-bit instruction word rotating register file
46
Loop Execution
for (i=0; i<N; i++)B[i] = A[i] + C;
Int1
Int 2 M1 M2 FP
+FPx
loop:
How many FP ops/cycle?
ld add r1
fadd
sdadd r2 bne
1 fadd / 8 cycles = 0.125
loop: ld f1, 0(r1)add r1, 8fadd f2, f0, f1sd f2, 0(r2)add r2, 8bne r1, r3, loop
Compile
Schedule
47
Loop Unrollingfor (i=0; i<N; i++)
B[i] = A[i] + C;
for (i=0; i<N; i+=4){
B[i] = A[i] + C;B[i+1] = A[i+1] + C;B[i+2] = A[i+2] + C;B[i+3] = A[i+3] + C;
}
Unroll inner loop to perform 4 iterations at once
Need to handle values of N that are not multiples of unrolling factor with final cleanup loop
48
Scheduling Loop Unrolled Code
loop: ld f1, 0(r1)ld f2, 8(r1)ld f3, 16(r1)ld f4, 24(r1)add r1, 32fadd f5, f0, f1fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4sd f5, 0(r2)sd f6, 8(r2)sd f7, 16(r2)sd f8, 24(r2)add r2, 32bne r1, r3,
loop
Schedule
Int1
Int 2 M1 M2 FP
+FPx
loop:
Unroll 4 ways
ld f1ld f2ld f3ld f4add r1 fadd f5
fadd f6fadd f7fadd f8
sd f5sd f6sd f7sd f8add r2 bne
How many FLOPS/cycle?
4 fadds / 11 cycles = 0.36
49
Software Pipelining
loop: ld f1, 0(r1)ld f2, 8(r1)ld f3, 16(r1)ld f4, 24(r1)add r1, 32fadd f5, f0, f1fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4sd f5, 0(r2)sd f6, 8(r2)sd f7, 16(r2)add r2, 32sd f8, -8(r2)bne r1, r3,
loop
Int1 Int2 M1 M2 FP+ FPxUnroll 4 ways first
ld f1ld f2ld f3ld f4
fadd f5fadd f6fadd f7fadd f8
sd f5sd f6sd f7sd f8
add r1
add r2bne
ld f1ld f2ld f3ld f4
fadd f5fadd f6fadd f7fadd f8
sd f5sd f6sd f7sd f8
add r1
add r2bne
ld f1ld f2ld f3ld f4
fadd f5fadd f6fadd f7fadd f8
sd f5
add r1
loop:iterate
prolog
epilog
How many FLOPS/cycle?
4 fadds / 4 cycles = 1
Software Pipelining
50
51
Software Pipelining vs. Loop Unrolling
time
performance
time
performance
Loop Unrolled
Software Pipelined
Startup overhead
Wind-down overhead
Loop Iteration
Loop Iteration
Software pipelining pays startup/wind-down costs only once per loop, not once per iteration
52
What if there are no loops?
Branches limit basic block size in control-flow intensive irregular codeDifficult to find ILP in individual basic blocks
Basic block
53
Trace Scheduling [ Fisher,Ellis]
Pick string of basic blocks, a trace, that represents most frequent branch pathUse profiling feedback or compiler heuristics to find common branch paths Schedule whole “trace” at onceAdd fixup code to cope with branches jumping out of trace
Trace Scheduling
54
55
56
57
Problems with “Classic” VLIW
Object-code compatibility have to recompile all code for every machine, even for two
machines in same generationObject code size instruction padding wastes instruction memory/cache loop unrolling/software pipelining replicates code
Scheduling variable latency memory operations caches and/or memory bank conflicts impose statically
unpredictable variabilityKnowing branch probabilities Profiling requires an significant extra step in build process
Scheduling for statically unpredictable branches optimal schedule varies with branch path
58
VLIW Instruction Encoding
Schemes to reduce effect of unused fields Compressed format in memory, expand on I-cache refill
used in Multiflow Trace introduces instruction addressing challenge
Mark parallel groups used in TMS320C6x DSPs, Intel IA-64
Provide a single-op VLIW instruction Cydra-5 UniOp instructions
Group 1 Group 2 Group 3
Rotating Register FilesProblems: Scheduled loops require lots of registers,
Lots of duplicated code in prolog, epilog
Solution: Allocate new set of registers for each loop iteration
ld r1, ()add r2, r1, #1st r2, ()
ld r1, ()add r2, r1, #1st r2, ()
ld r1, ()add r2, r1, #1st r2, ()
ld r1, ()add r2, r1, #1
st r2, ()ld r1, ()
add r2, r1, #1st r2, ()
ld r1, ()add r2, r1, #1
st r2, ()
Prolog
Epilog
Loop
60
Rotating Register File
P0P1P2P3P4P5P6P7
RRB=3
+R1
Rotating Register Base (RRB) register points to base of current register set. Value added on to logical register specifier to give physical register number. Usually, split into rotating and non-rotating registers.
dec RRBbloopdec RRBdec RRB
dec RRBld r1, ()add r3, r2, #1
st r4, ()ld r1, ()
add r3, r2, #1st r4, ()
ld r1, ()add r2, r1, #1
st r4, ()
Prolog
Epilog
LoopLoop closing branch decrements RRB
61
Rotating Register File(Previous Loop Example)
bloopsd f9, ()fadd f5, f4, ...ld f1, ()
Three cycle load latency encoded as difference of 3 in register specifier number (f4 - f1 = 3)
Four cycle fadd latency encoded as difference of 4 in register specifier number (f9 – f5 = 4)
bloopsd P17, ()fadd P13, P12,ld P9, () RRB=8bloopsd P16, ()fadd P12, P11,ld P8, () RRB=7bloopsd P15, ()fadd P11, P10,ld P7, () RRB=6bloopsd P14, ()fadd P10, P9,ld P6, () RRB=5bloopsd P13, ()fadd P9, P8,ld P5, () RRB=4bloopsd P12, ()fadd P8, P7,ld P4, () RRB=3bloopsd P11, ()fadd P7, P6,ld P3, () RRB=2bloopsd P10, ()fadd P6, P5,ld P2, () RRB=1
62
Cydra-5:Memory Latency Register (MLR)
Problem: Loads have variable latencySolution: Let software choose desired memory latency
Compiler schedules code for maximum load-use distanceSoftware sets MLR to latency that matches code schedule Hardware ensures that loads take exactly MLR cycles to return values into processor pipeline
Hardware buffers loads that return early Hardware stalls processor if loads return
late
63
Intel EPIC IA-64
EPIC is the style of architecture (CISC, RISC) Explicitly Parallel Instruction Computing
IA-64 is Intel’s chosen ISA (x86, MIPS) IA-64 = Intel Architecture 64-bit An object-code compatible VLIW
Itanium (aka Merced) is first implementation ( 8086) First customer shipment were expected 1997
(actual 2001) McKinley, second implementation shipped in 2002
64
IA-64 Instruction Format
Template bits describe grouping of these instructions with others in adjacent bundlesEach group contains instructions that can execute in parallel
Instruction 2 Instruction 1 Instruction 0 Template
128-bit instruction bundle
group i group i+1 group i+2group i-1
bundle j bundle j+1 bundle j+2bundle j-1
65
IA-64 Registers128 General Purpose 64-bit Integer Registers128 General Purpose 64/80-bit Floating Point Registers64 1-bit Predicate Registers
GPRs rotate to reduce code size for software pipelined loops
66
IA-64 Predicated ExecutionProblem: Mispredicted branches limit ILPSolution: Eliminate hard to predict branches with predicated execution
Almost all IA-64 instructions can be executed conditionally under predicate Instruction becomes NOP if predicate register false
Inst 1Inst 2br a==b, b2
Inst 3Inst 4br b3
Inst 5Inst 6
Inst 7Inst 8
b0:
b1:
b2:
b3:
if
else
then
Four basic blocks
Inst 1Inst 2p1,p2 <-cmp(a==b)(p1) Inst 3 || (p2) Inst 5(p1) Inst 4 || (p2) Inst 6Inst 7Inst 8
Predication
One basic block
Mahlke et al, ISCA95: On average >50% branches removed
Predicate Software Pipeline StagesSingle VLIW Instruction
Software pipeline stages turned on by rotating predicate registers Much denser encoding of loops
(p1) bloop(p3) st r4(p2) add r3(p1) ld r1
Dynamic Execution
(p1) ld r1(p1) ld r1(p1) ld r1(p1) ld r1(p1) ld r1
(p2) add r3(p2) add r3(p2) add r3(p2) add r3(p2) add r3
(p3) st r4(p3) st r4(p3) st r4(p3) st r4(p3) st r4
(p1) bloop(p1) bloop(p1) bloop(p1) bloop(p1) bloop(p1) bloop(p1) bloop
68
Fully Bypassed Datapath
ASrcIRIR IR
PCA
B
Y
R
MD1 MD2
addrinst
InstMemory
0x4Add
IR ALU
ImmExt
rd1
GPRs
rs1rs2
wswd rd2
we
wdata
addr
wdata
rdataData Memory
we
31
nop
stall
D
E M W
PC for JAL, ...
BSrc
Where does predication fit in?
69
IA-64 Speculative ExecutionProblem: Branches restrict compiler code motion
Inst 1Inst 2br a==b, b2
Load r1Use r1Inst 3
Can’t move load above branch because might cause spurious exception
Load.s r1Inst 1Inst 2br a==b, b2
Chk.s r1Use r1Inst 3
Speculative load never causes exception, but sets “poison” bit on destination register
Check for exception in original home block jumps to fixup code if exception detected
Particularly useful for scheduling long latency loads early
Solution: Speculative operations that don’t cause exceptions
70
IA-64 Data SpeculationProblem: Possible memory hazards limit code scheduling
Requires associative hardware in address check table
Inst 1Inst 2StoreLoad r1Use r1Inst 3
Can’t move load above store because store might be to same address
Load.a r1Inst 1Inst 2StoreLoad.cUse r1Inst 3
Data speculative load adds address to address check table
Store invalidates any matching loads in address check table
Check if load invalid (or missing), jump to fixup code if so
Solution: Hardware to check pointer hazards
71
Clustered VLIW
Divide machine into clusters of local register files and local functional unitsLower bandwidth/higher latency interconnect between clustersSoftware responsible for mapping computations to minimize communication overheadCommon in commercial embedded VLIW processors, e.g., TI C6x DSPs, HP Lx processor(Same idea used in some superscalar processors, e.g., Alpha 21264)
Cluster Interconnect Local Regfile
Local Regfile
Memory Interconnect
Cache/Memory Banks
Cluster
72
Limits of Static Scheduling
Unpredictable branchesVariable memory latency (unpredictable cache misses)Code size explosionCompiler complexity
Question:How applicable are the VLIW-inspired techniques to traditional RISC/CISC processor architectures?