eecs 452 – lecture 8 vliwmemik/courses/452/lecture... · 2009. 4. 28. · vliw - history floating...
TRANSCRIPT
EECS 452 – Lecture 8VLIW
Instructor: Gokhan MemikEECS Dept., Northwestern University
EECS 452 © Moshovos and Memik http://www.eecs.northwestern.edu/~memik/courses/452 2
Independence ISAConventional ISA
Instructions execute in orderNo way of stating
Instruction A is independent of BVectors and SIMD
Only for a set of the same operationIdea:
Change Execution Model at the ISA modelAllow specification of independence
Goals:Flexible enoughMatch well technology
These are some of the goals of VLIW
EECS 452 © Moshovos and Memik http://www.eecs.northwestern.edu/~memik/courses/452 3
VLIWVery Long Instruction Word
#1 defining attributeThe four instructions are independent
Some parallelism can be expressed this wayExtending the ability to specify parallelism
Take into consideration technologyRecall, delay slotsThis leads to
#2 defining attribute: NUALNon-unit assumed latency
ALU1 ALU2 MEM1 MEM2Instruction format
EECS 452 © Moshovos and Memik http://www.eecs.northwestern.edu/~memik/courses/452 4
NUAL vs. UALUnit Assumed Latency (UAL)
Semantics of the program are that each instruction is completed before the next one is issuedThis is the conventional sequential model
Non-Unit Assumed Latency (NUAL):At least 1 operation has a non-unit assumed latency, L, which is greater than 1The semantics of the program are correctly understood if exactly the next L-1 instructions are understood to have issued before this operation completes
EECS 452 © Moshovos and Memik http://www.eecs.northwestern.edu/~memik/courses/452 5
#2 Defining Attribute: NUAL
ALU1 ALU2 MEM1 controlInstruction format
ALU1 ALU2 MEM1 control
ALU1 ALU2 MEM1 control
ALU1 ALU2 MEM1 control
ALU1 ALU2 MEM1 control
ALU1 ALU2 MEM1 control
Assumed latencies for all operations
visiblevisible
visible
visibleGlorified delay slotsAdditional opportunities for specifying parallelism
EECS 452 © Moshovos and Memik http://www.eecs.northwestern.edu/~memik/courses/452 6
#3 DF: Resource Assignment
The VLIW also implies allocation of resourcesThis maps well onto the following datapath:
ALU1 ALU2 MEM1 control
ALU ALUcache
ControlFlow Unit
EECS 452 © Moshovos and Memik http://www.eecs.northwestern.edu/~memik/courses/452 7
VLIW: DefinitionMultiple independent Functional UnitsInstruction consists of multiple independent instructionsEach of them is aligned to a functional unitLatencies are fixed
Architecturally visibleCompiler packs instructions into a VLIW also schedules all hardware resourcesEntire VLIW issues as a single unitResult: ILP with simple hardware
compact, fast hardware controlfast clock
EECS 452 © Moshovos and Memik http://www.eecs.northwestern.edu/~memik/courses/452 8
VLIW Example
I-fetch &Issue
FU
FU
MemoryPort
MemoryPort
Multi-portedRegister
File
EECS 452 © Moshovos and Memik http://www.eecs.northwestern.edu/~memik/courses/452 9
VLIW Example
ALU1 ALU2 MEM1 MEM2
Instruction format
ALU1 ALU2 MEM1 MEM2ALU1 ALU2 MEM1 MEM2ALU1 ALU2 MEM1 MEM2
Program order and execution order
•Instructions in a VLIW are independent•Latencies are fixed in the architecture spec.•Hardware does not check anything•Software has to schedule so that all works
EECS 452 © Moshovos and Memik http://www.eecs.northwestern.edu/~memik/courses/452 10
Compilers are King
VLIW philosophy:“dumb” hardware“intelligent” compiler
Key technologiesPredicated ExecutionTrace Scheduling
If-ConversionSoftware Pipelining
EECS 452 © Moshovos and Memik http://www.eecs.northwestern.edu/~memik/courses/452 11
Predicated ExecutionInstructions are predicated
if (cond) then perform instructionIn practice
calculate resultif (cond) destination = result
Converts control flow dependences to data dependencesif ( a == 0)
b = 1; else b = 2;
1; pred = (a == 0)pred; b = 1!pred; b = 2
EECS 452 © Moshovos and Memik http://www.eecs.northwestern.edu/~memik/courses/452 12
Predicated Execution: Trade-offs
Is predicated execution always a win?
Is predication meaningful for VLIW only?
EECS 452 © Moshovos and Memik http://www.eecs.northwestern.edu/~memik/courses/452 13
If-ConversionLate 70’s – Early 80’s (Ken Kennedy, PLDI 1983)Compiler support
Early 90’s, Mahlke and Hwu, MICRO-92Predicate large chunks of code
No control flowSchedule
Free motion of code since no control flowAll restrictions are data related
Reverse if-convertReintroduce control flowN.J. Warter, S.A. Mahlke, W.W. Hwu, and B.R. Rau. Reverse if-conversion. In Proceedings of the SIGPLAN'93 Conference on Programming Language Design and Implementation, pages 290-299, June 1993.
EECS 452 © Moshovos and Memik http://www.eecs.northwestern.edu/~memik/courses/452 14
Trace SchedulingGoal:
Create a large continuous piece or codeSchedule to the max: exploit parallelism
Fact of life:Basic blocks are smallScheduling across BBs is difficult
But: while many control flow paths existThere are few “hot” ones
Trace SchedulingStatic control speculationAssume specific pathSchedule accordinglyIntroduce check and repair code where necessary
First used to compact microcodeFISHER, J. Trace scheduling: A technique for global microcode compaction. IEEE Transactions on Computers C-30, 7 (July 1981), 478--490.
EECS 452 © Moshovos and Memik http://www.eecs.northwestern.edu/~memik/courses/452 15
Trace Scheduling Example
bA
bB
bC
bD bE
bAbBbCbD
check
all OK
repairbCbD
repairbE
EECS 452 © Moshovos and Memik http://www.eecs.northwestern.edu/~memik/courses/452 16
Trace Scheduling Exampletest = a[i] + 20;If (test > 0) then
sum = sum + 10else
sum = sum + c[i]c[x] = c[y] + 10
test = a[i] + 20sum = sum + 10c[x] = c[y] + 10if (test <= 0) then goto repair…
repair:sum = sum – 10sum = sum + c[i]
Stra
ight
cod
e
EECS 452 © Moshovos and Memik http://www.eecs.northwestern.edu/~memik/courses/452 17
Software PipeliningRau, MICRO-81 and Lam, PLDI 88A loopfor i = 1 to N
a[i] = b[i] + C
Loop Schedule• 0: LD f0, 0(r16)• 1:• 2:• 3: ADD f16, f30, f0• 4:• 5:• 6: ST f16, 0(r17)
• Assume f30 holds C
EECS 452 © Moshovos and Memik http://www.eecs.northwestern.edu/~memik/courses/452 18
Software PipeliningAssume latency = 3 cycles for all ops• 0:LD f0, 0(r16)• 1: LD f1, 8(r16)• 2: LD f2, 16(r16)• 3:ADD f16, f30, f0• 4: ADD f17, f30, f1• 5: ADD f18, f30, f2• 6:ST f16, 0(r17)• 7: ST f17, 8(r17)• 8 : ST f18, 16(r17)Steady State:LD (i+3), ADD (i),ST (i – 3)
EECS 452 © Moshovos and Memik http://www.eecs.northwestern.edu/~memik/courses/452 19
Complete CodePROLOG
LD f0, 0(r16)LD f1, 8(r16)LD f2, 16(r16)
ADD f0,C, f16 LD f0, 0(r16)ADD f1,C, f17 LD f1, 8(r16)ADD f2,C, f18 LD f2, 16(r16)
KERNELST f16, 0(r17) ADD f0,C, f16 LD f0, 0(r16)ST f17, 8(r17) ADD f1,C, f17 LD f1, 8(r16)ST f18, 16(r17) ADD f2,C, f18 LD f2, 16(r16)EPILOGUEST f16, 0(r17) ADD f0,C, f16ST f17, 8(r17) ADD f1,C, f17ST f18, 16(r17) ADD f2,C, f18ST f16, 0(r17)ST f17, 8(r17)ST f18, 16(r17)
EECS 452 © Moshovos and Memik http://www.eecs.northwestern.edu/~memik/courses/452 20
Rotating Register Files HelpST f19, 0(r17) ADD f3,C, f16 LD f0, 0(r16)Special branches
Ctop and wtopRegister references are relative to register baseBranches increment base modulo sizeRegister file is treated as a circular queue
Finally, predication eliminates prolog and epilogue(p6) ST f19, 0(r17)(p3) ADD f3,C, f16(p0) LD f0, 0(r16)
EECS 452 © Moshovos and Memik http://www.eecs.northwestern.edu/~memik/courses/452 21
VLIW - HistoryFloating Point Systems Array Processor
very successful in 70’sall latencies fixed; fast memory
MultiflowJosh Fisher (now at HP)1980’s Mini-Supercomputer
CydromeBob Rau 1980’s Mini-Supercomputer
TeraBurton Smith1990’s SupercomputerMultithreading
Intel IA-64 (Intel & HP)
EECS 452 © Moshovos and Memik http://www.eecs.northwestern.edu/~memik/courses/452 22
EPIC philosophyCompiler creates complete plan of run-time execution
At what time and using what resourcePOE communicated to hardware via the ISAProcessor obediently follows POENo dynamic scheduling, out of order execution
These second guess the compiler’s planCompiler allowed to play the statistics
Many types of info only available at run-time branch directions, pointer values
Traditionally compilers behave conservatively handle worst case possibilityAllow the compiler to gamble when it believes the odds are in its favor
ProfilingExpose micro-architecture to the compiler
memory system, branch execution
EECS 452 © Moshovos and Memik http://www.eecs.northwestern.edu/~memik/courses/452 23
Defining feature I - MultiOp
SuperscalarOperations are sequentialHardware figures out resource assignment, time of execution
MultiOp instructionSet of independent operations that are to be issued simultaneously
no sequential notion within a MultiOp1 instruction issued every cycle
Provides notion of timeResource assignment indicated by position in MultiOpPOE communicated to hardware via MultiOps
EECS 452 © Moshovos and Memik http://www.eecs.northwestern.edu/~memik/courses/452 24
Defining feature II - Exposed latency
SuperscalarSequence of atomic operationsSequential order defines semantics (UAL)Each conceptually finishes before the next one starts
EPIC – non-atomic operationsRegister reads/writes for 1 operation separated in timeSemantics determined by relative ordering of reads/writes
Assumed latency (NUAL if > 1)Contract between the compiler and hardwareInstruction issuance provides common notion of time
EECS 452 © Moshovos and Memik http://www.eecs.northwestern.edu/~memik/courses/452 25
EPIC Architecture Overview
Many specialized registers32 Static General Purpose Registers96 Stacked/Rotated GPRs
64 bits32 Static FP regs96 Stacked/Rotated FPRs
81 bits8 Branch Registers
64 bits16 Static Predicates48 Rotating Predicates
EECS 452 © Moshovos and Memik http://www.eecs.northwestern.edu/~memik/courses/452 26
ISA
128-bit Instruction BundlesContains 3 instructions6-bit template field
FUs instructions go toTermination of independence bundleWAR allowed within same bundleIndependent instruction may spread over multiple bundles
EECS 452 © Moshovos and Memik http://www.eecs.northwestern.edu/~memik/courses/452 27
Other architectural features of EPIC
Add features into the architecture to support EPIC philosophy
Create more efficient POEsExpose the microarchitecturePlay the statistics
Register structureBranch architectureData/Control speculationMemory hierarchyPredicated execution
largest impact on the compiler
EECS 452 © Moshovos and Memik http://www.eecs.northwestern.edu/~memik/courses/452 28
Register Structure
SuperscalarSmall number of architectural registersRename using large pool of physical registers at run-time
EPICCompiler responsible for all resource allocation including registersRename at compile time
large pool of regs needed
EECS 452 © Moshovos and Memik http://www.eecs.northwestern.edu/~memik/courses/452 29
Rotating Register File
Overlap loop iterationsHow do you prevent register overwrite in later iterations?Compiler-controlled dynamic register renaming
Rotating registersEach iteration writes to r13But this gets mapped to a different physical registerBlock of consecutive regs allocated for each reg in loop corresponding to number of iterations it is needed
EECS 452 © Moshovos and Memik http://www.eecs.northwestern.edu/~memik/courses/452 30
Rotating Register File Example
actual reg = (reg + RRB) % NumRegsAt end of each iteration, RRB--
r13
r14
iteration nRRB = 10
r13
r14
iteration n + 1RRB = 9
r13
r14
iteration n + 2RRB = 8
R23
R22
R21
EECS 452 © Moshovos and Memik http://www.eecs.northwestern.edu/~memik/courses/452 31
Branch Architecture
Branch actionsBranch condition computedTarget address formedInstructions fetched from taken, fall-through or both pathsBranch itself executesAfter the branch, target of the branch is decoded/executed
Superscalar processors use hardware to hide the latency of all the actions
Icache prefetchingBranch prediction – Guess outcome of branchDynamic scheduling – overlap other instructions with branchReorder buffer – Squash when wrong
EECS 452 © Moshovos and Memik http://www.eecs.northwestern.edu/~memik/courses/452 32
EPIC BranchesMake each action visible with an architectural latency
No stallsNo prediction necessary (though sometimes still used)
Branch separated into 3 distinct operations1. Prepare to branch
compute target addressPrefetch instructions from likely targetExecuted well in advance of branch
2. Compute branch condition – comparison operation3. Branch itself
Branches with latency > 1, have delay slotsMust be filled with operations that execute regardless of the direction of the branch
EECS 452 © Moshovos and Memik http://www.eecs.northwestern.edu/~memik/courses/452 33
PredicationIf a[i].ptr != 0
b[i] = a[i].left;else
b[i] = a[i].right;i++
Conventionalload a[i].ptrp2 = cmp a[i].ptr != 0Jump if p2 nodecrload r8 = a[i].leftstore b[i] = r8jump next
nodecr:load r9 = a[i].rightstore b[i] = r9
next:i++
IA-64load a[i].ptr
p1, p2 = cmp a[i].ptr != 0<p1> load a[i].l <p2> load.a[i].r<p1> store b[i], r8 <p2> store b[i], r9
i++
EECS 452 © Moshovos and Memik http://www.eecs.northwestern.edu/~memik/courses/452 34
SpeculationAllow the compiler to play the statistics
Reordering operations to find enough parallelismBranch outcome
Control speculationLack of memory dependence in pointer code
Data speculationProfile or clever analysis provides “the statistics”
General plan of actionCompiler reorders aggressivelyHardware support to catch times when its wrongExecution repaired, continue
Repair is expensiveSo have to be right most of the time to or performance will suffer
EECS 452 © Moshovos and Memik http://www.eecs.northwestern.edu/~memik/courses/452 35
“Advanced” Loadst1=t1+1if (t1 > t2)
j = a[t1 + t2]
add t1 + 1comp t1 > t2Jump donothingload a[t1 – t2]
donothing:
add t1 + 1ld.s r8=a[t1 – t2]comp t1>t2jump[check.s r8
ld.s: load and record ExceptionCheck.s check forExceptionAllows load to be Performed early
Not IA-64 specific
EECS 452 © Moshovos and Memik http://www.eecs.northwestern.edu/~memik/courses/452 36
Speculative LoadsMemory Conflict Buffer (Illinois)Goal: Move load before a store when unsure that a dependence existsSpeculative load:
Load from memoryKeep a record of the address in a table
Stores check the tableSignal error in the table if conflict
Check load:Check table for signaled errorBranch to repair code if error
How are the CHECK and SPEC load linked?Via the target register specifier
Similar effect to dynamic speculation/synchornization
EECS 452 © Moshovos and Memik http://www.eecs.northwestern.edu/~memik/courses/452 37
Exposed Memory HierarcyConventional Memory Hierarchies have storage presence speculation mechanism built-inNot always effective
Streaming dataLatency tolerant computations
EPIC:Explicit control on where data goes to:
Conventional: C1/C1
L_B_C3_C2 S_H_C1
Source cache specifier – where its coming from latency
Target cache specifier – where to place the data
EECS 452 © Moshovos and Memik http://www.eecs.northwestern.edu/~memik/courses/452 38
VLIW DiscussionCan one build a dynamically scheduled processor with a VLIW instruction set?
VLIW really simplifies hardware?
Is there enough parallelism visible to the compiler?What are the trade-offs?
Many DSPs are VLIWWhy?