ece 552 / cps 550 advanced computer architecture i lecture 15 very long instruction word machines...
TRANSCRIPT
ECE 552 / CPS 550 Advanced Computer Architecture I
Lecture 15Very Long Instruction Word
Machines
Benjamin LeeElectrical and Computer Engineering
Duke University
www.duke.edu/~bcl15www.duke.edu/~bcl15/class/class_ece252fall11.html
ECE 552 / CPS 550 2
ECE252 Administrivia13 November – Homework #4 Due
Project ProposalsShould receive comments by end of week.
ECE 552 / CPS 550 3
OoO Superscalar ComplexityOut-of-order Superscalar
- Objective: Increase instruction-level parallelism.- Cost: Hardware logic/mechanisms to track dependencies and
dynamically schedule independent instructions.
Hardware Complexity- Instructions can issue, complete out-of-order.- Instructions must commit in-order- Implement Tomasulo’s algorithm with a variety of structures- Example: Reservation stations, reorder buffer, physical register file
Very Long Instruction Word (VLIW)- Objective: Increase instruction-level parallelism.- Cost: Software compilers/mechanisms to track dependencies and
statically schedule independent instructions.
ECE 552 / CPS 550 4
Review of Fetch/DecodeFetch
- Load instruction by accessing memory at program counter (PC)- Update PC using sequential address (PC+4) or branch prediction (BTB)
Decode/Rename- Take instruction from fetch buffer- Allocate resources, which are necessary to execute instruction:
(1) Destination physical register – if instruction writes a register, rename(2) Reorder buffer (ROB) entry – support in-order commit(3) Issue queue entry – hold instruction as it waits for execution(4) Memory buffer entry – resolve dependencies through memory (next slide)
- Stall if resources unavailable- Rename source/destination registers- Update reorder buffer, issue queue, memory buffer
ECE 552 / CPS 550 5
Review of Memory BufferAllocate memory buffer entry
Store Instructions- Calculate store-address and place in buffer- Take store-data and place in buffer- Instruction commits in-order when store-address, store-data ready
Load Instructions- Calculate load-address and place in buffer- Instruction searches memory buffer for stores with matching address- Forward load data from in-flight stores with matching address- Stall load if buffer contains stores with un-resolved addresses
ECE 552 / CPS 550 6
Review of Issue/ExecuteIssue
- Instruction commits from reorder buffer- A commit wakes-up an instruction by marking its sources ready- Select logic determines which ready instructions should execute- Issue when by sending instructions to functional unit
Execute- Read operands from physical register file and/or forwarding path- Execute instruction in functional unit- Write result to physical register file, store buffer- Produce exception status- Write to reorder buffer
ECE 552 / CPS 550 7
Review of CommitCommit
- Instructions can complete and update reorder buffer out-of-order- Instructions commit from reorder buffer in-order
Exceptions- Check for exceptions- If exception raised, flush pipeline- Jump to exception handler
Release Resources- Free physical register used by last writer to same architected register- Free reorder buffer slot- Free memory buffer slot
ECE 552 / CPS 550 8
Control Logic Scaling
- Lifetime (L) – number of cycles an instruction spends in pipeline- Lifetime depends on pipeline latency, time spent in reorder buffer- Issue width (W) – maximum number of instructions issued per cycle
- As W increases, issue logic must find more instructions to execute in parallel and keep pipeline busy.
- More instructions must be fetched, decoded, and queued. - W x L instructions can impact any of the W issuing instructions (e.g.
forwarding) and growth in hardware proportional to W x (W x L)
Lifetime L
Issue Group
Previously Issued
Instructions
Issue Width W
ECE 552 / CPS 550 9
Control Logic (MIPS R10000)
Control Logic
ECE 552 / CPS 550 10
Sequential Instruction Sets
Check instruction dependencies
Superscalar processor
a = foo(b);for (i=0, i<
Sequential source code
Superscalar compiler
Find independent operations
Schedule operations
Sequential machine code
Schedule execution
ECE 552 / CPS 550 11
Sequential Instruction SetsSuperscalar Compiler
- Takes sequential code (e.g., C, C++)- Check instruction dependencies- Schedule operations to preserve dependencies- Produces sequential machine code (e.g., MIPS)
Superscalar Processor- Takes sequential code (e.g., MIPS)- Check instruction dependencies- Schedule operations to preserve dependencies
Inefficiency of Superscalar Processors- Performs dependency, scheduling dynamically in hardware- Expensive logic rediscovers schedules that a compiler could have
found
ECE 552 / CPS 550 12
VLIW – Very Long Instruction
Word
- Multiple operations packed into one instruction format- Instruction format is fixed, each slot supports particular instruction type- Constant operation latencies are specified (e.g., 1 cycle integer op)- Software schedules operations into instruction format, guaranteeing
(1)Parallelism within an instruction – no RAW checks between ops(2)No data use before ready – no data interlocks/stalls
Two Integer Units,Single Cycle Latency
Two Load/Store Units,Three Cycle Latency Two Floating-Point Units,
Four Cycle Latency
Int Op 2 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2Int Op 1
ECE 552 / CPS 550 13
VLIW Compiler
Responsibilities Schedule operations to maximize parallel execution
- Fill operation slots
Guarantee intra-instruction parallelism- Ensure operations within same instruction are independent
Schedule to avoid data hazards- Separate options with explicit NOPs
ECE 552 / CPS 550 14
Loop Execution
for (i=0; i<N; i++) B[i] = A[i] + C; Int1 Int 2 M1 M2 FP+ FPx
loop: ld add r1
fadd
sd add r2 bne
loop: ld f1, 0(r1) add r1, 8 fadd f2, f0, f1 sd f2, 0(r2) add r2, 8 bne r1, r3, loop
Compile
Schedule
- The latency of each instruction is fixed (e.g., 3 cycle ld, 4 cycle fadd)- Instr-1: Load A[i] and increment i (r1) in parallel- Instr-2: Wait for load- Instr-3: Wait for add. Store B[i], increment i (r2), branch in parallel- How many flops / cycle? 1 fadd / 8 cycles = 0.125
ECE 552 / CPS 550 15
Loop Unrolling
- Unroll inner loop to perform k iterations of computation at once.
- If N is not a multiple of unrolling factor k, insert clean-up code
- Example: unroll inner loop to perform 4 iterations at once
for (i=0; i<N; i++) B[i] = A[i] + C;
for (i=0; i<N; i+=4){ B[i] = A[i] + C; B[i+1] = A[i+1] + C; B[i+2] = A[i+2] + C; B[i+3] = A[i+3] + C;}
ECE 552 / CPS 550 16
Scheduling Unrolled Loopsloop: ld f1, 0(r1) ld f2, 8(r1) ld f3, 16(r1) ld f4, 24(r1) add r1, 32 fadd f5, f0, f1 fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4 sd f5, 0(r2) sd f6, 8(r2) sd f7, 16(r2) sd f8, 24(r2) add r2, 32 bne r1, r3, loop
schedule
Int1 Int 2 M1 M2 FP+ FPx
loop: ld f1ld f2ld f3ld f4add r1 fadd f5
fadd f6fadd f7fadd f8
sd f5sd f6sd f7sd f8add r2 bne
- Unroll loop to execute 4 iterations- Reduces number of empty operation slots- How many flops/cycle? 4 fadds / 11 cycles = 0.36
ECE 552 / CPS 550 17
Software PipeliningExploit independent loop iterations
- If loop iterations are independent, then get more parallelism by scheduling instructions from different iterations
- Example: Loop iterations are independent in the code sequence below.- Construct the data-flow graph for one iteration
load A[i] C
+
store B[i]
for (i=0; i<N; i++) B[i] = A[i] + C;
ECE 552 / CPS 550 18
Software Pipelining
(Illustrated)Not pipelinedload A[0] C
+
store B[0]
load A[1] C
+
store B[1]
load A[2] C
+
store B[2]
load A[3] C
+
store B[3]
Fill
Ste
ady S
tate
Dra
in
Pipelined
Load A[0] C
+ load A[1] C
store B[0] + load A[2] C
store B[1] + load A[3] C
store B[2] + load A[4] C
store B[3] + load A[5] C
store B[4] +
store B[5]
ECE 552 / CPS 550 19
Scheduling SW Pipelined
LoopsUnroll the loop to perform 4 iterations at once. Then SW pipeline.for (i=0; i<N; i++) B[i] = A[i] + C;loop: ld f1, 0(r1) ld f2, 8(r1) ld f3, 16(r1) ld f4, 24(r1) add r1, 32 fadd f5, f0, f1 fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4 sd f5, 0(r2) sd f6, 8(r2) sd f7, 16(r2) add r2, 32 sd f8, -8(r2) bne r1, r3, loop
Int1 Int 2 M1 M2 FP+ FPx
ld f1
ld f2
ld f3
ld f4
fadd f5
fadd f6
fadd f7
fadd f8
sd f5
sd f6
sd f7
sd f8
add r1
add r2
bne
ld f1
ld f2
ld f3
ld f4
fadd f5
fadd f6
fadd f7
fadd f8
sd f5
sd f6
sd f7
sd f8
add r1
add r2
bne
ld f1
ld f2
ld f3
ld f4
fadd f5
fadd f6
fadd f7
fadd f8
sd f5
add r1
loop:steadystate
fill
drain
ECE 552 / CPS 550 21
What if there are no loops?
- Basic block defined by sequence of consecutive instructions. Every basic block ends with a branch.
- Instruction-level parallelism
is hard to find in basic blocks
- Basic blocks illustrated by control flow graph
Basic block
ECE 552 / CPS 550 22
Trace Scheduling
- A trace is a sequence of basic blocks (a.k.a., long string of straight-line code)
- Trace Selection: Use profiling or compiler heuristics to find common sequences/paths
- Trace Compaction: Schedule whole trace into few VLIW instructions.
- Add fix-up code to cope with branches jumping out of trace. Undo instructions if control flower diverges from trace.
ECE 552 / CPS 550 23
Problems with “Classic” VLIWObject Code Challenges
- Compatibility: Need to recompile code for every VLIW machine, even across generations.
- Compatibility: Code specific to operation slots in instruction format and latencies of operations
- Code Size: Instruction padding wastes instruction memory/cache with nops for unfilled slots.
- Code Size: Loop unrolling, software pipelining increases code footprint.
Scheduling Variable Latency Operations- Effective schedules rely on known instruction latencies- Caches, memories produce unpredictable variability
ECE 552 / CPS 550 24
Problems with “Classic” VLIWObject Code Challenges
- Compatibility: Need to recompile code for every VLIW machine, even across generations.
- Compatibility: Code specific to operation slots in instruction format and latencies of operations
- Code Size: Instruction padding wastes instruction memory/cache with nops for unfilled slots.
- Code Size: Loop unrolling, software pipelining increases code footprint.
Scheduling Variable Latency Operations- Effective schedules rely on known instruction latencies- Caches, memories produce unpredictable variability
ECE 552 / CPS 550 25
Intel Itanium, EPIC IA-64Explicitly Parallel Instruction Computing (EPIC)
- Computer architecture style (e.g., CISC, RISC, EPIC)
IA-64- Instruction set architecture (e.g., x86, MIPS, IA-64)- IA-64 – Intel Architecture 64-bit
Implementations- Merced, first implementation, 2001- McKinley, second implementation, 2002- Poulson, recent implementation, 2011
ECE 552 / CPS 550 26
Intel Itanium, EPIC IA-64- Eight cores- 1-cycle 16KB L1 I&D caches- 9-cycle 512KB L2 I-cache- 8-cycle 256KB L2 D-cache- 32 MB shared L3 cache- 544mm2 in 32nm CMOS- 3 billion transistors
- Cores are 2-way multithreaded
- Each VLIW word is 128-bits, containing 3 instructions (op slots)
- Fetch 2 words per cycle 6 instructions (op slots)
- Retire 4 words per cycle 12 instructions (op slots)
ECE 552 / CPS 550 28
VLIW and Control Flow
ChallengesChallenge
- Mispredicted branches limit ILP- Trace selection groups basic blocks into larger ones- Trace compaction schedules instructions into a single VLIW- Requires fix-up code for branches that exit trace
Solution – Predicated Execution- Eliminate hard to predict branches with predicated execution- IA-64 instructions can be executed conditionally under predicate- Instruction becomes a NOP if predicate register is false
ECE 552 / CPS 550 29
VLIW and Control Flow
Challenges
Inst 1Inst 2br a==b, b2
Inst 3Inst 4br b3
Inst 5Inst 6
Inst 7Inst 8
b0:
b1:
b2:
b3:
if
else
then
Four basic blocks
Inst 1Inst 2p1,p2 <- cmp(a==b)(p1) Inst 3 || (p2) Inst 5(p1) Inst 4 || (p2) Inst 6Inst 7Inst 8
Predication
One basic block
On average >50% branches removed. Mahlke et al. 1995.
ECE 552 / CPS 550 30
Limits of Static ILPSoftware Instruction-level Parallelism (VLIW)
- Compiler complexity- Code size explosion- Unpredictable branches- Variable memory latency and unpredictable cache behavior
Current Status- Despite several attempts, VLIW has failed in general-purpose
computing- VLIW hardware complexity similar to in-order, superscalar hardware
complexity. Limited advantage on large, complex applications- Successful in embedded digital signal processing; friendly code
ECE 552 / CPS 550 31
SummaryOut-of-order Superscalar
• Hardware complexity increases super-linearly with issue-width.
Very Long Instruction Word (VLIW)• Compiler explicitly schedules parallel instructions• Unrolling and software pipelining loops
Predication• Mitigates branches in VLIW machines• Predicate operations. If predicate false, operation does not affect
architected state. • Mahlke et al. “A comparison of full and partial predicated execution
support for ILP processors” 1995.
ECE 552 / CPS 550 32
AcknowledgementsThese slides contain material developed and copyright by - Arvind (MIT)- Krste Asanovic (MIT/UCB)- Joel Emer (Intel/MIT)- James Hoe (CMU)- Arvind Krishnamurthy (U. Washington)- John Kubiatowicz (UCB)- Alvin Lebeck (Duke)- David Patterson (UCB)- Daniel Sorin (Duke)