advanced computer architectures laboratory on dlx pipelining vittorio zaccaria
TRANSCRIPT
Vittorio Zaccaria – Laboratory of
Architectures
DLX Load/Store Architecture
Registers are faster than memory The compiler can do deeper optimization
16bit offsets and immediates 32bit integer registers 64bit floating point registers Fixed operation encoding:
Addr. Mode contained in the operation code Fits in one word Faster decoding
Vittorio Zaccaria – Laboratory of
Architectures
DLX (cont.) 32 General purpose registers 32 bit instructions:
Op
31 26 01516202125
Rs1 Rd immediate
Op
31 26 025
Op
31 26 01516202125
Rs1 Rs2
target
Rd Opx
Register-Register
561011
Register-Immediate
Op
31 26 01516202125
Rs1 Rs2/Opx immediate
Branch
Jump / Call
Vittorio Zaccaria – Laboratory of
Architectures
Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle
–Structural hazards: HW cannot support this combination of instructions
–Data hazards: Instruction depends on result of prior instruction still in the pipeline
–Control hazards: Pipelining of branches & other instructions that change the PC
Common solution is to stall the pipeline until the hazard is resolved, inserting one or more “bubbles” in the pipeline
Hazards
Vittorio Zaccaria – Laboratory of
Architectures
An example program:
.datadati_a: .word 1,2,3,4,5,6,7,8dati_b: .word 2,3,4,5,6,7,7,9
.text
.global main
add r3,r0,0loop: lw r4,dati_a(r3)
lw r5,dati_b(r3)sub r5,r5,r4addi r3,r3,4bnez r5,loop
exit:
Vittorio Zaccaria – Laboratory of
Architectures
1st Exercise: Draw pipeline chart Indicate:
Data Hazards between WB stages and ID stages.
Control Hazards between EX stage and IF stage
CK1 CK2 CK3 CK4 CK5 CK6 CK7 CK8 CK9 CK10 CK11 CK12 CK13 CK14
add r3,r0,0 IF ID EX MEM WB
Lw r4,dati_a(r3) IF ID EX MEM WB
Lw r5,dati_b(r3) IF ID EX MEM WB
Sub r5,r5,r4 IF ID EX MEM WB
Add r3,r3,4 IF ID EX MEM WB
Bnez r5,loop IF ID EX MEM WB
Lw r4,dati_a(r3) IF ID EX MEM WB
Lw r5,dati_b(r3) IF ID EX MEM WB
Sub r5,r5,r4 IF ID EX MEM WB
Add r3,r3,4 IF ID EX MEM WB
Bnez r5,loop IF ID EX MEM
Hazard Individuation
Vittorio Zaccaria – Laboratory of
Architectures
2nd Exercise: Hazard Resolution Software solution
NOPs insertion Hardware solutions
Bubbles/stalls generation Register forwarding
Software optimizations Code rescheduling
Vittorio Zaccaria – Laboratory of
Architectures
NOP insertionadd r3,r0,0NOPNOP
Loop: Lw r4,dati_a(r3)Lw r5,dati_b(r3)NOPNOPSub r5,r5,r4Add r3,r3,4NOPBnez r5,LoopNOP
Vittorio Zaccaria – Laboratory of
Architectures
NOP dynamic execution
CK1 CK2 CK3 CK4 CK5 CK6 CK7 CK8 CK9 CK10 CK11 CK12 CK13 CK14 CK15 CK16 CK17
add r3,r0,0 IF ID EX MEM WBNOP IF ID EX MEM WBNOP IF ID EX MEM WB
Lw r4,dati_a(r3) IF ID EX MEM WB
Lw r5,dati_b(r3) IF ID EX MEM WBNOP IF ID EX MEM WBNOP IF ID EX MEM WB
Sub r5,r5,r4 IF ID EX MEM WB
Add r3,r3,4 IF ID EX MEM WBNOP IF ID EX MEM WB
Bnez r5,loop IF ID EX MEM WBNOP IF ID EX MEM WB
Lw r4,dati_a(r3) IF ID EX MEM WB
First loop:
Second loop: ........
Loop composed by 5 instr and 4 Nops
Vittorio Zaccaria – Laboratory of
Architectures
Performance Indexes CPI= average clock cycles per
instruction; Average Clock cycles=
n° instr+n°stalls/nops+44 is the n° of cycles needed to execute the last instruction.
CPI=[Average Clock cycles]/[n° instr]
Vittorio Zaccaria – Laboratory of
Architectures
Performance evaluation of NOPs Actual CPI=
Instructions+Nops+4 13+4 --------------------------------- = -------- = 2.42 Instructions 7
MIPS frequency[=200Mhz]
------------------------- = 82.35 MIPS CPI*10^6
Vittorio Zaccaria – Laboratory of
Architectures
NOPs Manual Exercise Execute manually the loop for two
cycles (finishing on the nop after the 2nd bnez) and calculate CPI and MIPS
10 minutes
Vittorio Zaccaria – Laboratory of
Architectures
Asymptotic loop performance Consider an intermediate cycle of
the loop. Count instructions + nops of the
cycle and divide it by the number of effective instructions -> asymptotical CPI
10 minutes
Vittorio Zaccaria – Laboratory of
Architectures
Performance evaluation of NOPs (asymptotic) Asymptotic loop CPI=
(Instructions+Nops)*n+4 9n+4 --------------------------------- = ---------- =~ 1.8 Instructions*n 5n
MIPS frequency[=200Mhz]
------------------------- = 111 MIPS CPI*10^6
Vittorio Zaccaria – Laboratory of
Architectures
Bubbles Bubbles are NOPs inserted by the
hardware. Branch instructions provoke the
generation of a NOP Next instructions are stalled Previous instructions are executed.
Vittorio Zaccaria – Laboratory of
Architectures
Bubbles Example
CK1 CK2 CK3 CK4 CK5 CK6 CK7 CK8 CK9 CK10 CK11 CK12 CK13 CK14 CK15 CK16 CK17
add r3,r0,0 IF ID EX MEM WB
Lw r4,dati_a(r3) IF BubbleBubble ID EX MEM WB
Lw r5,dati_b(r3) IF ID EX MEM WB
Sub r5,r5,r4 IF BubbleBubble ID EX MEM WB
Add r3,r3,4 IF ID EX MEM WB
Bnez r5,loop IF Bubble ID EX MEM WB
Lw r4,dati_a(r3) Aborted IF ID EX MEM WB
Vittorio Zaccaria – Laboratory of
Architectures
Performance evaluation of bubbles Actual CPI=
Instructions+Bubbles/aborts+4 7+6+4 --------------------------------- = -----------= 2.42 Instructions 7
MIPS frequency[=200Mhz]
------------------------- = 82.35 MIPS CPI*10^6
Vittorio Zaccaria – Laboratory of
Architectures
Verify on the simulator File-> load code ... -> pipe1.s ->
select -> load -> yes Configuration -> disable forwarding Open clock cycle diagram Execute -> single cycle (until 1st
load of the 2nd cycle has been executed)
Vittorio Zaccaria – Laboratory of
Architectures
Manual Exercise Preview what happens in an
intermediate cycle Calculate asymptotical CPI and
MIPS 10 minutes
Vittorio Zaccaria – Laboratory of
Architectures
Let’s simulate it Simulate the program until the 4th
cycle
Vittorio Zaccaria – Laboratory of
Architectures
Solutions After the 1st cycle, we note the
same behavior: 5 instructions 1 nop 3 stalls so the asymptotic values are:
Asymptotic values: CPI=1.8 MIPS=111.11
Vittorio Zaccaria – Laboratory of
Architectures
Forwarding Example
CK1 CK2 CK3 CK4 CK5 CK6 CK7 CK8 CK9 CK10 CK11 CK12 CK13
add r3,r0,0 IF ID EX MEM WB
Lw r4,dati_a(r3) IF ID EX MEM WB
Lw r5,dati_b(r3) IF ID EX MEM WB
Sub r5,r5,r4 IF ID Bubble EX MEM WB
Add r3,r3,4 IF Bubble ID EX MEM WB
Bnez r5,loop IF ID EX MEM WB
Lw r4,dati_a(r3) Aborted IF ID EX MEM WB
Vittorio Zaccaria – Laboratory of
Architectures
Simulation of 2 cycles of the loop. Configuration -> enable forwarding Open clock cycle diagram File -> Reset DLX Execute -> single cycle
Just to the WB of the 2nd bnez
Vittorio Zaccaria – Laboratory of
Architectures
Manual Exercise Calculate CPI and MIPS for the 2
cycles. Calculate Asymptotical CPI and
MIPS. 15 minutes
Vittorio Zaccaria – Laboratory of
Architectures
Results 2 cycles:
11 instructions 1 nop 2 stalls 4 cycles to flush the pipe
CPI=18/11=1.63 MIPS=122
Vittorio Zaccaria – Laboratory of
Architectures
Asymptotical Results
5 instructions 1 nop 1 stall CPI=[7n+4]/5n=1.4 MIPS=142.86.
Vittorio Zaccaria – Laboratory of
Architectures
Speedup Speed up of A w.r.t. B:
Exec. Time B
-------------
Exec. Time A
Vittorio Zaccaria – Laboratory of
Architectures
Calculate asymptotical speedup Speedup(NOPs,Bubbles) Speedup(Forwarding,NOPs) Speedup(Forwarding,Bubbles) 5 minutes
Vittorio Zaccaria – Laboratory of
Architectures
Calculate Asym. speedup Speedup(NOPs,Bubbles)=1 Speedup(Forwarding,NOPs)=1.29 Speedup(Forwarding,Bubbles)=1.2
9
Vittorio Zaccaria – Laboratory of
Architectures
Scheduling Optimizations change of the order of operations
to minimize stalls/bubbles (forwarding enabled):
lw r3,0(r2)add r3,r3,r7lw r4,0(r2)add r4,r4,r8add r4,r4,r3
CPI=(5+2+4)/5
lw r3,0(r2)lw r4,0(r2)add r3,r3,r7add r4,r4,r8add r4,r4,r3
CPI=(5+4)/5
Vittorio Zaccaria – Laboratory of
Architectures
1st Exercise
addi r1,r0,1
seq r2,r1,r1
add r3,r3,r3
Loop: lw r4,0(r3)
sub r3,r3,r4
bnez r1,Loop
Vittorio Zaccaria – Laboratory of
Architectures
Manual Exercises Draw the conflicts between operations
until the end of the 3rd execution of the cycle (last instruction bnez). No forwarding possible.
Insert bubbles/aborts in the right place to solve hazards.
Calculate CPI and throughput of the trace. Calculate asymptotical CPI of the loop. 20 minutes
Vittorio Zaccaria – Laboratory of
Architectures
Hazard Diagramaddi r 1, r 0, 1 IF ID EX MEMWB
seq r 2, r 1, r 1 IF ID EX MEMWB
add r 3, r 3, r 3 IF ID EX MEMWB
l w r 4, 0( r 3) IF ID EX MEMWB
sub r 3, r 3, r 4 IF ID EX MEMWB
bnez r 1, Loop IF ID EX MEMWB
l w r 4, 0( r 3) IF ID EX MEMWB
sub r 3, r 3, r 4 IF ID EX MEMWB
bnez r 1, Loop IF ID EX MEMWB
l w r 4, 0( r 3) IF ID EX MEMWB
sub r 3, r 3, r 4 IF ID EX MEMWB
bnez r 1, Loop IF ID EX MEMWB
Vittorio Zaccaria – Laboratory of
Architectures
CPIs Trace CPI=[24+4]/12=~2.33 Asymptotic CPI=[6n+4]/3n=~2
Vittorio Zaccaria – Laboratory of
Architectures
Manual Exercises Suppose now that forwarding is possible. Draw the new execution pipeline
diagram (until the execution of the 3rd bnez) and indicate when stalls must be generated by the hardware.
Calculate CPI and MIPS Calculate asymptotical CPI and MIPS 20 minutes
Vittorio Zaccaria – Laboratory of
Architectures
Results CPI=21/12=1.75 Asymptotical
CPI=[(4+1)n+4]/3n=5/3=1.66
Vittorio Zaccaria – Laboratory of
Architectures
2nd exercise
loop: lw r2,dati_a(r4)
lw r3,dati_b(r5)
add r1,r2,r3
sw dati_a(r6),r1
addi r4,r4,4
addi r5,r5,4
addi r6,r6,4
j loop
Vittorio Zaccaria – Laboratory of
Architectures
1st part Assume no forwarding possible Insert bubbles/aborts in the right place
to solve hazards, assume no forwarding possible.
Calculate asymptotical CPI of the loop. Schedule the instructions to minimize
stalls by augmenting the distance between conflicting instructions.
20 minutes
Vittorio Zaccaria – Laboratory of
Architectures
Results No forwarding and no scheduling
asymptotic result: 13/8
Vittorio Zaccaria – Laboratory of
Architectures
A Possible Re-Scheduling
loop:lw r2,dati_a(r4)
lw r3,dati_b(r5)
addi r4,r4,4
addi r5,r5,4
add r1,r2,r3
sw dati_a(r6),r1
addi r6,r6,4
j loopIdea: increase distance of add from last lw.
Vittorio Zaccaria – Laboratory of
Architectures
Re-Scheduling results
Scheduled code decreases CPI to 11/8
Vittorio Zaccaria – Laboratory of
Architectures
2nd part Now assume that forwarding is possible Insert needed bubbles/aborts in the right
place to solve hazards Schedule the instructions to minimize stalls
by augmenting the distance between conflicting instructions.
Calculate Asymptotical CPI of the two loops. Calculate Speedup between the original code
(w/o fw.) and the last rescheduled and forwarded code.
10 minutes
Vittorio Zaccaria – Laboratory of
Architectures
Forwarding Results
With forwarding but not rescheduling we obtain: 10/8
Vittorio Zaccaria – Laboratory of
Architectures
Re-schedulingWe use the same re-scheduled code:
By rescheduling the loop we obtain 9/8
Vittorio Zaccaria – Laboratory of
Architectures
Speedup Results
Total requested speedup is:
CPI[unscheduled,unforwarded] 13
---------------------------- = ----
CPI[scheduled,forwarded] 9
Vittorio Zaccaria – Laboratory of
Architectures
3rd Exercise
loop: lw r2,dati_a(r1)addi r2,r2,4lw r3,dati_b(r1)addi r3,r3,4lw r4,dati_a(r1)addi r4,r4,4add r2,r2,r3add r2,r2,r4sw dati_a(r1),r2addi r1,r1,4bnez r1,loop
Vittorio Zaccaria – Laboratory of
Architectures
1st part Assume no forwarding possible Insert bubbles/aborts in the right place
to solve hazards. Calculate asymptotical CPI of the loop. Schedule the instructions to minimize
stalls by augmenting the distance between conflicting instructions.
20 minutes
Vittorio Zaccaria – Laboratory of
Architectures
Bubbles insertion
11 instructions, 1 nop, 12 stalls => CPI= 24/11
Vittorio Zaccaria – Laboratory of
Architectures
Rescheduled codeloop: lw r2,dati_a(r1)
lw r3,dati_b(r1)lw r4,dati_a(r1)addi r2,r2,4addi r3,r3,4addi r4,r4,4add r2,r2,r3add r2,r2,r4sw dati_a(r1),r2addi r1,r1,4bnez r1,loop
Idea: perform elaborations after all data has been loaded
Vittorio Zaccaria – Laboratory of
Architectures
Scheduled code results
11 instr., 1 nop, 7 stalls => CPI=19/11
Vittorio Zaccaria – Laboratory of
Architectures
2nd part Now assume that forwarding is possible Insert needed bubbles/aborts in the right
place to solve hazards Schedule the instructions to minimize stalls
by augmenting the distance between conflicting instructions.
Calculate Asymptotical CPI of the loop. Calculate Speedup between the original code
(w/o fw.) and the last rescheduled and forwarded code.
10 minutes
Vittorio Zaccaria – Laboratory of
Architectures
Bubbles insertion
11 + 1 NOP + 4 stalls => CPI=16/11
Vittorio Zaccaria – Laboratory of
Architectures
Rescheduling Results
11 instr. + 1 NOP + 1 stall => CPI=13/11Requested Speedup=24/13
Vittorio Zaccaria – Laboratory of
Architectures
DLX FPU Pipeline Latency of a FU=number of cycles that
must intervene between an instruction that produce a value through the FU and an instruction that uses this value (-1).
Initiation Interval of the FU: time that must elapse between issuing two operations to the same FU.
A stall in a pipeline does not mean a stall in the entire processor.
Vittorio Zaccaria – Laboratory of
Architectures
FPU Latencies and I.I.
FU Latency
Initiation Interval
Integer ALU 0 1
FP add 1 1
FP and integer multiply
4 1
FP and integer divide
18 19 [structural hazards!]
WINDLX default latencies
Vittorio Zaccaria – Laboratory of
Architectures
Problems with FPUs Divide instructions can provoke
structural hazards and need to be stalled in the ID stage.
Writes in the RF can be more than one.
WAW hazards are possible because WB can be reached out of order.
RAW hazards more frequent due to the longer latency of operations.
Vittorio Zaccaria – Laboratory of
Architectures
Register file structural hazard solution. Structural hazards on register file:
Solution: stall one of the instructions before entering the MEM stage.
Vittorio Zaccaria – Laboratory of
Architectures
FPU WAW Hazards
Subd finishes before multd!there is a WAW conflict, i.e., if we dont stall subd, multd will overwrite its results!
ld f6,dati_a(r2)ld f2,dati_b(r3)multd f6,f2,f4subd f6,f2,f2addd f6,f8,f2
Vittorio Zaccaria – Laboratory of
Architectures
Exercise: execute only a cycle of this loop:
loop: ld f0,dati_a(r2)
ld f4,dati_b(r3)
multd f0,f0,f4
addd f2,f0,f2
addi r2,r2,8
addi r3,r3,8
sub r5,r4,r2
bnez r5,loop
How many cycles between the IF of the 1st ld and the WB of the 1st bnez?