![Page 1: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/1.jpg)
EmbeddedComputerArchitectures
Hennessy & PattersonChapter 4
Exploiting ILP with Software Approaches
Gerard Smit (Zilverling 4102), [email protected]
André Kokkeler (Zilverling 4096), [email protected]
![Page 2: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/2.jpg)
Contents
• Introduction• Processor Architecture• Loop Unrolling• Software Pipelining
![Page 3: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/3.jpg)
IntroductionCommonName
Issue Structure
Hazard Detection
Scheduling
Characteristic
Examples
Superscalar(static)
dynamic hardware
static In order execution
Sun UltraSPARC II/III
Superscalar (dynamic)
dynamic hardware
dynamic Some out of order execution
IBM Power2
Superscalar (speculative)
dynamic hardware
Dynamic with speculation
Out of order execution
Pentium III
VLIW static software static No hazards between issue packets
Trimedia, i860
EPIC Mostlystatic
Mostlysoftware
Mostlystatic
Expl. depen-dencies marked comp
Itanium
![Page 4: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/4.jpg)
Processor Architecture
• 5 stage pipeline• Static scheduling• Integer and Floating
Point unit
IF ID INTEX
MEM
WB
IF ID FPEX
FPEX
FPEX
FPEX
MEM
WB
![Page 5: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/5.jpg)
Processor Architecture
• Latencies:
IF ID INTEX
MEM WB
IF ID INTEX
MEM WB
IF ID FPEX
FPEX
FPEX
FPEX
MEM WB
IF ID Stall Stall Stall FPEX
FPEX
FPEX
FPEX
MEM WB
Latency = 3
Integer ALU => Integer ALU
Floating point ALU => Floating point ALU
No LatencyInt. ALU
Int. ALU
FPALU
FPALU
![Page 6: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/6.jpg)
Processor Architecture
• Latencies:
IF ID EX MEM WB
IF ID EX MEM WB
Load Memory => Store Memory
No LatencyLoad
Store
![Page 7: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/7.jpg)
Processor Architecture
• Latencies:
IF ID INTEX
MEM WB
IF ID EX MEM WB
IF ID FPEX
FPEX
FPEX
FPEX
MEM WB
IF ID Stall Stall EX MEM WB
Latency = 2
Integer ALU => Store Memory
Floating point ALU => Store Memory
No LatencyInt. ALU
Store
FPALU
Store
![Page 8: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/8.jpg)
Processor Architecture
• Latencies:
IF ID EX MEM WB
IF ID stall INTEX
MEM WB
IF ID Stall FPEX
FPEX
FPEX
FPEX
MEM WB
Latency = 1
Load Memory => Integer ALU
Load Memory => Floating point ALU
Latency = 1
IF ID EX MEM WB
Load
Int. ALU
Load
FPALU
![Page 9: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/9.jpg)
Processor Architecture
• Latencies:
IF ID EX MEM WB
IF Stall ID INTEX
MEM WB
Integer ALU => Branch
Latency = 1
Branch
Int. ALU
![Page 10: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/10.jpg)
Loop Unrolling
• For i:=1000 downto 1 do x[i] := x[i]+s;
• Loop: L.D F0,0(R1) ; F0 x[i]ADD.D F4,F0,F2 ; F4 x[i]+sS.D 0(R1),F4 ; x[i] x[i]+sDADDUI R1,R1,#-8; i i-1BNE R1,R2,Loop ; repeat if i≠0NOP ; branch delay slot
• R1: pointer within arrayF2: value to be added (s)R2: last element in arrayF0: value in arrayF4: value to be written in array
![Page 11: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/11.jpg)
Loop Unrolling
• Loop: L.D F0,0(R1) ; F0 x[i]ADD.D F4,F0,F2 ; F4 x[i]
+sS.D 0(R1),F4 ; x[i] x[i]
+sDADDUI R1,R1,#-8; i i-1BNE R1,R2,Loop ;
repeat if i≠0NOP ; branch delay
slot
Load Memory => FP ALU1 stall
![Page 12: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/12.jpg)
Loop Unrolling
• Loop: L.D F0,0(R1) ; F0 x[i]stallADD.D F4,F0,F2 ; F4 x[i]
+sS.D 0(R1),F4 ; x[i] x[i]
+sDADDUI R1,R1,#-8; i i-1BNE R1,R2,Loop ;
repeat if i≠0NOP ; branch delay
slot
FP ALU =>Store Memory => 2 stalls
![Page 13: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/13.jpg)
Loop Unrolling
• Loop: L.D F0,0(R1) ; F0 x[i]stallADD.D F4,F0,F2 ; F4 x[i]
+sstallstallS.D 0(R1),F4 ; x[i] x[i]
+sDADDUI R1,R1,#-8; i i-1BNE R1,R2,Loop ;
repeat if i≠0NOP ; branch delay
slot
Integer ALU =>Branch1 stall
![Page 14: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/14.jpg)
Loop Unrolling
• Loop: L.D F0,0(R1) ; F0 x[i]stallADD.D F4,F0,F2 ; F4 x[i]+sstallstallS.D 0(R1),F4 ; x[i] x[i]+sDADDUI R1,R1,#-8; i i-1stallBNE R1,R2,Loop ; repeat
if i≠0NOP ; branch delay
slotSmart compiler
![Page 15: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/15.jpg)
Loop Unrolling
• Loop: L.D F0,0(R1) ; F0 x[i]DADDUI R1,R1,#-8; i i-1ADD.D F4,F0,F2 ; F4 x[i]
+sstallBNE R1,R2,Loop ;
repeat if i≠0S.D 8(R1),F4 ; x[i] x[i]
+s
Integer ALU =>Branch1 stall
From 10 cycles per loop to 6 cycles per loop
![Page 16: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/16.jpg)
Loop Unrolling
• Loop: L.D F0,0(R1) ; F0 x[i]DADDUI R1,R1,#-8; i i-1ADD.D F4,F0,F2 ; F4 x[i]+sBNE R1,R2,Loop ; repeat if i≠0S.D 8(R1),F4 ; x[i] x[i]+s
• 5 instructions—3 ‘doing the job’—2 control or ‘overhead’
• Reduce overhead => loop unrolling—Add code—From 1000 iterations to 500 iterations
![Page 17: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/17.jpg)
Loop Unrolling• Original Code Sequence:
Loop: L.D F0,0(R1) ; F0 x[i]ADD.D F4,F0,F2 ; F4 x[i]+sS.D 0(R1),F4 ; x[i] x[i]+sDADDUI R1,R1,#-8 ; i i-1BNE R1,R2,Loop ; repeat if i≠0NOP ; branch delay
slot
Copy this partWith correct‘data pointer’
![Page 18: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/18.jpg)
Loop Unrolling• Unrolled Code Sequence:
Loop: L.D F0,0(R1) ; F0 x[i]ADD.D F4,F0,F2 ; F4 x[i]+sS.D 0(R1),F4 ; x[i] x[i]+sL.D F0,-8(R1) ; F0 x[i]ADD.D F4,F0,F2 ; F4 x[i]+sS.D -8(R1),F4 ; x[i] x[i]+sDADDUI R1,R1,#-16 ; i i-2BNE R1,R2,Loop ; repeat if i≠0NOP ; branch delay
slot• There are still a lot of stalls. Removing is easier if some
additional registers are used
1 stall
2 stalls
1 stall
2 stalls
1 stall
![Page 19: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/19.jpg)
Loop Unrolling• Unrolled Code Sequence:
Loop: L.D F0,0(R1) ; F0 x[i]ADD.D F4,F0,F2 ; F4 x[i]+sS.D 0(R1),F4 ; x[i] x[i]+sL.D F6,-8(R1) ; F6 x[i]ADD.D F8,F6,F2 ; F8 x[i]+sS.D -8(R1),F8 ; x[i] x[i]+sDADDUI R1,R1,#-16 ; i i-1BNE R1,R2,Loop ; repeat if i≠0NOP ; branch delay
slot
1 stall
2 stalls
1 stall
2 stalls
1 stall
![Page 20: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/20.jpg)
Loop Unrolling• Unrolled Code Sequence:
Loop: L.D F0,0(R1) ; F0 x[i]L.D F6,-8(R1) ; F6 x[i]ADD.D F4,F0,F2 ; F4 x[i]+sS.D 0(R1),F4 ; x[i] x[i]+sADD.D F8,F6,F2 ; F8 x[i]+sS.D -8(R1),F8 ; x[i] x[i]+sDADDUI R1,R1,#-16 ; i i-1BNE R1,R2,Loop ; repeat if i≠0NOP ; branch delay
slot
1 stall
1 stall
2 stalls
1 stall
![Page 21: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/21.jpg)
Loop Unrolling• Unrolled Code Sequence:
Loop: L.D F0,0(R1) ; F0 x[i]L.D F6,-8(R1) ; F6 x[i]ADD.D F4,F0,F2 ; F4 x[i]+sADD.D F8,F6,F2 ; F8 x[i]+s S.D 0(R1),F4 ; x[i] x[i]+sS.D -8(R1),F8 ; x[i] x[i]+sDADDUI R1,R1,#-16 ; i i-1BNE R1,R2,Loop ; repeat if i≠0NOP ; branch delay
slot
2 stalls
1 stall
+16
+8
![Page 22: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/22.jpg)
Loop Unrolling• Unrolled Code Sequence:
Loop: L.D F0,0(R1) ; F0 x[i]L.D F6,-8(R1) ; F6 x[i]ADD.D F4,F0,F2 ; F4 x[i]+sADD.D F8,F6,F2 ; F8 x[i]+s DADDUI R1,R1,#-16 ; i i-1 S.D 16(R1),F4 ; x[i] x[i]+sS.D 8(R1),F8 ; x[i] x[i]+sBNE R1,R2,Loop ; repeat if i≠0NOP ; branch delay
slot
![Page 23: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/23.jpg)
Loop Unrolling• Unrolled Code Sequence:
Loop: L.D F0,0(R1) ; F0 x[i]L.D F6,-8(R1) ; F6 x[i]ADD.D F4,F0,F2 ; F4 x[i]+sADD.D F8,F6,F2 ; F8 x[i]+s DADDUI R1,R1,#-16 ; i i-1 S.D 16(R1),F4 ; x[i] x[i]+sBNE R1,R2,Loop ; repeat if i≠0S.D 8(R1),F8 ; x[i] x[i]+s
Clock cycles Original loop(1000 times)
Unrolled loop(500 times)
Savings
L.D Instrucions 1000 1000 0
ADD.D instructions 1000 1000 0
S.D instructions 1000 1000 0
DADDUI instructions
1000 500 500
BNE instructions 1000 500 500
Stall cycles 1000 0 1000
Totals 6000 4000 2000
![Page 24: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/24.jpg)
Loop Unrolling
• In example: loop-unrolling factor 2• In general: loop-unrolling factor k• Limitations concerning k
—Amdahls law: 3000 cycles are always needed—Increasing k => increasing number of
registers—Increasing k => increasing code size
![Page 25: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/25.jpg)
Software Pipelining
• Original unrolled loop:Loop: L.D F0,0(R1) ; F0 x[i]
ADD.D F4,F0,F2 ; F4 x[i]+sS.D 0(R1),F4 ; x[i] x[i]+sDADDUI R1,R1,#-8 ; i i-1BNE R1,R2,Loop ; repeat if i≠0NOP ; branch delay
slot
• Three actions involved with actual calculations:
F0 x[i]F4 x[i] + xx[i] x[i] + s
• Consider these as three different stages
1 stall
2 stalls
1 stall
![Page 26: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/26.jpg)
Software Pipelining
• Original unrolled loop:Loop: L.D F0,0(R1) ; F0 x[i]
ADD.D F4,F0,F2 ; F4 x[i]+sS.D 0(R1),F4 ; x[i] x[i]+sDADDUI R1,R1,#-8 ; i i-1BNE R1,R2,Loop ; repeat if i≠0NOP ; branch delay
slot
• Three actions involved with actual calculations:
F0 x[i] Stage 1F4 x[i] + x Stage 2 x[i] x[i] + s Stage 3
• Associate array element with the stages
![Page 27: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/27.jpg)
Software Pipelining
• Original unrolled loop:Loop: L.D F0,0(R1) ; F0 x[i]
ADD.D F4,F0,F2 ; F4 x[i]+sS.D 0(R1),F4 ; x[i] x[i]+sDADDUI R1,R1,#-8 ; i i-1BNE R1,R2,Loop ; repeat if i≠0NOP ; branch delay
slot
• Three actions involved with actual calculations:
F0 x[i] Stage 1, x[i]F4 x[i] + x Stage 2, x[i]x[i] x[i] + s Stage 3, x[i]
![Page 28: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/28.jpg)
Software Pipelining
• Normal Execution
X[1000]
X[1000]
X[1000]
X[999]
X[999]
X[999]
Time
Stage 1 Stage 2 Stage 3
X[998]
X[998]
X[998]
F0 F4
1 stall
2 stalls
1 stall
2 stalls
1 stall
2 stalls
RegisterEmpty
RegisterOccupied
Stage 1: fill F0Stage 2: read F0 fill F4Stage 3: read F4
![Page 29: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/29.jpg)
Software Pipelining
• Software Pipelined Execution
X[1000]
X[1000]
X[1000]
X[999]
X[999]
X[999]
Time
Stage 1 Stage 2 Stage 3
X[998]
X[998]
X[998]
F0 F4
1 stall
1 stall
0 stalls
1 stall
0 stalls
1 stall
RegisterEmpty
RegisterOccupied
X[997]
Stage 1: fill F0Stage 2: read F0 fill F4Stage 3: read F4
![Page 30: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/30.jpg)
Software Pipelining
• Software Pipelined Execution
X[1000]
X[1000]
X[i]
X[999]
X[i-1]
Stage 1 Stage 2 Stage 3
X[i-2]
1 stall
1 stall
0 stalls
1 stall
0 stalls
L.D F0,0(R1) ; F0 x[1000]
ADD.D F4,F0,F2 ; F4 x[1000] + s
LD.D F0,-8(R1) ; F0 x[999]
S.D 0(R1),F4 ; x[i] F4
ADD.D F4,F0,F2 ; F4 x[i-1] + sADD.D F4,F0,F2 ; F4 x[i-1] + s
LD.D F0,-16(R1) ; F0 x[i-2]
BNE R1,R2,Loop; repeat if i≠1
DADDUI R1,R1,#-8 ;i i-8
Loop:
![Page 31: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/31.jpg)
Software Pipelining
• Software Pipelined Execution
X[1000]
X[1000]
X[i]
X[999]
X[i-1]
Stage 1 Stage 2 Stage 3
X[i-2]
1 stall
1 stall
0 stalls
0 stalls0 stalls
L.D F0,0(R1) ; F0 x[1000]
ADD.D F4,F0,F2 ; F4 x[1000] + s
LD.D F0,-8(R1) ; F0 x[999]
S.D 0(R1),F4 ; x[i] F4
ADD.D F4,F0,F2 ; F4 x[i-1] + sADD.D F4,F0,F2 ; F4 x[i-1] + s
LD.D F0,-16(R1) ; F0 x[i-2]
BNE R1,R2,Loop; repeat if i≠1
DADDUI R1,R1,#-8 ;i i-8
Loop:
![Page 32: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/32.jpg)
Software Pipelining
• No stalls inside loop• Additional start-up (and clean-up) code• No reduction of control overhead• No additional registers
![Page 33: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/33.jpg)
VLIW
• To simplify processor hardware: sophisticated compilers (loop unrolling, software pipelining etc.)
• Extreme form: Very Long Instruction Word processors
![Page 34: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/34.jpg)
VLIW• Superscalar
• VLIW
Hardware-Grouping-Execution Unit Assignment-Initiation
Instructions Execution Units
![Page 35: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/35.jpg)
VLIW
• Suppose 4 functional units—Memory load unit—Floating point unit—Memory store unit—Integer/Branch unit
• Instruction
Memory load FP operation Memory store Integer/Branch
![Page 36: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/36.jpg)
VLIW• Original unrolled loop:
Loop: L.D F0,0(R1) ; F0 x[i]ADD.D F4,F0,F2 ; F4 x[i]+sS.D 0(R1),F4 ; x[i] x[i]+sDADDUI R1,R1,#-8 ; i i-1BNE R1,R2,Loop ; repeat if i≠0NOP ; branch delay
slot
1 stall
2 stalls
1 stall
Memory load FP operation Memory store Integer/Branch
L.D
stall
ADD.D
stall
stall
S.D
Limit stall cycles by clever compilers (loop unrolling, software pipelining)
![Page 37: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/37.jpg)
VLIW• Superscalar
• VLIW
Hardware-Grouping-Execution Unit Assignment-Initiation
Instructions Execution Units
![Page 38: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/38.jpg)
VLIW• Superscalar
• Dynamic VLIW
Hardware-Grouping-Execution Unit Assignment-Initiation
Instructions Execution Units
Initiation
![Page 39: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/39.jpg)
Dynamic VLIW
• VLIW: no caches because no hardware to deal with cache misses
• Dynamic VLIW: Hardware to stall on a cache miss.
• Not used frequently
![Page 40: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/40.jpg)
VLIW• Dynamic VLIW
• Explicitly Parallel Instruction Computing (EPIC)
Instructions Execution Units
Initiation
Initiation
ExecutionUnitAssign-ment
![Page 41: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/41.jpg)
EPIC
• IA-64 architecture by HP and Intel• IA-64 is an instruction set architecture
intended for implementation on EPIC• Itanium is first Intel product• 64-bit architecture• Basic concepts:
—Instruction level parallelism indicated by compiler
—Long or very long instruction words—Branch predication (≠ prediction)—Speculative loading
![Page 42: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/42.jpg)
Key Features
• Large number of registers—IA-64 instruction format assumes 256
– 128 * 64 bit integer, logical & general purpose– 128 * 82 bit floating point and graphic
—64 * 1 bit predicated execution registers (see later)
—To support high degree of parallelism
• Multiple execution units—Expected to be 8 or more—Depends on number of transistors available—Execution of parallel instructions depends on
hardware available– 8 parallel instructions may be spilt into two lots of
four if only four execution units are available
![Page 43: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/43.jpg)
IA-64 Execution Units
• I-Unit—Integer arithmetic—Shift and add—Logical—Compare—Integer multimedia ops
• M-Unit—Load and store
– Between register and memory
—Some integer ALU
• B-Unit—Branch instructions
• F-Unit—Floating point instructions
![Page 44: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/44.jpg)
Instruction Format Diagram
![Page 45: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/45.jpg)
Instruction Format
• 128 bit bundle—Holds three instructions (syllables) plus
template—Can fetch one or more bundles at a time—Template contains info on which instructions
can be executed in parallel– Not confined to single bundle– e.g. a stream of 8 instructions may be executed in
parallel– Compiler will have re-ordered instructions to form
contiguous bundles– Can mix dependent and independent instructions in
same bundle
![Page 46: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/46.jpg)
Assembly Language Format• [qp] mnemonic [.comp] dest = srcs //• qp - predicate register
—1 at execution then execute and commit result to hardware
—0 result is discarded
• mnemonic - name of instruction• comp – one or more instruction completers used
to qualify mnemonic• dest – one or more destination operands• srcs – one or more source operands• // - comment• Instruction groups and stops indicated by ;;
—Sequence without read after write or write after write—Do not need hardware register dependency checks
![Page 47: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/47.jpg)
Assembly Examples
ld8 r1 = [r5] ;; //first group
add r3 = r1, r4 //second group• Second instruction depends on value in r1
—Changed by first instruction—Can not be in same group for parallel
execution
![Page 48: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/48.jpg)
Predication
![Page 49: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/49.jpg)
Predication
cmp.eq p1, p2 = 0, a ;;(p1) add j = 1, j(p2) add k = 1, k
if a == 0then j = j+1else k = k+1
cmp a,0jne L1add j,1jmp L2
L1: add k,1L2:
If a == 0Then p1 = 1 and p2 = 0Else p1 = 0 and p2 = 1
Pseudo code
Using branches
Predicated
Should NOT be there toenable parallelism
![Page 50: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/50.jpg)
Speculative Loading
![Page 51: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/51.jpg)
Data Speculation
st8 [r4] = r12ld8 r6 = [r8];;add r5=r6, r7;;st8 [r18] = r5
stall What if r4 contains same address as r8 ?
Ld8.a r6 = [r8];; advanced loadst8 [r4] = r12Ld8.c r6 = [r8];; check loadadd r5=r6, r7;;st8 [r18] = r5
Writes source address (contents of r8) to Advanced Load Adress Table (ALAT)
Each store checks ALAT and removes entry if match
If no matching entry in ALAT:Load is performed again
![Page 52: Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0151a28abf838cce0ee/html5/thumbnails/52.jpg)
Control & Data Speculation
• Control Speculation—AKA Speculative loading—Load data from memory before needed
• Data Speculation—Load moved before store that might alter
memory location—Subsequent check in value