embedded computer architectures

52
Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit @ cs.utwente.nl André Kokkeler (Zilverling 4096), kokkeler @ utwente.nl

Upload: morrie

Post on 06-Jan-2016

40 views

Category:

Documents


0 download

DESCRIPTION

Embedded Computer Architectures. Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), [email protected] André Kokkeler (Zilverling 4096), [email protected]. Contents. Introduction Processor Architecture Loop Unrolling - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Embedded Computer Architectures

EmbeddedComputerArchitectures

Hennessy & PattersonChapter 4

Exploiting ILP with Software Approaches

Gerard Smit (Zilverling 4102), [email protected]

André Kokkeler (Zilverling 4096), [email protected]

Page 2: Embedded Computer Architectures

Contents

• Introduction• Processor Architecture• Loop Unrolling• Software Pipelining

Page 3: Embedded Computer Architectures

IntroductionCommonName

Issue Structure

Hazard Detection

Scheduling

Characteristic

Examples

Superscalar(static)

dynamic hardware

static In order execution

Sun UltraSPARC II/III

Superscalar (dynamic)

dynamic hardware

dynamic Some out of order execution

IBM Power2

Superscalar (speculative)

dynamic hardware

Dynamic with speculation

Out of order execution

Pentium III

VLIW static software static No hazards between issue packets

Trimedia, i860

EPIC Mostlystatic

Mostlysoftware

Mostlystatic

Expl. depen-dencies marked comp

Itanium

Page 4: Embedded Computer Architectures

Processor Architecture

• 5 stage pipeline• Static scheduling• Integer and Floating

Point unit

IF ID INTEX

MEM

WB

IF ID FPEX

FPEX

FPEX

FPEX

MEM

WB

Page 5: Embedded Computer Architectures

Processor Architecture

• Latencies:

IF ID INTEX

MEM WB

IF ID INTEX

MEM WB

IF ID FPEX

FPEX

FPEX

FPEX

MEM WB

IF ID Stall Stall Stall FPEX

FPEX

FPEX

FPEX

MEM WB

Latency = 3

Integer ALU => Integer ALU

Floating point ALU => Floating point ALU

No LatencyInt. ALU

Int. ALU

FPALU

FPALU

Page 6: Embedded Computer Architectures

Processor Architecture

• Latencies:

IF ID EX MEM WB

IF ID EX MEM WB

Load Memory => Store Memory

No LatencyLoad

Store

Page 7: Embedded Computer Architectures

Processor Architecture

• Latencies:

IF ID INTEX

MEM WB

IF ID EX MEM WB

IF ID FPEX

FPEX

FPEX

FPEX

MEM WB

IF ID Stall Stall EX MEM WB

Latency = 2

Integer ALU => Store Memory

Floating point ALU => Store Memory

No LatencyInt. ALU

Store

FPALU

Store

Page 8: Embedded Computer Architectures

Processor Architecture

• Latencies:

IF ID EX MEM WB

IF ID stall INTEX

MEM WB

IF ID Stall FPEX

FPEX

FPEX

FPEX

MEM WB

Latency = 1

Load Memory => Integer ALU

Load Memory => Floating point ALU

Latency = 1

IF ID EX MEM WB

Load

Int. ALU

Load

FPALU

Page 9: Embedded Computer Architectures

Processor Architecture

• Latencies:

IF ID EX MEM WB

IF Stall ID INTEX

MEM WB

Integer ALU => Branch

Latency = 1

Branch

Int. ALU

Page 10: Embedded Computer Architectures

Loop Unrolling

• For i:=1000 downto 1 do x[i] := x[i]+s;

• Loop: L.D F0,0(R1) ; F0 x[i]ADD.D F4,F0,F2 ; F4 x[i]+sS.D 0(R1),F4 ; x[i] x[i]+sDADDUI R1,R1,#-8; i i-1BNE R1,R2,Loop ; repeat if i≠0NOP ; branch delay slot

• R1: pointer within arrayF2: value to be added (s)R2: last element in arrayF0: value in arrayF4: value to be written in array

Page 11: Embedded Computer Architectures

Loop Unrolling

• Loop: L.D F0,0(R1) ; F0 x[i]ADD.D F4,F0,F2 ; F4 x[i]

+sS.D 0(R1),F4 ; x[i] x[i]

+sDADDUI R1,R1,#-8; i i-1BNE R1,R2,Loop ;

repeat if i≠0NOP ; branch delay

slot

Load Memory => FP ALU1 stall

Page 12: Embedded Computer Architectures

Loop Unrolling

• Loop: L.D F0,0(R1) ; F0 x[i]stallADD.D F4,F0,F2 ; F4 x[i]

+sS.D 0(R1),F4 ; x[i] x[i]

+sDADDUI R1,R1,#-8; i i-1BNE R1,R2,Loop ;

repeat if i≠0NOP ; branch delay

slot

FP ALU =>Store Memory => 2 stalls

Page 13: Embedded Computer Architectures

Loop Unrolling

• Loop: L.D F0,0(R1) ; F0 x[i]stallADD.D F4,F0,F2 ; F4 x[i]

+sstallstallS.D 0(R1),F4 ; x[i] x[i]

+sDADDUI R1,R1,#-8; i i-1BNE R1,R2,Loop ;

repeat if i≠0NOP ; branch delay

slot

Integer ALU =>Branch1 stall

Page 14: Embedded Computer Architectures

Loop Unrolling

• Loop: L.D F0,0(R1) ; F0 x[i]stallADD.D F4,F0,F2 ; F4 x[i]+sstallstallS.D 0(R1),F4 ; x[i] x[i]+sDADDUI R1,R1,#-8; i i-1stallBNE R1,R2,Loop ; repeat

if i≠0NOP ; branch delay

slotSmart compiler

Page 15: Embedded Computer Architectures

Loop Unrolling

• Loop: L.D F0,0(R1) ; F0 x[i]DADDUI R1,R1,#-8; i i-1ADD.D F4,F0,F2 ; F4 x[i]

+sstallBNE R1,R2,Loop ;

repeat if i≠0S.D 8(R1),F4 ; x[i] x[i]

+s

Integer ALU =>Branch1 stall

From 10 cycles per loop to 6 cycles per loop

Page 16: Embedded Computer Architectures

Loop Unrolling

• Loop: L.D F0,0(R1) ; F0 x[i]DADDUI R1,R1,#-8; i i-1ADD.D F4,F0,F2 ; F4 x[i]+sBNE R1,R2,Loop ; repeat if i≠0S.D 8(R1),F4 ; x[i] x[i]+s

• 5 instructions—3 ‘doing the job’—2 control or ‘overhead’

• Reduce overhead => loop unrolling—Add code—From 1000 iterations to 500 iterations

Page 17: Embedded Computer Architectures

Loop Unrolling• Original Code Sequence:

Loop: L.D F0,0(R1) ; F0 x[i]ADD.D F4,F0,F2 ; F4 x[i]+sS.D 0(R1),F4 ; x[i] x[i]+sDADDUI R1,R1,#-8 ; i i-1BNE R1,R2,Loop ; repeat if i≠0NOP ; branch delay

slot

Copy this partWith correct‘data pointer’

Page 18: Embedded Computer Architectures

Loop Unrolling• Unrolled Code Sequence:

Loop: L.D F0,0(R1) ; F0 x[i]ADD.D F4,F0,F2 ; F4 x[i]+sS.D 0(R1),F4 ; x[i] x[i]+sL.D F0,-8(R1) ; F0 x[i]ADD.D F4,F0,F2 ; F4 x[i]+sS.D -8(R1),F4 ; x[i] x[i]+sDADDUI R1,R1,#-16 ; i i-2BNE R1,R2,Loop ; repeat if i≠0NOP ; branch delay

slot• There are still a lot of stalls. Removing is easier if some

additional registers are used

1 stall

2 stalls

1 stall

2 stalls

1 stall

Page 19: Embedded Computer Architectures

Loop Unrolling• Unrolled Code Sequence:

Loop: L.D F0,0(R1) ; F0 x[i]ADD.D F4,F0,F2 ; F4 x[i]+sS.D 0(R1),F4 ; x[i] x[i]+sL.D F6,-8(R1) ; F6 x[i]ADD.D F8,F6,F2 ; F8 x[i]+sS.D -8(R1),F8 ; x[i] x[i]+sDADDUI R1,R1,#-16 ; i i-1BNE R1,R2,Loop ; repeat if i≠0NOP ; branch delay

slot

1 stall

2 stalls

1 stall

2 stalls

1 stall

Page 20: Embedded Computer Architectures

Loop Unrolling• Unrolled Code Sequence:

Loop: L.D F0,0(R1) ; F0 x[i]L.D F6,-8(R1) ; F6 x[i]ADD.D F4,F0,F2 ; F4 x[i]+sS.D 0(R1),F4 ; x[i] x[i]+sADD.D F8,F6,F2 ; F8 x[i]+sS.D -8(R1),F8 ; x[i] x[i]+sDADDUI R1,R1,#-16 ; i i-1BNE R1,R2,Loop ; repeat if i≠0NOP ; branch delay

slot

1 stall

1 stall

2 stalls

1 stall

Page 21: Embedded Computer Architectures

Loop Unrolling• Unrolled Code Sequence:

Loop: L.D F0,0(R1) ; F0 x[i]L.D F6,-8(R1) ; F6 x[i]ADD.D F4,F0,F2 ; F4 x[i]+sADD.D F8,F6,F2 ; F8 x[i]+s S.D 0(R1),F4 ; x[i] x[i]+sS.D -8(R1),F8 ; x[i] x[i]+sDADDUI R1,R1,#-16 ; i i-1BNE R1,R2,Loop ; repeat if i≠0NOP ; branch delay

slot

2 stalls

1 stall

+16

+8

Page 22: Embedded Computer Architectures

Loop Unrolling• Unrolled Code Sequence:

Loop: L.D F0,0(R1) ; F0 x[i]L.D F6,-8(R1) ; F6 x[i]ADD.D F4,F0,F2 ; F4 x[i]+sADD.D F8,F6,F2 ; F8 x[i]+s DADDUI R1,R1,#-16 ; i i-1 S.D 16(R1),F4 ; x[i] x[i]+sS.D 8(R1),F8 ; x[i] x[i]+sBNE R1,R2,Loop ; repeat if i≠0NOP ; branch delay

slot

Page 23: Embedded Computer Architectures

Loop Unrolling• Unrolled Code Sequence:

Loop: L.D F0,0(R1) ; F0 x[i]L.D F6,-8(R1) ; F6 x[i]ADD.D F4,F0,F2 ; F4 x[i]+sADD.D F8,F6,F2 ; F8 x[i]+s DADDUI R1,R1,#-16 ; i i-1 S.D 16(R1),F4 ; x[i] x[i]+sBNE R1,R2,Loop ; repeat if i≠0S.D 8(R1),F8 ; x[i] x[i]+s

Clock cycles Original loop(1000 times)

Unrolled loop(500 times)

Savings

L.D Instrucions 1000 1000 0

ADD.D instructions 1000 1000 0

S.D instructions 1000 1000 0

DADDUI instructions

1000 500 500

BNE instructions 1000 500 500

Stall cycles 1000 0 1000

Totals 6000 4000 2000

Page 24: Embedded Computer Architectures

Loop Unrolling

• In example: loop-unrolling factor 2• In general: loop-unrolling factor k• Limitations concerning k

—Amdahls law: 3000 cycles are always needed—Increasing k => increasing number of

registers—Increasing k => increasing code size

Page 25: Embedded Computer Architectures

Software Pipelining

• Original unrolled loop:Loop: L.D F0,0(R1) ; F0 x[i]

ADD.D F4,F0,F2 ; F4 x[i]+sS.D 0(R1),F4 ; x[i] x[i]+sDADDUI R1,R1,#-8 ; i i-1BNE R1,R2,Loop ; repeat if i≠0NOP ; branch delay

slot

• Three actions involved with actual calculations:

F0 x[i]F4 x[i] + xx[i] x[i] + s

• Consider these as three different stages

1 stall

2 stalls

1 stall

Page 26: Embedded Computer Architectures

Software Pipelining

• Original unrolled loop:Loop: L.D F0,0(R1) ; F0 x[i]

ADD.D F4,F0,F2 ; F4 x[i]+sS.D 0(R1),F4 ; x[i] x[i]+sDADDUI R1,R1,#-8 ; i i-1BNE R1,R2,Loop ; repeat if i≠0NOP ; branch delay

slot

• Three actions involved with actual calculations:

F0 x[i] Stage 1F4 x[i] + x Stage 2 x[i] x[i] + s Stage 3

• Associate array element with the stages

Page 27: Embedded Computer Architectures

Software Pipelining

• Original unrolled loop:Loop: L.D F0,0(R1) ; F0 x[i]

ADD.D F4,F0,F2 ; F4 x[i]+sS.D 0(R1),F4 ; x[i] x[i]+sDADDUI R1,R1,#-8 ; i i-1BNE R1,R2,Loop ; repeat if i≠0NOP ; branch delay

slot

• Three actions involved with actual calculations:

F0 x[i] Stage 1, x[i]F4 x[i] + x Stage 2, x[i]x[i] x[i] + s Stage 3, x[i]

Page 28: Embedded Computer Architectures

Software Pipelining

• Normal Execution

X[1000]

X[1000]

X[1000]

X[999]

X[999]

X[999]

Time

Stage 1 Stage 2 Stage 3

X[998]

X[998]

X[998]

F0 F4

1 stall

2 stalls

1 stall

2 stalls

1 stall

2 stalls

RegisterEmpty

RegisterOccupied

Stage 1: fill F0Stage 2: read F0 fill F4Stage 3: read F4

Page 29: Embedded Computer Architectures

Software Pipelining

• Software Pipelined Execution

X[1000]

X[1000]

X[1000]

X[999]

X[999]

X[999]

Time

Stage 1 Stage 2 Stage 3

X[998]

X[998]

X[998]

F0 F4

1 stall

1 stall

0 stalls

1 stall

0 stalls

1 stall

RegisterEmpty

RegisterOccupied

X[997]

Stage 1: fill F0Stage 2: read F0 fill F4Stage 3: read F4

Page 30: Embedded Computer Architectures

Software Pipelining

• Software Pipelined Execution

X[1000]

X[1000]

X[i]

X[999]

X[i-1]

Stage 1 Stage 2 Stage 3

X[i-2]

1 stall

1 stall

0 stalls

1 stall

0 stalls

L.D F0,0(R1) ; F0 x[1000]

ADD.D F4,F0,F2 ; F4 x[1000] + s

LD.D F0,-8(R1) ; F0 x[999]

S.D 0(R1),F4 ; x[i] F4

ADD.D F4,F0,F2 ; F4 x[i-1] + sADD.D F4,F0,F2 ; F4 x[i-1] + s

LD.D F0,-16(R1) ; F0 x[i-2]

BNE R1,R2,Loop; repeat if i≠1

DADDUI R1,R1,#-8 ;i i-8

Loop:

Page 31: Embedded Computer Architectures

Software Pipelining

• Software Pipelined Execution

X[1000]

X[1000]

X[i]

X[999]

X[i-1]

Stage 1 Stage 2 Stage 3

X[i-2]

1 stall

1 stall

0 stalls

0 stalls0 stalls

L.D F0,0(R1) ; F0 x[1000]

ADD.D F4,F0,F2 ; F4 x[1000] + s

LD.D F0,-8(R1) ; F0 x[999]

S.D 0(R1),F4 ; x[i] F4

ADD.D F4,F0,F2 ; F4 x[i-1] + sADD.D F4,F0,F2 ; F4 x[i-1] + s

LD.D F0,-16(R1) ; F0 x[i-2]

BNE R1,R2,Loop; repeat if i≠1

DADDUI R1,R1,#-8 ;i i-8

Loop:

Page 32: Embedded Computer Architectures

Software Pipelining

• No stalls inside loop• Additional start-up (and clean-up) code• No reduction of control overhead• No additional registers

Page 33: Embedded Computer Architectures

VLIW

• To simplify processor hardware: sophisticated compilers (loop unrolling, software pipelining etc.)

• Extreme form: Very Long Instruction Word processors

Page 34: Embedded Computer Architectures

VLIW• Superscalar

• VLIW

Hardware-Grouping-Execution Unit Assignment-Initiation

Instructions Execution Units

Page 35: Embedded Computer Architectures

VLIW

• Suppose 4 functional units—Memory load unit—Floating point unit—Memory store unit—Integer/Branch unit

• Instruction

Memory load FP operation Memory store Integer/Branch

Page 36: Embedded Computer Architectures

VLIW• Original unrolled loop:

Loop: L.D F0,0(R1) ; F0 x[i]ADD.D F4,F0,F2 ; F4 x[i]+sS.D 0(R1),F4 ; x[i] x[i]+sDADDUI R1,R1,#-8 ; i i-1BNE R1,R2,Loop ; repeat if i≠0NOP ; branch delay

slot

1 stall

2 stalls

1 stall

Memory load FP operation Memory store Integer/Branch

L.D

stall

ADD.D

stall

stall

S.D

Limit stall cycles by clever compilers (loop unrolling, software pipelining)

Page 37: Embedded Computer Architectures

VLIW• Superscalar

• VLIW

Hardware-Grouping-Execution Unit Assignment-Initiation

Instructions Execution Units

Page 38: Embedded Computer Architectures

VLIW• Superscalar

• Dynamic VLIW

Hardware-Grouping-Execution Unit Assignment-Initiation

Instructions Execution Units

Initiation

Page 39: Embedded Computer Architectures

Dynamic VLIW

• VLIW: no caches because no hardware to deal with cache misses

• Dynamic VLIW: Hardware to stall on a cache miss.

• Not used frequently

Page 40: Embedded Computer Architectures

VLIW• Dynamic VLIW

• Explicitly Parallel Instruction Computing (EPIC)

Instructions Execution Units

Initiation

Initiation

ExecutionUnitAssign-ment

Page 41: Embedded Computer Architectures

EPIC

• IA-64 architecture by HP and Intel• IA-64 is an instruction set architecture

intended for implementation on EPIC• Itanium is first Intel product• 64-bit architecture• Basic concepts:

—Instruction level parallelism indicated by compiler

—Long or very long instruction words—Branch predication (≠ prediction)—Speculative loading

Page 42: Embedded Computer Architectures

Key Features

• Large number of registers—IA-64 instruction format assumes 256

– 128 * 64 bit integer, logical & general purpose– 128 * 82 bit floating point and graphic

—64 * 1 bit predicated execution registers (see later)

—To support high degree of parallelism

• Multiple execution units—Expected to be 8 or more—Depends on number of transistors available—Execution of parallel instructions depends on

hardware available– 8 parallel instructions may be spilt into two lots of

four if only four execution units are available

Page 43: Embedded Computer Architectures

IA-64 Execution Units

• I-Unit—Integer arithmetic—Shift and add—Logical—Compare—Integer multimedia ops

• M-Unit—Load and store

– Between register and memory

—Some integer ALU

• B-Unit—Branch instructions

• F-Unit—Floating point instructions

Page 44: Embedded Computer Architectures

Instruction Format Diagram

Page 45: Embedded Computer Architectures

Instruction Format

• 128 bit bundle—Holds three instructions (syllables) plus

template—Can fetch one or more bundles at a time—Template contains info on which instructions

can be executed in parallel– Not confined to single bundle– e.g. a stream of 8 instructions may be executed in

parallel– Compiler will have re-ordered instructions to form

contiguous bundles– Can mix dependent and independent instructions in

same bundle

Page 46: Embedded Computer Architectures

Assembly Language Format• [qp] mnemonic [.comp] dest = srcs //• qp - predicate register

—1 at execution then execute and commit result to hardware

—0 result is discarded

• mnemonic - name of instruction• comp – one or more instruction completers used

to qualify mnemonic• dest – one or more destination operands• srcs – one or more source operands• // - comment• Instruction groups and stops indicated by ;;

—Sequence without read after write or write after write—Do not need hardware register dependency checks

Page 47: Embedded Computer Architectures

Assembly Examples

ld8 r1 = [r5] ;; //first group

add r3 = r1, r4 //second group• Second instruction depends on value in r1

—Changed by first instruction—Can not be in same group for parallel

execution

Page 48: Embedded Computer Architectures

Predication

Page 49: Embedded Computer Architectures

Predication

cmp.eq p1, p2 = 0, a ;;(p1) add j = 1, j(p2) add k = 1, k

if a == 0then j = j+1else k = k+1

cmp a,0jne L1add j,1jmp L2

L1: add k,1L2:

If a == 0Then p1 = 1 and p2 = 0Else p1 = 0 and p2 = 1

Pseudo code

Using branches

Predicated

Should NOT be there toenable parallelism

Page 50: Embedded Computer Architectures

Speculative Loading

Page 51: Embedded Computer Architectures

Data Speculation

st8 [r4] = r12ld8 r6 = [r8];;add r5=r6, r7;;st8 [r18] = r5

stall What if r4 contains same address as r8 ?

Ld8.a r6 = [r8];; advanced loadst8 [r4] = r12Ld8.c r6 = [r8];; check loadadd r5=r6, r7;;st8 [r18] = r5

Writes source address (contents of r8) to Advanced Load Adress Table (ALAT)

Each store checks ALAT and removes entry if match

If no matching entry in ALAT:Load is performed again

Page 52: Embedded Computer Architectures

Control & Data Speculation

• Control Speculation—AKA Speculative loading—Load data from memory before needed

• Data Speculation—Load moved before store that might alter

memory location—Subsequent check in value