advanced computer architectures – part 2.3

166
Advanced Computer Architectures HB49 Part 2.3 Vincenzo De Florio K.U.Leuven / ESAT / ELECTA

Upload: vincenzo-de-florio

Post on 06-May-2015

660 views

Category:

Technology


1 download

DESCRIPTION

Part 2.3 of the slides I wrote for the course "Advanced Computer Architectures", which I taught in the framework of the Advanced Masters Programme in Artificial Intelligence of the Catholic University of Leuven, Leuven (B)

TRANSCRIPT

Page 1: Advanced Computer Architectures – Part 2.3

Advanced Computer

Architectures

– HB49 –

Part 2.3

Vincenzo De Florio

K.U.Leuven / ESAT / ELECTA

Page 2: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/2

Course contents

• Basic Concepts

Computer Design

• Computer Architectures for AI

• Computer Architectures in Practice

Page 3: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/3

Computer Design

• Quantitative assessments

• Instruction sets

• Pipelining

Parallelism

Page 4: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/4

Parallelism

• Introduction to parallel processing

• Instruction level parallelism

• (Data level parallelism)

Part 3

• (Task level parallelism)

Part 3

Page 5: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/5

Parallelism

• Introduction to parallel processing

Basic concepts: granularity, program,

process, thread, language aspects

Types of parallelism

• Instruction level parallelism

Page 6: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/6

Parallelism

• Introduction to parallel processing

Basic concepts: granularity, program,

process, thread

Types of parallelism

• Instruction level parallelism

Page 7: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/7

Granularity

• Definition:

granularity is the complexity/grain size of

some item

e.g. computation item (instruction),

data item (scalar, array, struct),

communication item (token granularity),

hardware building block (gate,

RTL component)

RISC (e.g. add r1,r2,r4)

CISC (e.g. ld *a0++,r1)

High Level Languages HLLs

(e.g. x = sin(y))

Application-specific

(e.g. edge-det.invert.image)

Low

High

Granularity

Page 8: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/8

Granularity

• Deciding the granularity is an important

design choice

• E.g. grain size for the communication

tokens in a parallel computer:

coarse grain: less communication overhead

fine grain: less time penalty when two

communication packets compete for

transmission over the same channel and collide

Page 9: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/14

Parallelism

• Introduction to parallel processing

Basic concepts: granularity, program, process,

thread

Types of parallelism

• Instruction level parallelism

Page 10: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/15

Types of parallelism

• Functional parallelism

Different computations have to be performed

on the same or different data

E.g. Multiple users submit jobs to the same

computer or a single user submits multiple jobs

to the same computer

this is functional parallelism at the process level

taken care of at run-time by the OS

Important for

the exam!

Page 11: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/18

Types of parallelism

• Data parallelism

Same computations have to be performed on a

whole set of data

E.g. 2D convolution of an image

This is data parallelism at the loop level:

consecutive loop iterations are candidates for

parallel execution, subject to inter-iteration data

dependencies

Leads often to massive amount of parallelism

Important for

the exam!

Page 12: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/19

Levels of parallelism

• Instruction level parallel (ILP)

Functional parallelism at the instruction level

Example: pipelining

• Data level parallel (DLP)

Data parallelism at the loop level

• Process & thread level parallel (TLP)

Functional parallelism at the thread and

process level

Page 13: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/20

Parallelism

• Introduction to parallel processing

• Instruction level parallelism

Introduction

VLIW

Advanced pipelining techniques

Super scalar

Page 14: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/21

Parallelism

• Introduction to parallel processing

• Instruction level parallelism

Introduction

VLIW

Advanced pipelining techniques

Super scalar

Page 15: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/22

Type of Instruction Level

Parallelism utilization

• Sequential instruction issuing, sequential

instruction execution

von Neumann processors

EU

Instruction word

Page 16: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/23

Type of Instruction Level

Parallelism utilization

• Sequential instruction issuing, parallel

instruction execution

pipelined processors

EU1

EU2

EU3

EU4

Instruction word

Page 17: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/24

Type of Instruction Level

Parallelism utilization

• Parallel instruction issuing –

compile-time determined by compiler,

parallel instruction execution

VLIW processors:

Very Long Instruction Word

EU1 EU

2 EU

3 EU

4

Instruction word

Page 18: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/25

Type of Instruction Level

Parallelism utilization

• Parallel instruction issuing – run-time

determined by HW dispatch unit,

parallel instruction execution

super-scalar processors (to be seen later)

EU1 EU

2 EU

3 EU

4

Instruction

window

Page 19: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/26

Type of Instruction Level

Parallelism utilization

• Most processors provide sequential

execution semantics

regardless how the processor actually

executes the instructions (sequential or

parallel, in-order or out-of-order), the result

is the same as sequential execution in the

order they were written

• VLIW and IA-64 provide parallel

execution semantics

explicit indication in ASM which

instructions are executed in parallel

Page 20: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/27

Parallelism

• Introduction to parallel processing

• Instruction level parallelism

Introduction

VLIW

Advanced pipelining techniques

Super scalar

Page 21: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/28

VLIW

EU EU EU EU

Dec

Instruction Register

Dec Dec Dec

Main instruction

memory

128 bit

Instruction Cache

128 bit

32 bit each

256 decoded bits each

Register file

32 bit each; 8 read ports, 4 write ports

Ca

ch

e/

RA

M

32 bit each; 2 read ports, 1 write port

Main data

memory

32 bit;

1 bi-directional port

Page 22: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/29

VLIW

• Properties

Multiple Execution Units: multiple instructions

issued in one clock cycle

Every EU requires 2 operands and delivers one

result every clock cycle: high data memory

bandwidth needed

Careful design of data memory hierarchy

Register file with many ports

Large register file: 64-256 registers

Carefully balanced cache/RAM hierarchy with

decreasing number of ports and increasing

memory size and access time for the higher

levels (IMEC research: DTSE)

Page 23: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/32

VLIW

• Properties

Compiler should determine which instructions

can be issued in a single cycle without control

dependency conflict nor data dependency

conflict

Deterministic utilization of parallelism: good for

hard-real-time

Compile-time analysis of source code: worst case

analysis instead of actual case

Very sophisticated compilers, especially when

the EUs are pipelined! Perform well since early

2000

Page 24: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/33

VLIW

• Properties

Compiler should determine which instructions

can be issued in a single cycle without control

dependency conflict nor data dependency

conflict

Very difficult to write assembly:

programmer should resolve all control flow conflicts

all data flow conflicts

all pipelining conflicts

and at the same time fit data accesses into the

available data memory bandwidth

and all program accesses into the available program

memory bandwidth

e.g. 2 weeks for a sum-of-products (3 lines of C-

code)

All high end DSP processors since 1999 are

VLIW processors (examples: Philips Trimedia --

high end TV, TI TMS320C6x -- GSM base

stations and ISP modem arrays)

Page 25: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/34

Low power DSP

EU EU EU EU

Dec

Instruction Register

Dec Dec Dec

Main instruction

memory

128 bit

Instruction Cache

128 bit

32 bit each

256 decoded bits each

Register file

32 bit each; 8 read ports, 4 write ports

32 bit each; 2 read ports, 1 write port

Too much power

dissipation in

fetching wide

instructions

Page 26: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/35

Low power DSP

EU EU EU EU

Dec

Instruction Register

Dec Dec Dec

24 bit IC

ach

e

128 bit

32 bit each

256 decoded bits each

Register file

32 bit each; 8 read ports, 4 write ports

32 bit each; 2 read ports, 1 write port

Instruction

expansion

Ma

in

IM

em

24 bit

E.g. ADD4 is expanded into

ADD || ADD || ADD || ADD

Page 27: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/36

Low power DSP

• Properties

Power consumption in program memory is

reduced by specializing the instructions for the

application

Not all combinations of all instructions for the

EUs are possible, but only a limited set, i.e.

those combinations that lead to a substantial

speed-up of the application

Those relevant combinations are represented

by the smallest possible amount of bits to

reduce program memory width and hence

program memory power consumption

Can only be done for embedded DSP

applications: processor is specialized for 1

application (examples: TI TMS320C54x -- GSM

mobile phones, TI TMS320C55x -- UMTS mobile

phones)

Page 28: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/37

Low power DSP

for interactive

multimedia

REU REU REU REU

Dec

Instruction Register

Dec Dec Dec

24 bit IC

ach

e

128 bit

32 bit each

256 decoded bits each

Register file

32 bit each; 8 read ports, 4 write ports

32 bit each; 2 read ports, 1 write port

Reconfigurable

Instruction expansion

Ma

in

IM

em

24 bit Run-time reconfiguration

allows to adapt specialization

to changing application

requirements

Page 29: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/39

Parallelism

• Introduction to parallel processing

• Instruction level parallelism

Introduction

VLIW

Advanced pipelining techniques

Super scalar

Page 30: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/40

Advanced Pipelining

• Pipeline CPI is the result of many

components

• A number of techniques act on one or

more of these components:

Loop unrolling

Scoreboarding

Dynamic branch prediction

Speculation

• To be seen later

CPUTIME

(p) = IC(p) CPI(p)

clock rate

Page 31: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/41

Advanced Pipelining

• Till now, Instruction-level parallelism was

searched within the boundaries of a basic

block (BB)

• A BB is 6-7 instructions on average

too small to reach the expected

performance

• What is worse, there’s a big chance that

these instructions have dependencies

Even less performance can be expected

Page 32: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/42

Advanced Pipelining

• To obtain more, we need to go beyond the

BB limitation:

• We must exploit ILP across multiple BB’s

• Simplest way: loop level parallelism (LLP):

Exploiting the parallelism among iterations of a

loop

• Converting LLP into ILP

Loop unrolling

Statically (compiler-based)

Dynamically (HW-based)

• Using vector instructions

Does not require LLP -> ILP conversion

Page 33: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/43

Advanced Pipelining

• The efficiency of the conversion depends

On the amount of ILP available

On latencies of the functional units in the

pipeline

On the ability to avoid pipeline stalls by

separating dependent instructions by a

“distance” (in terms of stages) equal to the

latency peculiar to the source instruction

LW x, …

INSTR …, x

a load must not be followed by the

immediate use of the load destination

register

Page 34: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/44

Advanced Pipelining

Loop unrolling

Assumptions and steps

1. We assume the following latencies

ProducerInstruction

ConsumerInstruction

Latency

FP ALU OP FP ALU OP 3

FP ALU OP STORE DBL 2

LOAD DBL FP ALU OP 1

LOAD DBL STORE DBL 0

Page 35: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/45

Advanced Pipelining

Loop unrolling

2. We assume to work with a simple loop

such as

for (I=1; I<=1000; I++)

x[I] = X[I] + s;

• Note: each iteration is independent of

the others

Very simple case

Page 36: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/46

Advanced Pipelining

Loop unrolling

3. Translated in DLX, this simple loop looks

like this:

; assumptions: R1 = &x[1000]

; F2 = s

Loop: LD F0, 0(R1) ; F0 = x[I]

ADDD F4, F0, F2 ; F4 = F0 + s

SD 0(R1), F4 ; store result

SUBI R1, R1, #8 ; R1 = R1 - 1

BNEZ R1, Loop ; if (R1)

; goto Loop

W

O

Page 37: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/47

4. Tracing the loop (no scheduling!):

Loop: LD F0, 0(R1) ; 1

stall 2

ADDD F4, F0, F2 ; 3

stall 4

stall 5

SD 0(R1), F4 ; 6

SUBI R1, R1, #8 ; 7

BNEZ R1, Loop ; 8

stall ; 9

• 9 clock cycles per iteration, with 4 stalls

Advanced Pipelining

Loop unrolling

Page 38: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/48

Advanced Pipelining

Loop unrolling

5. With scheduling, we move from

Loop: LD F0, 0(R1)

ADDD F4, F0, F2

SD 0(R1), F4

SUBI R1, R1, #8

BNEZ R1, Loop

to

Loop: LD F0, 0(R1)

ADDD F4, F0, F2

SUBI R1, R1, #8

BNEZ R1, Loop

SD 8(R1), F4

8

whose trace shows that less cycles are

wasted:

Page 39: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/49

Advanced Pipelining

Loop unrolling

6. Tracing the loop (with scheduling!):

Loop: LD F0, 0(R1) ; 1

stall 2

ADDD F4, F0, F2 ; 3

SUBI R1, R1, 8 ; 4

BNEZ R1, Loop ; 5

SD 8(R1), F4 ; 6

• 6 clock cycles per iteration, with 1 stall

• 3 stalls less!

• Still the useful cycles are just 3

• How to gain more?

O

O

Page 40: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/50

Advanced Pipelining

Loop unrolling

7. With loop unrolling:

replicating the body of loop multiple

times

Loop: LD F0, 0(R1)

ADDD F4, F0, F2

SD 0(R1), F4 ; skip SUBI and BNEZ

LD F6, -8(R1) ; F6 vs. F0

ADDD F8, F6, F2 ; F8 vs. F4

SD -8(R1), F8 ; skip SUBI and BNEZ

LD F10, -16(R1) ; F10 vs. F0

ADDD F12, F10, F2 ; F12 vs. F4

SD -16(R1), F12 ; skip SUBI and BNEZ

LD F14, -24(R1) ; F14 vs. F0

ADDD F16, F14, F2 ; F16 vs. F4

SD -24(R1), F16 ; skip SUBI and BNEZ

SUBI R1, R1, #32 ; R1 = R1 – 4

BNEZ R1, Loop

• Spared 3 x (SUBI + BNEZ)

Page 41: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/51

Advanced Pipelining

Loop unrolling

• Loop unrolling:

replicating the body of loop multiple times

Some branches are eliminated

The ratio w/o increases

The BB artificially increases its size

Higher probability of optimal scheduling

Requires a wider set of registers and

adjusting values of load and store

registers

(In the given example,) Every operation is

followed by a dependent instruction

Will cause a stall

Trace of unscheduled unrolled loop: 27 cycles

2 per LD, 3 per ADD, 2 per branch, 1 per any other

6.8 clock cycles per iteration

Pure scheduling is better! (6 cycles)

Page 42: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/52

Advanced Pipelining

Loop unrolling

• Unrolled loop plus scheduling

Loop: LD F0, 0(R1)

ADDD F4, F0, F2

SD 0(R1), F4 ; skip SUBI and BNEZ

LD F6, -8(R1) ; F6 vs. F0

ADDD F8, F6, F2 ; F8 vs. F4

SD -8(R1), F8 ; skip SUBI and BNEZ

LD F10, -16(R1) ; F10 vs. F0

ADDD F12, F10, F2 ; F12 vs. F4

SD -16(R1), F12 ; skip SUBI and BNEZ

LD F14, -24(R1) ; F14 vs. F0

ADDD F16, F14, F2 ; F16 vs. F4

SD -24(R1), F16 ; skip SUBI and BNEZ

SUBI R1, R1, #32 ; R1 = R1 – 4

BNEZ R1, Loop

Page 43: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/53

Advanced Pipelining

Loop unrolling

• Unrolled loop plus scheduling

Loop: LD F0, 0(R1)

LD F6, -8(R1) ; F6 vs. F0

ADDD F4, F0, F2

SD 0(R1), F4 ; skip SUBI and BNEZ

ADDD F8, F6, F2 ; F8 vs. F4

SD -8(R1), F8 ; skip SUBI and BNEZ

LD F10, -16(R1) ; F10 vs. F0

ADDD F12, F10, F2 ; F12 vs. F4

SD -16(R1), F12 ; skip SUBI and BNEZ

LD F14, -24(R1) ; F14 vs. F0

ADDD F16, F14, F2 ; F16 vs. F4

SD -24(R1), F16 ; skip SUBI and BNEZ

SUBI R1, R1, #32 ; R1 = R1 – 4

BNEZ R1, Loop

Page 44: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/54

Advanced Pipelining

Loop unrolling

• Unrolled loop plus scheduling

Loop: LD F0, 0(R1)

LD F6, -8(R1) ; F6 vs. F0

LD F10, -16(R1) ; F10 vs. F0

ADDD F4, F0, F2

SD 0(R1), F4 ; skip SUBI and BNEZ

ADDD F8, F6, F2 ; F8 vs. F4

SD -8(R1), F8 ; skip SUBI and BNEZ

ADDD F12, F10, F2 ; F12 vs. F4

SD -16(R1), F12 ; skip SUBI and BNEZ

LD F14, -24(R1) ; F14 vs. F0

ADDD F16, F14, F2 ; F16 vs. F4

SD -24(R1), F16 ; skip SUBI and BNEZ

SUBI R1, R1, #32 ; R1 = R1 – 4

BNEZ R1, Loop

Page 45: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/55

Advanced Pipelining

Loop unrolling

• Unrolled loop plus scheduling

Loop: LD F0, 0(R1)

LD F6, -8(R1) ; F6 vs. F0

LD F10, -16(R1) ; F10 vs. F0

LD F14, -24(R1) ; F14 vs. F0

ADDD F4, F0, F2

SD 0(R1), F4 ; skip SUBI and BNEZ

ADDD F8, F6, F2 ; F8 vs. F4

SD -8(R1), F8 ; skip SUBI and BNEZ

ADDD F12, F10, F2 ; F12 vs. F4

SD -16(R1), F12 ; skip SUBI and BNEZ

ADDD F16, F14, F2 ; F16 vs. F4

SD -24(R1), F16 ; skip SUBI and BNEZ

SUBI R1, R1, #32 ; R1 = R1 – 4

BNEZ R1, Loop

Page 46: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/56

Advanced Pipelining

Loop unrolling

• Unrolled loop plus scheduling

Loop: LD F0, 0(R1)

LD F6, -8(R1) ; F6 vs. F0

LD F10, -16(R1) ; F10 vs. F0

LD F14, -24(R1) ; F14 vs. F0

ADDD F4, F0, F2

ADDD F8, F6, F2 ; F8 vs. F4

ADDD F12, F10, F2 ; F12 vs. F4

ADDD F16, F14, F2 ; F16 vs. F4

SD 0(R1), F4 ; skip SUBI and BNEZ

SD -8(R1), F8 ; skip SUBI and BNEZ

SD -16(R1), F12 ; skip SUBI and BNEZ

SD -24(R1), F16 ; skip SUBI and BNEZ

SUBI R1, R1, #32 ; R1 = R1 – 4

BNEZ R1, Loop

• 14 clock cycles, or 3.5 clock cycles / iteration

Enough

distance

to prevent

the

dependency

to turn

into a

hazard

Page 47: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/57

Advanced Pipelining

Loop unrolling

• Unrolling the loop exposes more

computation that can be scheduled to

minimize the stalls

• Unrolling increases the BB; as a result, a

better choice can be done for scheduling

• A useful technique with two key

requirements:

Understanding how an instruction depends on

another

Understanding how to change or reorder the

instructions, given the dependencies

• In what follows we concentrate on .

Page 48: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/58

Loop unrolling: . dependencies

• Again, let ( Ik)1 k IC(p)

be the ordered

series of instructions executed during

the run of program p

• Given two instructions, Ii and I

j, with i<j,

we say that

Ij is dependent on I

i (I

i I

j) iff

R(Ii) D(I

j)

R is the range and D the domain of a given

instruction

Ii produces a result which is consumed by I

j

or

$ n { 1,…,IC(p)} and $ k1 < k

2 < … < k

n

such that Ii I

k1 I

k2 .. I

kn I

j

Page 49: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/59

Loop unrolling: . dependencies

• (Ii , I

k1 , I

k2 , … I

kn , I

j) is called a

dependency (transitive) chain

• Note that a dependency chain can be as

long as the entire execution of p

• A hazard implies dependency

• Dependency does not imply a hazard!

• Scheduling tries to place dependent

instructions in places where no hazard can

occur

Page 50: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/60

Loop unrolling: . dependencies

• For instance:

SUBI R1, R1, #8

BNEZ R1, Loop

• This is clearly a dependence, but it does

not result in a hazard

Forwarding eliminates the hazard

• Another example:

LD F0, 0(R1)

ADDD F4, F0, F2

• This is a data dependency which does

lead to a hazard and a stall

Page 51: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/61

Loop unrolling: . dependencies

• Dealing with data dependencies

• Two classes of methods:

1. Keeping the dependence though avoiding

the hazard (via scheduling)

2. Eliminating a dependence by

transforming the code

Page 52: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/62

Loop unrolling: . dependencies

• Class 2 implies more work

• These are optimization methods used by

the compilers

• Detecting dependencies when only using

registers is easy; the difficulties come

from detecting dependencies in memory:

• For instance 100(R4) and 20(R6) may

point to the same memory location

• Also the opposite situation may take

place:

LD 20(R4), R2

ADD R3, R1, 20(R4)

• If R4 changes, this is no dependency

Page 53: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/63

Loop unrolling: . dependencies

• Ii I

j means that

Ii produces a result that is consumed by I

j

• When there is no such production, e.g.,

Ii and I

j are both loads or stores, we call

this a name dependency

• Two types of name dependencies:

Antidependence

Corresponds to WAR hazards

Ij x ; I

i x (reordering implies an error)

Output dependence

Corresponds to WAW hazards

Ij x ; I

i x (reordering implies an error)

• No value is transferred between the

instructions

• Register renaming solves the problem

Page 54: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/64

Loop unrolling: . dependencies

• Register renaming: if the register name is

changed, the conflict disappears

• This technique can be either static (and

done by the compiler) or dynamic (done

by the HW)

• Let us consider again the following loop:

Loop: LD F0, 0(R1)

ADDD F4, F0, F2

SD 0(R1), F4

SUBI R1, R1, #8

BNEZ R1, Loop

• Let us perform unrolling w/o renaming:

Page 55: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/65

Loop unrolling: . dependencies

Loop: LD F0, 0(R1)

ADDD F4, F0, F2

SD 0(R1), F4

LD F0, -8(R1)

ADDD F4, F0, F2

SD -8(R1), F4

LD F0, -16(R1)

ADDD F4, F0, F2

SD -16(R1), F4

LD F0, -24(R1)

ADDD F4, F0, F2

SD -24(R1), F0

SUBI R1, R1, #32

BNEZ R1, Loop

The yellow arrows

are name depen-

dencies. To solve

them, we perform

renaming

Page 56: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/66

Loop unrolling: . dependencies

Loop: LD F0, 0(R1)

ADDD F4, F0, F2

SD 0(R1), F4

LD F6, -8(R1)

ADDD F8, F6, F2

SD -8(R1), F8

LD F0, -16(R1)

ADDD F4, F0, F2

SD -16(R1), F4

LD F0, -24(R1)

ADDD F4, F0, F2

SD -24(R1), F0

SUBI R1, R1, #32

BNEZ R1, Loop

Page 57: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/67

Loop unrolling: . dependencies

Loop: LD F0, 0(R1)

ADDD F4, F0, F2

SD 0(R1), F4

LD F6, -8(R1)

ADDD F8, F6, F2

SD -8(R1), F8

LD F10, -16(R1)

ADDD F12, F10, F2

SD -16(R1), F12

LD F14, -24(R1)

ADDD F16, F14, F2

SD -24(R1), F16

SUBI R1, R1, #32

BNEZ R1, Loop

Page 58: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/68

Loop unrolling: . dependencies

Loop: LD F0, 0(R1)

ADDD F4, F0, F2

SD 0(R1), F4

LD F6, -8(R1)

ADDD F8, F6, F2

SD -8(R1), F8

LD F10, -16(R1)

ADDD F12, F10, F2

SD -16(R1), F12

LD F14, -24(R1)

ADDD F16, F14, F2

SD -24(R1), F16

SUBI R1, R1, #32

BNEZ R1, Loop

The yellow arrows

are data depen-

dencies. To solve

them, we reorder

the instructions

Page 59: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/69

Loop unrolling: . dependencies

• A third class of dependencies is the one

of control dependencies

• Examples:

if (p1) s

1;

if (p2) s

2;

then

p1

c s

1 (s

1 is control dependent on p

1)

p2

c s

2 (s

2 is control dependent on p

2)

• Clearly (p1

c s

2) ,

that is,

s2 is not control dependent on p

1

Page 60: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/72

Loop unrolling: . dependencies

• Two properties are critical to control

dependency:

Exception behaviour

Data flow

• Exception behaviour: suppose we have

the following excerpt:

BEQZ R2, L1

DIVI R1, 8(R2)

L1: …

• We may be able to move the DIVI to

before the BEQZ without violating the

sequential semantics of the program

• Suppose the branch is taken. Normally

one would simply need to undo the DIVI

• What if DIVI triggers a DIVBYZERO

exception?

Page 61: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/73

Loop unrolling: . dependencies

• Two properties are critical to control

dependency:

Exception behaviour

Data flow

• Data flow must be preserved

• Let us consider the following excerpt:

ADD R1, R2, R3

BEQZ R4, L

SUB R1, R5, R6

L: OR R7, R1, R8

• Value of R1 depends on the control flow

• The OR depends on both ADD and SUB

• Also depends on the nature of the branch

• R1 = (taken)? ADD.. : SUB..

Page 62: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/74

Loop Level Parallelism

• Let us consider the following loop:

for (I=1; I<=100; I++) {

A[I+1] = A[I] + C[I]; /* S1 */

B[I+1] = B[I] + A[I+1]; /* S2 */ }

• S1 is a loop-carried dependency (LCD):

iteration I+1 is dependent on iteration I:

A’ = f(A)

• S2 is B’ = f(B,A’)

• If a loop has only non-LCD’s, then it is

possible to execute more than one loop

iteration in parallel – as long as the

dependencies within each iteration are

not violated

Page 63: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/75

Loop Level Parallelism

• What to do in the presence of LCD’s?

• Loop transformations. Example:

for (I=1; I<=100; I++) {

A[I+1] = A[I] + B[I]; /* S1 */

B[I+1] = C[I] + D[I]; /* S2 */ }

• A’ = f(A, B)

B’ = f(C, D)

• Note: no dependencies except LCD’s

Instructions can be swapped!

Page 64: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/76

Loop Level Parallelism

• What to do in the presence of LCD’s?

• Loop transformations. Example:

for (I=1; I<=100; I++) {

A[I+1] = A[I] + B[I]; /* S1 */

B[I+1] = C[I] + D[I]; /* S2 */ }

• Note: the flow, i.e.,

A0 B0 A0 B0

C0 D0

C0 D0

A1 B1 can be A1 B1

C1 D1 changed

into C1 D1

A2 B2 A2 B2

C2 D2 . . .

. . .

Page 65: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/77

for (i=1; i <= 100; i=i+1) {

A[i] = A[i] + B[i]; /* S1 */

B[i+1] = C[i] + D[i]; /* S2 */

}

becomes

A[1] = A[1] + B[1];

for (i=1; i <= 99; i=i+1) {

B[i+1] = C[i] + D[i];

A[i+1] = A[i+1] + B[i+1];

}

B[101] = C[100] + D[100];

Loop Level Parallelism

Page 66: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/78

Loop Level Parallelim

• A’ = f(A, B) B’ = f(C, D)

B’ = f(C, D) A’ = f(A’, B’)

• Now we have dependencies but no more

LCD’s!

It is possible to execute more than one

loop iteration in parallel – as long as the

dependencies within each iteration are

not violated

Page 67: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/79

Dependency avoidance

1. “Batch” approaches: at compile time, the

compiler schedules the instructions in

order to minimize the dependencies

(static scheduling)

2. “Interactive” approaches: at run-time, the

HW rearranges the instructions in order

to minimize the stalls (dynamic

scheduling)

• Advantages of 2:

Only approach when dependencies are only

known at run-time (pointers etc.)

The compiler can be simpler

Given an executable compiled for a machine

with machine-level X and pipeline organization

Y, it can run efficiently on another machine

with the same machine level but a different

pipeline organization Z

Page 68: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/80

Dynamic Scheduling

• Static scheduling: compiler techniques

for scheduling (rearranging) the

instructions

so to separate dependent instructions

And hence minimize unsolvable hazards

causing unavoidable stalls

• Dynamic scheduling: HW-based, run-time

techniques

• A dynamically scheduled processor does

not try to remove true data dependencies

(which would be impossible): it tries to

avoid stalling when dependencies are

present

• The two techniques can be both used

Page 69: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/81

Dynamic Scheduling: General Idea

• If an instruction is stalled in the pipeline,

no later instruction can proceed

• A dependence between two instructions

close to each other causes a stall

• A stall means that, even though there

may be idle functional units that could

potentially serve other instructions, those

units have to stay idle

• Example:

DIVD F0, F2, F4

ADDD F10, F0, F8

SUBD F12, F8, F14

• ADDD depends on DIVD; but SUBD does

not. Despite this, it is not issued!

Page 70: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/82

Dynamic Scheduling: General Idea

• So SUBD is not issued even there might

be a functional unit ready to perform the

requested operation

• Big performance limitation!

• What are the reasons that lead to this

problem?

• In-order instruction issuing and

execution: instructions issue and execute

one at a time, one after the other

Page 71: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/83

Dynamic Scheduling: General Idea

• Example: in DLX, the issue of an

instruction occurs at ID (instruction

decode)

• In DLX, ID checks for absence of

structural hazards and waits for the

absence of data hazards

• These two steps may be made distinct

Page 72: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/84

Dynamic Scheduling: General Idea

• The issue process gets divided into two

parts:

1.Checking the presence of structural

hazards

2.Waiting for the absence of a data hazard

• Instructions are issued in order, but they

execute and complete as soon as their

data operands are available

• Data flow approach

Page 73: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/85

Dynamic Scheduling: General Idea

• The ID pipeline stage is divided into two

sub-stages:

• ID.1 (Issue) : decode the instruction,

check for structural hazards

• ID.2 (read operands) : wait until no data

hazards, then read operands

Page 74: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/86

Dynamic Scheduling: General Idea

• In the DLX floating point pipeline, the EX

stage of instructions may take multiple

cycles

• For each issued instruction I, depending

on the resolution of structural and data

hazards, I may be be waiting for

resources or data, or in execution, or

completed

• More than a single instruction can be in

execution at the same time

Page 75: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/87

Scoreboarding

• Scorebord (CDC6600, 1964): a technique

to allow instructions to execute out of

order when there are sufficient resources

and no data dependencies

• Goal: execution rate of 1 instruction per

clock cycle in the absence of structural

hazards

• Large set of FUs:

4 FPUs,

5 units for memory references

7 integer FUs

Highly redundant (parallel) system

• Four steps replace the ID, EX, WB stages

Page 76: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/88

Scoreboarding

• IF (a FU is available && no active

instruction has same destination reg) {

issue I to the FU; update state;

}

Avoids WAWs

• ASA (the two source operands are

available in the registers) {

read operands;

manage RAW stalls;

}

• For each FU: ASA (operands are available)

{ start EX; EOX? Alert scoreboard; }

• When at WB:

{ wait for (no WAR hazards);

store output to destination reg; }

Avoids WARs

Page 77: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/89

Scoreboarding

• In eliminating stalls, a scoreboard is

limited by several factors:

Amount of parallelism available among the

instructions

(in the presence of many dependencies there’s

not much that one can do…)

Number of scoreboard entries

(How far ahead the pipeline can look for

independent instructions)

Number and types of FUs

Number of WAR’s and WAW’s

Page 78: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/90

Scoreboarding

• The effectiveness of the scoreboard

heavily depends on the register file

• All operands are read from registers, all

outputs go to destination registers

The availability of registers influence the

capability to eliminate stalls

Page 79: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/91

Tomasulo’s approach

• Tomasulo’s approach (IBM 360/91, 1967) :

An improvement of scoreboarding when a

limited number of registers is allowed by

a machine architecture

• Based on virtual registers

• The IBM 360/91 had two key design goals:

To be faster than its predecessors

To be machine level compatible with its

predecessors

• Problem: the 360 family had only 4 FP

registers

• Tomasulo combined the key ideas of

scoreboarding with register renaming

Page 80: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/92

Tomasulo’s approach

• IBM 360/91 FUs:

3 ADDD/SUBD, 2 MULD, 6 LD, 6 SD

• Key element: the reservation station (RS):

a buffer which holds the operands of the

instructions waiting to issue

• Key concept:

A RS fetches and buffers an operand as soon as

it is available, eliminating the need to get that

operand from a register

Instead of tracing the source and destination

registers, we track source and destination RS’s

OP

RSa

RSb

RSc

Page 81: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/93

Tomasulo’s approach

• A reservation station represents:

A static data, read from a register

A “live” data (a future data) that will be

produced by another RS and FU

• Hazard detection and execution control

are not centralised into a scoreboard

• They are distributed in each RS, which,

independently:

Controls a FU attached to it,

And starts that FU the moment the operands

become available

Page 82: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/94

Tomasulo’s approach

• The operands go to the FUs through the

(wide set of) RS’s, not through the (small)

register file

• This is managed through a broadcast that

makes use of a

common result-or-data bus

• All units waiting for an operand can load

it at the same time:

RSa

OP

OP2

RSd

RSc

RSb

RSb

RSe

Page 83: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/95

Tomasulo’s approach

• The execution is driven by a graph of

dependencies

RSa

SUBD

MULTD

RSd

RSc

RSb

RSe

SUBD

RSf

SUBD

RSg

• A “live data structure” approach (similar

to LINDA): a tuple is made available in the

future, when a thread will have finished

producing it

Page 84: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/100

Major Advantages of Tomasulo’s

• Distributed approach: the RS’s

independently control the FU’s

• Distributed hazard detection logic

• The CDB broadcasts results -> all pending

instructions depending on that result are

unblocked simultaneously

The CDB, being a bus, reaches many

destinations in a single clock cycle

If the waiting instructions get their missing

operand in that clock cycle, they can all begin

execution on the next clock cycle

• WAR and WAW are eliminated by

renaming registers using the RS’s

Page 85: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/101

Reducing branch penalties

• Static Approaches

Dynamic Approaches

Page 86: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/102

Reducing branch penalties:

Dynamic Branch Prediction

• A branch history table

Address Branch Nature

0xA0B2DF37 BNEZ …

0xA0B2F02A BEQ …

0xA0B30504 BNEZ …

0xA0B30537 BGT …

37

2A

04

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

taken

taken

taken taken

untaken

untaken

untaken

un

Page 87: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/103

Dynamic Branch Prediction

Branch History Table Algorithm

/* before the branch is evaluated */

If (Current instruction is a branch) {

entry = PC & 0x000000FF;

predict branch as ( BHT [ entry ] );

}

/* after the branch */

If (branch was mispredicted)

BHT [ entry ] = 1 – BHT [ entry ]

Page 88: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/104

Dynamic Branch Prediction

Branch History Table Algorithm

• Just one bit is enough for coding the

Boolean value “taken” vs. “untaken”

• Note: the function associating addresses

to entries in the BHT is not guaranteed to

be a bijection (one-to-one relationship):

• The algorithm records the most recently

behaviour of one or more branches

For instance, entry 37 corresponds to two b.’s

• Despite this, the scheme works well…

• …though in some cases, the performance

of the scheme is not that satisfactory:

Page 89: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/105

Dynamic Branch Prediction

Branch History Table Accuracy

• for (i=0; i<BIGN; i++)

for (j=0; j<9; j++)

{ do stg(); }

• Loop is

taken nine times in a row

then not taken once

• Taken 90%, Untaken 10%

• What is the prediction accuracy?

Page 90: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/106

Dynamic Branch Prediction

Branch History Table Accuracy

Taken U 0

Taken T 1

. . .

Taken T 1

Untaken T 0

Taken U 0

Taken T 1

. . .

Taken T 1

Untaken T 0

Taken U 0

Taken T 1

. . .

Taken T 1

Untaken T 0

Taken U 0

8 successful

predictions

8 successful

predictions

8 successful

predictions

2 mispredictions

2 mispredictions

2 mispredictions

9

9

9

S.S. Prediction accuracy is just 80% !

Page 91: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/107

Dynamic Branch Prediction

Branch History Table Accuracy

• Loop branches (taken n-1 times in a row,

untaken once)

• Performance of this dynamic branch

predictor (based on a single-bit prediction

entry):

Misprediction: 2 x 1 / n

Twice rate of untaken branches

Page 92: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/108

Dynamic Branch Prediction

Two-bit Prediction Scheme

• Use a two bit field as a “branch behaviour

recorder”

• Allow a state to change only when two

mispredictions in a row occur:

Taken

Taken

Taken

Taken

Not taken

Not taken

Not taken

Not taken

Predict taken Predict taken

Predict not taken Predict not taken

Page 93: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/109

Dynamic Branch Prediction

Branch History Table Accuracy

Taken U2 0

Taken U 0

Taken T2 1

Taken T2 1

. . .

Taken T2 1

Untaken T 0

Taken T2 1

. . .

Taken T2 1

Untaken T 0

Taken T2 1

. . .

Taken T2 1

7 successful

predictions

9 successful

predictions

2 mispredictions first

S.S. Prediction accuracy is now 90%

9 successful

predictions

STEADY

STATE

Page 94: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/110

Dynamic Branch Prediction

Branch History Table Accuracy

18%

tomcatv

spiceSPEC89

benchmarks

gcc

li

2% 4% 6% 8% 10% 12% 14% 16%

0%

1%

5%

9%

9%

12%

5%

10%

18%

nasa7

matrix300

doduc

fpppp

espresso

eqntott

1%

0%

Frequency of mispredictions

Prediction accuracy with programs from

SPEC89 – 2-bit prediction buffer of 4096

entries

Page 95: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/111

Dynamic Branch Prediction

General Scheme

• In the general case, one could use an

n-bit branch behaviour recorder and a

branch history table of 2m entries

• In this case

A change occurs every 2n-1

mispredictions

There is a higher chance that not too many

branch addresses be associated with the same

BHT entry

Larger memory penalty

Page 96: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/112

D.B.P. Comparing the 2-bit with

the General Case

Page 97: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/113

Dynamic Branch Prediction

Schemes

• One-bit prediction buffer

Good, but with limited accuracy

• Two-bit prediction buffer

Very good, greater accuracy, slightly higher

overhead

• Infinite-bit prediction buffer

As good as the two-bit one, but with a very

large overhead

• Correlating predictors

Page 98: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/114

Dynamic Branch Prediction

Correlated predictors

• Two-level predictors

• If the behaviour of a branch is correlated

to the behaviour of another branch,

no single-level predictor would be able to

capture its behaviour

• Example:

if (aa == 2)

aa = 0;

if (bb == 2)

bb = 0;

if (aa != bb) {

• If we keep track of the recent behaviour

of other previous branches, our accuracy

may increase

Page 99: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/115

Dynamic Branch Prediction

Correlated predictors

• A simpler example:

if (d == 0) d = 1;

if (d == 1) …

• In DLX, this is

BNEZ R1, L1 ; b1 ( d != 0 )

MOV R1, #1

L1: SUBI R3, R1, #1

BNEZ R3, L2 ; b2 ( d != 1)

. . .

L2: . . .

Page 100: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/116

Dynamic Branch Prediction

Correlated predictors

• In DLX, this is

BNEZ R1, L1 ; b1 ( d != 0 )

MOV R1, #1

L1: SUBI R3, R1, #1

BNEZ R3, L2 ; b2 ( d != 1)

. . .

L2: . . .

• Let us assume that d is 0, 1 or 2

Initial value d==0? b1 Value of d d==1? b2

of d before b2

0 Yes Untaken 1 Yes Untaken

1 No Taken 1 Yes Untaken

2 No Untaken 2 No Taken

Page 101: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/117

Dynamic Branch Prediction

Correlated predictors

• This means that

(B1 == untaken ) (B2 == untaken )

• A one-bit predictor may not be able to

capture this property and behave very

badly

Initial value d==0? B1 Value of d d==1? b2

of d before b2

0 Yes Untaken 1 Yes Untaken

1 No Taken 1 Yes Untaken

2 No Untaken 2 No Taken

Page 102: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/118

Dynamic Branch Prediction

Correlated predictors

• Let us suppose that d alternates between 2 and 0

• This is the table for the one-bit predictor:

d b1 b1 new b1 b2 b2 new b2

pred action pred pred action pred

2 NT T T NT T T

0 T NT NT T NT NT

2 NT T T NT T T

0 T NT NT T NT NT

• ALL branches are mispredicted!

Page 103: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/119

Dynamic Branch Prediction

Correlated predictors

• Correlated predictor: example:

• Every branch, say branch number j>1, has

two separate prediction bits

First bit: predictor used if branch j-1 was NT

Second bit: otherwise

• At the end of branch j

If (branch was mispredicted)

BHT [ B.. ] [ entry ] = 1 – BHT [ B.. ] [ entry ]

• At the end of branch j-1:

Behaviour_j_min_1 = (taken?) 1 : 0;

• At the beginning of branch j:

predict branch as (

BHT [ Behaviour_j_min_1 ] [ entry ] );

Page 104: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/120

Dynamic Branch Prediction

Correlated predictors

• The behaviour of a branch

selects a one-bit branch predictor

• If the prediction is not OK, its state is

flipped

Page 105: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/121

Dynamic Branch Prediction

Correlated predictors

• We may also consider the last TWO

branches

The behaviour of these two branches selects,

e.g., a one-bit predictor

(NT NT, NT T, T NT, T T) (0-3) BHT [0..3]

This is called a (2,1) predictor

Or, the behaviour of the last two branches

selects an n-bit predictor

This is a (2, n) predictor

Page 106: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/122

Dynamic Branch Prediction

Correlated predictors

A (2,2) predictor: A 2-bit branch history entry selects

a 2-bit predictor

Page 107: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/123

Dynamic Branch Prediction

Correlated predictors

• General case: (m, n) predictors

Consider the last m branches and their 2m

possible values

This m-tuple selects an n-bit predictor

A change in the prediction only occurs after 2n-1

mispredictions

Page 108: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/124

Dynamic Branch Prediction

Branch-Target Buffer

• A run-time technique to reduce the

branch penalty

• In DLX, it is possible to “predict” the new

PC, via a branch prediction buffer, during

the second stage of the pipeline

• With a Branch-Target Buffer (BTB), the

new PC can be derived during the first

stage of the pipeline

Page 109: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/125

Dynamic Branch Prediction

Branch-Target Buffer

• The BTB is a branch-prediction cache that

stores the addresses of taken branch

• An associative array which works as

follows:

(instruction address) (branch target address)

• In case of a hit, we know the predicted

instruction address one cycle earlier w.r.t.

the branch prediction buffer

• Fetching begins immediately at the

predicted PC

Page 110: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/126

Dynamic Branch Prediction

Branch-Target Buffer

• Design issues:

The entire address must be used

(correspondence must be one-to-one)

Limited number of entries in the BTB

Most frequently used

BTB requires a number of actions to be

executed during the first pipeline stage, also in

order to update the state of the buffer

The pipeline management gets more complex and

the clock cycle duration may have to be

increased

Page 111: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/127

Dynamic Branch Prediction

Branch-Target Buffer

• Total branch penalty for a BTB

• Assumptions: penalties are as follows

Instruction Prediction Actual Penalty

is in buffer branch cycles

Yes Taken Taken 0

Yes Taken Untaken 2

No * Taken 2

• Prediction accuracy: 90%

• Hit rate in buffer: 90%

• Taken branch frequency: 60%

Page 112: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/128

Dynamic Branch Prediction

Branch-Target Buffer

• Branch penalty =

Percent buffer hit rate x

Percent incorrect predictions x

Penalty

+ (1 - Percent buffer hit rate) x

Percent taken branches x

Penalty =

Instruction Prediction Actual Penalty

is in buffer branch cycles

Yes Taken Taken 0

Yes Taken Untaken 2

No * Taken 2

10%

90%

10%

60%

90%x10%x2 + 10%x60%x2 = 0.18+0.12=

0.30 clock cycles (vs. 0.50 for delayed br.)

Page 113: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/129

Dynamic Branch Prediction

Branch-Target Buffer

• The same approach can be applied to the

procedures return addresses

• Example:

0x4ABC CALL 0x30A0

0x4AC0 …

0x4CF4 CALL 0x30A0

0x4CF8 …

0x30A0 0x4CF8

0x4AC0

• Associative arrays of stacks

• If cache is large enough, all return

addresses are predicted correctly

Page 114: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/130

Parallelism

• Introduction to parallel processing

• Instruction level parallelism

Introduction

VLIW

Advanced pipelining techniques

Superscalar

Page 115: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/131

Superscalar architectures

• So far, the goal was reaching the ideal

CPI = 1 goal

• Further increasing performance by having

CPI < 1 is the goal of

superscalar processors (SP)

• To reach this goal, SP issue multiple

instructions in the same clock cycle

• Multiple-issue processors

VLIW (seen already)

SP

Statically scheduled (compiler)

Dynamically scheduled (HW;

Scoreboarding/Tomasulo)

• In SP, a varying # of instructions is

issued, depending on structural limits and

dependencies

Page 116: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/132

Superscalar architectures

• Superscalar version of DLX

• At most two instructions per clock cycle

can be issued

1. One of: load, store (integer or FP), branch,

integer ALU operation

2. A FP ALU operation

• IF and ID operate on 64 bits of

instructions

• Multiple independent FPU are available

Page 117: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/133

Superscalar architectures

• The superscalar DLX is indeed a sort of

“bidimensional pipeline”:

IF Integer Instr. EX ID WB MEM

IF FP Instr. EX ID WB MEM

IF EX ID WB MEM

IF EX ID WB MEM

Integer Instr.

FP Instr.

Integer Instr.

FP Instr.

Integer Instr.

FP Instr.

IF EX ID WB MEM

IF EX ID WB MEM

IF EX ID WB MEM

IF EX ID WB MEM

Page 118: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/134

Superscalar architectures

• Every new solution breeds new problems..

• Latencies!

• When the latency of the load is 1:

In the “monodimensional pipeline”, one cannot

use the result of the load in the current and

next cycle:

P

LD NOP LDc

In the bidimensional pipeline of SP, this means

a loss of three cycles:

Pi

LD NOP

LDc

NOP NOP

LDc’

Pfp

Page 119: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/135

Superscalar architectures

• Let us consider again the following loop:

Loop: LD F0, 0(R1)

ADDD F4, F0, F2

SD 0(R1), F4

SUBI R1, R1, #8

BNEZ R1, Loop

• Let us perform unrolling (x5) + scheduling

on the Superscalar DLX:

Page 120: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/136

Superscalar architectures

Integer FP Cycle

Loop: LD F0, 0(R1) 1

LD F6, -8(R1) 2

LD F10, -16(R1) ADDD F4,F0,F2 3

LD F14, -24(R1) ADDD F8,F6,F2 4

LD F18, -32(R1) ADDD F12,F10,F2 5

SD 0(R1), F4 ADDD F16,F14,F2 6

SD -8(R1), F8 ADDD F20,F18,F2 7

SD -16(R1), F12 8

SD -24(R1), F16 9

SUBI R1, R1, #40 10

BNEZ R1, Loop 11

SD -32(R1), F20 12

• 12 clock cycles per 5 iterations = 2.4 cc/i

Page 121: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/137

Superscalar architectures

• Superscalar = 2.4 cc/i vs normal = 3.5 cc/i

• But in the example there were not enough

FP instructions to keep the FP pipeline in

use

From cycle 8 to cycle 12 and for the first two

cycles, each cycle holds just one instruction

• How to get more?

Dynamic scheduling for SP

Multicycle extension of the Tomasulo algorithm

Page 122: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/138

Superscalar architectures and the

Tomasulo algorithm

• Idea: employing separate data structures

for the Integer and the FP registers

Integer Reservation Stations (IRS)

FP Reservation Stations (FRS)

• In the same cycle, issue a FP (to a FRS)

and an integer instruction (to a IRS)

• Note: issuing does not mean executing!

Possible dependencies might serialize the two

instructions issued in parallel

• Dual issue is obtained

pipelining the instruction-issue stage

so that it runs twice as fast

Page 123: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/139

Superscalar architectures

• Multiple issue strategy’s inherent

limitations:

The amount of ILP may be limited (see loop

p.134)

Extra HW is required

Multiple FPU and IU

More complex (-> slower) design

Extra need for large memory and register-file

bandwith

Increase in code size due to hard loop unrolling

Recall: CPUTIME

(p) = IC(p) CPI(p)

clock rate

Page 124: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/140

Superscalar architectures:

compiler support

• Symbolic loop unrolling

The loop is not physically unrolled, though

reorganized, so to eliminate dependencies

• Software pipelining:

Dependencies are eliminated by interleaving

instructions from different iterations of the loop

Loop is not unrolled

Loop: LD F0, 0(R1)

ADDD F4, F0, F2

SD 0(R1), F4

SUBI R1, R1, #8

BNEZ R1, Loop

<startup>

Loop: SD 0(R1), F4

ADDD F4, F0, F2

LD F0, -16(R1)

SUBI R1, R1, #8

BNEZ R1, Loop

<clean-up>

RAW: problematic WAR: HW removable

Page 125: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/141

Superscalar architectures:

compiler support

• Trace scheduling

• Aim: tackling the problem of too short

basic blocks

• Method:

Trace selection

Trace compaction

Page 126: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/142

Superscalar architectures:

compiler support

• Trace selection:

A number of contiguous basic blocks are put

together into a “trace”

Using static branch prediction, the conditional

branches are chosen as taken/untaken, while

loop branches are considered as taken

A

B

C

Book-

keeping

A

B X

C

test

Page 127: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/143

Superscalar architectures:

compiler support

• Trace compaction:

The resulting trace is a longer straight-line of

code

Trace compaction: global code scheduling

A

B

C

Book-

keeping

Code scheduling with

a basic block whose size

is that of A + B + C

• Speculative movement of code

Page 128: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/144

Superscalar architectures:

HW support

• Conditional instructions: instructions like

CMOVZ R2, R3, R1

which means

if (R1 == 0) R2 = R3;

or (R1)? R2 = R3 : /* NOP */;

• The instruction turns into a NOP if the

condition is not met

This also means that no exception are raised!

• Using conditional instructions we convert

a control dependence (due to a branch)

into a data dependence

• Speculative transformation in a two-issue

superscalar with conditional instructions:

Page 129: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/145

Superscalar architectures: HW

support : conditional instructions

Integer FP Cycle

LW R1, 40(R2) ADDD R3,R4,R5 1

ADDD R6,R3,R7 2

BEQZ R10, L 3

LW R8, 20(R10) 4

LW R9,0(R8) 5

LW R1, 40(R2) ADDD R3,R4,R5 1

LWC R8,20(R10),R10 ADDD R6,R3,R7 2

BEQZ R10, L 3

LW R9,0(R8) 4

We speculate on the outcome of the branch. If the

condition is not met, we don’t slow down the execution,

because we had used a slot that would otherwise be lost

Page 130: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/146

Superscalar architectures: HW

support : conditional instructions

• Conditional instructions are useful to

implement short alternative control flows

• Their usefulness though is limited by

several factors:

Conditional instructions that are annullated

still take execution time – unless they are

scheduled into waste slots

They are good only in limited cases, when

there’s a simple alternative sequence

Moving an instruction across multiple branches

would require double-conditional instructions!

LWCC R1, R2, R10, R12

(makes no sense)

They require to do extra work w.r.t. their

“regular” version

The extra time required for the test may require

more cycles than the regular versions

Page 131: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/147

Superscalar architectures: HW

support : conditional instructions

• Most architectures support a few

conditional instructions (conditional

move)

• The HP PA architecture allows any

register-register instruction to turn the

next instruction into a NOP – which

makes that a conditional instruction

• Exceptions

Page 132: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/148

Superscalar architectures: HW

support : conditional instructions

• Exceptions:

Fatal (normally causing termination; e.g.,

memory protection violation)

Resumable exceptions (causing a delay, but no

termination; e.g., page fault exception)

• Resumable exceptions can be processed

for speculative instructions just as if they

were normal instructions

Corresponding time penalty is not considered

as incorrect

• Fatal exceptions cannot be handled by

speculative instructions, hence must be

deferred to the next non-speculative

instructions

Page 133: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/149

Superscalar architectures: HW

support : conditional instructions

• Moving instructions across a branch

must not affect

The (fatal) exception behaviour

The data dependences

• How to obtain this?

1. All the exceptions triggered by speculative

instructions are ignored by HW and OS

The HW and OS do handle all exceptions, but

return an undefined value for any fatal

exception. The program is allowed to continue

– though this will almost certainly lead to

incorrect results

Note: scheme 1. can never cause a correct

program to fail, regardless the fact that you

used or not speculation

Page 134: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/150

Superscalar architectures: HW

support : conditional instructions

2. Poison bits: A speculative instructions does

not trigger any exception, but turns a bit on in

the involved result registers. Next “normal”

(non-speculative) instruction using those

registers will be “poisoned” -> it will cause an

exception

3. Boosting: Renaming and buffering in the HW

(similar to the Tomasulo approach)

• Speculation can be used, e.g., to

optimize an if-the-else such as

if (a==0) a = b; else a = a + 4

or, equivalently,

a = (a==0)? b : a + 4

Page 135: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/151

Superscalar architectures: HW

support : conditional instructions

• Suppose A is in 0(R3) and B in 0(R2)

• Example:

LW R1, 0(R3) ; load A

BNEZ R1, L1 ; A != 0 ? GOTO L1

LW R1, 0(R2) ; load B

J L2 ; skip ELSE

L1:ADD R1,R1,4 ; ELSE part

L2:SW 0(R3), R1 ; store A

• Speculation:

LW R1, 0(R3) ; load A

LW R9, 0(R2) ; load speculatively B

BNEZ R1, L3

ADD R9, R1, 4 ; here R9 is A+4

L3: SW 0(R3), R9 ; here R9 is A+4 or B

• In this case, a temporary register is used

• Method 1: speculation is transparent

Page 136: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/152

Superscalar architectures: HW

support : conditional instructions

• Method 2 applied to the previous code

fragment:

LW R1, 0(R3) ; load A

LW* R9, 0(R2) ; load speculatively B

BNEZ R1, L3

ADD R9, R1, 4 ; here R9 is A+4

L3: SW 0(R3), R9 ; here R9 is A+4 or B

• LW* is a speculative version of LW

• LW* an opcode that turns on the poison

bit of register R9

• Next non speculative instruction using R9

will be “poisoned”: it will cause an

exception

• If another speculative instruction uses

R9, the poison bit will be inherited

Page 137: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/153

Superscalar architectures: HW

support : conditional instructions

• Combining speculation with dynamic

scheduling

An attribute bit is added to each instruction

(1: speculative, 0: normal)

When that bit is 1, it is allowed to execute, but

cannot enter the commit (WB) stage

The instruction then has to wait until the end of

the speculated code

It will be allowed to modify the register file /

memory only at end of speculative-mode

• Hence: instructions execute out-of-order,

but are forced to commit in order

• A special set of buffers holds the results

that have finished execution but have not

committed yet (reorder buffers)

Page 138: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/154

Superscalar architectures: HW

support : conditional instructions

• As neither the register values nor the

memory values are actually WRITTEN

until an instruction commits,

the processor can easily undo its

speculative actions when a branch is

found to be mispredicted

• If a speculated instruction raises an

exception, this is recorded in the reorder

buffer

• In case of branch misprediction such that

a certain speculative instruction should

not have been executed, the exception is

flushed along with the instruction when

the reorder buffer is cleared

Page 139: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/155

Superscalar architectures: HW

support : conditional instructions

• Reorder buffers:

An additional set of virtual registers that hold

the result of the instructions

That have finished execution, but

Have not committed yet

Issue: only when both a Reservation Station

and a reorder buffer are available

As soon as an instruction completes, its output

goes into its reorder buffer

Until the instruction has not committed, input

is received from the reorder buffer

(the Reservation Station is freed, the reorder

buffer is not)

The actual updating of registers takes place

when the instruction reaches the top of the list

of reorder buffers

Page 140: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/156

Superscalar architectures: HW

support : conditional instructions

• At this point the commit phase takes

place:

Either the result is written into the register file,

Or, in case of a mispredicted branch, the

reorder buffer is flushed and execution restarts

at the correct successor of the branch

• Assumption: when a branch with

incorrect prediction reaches the head of

the buffer, it means that the speculation

was wrong

Page 141: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/157

Superscalar architectures: HW

support : conditional instructions

• This technique allows also to tackle situation like

if (cond) do_this ; else do_that ;

• One may “bet” on the outcome of the branch and

say, e.g., it will be a taken one

• Even unlikely events do happen, so sooner or later

a misprediction occurs

• Idea: let the instructions in the else part (do_that)

issue and execute, with a separate list of reorder

buffers (list2)

• This second list is simpler: we don’t check for the

current head-of-list. Elements in there need to be

explicitly removed

• In case of a misprediction, in the second list we

have already executed the do_that part, and we

just need to perform its commit

• In case of positive prediction, the ELSE part is

purged off list2

Page 142: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/158

Superscalar architectures

• If a processor A has a lower CPI w.r.t

another processor B, will A always run

faster than B?

• Not always!

A higher clock rate is indeed a deterministic

measure of the performance improvement

A multiple issue (superscalar) architecture

cannot guarantee its improvements (stochastic

improvements)

Pushing towards a low CPI means adapting

sophisticated (=complex) techniques… which

slows down the clock rate!

Improving one aspect of a M.I.P. does not

necessarily lead to overall performance

improvements

Page 143: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/159

Superscalar architectures

• A simple question:

“how much ILP exists in a program?”

or, in other words, “how much can we

expect from techniques that are based on

the exploitation of the ILP?”

• How to proceed:

Delivering a set of very optimistic assumptions

and measuring how much parallelism is

available under those assumptions

Page 144: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/160

Superscalar architectures

• Assumptions (HW model of an ideal

processor):

1. Infinite # of virtual registers (-> no WAW or

WAR can suspend the pipeline)

2. All conditional branches are predicted exactly

(!!)

3. All computed jumps and returns are perfectly

predicted

4. All memory addresses are known exactly, so a

store can be moved before a load – provided

that the addresses are not identical

5. Infinite issue processor

6. No restriction about the types of instructions

to be executed in a cycle (no structural

hazards)

7. All latencies are 1

Page 145: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/161

Superscalar architectures

• How to match these assumptions??

• Gambling!

• We run a program and produce a trace

with all the values of all the instances of

each branch

Taken, Taken, Taken, Untaken, Taken, …

Each corresponding target address is recorded

and assumed to be available

Then we use a simulator to mimic, e.g., an

infinite virtual registers machine etc.

• Results are depicted in next picture

• Parallelism is expressed in IPC:

instruction issues per clock cycles

Page 146: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/162

Superscalar architectures

• Tomcatv reaches 150 IPC (for a particular

run)

gcc

espresso

liSPEC

benchmarksfpppp

doduc

tomcatv

54.8

62.6

17.9

75.2

118.7

150.1

140 160

Page 147: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/163

Superscalar architectures

• Then we can diminish the above

assumptions and introduce limitations

that represent our current possibilities

with computer design techniques for ILP

Window size: the actual range of instructions

we inspect when looking for candidates for

contemporary issuing

Realistic branch prediction

Finite # of registers

• See images 4-39 and 4-40

Page 148: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/164

Superscalar architectures

160

140

120

100

Instruction issues per cycle

80

60

40

20

0 Infinite 2k 512 128

Window size

gcc

fpppp

espresso

doduc

li

tomcatv

32 8 4

Page 149: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/165

Superscalar architectures

gcc

espresso

li

fpppp

Benchmarks

doduc

tomcatv

0

5510

108

43

1513

8

43

181211

94

3

4975

63

119

3514

53

16

159

43

15045

3414

63

20

Infinite

8

512 128 32

4

40 60 80

Instruction issues per cycle

100 120 140 160

Page 150: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/166

Superscalar architectures:

conclusive notes

• In the next 10 years it is realistic to reach

an architecture that looks like this:

64 instruction issues per clock cycle

Selective predictor, 1K entries, 16-entry return

predictor

Perfect disambiguation of memory references

Register renaming with 64 + 64 extra registers

• Computer architectures in practice:

Section 4.8 (PowerPC 620)

Page 151: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/167

Superscalar architectures:

conclusive notes

60

50

40

30

20

10

0Infinite

Instru

ctio

n is

su

es p

er c

ycle

256 128

Window size

32 16 864 4

gcc

fpppp

espresso

doduc

li

tomcatv

• Reachable

performance

Page 152: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/168

Pipelining and communications

• Suppose that N+1 processes need to

communicate a private value to all the

others

• They use all the values to produce next

output (e.g., for voting)

• Communication is fully synchronous and

needs to be repeated m times, m large

. . .

Page 153: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/169

Pipelining and communications

• Let us assume that no bus is available

• Point-to-point communication

• Processes are numbered p0…p

N

• Two instructions are available

Send (pj, value)

Receive (pj, &value)

• Blocking functions

• If the receiver is ready to receive, they

last one stage time, otherwise they block

the caller for a multiple of the stage time

• Sending and receiving occur at discrete

time steps

Page 154: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/170

Pipelining and communications

• In each time t, processor pi may be

Sending data (next stage pi is unblocked)

Receiving data (next stage pi is unblocked)

Blocked in a Receive()

Blocked in a Send()

• Slot = time corresponding to an entire

stage time

• Each time t we have n slots (a slot per

process)

• If pi is blocked, its slot is wasted

(it’s a “bubble”)

• Otherwise the slot is used

Page 155: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/171

Pipelining and communications

• In each time t, processor pi may be in

State S(j) : Sending data to processor pj

State R(j) : Receiving data from pj

State WR(j) : Blocked in a Receive( pj, … )

State WS(j) : Blocked in a Send( pj, …)

• We use formalism:

proc st proc’

to indicate that, at time t,

proc is in state s with proc’

• For instance

p1 WR(4)

21 p

3

means that the 21st

slot of p1 is wasted

waiting for p3 to send its value to it

Page 156: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/172

Pipelining and communications

• The following algorithm is executed by

process j:

Before gaining the right to

broadcast, process j needs to go

through j couples of states (WR, R)

Ordered

broadcast :

the k-th

message

to be sent

goes to

process pk

Finally, process j goes through N-j

couples of states (WR, R)

Page 157: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/173

Pipelining and communications

• p is a vector of indices

• For process j, p can be any arrangement

of the integers 0, 1, …, j-1, j+1, … N

• Whatever the arrangement, the algorithm

works correctly

• For instance, if N = 4 (5 processes) and

j = 1, then p can be any permutation of

0, 2, 3, and 4

• p determines the order in which process j

sends its value to its neighbours

• Example: p[] = [ 3, 2, 0, 4]. Then p1

executes:

send (p3), send(p

2), send(p

0), send(p

4)

Page 158: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/174

Pipelining and communications

• Example: p[] = ordered permutation

Ex: N=5 and pj p [ 0, … j-1,j+1, … N ]

Frequencies of used slots Slot wasted in send

Slot wasted in receive

Duration

Page 159: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/175

Pipelining and communications

• Case N = 20, p[] = ordered permutation

• Gray = wasted slots

• Black = used slots

• In general, duration is

• Used slots / total # of slots

• Average # used slots during

one stage time

• This image: reminds us of another one:

Page 160: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/176

No pipelining: Many slots are wasted!

30

B

C

D

A

Time

30 30 30 30 30 30 30 30 30 30 30 30 30 30 30

6 PM 7 8 9 10 11 12 1 2 AM

Pipelining and communications

Page 161: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/177

Pipelining and communications

• Let us now consider the case in which

processor k uses

p[] = [ k+1, k+2, …, N, O, 1, …, k-1 ]

Page 162: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/178

Pipelining and communications

Page 163: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/179

Pipelining and communications

• Duration: first case vs. second case

Page 164: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/180

Pipelining and communications

• Efficiency: first case vs. second case

Page 165: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/181

Pipelining and communications

• Algorithm of pipelined broadcast

Beginning of steady state

Every 10 slots, 5 mark the

completion of a broadcast

Throughput = t / 2 (t = 1 slot)

A full broadcast is finished every 2 t

• The image may remind us of another one…

Page 166: Advanced Computer Architectures – Part 2.3

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/182

Between 7.30 and 9.30pm, a whole job

is completed every 30’

6 PM

B

C

D

A

30 30 30 30 30

Pipelining (slide P2.2/20)

During that period, each worker is

permanently at work…

…but a new input must arrive within 30’

12 2 AM 7 8 9 10 11 1