advanced computer architectures – part 2.3

Advanced Computer

Architectures

– HB49 –

Part 2.3

Vincenzo De Florio

K.U.Leuven / ESAT / ELECTA

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/2

Course contents

• Basic Concepts

Computer Design

• Computer Architectures for AI

• Computer Architectures in Practice

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/3

Computer Design

• Quantitative assessments

• Instruction sets

• Pipelining

Parallelism

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/4

Parallelism

• Introduction to parallel processing

• Instruction level parallelism

• (Data level parallelism)

Part 3

• (Task level parallelism)

Part 3

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/5

Parallelism


Basic concepts: granularity, program,

process, thread, language aspects

Types of parallelism


part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/6

Parallelism


Basic concepts: granularity, program,

process, thread



part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/7

Granularity

• Definition:

granularity is the complexity/grain size of

some item

e.g. computation item (instruction),

data item (scalar, array, struct),

communication item (token granularity),

hardware building block (gate,

RTL component)

RISC (e.g. add r1,r2,r4)

CISC (e.g. ld *a0++,r1)

High Level Languages HLLs

(e.g. x = sin(y))

Application-specific

(e.g. edge-det.invert.image)

Low

High

Granularity

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/8

Granularity

• Deciding the granularity is an important

design choice

• E.g. grain size for the communication

tokens in a parallel computer:

coarse grain: less communication overhead

fine grain: less time penalty when two

communication packets compete for

transmission over the same channel and collide

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/14

Parallelism


Basic concepts: granularity, program, process,

thread



part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/15


• Functional parallelism

Different computations have to be performed

on the same or different data

E.g. Multiple users submit jobs to the same

computer or a single user submits multiple jobs

to the same computer

this is functional parallelism at the process level

taken care of at run-time by the OS

Important for

the exam!

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/18


• Data parallelism

Same computations have to be performed on a

whole set of data

E.g. 2D convolution of an image

This is data parallelism at the loop level:

consecutive loop iterations are candidates for

parallel execution, subject to inter-iteration data

dependencies

Leads often to massive amount of parallelism

Important for

the exam!

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/19

Levels of parallelism

• Instruction level parallel (ILP)

Functional parallelism at the instruction level

Example: pipelining

• Data level parallel (DLP)

Data parallelism at the loop level

• Process & thread level parallel (TLP)

Functional parallelism at the thread and

process level

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/20

Parallelism



Introduction

VLIW

Advanced pipelining techniques

Super scalar

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/21

Parallelism



Introduction

VLIW


Super scalar

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/22

Type of Instruction Level

Parallelism utilization

• Sequential instruction issuing, sequential

instruction execution

von Neumann processors

EU

Instruction word

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/23



• Sequential instruction issuing, parallel

instruction execution

pipelined processors

EU1

EU2

EU3

EU4

Instruction word

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/24



• Parallel instruction issuing –

compile-time determined by compiler,

parallel instruction execution

VLIW processors:

Very Long Instruction Word

EU1 EU

2 EU

3 EU

4

Instruction word

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/25



• Parallel instruction issuing – run-time

determined by HW dispatch unit,

parallel instruction execution

super-scalar processors (to be seen later)

EU1 EU

2 EU

3 EU

4

Instruction

window

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/26



• Most processors provide sequential

execution semantics

regardless how the processor actually

executes the instructions (sequential or

parallel, in-order or out-of-order), the result

is the same as sequential execution in the

order they were written

• VLIW and IA-64 provide parallel

execution semantics

explicit indication in ASM which

instructions are executed in parallel

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/27

Parallelism



Introduction

VLIW


Super scalar

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/28

VLIW

EU EU EU EU

Dec

Instruction Register

Dec Dec Dec

Main instruction

memory

128 bit

Instruction Cache

128 bit

32 bit each

256 decoded bits each

Register file

32 bit each; 8 read ports, 4 write ports

Ca

ch

e/

RA

M

32 bit each; 2 read ports, 1 write port

Main data

memory

32 bit;

1 bi-directional port

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/29

VLIW

• Properties

Multiple Execution Units: multiple instructions

issued in one clock cycle

Every EU requires 2 operands and delivers one

result every clock cycle: high data memory

bandwidth needed

Careful design of data memory hierarchy

Register file with many ports

Large register file: 64-256 registers

Carefully balanced cache/RAM hierarchy with

decreasing number of ports and increasing

memory size and access time for the higher

levels (IMEC research: DTSE)

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/32

VLIW

• Properties

Compiler should determine which instructions

can be issued in a single cycle without control

dependency conflict nor data dependency

conflict

Deterministic utilization of parallelism: good for

hard-real-time

Compile-time analysis of source code: worst case

analysis instead of actual case

Very sophisticated compilers, especially when

the EUs are pipelined! Perform well since early

2000

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/33

VLIW

• Properties

Compiler should determine which instructions

can be issued in a single cycle without control

dependency conflict nor data dependency

conflict

Very difficult to write assembly:

programmer should resolve all control flow conflicts

all data flow conflicts

all pipelining conflicts

and at the same time fit data accesses into the

available data memory bandwidth

and all program accesses into the available program

memory bandwidth

e.g. 2 weeks for a sum-of-products (3 lines of C-

code)

All high end DSP processors since 1999 are

VLIW processors (examples: Philips Trimedia --

high end TV, TI TMS320C6x -- GSM base

stations and ISP modem arrays)

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/34

Low power DSP

EU EU EU EU

Dec


Dec Dec Dec

Main instruction

memory

128 bit

Instruction Cache

128 bit

32 bit each


Register file



Too much power

dissipation in

fetching wide

instructions

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/35

Low power DSP

EU EU EU EU

Dec


Dec Dec Dec

24 bit IC

ach

e

128 bit

32 bit each


Register file



Instruction

expansion

Ma

in

IM

em

24 bit

E.g. ADD4 is expanded into

ADD || ADD || ADD || ADD

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/36

Low power DSP

• Properties

Power consumption in program memory is

reduced by specializing the instructions for the

application

Not all combinations of all instructions for the

EUs are possible, but only a limited set, i.e.

those combinations that lead to a substantial

speed-up of the application

Those relevant combinations are represented

by the smallest possible amount of bits to

reduce program memory width and hence

program memory power consumption

Can only be done for embedded DSP

applications: processor is specialized for 1

application (examples: TI TMS320C54x -- GSM

mobile phones, TI TMS320C55x -- UMTS mobile

phones)

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/37

Low power DSP

for interactive

multimedia

REU REU REU REU

Dec


Dec Dec Dec

24 bit IC

ach

e

128 bit

32 bit each


Register file



Reconfigurable

Instruction expansion

Ma

in

IM

em

24 bit Run-time reconfiguration

allows to adapt specialization

to changing application

requirements

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/39

Parallelism



Introduction

VLIW


Super scalar

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/40

Advanced Pipelining

• Pipeline CPI is the result of many

components

• A number of techniques act on one or

more of these components:

Loop unrolling

Scoreboarding

Dynamic branch prediction

Speculation

…

• To be seen later

CPUTIME

(p) = IC(p) CPI(p)

clock rate

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/41

Advanced Pipelining

• Till now, Instruction-level parallelism was

searched within the boundaries of a basic

block (BB)

• A BB is 6-7 instructions on average

too small to reach the expected

performance

• What is worse, there’s a big chance that

these instructions have dependencies

Even less performance can be expected

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/42

Advanced Pipelining

• To obtain more, we need to go beyond the

BB limitation:

• We must exploit ILP across multiple BB’s

• Simplest way: loop level parallelism (LLP):

Exploiting the parallelism among iterations of a

loop

• Converting LLP into ILP

Loop unrolling

Statically (compiler-based)

Dynamically (HW-based)

• Using vector instructions

Does not require LLP -> ILP conversion

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/43

Advanced Pipelining

• The efficiency of the conversion depends

On the amount of ILP available

On latencies of the functional units in the

pipeline

On the ability to avoid pipeline stalls by

separating dependent instructions by a

“distance” (in terms of stages) equal to the

latency peculiar to the source instruction

LW x, …

INSTR …, x

a load must not be followed by the

immediate use of the load destination

register

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/44

Advanced Pipelining

Loop unrolling

Assumptions and steps

1. We assume the following latencies

ProducerInstruction

ConsumerInstruction

Latency

FP ALU OP FP ALU OP 3

FP ALU OP STORE DBL 2

LOAD DBL FP ALU OP 1

LOAD DBL STORE DBL 0

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/45

Advanced Pipelining

Loop unrolling

2. We assume to work with a simple loop

such as

for (I=1; I<=1000; I++)

x[I] = X[I] + s;

• Note: each iteration is independent of

the others

Very simple case

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/46

Advanced Pipelining

Loop unrolling

3. Translated in DLX, this simple loop looks

like this:

; assumptions: R1 = &x[1000]

; F2 = s

Loop: LD F0, 0(R1) ; F0 = x[I]

ADDD F4, F0, F2 ; F4 = F0 + s

SD 0(R1), F4 ; store result

SUBI R1, R1, #8 ; R1 = R1 - 1

BNEZ R1, Loop ; if (R1)

; goto Loop

W

O

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/47

4. Tracing the loop (no scheduling!):

Loop: LD F0, 0(R1) ; 1

stall 2

ADDD F4, F0, F2 ; 3

stall 4

stall 5

SD 0(R1), F4 ; 6

SUBI R1, R1, #8 ; 7

BNEZ R1, Loop ; 8

stall ; 9

• 9 clock cycles per iteration, with 4 stalls

Advanced Pipelining

Loop unrolling

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/48

Advanced Pipelining

Loop unrolling

5. With scheduling, we move from

Loop: LD F0, 0(R1)

ADDD F4, F0, F2

SD 0(R1), F4

SUBI R1, R1, #8

BNEZ R1, Loop

to

Loop: LD F0, 0(R1)

ADDD F4, F0, F2

SUBI R1, R1, #8

BNEZ R1, Loop

SD 8(R1), F4

8

whose trace shows that less cycles are

wasted:

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/49

Advanced Pipelining

Loop unrolling

6. Tracing the loop (with scheduling!):

Loop: LD F0, 0(R1) ; 1

stall 2

ADDD F4, F0, F2 ; 3

SUBI R1, R1, 8 ; 4

BNEZ R1, Loop ; 5

SD 8(R1), F4 ; 6

• 6 clock cycles per iteration, with 1 stall

• 3 stalls less!

• Still the useful cycles are just 3

• How to gain more?

O

O

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/50

Advanced Pipelining

Loop unrolling

7. With loop unrolling:

replicating the body of loop multiple

times

Loop: LD F0, 0(R1)

ADDD F4, F0, F2

SD 0(R1), F4 ; skip SUBI and BNEZ

LD F6, -8(R1) ; F6 vs. F0

ADDD F8, F6, F2 ; F8 vs. F4

SD -8(R1), F8 ; skip SUBI and BNEZ

LD F10, -16(R1) ; F10 vs. F0

ADDD F12, F10, F2 ; F12 vs. F4


LD F14, -24(R1) ; F14 vs. F0

ADDD F16, F14, F2 ; F16 vs. F4


SUBI R1, R1, #32 ; R1 = R1 – 4

BNEZ R1, Loop

• Spared 3 x (SUBI + BNEZ)

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/51

Advanced Pipelining

Loop unrolling

• Loop unrolling:

replicating the body of loop multiple times

Some branches are eliminated

The ratio w/o increases

The BB artificially increases its size

Higher probability of optimal scheduling

Requires a wider set of registers and

adjusting values of load and store

registers

(In the given example,) Every operation is

followed by a dependent instruction

Will cause a stall

Trace of unscheduled unrolled loop: 27 cycles

2 per LD, 3 per ADD, 2 per branch, 1 per any other

6.8 clock cycles per iteration

Pure scheduling is better! (6 cycles)

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/52

Advanced Pipelining

Loop unrolling

• Unrolled loop plus scheduling

Loop: LD F0, 0(R1)

ADDD F4, F0, F2


LD F6, -8(R1) ; F6 vs. F0



LD F10, -16(R1) ; F10 vs. F0

ADDD F12, F10, F2 ; F12 vs. F4


LD F14, -24(R1) ; F14 vs. F0

ADDD F16, F14, F2 ; F16 vs. F4


SUBI R1, R1, #32 ; R1 = R1 – 4

BNEZ R1, Loop

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/53

Advanced Pipelining

Loop unrolling


Loop: LD F0, 0(R1)

LD F6, -8(R1) ; F6 vs. F0

ADDD F4, F0, F2




LD F10, -16(R1) ; F10 vs. F0

ADDD F12, F10, F2 ; F12 vs. F4


LD F14, -24(R1) ; F14 vs. F0

ADDD F16, F14, F2 ; F16 vs. F4


SUBI R1, R1, #32 ; R1 = R1 – 4

BNEZ R1, Loop

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/54

Advanced Pipelining

Loop unrolling


Loop: LD F0, 0(R1)

LD F6, -8(R1) ; F6 vs. F0

LD F10, -16(R1) ; F10 vs. F0

ADDD F4, F0, F2




ADDD F12, F10, F2 ; F12 vs. F4


LD F14, -24(R1) ; F14 vs. F0

ADDD F16, F14, F2 ; F16 vs. F4


SUBI R1, R1, #32 ; R1 = R1 – 4

BNEZ R1, Loop

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/55

Advanced Pipelining

Loop unrolling


Loop: LD F0, 0(R1)

LD F6, -8(R1) ; F6 vs. F0

LD F10, -16(R1) ; F10 vs. F0

LD F14, -24(R1) ; F14 vs. F0

ADDD F4, F0, F2




ADDD F12, F10, F2 ; F12 vs. F4


ADDD F16, F14, F2 ; F16 vs. F4


SUBI R1, R1, #32 ; R1 = R1 – 4

BNEZ R1, Loop

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/56

Advanced Pipelining

Loop unrolling


Loop: LD F0, 0(R1)

LD F6, -8(R1) ; F6 vs. F0

LD F10, -16(R1) ; F10 vs. F0

LD F14, -24(R1) ; F14 vs. F0

ADDD F4, F0, F2


ADDD F12, F10, F2 ; F12 vs. F4

ADDD F16, F14, F2 ; F16 vs. F4





SUBI R1, R1, #32 ; R1 = R1 – 4

BNEZ R1, Loop

• 14 clock cycles, or 3.5 clock cycles / iteration

Enough

distance

to prevent

the

dependency

to turn

into a

hazard

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/57

Advanced Pipelining

Loop unrolling

• Unrolling the loop exposes more

computation that can be scheduled to

minimize the stalls

• Unrolling increases the BB; as a result, a

better choice can be done for scheduling

• A useful technique with two key

requirements:

Understanding how an instruction depends on

another

Understanding how to change or reorder the

instructions, given the dependencies

• In what follows we concentrate on .

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/58

Loop unrolling: . dependencies

• Again, let ( Ik)1 k IC(p)

be the ordered

series of instructions executed during

the run of program p

• Given two instructions, Ii and I

j, with i<j,

we say that

Ij is dependent on I

i (I

i I

j) iff

R(Ii) D(I

j)

R is the range and D the domain of a given

instruction

Ii produces a result which is consumed by I

j

or

$ n { 1,…,IC(p)} and $ k1 < k

2 < … < k

n

such that Ii I

k1 I

k2 .. I

kn I

j

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/59


• (Ii , I

k1 , I

k2 , … I

kn , I

j) is called a

dependency (transitive) chain

• Note that a dependency chain can be as

long as the entire execution of p

• A hazard implies dependency

• Dependency does not imply a hazard!

• Scheduling tries to place dependent

instructions in places where no hazard can

occur

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/60


• For instance:

SUBI R1, R1, #8

BNEZ R1, Loop

• This is clearly a dependence, but it does

not result in a hazard

Forwarding eliminates the hazard

• Another example:

LD F0, 0(R1)

ADDD F4, F0, F2

• This is a data dependency which does

lead to a hazard and a stall

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/61


• Dealing with data dependencies

• Two classes of methods:

1. Keeping the dependence though avoiding

the hazard (via scheduling)

2. Eliminating a dependence by

transforming the code

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/62


• Class 2 implies more work

• These are optimization methods used by

the compilers

• Detecting dependencies when only using

registers is easy; the difficulties come

from detecting dependencies in memory:

• For instance 100(R4) and 20(R6) may

point to the same memory location

• Also the opposite situation may take

place:

LD 20(R4), R2

…

ADD R3, R1, 20(R4)

• If R4 changes, this is no dependency

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/63


• Ii I

j means that

Ii produces a result that is consumed by I

j

• When there is no such production, e.g.,

Ii and I

j are both loads or stores, we call

this a name dependency

• Two types of name dependencies:

Antidependence

Corresponds to WAR hazards

Ij x ; I

i x (reordering implies an error)

Output dependence

Corresponds to WAW hazards

Ij x ; I

i x (reordering implies an error)

• No value is transferred between the

instructions

• Register renaming solves the problem

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/64


• Register renaming: if the register name is

changed, the conflict disappears

• This technique can be either static (and

done by the compiler) or dynamic (done

by the HW)

• Let us consider again the following loop:

Loop: LD F0, 0(R1)

ADDD F4, F0, F2

SD 0(R1), F4

SUBI R1, R1, #8

BNEZ R1, Loop

• Let us perform unrolling w/o renaming:

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/65


Loop: LD F0, 0(R1)

ADDD F4, F0, F2

SD 0(R1), F4

LD F0, -8(R1)

ADDD F4, F0, F2

SD -8(R1), F4

LD F0, -16(R1)

ADDD F4, F0, F2

SD -16(R1), F4

LD F0, -24(R1)

ADDD F4, F0, F2

SD -24(R1), F0

SUBI R1, R1, #32

BNEZ R1, Loop

The yellow arrows

are name depen-

dencies. To solve

them, we perform

renaming

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/66


Loop: LD F0, 0(R1)

ADDD F4, F0, F2

SD 0(R1), F4

LD F6, -8(R1)

ADDD F8, F6, F2

SD -8(R1), F8

LD F0, -16(R1)

ADDD F4, F0, F2

SD -16(R1), F4

LD F0, -24(R1)

ADDD F4, F0, F2

SD -24(R1), F0

SUBI R1, R1, #32

BNEZ R1, Loop

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/67


Loop: LD F0, 0(R1)

ADDD F4, F0, F2

SD 0(R1), F4

LD F6, -8(R1)

ADDD F8, F6, F2

SD -8(R1), F8

LD F10, -16(R1)

ADDD F12, F10, F2

SD -16(R1), F12

LD F14, -24(R1)

ADDD F16, F14, F2

SD -24(R1), F16

SUBI R1, R1, #32

BNEZ R1, Loop

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/68


Loop: LD F0, 0(R1)

ADDD F4, F0, F2

SD 0(R1), F4

LD F6, -8(R1)

ADDD F8, F6, F2

SD -8(R1), F8

LD F10, -16(R1)

ADDD F12, F10, F2

SD -16(R1), F12

LD F14, -24(R1)

ADDD F16, F14, F2

SD -24(R1), F16

SUBI R1, R1, #32

BNEZ R1, Loop

The yellow arrows

are data depen-

dencies. To solve

them, we reorder

the instructions

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/69


• A third class of dependencies is the one

of control dependencies

• Examples:

if (p1) s

1;

if (p2) s

2;

then

p1

c s

1 (s

1 is control dependent on p

1)

p2

c s

2 (s

2 is control dependent on p

2)

• Clearly (p1

c s

2) ,

that is,

s2 is not control dependent on p

1

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/72


• Two properties are critical to control

dependency:

Exception behaviour

Data flow

• Exception behaviour: suppose we have

the following excerpt:

BEQZ R2, L1

DIVI R1, 8(R2)

L1: …

• We may be able to move the DIVI to

before the BEQZ without violating the

sequential semantics of the program

• Suppose the branch is taken. Normally

one would simply need to undo the DIVI

• What if DIVI triggers a DIVBYZERO

exception?

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/73


• Two properties are critical to control

dependency:

Exception behaviour

Data flow

• Data flow must be preserved

• Let us consider the following excerpt:

ADD R1, R2, R3

BEQZ R4, L

SUB R1, R5, R6

L: OR R7, R1, R8

• Value of R1 depends on the control flow

• The OR depends on both ADD and SUB

• Also depends on the nature of the branch

• R1 = (taken)? ADD.. : SUB..

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/74

Loop Level Parallelism

• Let us consider the following loop:

for (I=1; I<=100; I++) {

A[I+1] = A[I] + C[I]; /* S1 */

B[I+1] = B[I] + A[I+1]; /* S2 */ }

• S1 is a loop-carried dependency (LCD):

iteration I+1 is dependent on iteration I:

A’ = f(A)

• S2 is B’ = f(B,A’)

• If a loop has only non-LCD’s, then it is

possible to execute more than one loop

iteration in parallel – as long as the

dependencies within each iteration are

not violated

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/75


• What to do in the presence of LCD’s?

• Loop transformations. Example:

for (I=1; I<=100; I++) {

A[I+1] = A[I] + B[I]; /* S1 */

B[I+1] = C[I] + D[I]; /* S2 */ }

• A’ = f(A, B)

B’ = f(C, D)

• Note: no dependencies except LCD’s

Instructions can be swapped!

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/76


• What to do in the presence of LCD’s?

• Loop transformations. Example:

for (I=1; I<=100; I++) {

A[I+1] = A[I] + B[I]; /* S1 */

B[I+1] = C[I] + D[I]; /* S2 */ }

• Note: the flow, i.e.,

A0 B0 A0 B0

C0 D0

C0 D0

A1 B1 can be A1 B1

C1 D1 changed

into C1 D1

A2 B2 A2 B2

C2 D2 . . .

. . .

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/77

for (i=1; i <= 100; i=i+1) {

A[i] = A[i] + B[i]; /* S1 */

B[i+1] = C[i] + D[i]; /* S2 */

}

becomes

A[1] = A[1] + B[1];

for (i=1; i <= 99; i=i+1) {

B[i+1] = C[i] + D[i];

A[i+1] = A[i+1] + B[i+1];

}

B[101] = C[100] + D[100];


part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/78

Loop Level Parallelim

• A’ = f(A, B) B’ = f(C, D)

B’ = f(C, D) A’ = f(A’, B’)

• Now we have dependencies but no more

LCD’s!

It is possible to execute more than one

loop iteration in parallel – as long as the

dependencies within each iteration are

not violated

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/79

Dependency avoidance

1. “Batch” approaches: at compile time, the

compiler schedules the instructions in

order to minimize the dependencies

(static scheduling)

2. “Interactive” approaches: at run-time, the

HW rearranges the instructions in order

to minimize the stalls (dynamic

scheduling)

• Advantages of 2:

Only approach when dependencies are only

known at run-time (pointers etc.)

The compiler can be simpler

Given an executable compiled for a machine

with machine-level X and pipeline organization

Y, it can run efficiently on another machine

with the same machine level but a different

pipeline organization Z

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/80

Dynamic Scheduling

• Static scheduling: compiler techniques

for scheduling (rearranging) the

instructions

so to separate dependent instructions

And hence minimize unsolvable hazards

causing unavoidable stalls

• Dynamic scheduling: HW-based, run-time

techniques

• A dynamically scheduled processor does

not try to remove true data dependencies

(which would be impossible): it tries to

avoid stalling when dependencies are

present

• The two techniques can be both used

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/81

Dynamic Scheduling: General Idea

• If an instruction is stalled in the pipeline,

no later instruction can proceed

• A dependence between two instructions

close to each other causes a stall

• A stall means that, even though there

may be idle functional units that could

potentially serve other instructions, those

units have to stay idle

• Example:

DIVD F0, F2, F4

ADDD F10, F0, F8

SUBD F12, F8, F14

• ADDD depends on DIVD; but SUBD does

not. Despite this, it is not issued!

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/82


• So SUBD is not issued even there might

be a functional unit ready to perform the

requested operation

• Big performance limitation!

• What are the reasons that lead to this

problem?

• In-order instruction issuing and

execution: instructions issue and execute

one at a time, one after the other

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/83


• Example: in DLX, the issue of an

instruction occurs at ID (instruction

decode)

• In DLX, ID checks for absence of

structural hazards and waits for the

absence of data hazards

• These two steps may be made distinct

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/84


• The issue process gets divided into two

parts:

1.Checking the presence of structural

hazards

2.Waiting for the absence of a data hazard

• Instructions are issued in order, but they

execute and complete as soon as their

data operands are available

• Data flow approach

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/85


• The ID pipeline stage is divided into two

sub-stages:

• ID.1 (Issue) : decode the instruction,

check for structural hazards

• ID.2 (read operands) : wait until no data

hazards, then read operands

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/86


• In the DLX floating point pipeline, the EX

stage of instructions may take multiple

cycles

• For each issued instruction I, depending

on the resolution of structural and data

hazards, I may be be waiting for

resources or data, or in execution, or

completed

• More than a single instruction can be in

execution at the same time

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/87

Scoreboarding

• Scorebord (CDC6600, 1964): a technique

to allow instructions to execute out of

order when there are sufficient resources

and no data dependencies

• Goal: execution rate of 1 instruction per

clock cycle in the absence of structural

hazards

• Large set of FUs:

4 FPUs,

5 units for memory references

7 integer FUs

Highly redundant (parallel) system

• Four steps replace the ID, EX, WB stages

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/88

Scoreboarding

• IF (a FU is available && no active

instruction has same destination reg) {

issue I to the FU; update state;

}

Avoids WAWs

• ASA (the two source operands are

available in the registers) {

read operands;

manage RAW stalls;

}

• For each FU: ASA (operands are available)

{ start EX; EOX? Alert scoreboard; }

• When at WB:

{ wait for (no WAR hazards);

store output to destination reg; }

Avoids WARs

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/89

Scoreboarding

• In eliminating stalls, a scoreboard is

limited by several factors:

Amount of parallelism available among the

instructions

(in the presence of many dependencies there’s

not much that one can do…)

Number of scoreboard entries

(How far ahead the pipeline can look for

independent instructions)

Number and types of FUs

Number of WAR’s and WAW’s

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/90

Scoreboarding

• The effectiveness of the scoreboard

heavily depends on the register file

• All operands are read from registers, all

outputs go to destination registers

The availability of registers influence the

capability to eliminate stalls

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/91

Tomasulo’s approach

• Tomasulo’s approach (IBM 360/91, 1967) :

An improvement of scoreboarding when a

limited number of registers is allowed by

a machine architecture

• Based on virtual registers

• The IBM 360/91 had two key design goals:

To be faster than its predecessors

To be machine level compatible with its

predecessors

• Problem: the 360 family had only 4 FP

registers

• Tomasulo combined the key ideas of

scoreboarding with register renaming

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/92


• IBM 360/91 FUs:

3 ADDD/SUBD, 2 MULD, 6 LD, 6 SD

• Key element: the reservation station (RS):

a buffer which holds the operands of the

instructions waiting to issue

• Key concept:

A RS fetches and buffers an operand as soon as

it is available, eliminating the need to get that

operand from a register

Instead of tracing the source and destination

registers, we track source and destination RS’s

OP

RSa

RSb

RSc

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/93


• A reservation station represents:

A static data, read from a register

A “live” data (a future data) that will be

produced by another RS and FU

• Hazard detection and execution control

are not centralised into a scoreboard

• They are distributed in each RS, which,

independently:

Controls a FU attached to it,

And starts that FU the moment the operands

become available

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/94


• The operands go to the FUs through the

(wide set of) RS’s, not through the (small)

register file

• This is managed through a broadcast that

makes use of a

common result-or-data bus

• All units waiting for an operand can load

it at the same time:

RSa

OP

OP2

RSd

RSc

RSb

RSb

RSe

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/95


• The execution is driven by a graph of

dependencies

RSa

SUBD

MULTD

RSd

RSc

RSb

RSe

SUBD

RSf

SUBD

RSg

• A “live data structure” approach (similar

to LINDA): a tuple is made available in the

future, when a thread will have finished

producing it

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/100

Major Advantages of Tomasulo’s

• Distributed approach: the RS’s

independently control the FU’s

• Distributed hazard detection logic

• The CDB broadcasts results -> all pending

instructions depending on that result are

unblocked simultaneously

The CDB, being a bus, reaches many

destinations in a single clock cycle

If the waiting instructions get their missing

operand in that clock cycle, they can all begin

execution on the next clock cycle

• WAR and WAW are eliminated by

renaming registers using the RS’s

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/101

Reducing branch penalties

• Static Approaches

Dynamic Approaches

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/102

Reducing branch penalties:

Dynamic Branch Prediction

• A branch history table

Address Branch Nature

0xA0B2DF37 BNEZ …

0xA0B2F02A BEQ …

0xA0B30504 BNEZ …

0xA0B30537 BGT …

37

2A

04

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

taken

taken

taken taken

untaken

untaken

untaken

un

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/103


Branch History Table Algorithm

/* before the branch is evaluated */

If (Current instruction is a branch) {

entry = PC & 0x000000FF;

predict branch as ( BHT [ entry ] );

}

/* after the branch */

If (branch was mispredicted)

BHT [ entry ] = 1 – BHT [ entry ]

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/104


Branch History Table Algorithm

• Just one bit is enough for coding the

Boolean value “taken” vs. “untaken”

• Note: the function associating addresses

to entries in the BHT is not guaranteed to

be a bijection (one-to-one relationship):

• The algorithm records the most recently

behaviour of one or more branches

For instance, entry 37 corresponds to two b.’s

• Despite this, the scheme works well…

• …though in some cases, the performance

of the scheme is not that satisfactory:

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/105


Branch History Table Accuracy

• for (i=0; i<BIGN; i++)

for (j=0; j<9; j++)

{ do stg(); }

• Loop is

taken nine times in a row

then not taken once

• Taken 90%, Untaken 10%

• What is the prediction accuracy?

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/106



Taken U 0

Taken T 1

. . .

Taken T 1

Untaken T 0

Taken U 0

Taken T 1

. . .

Taken T 1

Untaken T 0

Taken U 0

Taken T 1

. . .

Taken T 1

Untaken T 0

Taken U 0

8 successful

predictions

8 successful

predictions

8 successful

predictions

2 mispredictions

2 mispredictions

2 mispredictions

9

9

9

S.S. Prediction accuracy is just 80% !

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/107



• Loop branches (taken n-1 times in a row,

untaken once)

• Performance of this dynamic branch

predictor (based on a single-bit prediction

entry):

Misprediction: 2 x 1 / n

Twice rate of untaken branches

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/108


Two-bit Prediction Scheme

• Use a two bit field as a “branch behaviour

recorder”

• Allow a state to change only when two

mispredictions in a row occur:

Taken

Taken

Taken

Taken

Not taken

Not taken

Not taken

Not taken

Predict taken Predict taken

Predict not taken Predict not taken

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/109



Taken U2 0

Taken U 0

Taken T2 1

Taken T2 1

. . .

Taken T2 1

Untaken T 0

Taken T2 1

. . .

Taken T2 1

Untaken T 0

Taken T2 1

. . .

Taken T2 1

7 successful

predictions

9 successful

predictions

2 mispredictions first

S.S. Prediction accuracy is now 90%

9 successful

predictions

STEADY

STATE

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/110



18%

tomcatv

spiceSPEC89

benchmarks

gcc

li

2% 4% 6% 8% 10% 12% 14% 16%

0%

1%

5%

9%

9%

12%

5%

10%

18%

nasa7

matrix300

doduc

fpppp

espresso

eqntott

1%

0%

Frequency of mispredictions

Prediction accuracy with programs from

SPEC89 – 2-bit prediction buffer of 4096

entries

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/111


General Scheme

• In the general case, one could use an

n-bit branch behaviour recorder and a

branch history table of 2m entries

• In this case

A change occurs every 2n-1

mispredictions

There is a higher chance that not too many

branch addresses be associated with the same

BHT entry

Larger memory penalty

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/112

D.B.P. Comparing the 2-bit with

the General Case

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/113


Schemes

• One-bit prediction buffer

Good, but with limited accuracy

• Two-bit prediction buffer

Very good, greater accuracy, slightly higher

overhead

• Infinite-bit prediction buffer

As good as the two-bit one, but with a very

large overhead

• Correlating predictors

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/114


Correlated predictors

• Two-level predictors

• If the behaviour of a branch is correlated

to the behaviour of another branch,

no single-level predictor would be able to

capture its behaviour

• Example:

if (aa == 2)

aa = 0;

if (bb == 2)

bb = 0;

if (aa != bb) {

…

• If we keep track of the recent behaviour

of other previous branches, our accuracy

may increase

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/115



• A simpler example:

if (d == 0) d = 1;

if (d == 1) …

• In DLX, this is

BNEZ R1, L1 ; b1 ( d != 0 )

MOV R1, #1

L1: SUBI R3, R1, #1

BNEZ R3, L2 ; b2 ( d != 1)

. . .

L2: . . .

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/116



• In DLX, this is

BNEZ R1, L1 ; b1 ( d != 0 )

MOV R1, #1

L1: SUBI R3, R1, #1

BNEZ R3, L2 ; b2 ( d != 1)

. . .

L2: . . .

• Let us assume that d is 0, 1 or 2

Initial value d==0? b1 Value of d d==1? b2

of d before b2

0 Yes Untaken 1 Yes Untaken

1 No Taken 1 Yes Untaken

2 No Untaken 2 No Taken

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/117



• This means that

(B1 == untaken ) (B2 == untaken )

• A one-bit predictor may not be able to

capture this property and behave very

badly

Initial value d==0? B1 Value of d d==1? b2

of d before b2

0 Yes Untaken 1 Yes Untaken

1 No Taken 1 Yes Untaken

2 No Untaken 2 No Taken

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/118



• Let us suppose that d alternates between 2 and 0

• This is the table for the one-bit predictor:

d b1 b1 new b1 b2 b2 new b2

pred action pred pred action pred

2 NT T T NT T T

0 T NT NT T NT NT

2 NT T T NT T T

0 T NT NT T NT NT

• ALL branches are mispredicted!

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/119



• Correlated predictor: example:

• Every branch, say branch number j>1, has

two separate prediction bits

First bit: predictor used if branch j-1 was NT

Second bit: otherwise

• At the end of branch j

If (branch was mispredicted)

BHT [ B.. ] [ entry ] = 1 – BHT [ B.. ] [ entry ]

• At the end of branch j-1:

Behaviour_j_min_1 = (taken?) 1 : 0;

• At the beginning of branch j:

predict branch as (

BHT [ Behaviour_j_min_1 ] [ entry ] );

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/120



• The behaviour of a branch

selects a one-bit branch predictor

• If the prediction is not OK, its state is

flipped

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/121



• We may also consider the last TWO

branches

The behaviour of these two branches selects,

e.g., a one-bit predictor

(NT NT, NT T, T NT, T T) (0-3) BHT [0..3]

This is called a (2,1) predictor

Or, the behaviour of the last two branches

selects an n-bit predictor

This is a (2, n) predictor

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/122



A (2,2) predictor: A 2-bit branch history entry selects

a 2-bit predictor

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/123



• General case: (m, n) predictors

Consider the last m branches and their 2m

possible values

This m-tuple selects an n-bit predictor

A change in the prediction only occurs after 2n-1

mispredictions

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/124


Branch-Target Buffer

• A run-time technique to reduce the

branch penalty

• In DLX, it is possible to “predict” the new

PC, via a branch prediction buffer, during

the second stage of the pipeline

• With a Branch-Target Buffer (BTB), the

new PC can be derived during the first

stage of the pipeline

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/125



• The BTB is a branch-prediction cache that

stores the addresses of taken branch

• An associative array which works as

follows:

(instruction address) (branch target address)

• In case of a hit, we know the predicted

instruction address one cycle earlier w.r.t.

the branch prediction buffer

• Fetching begins immediately at the

predicted PC

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/126



• Design issues:

The entire address must be used

(correspondence must be one-to-one)

Limited number of entries in the BTB

Most frequently used

BTB requires a number of actions to be

executed during the first pipeline stage, also in

order to update the state of the buffer

The pipeline management gets more complex and

the clock cycle duration may have to be

increased

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/127



• Total branch penalty for a BTB

• Assumptions: penalties are as follows

Instruction Prediction Actual Penalty

is in buffer branch cycles

Yes Taken Taken 0

Yes Taken Untaken 2

No * Taken 2

• Prediction accuracy: 90%

• Hit rate in buffer: 90%

• Taken branch frequency: 60%

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/128



• Branch penalty =

Percent buffer hit rate x

Percent incorrect predictions x

Penalty

+ (1 - Percent buffer hit rate) x

Percent taken branches x

Penalty =

Instruction Prediction Actual Penalty

is in buffer branch cycles

Yes Taken Taken 0

Yes Taken Untaken 2

No * Taken 2

10%

90%

10%

60%

90%x10%x2 + 10%x60%x2 = 0.18+0.12=

0.30 clock cycles (vs. 0.50 for delayed br.)

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/129



• The same approach can be applied to the

procedures return addresses

• Example:

0x4ABC CALL 0x30A0

0x4AC0 …

…

0x4CF4 CALL 0x30A0

0x4CF8 …

…

0x30A0 0x4CF8

0x4AC0

• Associative arrays of stacks

• If cache is large enough, all return

addresses are predicted correctly

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/130

Parallelism



Introduction

VLIW


Superscalar

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/131

Superscalar architectures

• So far, the goal was reaching the ideal

CPI = 1 goal

• Further increasing performance by having

CPI < 1 is the goal of

superscalar processors (SP)

• To reach this goal, SP issue multiple

instructions in the same clock cycle

• Multiple-issue processors

VLIW (seen already)

SP

Statically scheduled (compiler)

Dynamically scheduled (HW;

Scoreboarding/Tomasulo)

• In SP, a varying # of instructions is

issued, depending on structural limits and

dependencies

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/132


• Superscalar version of DLX

• At most two instructions per clock cycle

can be issued

1. One of: load, store (integer or FP), branch,

integer ALU operation

2. A FP ALU operation

• IF and ID operate on 64 bits of

instructions

• Multiple independent FPU are available

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/133


• The superscalar DLX is indeed a sort of

“bidimensional pipeline”:

IF Integer Instr. EX ID WB MEM

IF FP Instr. EX ID WB MEM

IF EX ID WB MEM

IF EX ID WB MEM

Integer Instr.

FP Instr.

Integer Instr.

FP Instr.

Integer Instr.

FP Instr.

IF EX ID WB MEM

IF EX ID WB MEM

IF EX ID WB MEM

IF EX ID WB MEM

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/134


• Every new solution breeds new problems..

• Latencies!

• When the latency of the load is 1:

In the “monodimensional pipeline”, one cannot

use the result of the load in the current and

next cycle:

P

LD NOP LDc

In the bidimensional pipeline of SP, this means

a loss of three cycles:

Pi

LD NOP

LDc

NOP NOP

LDc’

Pfp

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/135


• Let us consider again the following loop:

Loop: LD F0, 0(R1)

ADDD F4, F0, F2

SD 0(R1), F4

SUBI R1, R1, #8

BNEZ R1, Loop

• Let us perform unrolling (x5) + scheduling

on the Superscalar DLX:

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/136


Integer FP Cycle

Loop: LD F0, 0(R1) 1

LD F6, -8(R1) 2

LD F10, -16(R1) ADDD F4,F0,F2 3

LD F14, -24(R1) ADDD F8,F6,F2 4

LD F18, -32(R1) ADDD F12,F10,F2 5

SD 0(R1), F4 ADDD F16,F14,F2 6

SD -8(R1), F8 ADDD F20,F18,F2 7

SD -16(R1), F12 8

SD -24(R1), F16 9

SUBI R1, R1, #40 10

BNEZ R1, Loop 11

SD -32(R1), F20 12

• 12 clock cycles per 5 iterations = 2.4 cc/i

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/137


• Superscalar = 2.4 cc/i vs normal = 3.5 cc/i

• But in the example there were not enough

FP instructions to keep the FP pipeline in

use

From cycle 8 to cycle 12 and for the first two

cycles, each cycle holds just one instruction

• How to get more?

Dynamic scheduling for SP

Multicycle extension of the Tomasulo algorithm

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/138

Superscalar architectures and the

Tomasulo algorithm

• Idea: employing separate data structures

for the Integer and the FP registers

Integer Reservation Stations (IRS)

FP Reservation Stations (FRS)

• In the same cycle, issue a FP (to a FRS)

and an integer instruction (to a IRS)

• Note: issuing does not mean executing!

Possible dependencies might serialize the two

instructions issued in parallel

• Dual issue is obtained

pipelining the instruction-issue stage

so that it runs twice as fast

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/139


• Multiple issue strategy’s inherent

limitations:

The amount of ILP may be limited (see loop

p.134)

Extra HW is required

Multiple FPU and IU

More complex (-> slower) design

Extra need for large memory and register-file

bandwith

Increase in code size due to hard loop unrolling

Recall: CPUTIME

(p) = IC(p) CPI(p)

clock rate

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/140

Superscalar architectures:

compiler support

• Symbolic loop unrolling

The loop is not physically unrolled, though

reorganized, so to eliminate dependencies

• Software pipelining:

Dependencies are eliminated by interleaving

instructions from different iterations of the loop

Loop is not unrolled

Loop: LD F0, 0(R1)

ADDD F4, F0, F2

SD 0(R1), F4

SUBI R1, R1, #8

BNEZ R1, Loop

<startup>

Loop: SD 0(R1), F4

ADDD F4, F0, F2

LD F0, -16(R1)

SUBI R1, R1, #8

BNEZ R1, Loop

<clean-up>

RAW: problematic WAR: HW removable

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/141


compiler support

• Trace scheduling

• Aim: tackling the problem of too short

basic blocks

• Method:

Trace selection

Trace compaction

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/142


compiler support

• Trace selection:

A number of contiguous basic blocks are put

together into a “trace”

Using static branch prediction, the conditional

branches are chosen as taken/untaken, while

loop branches are considered as taken

A

B

C

Book-

keeping

A

B X

C

test

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/143


compiler support

• Trace compaction:

The resulting trace is a longer straight-line of

code

Trace compaction: global code scheduling

A

B

C

Book-

keeping

Code scheduling with

a basic block whose size

is that of A + B + C

• Speculative movement of code

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/144


HW support

• Conditional instructions: instructions like

CMOVZ R2, R3, R1

which means

if (R1 == 0) R2 = R3;

or (R1)? R2 = R3 : /* NOP */;

• The instruction turns into a NOP if the

condition is not met

This also means that no exception are raised!

• Using conditional instructions we convert

a control dependence (due to a branch)

into a data dependence

• Speculative transformation in a two-issue

superscalar with conditional instructions:

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/145

Superscalar architectures: HW

support : conditional instructions

Integer FP Cycle

LW R1, 40(R2) ADDD R3,R4,R5 1

ADDD R6,R3,R7 2

BEQZ R10, L 3

LW R8, 20(R10) 4

LW R9,0(R8) 5

LW R1, 40(R2) ADDD R3,R4,R5 1

LWC R8,20(R10),R10 ADDD R6,R3,R7 2

BEQZ R10, L 3

LW R9,0(R8) 4

We speculate on the outcome of the branch. If the

condition is not met, we don’t slow down the execution,

because we had used a slot that would otherwise be lost

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/146



• Conditional instructions are useful to

implement short alternative control flows

• Their usefulness though is limited by

several factors:

Conditional instructions that are annullated

still take execution time – unless they are

scheduled into waste slots

They are good only in limited cases, when

there’s a simple alternative sequence

Moving an instruction across multiple branches

would require double-conditional instructions!

LWCC R1, R2, R10, R12

(makes no sense)

They require to do extra work w.r.t. their

“regular” version

The extra time required for the test may require

more cycles than the regular versions

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/147



• Most architectures support a few

conditional instructions (conditional

move)

• The HP PA architecture allows any

register-register instruction to turn the

next instruction into a NOP – which

makes that a conditional instruction

• Exceptions

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/148



• Exceptions:

Fatal (normally causing termination; e.g.,

memory protection violation)

Resumable exceptions (causing a delay, but no

termination; e.g., page fault exception)

• Resumable exceptions can be processed

for speculative instructions just as if they

were normal instructions

Corresponding time penalty is not considered

as incorrect

• Fatal exceptions cannot be handled by

speculative instructions, hence must be

deferred to the next non-speculative

instructions

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/149



• Moving instructions across a branch

must not affect

The (fatal) exception behaviour

The data dependences

• How to obtain this?

1. All the exceptions triggered by speculative

instructions are ignored by HW and OS

The HW and OS do handle all exceptions, but

return an undefined value for any fatal

exception. The program is allowed to continue

– though this will almost certainly lead to

incorrect results

Note: scheme 1. can never cause a correct

program to fail, regardless the fact that you

used or not speculation

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/150



2. Poison bits: A speculative instructions does

not trigger any exception, but turns a bit on in

the involved result registers. Next “normal”

(non-speculative) instruction using those

registers will be “poisoned” -> it will cause an

exception

3. Boosting: Renaming and buffering in the HW

(similar to the Tomasulo approach)

• Speculation can be used, e.g., to

optimize an if-the-else such as

if (a==0) a = b; else a = a + 4

or, equivalently,

a = (a==0)? b : a + 4

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/151



• Suppose A is in 0(R3) and B in 0(R2)

• Example:

LW R1, 0(R3) ; load A

BNEZ R1, L1 ; A != 0 ? GOTO L1

LW R1, 0(R2) ; load B

J L2 ; skip ELSE

L1:ADD R1,R1,4 ; ELSE part

L2:SW 0(R3), R1 ; store A

• Speculation:


LW R9, 0(R2) ; load speculatively B

BNEZ R1, L3

ADD R9, R1, 4 ; here R9 is A+4

L3: SW 0(R3), R9 ; here R9 is A+4 or B

• In this case, a temporary register is used

• Method 1: speculation is transparent

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/152



• Method 2 applied to the previous code

fragment:


LW* R9, 0(R2) ; load speculatively B

BNEZ R1, L3

ADD R9, R1, 4 ; here R9 is A+4

L3: SW 0(R3), R9 ; here R9 is A+4 or B

• LW* is a speculative version of LW

• LW* an opcode that turns on the poison

bit of register R9

• Next non speculative instruction using R9

will be “poisoned”: it will cause an

exception

• If another speculative instruction uses

R9, the poison bit will be inherited

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/153



• Combining speculation with dynamic

scheduling

An attribute bit is added to each instruction

(1: speculative, 0: normal)

When that bit is 1, it is allowed to execute, but

cannot enter the commit (WB) stage

The instruction then has to wait until the end of

the speculated code

It will be allowed to modify the register file /

memory only at end of speculative-mode

• Hence: instructions execute out-of-order,

but are forced to commit in order

• A special set of buffers holds the results

that have finished execution but have not

committed yet (reorder buffers)

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/154



• As neither the register values nor the

memory values are actually WRITTEN

until an instruction commits,

the processor can easily undo its

speculative actions when a branch is

found to be mispredicted

• If a speculated instruction raises an

exception, this is recorded in the reorder

buffer

• In case of branch misprediction such that

a certain speculative instruction should

not have been executed, the exception is

flushed along with the instruction when

the reorder buffer is cleared

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/155



• Reorder buffers:

An additional set of virtual registers that hold

the result of the instructions

That have finished execution, but

Have not committed yet

Issue: only when both a Reservation Station

and a reorder buffer are available

As soon as an instruction completes, its output

goes into its reorder buffer

Until the instruction has not committed, input

is received from the reorder buffer

(the Reservation Station is freed, the reorder

buffer is not)

The actual updating of registers takes place

when the instruction reaches the top of the list

of reorder buffers

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/156



• At this point the commit phase takes

place:

Either the result is written into the register file,

Or, in case of a mispredicted branch, the

reorder buffer is flushed and execution restarts

at the correct successor of the branch

• Assumption: when a branch with

incorrect prediction reaches the head of

the buffer, it means that the speculation

was wrong

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/157



• This technique allows also to tackle situation like

if (cond) do_this ; else do_that ;

• One may “bet” on the outcome of the branch and

say, e.g., it will be a taken one

• Even unlikely events do happen, so sooner or later

a misprediction occurs

• Idea: let the instructions in the else part (do_that)

issue and execute, with a separate list of reorder

buffers (list2)

• This second list is simpler: we don’t check for the

current head-of-list. Elements in there need to be

explicitly removed

• In case of a misprediction, in the second list we

have already executed the do_that part, and we

just need to perform its commit

• In case of positive prediction, the ELSE part is

purged off list2

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/158


• If a processor A has a lower CPI w.r.t

another processor B, will A always run

faster than B?

• Not always!

A higher clock rate is indeed a deterministic

measure of the performance improvement

A multiple issue (superscalar) architecture

cannot guarantee its improvements (stochastic

improvements)

Pushing towards a low CPI means adapting

sophisticated (=complex) techniques… which

slows down the clock rate!

Improving one aspect of a M.I.P. does not

necessarily lead to overall performance

improvements

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/159


• A simple question:

“how much ILP exists in a program?”

or, in other words, “how much can we

expect from techniques that are based on

the exploitation of the ILP?”

• How to proceed:

Delivering a set of very optimistic assumptions

and measuring how much parallelism is

available under those assumptions

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/160


• Assumptions (HW model of an ideal

processor):

1. Infinite # of virtual registers (-> no WAW or

WAR can suspend the pipeline)

2. All conditional branches are predicted exactly

(!!)

3. All computed jumps and returns are perfectly

predicted

4. All memory addresses are known exactly, so a

store can be moved before a load – provided

that the addresses are not identical

5. Infinite issue processor

6. No restriction about the types of instructions

to be executed in a cycle (no structural

hazards)

7. All latencies are 1

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/161


• How to match these assumptions??

• Gambling!

• We run a program and produce a trace

with all the values of all the instances of

each branch

Taken, Taken, Taken, Untaken, Taken, …

Each corresponding target address is recorded

and assumed to be available

Then we use a simulator to mimic, e.g., an

infinite virtual registers machine etc.

• Results are depicted in next picture

• Parallelism is expressed in IPC:

instruction issues per clock cycles

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/162


• Tomcatv reaches 150 IPC (for a particular

run)

gcc

espresso

liSPEC

benchmarksfpppp

doduc

tomcatv

54.8

62.6

17.9

75.2

118.7

150.1

140 160

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/163


• Then we can diminish the above

assumptions and introduce limitations

that represent our current possibilities

with computer design techniques for ILP

Window size: the actual range of instructions

we inspect when looking for candidates for

contemporary issuing

Realistic branch prediction

Finite # of registers

• See images 4-39 and 4-40

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/164


160

140

120

100

Instruction issues per cycle

80

60

40

20

0 Infinite 2k 512 128

Window size

gcc

fpppp

espresso

doduc

li

tomcatv

32 8 4

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/165


gcc

espresso

li

fpppp

Benchmarks

doduc

tomcatv

0

5510

108

43

1513

8

43

181211

94

3

4975

63

119

3514

53

16

159

43

15045

3414

63

20

Infinite

8

512 128 32

4

40 60 80

Instruction issues per cycle

100 120 140 160

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/166


conclusive notes

• In the next 10 years it is realistic to reach

an architecture that looks like this:

64 instruction issues per clock cycle

Selective predictor, 1K entries, 16-entry return

predictor

Perfect disambiguation of memory references

Register renaming with 64 + 64 extra registers

• Computer architectures in practice:

Section 4.8 (PowerPC 620)

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/167


conclusive notes

60

50

40

30

20

10

0Infinite

Instru

ctio

n is

su

es p

er c

ycle

256 128

Window size

32 16 864 4

gcc

fpppp

espresso

doduc

li

tomcatv

• Reachable

performance

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/168

Pipelining and communications

• Suppose that N+1 processes need to

communicate a private value to all the

others

• They use all the values to produce next

output (e.g., for voting)

• Communication is fully synchronous and

needs to be repeated m times, m large

. . .

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/169


• Let us assume that no bus is available

• Point-to-point communication

• Processes are numbered p0…p

N

• Two instructions are available

Send (pj, value)

Receive (pj, &value)

• Blocking functions

• If the receiver is ready to receive, they

last one stage time, otherwise they block

the caller for a multiple of the stage time

• Sending and receiving occur at discrete

time steps

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/170


• In each time t, processor pi may be

Sending data (next stage pi is unblocked)

Receiving data (next stage pi is unblocked)

Blocked in a Receive()

Blocked in a Send()

• Slot = time corresponding to an entire

stage time

• Each time t we have n slots (a slot per

process)

• If pi is blocked, its slot is wasted

(it’s a “bubble”)

• Otherwise the slot is used

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/171


• In each time t, processor pi may be in

State S(j) : Sending data to processor pj

State R(j) : Receiving data from pj

State WR(j) : Blocked in a Receive( pj, … )

State WS(j) : Blocked in a Send( pj, …)

• We use formalism:

proc st proc’

to indicate that, at time t,

proc is in state s with proc’

• For instance

p1 WR(4)

21 p

3

means that the 21st

slot of p1 is wasted

waiting for p3 to send its value to it

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/172


• The following algorithm is executed by

process j:

Before gaining the right to

broadcast, process j needs to go

through j couples of states (WR, R)

Ordered

broadcast :

the k-th

message

to be sent

goes to

process pk

Finally, process j goes through N-j

couples of states (WR, R)

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/173


• p is a vector of indices

• For process j, p can be any arrangement

of the integers 0, 1, …, j-1, j+1, … N

• Whatever the arrangement, the algorithm

works correctly

• For instance, if N = 4 (5 processes) and

j = 1, then p can be any permutation of

0, 2, 3, and 4

• p determines the order in which process j

sends its value to its neighbours

• Example: p[] = [ 3, 2, 0, 4]. Then p1

executes:

send (p3), send(p

2), send(p

0), send(p

4)

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/174


• Example: p[] = ordered permutation

Ex: N=5 and pj p [ 0, … j-1,j+1, … N ]

Frequencies of used slots Slot wasted in send

Slot wasted in receive

Duration

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/175


• Case N = 20, p[] = ordered permutation

• Gray = wasted slots

• Black = used slots

• In general, duration is

• Used slots / total # of slots

• Average # used slots during

one stage time

• This image: reminds us of another one:

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/176

No pipelining: Many slots are wasted!

30

B

C

D

A

Time

30 30 30 30 30 30 30 30 30 30 30 30 30 30 30

6 PM 7 8 9 10 11 12 1 2 AM


part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/177


• Let us now consider the case in which

processor k uses

p[] = [ k+1, k+2, …, N, O, 1, …, k-1 ]

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/178


part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/179


• Duration: first case vs. second case

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/180


• Efficiency: first case vs. second case

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/181


• Algorithm of pipelined broadcast

Beginning of steady state

Every 10 slots, 5 mark the

completion of a broadcast

Throughput = t / 2 (t = 1 slot)

A full broadcast is finished every 2 t

• The image may remind us of another one…

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.3/182

Between 7.30 and 9.30pm, a whole job

is completed every 30’

6 PM

B

C

D

A

30 30 30 30 30

Pipelining (slide P2.2/20)

…

…

…

…

During that period, each worker is

permanently at work…

…but a new input must arrive within 30’

12 2 AM 7 8 9 10 11 1

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

advanced computer architectures – part 2.3

Technology