advanced computer architectures – part 2.2

Advanced Computer

Architectures

– HB49 –

Part 2.2

Vincenzo De Florio

K.U.Leuven / ESAT / ELECTA

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/2

Course contents

• Basic Concepts

Computer Design

• Computer Architectures for AI

• Computer Architectures in Practice

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/3

Computer Design IS

• IS Classification

• Role of the compilers

DLX

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/4

IS DLX Architecture

• An example RISC architecture designed

by Patterson and Hennessey

• Simple register-register (load-store)

instruction set

• Designed for efficiency

From HW viewpoint

From compiler viewpoint

• Useful as an example of good IS design

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/5

IS DLX Architecture

• Registers:

32 registers called R0 = 0, R1, …, R31

32 single-precision floating point registers or

16 double-precision floating point registers F0,

F2, …, F30

• Data types

Like in C: 1 byte, 2 byte, 4 byte integers and 4

byte and 8 byte floats

• Addressing modes: just 2

Immediate (example: Add R4, #3)

Displacement (example: Add R4, 100(R1))

16-bit fields

Register deferred: Add R4, 0(R1)

Absolute: Add R4, 100(R0)

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/6

IS DLX Architecture

• Big endian

• DLX instruction format

Just two modes easily to encode in the

opcode

All instructions have the same length and start

with a 6 bit opcode

easier decoding algorithm

faster processing

shorter cycle is possible

• Layout: P&H p.99

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/7

IS DLX Architecture

• Mnemonics: L=load S=store followed by

B=byte H=half word W=word

F=float D=double

Examples

LB R1, 50(R9)

SF 50(R0), F2

• ADD…(Arithmetic op’s),

• SL... (shift left, logical op’s),

• J…, B… (jump and branch op’s)

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/8

IS DLX Architecture

• How good is the DLX architecture?

• DLX is a RISC architecture

• What’s a RISC architecture, and what’s

the difference between a RISC and a non-

RISC architecture?

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/9

IS DLX Architecture

• CISC = complex IS architecture

Architecture of the ’70s

Axioms:

(1) the IS must be easy to program with

(2) the IS must be easy to compile for

IS not too far away from a HLL

IS includes high level constructs

Loop instructions vs. gotoes

Complex CALL instructions preserving the

register file

Case/switch instructions

Large set of addressing modes

All addressing modes are available with all the

instructions

Key requirement of the ’70s: Minimize code size

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/10

IS DLX Architecture

• Why?

• Because, in the ’70s, RAM memories were

1000 times smaller than today

• Code space was a key factor

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/11

IS DLX Architecture

• RISC = restricted IS architecture

Key architecture today

Axioms:

(1) the IS must be simple,

(2) easy to implement in HW,

(3) should match with clever design solutions

(e.g., pipelining)

(4) should be a good target for nowadays

optimising compilers

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/12

IS DLX Architecture

• RISC = restricted IS architecture

Simple instructions

A few simple addressing modes

Fixed-length instructions

“Many” general purpose registers

Key goal: Help the machine go fast

In general, RISCs increase the number of

instructions executed (IC)…

Recall: CPUTIME

(p) = IC(p) CPI(p)

clock rate

…but at the same time they decrease CPI

The decrease rate of CPI is higher than the

increase rate of IC shorter CPUTIME

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/13

IS DLX Architecture

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/14

IS DLX Architecture

• Clock cycles: assumed to be the same

• Results:

• ICMIPS

@ 2 x ICVAX

• CPIMIPS

@ CPIVAX

/ 6

• The performance of the MIPS M2000 is

about 3 times the performance of the VAX

8700

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/15

Computer Design

• Quantitative assessments

• Instruction sets

Pipelining

• Parallelism

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/16

Pipelining

• Pipelining =

“an implementation technique whereby

multiple instruction are overlapped in

execution” (P&H)

• An assembly line:

Different steps (pipe stages) …

are completing different parts …

of different instructions …

in parallel

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/17

° Four persons (A, B, C,

and D) have to perform

a certain job on 4 sets

of items. The job

consists of 4 phases.

Pipelining

C D A B

° Phase 1 (washing)

takes 30’

° Phase 2 (drying),

another 30’

° Phase 3 (packaging),

other 30’

° Phase 4 (delivering)

also takes 30’

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/18

Doing the job sequentially takes 8 hours

30

B

C

D

A

Time

30 30 30 30 30 30 30 30 30 30 30 30 30 30 30

6 PM 7 8 9 10 11 12 1 2 AM

Pipelining

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/19

The whole job is now finished in just

3.5 hours

12 2 AM 6 PM 7 8 9 10 11 1

B

C

D

A

30 30 30 30 30 30 30

Key idea: one

starts a new

phase as soon as

possible

Pipelining

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/20

What if they had more job to do?

6 PM

B

C

D

A

30 30 30 30 30 30 30

Between 7.30

and 8pm, each

person is busy

Pipelining

12 2 AM 7 8 9 10 11 1

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/21

Between 7.30 and 9.30pm, a whole job

is completed every 30’

6 PM

B

C

D

A

30 30 30 30 30

Pipelining

…

…

…

…

During that period, each worker is

permanently at work…

…but a new input must arrive within 30’

12 2 AM 7 8 9 10 11 1

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/22

Pipelining

• Important issues in this example

• Each phase has the same complexity

Each phase takes the same amount of

time!

• In the sequential processing example, the

requirement was: a new input must be

ready for processing every four phases

• Now, a new input must be available every

phase time!

The means that brings the input needs to

be fourfold as fast

One gets more from the system; though

one also asks more to it

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/23

Pipelining

• Also in the execution of, e.g., DLX

instructions, we distinguish a number of

distinct phases – we call them cycles,

because each one takes one clock cycle

time

• In DLX, an instructions is completed in at

most five cycles

• A number of special purpose registers are

used for this:

PC (program counter)

= address of the instruction to

be executed

IR (instruction register)

= instruction to be executed = *(PC)

NPC (next program counter), etc.

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/24

Memory and special purpose

registers in DLX

100

104

108

10C

110

114

118

52 71 73 10

45

52 71 75 52

71 00 96

… … … …

… … … …

… … … …

… … … …

…

… … … …

…

… … … …

BEQ R1, R3, eq3

BEQ R1, R5, eq5

BGT R1, #0, positive

…

…

NPC

IMM

PC

IR

ALUOUT

COND

LMD

TMP1

TMP2

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/25

Executing DLX Instructions:

Phase 1: Instruction Fetch (IF)

100

104

108

10C

110

114

118

52 71 73 10

45

52 71 75 52

71 00 96

… … … …

… … … …

… … … …

… … … …

…

… … … …

PC

IR

NPC

IMM

…

… … … …

BEQ R1, R3, eq3

BEQ R1, R5, eq5


…

…

00 00 01 00

52 71 73 10 IR

ALUOUT

COND

LMD

TMP1

TMP2

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/26


Phase 1: Instruction Fetch (IF)

100

104

108

10C

110

114

118

52 71 73 10

45

52 71 75 52

71 00 96

… … … …

… … … …

… … … …

… … … …

…

… … … …

PC

IR

NPC

IMM

…

… … … …

BEQ R1, R3, eq3

BEQ R1, R5, eq5


…

…

00 00 01 00

52 71 73 10 +4

00 00 01 04 NPC

ALUOUT

COND

LMD

TMP1

TMP2

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/27


Phase 2: Instruction Decode and

Register Fetch (ID)

100

104

108

10C

110

114

118

52 71 73 10

45

52 71 75 52

71 00 96

… … … …

… … … …

… … … …

… … … …

…

… … … …

IR

NPC

IMM

…

… … … …

BEQ R1, R3, eq3

BEQ R1, R5, eq5


…

…

00 00 01 00

52 71 73 10

00 00 01 04

ALUOUT

COND

LMD

TMP1

TMP2

PC

(R1)

(R3) 00 00 00 10

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/28


Phase 3: Execution (EX, branch)

100

104

108

10C

110

114

118

52 71 73 10

45

52 71 75 52

71 00 96

… … … …

… … … …

… … … …

… … … …

…

… … … …

IR

NPC

IMM

…

… … … …

BEQ R1, R3, eq3

BEQ R1, R5, eq5


…

…

00 00 01 00

52 71 73 10

00 00 01 04

ALUOUT

COND

LMD

TMP1

TMP2

PC

(R1)

(R3) 00 00 00 10

+

00 00 01 14

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/29


Phase 3: Execution (EX, branch)

100

104

108

10C

110

114

118

52 71 73 10

45

52 71 75 52

71 00 96

… … … …

… … … …

… … … …

… … … …

…

… … … …

IR

NPC

IMM

…

… … … …

BEQ R1, R3, eq3

BEQ R1, R5, eq5


…

…

00 00 01 00

52 71 73 10

00 00 01 04

ALUOUT

COND

LMD

TMP1

TMP2

PC

(R1)

(R3) 00 00 00 10

00 00 01 14

=

(R1) == (R

3)

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/30


Phase 3: Execution

• An instruction only enters an active phase

when it reaches state EX

• At that point, the instruction is said to

have issued or to have committed

• The machine state is only changed when

an instruction has committed

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/31


Phase 4: Memory access/branch

completion (MEM, branch)

100

104

108

10C

110

114

118

52 71 73 10

45

52 71 75 52

71 00 96

… … … …

… … … …

… … … …

… … … …

…

… … … …

IR

NPC

IMM

…

… … … …

BEQ R1, R3, eq3

BEQ R1, R5, eq5


…

…

00 00 01 14

52 71 73 10

00 00 01 04

ALUOUT

LMD

TMP1

TMP2

PC

(R1)

(R3) 00 00 00 10

00 00 01 14

COND (R1) == (R

3)

114

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/32

Executing DLX Instructions

• DLX branch instructions have only 4

phases

• The fifth phase is the write-back (WR), in

which registers are loaded with an

output from the ALU (ALUOUT

) or from

LMD (see P&H Chapter 3)

• For instance, when the instruction is

LW R1, 100(R0)

phases 3 – 5 are as follows:

3. ALUOUT

TMP1 + IMM /* i.e., R0 + 100 */

4. LMD Mem[ALUOUT

]

5. R1 LMD

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/33

Pipelined

Cache/

memory

Fetch

unit

Decode

unit

Execute

unit

Reg

file

Fe

tch

D

ec

od

e

Exe

cu

te

W

rite

ba

ck

Instr. 1

Instr. 1

Instr. 1

Instr. 1

T1 T2 T3 T4 T5 T6

Instr 1 F1 D1 E1 W1

Instr 2 F2 D2 E2 W2

Instr 3 F3 D3 E3 W3

Instr 4 F4 D4 E4

Instr 5 F5 D5

Instr 6 F6

T1 T2 T3 T4 T5 T6

Instr 1 F1 D1 E1 W1

Instr 2 F2 D2 E2 W2

Instr 3 F3 D3 E3 W3

Instr 4 F4 D4 E4

Instr 5 F5 D5

Instr 6 F6

Instr. 2

T1 T2 T3 T4 T5 T6

Instr 1 F1 D1 E1 W1

Instr 2 F2 D2 E2 W2

Instr 3 F3 D3 E3 W3

Instr 4 F4 D4 E4

Instr 5 F5 D5

Instr 6 F6

Instr. 2

Instr. 3

T1 T2 T3 T4 T5 T6

Instr 1 F1 D1 E1 W1

Instr 2 F2 D2 E2 W2

Instr 3 F3 D3 E3 W3

Instr 4 F4 D4 E4

Instr 5 F5 D5

Instr 6 F6

Instr. 2

Instr. 3

Instr. 4

T1 T2 T3 T4 T5 T6

Instr 1 F1 D1 E1 W1

Instr 2 F2 D2 E2 W2

Instr 3 F3 D3 E3 W3

Instr 4 F4 D4 E4

Instr 5 F5 D5

Instr 6 F6

Instr. 2

Instr. 3

Instr. 4

Instr. 5

Instr. 3

Instr. 4

Instr. 5

Instr. 6

T1 T2 T3 T4 T5 T6

Instr 1 F1 D1 E1 W1

Instr 2 F2 D2 E2 W2

Instr 3 F3 D3 E3 W3

Instr 4 F4 D4 E4

Instr 5 F5 D5

Instr 6 F6

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/34

Pipelining

• With respect to a non-pipelined machine,

the memory system must deliver n times

that bandwidth (n being the number of

pipeline stages)

• In pipelined operation, n instructions are

concurrently being processed: on average

n memory accesses per clock cycle

This worsens the memory bottleneck: even

apart from technological advances, this

architectural modification increases the

number of memory accesses per clock cycle

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/35

Pipelining

• In DLX, each instruction takes 5 clock

cycles to complete…

• …but during each clock cycle, the HW

initiates a new instruction and is

executing some part of 5 different

instructions

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/36

Pipelining

• Clearly pipelining increases the

complexity of the HW

Each stage involves a set of HW resources; we

need to guarantee that the same HW resource

be scheduled for execution in at most one

pipeline stage

When the pipelined is in steady state, in each

cycle the register file is accessed twice:

in ID (for reading),

in WB (for writing)

Each clock cycle, we need to perform two

reads and one write

We need to guarantee consistent operation

even when we read from and write to, e.g., the

same register

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/37

Pipelining

In order to realize the pipeline, values and

control information must “move through” the

pipeline from one stage to the next

Special registers, called pipeline registers or

pipeline latches, convey that information

This because, instead of having, e.g., a single

NPC register, we need to have

NPC’, NPC’’, NPC’’’…

representing the values of NPC during the

different stages of different instructions

For instance,

bwIDandEX.NPC bwIFandID.NPC

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/38

Pipelining

Stage Actions and pipeline registers

IF bwIFandID.IR *PC

if (bwEXandMEM.COND == TRUE)

bwIFandID.NPC bwEXandMEM.NPC

else

bwIFandID.NPC PC + 4

ID bwIDandEX.TMP1 RbwIFandID.IR[1]

bwIDandEX.TMP2 RbwIFandID.IR[2]

52 71 73 10

BEQ R1, R3, eq3

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/39

Pipelining


IF bwIFandID.IR *PC



else

bwIFandID.NPC *PC + 4



New reg

old reg


bwIDandEX.IR bwIFandID.IR

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/40

Pipelining


IF bwIFandID.IR *PC



else

bwIFandID.NPC *PC + 4




bwIDandEX.IR bwIFandID.IR

bwIDandEX.IMM bwIFandID.IR[3]

52 71 73 10

BEQ R1, R3, eq3

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/41

Pipelining


EX bwEXandMEM.ALUOUT

bwIDandEX.NPC + bwIDandEX.Imm

bwEXandMEM.cond

bwIDandEX.TMP1 rel bwIDandEX.TMP

2

…and so forth (see P&H, p.136)

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/42

Pipelining

• More registers are required a more

complex design is to be carried out

• More complex algorithm takes more

time to complete

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/43

Pipelining

• Indeed, implementing an instruction

pipeline

increases the instruction throughput

(average number of instructions

completed in one time unit)…

…though it slightly

increases the execution time of each

instruction

Overhead for controlling the pipeline

Overhead for avoiding “hazards” (to be

discussed later on)

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/44

Pipelining

Quantitative measurements

• U be an unpipelined machine

• Clock cycle of U = ccU = 10 ns

• Cycle distribution of U is as follows:

ALU instructions (40%) take 4 cycles

Branches (20%) take 4 cycles

Memory operations (40%) take 5 cycles

• P = pipelined version of U

• Clock cycle of P = ccP = 11 ns

(overhead: 1 ns per cycle)

• How fast is P w.r.t. U?

(Assumption: continuous flow is available,

no pipeline stalls...)

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/45

Pipelining

Quantitative measurements

• Average Instruction Execution Time = T

• TU = cc

U x average CPI

= 10 ns x ( (40% + 20%) x 4 + 40% x 5 )

ALU BRANCH take 4

cycles

MEM takes 5 cycles

= 10 ns x 4.4 = 44 ns

• TP = cc

P x average CPI = cc

P x 1

• Speedup = TU / T

P = 44 ns / 11 ns = 4

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/46

Pipelining Hazards

• Ideally, pipelines should continuously

“crunch” instructions without being

interrupted

• This way, the speedup is maximum

• In reality there exist three classes of

impediments that prevent the next

instruction from being executed:

Structural Hazards

Data Hazards

Control hazards

to be described in what follows

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/47

Pipelining Hazards

• Hazards are a problem because they

require to stall the pipeline (see later)

• Later on we will show some techniques

for hazard prevention

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/48

Pipelining Structural Hazards

• Structural Hazards are resource conflicts

• Not every combination of instructions is

allowed

because not every functional unit is fully

pipelined

Or because of other resource conflicts

A problem of cost-effectiveness

Consequence: a stall (“bubble”) floats

through the pipeline

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/49


Cycles

1 2 3 4 5 6 7 8 LO

AD

Instr2

Instr3

Instr4 Mem

Mem

If the machine has just one memory port, this is

a structural hazard

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/50


LO

AD

Instr2

Instr3

Cycles

1 2 3 4 5 6 7 8

Instr4 bubble

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/51


• One of the keywords of computer design:

make the common case fast,

and the rare case correct

• If a particular structural hazard

does not occur very frequently,

it may not be worth the cost to avoid it

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/52


• Avoiding a conflict has a cost due to the

extra redundancy,

but also a cost due to

extra control

• Compare for instance fig. 3.1 and fig.3.4

of P&H

• One must be careful so that this overhead

does not trigger a need for a higher clock

cycle lower clock rate

Recall: CPUTIME

(p) = IC(p) CPI(p)

clock rate

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/53

Pipelining Data Hazards

• Pipelining overlaps the execution of a set

of instructions

• Data Hazards are hazards due to

data dependencies between these

overlapped executions

ADD R1, R2, R3

SUB R4, R5, R1

AND R6, R1, R7

OR R8, R1, R9

XOR R10, R1, R11

ADD requires

5 cycles to

complete!

SUB may

use the

wrong value!

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/54


• Pipelining overlaps the execution of a set

of instructions

• Data Hazards are hazards due to

data dependencies between these

overlapped executions

ADD R1, R2, R3

SUB R4, R5, R1

AND R6, R1, R7

OR R8, R1, R9

XOR R10, R1, R11

ADD requires

5 cycles to

complete!

SUB, AND,

and OR requi-

re R1 sooner

XOR is

“far”

enough

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/55


Cycles

1 2 3 4 5 6 7 8

AD

D R

1, R

2, R

3

SUB R4,

R1, R5

AND R6,

R1, R7

OR R8,

R1, R9

XOR R10,

R1, R11

DATA

HAZARDS

NOT A DATA

HAZARD

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/56

Pipelining Minimizing or

Avoiding Data Hazards

• Let us consider again ADD R1, R2, R3

• “ADD requires 5 cycles to complete”

means

“the sum of R2 and R3 will be stored into

R1 only at the 5th cycle”

Why should we wait for this to happen?

Forwarding: using a pipeline register that

holds the right value

SUB R4, bwEXandMEM.ALUOUT

, R5

SUB R4, R1, R5

becomes

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/57



• How forwarding is realized?

• By propagating the result of the ALU

directly to an input latch of the ALU

• A custom circuit selects the right value to

be input to the ALU: the named register or

the propagated value

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/58



• Sometimes forwarding can be avoided by

very simple techniques

• For instance, let us assume that, during

each cycle,

writes into the register file occur in the

first half of the cycle, while

reads occur in the second half

W

R

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/59


Avoiding Data Hazards Cycles

1 2 3 4 5 6 7 8

AD

D R

1, R

2, R

3

SUB R4,

R1, R5

AND R6,

R1, R7

OR R8,

R1, R9

XOR R10,

R1, R11

3, 4: Forwarding

5: F. Avoidance

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/60

Pipelining Classification of

Data Hazards

• Let ( Ik)1 k IC(p)

be the ordered series of

instructions executed during the run of

program p

• Let i < j two integers, 1 i < j IC(p)

• So Ii

occurs before Ij

• Let us represent predicate

“instruction i writes in memory location v”

as Ii v

• Let us represent predicate

“instruction i reads from location v”

as Ii v

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/61


Data Hazards

1. RAW HAZARD (Read-After-Write hazard)

t

Ii v

Ij v

• RAW data dependency on an operand

that needs first to be written by Ii, and

then read by Ij

• If, due to pipelining,

Ij reads v before I

i writes it,

a RAW hazard occurs : Ij erroneously

gets a stale value

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/62


Data Hazards

2. WAW HAZARD (Write-After-Write hazard)

t

Ii v

Ij v

• WAW data dependency on an operand

that must be written in a certain order

while it is written in the wrong one


Ij writes v before I

i writes it,

a WAW hazard occurs : the wrong value

gets stored in v

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/63


Data Hazards

• WAW hazards may happen in pipelines

such that the write-back stage happens in

different positions

LW R1, 0(R2)

ADD R1,R2,R3

IF

IF

ID

ID EX

EX MEM1 MEM2 WB

WB WB

WB

WB

WB

• This cannot happen with instruction sets

such as, e.g., DLX, where

each instruction takes

the same amount of cycles

• Less tricky design less complexity to

handle less pitfalls

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/64


Data Hazards

3. WAR HAZARD (Write-After-Read hazard)

t

Ii v

Ij v

• WAR data dependency on an operand

that needs first to be read by Ii, and then

written by Ij


Ij writes v before I

i reads it,

a WAR hazard occurs : the wrong value is

read from v

• Ii erroneously gets the NEW value of v,

the one produced by Ij

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/65


Data Hazards

• This cannot happen with instruction sets

such as, e.g., DLX, where

all reads are early (ID stage) and

all writes are late (WB stage)

• WAR hazards occur when there are

instructions that write results early in the

instruction pipeline, as well as

instructions that read a source late in the

pipeline

• For instance, this may happen with the

autoincrement addressing mode

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/66

Pipelining Hazards

Cycles

1 2 3 4 5 6 7 8

AD

D R

1, R

2, R

3

SUB R4,

R1, R5

AND R6,

R1, R7

OR R8,

R1, R9

• In some cases, forwarding and subcycling

can prevent a stall

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/67

Pipelining Hazards

Cycles

1 2 3 4 5 6 7 8

LW

R

1, 0

(R

2)

SUB R4,

R1, R5

AND R6,

R1, R7

OR R8,

R1, R9

• In some cases, forwarding and subcycling

cannot prevent a stall

IMPOSSIBLE!

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/68

Pipelining Hazards

1 2 3 4 5 6 7 8

LW

R

1, 0

(R

2)

SUB R4,

R1, R5

AND R6,

R1, R7

OR R8,

R1, R9

• A special HW, called the pipeline

interlock, detects the hazard and stalls

the pipeline until the hazard is cleared

bubble

bubble

bubble

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/69

Pipelining Hazards

• Pipeline interlock penalty:

one or more clock cycles

• Consequences:

the CPI for the stalled instruction

increases by the length of the stall

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/70

Pipelining Pipeline Scheduling

• Classical solution: pipeline scheduling

• The compiler re-arranges the instructions

in order to (try to) avoid stalls

• Example: the compiler tries to avoid

generating code like

LW x, …

INSTR …, x

that is, a load followed by the immediate

use of the load destination register

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/71

LW R1, b

LW R2, c

ADD R3, R1, R2

SW a, R3

LW R4, e

LW R5, f

SUB R6, R4, R5

SW d, R6


1. Generate DLX code for the expressions

a = b + c

d = e – f

Basic block

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/72

LW R1, b

LW R2, c

ADD R3, R1, R2

SW a, R3

LW R4, e

LW R5, f

SUB R6, R4, R5

SW d, R6


2. We make a graph of the dependences

among the instructions and we order the

instructions so as to minimize the stalls

LW R1, b

LW R2, c

LW R4, e

ADD R3, R1, R2

LW R5, f

SW a, R3

SUB R6, R4, R5

SW d, R6

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/73

Pipelining Control Hazards

• Control hazards are hazards due to the

execution of branches

• Let us call

TAKEN BRANCH

a branch that sets the PC to its target

address

• Let us call

UNTAKEN BRANCH

a branch that does not force the PC to be

set; as far as PC is concerned, it behaves

like a NOP

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/74


• The problem with branches is that their

nature is only known at run-time

• Simplest method to deal with branches:

as soon as we detect a branch,

we stall the pipeline

• What does exactly mean “as soon as”?

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/75

DLX Branch

1: IF (1/2)

100

104

108

10C

110

114

118

52 71 73 10

45

52 71 75 52

71 00 96

… … … …

… … … …

… … … …

… … … …

…

… … … …

PC

IR

NPC

IMM

…

… … … …

BEQ R1, R3, eq3

BEQ R1, R5, eq5


…

…

00 00 01 00

52 71 73 10 IR

ALUOUT

COND

LMD

TMP1

TMP2

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/76

DLX Branch:

1: IF (2/2)

100

104

108

10C

110

114

118

52 71 73 10

45

52 71 75 52

71 00 96

… … … …

… … … …

… … … …

… … … …

…

… … … …

PC

IR

NPC

IMM

…

… … … …

BEQ R1, R3, eq3

BEQ R1, R5, eq5


…

…

00 00 01 00

52 71 73 10 +4

00 00 01 04 NPC

ALUOUT

COND

LMD

TMP1

TMP2

At this point, we’ve just

fetched an instruction; but

we don’t know yet WHICH ONE!

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/77

DLX Branch

2: ID

100

104

108

10C

110

114

118

52 71 73 10

45

52 71 75 52

71 00 96

… … … …

… … … …

… … … …

… … … …

…

… … … …

IR

NPC

IMM

…

… … … …

BEQ R1, R3, eq3

BEQ R1, R5, eq5


…

…

00 00 01 00

52 71 73 10

00 00 01 04

ALUOUT

COND

LMD

TMP1

TMP2

PC

(R1)

(R3) 00 00 00 10

At this point, we’ve decoded the

instruction and found that it’s

indeed a branch

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/78

DLX Branch

3: EX (1/2)

100

104

108

10C

110

114

118

52 71 73 10

45

52 71 75 52

71 00 96

… … … …

… … … …

… … … …

… … … …

…

… … … …

IR

NPC

IMM

…

… … … …

BEQ R1, R3, eq3

BEQ R1, R5, eq5


…

…

00 00 01 00

52 71 73 10

00 00 01 04

ALUOUT

COND

LMD

TMP1

TMP2

PC

(R1)

(R3) 00 00 00 10

+

00 00 01 14

Here we get the next PC of

the taken branch

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/79

DLX Branch

3: EX (2/2)

100

104

108

10C

110

114

118

52 71 73 10

45

52 71 75 52

71 00 96

… … … …

… … … …

… … … …

… … … …

…

… … … …

IR

NPC

IMM

…

… … … …

BEQ R1, R3, eq3

BEQ R1, R5, eq5


…

…

00 00 01 00

52 71 73 10

00 00 01 04

ALUOUT

COND

LMD

TMP1

TMP2

PC

(R1)

(R3) 00 00 00 10

00 00 01 14

=

(R1) == (R

3)

Only at this point we now the

nature of the branch:

brnch = (cond)? Taken:Untaken;

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/80


• The problem with branches is that their

nature is only known at run-time

• Simplest method to deal with branches:

as soon as we detect a branch,

we stall the pipeline

1. “As soon as” means

after the IF stage, during stage ID

IF: first stall

2. Then we need to reach the EX stage

to know the address where to branch to

ID: second stall

3. The nature of a branch is revealed at the

end of EX, in MEM

EX: third stall

• At this point, the pipeline restarts

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/81


• With a 30% branch frequency and an ideal

CPI of 1, three clock cycles of penalty

means that the machine only achieves

about HALF the ideal speedup from

pipelining

• What can we do to reduce the three cycle

penalty?

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/82


1. Uncover the nature of the branch earlier

in the pipeline:

in DLX, this means adding a test to the ID

stage

2. Compute the taken PC earlier:

at the cost of an additional adder,

we can anticipate the addition that gives

the taken PC

3. (For untaken branches): do not repeat

the IF stage

• These strategies can reduce the branch

penalty to one clock cycle

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/83


• How to deal with branch penalties

• Four simple compile-time schemes

Static, fixed, per-branch predictions

Compile-time guesses

• Simplest: freezing or flushing the pipeline

Penalty: one clock cycle

• Predict not taken:

The HW continues as if the branch was not

taken (next IR = *(PC + 4))

If the branch is taken, the fetched instruction

is invalidated (turned into a NOP)

Penalty: no penalty if untaken,

one cycle if taken

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/84


Predict not taken

IF Untaken branch

i + 1 IF

ID

i + 2

EX

ID

IF

i + 3

MEM

EX

ID

IF

i + 4

WB

MEM

EX

ID

IF

WB

MEM

EX

ID

WB

MEM

EX

WB

MEM WB

Taken branch IF

i + 1 IF

ID EX

idle

Branch target IF

MEM

idle

ID

Branch target + 1 IF

idle

MEM

EX

ID

WB

MEM

EX

WB

MEM WB

WB

idle

EX

ID

IF Branch target + 2

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/85


• Predict taken:

Hypothesis: the taken branch address is known

very early, long before the outcome of the

branch is known

The HW assumes the branch is taken

Penalty: no penalty if taken,

one cycle if untaken

Due to loops, taken branches are more than

untaken branch

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/86


• Delayed branch

• Hypothesis: a branch implies a delay that

adds up to the time required to execute n

instructions

• The branch delay slot is then filled in

with instructions that would be executed

whatever the outcome of the

branch test be

• In DLX, n = 1

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/87


Delayed branch

Untaken branch

Branch delay

IF

IF

ID EX MEM WB

ID EX MEM WB

i + 2 IF ID EX MEM WB



Taken branch

Branch delay

IF

IF

ID EX MEM WB

Branch target IF ID EX MEM WB

Branch target + 1 IF ID EX MEM WB

IF ID EX MEM WB Branch target + 2

ID EX MEM WB

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/88


Slot schedule

• Problem: how to schedule

the branch-delay slot

• Three ways

• Best choice: an independent instruction

from before the branch

INSTR1

INSTR2

IF TEST THEN

Delay slot

…

INSTR N

INSTR1

INSTR2

IF TEST THEN

…

INSTR N

INSTR2

• Penalty: none

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/89


Slot schedule

• If the best choice is not possible, e.g.,

due to a dependency, then one may

choose among the following two

methods:

1. From target :

If it is not possible to select an

independent instruction from before the

branch (a sure one!), then you must

guess: If the chance that the branch is

taken is felt as higher, then you fill the

delay slot with an instruction from the

target of the branch

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/90


Slot schedule

INSTR1

INSTR2

IF TEST THEN

…

• Penalty: none if the branch is a taken one,

1 clock cycle if it’s untaken

Delay slot

INSTR1

INSTR2

IF TEST THEN

…

• From target

INSTR 1

• Assumption: no side effect from

executing INSTR 1 when branch is

mispredicted (no undo required!)

INSTR1

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/91


Slot schedule

2. From fall through :

If it is not possible to select an

independent instruction from before the

branch (a sure one!), and if the chance

that the branch is not taken is felt as

higher, then you fill the delay slot with

the instruction at PC+4

INSTR1

INSTR2

IF TEST THEN

Delay slot

…

INSTR N

INSTR1

…

INSTR N

IF TEST THEN

INSTR2

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/92


Slot schedule

• Again, the instruction selected to be

placed in the delay slot must be side

effect free

• That instruction must be such that

no undo is required if the branch goes in

the unexpected direction

BEQ R2, R3, Skip

LW R1, #100

. . .

Skip LW R1, #200

. . .

The second load

overwrites the

first one

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/93


• The LW example is clearly an ideal one. In

reality, it is very difficult to select an

instruction for the delay slot

• Furthermore, these schemes are compile-

time predictions that may be found to be

false at run-time

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/94


• Improvements are possible :

Cancelling branches :

the branch instructions include a

prediction bit (taken vs. untaken).

If the prediction bit is false, the branch

instruction “cancels” the instruction in

the delay slot by writing the NOP bit(s)

• This makes it easier to select

instructions for the delay slot:

the side-effect free requirement can be

relaxed

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/95


The use of delayed and cancelling

branches resulted in no penalty in 70% of

the time on average with 10 programs of

the SPECint92 benchmarks (5 int., 5 f.p.)

Delayed branches have an extra cost:

an interrupt may occur also during the

execution of the instruction in the branch

delay slot (BDSI).

If the branch was taken, then both the

address of the BDSI and that of the

branch target need to be preserved and

restored when the interrupt has been

served

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/96


• The longer the pipeline, the more pipeline

stages are required

(1) to uncover the current branch target

address and

(2) to tell the nature of the current branch

• In DLX, one clock cycle (very small)

• In R4000, it is 3 clock cycles (1) and 1

clock cycle (2)

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/97

Pipelining Static Branch

Prediction & Compiler Support

• The effectiveness of delayed branch

depends on the truth value of our guess

• Static branch prediction: predicting the

outcome of a branch at compile time

(vs. dynamic prediction: prediction based

on runtime program behaviour)

• Static prediction method 1:

observing and analysing the program

behaviour


using profile information collected from

earlier runs of the program

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/98




observing and analysing the program

behaviour

• Observations (10 SPECint92 benchmark

programs) show that most branches are

taken

On average, 62% in integer programs, 70% in

f.p. programs (total @ 67%)

Of taken branches, backward branches are at

least 1.5 times more than forward branches

Loop unrolling is a reason for this

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/99



• Simplest method: predict-as-taken (1.1)

• In our benchmark, a minority of these

predictions is wrong (34%)

• Note: On the average!

Worst misprediction is 59%, best is 9%

(in the worst case, predict-as-untaken

would give better performance!)

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/100



• Method 1.2:

predict-bw-as-taken

predict-fw-as-untaken

• For some programs and compilers,

n (fw branches) 50%

• In this case only, M1.2 is better than M1.1

• This is not true for the 10 SPECint92

programs and in most cases

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/101




using profile information collected from

earlier runs of the program

• You see what happened in the past and

consider this as a good model for the

future

• Per branch prediction

• Key observation and principle: “often,”

a given branch has a high-probability

behaviour

A privileged attribute

It is most likely a taken or an untaken branch

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/102



• Average # of instructions between

mispredictions: 20 vs 110

• St.dev: 27 vs. 85 (very large)

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/103

Performance of the DLX Integer

Pipelining System

• Assumptions:

No misses

No clock overhead

Basic delayed branch + cancelling delayed

branch (1 cycle delay each)

• Results:

(Exercising five SPECint92 programs:)

9% – 23% of the instructions cause a 1 cycle

loss

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/104

Performance of the DLX Integer

Pipelining System

• Colors: branch / load stalls

• DLX average CPI : 1.11

• Speedup(5 SPECint92 prgs) = 5/1.1 = 4.5

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/105

Pipelining Exceptions

• An exception is an event that is triggered

at run time due to the interaction with the

environment and results in a (temporary

or permanent) suspension of the current

application so to manage the event

• Examples:

A key has been pressed (interrupt)

The user invokes a service of the OS

A breakpoint is encountered

A division-by-zero condition is encountered

An overflow or underflow condition

A NaN float

Misalignments

Access to protected or non existing memory

areas

Power failures…

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/106


• What happens to the pipeline when

an exception takes place?

• With pipelining, instructions are no more

“atomic”

• An instruction is further subdivided into

“stages”

• The instruction is only completed at the

end of the last stage

• If an interrupt occurs in the middle of a

committed instruction, the result may be

a half-finished instruction

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/107


• Interrupt

An external event asks for immediate attention

(service) by raising an input line (the INT line)

The main program is interrupted wherever it is

A jump is made to the interrupt service routine

(ISR)

After processing the ISR, the main program

resumes where it was broken off

• A pipeline (or machine) is said to be

restartable if it can handle an exception

(e.g. an interrupt), save the state, and

restart without affecting the execution of

the program being interrupted

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/108


• Precise exceptions: a property of a

pipelined machine such that

instructions just before the exceptions

are completed and

instructions after the exceptions can be

restarted from scratch

• Often precise exceptions imply a huge

penalty

• The IBM PowerPc and others adopts two

modes:

Precise exceptions mode (slow, for debugging)

Performance mode (inprecise, fast)

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/109


• In the DLX integer pipeline no instruction

updates the machine state before the end

of the MEM stage

• This makes realising precise exceptions

very easy

• The instructions later in the pipeline have

not committed yet

• This is not true, e.g., for the

autodecrement mode instructions of the

VAX, which cause the update of registers

in the middle of the execution of an

instruction

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/110


• If such an instruction is aborted due to an

exception, the machine state would be

left altered

• Machines with these instructions often

have the ability to back out any state

change before the instruction has

committed

• If an exception occurs, the machine uses

this feature to reset the state of the

machine to its value before the

interrupted instruction started

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2002

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.2/111


• On VAX and the 360 family, special

instructions use the general purpose

registers as working storage

• In such machines, g.p. registers are

always saved on exception and restored

after the exception

• The state of partially completed

instructions lies in these registers, which

makes the exceptions precise

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

advanced computer architectures – part 2.2

Technology