graduate computer architecture i

Lecture 2: Processor and Pipelining

Young Cho

Graduate Computer Architecture I

2 - CSE/ESE 560M – Graduate Computer Architecture I

Instruction Set Architecture

• Set of Elementary Commands• Good ISA …

– CONVENIENT functionality to higher levels– EFFICIENT functionality to higher levels– GENERAL: Used in many different ways– PORTABLE: Lasts through many Gen

• Points of View– Provides different HW and SW interface


Processors Today

• General Purpose Register• Split of CISC and RISC• Development of RISC

– Very Complex Design– No longer REDUCED

• Rapid Technology Advancements

Capacity Speed

Logic 2x per 3 years 2x per 3 years

DRAM 4x per 3 years 2x per 10 years

Disk 4x per 3 years 2x per 10 years

Processor N/A 2x per 1.5 years


• Performance is in units of things per second– bigger is better

• If we are primarily concerned with response time

• " X is n times faster than Y" means

)(_

1)(

xtimeexecxeperformanc

)(_

)(_

)(

)(

xtimeexec

ytimeexec

yeperformanc

xeperformancn

Performance Definition


enhanced

enhancedenhanced

new

oldoverall

SpeedupFraction

Fraction 1

1

ExTime

ExTime Speedup

Best you could ever hope to do:

enhancedmaximum Fraction - 1

1 Speedup

enhanced

enhancedenhancedoldnew Speedup

FractionFraction 1ExTime ExTime

Amdahl’s Law


Semester Schedule Review• Basic Architecture Organization: Weeks 2-4

– Processors and Pipelining – Week 2– Memory Hierarchy and Cache Design – Week 3– Hazards and Predictions – Week 4 Quiz 1 – Week 4

• Quantitative Approach: Weeks 5-10– Instructional Level Parallelism – Week 5 & 6– Vector and Multi-Processors – Week 7 & 8– Storage and I/O – Week 9– Interconnects and Clustering – Week 10 Quiz 2 – Week 6 Quiz 3 – Week 9

• Advanced Topics: Weeks 11-15– Network Processors – Week 11– Reconfigurable Devices and SoC – Week 12– Low Power Hardware and Techniques – Week 12– HW and SW Co-design – Week 13– Other Topics – Week 14 & 15 Quiz 4 – Week 11


Administrative• Course Web Site

– http://www.arl.wustl.edu/~young/cse560m• Xilinx Tools

– May use Urbauer Room 116 Computers– Accounts will be available– ISE Version 7.1 and Modelsim 6.0a

• http://direct.xilinx.com/direct/webpack/71/WebPACK_71_fcfull_i.exe• http://direct.xilinx.com/direct/webpack/71/MXE_6.0a_Full_installer.exe

• Prerequisite Course Text– (Optional) D. Patterson and J. Hennessy, Computer Organization and Design: The

Hardware/Software Interface, Third Edition.• Quizzes A and B

– For your own benefit– May need prerequisite course text but not necessary– Look for answers on the WWW

• Project– Groups of 2-3 by Thursday – Weeks 1-5: Pipelined 32bit Processor– Build on top of the basic Processor afterwards– Lectures at Urbauer Room 116 (project check points)

• Sep 27, Oct 11, Nov 08, Nov 17, and Nov 22


Test_Die Die_Area 2

Wafer_diam Die_Area

2m/2)(Wafer_dia per wafer Dies

Die_area sity Defect_Den 1 ld Wafer_yieYield Die

yield test Final

cost Packaging cost Testing cost Die cost IC

yield Die per Wafer Dies

costWafer cost Die

Fabricated IC Costs


Traditional CISC and RISC• Reduced Instruction Set Computer

– Smaller Design Footprint Reduced Cost– Essential Set of Instructions– Intuitively Larger Program

• Complex Instruction Set Computer– Complex set of desired Instructions– Pack many functions in one Instruction– Compact Program: Memory WAS Expensive

• RISC a better fit for the Changes– Cheaper Memory– Shorter Critical Path = Fast Clock Cycles– CISC Chips Integrated the RISC Concepts– Better Compilers

• RISC!? of Today– Very Complex and Large set of Instructions– The original motivation cannot be seen– High Performance and Throughput

H-Line V-Line Circle

A lot of H and V-Lines


Real Performance Measurement

InstCount

CPI

CycleTime

CPU time = Seconds = Instructions x Cycles x Seconds

Program Program Instruction Cycle


Program Program Instruction CycleCPU time = Seconds = Instructions x Cycles x Seconds




Inst Count CPI Clock Rate

Program X

Compiler X (X)

Inst Set X X

Organization X X

Technology X

CPU time is the REAL measure of computer performance. NOT Clock rate and NOT CPI


“Instruction Frequency”

CPI = (CPU Time * Clock Rate) / Instruction Count = Cycles / Instruction Count

“Average Cycles per Instruction”

j

n

jj I CPI Time Cycle timeCPU

1

Countn Instructio

I F whereF CPI CPI

1

jj

n

jjj

Cycles Per Instructions


Typical Mix of instruction typesin program

Base Machine (Reg / Reg)

Op Freq Cycles CPI(i) (% Time)

ALU 50% 1 .5 (33%)

Load 20% 2 .4 (27%)

Store 10% 2 .2 (13%)

Branch 20% 2 .4 (27%)

1.5

Design guideline: Make the common case fast

MIPS 1% rule: only consider adding an instruction of it is shown to add 1% performance improvement on reasonable benchmarks.

Run benchmark and collect workload characterization (simulate, machine counters, or sampling)

Calculating CPI


Impact of Stalls

• Assume CPI = 1.0 ignoring branches (ideal)• Assume solution was stalling for 3 cycles• If 30% branch, Stall 3 cycles on 30%

Op Freq Cycles CPI(i) (% Time)Other 70% 1 .7 (37%)Branch 30% 4 1.2 (63%)

new CPI = 1.9

• The Machine is 1/1.9 = 0.52 times– Far from ideal


Instruction Set Architecture Design• Definition

– Set of Operations– Instruction Format– Hardware Data Types– Named Storage– Addressing Modes and Sequencing

• Description in Register Transfer Language– Intermediate Representation– Map Instruction to RTLs

• Technology Constraint Considerations– Architected storage mapped to actual storage– Function units to do all the required operations– Possible additional storage (e.g. MAddressR, MBufferR, …)– Interconnect to move information among regs and FUs

• Controller– Sequences into symbolic controller state transition diagram (STD)– Lower symbolic STD to control points– Controller Implementation


Instruction Memory

Register File ALU

Data Memory

PC Control

IF/ID ID/EX EX/MEM MEM/WB

Typical Load/Store Processor


R-type instruction

4 bits remaining 28 bits vary according to instruction type

General instruction format

31 28

opcode

27 25

rs

24 22

rt

21 19

rd

unusedunused 3 0

funct

3 0

funct

I-type instruction

31 28

opcode

27 25

rs

24 22

rt

unused 15 0

imm16

31 28

opcode

27 25

rs

24 22

rt

unused 15 0

imm16

J-type instruction

31 28

opcode

unused 15 0

imm16

31 28

opcode

27 25

rs

24 22

rt

21 19

rd

31 28

opcode

unused 15 0

imm16

Instruction Format


R-type instructionsI-type instructionsJ-type instructions

Instruction Type Datapath


Cloth Washing Process

30 minutes 35 minutes 25 minutes

One set of Clothes in 1 Hour 30 minutes


Pipelining Laundry

30 minutes 35 minutes 35 minutes

Three sets of Clean Clothes in 2 hours 40 minutes

35 minutes 25 minutes

With large number of sets, the each load takes average of ~35 min to wash

3X Increase in Productivity!!!


Introducing Problems

• Hazards prevent next instruction from executing during its designated clock cycle– Structural hazards: HW cannot support this

combination of instructions (single person to dry and iron clothes simultaneously)

– Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock – needs both before putting them away)

– Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (Er…branch & jump)


Instr.

Order

Time (clock cycles)

StallInstr 3

Load Reg

ALU

DMemIfetch Reg

Instr 1 Reg

ALU

DMemIfetch Reg

Instr 2 Reg

ALU

DMemIfetch Reg

Cycle 1Cycle 2 Cycle 3Cycle 4 Cycle 6Cycle 7Cycle 5

Reg

ALU

DMemIfetch RegBubble Bubble Bubble BubbleBubble

One Memory Port/Structural Hazards

Instruction Fetch as well as Load from Memory


Speed Up Equation for Pipelining

pipelined

dunpipeline

Time Cycle

Time Cycle

CPI stall Pipeline CPI Ideal

depth Pipeline CPI Ideal Speedup

pipelined

dunpipeline

Time Cycle

Time Cycle

CPI stall Pipeline 1

depth Pipeline Speedup

Instper cycles Stall Average CPI Ideal CPIpipelined

For simple RISC pipeline, CPI = 1:


Memory and Pipeline

• Machine A: Dual ported memory• Machine B: Single ported memory

– 1.05 times faster clock rate

• Ideal CPI = 1 for both• Loads are 40% of instructions executed

– FreqRatio = Clockunpipe/Clockpipe

– SpeedUpA = (Pipeline Depth/(1 + 0)) x FreqRatio

= Pipeline Depth x FreqRatio– SpeedUpB = (Pipeline Depth/(1 + 0.4 x 1)) x FreqRatio x 1.05

= Pipeline Depth x 0.75 x FreqRatio– SpeedUpA / SpeedUpB = 1.33

• Machine A is 1.33 times faster


Instr.

Order

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Time (clock cycles)

IF ID/RF EX MEM WB

Data Hazard on r1


• Read After Write (RAW) InstrJ tries to read operand before InstrI

writes it

• Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication.

I: add r1,r2,r3J: sub r4,r1,r3

Data Hazards


• Write After Read (WAR) InstrJ writes operand before InstrI reads it

• Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1”.

• Can’t happen in DLX 5 stage pipeline because:– All instructions take 5 stages, and– Reads are always in stage 2, and – Writes are always in stage 5

I: sub r4,r1,r3 J: add r1,r2,r3K: mul r6,r1,r7

Data Hazards


• Write After Write (WAW) InstrJ writes operand before InstrI writes it.

• “Output dependence” by compiler writers• This also results from the reuse of name “r1”.• Can’t happen in DLX 5 stage pipeline because:

– All instructions take 5 stages, and – Writes are always in stage 5

• Will see WAR and WAW in complicated pipelines

I: sub r1,r4,r3 J: add r1,r2,r3K: mul r6,r1,r7

Data Hazards


Time (clock cycles)

Inst

r.

Order

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Solution: Data Forwarding


MEM

/WR

ID/E

X

EX

/MEM

DataMemory

ALU

mux

mux

Registe

rs

NextPC

Immediate

mux

HW Change for Forwarding


Time (clock cycles)

Instr.

Order

lw r1, 0(r2)

sub r4,r1,r6

and r6,r1,r7

or r8,r1,r9

Reg ALU

MemIF Reg

RegIF

IF

ALU

Mem Reg

Reg ALU

Mem Reg

Reg

ALU

MemIF Reg

Data Hazard Even with Forwarding

Bubble

Bubble

Bubble


Try producing fast code for

a = b + c;

d = e – f;

assuming a, b, c, d ,e, and f in memory. Slow code:

LW Rb,b

LW Rc,c

ADD Ra,Rb,Rc

SW a,Ra

LW Re,e

LW Rf,f

SUB Rd,Re,Rf

SW d,Rd

Software Scheduling

Fast code:LW Rb,bLW Rc,cLW Re,e ADD Ra,Rb,RcLW Rf,fSW a,Ra SUB Rd,Re,RfSW d,Rd

Compiler optimizes for performance. Hardware checks for safety.


10: beq r1,r3,36

14: and r2,r3,r5

18: or r6,r1,r7

22: add r8,r1,r9

36: xor r10,r1,r11

Reg ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch

What do you do with the 3 instructions in between?

How do you do it?

Where is the “commit”?

Control Hazard on Branches


Branch Hazard Alternatives

• Stall until branch direction is clear• Predict Branch Not Taken

– Execute successor instructions in sequence– “Squash” instructions in pipeline if branch actually taken– Advantage of late pipeline state update– 47% DLX branches not taken on average– PC+4 already calculated, so use it to get next instr

• Predict Branch Taken– 53% DLX branches taken on average– DLX still incurs 1 cycle branch penalty– Other machines: branch target known before outcome


Branch delay of length n

• Delayed Branch– Define branch to take place AFTER a following

instruction (Fill in Branch Delay Slot)

branch instructionsequential successor1

sequential successor2

........sequential successorn

branch target if taken

– 1 slot delay allows proper decision and branch target address in 5 stage pipeline

Branch Hazard Alternatives


Evaluating Branch Alternatives

Scheduling Branch CPI speedup v. speedup v. scheme penalty unpipelined

stall

Stall pipeline 3 1.42 3.5 1.0Predict taken 1 1.14 4.4 1.26Predict not taken 1 1.09 4.5 1.29Delayed branch 0.5 1.07 4.6 1.31

Conditional & Unconditional = 14%, 65% change PC

Pipeline speedup = Pipeline depth1 +Branch frequencyBranch penalty


Conclusion

• Instruction Set Architecture– Things to Consider when designing a new ISA

• Processor– Concept behind Pipelining– Five Stage Pipeline RISC– Proper Processor Performance Evaluation

• Limitations of Pipelining– Structural, Data, and Control Hazards– Techniques to Recover Performance– Re-evaluating Speed-ups

graduate computer architecture i

Documents

topics week

week 11cs252s05 lec2

soc week

techniques week

io week

predictions week

pipelining week

clustering week