graduate computer architecture i
DESCRIPTION
Graduate Computer Architecture I. Lecture 2: Processor and Pipelining Young Cho. Instruction Set Architecture. Set of Elementary Commands Good ISA … CONVENIENT functionality to higher levels EFFICIENT functionality to higher levels GENERAL: Used in many different ways - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Graduate Computer Architecture I](https://reader035.vdocuments.net/reader035/viewer/2022062519/56814e5e550346895dbbfdfc/html5/thumbnails/1.jpg)
Lecture 2: Processor and Pipelining
Young Cho
Graduate Computer Architecture I
![Page 2: Graduate Computer Architecture I](https://reader035.vdocuments.net/reader035/viewer/2022062519/56814e5e550346895dbbfdfc/html5/thumbnails/2.jpg)
2 - CSE/ESE 560M – Graduate Computer Architecture I
Instruction Set Architecture
• Set of Elementary Commands• Good ISA …
– CONVENIENT functionality to higher levels– EFFICIENT functionality to higher levels– GENERAL: Used in many different ways– PORTABLE: Lasts through many Gen
• Points of View– Provides different HW and SW interface
![Page 3: Graduate Computer Architecture I](https://reader035.vdocuments.net/reader035/viewer/2022062519/56814e5e550346895dbbfdfc/html5/thumbnails/3.jpg)
3 - CSE/ESE 560M – Graduate Computer Architecture I
Processors Today
• General Purpose Register• Split of CISC and RISC• Development of RISC
– Very Complex Design– No longer REDUCED
• Rapid Technology Advancements
Capacity Speed
Logic 2x per 3 years 2x per 3 years
DRAM 4x per 3 years 2x per 10 years
Disk 4x per 3 years 2x per 10 years
Processor N/A 2x per 1.5 years
![Page 4: Graduate Computer Architecture I](https://reader035.vdocuments.net/reader035/viewer/2022062519/56814e5e550346895dbbfdfc/html5/thumbnails/4.jpg)
4 - CSE/ESE 560M – Graduate Computer Architecture I
• Performance is in units of things per second– bigger is better
• If we are primarily concerned with response time
• " X is n times faster than Y" means
)(_
1)(
xtimeexecxeperformanc
)(_
)(_
)(
)(
xtimeexec
ytimeexec
yeperformanc
xeperformancn
Performance Definition
![Page 5: Graduate Computer Architecture I](https://reader035.vdocuments.net/reader035/viewer/2022062519/56814e5e550346895dbbfdfc/html5/thumbnails/5.jpg)
5 - CSE/ESE 560M – Graduate Computer Architecture I
enhanced
enhancedenhanced
new
oldoverall
SpeedupFraction
Fraction 1
1
ExTime
ExTime Speedup
Best you could ever hope to do:
enhancedmaximum Fraction - 1
1 Speedup
enhanced
enhancedenhancedoldnew Speedup
FractionFraction 1ExTime ExTime
Amdahl’s Law
![Page 6: Graduate Computer Architecture I](https://reader035.vdocuments.net/reader035/viewer/2022062519/56814e5e550346895dbbfdfc/html5/thumbnails/6.jpg)
6 - CSE/ESE 560M – Graduate Computer Architecture I
Semester Schedule Review• Basic Architecture Organization: Weeks 2-4
– Processors and Pipelining – Week 2– Memory Hierarchy and Cache Design – Week 3– Hazards and Predictions – Week 4 Quiz 1 – Week 4
• Quantitative Approach: Weeks 5-10– Instructional Level Parallelism – Week 5 & 6– Vector and Multi-Processors – Week 7 & 8– Storage and I/O – Week 9– Interconnects and Clustering – Week 10 Quiz 2 – Week 6 Quiz 3 – Week 9
• Advanced Topics: Weeks 11-15– Network Processors – Week 11– Reconfigurable Devices and SoC – Week 12– Low Power Hardware and Techniques – Week 12– HW and SW Co-design – Week 13– Other Topics – Week 14 & 15 Quiz 4 – Week 11
![Page 7: Graduate Computer Architecture I](https://reader035.vdocuments.net/reader035/viewer/2022062519/56814e5e550346895dbbfdfc/html5/thumbnails/7.jpg)
7 - CSE/ESE 560M – Graduate Computer Architecture I
Administrative• Course Web Site
– http://www.arl.wustl.edu/~young/cse560m• Xilinx Tools
– May use Urbauer Room 116 Computers– Accounts will be available– ISE Version 7.1 and Modelsim 6.0a
• http://direct.xilinx.com/direct/webpack/71/WebPACK_71_fcfull_i.exe• http://direct.xilinx.com/direct/webpack/71/MXE_6.0a_Full_installer.exe
• Prerequisite Course Text– (Optional) D. Patterson and J. Hennessy, Computer Organization and Design: The
Hardware/Software Interface, Third Edition.• Quizzes A and B
– For your own benefit– May need prerequisite course text but not necessary– Look for answers on the WWW
• Project– Groups of 2-3 by Thursday – Weeks 1-5: Pipelined 32bit Processor– Build on top of the basic Processor afterwards– Lectures at Urbauer Room 116 (project check points)
• Sep 27, Oct 11, Nov 08, Nov 17, and Nov 22
![Page 8: Graduate Computer Architecture I](https://reader035.vdocuments.net/reader035/viewer/2022062519/56814e5e550346895dbbfdfc/html5/thumbnails/8.jpg)
8 - CSE/ESE 560M – Graduate Computer Architecture I
Test_Die Die_Area 2
Wafer_diam Die_Area
2m/2)(Wafer_dia per wafer Dies
Die_area sity Defect_Den 1 ld Wafer_yieYield Die
yield test Final
cost Packaging cost Testing cost Die cost IC
yield Die per Wafer Dies
costWafer cost Die
Fabricated IC Costs
![Page 9: Graduate Computer Architecture I](https://reader035.vdocuments.net/reader035/viewer/2022062519/56814e5e550346895dbbfdfc/html5/thumbnails/9.jpg)
9 - CSE/ESE 560M – Graduate Computer Architecture I
Traditional CISC and RISC• Reduced Instruction Set Computer
– Smaller Design Footprint Reduced Cost– Essential Set of Instructions– Intuitively Larger Program
• Complex Instruction Set Computer– Complex set of desired Instructions– Pack many functions in one Instruction– Compact Program: Memory WAS Expensive
• RISC a better fit for the Changes– Cheaper Memory– Shorter Critical Path = Fast Clock Cycles– CISC Chips Integrated the RISC Concepts– Better Compilers
• RISC!? of Today– Very Complex and Large set of Instructions– The original motivation cannot be seen– High Performance and Throughput
H-Line V-Line Circle
A lot of H and V-Lines
![Page 10: Graduate Computer Architecture I](https://reader035.vdocuments.net/reader035/viewer/2022062519/56814e5e550346895dbbfdfc/html5/thumbnails/10.jpg)
10 - CSE/ESE 560M – Graduate Computer Architecture I
Real Performance Measurement
InstCount
CPI
CycleTime
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction CycleCPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
Inst Count CPI Clock Rate
Program X
Compiler X (X)
Inst Set X X
Organization X X
Technology X
CPU time is the REAL measure of computer performance. NOT Clock rate and NOT CPI
![Page 11: Graduate Computer Architecture I](https://reader035.vdocuments.net/reader035/viewer/2022062519/56814e5e550346895dbbfdfc/html5/thumbnails/11.jpg)
11 - CSE/ESE 560M – Graduate Computer Architecture I
“Instruction Frequency”
CPI = (CPU Time * Clock Rate) / Instruction Count = Cycles / Instruction Count
“Average Cycles per Instruction”
j
n
jj I CPI Time Cycle timeCPU
1
Countn Instructio
I F whereF CPI CPI
1
jj
n
jjj
Cycles Per Instructions
![Page 12: Graduate Computer Architecture I](https://reader035.vdocuments.net/reader035/viewer/2022062519/56814e5e550346895dbbfdfc/html5/thumbnails/12.jpg)
12 - CSE/ESE 560M – Graduate Computer Architecture I
Typical Mix of instruction typesin program
Base Machine (Reg / Reg)
Op Freq Cycles CPI(i) (% Time)
ALU 50% 1 .5 (33%)
Load 20% 2 .4 (27%)
Store 10% 2 .2 (13%)
Branch 20% 2 .4 (27%)
1.5
Design guideline: Make the common case fast
MIPS 1% rule: only consider adding an instruction of it is shown to add 1% performance improvement on reasonable benchmarks.
Run benchmark and collect workload characterization (simulate, machine counters, or sampling)
Calculating CPI
![Page 13: Graduate Computer Architecture I](https://reader035.vdocuments.net/reader035/viewer/2022062519/56814e5e550346895dbbfdfc/html5/thumbnails/13.jpg)
13 - CSE/ESE 560M – Graduate Computer Architecture I
Impact of Stalls
• Assume CPI = 1.0 ignoring branches (ideal)• Assume solution was stalling for 3 cycles• If 30% branch, Stall 3 cycles on 30%
Op Freq Cycles CPI(i) (% Time)Other 70% 1 .7 (37%)Branch 30% 4 1.2 (63%)
new CPI = 1.9
• The Machine is 1/1.9 = 0.52 times– Far from ideal
![Page 14: Graduate Computer Architecture I](https://reader035.vdocuments.net/reader035/viewer/2022062519/56814e5e550346895dbbfdfc/html5/thumbnails/14.jpg)
14 - CSE/ESE 560M – Graduate Computer Architecture I
Instruction Set Architecture Design• Definition
– Set of Operations– Instruction Format– Hardware Data Types– Named Storage– Addressing Modes and Sequencing
• Description in Register Transfer Language– Intermediate Representation– Map Instruction to RTLs
• Technology Constraint Considerations– Architected storage mapped to actual storage– Function units to do all the required operations– Possible additional storage (e.g. MAddressR, MBufferR, …)– Interconnect to move information among regs and FUs
• Controller– Sequences into symbolic controller state transition diagram (STD)– Lower symbolic STD to control points– Controller Implementation
![Page 15: Graduate Computer Architecture I](https://reader035.vdocuments.net/reader035/viewer/2022062519/56814e5e550346895dbbfdfc/html5/thumbnails/15.jpg)
15 - CSE/ESE 560M – Graduate Computer Architecture I
Instruction Memory
Register File ALU
Data Memory
PC Control
IF/ID ID/EX EX/MEM MEM/WB
Typical Load/Store Processor
![Page 16: Graduate Computer Architecture I](https://reader035.vdocuments.net/reader035/viewer/2022062519/56814e5e550346895dbbfdfc/html5/thumbnails/16.jpg)
16 - CSE/ESE 560M – Graduate Computer Architecture I
R-type instruction
4 bits remaining 28 bits vary according to instruction type
General instruction format
31 28
opcode
27 25
rs
24 22
rt
21 19
rd
unusedunused 3 0
funct
3 0
funct
I-type instruction
31 28
opcode
27 25
rs
24 22
rt
unused 15 0
imm16
31 28
opcode
27 25
rs
24 22
rt
unused 15 0
imm16
J-type instruction
31 28
opcode
unused 15 0
imm16
31 28
opcode
27 25
rs
24 22
rt
21 19
rd
31 28
opcode
unused 15 0
imm16
Instruction Format
![Page 17: Graduate Computer Architecture I](https://reader035.vdocuments.net/reader035/viewer/2022062519/56814e5e550346895dbbfdfc/html5/thumbnails/17.jpg)
17 - CSE/ESE 560M – Graduate Computer Architecture I
R-type instructionsI-type instructionsJ-type instructions
Instruction Type Datapath
![Page 18: Graduate Computer Architecture I](https://reader035.vdocuments.net/reader035/viewer/2022062519/56814e5e550346895dbbfdfc/html5/thumbnails/18.jpg)
18 - CSE/ESE 560M – Graduate Computer Architecture I
Cloth Washing Process
30 minutes 35 minutes 25 minutes
One set of Clothes in 1 Hour 30 minutes
![Page 19: Graduate Computer Architecture I](https://reader035.vdocuments.net/reader035/viewer/2022062519/56814e5e550346895dbbfdfc/html5/thumbnails/19.jpg)
19 - CSE/ESE 560M – Graduate Computer Architecture I
Pipelining Laundry
30 minutes 35 minutes 35 minutes
Three sets of Clean Clothes in 2 hours 40 minutes
35 minutes 25 minutes
With large number of sets, the each load takes average of ~35 min to wash
3X Increase in Productivity!!!
![Page 20: Graduate Computer Architecture I](https://reader035.vdocuments.net/reader035/viewer/2022062519/56814e5e550346895dbbfdfc/html5/thumbnails/20.jpg)
20 - CSE/ESE 560M – Graduate Computer Architecture I
Introducing Problems
• Hazards prevent next instruction from executing during its designated clock cycle– Structural hazards: HW cannot support this
combination of instructions (single person to dry and iron clothes simultaneously)
– Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock – needs both before putting them away)
– Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (Er…branch & jump)
![Page 21: Graduate Computer Architecture I](https://reader035.vdocuments.net/reader035/viewer/2022062519/56814e5e550346895dbbfdfc/html5/thumbnails/21.jpg)
21 - CSE/ESE 560M – Graduate Computer Architecture I
Instr.
Order
Time (clock cycles)
StallInstr 3
Load Reg
ALU
DMemIfetch Reg
Instr 1 Reg
ALU
DMemIfetch Reg
Instr 2 Reg
ALU
DMemIfetch Reg
Cycle 1Cycle 2 Cycle 3Cycle 4 Cycle 6Cycle 7Cycle 5
Reg
ALU
DMemIfetch RegBubble Bubble Bubble BubbleBubble
One Memory Port/Structural Hazards
Instruction Fetch as well as Load from Memory
![Page 22: Graduate Computer Architecture I](https://reader035.vdocuments.net/reader035/viewer/2022062519/56814e5e550346895dbbfdfc/html5/thumbnails/22.jpg)
22 - CSE/ESE 560M – Graduate Computer Architecture I
Speed Up Equation for Pipelining
pipelined
dunpipeline
Time Cycle
Time Cycle
CPI stall Pipeline CPI Ideal
depth Pipeline CPI Ideal Speedup
pipelined
dunpipeline
Time Cycle
Time Cycle
CPI stall Pipeline 1
depth Pipeline Speedup
Instper cycles Stall Average CPI Ideal CPIpipelined
For simple RISC pipeline, CPI = 1:
![Page 23: Graduate Computer Architecture I](https://reader035.vdocuments.net/reader035/viewer/2022062519/56814e5e550346895dbbfdfc/html5/thumbnails/23.jpg)
23 - CSE/ESE 560M – Graduate Computer Architecture I
Memory and Pipeline
• Machine A: Dual ported memory• Machine B: Single ported memory
– 1.05 times faster clock rate
• Ideal CPI = 1 for both• Loads are 40% of instructions executed
– FreqRatio = Clockunpipe/Clockpipe
– SpeedUpA = (Pipeline Depth/(1 + 0)) x FreqRatio
= Pipeline Depth x FreqRatio– SpeedUpB = (Pipeline Depth/(1 + 0.4 x 1)) x FreqRatio x 1.05
= Pipeline Depth x 0.75 x FreqRatio– SpeedUpA / SpeedUpB = 1.33
• Machine A is 1.33 times faster
![Page 24: Graduate Computer Architecture I](https://reader035.vdocuments.net/reader035/viewer/2022062519/56814e5e550346895dbbfdfc/html5/thumbnails/24.jpg)
24 - CSE/ESE 560M – Graduate Computer Architecture I
Instr.
Order
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Time (clock cycles)
IF ID/RF EX MEM WB
Data Hazard on r1
![Page 25: Graduate Computer Architecture I](https://reader035.vdocuments.net/reader035/viewer/2022062519/56814e5e550346895dbbfdfc/html5/thumbnails/25.jpg)
25 - CSE/ESE 560M – Graduate Computer Architecture I
• Read After Write (RAW) InstrJ tries to read operand before InstrI
writes it
• Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication.
I: add r1,r2,r3J: sub r4,r1,r3
Data Hazards
![Page 26: Graduate Computer Architecture I](https://reader035.vdocuments.net/reader035/viewer/2022062519/56814e5e550346895dbbfdfc/html5/thumbnails/26.jpg)
26 - CSE/ESE 560M – Graduate Computer Architecture I
• Write After Read (WAR) InstrJ writes operand before InstrI reads it
• Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1”.
• Can’t happen in DLX 5 stage pipeline because:– All instructions take 5 stages, and– Reads are always in stage 2, and – Writes are always in stage 5
I: sub r4,r1,r3 J: add r1,r2,r3K: mul r6,r1,r7
Data Hazards
![Page 27: Graduate Computer Architecture I](https://reader035.vdocuments.net/reader035/viewer/2022062519/56814e5e550346895dbbfdfc/html5/thumbnails/27.jpg)
27 - CSE/ESE 560M – Graduate Computer Architecture I
• Write After Write (WAW) InstrJ writes operand before InstrI writes it.
• “Output dependence” by compiler writers• This also results from the reuse of name “r1”.• Can’t happen in DLX 5 stage pipeline because:
– All instructions take 5 stages, and – Writes are always in stage 5
• Will see WAR and WAW in complicated pipelines
I: sub r1,r4,r3 J: add r1,r2,r3K: mul r6,r1,r7
Data Hazards
![Page 28: Graduate Computer Architecture I](https://reader035.vdocuments.net/reader035/viewer/2022062519/56814e5e550346895dbbfdfc/html5/thumbnails/28.jpg)
28 - CSE/ESE 560M – Graduate Computer Architecture I
Time (clock cycles)
Inst
r.
Order
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Solution: Data Forwarding
![Page 29: Graduate Computer Architecture I](https://reader035.vdocuments.net/reader035/viewer/2022062519/56814e5e550346895dbbfdfc/html5/thumbnails/29.jpg)
29 - CSE/ESE 560M – Graduate Computer Architecture I
MEM
/WR
ID/E
X
EX
/MEM
DataMemory
ALU
mux
mux
Registe
rs
NextPC
Immediate
mux
HW Change for Forwarding
![Page 30: Graduate Computer Architecture I](https://reader035.vdocuments.net/reader035/viewer/2022062519/56814e5e550346895dbbfdfc/html5/thumbnails/30.jpg)
30 - CSE/ESE 560M – Graduate Computer Architecture I
Time (clock cycles)
Instr.
Order
lw r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
or r8,r1,r9
Reg ALU
MemIF Reg
RegIF
IF
ALU
Mem Reg
Reg ALU
Mem Reg
Reg
ALU
MemIF Reg
Data Hazard Even with Forwarding
Bubble
Bubble
Bubble
![Page 31: Graduate Computer Architecture I](https://reader035.vdocuments.net/reader035/viewer/2022062519/56814e5e550346895dbbfdfc/html5/thumbnails/31.jpg)
31 - CSE/ESE 560M – Graduate Computer Architecture I
Try producing fast code for
a = b + c;
d = e – f;
assuming a, b, c, d ,e, and f in memory. Slow code:
LW Rb,b
LW Rc,c
ADD Ra,Rb,Rc
SW a,Ra
LW Re,e
LW Rf,f
SUB Rd,Re,Rf
SW d,Rd
Software Scheduling
Fast code:LW Rb,bLW Rc,cLW Re,e ADD Ra,Rb,RcLW Rf,fSW a,Ra SUB Rd,Re,RfSW d,Rd
Compiler optimizes for performance. Hardware checks for safety.
![Page 32: Graduate Computer Architecture I](https://reader035.vdocuments.net/reader035/viewer/2022062519/56814e5e550346895dbbfdfc/html5/thumbnails/32.jpg)
32 - CSE/ESE 560M – Graduate Computer Architecture I
10: beq r1,r3,36
14: and r2,r3,r5
18: or r6,r1,r7
22: add r8,r1,r9
36: xor r10,r1,r11
Reg ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch
What do you do with the 3 instructions in between?
How do you do it?
Where is the “commit”?
Control Hazard on Branches
![Page 33: Graduate Computer Architecture I](https://reader035.vdocuments.net/reader035/viewer/2022062519/56814e5e550346895dbbfdfc/html5/thumbnails/33.jpg)
33 - CSE/ESE 560M – Graduate Computer Architecture I
Branch Hazard Alternatives
• Stall until branch direction is clear• Predict Branch Not Taken
– Execute successor instructions in sequence– “Squash” instructions in pipeline if branch actually taken– Advantage of late pipeline state update– 47% DLX branches not taken on average– PC+4 already calculated, so use it to get next instr
• Predict Branch Taken– 53% DLX branches taken on average– DLX still incurs 1 cycle branch penalty– Other machines: branch target known before outcome
![Page 34: Graduate Computer Architecture I](https://reader035.vdocuments.net/reader035/viewer/2022062519/56814e5e550346895dbbfdfc/html5/thumbnails/34.jpg)
34 - CSE/ESE 560M – Graduate Computer Architecture I
Branch delay of length n
• Delayed Branch– Define branch to take place AFTER a following
instruction (Fill in Branch Delay Slot)
branch instructionsequential successor1
sequential successor2
........sequential successorn
branch target if taken
– 1 slot delay allows proper decision and branch target address in 5 stage pipeline
Branch Hazard Alternatives
![Page 35: Graduate Computer Architecture I](https://reader035.vdocuments.net/reader035/viewer/2022062519/56814e5e550346895dbbfdfc/html5/thumbnails/35.jpg)
35 - CSE/ESE 560M – Graduate Computer Architecture I
Evaluating Branch Alternatives
Scheduling Branch CPI speedup v. speedup v. scheme penalty unpipelined
stall
Stall pipeline 3 1.42 3.5 1.0Predict taken 1 1.14 4.4 1.26Predict not taken 1 1.09 4.5 1.29Delayed branch 0.5 1.07 4.6 1.31
Conditional & Unconditional = 14%, 65% change PC
Pipeline speedup = Pipeline depth1 +Branch frequencyBranch penalty
![Page 36: Graduate Computer Architecture I](https://reader035.vdocuments.net/reader035/viewer/2022062519/56814e5e550346895dbbfdfc/html5/thumbnails/36.jpg)
36 - CSE/ESE 560M – Graduate Computer Architecture I
Conclusion
• Instruction Set Architecture– Things to Consider when designing a new ISA
• Processor– Concept behind Pipelining– Five Stage Pipeline RISC– Proper Processor Performance Evaluation
• Limitations of Pipelining– Structural, Data, and Control Hazards– Techniques to Recover Performance– Re-evaluating Speed-ups