lecture’xx:’midterm’review’ - github pages...great ideas in computer architectures 1. design...
TRANSCRIPT
Lecture XX: Midterm Review
CSE 564 Computer Architecture Fall 2016
Department of Computer Science and Engineering Yonghong Yan
[email protected] www.secs.oakland.edu/~yan
1
Lecture 01 IntroducCon
2
3
The InstrucCon Set: a CriCcal Interface
instrucBon set
soCware
hardware
• ProperBes of a good abstracBon – Lasts through many generaBons (portability) – Used in many different ways (generality) – Provides convenient funcBonality to higher levels – Permits an efficient implementaBon at lower levels
Great Ideas in Computer Architectures
1. Design for Moore’s Law
2. Use abstraction to simplify design
3. Make the common case fast
4. Performance via parallelism
5. Performance via pipelining
6. Performance via prediction
7. Hierarchy of memories
8. Dependability via redundancy
4
Great Idea: “Moore’s Law”
Gordon Moore, Founder of Intel • 1965: since the integrated circuit was invented, the number of
transistors/inch2 in these circuits roughly doubled every year; this trend would conBnue for the foreseeable future
• 1975: revised -‐ circuit complexity doubles every two years
5 Image credit: Intel
Moore’s Law trends • More transistors = ↑ opportuniBes for exploiBng parallelism in the
instrucBon level (ILP) – Pipeline, superscalar, VLIW (Very Long InstrucBon Word), SIMD (Single
InstrucBon MulBple Data) or vector, speculaBon, branch predicBon • General path of scaling
– Wider instrucBon issue, longer piepline – More speculaBon – More and larger registers and cache
• Increasing circuit density ~= increasing frequency ~= increasing performance
• Transparent to users – An easy job of geang beber performance: buying faster processors (higher
frequency)
• We have enjoyed this free lunch for several decades, however (TBD) …
6
Problems of tradiConal ILP scaling
• Fundamental circuit limitaBons1 – delays ⇑ as issue queues ⇑ and mulB-‐port register files ⇑ – increasing delays limit performance returns from wider issue
• Limited amount of instrucBon-‐level parallelism1
– inefficient for codes with difficult-‐to-‐predict branches
• Power and heat stall clock frequencies
7
[1] The case for a single-‐chip mulBprocessor, K. Olukotun, B. Nayfeh, L. Hammond, K. Wilson, and K. Chang, ASPLOS-‐VII, 1996.
Power/heat density limits frequency
8
• Some fundamental physical limits are being reached
9
RevoluCon is happening now • Chip density is
conBnuing increase ~2x every 2 years – Clock speed is not – Number of processor
cores may double instead
• There is lible or no hidden parallelism (ILP) to be found
• Parallelism must be exposed to and managed by soCware – No free lunch
Source: Intel, MicrosoC (Suber) and Stanford (Olukotun, Hammond)
Architectural Challenges
• Massive (ca. 4X) increase in concurrency – MulBcore (4 -‐ <100) à Manycores (100s – 1ks)
• Heterogeneity – System-‐level (accelerators) vs chip level (embedded)
• Compute power and memory speed challenges (two walls) – 500x compute power and 30x memory of 2PF HW – Memory access ;me lags further behind
10
• Complex Digital ASIC Design • Activity 1 Case Study: Scalar vs. Vector Processors Activity 2
Course Motivation: Research Perspective
���
�����
�����
� ���� ������ ����
��� ��� ��� ��� ���� ���� �� � �� � ����
��� ��� ��� ������� �� ��
�������
���������
����������
� �����
������
!"�#���
"��$������������
���%�����
&�'(�)
(�*�!+(
�,����$����
'�����'�+-��
�����'.�� �
�/�0��1'���
&��2$���-�������
'.�� �
(�$3
�������
�4��5��2�6�&0�.&���7��8�9��1:*�4��9 ��'.�� ���������$�� �4�6����� ����;����� ��2��<�� �4��9 ��'"<�����5$�� <��5$�� �")
��������
ECE 5950 Course Overview 18 / 35Data$Processing$in$Exascale1class$Computing$Systems$$|$$April$27,$2011$$|$$CRM$4"
Three"Eras"of"Processor"Performance"
Single4Core""Era"
Single1thread$$Performance$
?$
Time$
we#are#here#
o"
Enabled$by:$� ���������$� Voltage$Scaling$� MicroArchitecture$
$
Constrained$by:$Power$Complexity$
Multi4Core""Era"
Throughput$$Performance$
Time$(##of#Processors)#
we#are#here#
o"
Enabled$by:$� ���������$� Desire$for$Throughput$� 20$years$of$SMP$arch$
$
Constrained$by:$Power$Parallel$SW$availability$Scalability$
Heterogeneous"Systems"Era"
Targeted$Application$$
Performance$
Time$(Data1parallel#exploitation)#
we#are#here#
o"
Enabled$by:$� ���������$� Abundant$data$parallelism$� Power$efficient$GPUs$
$
Currently)constrained$by:$Programming$models$Communication$overheads$
Source: Chuck Moore, Data Processing in ExaScale-‐ClassComputer Systems, Salishan, April 2011
Lecture 02 Performance
11
Dynamic Energy and Power
• Dynamic energy – Transistor switch from 0 -‐> 1 or 1 -‐> 0
• Dynamic power
• Reducing clock rate reduces power, not energy • The capaciBve load:
– a funcBon of the number of transistors connected to an output and the technology, which determines the capacitance of the wires and the transistors.
12
An Example from Textbook page #21
13
InstrucCon Count and CPI
• InstrucBon Count for a program – Determined by program, ISA and compiler
• Average cycles per instrucBon – Determined by CPU hardware – If different instrucBons have different CPI
• Average CPI affected by instrucBon mix
Rate ClockCPICount nInstructio
Time Cycle ClockCPICount nInstructioTime CPU
nInstructio per CyclesCount nInstructioCycles Clock
×=
××=
×=
14
CPI Example
• Computer A: Cycle Time = 250ps, CPI = 2.0 • Computer B: Cycle Time = 500ps, CPI = 1.2 • Same ISA • Which is faster, and by how much?
1.2500psI600psI
ATime CPUBTime CPU
600psI500ps1.2IBTime CycleBCPICount nInstructioBTime CPU
500psI250ps2.0IATime CycleACPICount nInstructioATime CPU
=×
×=
×=××=
××=
×=××=
××=
A is faster…
…by this much
15
CPI in More Detail
• If different instruction classes take different numbers of cycles
∑=
×=n
1iii )Count nInstructio(CPICycles Clock
n Weighted average CPI
∑=
⎟⎠
⎞⎜⎝
⎛ ×==n
1i
ii Count nInstructio
Count nInstructioCPICount nInstructio
Cycles ClockCPI
Relative frequency
16
CPI Example
• AlternaBve compiled code sequences using instrucBons in classes A, B, C
Class A B C CPI for class 1 2 3
IC in sequence #1 2 1 2 IC in sequence #2 4 1 1
n Sequence #1: IC = 5 n Clock Cycles = 2×1 + 1×2 + 2×3 = 10
n Avg. CPI = 10/5 = 2.0
n Sequence #2: IC = 6 n Clock Cycles = 4×1 + 1×2 + 1×3 = 9
n Avg. CPI = 9/6 = 1.5
17
Principles of Computer Design
• The Processor Performance EquaBon
18
Principles of Computer Design
• Different instrucBon types having different CPIs
19
Impacts by Components
Inst Count CPI Clock Rate
Program X Compiler X (X) Inst. Set. X X Architecture X X Technology X
20
inst count
CPI
Cycle time
Principles of Computer Design
• Take Advantage of Parallelism – e.g. mulBple processors, disks, memory banks, pipelining,
mulBple funcBonal units
• Principle of Locality – Reuse of data and instrucBons
• Focus on the Common Case – Amdahl’s Law
21
Amdahl’s Law
22
( )enhanced
enhancedenhanced
new
oldoverall
SpeedupFraction Fraction
1 ExTimeExTime Speedup
+−==1
Best you could ever hope to do:
( )enhancedmaximum Fraction - 1
1 Speedup =
( ) ⎥⎦
⎤⎢⎣
⎡+−×=
enhanced
enhancedenhancedoldnew Speedup
FractionFraction ExTime ExTime 1
Using Amdahl’s Law
23
Amdahl’s Law for Parallelism
• The enhanced fracBon F is through parallelism, perfect parallelism with linear speedup – The speedup for F is N for N processors
• Overall speedup
• Speedup upper bound (when N à∞ ): – 1-‐F: the sequenBal porBon of a program
24
Lecture 03: ISA
25
Iron-‐code Summary • Sec$on A.2—Use general-‐purpose registers with a load-‐store architecture. • Sec$on A.3—Support these addressing modes: displacement (with an address offset
size of 12 to 16 bits), immediate (size 8 to 16 bits), and register indirect. • Sec$on A.4—Support these data sizes and types: 8-‐, 16-‐, 32-‐, and 64-‐bit integers and
64-‐bit IEEE 754 floaCng-‐point numbers. – Now we see 16-‐bit FP for deep learning in GPU
• hip://www.nextplajorm.com/2016/09/13/nvidia-‐pushes-‐deep-‐learning-‐inference-‐new-‐pascal-‐gpus/
• Sec$on A.5—Support these simple instrucCons, since they will dominate the number of instrucCons executed: load, store, add, subtract, move register-‐ register, and shil.
• Sec$on A.6—Compare equal, compare not equal, compare less, branch (with a PC-‐relaCve address at least 8 bits long), jump, call, and return.
• Sec$on A.7—Use fixed instrucCon encoding if interested in performance, and use variable instrucCon encoding if interested in code size.
• Sec$on A.8—Provide at least 16 general-‐purpose registers, be sure all addressing modes apply to all data transfer instrucCons, and aim for a minimalist IS
– Olen use separate floaCng-‐point registers. – The jusCficaCon is to increase the total number of registers without raising problems
in the instrucCon for-‐mat or in the speed of the general-‐purpose register file. This compromise, however, is not orthogonal.
26
Lecture 05: Pipeline
27
RISC Instruction Set
• Every instruction can be implemented in at most 5 clock cycles – Instruction fetch cycle (IF): send PC to memory, fetch the
current instruction from memory, and update PC to the next sequential PC by adding 4 to the PC.
– Instruction decode/register fetch cycle (ID): decode the instruction, read the registers corresponding to register source specifiers from the register file.
– Execution/effective address cycle (EX): perform Memory reference, Register-Register ALU instruction and Register-Immediate ALU instruction.
– Memory access (MEM): perform load/store instructions. – Write-back cycle (WB): Register-Register ALU instruction or
Load instruction.
Making RISC Pipelining Real
• Function units used in different cycles – Hence we can overlap the execution of multiple instructions
• Important things to make it real – Separate instruction and data memories, e.g. I-cache and D-cache,
banking • Eliminate a conflict for accessing a single memory.
– The Register file is used in the two stages (two R and one W every cycle) • Read from register in ID (second half of CC), and write to register
in WB (first half of CC). – PC
• Increment and store the PC every clock, and done it during the IF stage.
• A branch does not change the PC until the ID stage (have an adder to compute the potential branch target).
– Staging data between pipeline stages • Pipeline register
Pipeline Datapath
• Register files in ID and WB stage – Read from register in ID (second half of CC), and write to
register in WB (first half of CC). • IM and DM
Pipeline Registers
Memory Access
Write Back
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc
ALU
Mem
ory
Reg File
MU
X MU
X
Data
Mem
ory
MU
X
Sign Extend
Zero?
IF/ID
ID/EX
MEM
/WB
EX/M
EM
4
Adder
Next SEQ PC Next SEQ PC
RD RD RD WB
Dat
a
Next PC
Address
RS1 RS2
Imm
MU
X
IR <= mem[PC];
PC <= PC + 4
A <= Reg[IRrs];
B <= Reg[IRrt] rslt <= A opIRop B
Reg[IRrd] <= WB
WB <= rslt Pipeline Registers for Data Staging between Pipeline Stages Named as: IF/ID, ID/EX, EX/MEM, and MEM/WB
Pipeline Registers
• Edge-‐triggered property of register is criBcal
Inst. Set Processor Controller
IR <= mem[PC]; PC <= PC + 4
A <= Reg[IRrs]; B <= Reg[IRrt]
r <= A opIRop B
Reg[IRrd] <= WB
WB <= r
Ifetch
opFetch-‐DCD
PC <= IRjaddr if bop(A,b) PC <= PC+IRim
br jmp RR
r <= A opIRop IRim
Reg[IRrd] <= WB
WB <= r
RI r <= A + IRim
WB <= Mem[r]
Reg[IRrd] <= WB
LD ST
JSR JR
branch requires 3 cycles, store requires 4 cycles, and all other instructions require 5 cycles.
• InstrucBons per program depends on source code, compiler technology, and ISA
• Cycles per instrucCons (CPI) depends on ISA and µarchitecture
• Time per cycle depends upon the µarchitecture and base technology
Processor Performance
CPU Time = InstructionsProgram
* CyclesInstruction
*TimeCycle
RISC-‐V ISA and ImplementaCons
35
User Level ISA
• Defines the normal instrucBons needed for computaBon – A mandatory Base integer ISA
• I: Integer instrucCons: ALU, branches/jumps, and loads/stores • Support for misaligned memory access is mandatory
– Standard Extensions • M: Integer MulCplicaCon and Division • A: Atomic InstrucCons • F: Single-‐Precision FloaCng-‐Point • D: Double-‐Precision FloaCng-‐Point • C: Compressed InstrucCons (16 bit)
• G = IMAFD: Integer base + four standard extensions – OpBonal extensions
36
Purpose of a Specific Control Signal
Datapath for ALU InstrucCons
38
<14:12>
Op2Sel Reg / Imm
Imm Select
ImmSel OpCode
0x4 Add
clk
addr inst
Inst. Memory
PC ALU
RegWriteEn clk
rd1
GPRs
rs1 rs2
wa wd rd2
we <19:15> <24:20>
ALU Control
<11:7>
<6:0>
7 5 5 3 5 7 func7 rs2 rs1 func3 rd opcode rd ← (rs1) func (rs2) immediate12 rs1 func3 rd opcode rd ← (rs1) op immediate 31 20 19 15 14 12 11 7 6 0
Inst<31:20>
Datapath for Load/Store InstrucCons
39
WBSel ALU / Mem
rs1 is the base register rd is the destination of a Load, rs2 is the data source for a Store
Op2Sel
“base”
disp
ImmSel OpCode
ALU Control
ALU
0x4 Add
clk
addr inst
Inst. Memory
PC
RegWriteEn
clk
rd1
GPRs
rs1 rs2
wa wd rd2
we
Imm Select
clk
MemWrite
addr
wdata
rdata Data Memory
we
7 5 5 3 5 7 imm rs2 rs1 func3 imm opcode Store (rs1) + displacement immediate12 rs1 func3 rd opcode Load 31 20 19 15 14 12 11 7 6 0
Datapath for CondiConal Branches (BEQ/BNE/BLT/BGE/BLTU/BGEU)
40
0x4
Add
PCSel
clk
WBSel MemWrite
addr
wdata
rdata Data Memory
we
Op2Sel ImmSel OpCode
Bcomp?
clk
clk
addr inst
Inst. Memory
PC rd1
GPRs
rs1 rs2
wa wd rd2
we
Imm Select
ALU
ALU Control
Add
br
pc+4
RegWrEn
Br Logic
Data Hazards Summary • Stall cycles without by-‐passing
– 3, 2, 1 depending on the distance between the two instrucBons • RAW dependency between ALU
– Full-‐bypassing will eliminate all stalls • Load-‐use RAW
– Full-‐bypassing could have at most 1 cycle stall for two instrs • Ld x5 16(x4) • Add x6, x5, x1
• Load-‐store or store-‐store – No stalls with full-‐bypassing
• Interlock control logic for RAW hazard detecBon and stall inserBon • By-‐passing data path
– Need to deal with three situaBons • ALU à ALU • MEM à ALU • WB à ALU
41
Interlock Control Logic ignoring jumps & branches
42
IR IR IR
PC A
B Y
R
MD1 MD2
addr inst
Inst Memory
0x4 Add
IR ALU rd1
GPRs
rs1 rs2
wa wd rd2
we
wdata
addr
wdata
rdata Data Memory
we
bubble
stall Cstall
wa1
rs1 rs2 ?
we1
re1 re2 Cre
wa3 we2 wa2
Cdest Cdest we3
Imm Select
Fully Bypassed Datapath
43
ASrc IR IR IR
PC A
B Y
R
MD1 MD2
addr inst
Inst Memory
0x4 Add
IR ALU
Imm Select
rd1
GPRs
rs1 rs2
wa wd rd2
we
wdata
addr
wdata
rdata Data Memory
we
bubble
stall
D
E M W
PC for JAL, ...
BSrc
Control Hazards Summary
44
Instruc$on Taken known? Target known?
JAL
JALR B<cond.>
Each instrucBon fetch depends on one or two pieces of informaBon from the preceding instrucBon:
1) Is the preceding instrucBon a taken branch? 2) If so, what is the target address?
• JAL: uncondiConal jump to PC+immediate • JALR: indirect jump to rs1+immediate • Branch: if (rs1 conds rs2), branch to PC+immediate
Aler Inst. Decode
Aler Inst. Decode Aler Inst. Decode
Aler Inst. Decode Aler Reg. Fetch
Aler Execute
Control Hazards Summary
• JAL: uncondiConal jump to PC+immediate – 1 cycle delay of pipeline
• JALR: indirect jump to rs1+immediate – 1 cycle delay
• Branch: if (rs1 conds rs2), branch to PC+immediate – 2 cycles delay
• SoluCons: – Delay slot (not a soluCon to remove the bubble) – BHT and BTB (not a soluCon either)
45
Lecture 10/11 Memory Tech, Cache OrganizaCon and Performance
46
Review: Memory Technology and Hierarchy
• RelaBonships
47
Technology challenge: Memory Wall
Address Space 0 2^n -‐ 1 Prob
ability
of re
ference
Program Behavior: Principle of Locality
Architecture Approach: Memory Hierarchy
Technology Challenge: Memory Wall
• Growing disparity of processor and memory speed
• DRAM: Slow cheap and dense: – Good choice for presenBng the
user with a BIG memory system – Used for Main memory
48
• SRAM: fast, expensive, and not very dense: – Good choice for providing the user FAST access Bme. – Used for Cache
• Speed: – Latency – Bandwidth – Memory interleaving
Program Behavior: Principle of Locality • Programs tend to reuse data and instrucBons near those they have used
recently, or that were recently referenced themselves – SpaBal locality: Items with nearby addresses tend to be referenced close
together in Bme – Temporal locality: Recently referenced items are likely to be referenced
in the near future
Locality Example: • Data
– Reference array elements in succession (stride-‐1 reference paiern):
– Reference sum each iteraCon: • InstrucCons
– Reference instrucCons in sequence: – Cycle through loop repeatedly:
sum = 0; for (i = 0; i < n; i++) sum += a[i];
return sum;
SpaCal locality
SpaCal locality Temporal locality
Temporal locality
Architecture Approach: Memory Hierarchy
• Keep most recent accessed data and its adjacent data in the smaller/faster caches that are closer to processor
• Mechanisms for replacing data
50
Control
Datapath
Secondary Storage (Disk)
Processor
Registers
Main Memory (DRAM)
2nd/3rd Level Cache
(SRAM)
On-C
hip C
ache
1s 10,000,000s (10s ms)
Speed (ns): 10s 100s
100s Gs Size (bytes): Ks Ms
Tertiary Storage (Tape)
10,000,000,000s (10s sec)
Ts
4 QuesCons for Cache OrganizaCon Review
• Q1: Where can a block be placed in the upper level? – Block placement
• Q2: How is a block found if it is in the upper level? – Block identification
• Q3: Which block should be replaced on a miss? – Block replacement
• Q4: What happens on a write? – Write strategy
Q1: Where Can a Block be Placed in The Upper Level?
• Block Placement – Direct Mapped, Fully Associative, Set Associative
• Direct mapped: (Block number) mod (Number of blocks in cache) • Set associative: (Block number) mod (Number of sets in cache)
– # of set ≤ # of blocks – n-way: n blocks in a set – 1-way = direct mapped
• Fully associative: # of set = 1
Block-frame address
Block no. 0 1 2 3 54 76 8 12 9 31
Direct mapped: block 12 can go only into block 4 (12 mod 8)
0 1 2 3 4 5 6 7 Block no.
Set associative: block 12 can go anywhere in set 0 (12 mod 4)
0 1 2 3 4 5 6 7
Set0
Block no.
Set1 Set2 Set3
Fully associative: block 12 can go anywhere
Block no. 0 1 2 3 4 5 6 7
1 KB Direct Mapped Cache, 32B blocks
• For a 2N byte cache – The uppermost (32 - N) bits are always the Cache Tag – The lowest M bits are the Byte Select (Block Size = 2M)
Cache Index
0 1 2 3
:
Cache Data Byte 0
0 4 31
:
Cache Tag Example: 0x50 Ex: 0x01
0x50
Stored as part of the cache “state”
Valid Bit
: 31
Byte 1 Byte 31 :
Byte 32 Byte 33 Byte 63 : Byte 992 Byte 1023 :
Cache Tag
Byte Select Ex: 0x00
9 5 10
Set Associative Cache
• N-way set associative: N entries for each Cache Index – N direct mapped caches operates in parallel
• Example: Two-way set associative cache – Cache Index selects a “set” from the cache; – The two tags in the set are compared to the input in parallel; – Data is selected based on the tag result.
Cache DataCache Block 0
Cache TagValid
:: :
Cache DataCache Block 0
Cache Tag Valid
: ::
Cache Index
Mux 01Sel1 Sel0
Cache Block
CompareAdr Tag
Compare
ORHit
Cache DataCache Block 0
Cache TagValid
:: :
Cache DataCache Block 0
Cache Tag Valid
: ::
Cache DataCache Block 0
Cache Tag Valid
: ::
Cache Index
Mux 01Sel1 Sel0
Cache Block
CompareAdr Tag
CompareCompare
ORHit
Disadvantage of Set Associative Cache
• N-way Set Associative Cache versus Direct Mapped Cache: – N comparators vs. 1 – Extra MUX delay for the data – Data comes AFTER Hit/Miss decision and set selection
• In a direct mapped cache, Cache Block is available BEFORE Hit/Miss: – Possible to assume a hit and continue. Recover later if miss.
Cache DataCache Block 0
Cache TagValid
:: :
Cache DataCache Block 0
Cache Tag Valid
: ::
Cache Index
Mux 01Sel1 Sel0
Cache Block
CompareAdr Tag
Compare
ORHit
Cache DataCache Block 0
Cache TagValid
:: :
Cache DataCache Block 0
Cache Tag Valid
: ::
Cache DataCache Block 0
Cache Tag Valid
: ::
Cache Index
Mux 01Sel1 Sel0
Cache Block
CompareAdr Tag
CompareCompare
ORHit
Q2: Block Identification
• Tag on each block – No need to check index or block offset
• Increasing associativity shrinks index, expands tag
Block Offset
Block Address
Index Tag
Cache size = Associativity × 2index_size × 2offest_size
Set Select Data Select
Q3: Which block should be replaced on a miss?
• Easy for Direct Mapped • Set Associative or Fully Associative
– Random – LRU (Least Recently Used) – First in, first out (FIFO)
Associativity
2-way 4-way 8-way
Size LRU Ran. FIFO LRU Ran. FIFO LRU Ran. FIFO
16KB 114.1 117.3 115.5 111.7 115.1 113.3 109.0 111.8 110.4
64KB 103.4 104.3 103.9 102.4 102.3 103.1 99.7 100.5 100.3
256KB 92.2 92.1 92.5 92.1 92.1 92.5 92.1 92.1 92.5
Q4: What Happens on a Write?
Write-Through Write-Back
Policy
Data written to cache block, also written to lower-level memory
1. Write data only to the cache
2. Update lower level when a block falls out of the cache
Debug Easy Hard Do read misses produce writes? No Yes Do repeated writes make it to lower level? Yes No
Additional option -- let writes to an un-cached address allocate a new cache line (“write-allocate”).
Write Buffers for Write-Through Caches
• Q. Why a write buffer ? – A. So CPU doesn’t stall
• Q. Why a buffer, why not just one register ? – A. Bursts of writes are common.
• Q. Are Read After Write (RAW) hazards an issue for write buffer? – A. Yes! Drain buffer before next read, or send read 1st after check
write buffers.
ProcessorCache
Write Buffer
DRAM
Write - Miss Policy
• Two options on a write miss – Write allocate – the block is allocated on a write miss, followed
by the write hit actions. • Write misses act like read misses.
– No-write allocate – write misses do not affect the cache. The block is modified only in the lower-level memory. • Block stay out of the cache in no-write allocate until the
program tries to read the blocks, but with write allocate even blocks that are only written will still be in the cache.
Write-Miss Policy Example
• Example: Assume a fully associative write-back cache with many cache entries that starts empty. Below is sequence of five memory operations (The address is in square brackets): Write Mem[100]; Write Mem[100]; Read Mem[200]; Write Mem[200]; Write Mem[100]. What are the number of hits and misses (inclusive reads and writes) when using no-write allocate versus write allocate?
• Answer No-write Allocate: Write allocate:
Write Mem[100]; 1 write miss Write Mem[100]; 1 write miss Write Mem[100]; 1 write miss Write Mem[100]; 1 write hit Read Mem[200]; 1 read miss Read Mem[200]; 1 read miss Write Mem[200]; 1 write hit Write Mem[200]; 1 write hit Write Mem[100]. 1 write miss Write Mem[100]; 1 write hit 4 misses; 1 hit 2 misses; 3 hits
Cache Performance (1/3)
• Memory Stall Cycles: the number of cycles during which the processor is stalled waiBng for a memory access.
• RewriBng the CPU performance Bme
• The number of memory stall cycles depends on both the number of misses
and the cost per miss, which is called the miss penalty:
timecycleClock cycles) stallMemory cyclesclock (CPUtimeexecution CPU ×+=
Penalty Missrate MissInstrution
accessesMemory IC
Penalty MissInstrution
MissesIC
penalty Missmisses ofNumber cycles stallMemory
×××=
××=
×=
† The advantage of the last form is the component can be easily measured.
Cache Performance (2/3)
• Miss penalty depends on – Prior memory requests or memory refresh; – Different clocks of the processor, bus, and memory; – Thus, using miss penalty be a constant is a simplificaBon.
• Miss rate: the fracBon of cache access that result in a miss (i.e., number of accesses that miss divided by number of accesses).
• Extract formula for R/W
Penalty Miss Writerate miss Writeninstructioper WritesIC Penalty Miss Readrate miss Readninstructioper ReadsICcycles stallMemory
×××+
×××=
penalty Missrate MissInstrution
accessesMemory ICcycles stallMemory ×××=
† Simplify the complete formula by combining the R/W.
Example (C-‐5) • Assume we have a computer where the clocks per instrucBon (CPI) is 1.0 when all memory
accesses hit in the cache. The only data accesses are loads and stores, and these total 50% of the instrucBons. If the miss penalty is 25 clock cycles and the miss rate is 2%, how much faster would the computer be if all instrucBons were cache hits?
• Answer:
cycleClock 1.0ICcycleClock 0)CPI(IC timecycleClock cycles) stallMemory cyclesclock (CPUtimeexecution CPU
××=×+×=
×+=
1. Compute the performance for the computer that always hits:
0.75IC250.020.5)(1IC
penalty Missrate MissInstrution
accessesMemory ICcycles stallMemory
×=××+×=
×××=
2. For the computer with the real cache, we compute memory stall cycles:
CPU execution time = (CPU clock cycles + Memory stall cycles)×Clock cycle time =1.75× IC×Clock cycle
3. Compute the total performance
75.1cycleClock IC1.0cycleClock IC1.75
timeexecution CPUtimeexecution CPU cache =
××
××=
4. Compute the performance raCo which is the inverse of the execuCon Cmes:
Cache Performance (3/3)
• Usually, measuring miss rate as misses per instrucBon rather than misses per memory reference.
• For example, in the previous example into misses per instrucBon:
nInstructioaccessesMemoryrate Miss
countn InstructioaccessesMemory rate Miss
InstrutionMisses ×
×=×
=
† The latter formula is useful when you know the average number of memory accesses per instruction.
030.05.102.0nInstructio
accessesMemoryrate MissInstrution
Misses=×=
××=
Example (C-‐6)
• To show equivalency between the two miss rate equations, let’s redo the example above, this time assuming a miss rate per 1000 instructions of 30. What is memory stall time in terms of instruction count?
• Answer Recomputing the memory stall cycles:
0.75IC
7501000
IC
25301000
IC
penalty Miss1000nInstructio
Misses1000
IC
penalty MissnInstructio
MissesIC
penalty Missmisses ofNumber cycles stallMemory
×=
×=
××=
××
×=
××=
×=