lecture’xx:’midterm’review’ - github pages...great ideas in computer architectures 1. design...

Lecture XX: Midterm Review

CSE 564 Computer Architecture Fall 2016

Department of Computer Science and Engineering Yonghong Yan

[email protected] www.secs.oakland.edu/~yan

1

Lecture 01 IntroducCon

2

3

The InstrucCon Set: a CriCcal Interface

instrucBon set

soCware

hardware

•  ProperBes of a good abstracBon –  Lasts through many generaBons (portability) –  Used in many different ways (generality) –  Provides convenient funcBonality to higher levels –  Permits an efficient implementaBon at lower levels

Great Ideas in Computer Architectures

1.  Design for Moore’s Law

2.  Use abstraction to simplify design

3.  Make the common case fast

4.  Performance via parallelism

5.  Performance via pipelining

6.  Performance via prediction

7.  Hierarchy of memories

8.  Dependability via redundancy

4

Great Idea: “Moore’s Law”

Gordon Moore, Founder of Intel •  1965: since the integrated circuit was invented, the number of

transistors/inch2 in these circuits roughly doubled every year; this trend would conBnue for the foreseeable future

•  1975: revised -‐ circuit complexity doubles every two years

5 Image credit: Intel

Moore’s Law trends •  More transistors = ↑ opportuniBes for exploiBng parallelism in the

instrucBon level (ILP) –  Pipeline, superscalar, VLIW (Very Long InstrucBon Word), SIMD (Single

InstrucBon MulBple Data) or vector, speculaBon, branch predicBon •  General path of scaling

–  Wider instrucBon issue, longer piepline –  More speculaBon –  More and larger registers and cache

•  Increasing circuit density ~= increasing frequency ~= increasing performance

•  Transparent to users –  An easy job of geang beber performance: buying faster processors (higher

frequency)

•  We have enjoyed this free lunch for several decades, however (TBD) …

6

Problems of tradiConal ILP scaling

•  Fundamental circuit limitaBons1 –  delays ⇑ as issue queues ⇑ and mulB-‐port register files ⇑ –  increasing delays limit performance returns from wider issue

•  Limited amount of instrucBon-‐level parallelism1

–  inefficient for codes with difficult-‐to-‐predict branches

•  Power and heat stall clock frequencies

7

[1] The case for a single-‐chip mulBprocessor, K. Olukotun, B. Nayfeh, L. Hammond, K. Wilson, and K. Chang, ASPLOS-‐VII, 1996.

Power/heat density limits frequency

8

•  Some fundamental physical limits are being reached

9

RevoluCon is happening now •  Chip density is

conBnuing increase ~2x every 2 years –  Clock speed is not –  Number of processor

cores may double instead

•  There is lible or no hidden parallelism (ILP) to be found

•  Parallelism must be exposed to and managed by soCware –  No free lunch

Source: Intel, MicrosoC (Suber) and Stanford (Olukotun, Hammond)

Architectural Challenges

•  Massive (ca. 4X) increase in concurrency –  MulBcore (4 -‐ <100) à Manycores (100s – 1ks)

•  Heterogeneity –  System-‐level (accelerators) vs chip level (embedded)

•  Compute power and memory speed challenges (two walls) –  500x compute power and 30x memory of 2PF HW –  Memory access ;me lags further behind

10

• Complex Digital ASIC Design • Activity 1 Case Study: Scalar vs. Vector Processors Activity 2

Course Motivation: Research Perspective

��

��

��

� ��

��

��

��

��

��

� ��

��

!"�#��

"��$��

��%��

&�'(�)

(�*�!+(

�,��$��

'��'�+-��

��'.��

�/�0��1'��

&��2$��-��

'.��

(�$3

��

�4��5��2�6�&0�.&��7��8�9��1:*�4��9 ��'.�� $�� 4�6�� ;�� 2��<�� 4��9 ��'"<��5$�� <��5$�� ")

��

ECE 5950 Course Overview 18 / 35Data$Processing$in$Exascale1class$Computing$Systems$$|$$April$27,$2011$$|$$CRM$4"

Three"Eras"of"Processor"Performance"

Single4Core""Era"

Single1thread$$Performance$

?$

Time$

we#are#here#

o"

Enabled$by:$� ��$� Voltage$Scaling$� MicroArchitecture$

$

Constrained$by:$Power$Complexity$

Multi4Core""Era"

Throughput$$Performance$

Time$(##of#Processors)#

we#are#here#

o"

Enabled$by:$� ��$� Desire$for$Throughput$� 20$years$of$SMP$arch$

$

Constrained$by:$Power$Parallel$SW$availability$Scalability$

Heterogeneous"Systems"Era"

Targeted$Application$$

Performance$

Time$(Data1parallel#exploitation)#

we#are#here#

o"

Enabled$by:$� ��$� Abundant$data$parallelism$� Power$efficient$GPUs$

$

Currently)constrained$by:$Programming$models$Communication$overheads$

Source: Chuck Moore, Data Processing in ExaScale-‐ClassComputer Systems, Salishan, April 2011

Lecture 02 Performance

11

Dynamic Energy and Power

•  Dynamic energy –  Transistor switch from 0 -‐> 1 or 1 -‐> 0

•  Dynamic power

•  Reducing clock rate reduces power, not energy •  The capaciBve load:

–  a funcBon of the number of transistors connected to an output and the technology, which determines the capacitance of the wires and the transistors.

12

An Example from Textbook page #21

13

InstrucCon Count and CPI

•  InstrucBon Count for a program –  Determined by program, ISA and compiler

•  Average cycles per instrucBon –  Determined by CPU hardware –  If different instrucBons have different CPI

•  Average CPI affected by instrucBon mix

Rate ClockCPICount nInstructio

Time Cycle ClockCPICount nInstructioTime CPU

nInstructio per CyclesCount nInstructioCycles Clock

×=

××=

×=

14

CPI Example

•  Computer A: Cycle Time = 250ps, CPI = 2.0 •  Computer B: Cycle Time = 500ps, CPI = 1.2 •  Same ISA •  Which is faster, and by how much?

1.2500psI600psI

ATime CPUBTime CPU

600psI500ps1.2IBTime CycleBCPICount nInstructioBTime CPU

500psI250ps2.0IATime CycleACPICount nInstructioATime CPU

=×

×=

×=××=

××=

×=××=

××=

A is faster…

…by this much

15

CPI in More Detail

•  If different instruction classes take different numbers of cycles

∑=

×=n

1iii )Count nInstructio(CPICycles Clock

n  Weighted average CPI

∑=

⎟⎠

⎞⎜⎝

⎛ ×==n

1i

ii Count nInstructio

Count nInstructioCPICount nInstructio

Cycles ClockCPI

Relative frequency

16

CPI Example

•  AlternaBve compiled code sequences using instrucBons in classes A, B, C

Class A B C CPI for class 1 2 3

IC in sequence #1 2 1 2 IC in sequence #2 4 1 1

n  Sequence #1: IC = 5 n  Clock Cycles = 2×1 + 1×2 + 2×3 = 10

n  Avg. CPI = 10/5 = 2.0

n  Sequence #2: IC = 6 n  Clock Cycles = 4×1 + 1×2 + 1×3 = 9

n  Avg. CPI = 9/6 = 1.5

17

Principles of Computer Design

•  The Processor Performance EquaBon

18


•  Different instrucBon types having different CPIs

19

Impacts by Components

Inst Count CPI Clock Rate

Program X Compiler X (X) Inst. Set. X X Architecture X X Technology X

20

inst count

CPI

Cycle time


•  Take Advantage of Parallelism –  e.g. mulBple processors, disks, memory banks, pipelining,

mulBple funcBonal units

•  Principle of Locality –  Reuse of data and instrucBons

•  Focus on the Common Case –  Amdahl’s Law

21

Amdahl’s Law

22

( )enhanced

enhancedenhanced

new

oldoverall

SpeedupFraction Fraction

1 ExTimeExTime Speedup

+−==1

Best you could ever hope to do:

( )enhancedmaximum Fraction - 1

1 Speedup =

( ) ⎥⎦

⎤⎢⎣

⎡+−×=

enhanced

enhancedenhancedoldnew Speedup

FractionFraction ExTime ExTime 1

Using Amdahl’s Law

23

Amdahl’s Law for Parallelism

•  The enhanced fracBon F is through parallelism, perfect parallelism with linear speedup –  The speedup for F is N for N processors

•  Overall speedup

•  Speedup upper bound (when N à∞ ): –  1-‐F: the sequenBal porBon of a program

24

Lecture 03: ISA

25

Iron-‐code Summary •  Sec$on A.2—Use general-‐purpose registers with a load-‐store architecture. •  Sec$on A.3—Support these addressing modes: displacement (with an address offset

size of 12 to 16 bits), immediate (size 8 to 16 bits), and register indirect. •  Sec$on A.4—Support these data sizes and types: 8-‐, 16-‐, 32-‐, and 64-‐bit integers and

64-‐bit IEEE 754 floaCng-‐point numbers. –  Now we see 16-‐bit FP for deep learning in GPU

•  hip://www.nextplajorm.com/2016/09/13/nvidia-‐pushes-‐deep-‐learning-‐inference-‐new-‐pascal-‐gpus/

•  Sec$on A.5—Support these simple instrucCons, since they will dominate the number of instrucCons executed: load, store, add, subtract, move register-‐ register, and shil.

•  Sec$on A.6—Compare equal, compare not equal, compare less, branch (with a PC-‐relaCve address at least 8 bits long), jump, call, and return.

•  Sec$on A.7—Use fixed instrucCon encoding if interested in performance, and use variable instrucCon encoding if interested in code size.

•  Sec$on A.8—Provide at least 16 general-‐purpose registers, be sure all addressing modes apply to all data transfer instrucCons, and aim for a minimalist IS

–  Olen use separate floaCng-‐point registers. –  The jusCficaCon is to increase the total number of registers without raising problems

in the instrucCon for-‐mat or in the speed of the general-‐purpose register file. This compromise, however, is not orthogonal.

26

Lecture 05: Pipeline

27

RISC Instruction Set

•  Every instruction can be implemented in at most 5 clock cycles –  Instruction fetch cycle (IF): send PC to memory, fetch the

current instruction from memory, and update PC to the next sequential PC by adding 4 to the PC.

–  Instruction decode/register fetch cycle (ID): decode the instruction, read the registers corresponding to register source specifiers from the register file.

–  Execution/effective address cycle (EX): perform Memory reference, Register-Register ALU instruction and Register-Immediate ALU instruction.

–  Memory access (MEM): perform load/store instructions. –  Write-back cycle (WB): Register-Register ALU instruction or

Load instruction.

Making RISC Pipelining Real

•  Function units used in different cycles –  Hence we can overlap the execution of multiple instructions

•  Important things to make it real –  Separate instruction and data memories, e.g. I-cache and D-cache,

banking •  Eliminate a conflict for accessing a single memory.

–  The Register file is used in the two stages (two R and one W every cycle) •  Read from register in ID (second half of CC), and write to register

in WB (first half of CC). –  PC

•  Increment and store the PC every clock, and done it during the IF stage.

•  A branch does not change the PC until the ID stage (have an adder to compute the potential branch target).

–  Staging data between pipeline stages •  Pipeline register

Pipeline Datapath

•  Register files in ID and WB stage –  Read from register in ID (second half of CC), and write to

register in WB (first half of CC). •  IM and DM

Pipeline Registers

Memory Access

Write Back

Instruction Fetch

Instr. Decode Reg. Fetch

Execute Addr. Calc

ALU

Mem

ory

Reg File

MU

X MU

X

Data

Mem

ory

MU

X

Sign Extend

Zero?

IF/ID

ID/EX

MEM

/WB

EX/M

EM

4

Adder

Next SEQ PC Next SEQ PC

RD RD RD WB

Dat

a

Next PC

Address

RS1 RS2

Imm

MU

X

IR <= mem[PC];

PC <= PC + 4

A <= Reg[IRrs];

B <= Reg[IRrt] rslt <= A opIRop B

Reg[IRrd] <= WB

WB <= rslt Pipeline Registers for Data Staging between Pipeline Stages Named as: IF/ID, ID/EX, EX/MEM, and MEM/WB

Pipeline Registers

•  Edge-‐triggered property of register is criBcal

Inst. Set Processor Controller

IR <= mem[PC]; PC <= PC + 4

A <= Reg[IRrs]; B <= Reg[IRrt]

r <= A opIRop B

Reg[IRrd] <= WB

WB <= r

Ifetch

opFetch-‐DCD

PC <= IRjaddr if bop(A,b) PC <= PC+IRim

br jmp RR

r <= A opIRop IRim

Reg[IRrd] <= WB

WB <= r

RI r <= A + IRim

WB <= Mem[r]

Reg[IRrd] <= WB

LD ST

JSR JR

branch requires 3 cycles, store requires 4 cycles, and all other instructions require 5 cycles.

•  InstrucBons per program depends on source code, compiler technology, and ISA

•  Cycles per instrucCons (CPI) depends on ISA and µarchitecture

•  Time per cycle depends upon the µarchitecture and base technology

Processor Performance

CPU Time = InstructionsProgram

* CyclesInstruction

*TimeCycle

RISC-‐V ISA and ImplementaCons

35

User Level ISA

•  Defines the normal instrucBons needed for computaBon –  A mandatory Base integer ISA

•  I: Integer instrucCons: ALU, branches/jumps, and loads/stores •  Support for misaligned memory access is mandatory

–  Standard Extensions • M: Integer MulCplicaCon and Division •  A: Atomic InstrucCons •  F: Single-‐Precision FloaCng-‐Point •  D: Double-‐Precision FloaCng-‐Point •  C: Compressed InstrucCons (16 bit)

•  G = IMAFD: Integer base + four standard extensions –  OpBonal extensions

36

Purpose of a Specific Control Signal

Datapath for ALU InstrucCons

38

<14:12>

Op2Sel Reg / Imm

Imm Select

ImmSel OpCode

0x4 Add

clk

addr inst

Inst. Memory

PC ALU

RegWriteEn clk

rd1

GPRs

rs1 rs2

wa wd rd2

we <19:15> <24:20>

ALU Control

<11:7>

<6:0>

7 5 5 3 5 7 func7 rs2 rs1 func3 rd opcode rd ← (rs1) func (rs2) immediate12 rs1 func3 rd opcode rd ← (rs1) op immediate 31 20 19 15 14 12 11 7 6 0

Inst<31:20>

Datapath for Load/Store InstrucCons

39

WBSel ALU / Mem

rs1 is the base register rd is the destination of a Load, rs2 is the data source for a Store

Op2Sel

“base”

disp

ImmSel OpCode

ALU Control

ALU

0x4 Add

clk

addr inst

Inst. Memory

PC

RegWriteEn

clk

rd1

GPRs

rs1 rs2

wa wd rd2

we

Imm Select

clk

MemWrite

addr

wdata

rdata Data Memory

we

7 5 5 3 5 7 imm rs2 rs1 func3 imm opcode Store (rs1) + displacement immediate12 rs1 func3 rd opcode Load 31 20 19 15 14 12 11 7 6 0

Datapath for CondiConal Branches (BEQ/BNE/BLT/BGE/BLTU/BGEU)

40

0x4

Add

PCSel

clk

WBSel MemWrite

addr

wdata

rdata Data Memory

we

Op2Sel ImmSel OpCode

Bcomp?

clk

clk

addr inst

Inst. Memory

PC rd1

GPRs

rs1 rs2

wa wd rd2

we

Imm Select

ALU

ALU Control

Add

br

pc+4

RegWrEn

Br Logic

Data Hazards Summary •  Stall cycles without by-‐passing

–  3, 2, 1 depending on the distance between the two instrucBons •  RAW dependency between ALU

–  Full-‐bypassing will eliminate all stalls •  Load-‐use RAW

–  Full-‐bypassing could have at most 1 cycle stall for two instrs •  Ld x5 16(x4) •  Add x6, x5, x1

•  Load-‐store or store-‐store –  No stalls with full-‐bypassing

•  Interlock control logic for RAW hazard detecBon and stall inserBon •  By-‐passing data path

–  Need to deal with three situaBons •  ALU à ALU •  MEM à ALU •  WB à ALU

41

Interlock Control Logic ignoring jumps & branches

42

IR IR IR

PC A

B Y

R

MD1 MD2

addr inst

Inst Memory

0x4 Add

IR ALU rd1

GPRs

rs1 rs2

wa wd rd2

we

wdata

addr

wdata

rdata Data Memory

we

bubble

stall Cstall

wa1

rs1 rs2 ?

we1

re1 re2 Cre

wa3 we2 wa2

Cdest Cdest we3

Imm Select

Fully Bypassed Datapath

43

ASrc IR IR IR

PC A

B Y

R

MD1 MD2

addr inst

Inst Memory

0x4 Add

IR ALU

Imm Select

rd1

GPRs

rs1 rs2

wa wd rd2

we

wdata

addr

wdata

rdata Data Memory

we

bubble

stall

D

E M W

PC for JAL, ...

BSrc

Control Hazards Summary

44

Instruc$on Taken known? Target known?

JAL

JALR B<cond.>

Each instrucBon fetch depends on one or two pieces of informaBon from the preceding instrucBon:

1) Is the preceding instrucBon a taken branch? 2) If so, what is the target address?

•  JAL: uncondiConal jump to PC+immediate •  JALR: indirect jump to rs1+immediate •  Branch: if (rs1 conds rs2), branch to PC+immediate

Aler Inst. Decode

Aler Inst. Decode Aler Inst. Decode

Aler Inst. Decode Aler Reg. Fetch

Aler Execute

Control Hazards Summary

•  JAL: uncondiConal jump to PC+immediate –  1 cycle delay of pipeline

•  JALR: indirect jump to rs1+immediate –  1 cycle delay

•  Branch: if (rs1 conds rs2), branch to PC+immediate –  2 cycles delay

•  SoluCons: –  Delay slot (not a soluCon to remove the bubble) –  BHT and BTB (not a soluCon either)

45

Lecture 10/11 Memory Tech, Cache OrganizaCon and Performance

46

Review: Memory Technology and Hierarchy

•  RelaBonships

47

Technology challenge: Memory Wall

Address Space 0 2^n -‐ 1 Prob

ability

of re

ference

Program Behavior: Principle of Locality

Architecture Approach: Memory Hierarchy

Technology Challenge: Memory Wall

•  Growing disparity of processor and memory speed

•  DRAM: Slow cheap and dense: –  Good choice for presenBng the

user with a BIG memory system –  Used for Main memory

48

•  SRAM: fast, expensive, and not very dense: –  Good choice for providing the user FAST access Bme. –  Used for Cache

•  Speed: –  Latency –  Bandwidth –  Memory interleaving

Program Behavior: Principle of Locality •  Programs tend to reuse data and instrucBons near those they have used

recently, or that were recently referenced themselves –  SpaBal locality: Items with nearby addresses tend to be referenced close

together in Bme –  Temporal locality: Recently referenced items are likely to be referenced

in the near future

Locality Example: •  Data

– Reference array elements in succession (stride-‐1 reference paiern):

– Reference sum each iteraCon: •  InstrucCons

– Reference instrucCons in sequence: – Cycle through loop repeatedly:

sum = 0; for (i = 0; i < n; i++) sum += a[i];

return sum;

SpaCal locality

SpaCal locality Temporal locality

Temporal locality

Architecture Approach: Memory Hierarchy

•  Keep most recent accessed data and its adjacent data in the smaller/faster caches that are closer to processor

•  Mechanisms for replacing data

50

Control

Datapath

Secondary Storage (Disk)

Processor

Registers

Main Memory (DRAM)

2nd/3rd Level Cache

(SRAM)

On-C

hip C

ache

1s 10,000,000s (10s ms)

Speed (ns): 10s 100s

100s Gs Size (bytes): Ks Ms

Tertiary Storage (Tape)

10,000,000,000s (10s sec)

Ts

4 QuesCons for Cache OrganizaCon Review

•  Q1: Where can a block be placed in the upper level? –  Block placement

•  Q2: How is a block found if it is in the upper level? –  Block identification

•  Q3: Which block should be replaced on a miss? –  Block replacement

•  Q4: What happens on a write? –  Write strategy

Q1: Where Can a Block be Placed in The Upper Level?

•  Block Placement –  Direct Mapped, Fully Associative, Set Associative

•  Direct mapped: (Block number) mod (Number of blocks in cache) •  Set associative: (Block number) mod (Number of sets in cache)

–  # of set ≤ # of blocks –  n-way: n blocks in a set –  1-way = direct mapped

•  Fully associative: # of set = 1

Block-frame address

Block no. 0 1 2 3 54 76 8 12 9 31

Direct mapped: block 12 can go only into block 4 (12 mod 8)

0 1 2 3 4 5 6 7 Block no.

Set associative: block 12 can go anywhere in set 0 (12 mod 4)

0 1 2 3 4 5 6 7

Set0

Block no.

Set1 Set2 Set3

Fully associative: block 12 can go anywhere

Block no. 0 1 2 3 4 5 6 7

1 KB Direct Mapped Cache, 32B blocks

•  For a 2N byte cache –  The uppermost (32 - N) bits are always the Cache Tag –  The lowest M bits are the Byte Select (Block Size = 2M)

Cache Index

0 1 2 3

:

Cache Data Byte 0

0 4 31

:

Cache Tag Example: 0x50 Ex: 0x01

0x50

Stored as part of the cache “state”

Valid Bit

: 31

Byte 1 Byte 31 :

Byte 32 Byte 33 Byte 63 : Byte 992 Byte 1023 :

Cache Tag

Byte Select Ex: 0x00

9 5 10

Set Associative Cache

•  N-way set associative: N entries for each Cache Index –  N direct mapped caches operates in parallel

•  Example: Two-way set associative cache –  Cache Index selects a “set” from the cache; –  The two tags in the set are compared to the input in parallel; –  Data is selected based on the tag result.

Cache DataCache Block 0

Cache TagValid

:: :


Cache Tag Valid

: ::

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

ORHit


Cache TagValid

:: :


Cache Tag Valid

: ::


Cache Tag Valid

: ::

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

CompareCompare

ORHit

Disadvantage of Set Associative Cache

•  N-way Set Associative Cache versus Direct Mapped Cache: –  N comparators vs. 1 –  Extra MUX delay for the data –  Data comes AFTER Hit/Miss decision and set selection

•  In a direct mapped cache, Cache Block is available BEFORE Hit/Miss: –  Possible to assume a hit and continue. Recover later if miss.


Cache TagValid

:: :


Cache Tag Valid

: ::

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

ORHit


Cache TagValid

:: :


Cache Tag Valid

: ::


Cache Tag Valid

: ::

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

CompareCompare

ORHit

Q2: Block Identification

•  Tag on each block –  No need to check index or block offset

•  Increasing associativity shrinks index, expands tag

Block Offset

Block Address

Index Tag

Cache size = Associativity × 2index_size × 2offest_size

Set Select Data Select

Q3: Which block should be replaced on a miss?

•  Easy for Direct Mapped •  Set Associative or Fully Associative

–  Random –  LRU (Least Recently Used) –  First in, first out (FIFO)

Associativity

2-way 4-way 8-way

Size LRU Ran. FIFO LRU Ran. FIFO LRU Ran. FIFO

16KB 114.1 117.3 115.5 111.7 115.1 113.3 109.0 111.8 110.4

64KB 103.4 104.3 103.9 102.4 102.3 103.1 99.7 100.5 100.3

256KB 92.2 92.1 92.5 92.1 92.1 92.5 92.1 92.1 92.5

Q4: What Happens on a Write?

Write-Through Write-Back

Policy

Data written to cache block, also written to lower-level memory

1.  Write data only to the cache

2.  Update lower level when a block falls out of the cache

Debug Easy Hard Do read misses produce writes? No Yes Do repeated writes make it to lower level? Yes No

Additional option -- let writes to an un-cached address allocate a new cache line (“write-allocate”).

Write Buffers for Write-Through Caches

•  Q. Why a write buffer ? –  A. So CPU doesn’t stall

•  Q. Why a buffer, why not just one register ? –  A. Bursts of writes are common.

•  Q. Are Read After Write (RAW) hazards an issue for write buffer? –  A. Yes! Drain buffer before next read, or send read 1st after check

write buffers.

ProcessorCache

Write Buffer

DRAM

Write - Miss Policy

•  Two options on a write miss –  Write allocate – the block is allocated on a write miss, followed

by the write hit actions. •  Write misses act like read misses.

–  No-write allocate – write misses do not affect the cache. The block is modified only in the lower-level memory. •  Block stay out of the cache in no-write allocate until the

program tries to read the blocks, but with write allocate even blocks that are only written will still be in the cache.

Write-Miss Policy Example

•  Example: Assume a fully associative write-back cache with many cache entries that starts empty. Below is sequence of five memory operations (The address is in square brackets): Write Mem[100]; Write Mem[100]; Read Mem[200]; Write Mem[200]; Write Mem[100]. What are the number of hits and misses (inclusive reads and writes) when using no-write allocate versus write allocate?

•  Answer No-write Allocate: Write allocate:

Write Mem[100]; 1 write miss Write Mem[100]; 1 write miss Write Mem[100]; 1 write miss Write Mem[100]; 1 write hit Read Mem[200]; 1 read miss Read Mem[200]; 1 read miss Write Mem[200]; 1 write hit Write Mem[200]; 1 write hit Write Mem[100]. 1 write miss Write Mem[100]; 1 write hit 4 misses; 1 hit 2 misses; 3 hits

Cache Performance (1/3)

•  Memory Stall Cycles: the number of cycles during which the processor is stalled waiBng for a memory access.

•  RewriBng the CPU performance Bme

•  The number of memory stall cycles depends on both the number of misses

and the cost per miss, which is called the miss penalty:

timecycleClock cycles) stallMemory cyclesclock (CPUtimeexecution CPU ×+=

Penalty Missrate MissInstrution

accessesMemory IC

Penalty MissInstrution

MissesIC

penalty Missmisses ofNumber cycles stallMemory

×××=

××=

×=

†  The advantage of the last form is the component can be easily measured.


•  Miss penalty depends on –  Prior memory requests or memory refresh; –  Different clocks of the processor, bus, and memory; –  Thus, using miss penalty be a constant is a simplificaBon.

•  Miss rate: the fracBon of cache access that result in a miss (i.e., number of accesses that miss divided by number of accesses).

•  Extract formula for R/W

Penalty Miss Writerate miss Writeninstructioper WritesIC Penalty Miss Readrate miss Readninstructioper ReadsICcycles stallMemory

×××+

×××=

penalty Missrate MissInstrution

accessesMemory ICcycles stallMemory ×××=

†  Simplify the complete formula by combining the R/W.

Example (C-‐5) •  Assume we have a computer where the clocks per instrucBon (CPI) is 1.0 when all memory

accesses hit in the cache. The only data accesses are loads and stores, and these total 50% of the instrucBons. If the miss penalty is 25 clock cycles and the miss rate is 2%, how much faster would the computer be if all instrucBons were cache hits?

•  Answer:

cycleClock 1.0ICcycleClock 0)CPI(IC timecycleClock cycles) stallMemory cyclesclock (CPUtimeexecution CPU

××=×+×=

×+=

1. Compute the performance for the computer that always hits:

0.75IC250.020.5)(1IC

penalty Missrate MissInstrution

accessesMemory ICcycles stallMemory

×=××+×=

×××=

2. For the computer with the real cache, we compute memory stall cycles:

CPU execution time = (CPU clock cycles + Memory stall cycles)×Clock cycle time =1.75× IC×Clock cycle

3. Compute the total performance

75.1cycleClock IC1.0cycleClock IC1.75

timeexecution CPUtimeexecution CPU cache =

××

××=

4. Compute the performance raCo which is the inverse of the execuCon Cmes:


•  Usually, measuring miss rate as misses per instrucBon rather than misses per memory reference.

•  For example, in the previous example into misses per instrucBon:

nInstructioaccessesMemoryrate Miss

countn InstructioaccessesMemory rate Miss

InstrutionMisses ×

×=×

=

†  The latter formula is useful when you know the average number of memory accesses per instruction.

030.05.102.0nInstructio

accessesMemoryrate MissInstrution

Misses=×=

××=

Example (C-‐6)

•  To show equivalency between the two miss rate equations, let’s redo the example above, this time assuming a miss rate per 1000 instructions of 30. What is memory stall time in terms of instruction count?

•  Answer Recomputing the memory stall cycles:

0.75IC

7501000

IC

25301000

IC

penalty Miss1000nInstructio

Misses1000

IC

penalty MissnInstructio

MissesIC

penalty Missmisses ofNumber cycles stallMemory

×=

×=

××=

××

×=

××=

×=

lecture’xx:’midterm’review’ - github pages...great ideas in computer architectures 1. design...

Documents