ece7660 advanced computer architecture basics …czxu/ece7660_f05/cache-basics.pdfece7660 memory.1...
Post on 22-Apr-2018
234 Views
Preview:
TRANSCRIPT
ECE7660 Memory.1 Fall 2005
ECE7660Advanced Computer Architecture
Basics of Memory Hierarchy
ECE7660 Memory.2 Fall 2005
Who Cares About the Memory Hierarchy?
µProc60%/yr. (2X/1.5yr)
DRAM9%/yr.(2X/10 yrs)
1
10
100
1000
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
DRAM
CPU
1982
Processor-MemoryPerformance Gap:(grows 50% / year)
Per
form
ance
Time
“Moore’s Law”
Processor-DRAM Memory Gap (latency)
2001
2002
ECE7660 Memory.3 Fall 2005
Memory Hierarchy of a Modern Computer System
Control
Datapath
SecondaryStorage(Disk)
Processor
Registers
MainMemory(DRAM)
SecondLevelCache
(SRAM)
On-C
hipC
ache
CD or tapeDiskMain memoryCacheBacked by
OS/operatorOperating systemHardwareCompilerManaged by
20-150 (MB/s)1000-5000 (MB/s)5000-10,000 (MB/s)20,000-100,000 (MB/s)Bandwidth
5,000,00080-2500.5-250.25-0.5Access time (ns)
Magnetic diskCMOS DRAMOn-chip or off-chip CMOS SRAM
Custom memory w. multiple ports, CMOS
Implementation technology
> 100GB< 16 GB< 16 MB< 1 KBTypical size
4/DM3/MM2/C1/RFLevel/Name
ECE7660 Memory.4 Fall 2005
Memory Hierarchy: Principles of Operation
°At any given time, data is copied between only 2 adjacent levels:
• Upper Level: the one closer to the processor
- Smaller, faster, and uses more expensive technology
• Lower Level: the one further away from the processor
- Bigger, slower, and uses less expensive technology
°Block:
• The minimum unit of information that can either be present or not present in the two level hierarchy
Lower LevelMemoryUpper Level
MemoryTo Processor
From ProcessorBlk X
Blk Y
ECE7660 Memory.5 Fall 2005
Memory Hierarchy: Terminology
°Hit: data appears in some block in the upper level (example: Block X)
• Hit Rate: the fraction of memory access found in the upper level
• Hit Time: Time to access the upper level which consists of
RAM access time + Time to determine hit/miss
°Miss: data needs to be retrieve from a block in the lower level (Block Y)
• Miss Rate = 1 - (Hit Rate)
• Miss Penalty: Time to replace a block in the upper level +
Time to deliver the block to the processor
°Hit Time << Miss Penalty
Lower LevelMemoryUpper Level
MemoryTo Processor
From ProcessorBlock X
Block Y
ECE7660 Memory.6 Fall 2005
Memory Hierarchy: How Does it Work?
°Principle of Locality:
• Program access a relatively small portion of the address space at any instant of time.
• Example: 90% of time in 10% of the code
°Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon.
• Keep more recently accessed data items closer to the processor
°Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon.
• Move blocks consists of contiguous words to the upper levels
Lower LevelMemoryUpper Level
MemoryTo Processor
From ProcessorBlk X
Blk Y
ECE7660 Memory.7 Fall 2005
Memory Hierarchy Technology
°Random Access:
• “Random” is good: access time is the same for all locations
• DRAM: Dynamic Random Access Memory
- High density, low power, cheap, slow
- Dynamic: need to be “refreshed” regularly
• SRAM: Static Random Access Memory
- Low density, high power, expensive, fast
- Static: content will last “forever”
°“Non-so-random” Access Technology:
• Access time varies from location to location and from time to time
• Examples: Disk, tape drive, CDROM
°The lectures will concentrate on random access technology
• The Main Memory: DRAMs
• Caches: SRAMs
ECE7660 Memory.8 Fall 2005
Memory Hierarchy
Cache SystemVirtual Memory System
ECE7660 Memory.9 Fall 2005
What is a cache?° Small, fast storage used to improve average access time to slow memory.
° Exploits spacial and temporal locality
° In computer architecture, almost everything is a cache!
• Registers a cache on variables
• First-level cache a cache on second-level cache
• Second-level cache a cache on memory
• Memory a cache on disk (virtual memory)
• TLB a cache on page table
• Branch-prediction a cache on prediction information?
Proc/Regs
L1-CacheL2-Cache
Memory
Disk, Tape, etc.
Bigger Faster
ECE7660 Memory.10 Fall 2005
The Simplest Cache: Direct Mapped Cache
Memory
4 Byte Direct Mapped Cache
Memory Address
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
Cache Index
0
1
2
3
°Location 0 can be occupied by data from:
• Memory location 0, 4, 8, ... etc.
• In general: any memory locationwhose 2 LSBs of the address are 0s
• Address<1:0> => cache index
• (Block Addr) module (#cache blocks)
°How can we tell which one is in the cache?
°Which one should we place in the cache?
ECE7660 Memory.11 Fall 2005
Example:
Consider an eight-word direct mapped cache. Show thecontents of the cache as it responds to a series of requests (decimal addresses):
22, 26, 22, 26, 16, 3, 16, 18
22 26 22 26 16 3 16 18
10110 11010 10110 11010 10000 00011 10000 10010
miss miss hit hit miss miss hit miss110 010 110 010 000 011 000 010
ECE7660 Memory.12 Fall 2005
Cache Organization: Cache Tag and Cache Index
°Assume a 32-bit memory (byte ) address:
• A 2 bytes direct mapped cache:
- Cache Index: The lower N bits of the memory address
- Cache Tag: The upper (32 - N) bits of the memory address
Cache Index
0
1
2
3
2 - 1N
:
2 N BytesDirect Mapped Cache
Byte 0
Byte 1
Byte 2
Byte 3
Byte 2 - 1
0N31
:
Cache Tag Example: 0x50 Ex: 0x03
0x50
Stored as partof the cache “state”Valid Bit
:
N
N
ECE7660 Memory.13 Fall 2005
Cache Access Example
Access 000 01
Start Up
000 M [00001]
Access 010 10
(miss)
(miss)
000 M [00001]
010 M [01010]
Tag DataV
000 M [00001]
010 M [01010]
Miss Handling:Load DataWrite Tag & Set V
Load Data
Write Tag & Set V
Access 000 01(HIT)
000 M [00001]
010 M [01010]Access 010 10(HIT)
°Sad Fact of Life:
• A lot of misses at start up:
Compulsory Misses
- (Cold start misses)
ECE7660 Memory.14 Fall 2005
Definition of a Cache Block
°Cache Block: the cache data that has in its own cache tag
°Our previous “extreme” example (see slide 10):
• 4-byte Direct Mapped cache: Block Size = 1 Byte
• Take advantage of Temporal Locality: If a byte is referenced,it will tend to be referenced soon.
• Did not take advantage of Spatial Locality: If a byte is referenced, its adjacent bytes will be referenced soon.
°In order to take advantage of Spatial Locality: increase the block size
°Question: Can we say “the larger, the better” for the block size?
Direct Mapped Cache Data
Byte 0
Byte 1
Byte 2
Byte 3
Cache TagValid
ECE7660 Memory.15 Fall 2005
Example: 1 KB Direct Mapped Cache with 32 B Blocks
°For a 2 ** N byte cache:
• The uppermost (32 - N) bits are always the Cache Tag
• The lowest M bits are the Byte Select (Block Size = 2 ** M)
Cache Index
0
1
2
3
:
Cache Data
Byte 0
0431
:
Cache Tag Example: 0x50
Ex: 0x01
0x50
Stored as partof the cache “state”
Valid Bit
:
31
Byte 1Byte 31 :
Byte 32Byte 33Byte 63 :
Byte 992Byte 1023 :
Cache Tag
Byte Select
Ex: 0x00
9
ECE7660 Memory.16 Fall 2005
Mapping an Address to a Multiword Cache Block
Example: Consider a cache with 64 Blocks and a block size of 16 bytes. What block number does byte address 1200 map to?
[1200/16] module 64 = 11
ttttttttttttttttt iiiiiiiiii oooo
tag index byteto check to offsetif have select withincorrect block block block
Address Tag is divided into three fields: tag, index, offset
ECE7660 Memory.17 Fall 2005
DM Cache: Example One
° Suppose we have a 16KB of data in a direct-mapped cache with 4 word blocks
° Determine the size of the tag, index and offset fields if we’re using a 32-bit architecture
° # rows/cache =# blocks/cache (since there’s one block/row)
= 210 rows/cache
need 10 bits to specify this many rows
° Tag: use remaining bits as tag
• tag length = mem addr length - offset - index= 32 - 4 - 10 bits= 18 bits
• so tag is leftmost 18 bits of memory address
° Why not full 32 bit address as tag?
• All bytes within block need same address (-4b)
• Index must be same for every address within a block, so its redundant in tag check, thus can leave off to save memory (- 10 bits in this example)
ECE7660 Memory.18 Fall 2005
DM Cache: Example Two (Bits in a cache)
How many total bits are required for a directedmapped cache with 64 KB of data and one-word blocks,assuming a 32-bit address?
2^14 (32 + (32-14-2)+1) = 2^14*49 = 784 Kbits
32 –14 –2 32
214
1
ECE7660 Memory.19 Fall 2005
Block Size Tradeoff
°In general, larger block size take advantage of spatial locality BUT:
• Larger block size means larger miss penalty:
- Takes longer time to fill up the block
• If block size is too big relative to cache size, miss rate will go up
°Average Access Time:
• = Hit Time x (1 - Miss Rate) + Miss Penalty x Miss Rate
MissPenalty
Block Size
MissRate Exploits Spatial Locality
Fewer blocks: compromisestemporal locality
AverageAccess
Time
Increased Miss Penalty& Miss Rate
Block Size Block Size
ECE7660 Memory.20 Fall 2005
Another Extreme Example
°Cache Size = 4 bytes Block Size = 4 bytes
• Only ONE entry in the cache
°True: If an item is accessed, likely that it will be accessed again soon
• But it is unlikely that it will be accessed again immediately!!!
• The next access will likely to be a miss again
- Continually loading data into the cache butdiscard (force out) them before they are used again
- Worst nightmare of a cache designer: Ping Pong Effect
°Conflict Misses are misses caused by:
• Different memory locations mapped to the same cache index
- Solution 1: make the cache size bigger
- Solution 2: Multiple entries for the same Cache Index
0
Cache DataValid Bit
Byte 0Byte 1Byte 3
Cache Tag
Byte 2
ECE7660 Memory.21 Fall 2005
How Do you Design a Cache?
°Set of Operations that must be supported
• read: data <= Mem[Physical Address]
• write: Mem[Physical Address] <= Data
°Determine the internal register transfers
°Design the Datapath
°Design the Cache Controller
Physical Address
Read/Write
Data
Memory“Black Box”
Inside it has:Tag-Data Storage,Muxes,Comparators, . . .
CacheController
CacheDataPathAddress
Data In
Data Out
R/WActive
ControlPoints
Signalswait
ECE7660 Memory.22 Fall 2005
Impact on time
° CPI = Ideal CPI + memory stall cycles°° CPU time =
(CPU execution clock cycles + Memory stall cycles) x clock cycle time
° Memory stall clock cycles = (Reads x Read miss rate x Read miss penalty + Writes x Write miss rate x Write miss penalty)
° Memory stall clock cycles = Memory accesses x Miss rate x Miss penalty
° Average Memory Access time (AMAT) = Hit Time + (Miss Rate x Miss Penalty)
° Memory hit time is included in execution cycles.
IR
PC
I -Cache
D Cache
A B
R
T
IRex
IRm
IRwb
miss
invalid
Miss
ECE7660 Memory.23 Fall 2005
Example 1:
Assume an instruction cache miss rate for gcc of 2% and a data cachemiss rate of 4%. If a machine has a CPI of 2 without any memory stallsand the miss penalty is 40 cycles for all misses, determine how much faster a machine would run with a perfect cache that never missed.(It is known that the frequency of all loads and stores in gcc is 36%, unused)
Answer: Instruction miss cycles = I * 2% * 40 = 0.80IData miss cycles = I * 36% * 4% * 40 = 0.56IThe CPI with memory stalls is 2 + 1.36 = 3.36
The performance with the perfect cache is better by 1.68
ECE7660 Memory.24 Fall 2005
Example 2
° Suppose a processor executes at
• Clock Rate = 200 MHz (5 ns per cycle), Ideal (no misses) CPI = 1.1
• 50% arith/logic, 30% ld/st, 20% control° Suppose that 10% of memory operations get 50 cycle miss penalty° Suppose that 1% of instructions get same miss penalty° Questions: what is CPI? What’s the percentage of proc time stalled
waiting for memory? What is average memory access time?
° CPI = ideal CPI + average stalls per instruction
1.1(cycles/ins) +[ 0.30 (DataMops/ins)
x 0.10 (miss/DataMop) x 50 (cycle/miss)] +[ 1 (InstMop/ins)
x 0.01 (miss/InstMop) x 50 (cycle/miss)] = (1.1 + 1.5 + .5) cycle/ins = 3.1
° Percentage of the memory stall time = (1.5+0.5)/3.1 = 64.5%
° AMAT=(1/1.3)x[1+0.01x50]+(0.3/1.3)x[1+0.1x50]=2.54 cycles/access
ECE7660 Memory.25 Fall 2005
Impact on Time: Vector Product Example
° Machine
• Processor with 64KB direct-mapped cache, 16 B line size
° Performance
• Good case: 24 cycles / element
• Bad case: 66 cycles / element
float dot_prod(float x[1024], y[1024]){
float sum = 0.0;int i;for (i = 0; i < 1024; i++)
sum += x[i]*y[i];return sum;
}
ECE7660 Memory.26 Fall 2005
Thrashing Example
• Access one element from each array per iteration
x[1]
x[0]
x[1020]
•••
•••
x[3]
x[2]
x[1021]
x[1022]
x[1023]
y[1]
y[0]
y[1020]
•••
•••
y[3]
y[2]
y[1021]
y[1022]
y[1023]
CacheLine
CacheLine
CacheLine
CacheLine
CacheLine
CacheLine
ECE7660 Memory.27 Fall 2005
x[1]
x[0]
x[3]
x[2]
y[1]
y[0]
y[3]
y[2]CacheLine
Thrashing Example: Good Case
° Access Sequence
• Read x[0]
- x[0], x[1], x[2], x[3] loaded• Read y[0]
- y[0], y[1], y[2], y[3] loaded• Read x[1]
- Hit• Read y[1]
- Hit• • • •• 2 misses / 8 reads
° Analysis
• x[i] and y[i] map to different cache lines
• Miss rate = 25%
- Two memory accesses / iteration
- On every 4th iteration have two misses
° Timing
• 10 cycle loop time
• 28 cycles / cache miss
• Average time / iteration =
10 + 0.25 * 2 * 28
ECE7660 Memory.28 Fall 2005
x[1]
x[0]
x[3]
x[2]
y[1]
y[0]
y[3]
y[2]CacheLine
Thrashing Example: Bad Case
° Access Pattern
• Read x[0]
- x[0], x[1], x[2], x[3] loaded• Read y[0]
- y[0], y[1], y[2], y[3] loaded• Read x[1]
- x[0], x[1], x[2], x[3] loaded• Read y[1]
- y[0], y[1], y[2], y[3] loaded• 8 misses / 8 reads
° Analysis
• x[i] and y[i] map to same cache lines
• Miss rate = 100%
- Two memory accesses / iteration
- On every iteration have two misses
° Timing
• 10 cycle loop time
• 28 cycles / cache miss
• Average time / iteration =
10 + 1.0 * 2 * 28
ECE7660 Memory.29 Fall 2005
General Approaches to Improve Cache Perf.
1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.
Average Memory Access time (AMAT) = Hit Time + (Miss Rate x Miss Penalty)
IR
PC
I -Cache
D Cache
A B
R
T
IRex
IRm
IRwb
miss
invalid
Miss
ECE7660 Memory.30 Fall 2005
Cache Design to Reduce Miss Rate: Four Key Questions
° Q1: Where can a block be placed in the upper level? (Block placement)
• Fully Associative, Set Associative, Direct Mapped° Q2: How is a block found if it is in the upper level?
(Block identification)• Tag/Block
° Q3: Which block should be replaced on a miss? (Block replacement)
• Random, LRU° Q4: What happens on a write?
(Write strategy)
• Write Back or Write Through (with Write Buffer)
ECE7660 Memory.31 Fall 2005
° Block 12 placed in 8 block cache:• Fully associative, direct mapped, 2-way set associative• S.A. Mapping = Block Number Modulo Number Sets
0 1 2 3 4 5 6 7Blockno.
Fully associative:block 12 can go anywhere
0 1 2 3 4 5 6 7Blockno.
Direct mapped:block 12 can go only into block 4 (12 mod 8)
0 1 2 3 4 5 6 7Blockno.
Set associative:block 12 can go anywhere in set 0 (12 mod 4)
Set0
Set1
Set2
Set3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
Block-frame address
1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3Blockno.
Where can a block be place?
ECE7660 Memory.32 Fall 2005
And yet Another Extreme Example: Fully Associative
°Fully Associative Cache
• Forget about the Cache Index
• Compare the Cache Tags of all cache entries in parallel
• Example: Block Size = 32 B blocks, we need N 27-bit comparators
°By definition: Conflict Miss = 0 for a fully associative cache
:
Cache Data
Byte 0
0431
:
Cache Tag (27 bits long)
Valid Bit
:
Byte 1Byte 31 :
Byte 32Byte 33Byte 63 :
Cache Tag
Byte Select
Ex: 0x01
X
X
X
X
X
ECE7660 Memory.33 Fall 2005
A Two-way Set Associative Cache
°N-way set associative: N entries for each Cache Index
• N direct mapped caches operates in parallel
°Example: Two-way set associative cache
• Cache Index selects a “set” from the cache
• The two tags in the set are compared in parallel
• Data is selected based on the tag result
Cache Data
Cache Block 0
Cache TagValid
:: :
Cache Data
Cache Block 0
Cache Tag Valid
: ::
Cache Index
Mux 01Sel1 Sel0
Cache Block
CompareAdr Tag
Compare
OR
Hit
ECE7660 Memory.34 Fall 2005
Disadvantage of Set Associative Cache
°N-way Set Associative Cache versus Direct Mapped Cache:
• N comparators vs. 1
• Extra MUX delay for the data
• Data comes AFTER Hit/Miss
°In a direct mapped cache, Cache Block is available BEFORE Hit/Miss:
• Possible to assume a hit and continue. Recover later if miss.
Cache Data
Cache Block 0
Cache Tag Valid
: ::
Cache Data
Cache Block 0
Cache TagValid
:: :
Cache Index
Mux 01Sel1 Sel0
Cache Block
CompareAdr Tag
Compare
OR
Hit
ECE7660 Memory.35 Fall 2005
Sources of Cache Misses
°Compulsory (cold start, first reference): first access to a block
• “Cold” fact of life: not a whole lot you can do about it
°Conflict (collision):
• Multiple memory locations mappedto the same cache location
• Solution 1: increase cache size
• Solution 2: increase associativity
°Capacity:
• Cache cannot contain all blocks access by the program
• Solution: increase cache size
°Invalidation: other process (e.g., I/O, processors) updates memory
ECE7660 Memory.36 Fall 2005
The Need to Make a Decision!
°Direct Mapped Cache:
• Each memory location can only mapped to 1 cache location
• No need to make any decision :-)
- Current item replaced the previous item in that cache location
°N-way Set Associative Cache:
• Each memory location have a choice of N cache locations
°Fully Associative Cache:
• Each memory location can be placed in ANY cache location
°Cache miss in a N-way Set Associative or Fully Associative Cache:
• Bring in new block from memory
• Throw out a cache block to make room for the new block
• ====> We need to make a decision on which block to throw out!
ECE7660 Memory.37 Fall 2005
Cache Block Replacement Policy
°Random Replacement:
• Hardware randomly selects a cache item and throw it out
°Least Recently Used:
• Hardware keeps track of the access history
• Replace the entry that has not been used for the longest time
°First-in, first out (FIFO): replace the oldest block
°Example of a Simple “Pseudo” Least Recently Used Implementation:
• Assume 64 Fully Associative Entries
• Hardware replacement pointer points to one cache entry
• Whenever an access is made to the entry the pointer points to:
- Move the pointer to the next entry
• Otherwise: do not move the pointer
:
Entry 0
Entry 1
Entry 63
Replacement
Pointer
ECE7660 Memory.38 Fall 2005
Cache Write Policy:Write Through versus Write Back
°Cache read is much easier to handle than cache write:
• Instruction cache is much easier to design than data cache
°Cache write:
• How do we keep data in the cache and memory consistent?
°Two options (decision time again :-)
• Write Back: write to cache only. Write the cache block to memorywhen that cache block is being replaced on a cache miss.
- Need a “dirty” bit for each cache block, so as to the frequency of writing back blocks
- Greatly reduce the memory bandwidth requirement
- Control can be complex
• Write Through: write to cache and memory at the same time.
- The cache is always clean
- What!!! How can this be? Isn’t memory too slow for this?
ECE7660 Memory.39 Fall 2005
Write Buffer for Write Through
°Write stall: the time when cpu’s waiting for writes to complete
°A Write Buffer is needed between the Cache and Memory
• Processor: writes data into the cache and the write buffer
• Memory controller: write contents of the buffer to memory
°Write buffer is just a FIFO:
• Typical number of entries: 4
• Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle
°Memory system designer’s nightmare:
• Store frequency (w.r.t. time) > 1 / DRAM write cycle
• Write buffer saturation
ProcessorCache
Write Buffer
DRAM
ECE7660 Memory.40 Fall 2005
Write Buffer Saturation
°Store frequency (w.r.t. time) > 1 / DRAM write cycle
• If this condition exist for a long period of time (CPU cycle time too quick and/or too many store instructions in a row):
- Store buffer will overflow no matter how big you make it
- The CPU Cycle Time <= DRAM Write Cycle Time
°Solution for write buffer saturation:
• Use a write back cache
• Install a second level (L2) cache:
ProcessorCache
Write Buffer
DRAM
ProcessorCache
Write Buffer
DRAML2Cache
ECE7660 Memory.41 Fall 2005
Write Allocate versus Not Allocate
Assume: a 16-bit write to memory location 0x0 and causes a Write Miss
• Do we read in the rest of the block (Byte 2, 3, ... 31)?
Yes: Write Allocate
No: Write Not Allocate
Cache Index
0
1
2
3
:
Cache Data
Byte 0
0431
:
Cache Tag Example: 0x00
Ex: 0x00
0x00
Valid Bit
:
31
Byte 1Byte 31 :
Byte 32Byte 33Byte 63 :
Byte 992Byte 1023 :
Cache Tag
Byte Select
Ex: 0x00
9
ECE7660 Memory.42 Fall 2005
How important are caches?
(Figure from Jim Keller, Compaq Corp.)
•21264 Floorplan
•Register files in middle of execution units
•64k instr cache
•64k data cache
•Caches take up a large fraction of the die
ECE7660 Memory.43 Fall 2005
The Processor Itself
ECE7660 Memory.44 Fall 2005
Summary of Cache Basics
°The Principle of Locality: Temporal Locality vs Spatial Locality
°Four Questions For Any Cache
• Where to place in the cache
• How to locate a block in the cache
• Replacement
• Write policy: Write through vs Write back
- Write miss:
°Three Major Categories of Cache Misses:
• Compulsory Misses: sad facts of life. Example: cold start misses.
• Conflict Misses: increase cache size and/or associativity.Nightmare Scenario: ping pong effect!
• Capacity Misses: increase cache size
top related