1 improving direct-mapped cache performance by the addition of a small fully-associative cache and...

Improving Direct-Mapped Cache Performance by the Addition

of a Small Fully-Associative Cache and Prefetch Buffers

BySreemukha KandlakuntaPhani Shashank Nagari

Outline

Base Line Design

Reducing Conflict Misses

Miss Caching

Victim Caching

Reducing Capacity and Compulsory Misses

Stream Buffers

Multi-way Stream Buffers

Base Line Design

Base Line Design Contd..

• Size of on-chip caches usually varies

• High speed technologies result in smaller on chip caches

• L1 caches are assumed to be direct mapped

• L1 cache line sizes – 16 - 32 B

• L2 cache line sizes – 128-256B

Parameters assumed

• Processor Speed: 1000 MIPS

• L1 Inst and Data Cache

Size : 4Kb

Line Size : 16B

• L2 Inst and Data Cache

Size : 1MB

Line Size : 128B

Parameters assumed Contd..

Miss Penalty

L1- 24 Inst times

L2- 320 Inst times

Test Program Characteristics

Base Line system L1 Cache Miss Rates

Base Line DesignPerformance

Inferences

• Potential performance loss in memory hierarchy

• Improving performance of memory hierarchy rather than CPU performance

• H/w Techniques are used for improving the performance of the baseline M-H

How Direct Mapped Cache works

Main Memory

Tag Data Block Number

01110101100011010

100011010

000111

000111110101

000111110101100011

010011

101100

0000000000000000

0101010101010101

1010101010101010

111111111111

011010001000

111110101100

Direct Mapped Cache with 8 Blocks

How to identify?

•Match the Tag

•Tag 01 in block 001 means address 01001 is there

How to search

•00101, 01101, 10101, 11101 maps to block 101

001001

How Fully-associative Cache works

Main Memory

Tag Data Block Number

110101100011010001

100011010001000111

010001000111110101

000111110101100011

001010011

101100

0000000000000000

0101010101010101

1010101010101010

111111111111

011010001000

111110101100

Fully Associative Cache with 8 Blocks

Where to search?

•Every Block in Cache

•Very Expensive

Cache Misses

• Three Kinds

- Instruction read miss: Causes most delay, CPU has to wait until the instruction Is fetched from the DRAM

- Data read miss: Causes less delay, Inst not dependent on cache miss can continue execution until data is returned from DRAM

- Data write miss: causes least delay, write can be queued & CPU can continue until queue is full

Types of Misses

• Conflict Misses

Reduced by caching : Miss and Victim

• Compulsory Misses

• Capacity Misses

Both are reduced by prefecthing:

Stream Buffers

Multi-way Buffers

Conflict Miss

• Conflict Misses are the misses which would not occur if the cache was Fully associative and had LRU

• If an item has been evicted from the cache and the next miss corresponds to that item then that kind of miss is called the conflict miss

Conflict Misses Contd..

• Conflict Misses account to

– 20-40% of overall D-M misses

– 39% of L1-D$ misses

– 29% of L1-I$ misses

Conflict Misses,4Kb I&D

Outline

Base Line Design

Miss Caching

Victim Caching

Stream Buffers

Miss Caching

• Small, Fully associative on-chip cache

• On Miss

Data is returned to

-Direct mapped cache

-Small Miss cache ( Where it replaces LRU item)

• Processor probes both D-M and Miss cache

Miss cache Organization

Observations

• Eliminates long off-chip miss penalty • More data conflicts misses are removed than

Instruction conflict misses- Instructions within a procedure do not

conflict as long as the procedure size is < cache size

- If an instruction within the program calls another program which may be mapped else where, a conflict arises- instruction conflict

Miss Cache Performance

• For 4 KB D$ size - Miss cache of 2 entries can remove 25% of D$ conflict misses i.e. 13% of overall D$ misses- Miss cache of 4 entries can remove 36% of D$ conflict misses i.e. 18% of overall D$ misses

• After 4 entries the improvement is minor

Conflict Misses removed by Miss caching

Overall Cache Misses removed by Miss Caching

Outline

Base Line Design

Miss Caching

Victim Caching

Stream Buffers

Victim Caching

• Duplication of the data wastes storage space in miss cache

• Loads F-A cache with victim line from the

D-M cache

• When data misses in the D-M cache but hits in the Victim cache, contents are swapped

Victim Cache Organization

Victim Cache Performance

• Victim cache consisting of just one line is better than miss cache consisting of 2 lines

• Significant improvement in the performance of all the benchmark programs

Conflict Misses removed by Victim Caching

Overall Cache Misses removed by Victim Caching

Comparison of Miss cache and Victim cache performances

Effect of D-M cache size on Victim cache performance

• Smaller D-M caches – Most benefited due to addition of victim cache

• As D-M cache size increases, likelihood of conflict misses removed by victim cache decreases

• As the percentage of conflict misses decreases, the percentage of these misses removed by victim cache decreases

Victim cache: vary direct-map cache size

Effect of Line Size on Victim CachePerformance

• As line size increases the number of conflict misses increases

• As a result percentage of misses removed by victim cache increases

Victim cache: vary data cache line size

Victim caches and L2 Caches

• Victim caches are also useful for L2 caches due to large line sizes

• Using L1 victim cache can also reduce the number of L2 conflict misses

Outline

Base Line Design

Miss Caching

Victim Caching

Stream Buffers

• Compulsory Misses

First reference to a piece of data

• Capacity MissesDue to insufficient cache size

Prefetching Algorithms

• Prefetch Always : Access to line “i” implies to prefetch access for “i+1”

• Prefetch on miss : Reference to block “i”

causes prefetch to block “i+1” Iff the block was a miss

• Tagged Prefetch : Tag bit is set to `0` when a block is prefetched and to set 1 when block is used

Limited Time For Prefetch

Outline

Base Line Design

Miss Caching

Victim Caching

Stream Buffers

• Prefetched lines are placed in buffer in order to avoid polluting

• Each entry consists of tag ,an available bit and data line

• If a reference misses in the cache but hits in the buffer , the cache can be reloaded

• When a line is moved from the SB , entries in the SB shift up and new successive data is fetched

Stream Buffer Mechanism

Stream Buffer Mechanism Contd..

• On Miss Prefetch successive lines Enter tag for address in to the SB Set available bit to false

• On return of the prefetched data Place data in entry with its tag Set available bit to true

Stream Buffer Performance

• Most instruction references break the purely sequential access pattern by the time the 6th successive line is fetched

• Data references end even sooner

• As a result , Stream buffers show better performance at removing I$ misses

Sequential SB performance

Limitations of Stream Buffers

• Stream buffers considered are FIFO queues

• Head of the queue has tag comparator

• Elements must be removed strictly in sequence

• Works only for sequential line misses

• Fails for a non-sequential line miss

Outline

Base Line Design

Miss Caching

Victim Caching

Stream Buffers

Multi-way stream buffers

• Single data SB`s could remove 72% of I$ misses and 25% of D$ misses

• Multi-way SB was simulated- to improve performance of SB`s for data references

• Consists of 4 SB in parallel

• On Miss the least recently Hit SB is cleared and data is started fetching from the miss address

Multi-way stream Buffer Design

Observations

• Performance of the instruction stream remains virtually unchanged

• Significant improvement in the performance of the data stream

• Removes 43% of the misses for the test programs i.e. almost twice the performance of single SB

Four-way SB performance

SB Performance Vs Cache size

SB Performance Vs Line size

Performance Evaluation

• Over the set of 6 benchmarks on an average 2.5% of 4KB D-M D$ misses that hit in a 4 entry victim cache also hit in a 4 way SB

• The combination of buffers and victim caches reduces the L1 miss rate to less than half of that of the base line system

• Resulting in an average of 143% improvement in system performance for the 6 benchmarks

Improved System Performance

Future Enhancements

• The study has concentrated on applying these H/W techniques to L1 caches

• Application of these techniques to L2 caches forms an interesting area of future work

• Performance of victim caching and stream buffers can be investigated for OS design and for multi-programming work loads

Conclusions

• Miss caches remove tight conflicts where several addresses map to the same cache line

• Victim caches are an improvement to miss caching that save the victim of the cache miss

• Stream buffers prefetch cache lines after missed cache line

• Multi-way stream buffers are a set of stream buffers that can do concurrent prefetches

1 improving direct-mapped cache performance by the addition of a small fully-associative cache and...

cache misses

conflict miss conflict

miss data

associative cache

b slide

data cache size

base line design slide

id slide

Documents

stm32l4r5xx stm32l4r7xx stm32l4r9xx - st.com · processing...

pacman: prefetch-aware cache management for high ... ·...

distributed prefetch-bufer -- cache design for high...

1 lecture 11: large cache design iv topics: prefetch, dead...

quiz wei hsu 8/16/2006. which of the following instructions...

prefetch -aware dram controllers

phani-final123 (3)

perceptron learning in cache management and … · 2019....

09.05.2015 04:19:35 ch. phani kiran

improving direct-mapped cache performance by the addition...

improving direct-mapped cache performance by...

phani carrefour

improving direct-mapped cache performance by the addition of...

openssl - prefetch technologies -- prefetch technologies ·...

pacman: prefetch-aware cache management for high ... ·...

ayelet zohar pelluses phani

dicom prefetch quick access to priors

guthi phani himaja - qmr ia

section 4. prefetch cache module - microchip...

hello, - home - stmicroelectronics flash access to fill the...