an introduction to cache design

An Introduction to

Cache Design

112/04/20 \course\cpeg323-08F\Topic7a 1


Cache

A safe place for hiding and storing things.

Webster Dictionary


Even with the inclusion of cache, almost all CPUs are

still mostly strictly limited by the cache access-time:

In most cases, if the cache access time were decreased,

the machine would speedup accordingly.

- Alan Smith -

Even more so for MPs!


While one can imagine ref. patterns that can defeat

existing cache M designs, it is the author’s

experience that cache M improve performance for

any program or workload which actually does useful

computation.


Generally has four aspects:

1. Maximizing the probability of finding a memory reference’s

target in the cache (the hit ratio).

2. Minimizing the time to access information that is indeed in the

cache (access time).

3. Minimizing the delay due to a miss.

4. Minimizing the overheads of updating main memory,

maintaining cache coherence etc.

Optimizing the design of a cache memory


Key Factor in Design Decision for VM and Cache

Access-timeMainMem

Access-timeCache

Access-timeSecondaryMem

Access-timeMainMem

= 4 ~ 20.

= 104 ~ 106.

Cache control is usually implemented in hardware!!


Memory Technology Typical Access Time $ per MbyteSRAM 10-20 ns 200-400DRAM 90-120 ns 50-100

Magnetic disk 10,000,000 - 20,000,000 ns 2-5

Technology in 1990s:

Technology in 2000s ?


Memory Technology Typical Access Time $ per GbyteSRAM 0.5 - 5 ns 4,000 - 10,000DRAM 50 - 70 ns 100 - 200

Magnetic disk 10,000,000 - 20,000,000 ns 0.5 - 2

Technology in 2004:

Technology in 2008s ?

See P&H Fig. pg. 469 3rd Ed


Memory Technology Typical Access Time $ per GbyteSRAM 0.5 -2.5 ns 2,000 - 5,000DRAM 50 - 70 ns 20 - 75

Magnetic disk 5,000,000 - 20,000,000 ns 0.2 - 2

Technology in 2008:

See P&H Fig. pg. 453 4th Ed


ProcessorProcessorCacheCache

Main

Memory

Main

Memory

Secondary

Memory

Secondary

Memory

Cache in Memory Hierarchy

Emerging Memory Device Technologies

Source: Emerging Nanoscale Memory and Logic devices: A Critical Assesment”, Hutchby et al, IEEE Computer, May, 2008

Emerging Memory Device Technologies

Source: “Emerging Nanoscale Memory and Logic devices: A Critical Assesment”, Hutchby et al, IEEE Computer, May, 2008

Source: Kooge, Peter ACS Productivity Workshop 2008


Four Questions for Classifying Memory Hierarchies:

The fundamental principles that drive all memory

hierarchies allow us to use terms that transcend the levels

we are talking about. These same principles allow us to

pose four questions about any level of the hierarchy:


Q1: Where can a block be placed in the upper

level? (Block placement)

Q2: How is a block found if it is in the upper

level? (Block identification)

Q3: Which block should be replaced on a

miss? (Block replacement)

Q4: What happens on a write? (Write strategy)

Four Questions for Classifying Memory Hierarchies


These questions will help us gain

an understanding of the different

tradeoffs demanded by the

relationships of memories at

different levels of a hierarchy.


01173 30

Line

ADDRESS DATA

Concept of Cache miss and Cache hit

0 1 2 3 4 5 6 7

TAGS DATA

0117X 35, 72, 55, 30, 64, 23, 16, 14

7620X 11, 31, 26, 22, 55, …

3656X 71, 72, 44, 50, …

1741X 33, 35, 07, 65, ...


teff : effective cache access time

tcache : cache access time

tmain : main memory access time

h : hit ratio

teff = htcache + (1-h)tmain

Access Time


Example

Let tcache = 10 ns - 1- 4 clock cycles

tmain = 50 ns - 8-32 clock cycles

h = 0.95

teffect = ?

10 x 0.95 + 50 x 0.05

9.5 + 2.5 = 12


Hit Ratio

• Need high enough (say > 90%) to obtain

desirable level of performance

• Amplifying effect of changes

• Never a constant even for the same

machine


Sensitivity of Performance w.r.t h (hit ratio)

teff = h tcache + (1-h) tmain

= tcache [ h + (1-h) ]

tcache [ 1 + (1-h) ]

since 10, the magnifactor of h changes is 10

times.

Conclusion: very sensitive

tmain

tcachetmain

tcache

tmain

tcache

~~


• Remember:

“h 1”

• Example:

Let h = 0.90

if h = 0.05 (0.90 0.95)

then (1 - h) = 0.05

then teff = tcache ( 1 + 0.5)

~~


Basic Terminology

• Cache line (block) - size of a room

1 ~ 16 words

• Cache directory - key of rooms

Cache may use associativity to find the “right directory” by matching

“A collection of contiguousdata that are treated as a single entity of cache storage.”

The portion of a cache thatholds the access keys that support associative access.


Cache Organization

• Fully associative: an element can be in any block

• Direct mapping : an element can be in only one

block.

• Set-associative : an element can be in a group

of block


An Example

Mem Size = 256 k words x 4B/W = 1 MB

Cache Size = 2 k words = 8 k byte

Block Size = 16 word/block = 64 byte/block

So

Main M has = 16 k blocks (16,384)

Cache has = 128 blocks

addr = 18 bits + 2 bits = (28 x 210) x 22

256K16

2K16

(byte)20

256 k words


Fully Associative

Feature

Any block in M can be in any block-

frame in cache.

All entries (block frame) are compared

simultaneously (by associative search).


simplest example: a block = a word

entire memory word address becomes

Address

027560

0 17

027560 data

0 17

adv: no trashing (quick reorganizing)disadv: overhead of associative search:

cost + time

very “flexible” and higherprobability to reside in cache.

Cache

“tag”

A Special Case


Fully associative cache organization


• No associative match

• From M-addr, “directly” indexed to the

block frame in cache where the block

should be located. A comparison then is

to used to determine if it is a miss or hit.

Direct Mapping


Direct Mapping

Advantage:

simplest:

Disadvantage: “trashing”

Cont’d

Fast (fewer logic)Low cost: (only one set comparator is needed

hence can be in the form of standard M


since cache only has 128 block frames so the degree of multiplexing:

Disadr: “trashing”

Main Memory Size 16384 (block)

128 (27) 128= = 27 block/frame

for addressing the corresponding frame or set of size 1.

the high-order7 bit is usedas tag.

i.e. 27 blocks “fall” in one block frame.

Example


Direct Mapping


Direct Mapping

Mapping (indexing) block addr mod (# of blocks in cache –

in this case: mod (27))

Adv: low-order log2 (cache size) bit can be used for indexing

Cont’d


Set-Associative

• A compromises between direct/full-associative

• The cache is divided into S sets

S = 2, 4, 8, …

• If the cache has M blocks

than, all together, there are

E = blocks/set

# of buildings available for indexing

MS

In our example, S = 128/2 = 64 sets


2-way set associative

The 6-bit will index to the right set, then the 8-bit tag will be used for an associative match.


Associativity with 8-block cache


thus

or

Set Word

8 6 4 2

a 2-way set associative organization:

available for indexing

214 (16k)26

= 28 block/set

28 block/per set of 2 blocks

6 bit used to indexinto the right “set”higher order

8 bit used as taghence an associativematch of 8 bit withthe tags of the 2 blocks is required

2 way

Hence an associative matching of 8 bit with the tags of the 2 block is required.


Sector Mapping Cache

• Sector (IBM 360/85) - 16 sector x 16 block/sector- 1 sector = consecutive multiple blocks- Cache miss: sector replacement- Valid bit - one block is moved on demand

• Example:

Sector block word (tag)

0 6 7 13 14 177 7 4

A sector in memory can be in any sector in cache


Sector Mapping Cache


Cache has = 8 sector

Main memory has = 1K sectors

128 blocks16 blocks/sector

16k

16

Sector mapping cache

cont’d


Example

See P&H Fig. 7.7 3rd Ed or 5.7 4th Ed


Total # of Bits in a CacheTotal # of bits = Cache size x (# of bits of a tag + # of bits of a block +

# of bits in valid field)

For the example:

Direct mapped Cache with 4kB of data, 1-word blocks and 32 bit address

4kB = 1k words = 210 words = 210 blocks

# of bits of tag = 32 – (10 + 0 + 2) = 20

210 blocks 20 words/block 22 bytes/word

Total # of bits = 210 x (20 + 32*1 + 1) = 53* 210 = 53 kbits = 6.625kBytes


Another example: FastMATHFast embedded microprocessor that uses the MIPS Architecture and a

simple cache implementation.

16kB of data, 16-word blocks and 32 bit address

214 bytes * 1 word/4bytes * 1 block/16 words = 214 / (22 * 24 ) = 28 blocks

# of bits of tag = 32 – (8 + 4 + 2) = 18

28 blocks 24 words/block 22 bytes/word

Total # of bits = 28 x (18 + 32*16 + 1) = 531* 28 = 135,936 bits

= 132.75 kBytes


Example FastMATH

See P&H Fig. 7.9 3rd Ed or 5.9 4th Ed

an introduction to cache design

Documents