2. memory hierarchy – what is...

16
1. Chapter 2/Appendix B: Memory Hierarchy General Principles of Memory Hierarchies Understanding Caches and their Design Main Memory Organization Virtual Memory 2. Memory Hierarchy – What Is It Key idea: Use layers of increasingly large, cheap and slow storage: Try to keep as much access as possible in small, fast levels Try to keep cost per byte almost as low as the slow, cheap levels Effectiveness tied to two properties of mem access: 1. Spatial locality: data located close in memory to the present reference is likely to be used soon 2. Temporal locality: Recently accessed data is likely to be used again soon. If at least one of these not present (random access), memory hierarchy will provide no benefit We will see that the memory hierarchy is used to implement protec- tion schemes in multitasking OS as well 3. Memory Hierarchy – overview Composed of four general classes, managed by varying means: 1. Registers: com- piler 2. Caches: Hard- ware - usually several levels 3. Main memory: OS 4. Virtual memory: OS Nearer (to CPU) lev- els are small, fast, & expensive Registers 1 8-32 words Level 1 Cache 2 – 25 8 – 64 KB Level 2 Cache 150 – 500 256KB – 8MB Main Memory 5,000 .25 – 32 GB Disk 10M X TB 4. DDOT Naturally Exploits Registers DDOT: dot product of vectors X & Y for (dot=0.0, i=0; i < N; i++) dot += x[i] * y[i]; 2N FLOPS 4N references 2N data dot assigned to a register 4N references still needed by algorithm 2N access of dot now from register 1. Init main memory read not needed – zeroed 2. N reads and N writes of dot in same register Main memory access reduced to 2N of X & Y X & Y ref irreducable be- cause no reuse Cache still possibly provides benefit (line fill & prefetch)

Upload: others

Post on 20-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

  • 1. Chapter 2/Appendix B: Memory Hierarchy

    • General Principles of Memory Hierarchies• Understanding Caches and their Design• Main Memory Organization• Virtual Memory

    2. Memory Hierarchy – What Is It

    • Key idea: Use layers of increasingly large, cheap and slow storage:– Try to keep as much access as possible in small, fast levels

    – Try to keep cost per byte almost as low as the slow, cheap levels

    • Effectiveness tied to two properties of mem access:1. Spatial locality: data located close in memory to the present

    reference is likely to be used soon

    2. Temporal locality: Recently accessed data is likely to be used

    again soon.

    → If at least one of these not present (random access), memoryhierarchy will provide no benefit

    • We will see that the memory hierarchy is used to implement protec-tion schemes in multitasking OS as well

    3. Memory Hierarchy – overview

    • Composed of fourgeneral classes,managed by varyingmeans:

    1. Registers: com-piler

    2. Caches: Hard-ware - usuallyseveral levels

    3. Main memory:OS

    4. Virtual memory:OS

    • Nearer (to CPU) lev-els are small, fast, &expensive

    JJ

    JJ

    JJ

    JJ

    JJ

    JJ

    JJ

    JJ

    JJ

    JJ

    JJ

    JJ

    JJ

    JJ

    JJ

    JJ

    JJ

    JJ

    Registers1 8-32 words

    Level 1Cache2 – 25 8 – 64 KB

    Level 2Cache150 – 500 256KB – 8MB

    Main Memory5,000 .25 – 32 GB

    Disk10M X TB

    4. DDOT Naturally Exploits Registers

    DDOT: dot product of vectors X & Y

    for (dot=0.0, i=0; i < N; i++)

    dot += x[i] * y[i];

    • 2N FLOPS• 4N references• 2N data

    • dot assigned to a register– 4N references still needed by

    algorithm

    – 2N access of dot now from

    register

    1. Init main memory read not

    needed – zeroed

    2. N reads and N writes of

    dot in same register

    • Main memory access reduced to2N of X & Y

    – X & Y ref irreducable be-

    cause no reuse

    – Cache still possibly provides

    benefit (line fill & prefetch)

  • 5. Memory Hierarchy - Why?

    • Processor perf increasing much more rapidly than mem (log scale)– After 25 yrs, main mem is < 10×, while CPUs are > 10,000×!

    • Without hier, faster procs won’t help: all time waiting on mem⇒ Try to design system where most references satisfied by small, fast

    storage systems, and use these sys to hide the cost of total storage

    Comp Arch, Henn & Patt, Fig B.9, pg B-25

    1

    100

    10

    1000

    Perf

    orm

    ance

    10,000

    100,000

    1980 2010200520001995

    Processor

    Memory

    19901985

    6. Typical Layers of Memory Hierarchy

    • Access time: Time from request to access→ Each component has its own, sys access time may be addition

    of all, or only some (some systems run in parallel)

    • Bandwidth : how fast data can be streamed from given address

    Name registers cache main mem diskTypical size < 1KB < 16MB < 512GB > 1 TBImplem tech custom mem wt on/off-chip CMOS magnetic

    mult ports, CMOS CMOS SRAM DRAM diskAccess time (ns) 0.15-0.3 0.5-15 30-200 5,000,000Bndwdth (MB/s) 100,000-1,000,000 10,000-40,000 5000-20,000 50-500Managed by compiler hardware OS OS/operatorBacked by cache main mem disk tape/disks

    or cache DVD/CD

    • Huge drop in hitting disk → disk is a mechanical device

    7. Memory Hierarchy Terminology

    Each level of the hierarchy has some defining terms:

    • Hit : item is found in that level of the hierarchy• Miss : item is not found in that level of the hierarchy• Hit Time: time to access item in level, including time to determine

    if access was a hit

    • Miss Penalty : time to fetch block from further level(s) of hierarchy– Varies widely even for given level: eg. may be found in Level 2

    cache or on disk, giving hugely diff values for diff accesses

    • Miss Rate: fraction of access that are not in that level• Block: the amount of information that is retrieved from the next

    further level on a miss

    8. Memory Hierarchy Equations

    • Hit rate = nhit / nref• Miss rate = nmiss / nref• Average access time = hit time + miss rate * miss penalty• Total access time = hit time * nref + miss penalty*nmiss

    – In OO machines, some of this time may be hidden

    • Decrease mem cost three ways:1. Decrease hit time – hardware

    2. Decrease nmiss – hardware/software

    3. Decrease miss penalty – hardware/software

  • 9. Memory Hierarchy Evaluation Methods

    • Software approach requires two components:1. Generate traces of memory accesses:

    – Instrumentation

    – Simulation

    – Traps

    2. Use memory hierarchy simulator:

    – Offline

    – On-the-fly

    • Hardware approach - use hardware performance counters:– Available on most machines, not all

    – May not have capability of capturing all data

    – Usually not repeatable (like real world!)

    10. Memory Hierarchy Questions

    Since a level does not contain entire address space, must ask ourselves:

    1. Where can a block be placed in the current level?

    → Block placement / associativity (caches)2. How is block found if it is in the current level?

    → Block identification3. Which block should be replaced on miss?

    → Block replacement / replacement policy4. What happens on a write?

    → Write policy

    11. Cache Basics

    Cache: smaller, faster area where data may be temporarily stored and

    operated on.

    Five basic properties key in understanding caching:

    1. Inclusive/exclusive

    2. block (cache line) size

    • Strongly correlated with buswidth

    • Simplify book-keeping• Usually filled in order• Optimizes for spatial locality

    3. Associativity∗

    • Direct mapped cache (1-wayassoc; lots of conflicts!)

    • Fully associative• N-way associative cache

    4. Replacement policy

    • Least recently used (LRU)• First in first out (FIFO)• Random or pseudo-random

    5. Write policy∗

    • Write through• Write back

    • Separate/shared– Most L1 caches separate in-

    struction/data

    – Most further level caches

    shared

    12. Write Policy / Write Miss Policy

    • write through: data is written to both cache & main mem– Simpler to implement

    – Can use write buffers to reduce stalls

    – Cache & main mem kept consistant

    – Usual write miss policy:

    ∗ no-write allocate: block is not loaded on write miss, but isupdated in main memory

    ∗ Write-only operands do not hog up cache• write back: data is written only to cache

    – Modified block has dirty bit set, so cache knows to write value

    back to mem when it is ejected from cache

    – Reduces bus traffic, but mem does not always have good values

    – Usual write miss policy:

    ∗ write allocate: block is loaded on write miss

  • 13. Understanding Cache Lookup and Associativity

    Cache is like a table, with rows being number of sets (a loc where a blk

    with a particular address may reside), and columns containing blocks.

    • Set (row index) is determined by/implies certain bits of mem @• 1-way assoc. has 1 column (given @ maps to only 1 loc in cache)• N-way associative has N (could go N different locations in cache)

    – Given s, must search tags of N blks for hit

    • Fully associative has only 1 row (any @ placed arbitrarily in cache)• Column index is undetermined, and so all columns must be searched

    to discover if we have a cache miss

    – High associativity is expensive (dup hardware or slower access)

    – Low associativity leads to more non-capacity conflicts

    Conceptual N-way Associative Caches associativity

    0 blk1 blk2 . . . blkN1 blk1 blk2 . . . blkN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Ns − 1 blk1 blk2 . . . blkN

    • Ns = (cachesz)/(blksz × assoc)• s = (address/blksz) mod Ns

    K-byte long block in cachetag byte1 byte2 . . . byteK

    14. Tradeoffs in Cache Design

    • Make block size larger:– Pro: More cache hits for contiguous access (> spatial loc), less

    % overhead in block-spec. info (eg. tag, last access)

    – Con: For non-cont access, load & store more useless data (more

    bus contention), less indep blks in cache

    • Increase associativity:– Pro: Less conflicts (2 @ mapping to same entry)

    – Con: Extra hardware (comparison & replacement)

    ⇒ Most small caches use 2-3 way (# of ops), lrg caches 1-way orrandom replacement

    • Use write-back:– Pro: Less bus traffic

    – Con: Memory coherence prob (esp for multiprocs), write-only

    data (eg. copy) may hog up cache

    ⇒ L1 can be either, further levels usually write-back

    15. Addressing the Cache: Revenge of the Tag!

    • Memory address is decomposed into three pieces for cache access:1. block offset: offset within the block to be accessed

    2. index: Bits imply what “row” (set) in table to examine for match

    3. tag: The part of the memory address not implied by index or

    block size – this value from address must match the cache’s

    stored tag for there to be a hit

    • Block address (tag+index) is the block number in memory, offset isindex into block for element you want

    • Assuming power of two, # of bits for [offset,index] = [Nob , Nib]:

    – Nob = log2(blksz), Nib = log2(Ns)

    – tag = (address) >> (Nob +Nib)

    block address blocktag index (s) offset

    accessing 2nd iword in 8-byte block(offset = 100b = 4)

    b0 b1 b2 b3 b4 b5 b6 b7

    ⇒ Alignment restrictions prevent cacheblock splits!

    16. Direct Mapped (1-Way Associative) Example

    • Assume 2KB cache with 16 byte linesize:– How many blocks and how many sets (rows)?

    – How many bits for tag, set, and block offset?

    – How much storage for tags (32-bit addresses)?

    Addr Blk Fnd UpdR/W Binary addr Tag Set Off Way Way

    300R 00000 0010010 1100

    304R 00000 0010011 0000

    1216R 00000 1001100 0000

    4404R 00010 0010011 0100

    4408R 00010 0010011 1000

    9416R 00100 1001100 1000

    296R 00000 0010010 1000

    304R 00000 0010011 0000

    1220R 00000 1001100 0100

    2248R 00001 0001100 1000

  • 17. 2-Way Set Associative Example

    Assume 2-way set-associative, 64 cache sets, 16-byte cache line, and

    LRU replacement policy:

    • How big is cache?

    Addr Blk Fnd UpdR/W Binary addr Tag Set Off Way Way

    300R 000000 010010 1100

    304R 000000 010011 0000

    1216R 000001 001100 0000

    4404R 000100 010011 0100

    4408R 000100 010011 1000

    9416R 001001 001100 1000

    296R 000000 010010 1000

    304R 000000 010011 0000

    1220R 000001 001100 0100

    2248R 000010 001100 1000

    18. Write-Through, No Write Allocate Example

    Assume 2-way set-associative, 64 cache sets, 16-byte cache line, and

    LRU replacement policy:

    Addr Blk Fnd Upd MemR/W Binary addr Tag Set Off Way Way Refs

    300W 000000 010010 1100 0 18 12

    304R 000000 010011 0000 0 19 0

    4404R 000100 010011 0100 4 19 4

    4408W 000100 010011 1000 4 19 8

    8496W 001000 010011 0000 8 19 0

    8500R 001000 010011 0100 8 19 4

    304R 000000 010011 0000 0 19 0

    • How many main memory reads?• How many main memory writes?

    19. Write-Back, Write Allocate Example

    Assume 2-way set-associative, 64 cache sets, 16-byte cache line, and

    LRU replacement policy:

    Addr Blk Fnd Upd MemR/W Binary addr Tag Set Off Way Way Refs

    300W 000000 010010 1100 0 18 12

    304R 000000 010011 0000 0 19 0

    4404R 000100 010011 0100 4 19 4

    4408W 000100 010011 1000 4 19 8

    8496W 001000 010011 0000 8 19 0

    8500R 001000 010011 0100 8 19 4

    304R 000000 010011 0000 0 19 0

    • Put * in Upd Way if that way is (still) dirty• How many reads and writes, and why?

    20. L1 cache of AMD Opteron

    Comp Arch, Henn & Patt, Fig B.5, pg B-13

    • 2-way assoc L1• Valid & dirty bit ∀ blk• LRU bit for each set

    1. Brk @ into 3 parts:

    – 64 byte blksz → 6 bits off– 512 entry → 9 bits index– 25-bit tag (40 bit @)

    2. Index determ proper set

    3. Check if tag match & valid

    bit set

    4. Mux selects which way to

    pass out

    – On hit, output → CPU– On miss, output → vic buff

  • 21. Improving Cache Performance

    Assume main and virtual memory implementations are fixed, how can

    we improve our cache performance?

    • Reduce the cache miss penalty• Reduce cache miss rate• Use parallelism to overlap operations, improving one or both of

    above

    – Fetch from all levels simult reduces miss penalty

    – Doing hardware prefetch in parallel with normal mem traffic can

    reduce miss rate

    • Reduce cache hit time⇒ Will discuss each of these in turn

    22. Ideas for Reducing Cache Miss Penalty

    • Use early restart:– Allow cpu to continue as soon as required bytes are in cache,

    rather than waiting for entire block to load

    • Critical word first:– Load accessed bytes in block first

    – Load remaining words in block in wrap-around manner

    – Status bits needed to indicate how much of block has arrived

    – Particularly good for caches wt large block sizes

    • Give memory reads priority over writes• Merging write buffer• Victim caches• Use multilevel caches

    23. Giving Memory Reads Priority Over Writes

    (Reducing Cache Miss Penalty)

    Assume common case of having a write-buffer so that the CPU does

    not stall on writes (as it must for reads):

    • The CPU can check the @s in write buffer on read miss:– If @ presently in write buff, load from there

    – If @ not in write buff, load from mem before prior writes

    – Advantages:

    ∗ Since read stalls CPU & write does not, we min stalls∗ May avoid mem load if in write buff

    • Write buffers can make write-back more efficient as well:1. For dirty bit eviction, copy the dirty blk frm cache to write buff

    2. Load the evicting block from mem to cache (CPU unstalled)

    3. Write dirty block from buff to memory

    24. Merging Write Buffer

    (Reducing Cache Miss Penalty)

    • Due to lat, multiword writes more effic than writing words sep• Mult words in write buff may be associated with same blk @• Valid bits used to indicate which words to write• Reduces the # of mem accesses• Reduces the # of write buff stalls for a given buff size

    Comp Arch, Henn & Patt, Fig 2.7, pg 88 • Top buff wo write merge• Don’t need valid tags in

    write-back

    • Assume 32-byte blk forfurther cache

    • For seq acc, 4-fold red in# of writes & buff e

    • In practice, must handle1,2,4 as well as 8-byte

    words

    ⇒ Larger blksz, more help

  • 25. Victim Caches

    (Reducing Cache Miss Penalty)

    Victim cache: small (eg, 1-8 blocks), fully-associative cache that con-

    tains recently evicted blocks from a primary cache

    • Checked in parallel with primary cache• Available on following cycle if the item in the victim cache

    – Victim block swapped with block in cache

    • Of great benefit to direct-mapped L1 cache

    Comp Arch 3rd ed, Henn & Patt, Fig 5.13, pg 422• Less popular today→ Most L1s are at least

    2-way assoc

    26. Multi-Level Caches

    (Reducing Cache Miss Penalty)

    • Becomes more popular as miss penalty for primary cache grows• Further caches may be off-chip, but still made of SRAM• Almost all general purpose machine have at least 2 lvls of cache,

    most have 2 on-chip caches

    • Further caches typically have larger blocks and cache size• Equations:

    – local miss rate = misses in this cache / accesses to this cache

    – global miss rate = misses in this cache / accesses to L1 cache

    – avg acc time = L1 hit time + L1 miss rate * L1 miss penalty

    – L1 miss penalty = L2 hit time + L2 miss rate * L2 miss penalty

    – L2 miss penalty = mainmem access time

    • L1 miss penalty is average access time for L2• Local miss rate: % of this cache’s refs that hit next lvl• Global miss rate: % of all cache’s refs that hit next lvl

    27. Reducing Cache Miss Rate

    Just discussed techniques for reducing the cost of a cache miss, now

    want to investigate ways to increase our chances of hitting in the cache:

    • Use larger block size – bet on spatial locality• Use larger cache size – less capacity and conflict contention• Use higher associativity – less conflict contention• Way prediction and pseudo associativity – mix of simple and asso-

    ciative caches

    • Compiler optimization – reorder accesses to reduce # of missesCache Miss Categories:

    • Compulsory: First access to block must always miss– Calc total # of blocks accessed in program

    • Capacity: Blocks that are replaced and reloaded because the cachecannot contain all the used blocks needed during execution.

    – Sim fully-assoc cache, sub compulsory miss from total miss

    • Conflict: Occurs when too many blocks map to same cache set– Sim desired cache, sub comp & capacity miss from total miss

    28. Miss Rate vs. Cache Size

    Comp Arch, Henn & Patt, Fig 2.2, pg 73 • Top figure shows total miss rate– Compulsory (tiny 1st line)

    misses stay constant

    ∗ Only way to dec comp is toincrease blksz, which may

    increase miss penalty

    – Capacity (lrg blk area) go

    down with size

    – Conflict misses dec wt size:

    ∗ Since # of conflicts godown wt size, assoc pays

    off less for large caches

    • Bottom figure shows distribu-tion of misses

    – % of compuls misses increase

    wt size, since other types of

    misses decrease wt size

  • 29. Reducing Miss Rate wt Larger Blocks

    Advantages:

    • Exploits spatial locality• Reduces compulsory misses

    Disadvantages:

    • Increases miss penalty• Can increase conflicts• May waste bandwidth

    SPEC92 blksz analysis:

    • If linesz lrg comp to cachesz,conflicts rise, increasing miss

    rate

    • 64-byte linesz reasonable ∀ stud-ied cache sizes

    Comp Arch, Henn & Patt, Fig B.10, pg B-27

    Cache Sizeblksz 4K 16K 64K 256K

    16 8.57% 3.94% 2.04% 1.09%32 7.24% 2.87% 1.35% 0.70%64 7.00% 2.64% 1.06% 0.51%128 7.78% 2.77% 1.02% 0.49%256 9.51% 3.29% 1.15% 0.49%

    30. Reducing Miss Rate wt Larger Caches & Higher Associativity

    Larger Caches

    Advantages:

    • Reduces capacity & conflictmisses

    Disdvantages:

    • Uses more space• May increase hit time• Higher cost ($, power, die)

    Higher Associativity

    Advantages:

    • Reduces conflict missesDisadvantages:

    • May increase hit time– Tag check done before data

    can be sent

    • Req more space & power– More logic for comparitors

    – More bits for tag

    – Other status bits (LRU)

    31. Reducing Miss Rate wt Way Prediction & Pseudo Associativity

    • Hit time as fast as direct-mapped, and req only 1 comparitor• Reduces misses like a set-associative cache• Will have fast hits and slow hits

    Way Prediction

    • Each set has bits indicatingwhich block to check on next

    access

    • A miss requires checking theother blocks in the set on sub-

    sequent cycles

    Pseudo Associativity

    • Accesses cache as in direct-mapped cache wt 1 less indx bit

    • On miss, chks sister blk in cache(eg., by invert most sig indx bit)

    • May swap two blks on an initcache miss wt a pseudo way hit

    32. Reducing Cache Miss Penalty and/or Miss Rate via Parallelism

    • Nonblocking caches– Allow cache hit accesses while a cache miss is being serviced

    – Some allow hits under multiple misses (req. a queue of outstand-

    ing misses)

    – Could use a status bit for each block to indicate blk currently

    being filled

    • Hardware prefetching & Software prefetching– Idea is that predicted mem blocks are fetched while doing com-

    putations on present blocks

    – Requires nonblocking caches

    – Most prefetches do not raise exceptions

    – If guess is right, data in-cache for use

    – If wrong, wasted some bandwidth we weren’t using anyway

    – Helps with latency, by exploiting unused bandwidth

    ∗ If bus saturated, prefetch won’t help, and most archs ignore– Can help with throughput, if usage is sporadic

    – Could expand conflict/capacity misses if prefetch is wrong

  • 33. Hardware and Software Prefetching

    software prefetching

    • Compiler (or hand-tuner) insertspref inst in loop:

    for (i=0; i < N; i++){

    prefetch(&X[16]);

    sum += X[i]

    }

    • Adds extra overheads (inst fetch& decode), so may cause slow-

    down

    hardware prefetching

    • 2 blks fetched on miss: ref blkto cache, next blk to pref buffer

    • If blk ref is in pref buff, +1 cycleaccess to move to cache, and

    pref of next blk in stream issued

    • A series of strided addresses is a‘stream’, hardware best at find-

    ing low-stride up, down streams

    • Can have multiple stream buffsto allow pref of mult arrays

    34. Reducing Cache Hit Time

    • Small & simple caches– Less storage → afford more $/byte– Small caches → less propogation delay– Direct mapped → overlap tag chk & data sending– Some designs have tags on-chip, data off

    • Avoiding address translation:– Virtual caches

    ∗ Avoids virtual-physical trans step, but problematic in practice– Virtually indexed, physically tagged

    ∗ Indx cache by page offset, but tag with physical @∗ Can get data frm cache earlier

    • Pipelined cache access: Allows fast clock speed, but results ingreater br mispred penalty & load latency

    35. Increasing Cache Bandwidth wt Multibanked Caches

    Increase bandwidth by sending address to b banks simultaneously

    • b banks lookup address & write to bus at same time• Increases bandwidth by b in best case• Usually use sequential interleaving

    ⇒ Figure 5.6 shows b=4

    Comp Arch, Henn & Patt, Fig 5.6, pg 299

    36. Summary of Cache Optimizations

    Miss Miss Hit HWTechnique pen rate tim BW cmplx Comment

    Lrgr cachesz = + – = 1 widely used for L2,L3Larger blksz – + = = 0 P4 L2 uses 128 bytesHigher assoc = + – = 1 widely usedMultilevel + = = = 2 Costly hrdwr, esp ifcaches L1 blksz 6= L2; widely usedCache indx w/o = = + = 1 triv if small cache@ translation USIII/21264Read priority + = = = 1 easy for uniproc,over writes widely used

    Crit wrd frst + = = = 2 widely used& early restrtMrgng write buff + = = = 1 widely usedVictim caches + + = = 2 Athlon had 8-entryWay pred & = = + = 1 I-cache of USIII/D-c of R4300Pseudoassoc = = + = 1 L2 of of R10KComp. opt. = + = = 0 hard, varies by comp.Hardware pref + + = = 2I,3D widely usedSoftware pref + + = = 3 widely usedSm & simple cache = – + 0 widely used L1Nonblk caches + = = + 3 all out-of-order CPUsPipelined cache = = - + 1 widely usedbanked caches = = = + 1 L2 of Opteron & Niagara

  • 37. Static Random Access Memory (SRAM)

    • Caches are built out of SRAM, main memory DRAM• Requires 4-6 transisters per bit• Does not need to be refreshed (static)

    – Always available for access (unlike DRAM)

    – Can consume less power if not often accessed

    • SRAM cycle time 8-16 times faster than DRAM• SRAM is 8-16 times more expensive

    38. Dynamic Random Access Memory (DRAM)

    • Requires only 1 transistor (a capacitor) per bit– Expect 6-fold greater density & cost adv over 6-trans SRAM

    – Charge leaks away wt time, requiring periodic refresh (read before

    charge below threshhold, rewrite)

    ∗ Not all accesses take same amount of time, since mem loc maybe busy wt refresh when access request arrives

    ∗ Designers try to maintain 5% or less refresh time (i.e. memavail ≥ 95% of time)

    – Reading the value drains charge, requiring a refresh

    ∗ better not want to access same item again & again• Logically laid out as 2-D square (or rectangle):

    – Since pins $$, use half as many pins, & send @ in two parts:

    1. Row Access Strobe (RAS): row to store in latches

    2. Column Access Strobe (CAS): bits to output

    • Rated by two types of latency:1. Cycle time: min time between succ req to given DRAM module

    2. access time: time between read req & output data on pin

    39. DRAM (continued)

    Comp Org, Patt & Henn, 3rd ed., Fig B.9.6 • 512KB DRAM stored as2048×2048 array

    • 2048 = 211, so need 11 @ pins• On RAS, entire row is written to

    column latches

    • On CAS, MUX passes selectedbit(s) to output

    • For refresh, entire row read andrewritten

    • There is a delay between RAS, CAS and data output, while electricalsignals stabilize

    • fast page mode allows CAS to vary wt RAS constant, allowingsuccessive reads of column latches

    → Pump out multiple words frm latch using one row-read⇒ We say RAS related to latency, CAS to bandwidth.

    • Sync DRAM (SDRAM) has clock signal in DRAM so that mem contdoes not need to sync wt DRAM after each word in row

    • Double data rate (DDR) xfers data on rising & falling edge of clock

    40. DRAM Packaging and Pins

    • Each DRAM module is rated by # of bits of storage and output:– Prior graph was 4M × 1: 4Mb × 1 bit output

    ∗ 4Mb (megabits) = 512KB (kilobytes)∗ 4Mb = 4 ∗ 1024 ∗ 1024 = 4194304;

    √4194304 = 2048

    – Are often rectangular, wt more columns than rows

    ∗ Less time in refresh• Multiple DRAM modules are packaged together:

    – Older systems use single in-line memory modules (SIMMS)

    ∗ 72-pins, 32 bits of output– Newer systems use double in-line memory modules (DIMMS)

    ∗ 168-pins, 64 bits of output∗ Internally, DRAM modules are organized into banks, allowing

    single DIMMS to be accessed during recovery (access bank 2

    while bank 1 is still in cycle time from prior access)

  • 41. DRAM Timing by Year

    RAS timeDRAM slow fast CAS Cycle

    Year Size (ns) (ns) (ns) time (ns)1980 64K bit 180 150 75 2501983 256K bit 150 120 50 2201986 1M bit 120 100 25 1901989 4M bit 100 80 20 1651992 16M bit 80 60 15 1201996 64M bit 70 50 12 1101998 128M bit 70 50 10 1002000 256M bit 65 45 7 902002 512M bit 60 40 5 802004 1G bit 55 35 5 702006 2G bit 50 30 2.5 602010 4G bit 36 28 1 372012 8G bit 30 24 0.5 31

    • RAS: row access time→ ∝ latency– Improves ≈5%/year

    • CAS: column access time→ ∝ bandwidth– Improves >10%/year

    • Cycle time: min time be-tween access to same array

    – follows RAS (≈5%)

    • Latency helped by compiler & arch (eg. prefetch & OOE)• Cycle time helped by caches & multiple arrays/banks

    → mem designers concentrate on increasing BW

    • processor speeds no longer increasing, but mem demand still rising→ Two procs running full out need twice the mem capacity/speed!

    42. DIMM Naming and Timing

    DRAM DIMMStan- Rate xfer DRAM xfer DIMMdard (Mhz) (Mb/s) name (MB/s) nameDDR 133 266 DDR266 2128 PC2100DDR 150 300 DDR300 2400 PC2400DDR 200 400 DDR400 3200 PC3200DDR2 266 533 DDR2-533 4264 PC4300DDR2 333 667 DDR2-667 5336 PC5300DDR2 400 800 DDR2-800 6400 PC6400DDR3 533 1066 DDR3-1066 8528 PC8500DDR3 666 1333 DDR3-1333 10,664 PC10700DDR3 800 1600 DDR3-1600 12,800 PC12800DDR4 1066-1600 2133-3200 DDR4-3200 17,056-25,600 PC25600

    • DDRxxx describes DRAM BW, PCyyyy describes DIMM BW• DIMMs usually have 4-16 DRAM units

    – unit with 16 DRAM arrays wt 4 bit outputs can produce 64 bits

    → such a DIMM’s BW is 64 times greater than DRAM→ 64 bits / 8 bits/byte = 8× faster in bytes

    ⇒ DIMM MB/s 8× DRAM’s Mb/s• Advertisers use BW (it goes up faster than other measures :)

    43. Memory Bus

    Comp Arch 3rd ed, Henn & Patt, Fig 5.27, pg 450 a. Fetch 1 word (i.e. 32 or

    64 bits) at a time (cache,

    bus, mem)

    b. Wide bus, matching

    wider L2 blksz

    c. 1-word cache & bus,

    with 4 banks

    • Modern systems combine (b) & (c), with all levels of cache havingmultiword cache blocks

    • Not sure why multiplexor is big deal: most modern machines haveit

    44. Improving Main Memory Bandwidth

    • Wider Main Memory– PRO: Increases bandwidth (more bits sent in parallel), reducing

    miss penalty (assuming blksz > wordsz)

    – CON: $$ bus, min increment for main mem expansion increased

    • Simple Interleaved Memory– PRO: Banks accessed in parallel, ideal for sequential access

    – CON: Which bank(s) are accessed depends on the @ (fig below)Comp Arch, Henn & Patt, Fig 2.6, pg 86

    • Independent memory banks: like interleaved mem, but wt no @ rest– PRO: Each bank may be addressed independently; very useful

    for nonblocking caches

    – CON: Each bank needs separate @ lines & possibly data bus

  • 45. Virtual Memory

    Managing mem needs in excess of physical mem automatically requires

    Virtual Memory.

    • Originally developed for single prog wt lrg mem needs• Now critical for multiprocessing/time sharing.• Divides physical memory into blocks and allocates these blocks to

    different processes

    • Provides a mapping between blks in physical mem & blks on disk.• Allows a process to exec wt only part of process resident in memory.

    – Reduces startup/switch time!

    • Provides protection to prevent processes from accessing blocks in-appropriately

    – Critical for multiprocessing (why win9x so crash prone?)

    46. Virtual Memory Terms

    • Page: virtual memory’s block.• Page fault: a main memory miss (page is on disk, not in memory).• Physical address: @ used to access main mem and typically cache

    as well

    • Page table: data structure that maps between virtual and physicaladdresses

    • Translation Lookaside Buffer : cache that contains a portion of thepage table

    47. Program is Virtually Contiguous, Not so Physically

    • Small prog consumes minoramount of virtual space (virtual

    space bigger than physical on

    most procs)

    – Think of unused mem blks

    belonging to other users

    • Some blks in mem (fully assoc)& some on disk

    • Each blk loaded into any blk ofphysical mem, called relocation

    – Without relocation, prog is

    diff every time you load it!

    Comp Arch, Henn & Patt, Fig B.19, pg B-41

    48. A Process’s Virtual Address Space

    Each process has full virtual address, as shown on right:

    • Exact organization varies by architecture:– Typically, heap begins at low @ and grows upward

    – Stack begins at high @, and grows downward

    – Memory between heap & stack is unused, and will

    not be allocated in mem or on disk until created

    • Not only is entire address space not allocated, maystart running wt only initial page of code (eg., 1st

    page of text segment) in mem (fast load)

    • Each page can be protected, to allow only certainaccesses, but entire page treated same:

    – Code & part of data section may be read only (seg

    fault if you try to overwrite inst or read-only data)

    – Unallocated memory protected (fault on read or

    write)

    – Can still write beyond heap on same page

    stack

    unusedspace

    heap

    data(global)text

    (code)0x0

  • 49. Typical Ranges of Parameters for Caches & Virtual Memory

    Parameter L1 Cache Virtual Memory

    Block size 16-128 bytes 4K - 64K (4M)

    Hit time 1-3 cycles 100-200 cycles

    Miss penalty 8-200 cycles 1,000,000-10,000,000 cycles

    (access time) (6-160 cycles) (800,000-8,000,000 cycles)

    (transfer time) (2-40 cycles) (200,000-2,000,000 cycles)

    Miss rate 0.1-10% 0.00001-0.001%

    Address 25-45 bit phys 32-64 bit virtual tomapping to 14-20 cache 25-45 bit physical

    → Huge miss penalty due to physical mechanism (seek,rotate,transfer)• miss penalty = access time + transfer time• access time : ?time to get 1st byte from next level?• transfer time: ?time to stream into this level?

    50. OS & Hardware Cooperate to Achieve Efficient Virtual Memory

    Because miss penalty so large, need very good mem man policy, and

    can afford to run in software. So, hardware & OS cooperate: hard-

    ware generates page faults, OS supplies the brains to implement good

    policies:

    • Where can a block go in memory? : OS places blocks anywhere –fully associative.

    • How is a block found in memory? : Page table is indexed by pagenumber, which is found by : (virtual @)/pagesz. Page table estab-

    lishes mapping between virtual addresses, and either a physical mem

    @, or a disk location (more later).

    • Which blk should be replaced on page fault? : OSes tend to use verygood approx of LRU, since misses are so expensive.

    • What happens on write? All virtual mem systems use write-back,write allocate, due to cost of disk access.

    51. Mapping of a Virtual to Physical Address via a Page Table

    • Virtual @ split in two parts1. Virtual pg # mapped to

    physical pg # via page table

    2. pg offset indexes physical pg

    • Each process has its own pagetable

    • Usually, dedicated reg to holdbeginning of pg table

    • Page table itself stored in mem-ory, and may be on disk

    • Every ref (inst or data) requirestranslation

    Comp Arch, Henn & Patt, Fig B.23, pg B-45

    52. Page Tables

    • # of entries = # of virtual pages• # virtual pages = sizeof(virt @ space) / (pagesz)• Each process has its own page table• A page entry will typically contain:

    – physical page number : phys @ = ppn * pagesz

    – valid bit: table entry used?

    – dirty bit: pg needs to be written to disk on swap?

    – use bit : set when pg is accessed; helps det LRU

    – protection field: (eg., read only)

    – disk @: where on disk this page is stored

    • Some archs use hashing to allow the # of entries to be the numberof physical pages (usually much smaller than virtual). This is known

    as an inverted page table.

  • 53. Virtual-to-Physical Mapping Example (no TLB)

    virtual addressvirtual page number page offsetpage table indexnbits− log2(pagesz) log2(pagesz)

    page table

    index Phys pg Disk V D

    0 2 3 1 01 9993 3 0 12 -1 4 1 0. . . . . . . . . . . . . . . . . . . . . . . . . . .9 3 n/a 1 112 100 5 1 1

    Assume 256 byte page size (unrealistic), use above to find physical @;

    assume next physical page number to be used is 8:

    virtual @ index pg off MEM hit physical address

    0x90F (1001 0000 1111)0xC10 (1100 0001 0000)0x201 (0010 0000 0001)0x211 (0010 0001 0001)0x031 (0000 0011 0001)

    • How can page be in memory & not on disk?• How can page be on disk but not in memory?• How can invalid page table entry become valid?

    54. Translation Lookaside Buffer (TLB)

    TLB: Special caches that contain a portion of the entries in a page

    table

    • Each entry in TLB has a tag (portion of virtual page number) andmost of info in a page table entry

    • TLBs are typically cleared on context switch• TLBs are typically quite small to provide for very fast translation• Typically separate inst and data TLBs

    ⇒ W/o TLB every mem ref requires at least 2 mem refs:– One reference to page table, one to address

    ⇒ Most archs avoid context switch-driven TLB flush by storing PID inTLB

    55. TLB and Page Table Example

    TLB

    Indx Pg# Tag V

    0 0 0 07 0 0 08 2 0 163 0 0 0

    Page Table

    Indx Pg# Res Dirty Disk @ . . .

    0 0 0 0 1 . . .71 9 Yes 0 16 . . .

    • Page size = 512 bytes• TLB direct mapped wt 64 sets

    PgTab Page PgTab PageHex @ Binary Address index bits offset bits index offset

    0x8FDF 1000 1111 1101 11110x10DF 0001 0000 1101 1111

    TLB TLB TLB PgTab PhysicalAddr Binary Addr Indx Tag Hit? Indx address

    0x8FDF 1000 1111 1101 11110x10DF 0001 0000 1101 1111

    56. Selecting a Page Size

    • Advantages of using large page size:– The larger the page size, the fewer entries in page table, and

    thus the less memory used by the page table.

    – Virtually indexed, physically tagged caches can do translation

    and cache read in parallel (normally must 1st do translation, then

    cache access), but # of bits in page offset sets upper limit on

    cache size. Therefore, increasing page size increases the maxi-

    mum cachesz in a virtually indexed cache.

    – Transferring larger pages to/from memory more efficient since

    fewer seeks are required

    – Will probably result in fewer TLB misses since there are fewer

    distinct pages in the same address range.

    • Advantages of using smaller page size:– Will waste less space

    – Startup time for process/page allocation quicker.

  • 57. Hypothetical Memory Hierarchy, from Virtual @ to L2

    Comp Arch, Henn & Patt, Fig B.25, pg B-48 • Direct mapped L1, L2, TLB• Virtually indexed, physically

    tagged L1

    – TLB & L1 accessed simult.

    – If L1 set valid, physical @

    compared wt L1 tag

    • Physically indexed L2• 64 bit virtual address• 41 bit physical address• nbits(TLB tag) = nbits(virtual

    page number) - nbits(TLB in-

    dex)

    58. Multiprogramming

    Multiprogramming means that several processes share a computer con-

    currently.

    • A process is the code, data, and any state information used in theexecution of a program.

    • A context switch means transfering control of the machine from oneprocess to another.

    – Must save/restore state (regs, PC, TLB, page tab loc).

    • Proper protection must be provided to prevent a process from in-appropriately affecting another process or itself.

    • Virtual mem sys keep bits in page table & TLB indicating the typeand level of access that the process has to each of the pages.

    – Most sys provide user and kernel modes – kernel has greater

    access, and user must transfer control via syscall to enter.

    – Sharing mem accomplished by having multiple processes’ pgtab

    entries point to the same physical mem page.

    59. Crosscutting Issues

    • Multi-issue machines require more ports to inst and data caches• Speculative execution often results in invalid addresses which raise

    exceptions that must be suppressed

    • Some machines, such as Pentium 4, read in 80x86 inst, but savedecoded RISC-like inst in the inst cache

    • Some caches are designed to save power– Eg., MIPS 4300 can power only half its @ chking hardware in its

    2-way assoc cache by using way prediction.

    • Cache coherence problem & I/O (next slide)

    60. I/O and the Cache Coherence Problem

    We may ask ourselves Where does I/O occur?:

    1. Between I/O device and the cache

    • No cache-coherence prob, but I/O blocks cache, and thus CPU2. Between I/O device and main mem (this is the usual case):

    • Min interf wt CPU, write-thru cache mem up-to-date for output• For input or write-back, machine must either:(a) Not allow certain pages used as I/O buff to ever load to cache

    (b) Invalidate cache blocks for I/O pages before starting I/O, and

    disallow mem access until I/O done

    Cache Coherence Problem

    With data in mem & caches, data can be stale:

    1. Data in cache may be more up-to-date due to write-back cache

    • This is problem for multiprocessors & output devices2. Data in mem may be more up-to-date due to input frm ext device

    • This is problem for CPU, handled as described above

  • 61. Fallacies and Pitfalls

    • Pitfall: Predicting memory hierarchy performance using only a lim-ited number of instructions

    1. Large cache may contain all of simple prog, not so real application

    2. Program memory use varies over time

    3. Program memory use varies by input

    • Fallacy: Predicting mem hier perf of one program from another.• Pitfall: Ignoring the impact of the OS on the perf of the mem hier

    – OS impact often ignored because it is harder to gather data (not

    easily simulated, and true timings not repeatable) but:

    – Authors found about 25% of mem stall time due either to OS

    misses or OS refs displacing process blocks.