computer architecture cache design

Upload: prahalladreddy

Post on 03-Apr-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 Computer Architecture Cache Design

    1/28

    COEN6741Chap 5.1

    11/10/2003

    (Dr. Sofine Tahar)

    COEN6741

    Computer Architecture and Design

    Chapter 5

    Memory Hierarchy Design

    11/10/2003

    Outline

    Introduction

    Memory HierarchyCache Memory

    Cache Performance

    Main Memory

    Virtual Memory

    Translation Lookaside

    Alpha 21064 Example

    Computer Architecture Topics

    L2 Cache

    DRAM

    Disks, WORM, Tape

    Coherence,Bandwidth,

    Emerging Technologies

    InterleavingBus protocols

    RAID

    Input/Output and Storage

    MemoryHierarchy

    Who Cares About the M

    100

    1000

    mance

    Processor-DRAM Memory G

  • 7/29/2019 Computer Architecture Cache Design

    2/28

    LHierarchym

    COEN6741Chap 5.5

    11/10/2003

    Levels of the Memory Hierarchy

    CPU Registers100s Bytes

  • 7/29/2019 Computer Architecture Cache Design

    3/28

    COEN6741Chap 5.9

    11/10/2003

    A Modern Memory Hierarchy

    By taking advantage of the principle of locality: Present the user with as much memory as is available in the

    cheapest technology. Provide access at the speed offered by the fastest technology.

    Requires servicing faults on the processor

    Control

    Datapath

    Secondary

    Storage

    (Disk)

    Processor

    Registers

    Main

    Memory

    (DRAM)

    Second

    Level

    Cache(SRAM)

    On-Chip

    Cache

    1s 10,000,000s

    (10s ms)

    Speed (ns): 10s 100s

    100sGs

    Size (bytes):Ks Ms

    Tertiary

    Storage

    (Disk/Tape)

    10,000,000,000s

    (10s sec)

    Ts11/10/2003

    The Memory Ab

    Association of

  • 7/29/2019 Computer Architecture Cache Design

    4/28

  • 7/29/2019 Computer Architecture Cache Design

    5/28

    Q3: Which block should be replacedon a miss?

    Easy for Direct Mapped

    Set Associative or Fully Associative: Random LRU (Least Recently Used)

    Associativity:

    2-way 4-way 8-waySize LRU Rand. LRU Random LRU Random

    16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0%

    64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5%

    256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%

    11/10/2003

    Q4: What happens

    Write-through: all writes update cmemory/cache

    Can always discard cached data -memory Cache control bit: only a validbit

    Write-back: all writes simply upda Cant just discard cached data - m

    memory Cache control bits: both validand

    Other Advantages: Write-through:

    memory (or other processors) a Simpler management of cache Write-back:

    much lower bandwidth, since datimes

    Better tolerance to long-latenc

    Write Policy:(What happens on write-miss?)

    Write allocate: allocate new cache line in cache Usually means that you have to do a read miss tofill in rest of the cache-line!

    Write Buffer for W

    A Write Buffer is needed beMemory Processor: writes data into the ca

    Memory controller: write contents

    ProcessorCache

    Write Buffer

  • 7/29/2019 Computer Architecture Cache Design

    6/28

    COEN6741Chap 5.21

    11/10/2003

    Simplest Cache: Direct Mapped

    Memory

    4 Byte Direct Mapped Cache

    Memory Address

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    A

    BC

    D

    E

    F

    Cache Index

    0

    1

    2

    3

    Location 0 can be occupied by datafrom:

    Memory location 0, 4, 8, ... etc. In general: any memory location

    whose 2 LSBs of the address are 0s Address => cache index

    Which one should we place in thecache?

    How can we tell which one is in thecache?

    11/10/2003

    Example: 1 KB Direc

    For a 2 ** N byte cache: The uppermost (32 - N) bits are a

    The lowest M bits are the Byte Se

    31

    :

    Cache Tag Example: 0x50

    0x50

    Stored as part

    of the cache state

    Valid Bit

    :

    Cache Tag

    Block address

    Set Associative Cache

    N-way set associative: N entries for each CacheIndex

    N direct mapped caches operates in parallel

    Example: Two-way set associative cache Cache Index selects a set from the cache

    The two tags in the set are compared to the input in parallel Data is selected based on the tag result

    Cache DataCache TagValid Cache Data Cache Tag Valid

    Cache Index

    Disadvantage of Set A

    N-way Set Associative CacheCache:

    N comparators vs. 1 Extra MUX delay for the data Data comes AFTER Hit/Miss decisio

    In a direct mapped cache, CaBEFORE Hit/Miss: Possible to assume a hit and contin

    Cache DataCache TagValidCache Inde

  • 7/29/2019 Computer Architecture Cache Design

    7/28

    Block addressBlockoffset

    CPUaddressDatain

    Dataout

    Tag Index

    < 8> < 5>

    Valid

    Data

    =?

    4

    3

    (256

    blocks)2

    1

    Writebuffer

    Lower level memory

    Tag

    4:1 Mux

    The organization of the data cache in the Alpha AXP 21064 microprocessor.

    11/10/2003

    Miss-oriented Approach to M

    CPIExecution includes ALU and Memor

    Inst

    MemAccess

    ExecutionCPIICCPUtime

    Inst

    MemMisses

    ExecutionCPIICCPUtime

    Cache perfo

    Separating out Memory comp AMAT = Average Memory Access T

    CPIALUOps does not include memory

    Inst

    MemAcceCPI

    Inst

    AluOpsICCPUtime

    AluOps

    yMissPenaltMissRateHitTimeAMAT

    ataata

    InstInst

    MissPeMissRateHitTime

    MissPenMissRateHitTime

    Impact on Performance Suppose a processor executes at

    Clock Rate = 200 MHz (5 ns per cycle), Ideal (no misses) CPI = 1.1

    50% arith/logic, 30% ld/st, 20% control

    Suppose that 10% of memory operations get 50 cyclemiss penalty

    Suppose that 1% of instructions get same miss penalty CPI = ideal CPI + average stalls per instruction

    1.1(cycles/ins) +[ 0 30 (DataMops/ins)

    Unified vs.. Sp

    Unified vs. Separate I&D

    Example: 16KB I&D: Inst miss rate=0.64%, Da

    PrI-Cache-1

    Proc

    UnifiedCache-1

    UnifiedCache-2

    Proc

    UnifCach

  • 7/29/2019 Computer Architecture Cache Design

    8/28

    COEN6741Chap 5.29

    11/10/2003

    How to Improve Cache Performance?

    1. Reduce the miss rate,

    2. Reduce the miss penalty, or

    3. Reduce the time to hit in the cache.

    yMissPenaltMissRateHitTimeAMAT

    11/10/2003

    Miss Rate Re

    3 Cs: Compulsory, Capacity, Con

    0. Larger cache1. Reduce Misses via Larger Block2. Reduce Misses via Higher Assoc3. Reducing Misses via Victim Cach4. Reducing Misses via Pseudo-Ass5. Reducing Misses by HW Prefet6. Reducing Misses by SW Prefetc7. Reducing Misses by Compiler Op

    Danger of concentrating on just Prefetching comes in two flavors Binding prefetch: Requests load

    Must be correct address an Non-Binding prefetch: Load into

    Can be incorrect. Frees HW

    CPUtime IC CPIExecution

    Memory accesses

    Instruction Miss

    Where to misses come from?

    Classifying Misses: 3 Cs CompulsoryThe first access to a block is not in the cache,

    so the block must be brought into the cache. Also called coldstart misses or first reference misses.(Misses in even an Infinite Cache)

    CapacityIf the cache cannot contain all the blocks neededduring execution of a program, capacity misses will occur due toblocks being discarded and later retrieved.(Misses in Fully Associative Size X Cache)

    Conflict If bl k l t t t i t i ti0.08

    0.1

    0.12

    0.141-way

    2-way

    4-way

    8-w

    3Cs Absolute Miss R

    ep

    er

    type

  • 7/29/2019 Computer Architecture Cache Design

    9/28

    COEN6741Chap 5.33

    11/10/2003

    0. Cache Size

    Old rule of thumb: 2x size => 25% cut in miss rate What does it reduce? Thrashing reduction!!!

    Cache Size (KB)

    0

    0.02

    0.04

    0.06

    0.080.1

    0.12

    0.14

    1 2 4 8

    16

    32

    64

    128

    1-way

    2-way

    4-way

    8-way

    Capacity

    Compulsory

    11/10/2003

    Cache Organ

    Assume total cache size not

    What happens if:

    1) Change Block Size:

    2) Change Associativity:

    3) Change Compiler:

    Which of 3Cs is obviously a

    MissRate

    5%

    10%

    15%

    20%

    25%

    1K

    4K

    16K

    64K

    1. Larger Block Size (fixed size & assoc)

    0.04

    0.06

    0.08

    0.1

    0.12

    0.141-way

    2-way

    4-way

    8-way

    C

    2. Higher Asso

    Co

  • 7/29/2019 Computer Architecture Cache Design

    10/28

    COEN6741Chap 5.37

    11/10/2003

    3Cs Relative Miss Rate

    Cache Size (KB)

    0%

    20%

    40%

    60%

    80%

    100%

    1 2 4 816

    32

    64

    128

    1-way

    2-way4-way

    8-way

    Capacity

    Compulsory

    Conflict

    Flaws: for fixed block sizeGood: insight => invention

    11/10/2003

    Associativity vs..

    Beware: Execution time is only

    Why is cycle time tied to hit t

    Will Clock Cycle time increase? Hill [1988] suggested hit time

    external cache +10%,internal + 2%

    suggested big and dumb cache

    Effective cycle time of assoc

    pzrbski ISCA

    Example: Avg. Memory Access Timevs.. Miss Rate

    Example: assume CCT = 1.10 for 2-way, 1.12 for4-way, 1.14 for 8-way vs.. CCT direct mapped

    Cache Size Associativity(KB) 1-way 2-way 4-way 8-way1 2.33 2.15 2.07 2.012 1.98 1.86 1.76 1.68

    3. Victim Ca

    Fast Hit Time + Low Conflict=> Victim Cache

    How to combine fast hit time

    of direct mappedyet still avoid conflict misses?

    Add buffer to place datadiscarded from cache

  • 7/29/2019 Computer Architecture Cache Design

    11/28

    COEN6741Chap 5.41

    11/10/2003

    4. Pseudo-Associativity

    How to combine fast hit time of Direct Mapped and have thelower conflict misses of 2-way SA cache?

    Divide cache: on a miss, check other half of cache to see ifthere, if so have a pseudo-hit (slow hit)

    Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles Better for caches not tied directly to processor (L2) Used in MIPS R1000 L2 cache, similar in UltraSPARC

    Hit Time

    Pseudo Hit Time Miss Penalty

    Time

    11/10/2003

    5. Hardware PrefetchingData

    E.g., Instruction Prefetchi Alpha 21064 fetches 2 blocks o

    Extra block placed in stream b

    On miss check stream buffer

    Works with data blocks to Jouppi [1990] 1 data stream bu

    4KB cache; 4 streams got 43%

    Palacharla & Kessler [1994] for streams got 50% to 70% of mis2 64KB, 4-way set associative c

    Prefetching relies on havinbandwidth that can be use

    6. Software Prefetching Data

    Data Prefetch Load data into register (HP PA-RISC loads) Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9)

    Special prefetching instructions cannot cause faults; a form ofspeculative execution

    Prefetching comes in two flavors: Binding prefetch: Requests load directly into register.

    M b dd d i !

    7. Compiler Op

    McFarling [1989] reduced caches mion 8KB direct mapped cache, 4 byte

    Instructions Reorder procedures in memory so as to r Profiling to look at conflicts(using tools t

    Data Merging Arrays: improve spatial locality

  • 7/29/2019 Computer Architecture Cache Design

    12/28

    COEN6741Chap 5.45

    11/10/2003

    Merging Arrays Example

    /* Before: 2 sequential arrays */

    int val[SIZE];int key[SIZE];

    /* After: 1 array of stuctures */

    struct merge {

    int val;

    int key;

    };

    struct merge merged_array[SIZE];

    Reducing conflicts between val & key;improve spatial locality

    11/10/2003

    Loop Interchang

    /* Before */

    for (k = 0; k < 100; k = k+1

    for (j = 0; j < 100; j = j

    for (i = 0; i < 5000;

    x[i][j] = 2 *

    /* After */

    for (k = 0; k < 100; k = k+1

    for (i = 0; i < 5000; i =

    for (j = 0; j < 100;

    x[i][j] = 2 *

    Sequential accesses instethrough memory every spatial locality

    Loop Fusion Example/* Before */

    for (i = 0; i < N; i = i+1)

    for (j = 0; j < N; j = j+1)

    a[i][j] = 1/b[i][j] * c[i][j];

    for (i = 0; i < N; i = i+1)

    for (j = 0; j < N; j = j+1)

    d[i][j] = a[i][j] + c[i][j];

    /* After */

    for (i = 0; i < N; i = i+1)

    Blocking Exa/* Before */

    for (i = 0; i < N; i = i+1)

    for (j = 0; j < N; j = j+1)

    {r = 0;

    for (k = 0; k < N; k = k+1){

    r = r + y[i][k]*z[k][j];};

    x[i][j] = r;

    };

  • 7/29/2019 Computer Architecture Cache Design

    13/28

    COEN6741Chap 5.49

    11/10/2003

    Blocking Example

    /* After */

    for (jj = 0; jj< N; jj = jj+B)

    for (kk = 0; kk< N; kk = kk+B)

    for (i = 0; i < N; i = i+1)

    for (j = jj; j< min(jj+B-1,N); j = j+1)

    {r = 0;

    for (k = kk; k< min(kk+B-1,N); k = k+1) {

    r = r + y[i][k]*z[k][j];};

    x[i][j] = x[i][j] + r;

    };

    B called Blocking Factor Capacity Misses from 2N3 + N2 to 2N3/B+N2

    Conflict Misses Too?

    11/10/2003

    Reducing Conflict Mi

    Conflict misses in caches n Lam et al [1991] a blocking fac

    vs.. 48 despite both fit in cach

    Blocking Fa

    0

    0.05

    0.1

    0 50

    choleskyspice

    mxm (nasa7)btrix (nasa7)

    tomcatv

    gmty (nasa7)

    vpenta (nasa7)

    Summary of Compiler Optimizations toReduce Cache Misses (by hand) Improving Cache P

    1. Reduce the miss rate,

    2. Reduce the miss penalty3. Reduce the time to hit

  • 7/29/2019 Computer Architecture Cache Design

    14/28

    COEN6741Chap 5.53

    11/10/2003

    Reducing Miss Penalty

    Four techniques

    Read priority over write on miss Early Restart and Critical Word First on miss

    Non-blocking Caches (Hit under Miss, Miss under Miss)

    Second Level Cache

    Can be applied recursively to Multilevel Caches Danger is that time to DRAM will grow with multiple levels in

    between

    First attempts at L2 caches can make things worse, sinceincreased worst case is worse

    Out-of-order CPU can hide L1 data cache miss (35clocks), but stall on L2 miss (40100 clocks)?

    CPUtime IC CPIExecution

    Memory accesses

    Instruction Miss rate Miss penalty

    Clock cycle time

    11/10/2003

    1. Read Priority over

    (or lo

    1. Read Priority over Write on Miss

    Write-through w/ write buffers => RAW conflictswith main memory reads on cache misses

    If simply wait for write buffer to empty, might increase read

    miss penalty (old MIPS 1000 by 50% ) Check write buffer contents before read;

    if no conflicts, let the memory access continue

    W it b k t b ff t h ld di l d bl k

    2. Early Restart and Cri

    Dont wait for full block to be restarting CPU

    Early restartAs soon as the requesarrives, send it to the CPU and let t

    Critical Word FirstRequest the missand send it to the CPU as soon as it execution while filling the rest of thecalled wrapped fetch and requested w

  • 7/29/2019 Computer Architecture Cache Design

    15/28

    COEN6741Chap 5.57

    11/10/2003

    3. Non-blocking Caches

    Non-blocking cache or lockup-free cache allow datacache to continue to supply cache hits during a miss

    requires F/E bits on registers or out-of-order execution requires multi-bank memories

    hit under miss reduces the effective miss penaltyby working during miss vs.. ignoring CPU requests

    hit under multiple miss or miss under miss mayfurther lower the effective miss penalty byoverlapping multiple misses

    Significantly increases the complexity of the cache controller asthere can be multiple outstanding memory accesses

    Requires multiples memory banks (otherwise cannot support)

    Pentium Pro allows 4 outstanding memory misses

    11/10/2003

    Value of Hit Under

    FP programs on average: AMAT= 0.6

    Int programs on average: AMAT= 0.2

    8 KB Data Cache, Direct Mapped, 32

    Hit Under i Misses

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    1.6

    1.8

    2

    e

    qntott

    espresso

    xlisp

    com

    press

    mdljsp2 e

    ar

    fpppp

    to

    mcatv

    swm256

    doduc

    s

    u2cor

    wave5

    Integer Floating Point

    4. Add a Second-level Cache

    L2 EquationsAMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1

    Miss PenaltyL1

    = Hit TimeL2

    + Miss RateL2

    x Miss PenaltyL2

    AMAT = Hit TimeL1 +

    Miss RateL1 x (Hit TimeL2 + Miss RateL2 + Miss PenaltyL2)

    Comparing Local and G

    32 KByte 1st level cache;Increasing 2nd level cache

    Global miss rate close tosingle level cache rateprovided L2 >> L1

    Dont use local miss rate

    L2 not tied to CPU clock

  • 7/29/2019 Computer Architecture Cache Design

    16/28

    COEN6741Chap 5.61

    11/10/2003

    Reducing Misses:Which apply to L2 Cache?

    Reducing Miss Rate1. Reduce Misses via Larger Block Size2. Reduce Conflict Misses via Higher Associativity

    3. Reducing Conflict Misses via Victim Cache

    4. Reducing Conflict Misses via Pseudo-Associativity

    5. Reducing Misses by HW Prefetching Instr, Data

    6. Reducing Misses by SW Prefetching Data

    7. Reducing Capacity/Conf. Misses by CompilerOptimizations

    11/10/2003

    Relative CPU

    Block Siz

    11.1

    1.2

    1.3

    1.4

    1.5

    1.6

    1.7

    1.8

    1.92

    16 32 64

    1.361.28 1.27

    L2 Cache Block Size

    32KB L1, 8 byte path to

    Improving Cache Performance

    1. Reduce the miss rate,

    2. Reduce the miss penalty, or3. Reduce the time to hit in the cache.

    1. Small and Sim

    Why Alpha 21164 has 8K8KB data cache + 96KB s

    Small data cache and clock

    Direct Mapped, on chip

  • 7/29/2019 Computer Architecture Cache Design

    17/28

    COEN6741Chap 5.65

    11/10/2003

    2. Avoiding Address Translation

    Send virtual address to cache? Called VirtuallyAddressed Cache or just Virtual Cache vs.. PhysicalCache

    Every time process is switched logically must flush the cache;otherwise get false hits Cost is time to flush + compulsory misses from empty cache

    Dealing with aliases (sometimes called synonyms);Two different virtual addresses map to same physical address

    I/O must interact with cache, so need virtual address

    Solution to aliases HW guarantees every cache block has unique physical address

    SW guarantee : lower n bits must have same address;as long as covers index field & direct mapped, they must beunique; called page coloring

    Solution to cache flush Add process identifier tag that identifies process as well as

    address within process: cant get a hit if wrong process 11/10/2003

    Virtually Addres

    CPU

    TB

    $

    MEM

    VA

    PA

    PA

    ConventionalOrganization

    CPU

    $

    TB

    MEM

    VA

    VA

    PA

    Virtually Addressed Translate only on

    Synonym Proble

    VATags

    Pipeline Tag Check and Update Cache as separatestages; current write tag check & previous write cacheupdate

    Only STORES in the pipeline; empty during a miss

    Store r2, (r1) Check r1

    3. Pipelined Writes Case Study: MIP

    8 Stage Pipeline: IFfirst half of fetching of instruc

    here as well as initiation of instruc

    ISsecond half of access to instruc

    RFinstruction decode and register also instruction cache hit detection

    EXexecution, which includes effecoperation and branch target compu

  • 7/29/2019 Computer Architecture Cache Design

    18/28

    COEN6741Chap 5.69

    11/10/2003

    Case Study: MIPS R4000

    IF ISIF

    RFIS

    IF

    EXRF

    ISIF

    DFEX

    RFISIF

    DSDF

    EXRFISIF

    TCDS

    DFEXRFISIF

    WBTC

    DSDFEXRFISIF

    TWO CycleLoad Latency

    IF ISIF

    RFISIF

    EXRFISIF

    DFEXRFISIF

    DSDFEXRFISIF

    TCDSDFEXRFISIF

    WBTCDSDFEXRFISIF

    THREE CycleBranch Latency

    (conditions evaluatedduring EX phase)

    Delay slot plus two stallsBranch likely cancels delay slot if not taken

    11/10/2003

    R4000 Perfo Not ideal CPI of 1:

    Load stalls (1 or 2 clock cycles

    Branch stalls (2 cycles + unfille

    FP result stalls: RAW data haz

    FP structural stalls: Not enoug

    00.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    4.5

    eqntott

    espresso

    gcc li

    doduc

    n a s a 7

    Base Load stalls Branch stalls

    What is the Impact of What YouveLearned About Caches?

    1960-1985: Speed= (no. operations)

    1990 Pipelined

    Execution &Fast Clock Rate

    100

    1000CPU

    Processor-MemoryPerformance Gap:(grows 50% / year)

    Cache Optimizatio

    Technique

    Larger Block SizeHigher AssociativityVictim CachesPseudo-Associative Caches HW Prefetching of Instr/DataCompiler Controlled PrefetchingC mpil R d c Miss s

    missra

    te

  • 7/29/2019 Computer Architecture Cache Design

    19/28

    COEN6741Chap 5.73

    11/10/2003

    Cache Cross Cutting Issues

    Superscalar CPU & Number Cache Ports

    must match: number memoryaccesses/cycle?

    Speculative Execution and non-faultingoption on memory/TLB

    Parallel Execution vs.. Cache locality Want far separation to find independent operations vs..

    want reuse of data accesses to avoid misses

    I/O and consistency of data between cacheand memory Caches => multiple copies of data

    Consistency by HW or by SW?

    Where connect I/O to computer?

    11/10/2003

    0.01%

    0.10%

    1.00%

    10.00%

    100.00%AlphaSort Espresso Sc Mdljsp2 Ear

    MissRate

    Alpha Memory PerfoRates of SP

    I$ miss = 2%D$ miss = 13%L2 miss = 0.6%

    I$ miss = 6%

    D$ miss = 32%L2 miss = 10%

    3.50

    4.00

    4.50

    5.00

    L2

    Alpha CPI Components Instruction stall: branch mispredict (green);

    Data cache (blue); Instruction cache (yellow); L2$ (pink)Other: compute + reg conflicts, structural conflicts

    Predicting Cache PerformProg. (ISA, com

    4KB Data cache missrate 8%,12%, or28%?

    1KB Instr cache missrate 0% 3% or 10%?

    20%

    25%

    30%

    35%

    D$, T

    D$, gcc

  • 7/29/2019 Computer Architecture Cache Design

    20/28

    COEN6741Chap 5.77

    11/10/2003

    Main Memory Background Performance of Main Memory:

    Latency: Cache Miss Penalty

    Access Time: time between request and word arrives

    Cycle Time: time between requests Bandwidth: I/O & Large Block Miss Penalty (L2)

    Main Memory is DRAM: Dynamic Random Access Memory Dynamic since needs to be refreshed periodically (8 ms, 1% time)

    Addresses divided into 2 halves (Memory as a 2D matrix):

    RAS or Row Access Strobe

    CAS or Column Access Strobe

    Cache uses SRAM: Static Random Access Memory No refresh (6 transistors/bit vs.. 1 transistor /bit, area is 10X) Address not divided: Full addreess

    Size: DRAM/SRAM 4-8,Cost/Cycle time: SRAM/DRAM 8-16

    11/10/2003

    Main Memory Deep

    Out-of-Core, In-Core,

    Core memory?

    Non-volatile, magnetic Lost to 4 Kbit DRAM (today

    DRAM)

    Access time 750 ns, cycle t

    DRAM Logical Organization (4 Mbit)

    Column Decoder

    Sense Amps & I/O

    11D

    DRAM Physical Organi

    BlockRow Dec

    Row

    BlockRow Dec

    Column Address

    BlockRow De

    I/OI/O

    Address

  • 7/29/2019 Computer Architecture Cache Design

    21/28

    COEN6741Chap 5.81

    11/10/2003

    4 Key DRAM Timing Parameters

    tRAC: minimum time from RAS line falling to the validdata output.

    Quoted as the speed of a DRAM when buy

    A typical 4Mb DRAM tRAC = 60 ns

    Speed of DRAM since on purchase sheet?

    tRC: minimum time from the start of one row accessto the start of the next.

    tRC = 110 ns for a 4Mbit DRAM with a tRAC of 60 ns

    tCAC: minimum time from CAS line falling to validdata output.

    15 ns for a 4Mbit DRAM with a tRAC of 60 ns

    tPC: minimum time from the start of one columnaccess to the start of the next.

    35 ns for a 4Mbit DRAM with a tRAC of 60 ns

    11/10/2003

    DRAM Perfor

    A 60 ns (tRAC) DRAM can perform a row access only e

    perform column access (tCACbetween column accesses is

    In practice, external addraround buses make it 40 to

    These times do not include tthe addresses off the micromemory controller overhead!

    DRAM History

    DRAMs: capacity +60%/yr, cost 30%/yr 2.5X cells/area, 1.5X die size in 3 years

    98 DRAM fab line costs $2B DRAM only: density, leakage v. speed

    Rely on increasing no. of computers & memoryper computer (60% market)

    SIMM or DIMM is replaceable unit

    DRAM Future: 1

    Mitsubish

    Blocks 512 x 2 Mb Clock 200 MHz

    Data Pins 64

  • 7/29/2019 Computer Architecture Cache Design

    22/28

    Main Memory Performance

    Simple: CPU, Cache, Bus,Memory same width(32 or 64 bits)

    Wide: CPU/Mux 1 word; Mux/

    Cache, Bus, Memory Nwords (Alpha: 64 bits &256 bits; UtraSPARC 512)

    Interleaved: CPU, Cache, Bus 1 word:

    Memory N Modules(4 Modules); example isword interleaved

    11/10/2003

    Main Memory Pe

    Timing model (word size is

    1 to send address, 6 access time, 1 to sen

    Cache Block is 4 words

    Simple M.P. = 4 x (1 Wide M.P. = 1 + 6 Interleaved M.P. = 1 + 6

    Independent Memory Banks

    Memory banks for independent accessesvs.. faster sequential accesses

    Multiprocessor

    I/O CPU with Hit under n Misses, Non-blocking Cache

    Superbank: all memory active on one blockt f ( B k)

    Independent Me

    How many banks?

    number banks number clockbank

    For sequential accesses, ooriginal bank before it has

    (like in vector case)

  • 7/29/2019 Computer Architecture Cache Design

    23/28

  • 7/29/2019 Computer Architecture Cache Design

    24/28

    COEN6741Chap 5.93

    11/10/2003

    Main Memory Summary

    Wider Memory

    Interleaved Memory: for sequential orindependent accesses

    Avoiding bank conflicts: SW & HW

    DRAM specific optimizations: page mode &Specialty DRAM

    DRAM future less rosy?

    11/10/2003

    DRAM Cross

    After 20 years of 4X everyinto wall? (64Mb - 1 Gb)

    How can keep $1B fab linesDRAMs per computer?

    Cost/bit 30%/yr if stop 4X

    What will happen to $40B/y

    DRAMs per PC over Time

    rySize

    DRAM Generation

    86 89 92 96 99 021 Mb 4 Mb 16 Mb 64 Mb 256 Mb 1 Gb

    4 MB

    8 MB

    32 8

    16 4

    8 2

    Virtual Me

    A virtual memory is a memusually consisting of at leaand disk, in which the pro

    memory references as effa flat address space.

    All translations to primaryaddresses are handled tra

  • 7/29/2019 Computer Architecture Cache Design

    25/28

    Basic Issues in VM System Designsize of information blocks that are transferred fromsecondary to main storage (M)

    block of information brought into M, and M is full, then some regionof M must be released to make room for the new block -->replacement policy

    which region of M is to hold the new block --> placement policy

    missing item fetched from secondary memory only on the occurrenceof a fault --> demand load policy

    Paging Organization

    virtual and physical address space partitioned into blocks of equalsize

    page frames

    pages

    pages

    reg

    cachemem disk

    frame

    Addressing and AcTwo-Level Hier

    Thecomputersystem, HWor SW,

    mustperform anyaddresstranslationthat isrequired:

    Two ways of forming the address: Segmenta

    Paging is more common. Sometimes the two

    one on top of the other. More about addre

    Miss

    Systemaddress

    Hit

    Memory management unit (MMU

    Translation function(mapping tables,permissions, etc.)

    Paging vs. Segmentation

    Block

    System address

    Lookup table

    Word Block

    System address

    Lookup table

    Word

    Base address

    Paging Organiz

    frame 0

    1

    7

    0

    1024

    7168

    P.A.

    PhysicalMemory

    1K

    1K

    1K

    AddrTrans

    MAP

    0

    1024

    31744

    Address Mapping

    10

    V.A.

  • 7/29/2019 Computer Architecture Cache Design

    26/28

    SegmentationOrganization

    Notice that each segments virtual address starts at 0, different from itsphysical address.

    Repeated movement of segments into and out of physical memory willresult in gaps between segments. This is called external fragmentation.

    Compaction routines must be occasionally run to remove these fragments.

    Main memory

    Segment 1

    Segment 5

    Gap

    Segment 6

    Physicalmemory

    addresses

    Virtualmemory

    addresses

    0000

    0

    0

    0

    0

    0

    FFF

    Segment 9

    Segment 3

    Gap

    11/10/2003

    Translation Looka

    A way to speed up translation isof recently used page table entrnames, but the most frequently

    Lookaside Buffer or TLB

    Virtual Address Physical Address D

    Really just a cache on the pagTLB access time comparable to

    (much less than main mem

    Translation Lookaside Buffers

    VA PA misshit

    Just like any other cache, the TLB can be organizedas fully associative, set associative, or direct mapped

    TLBs are usually small, typically not more than 128 -256 entries even on high end machines. This permitsfully Associative lookup on these machines.

    Most mid-range machines use small n-way set associativeorganizations.

    Address Translat

    Page table is a large data s

    CPUTrans-lation Cac

    VA PA

    hit

    data

  • 7/29/2019 Computer Architecture Cache Design

    27/28

    Overlapped Cache & TLB Access

    TLB Cache

    10 2

    00

    4 bytes

    index1 K

    page # disp

    20 12

    assoclookup32

    PA Hit/Miss

    PA Data Hit/Miss

    =

    IF cache hit AND (cache tag = PA) then deliver data to CPUELSE IF [cache miss OR (cache tag = PA)] and TLB hit THEN

    access memory with the PA from the TLBELSE do standard VA translation

    11/10/2003

    Memory Hierarchy

    The memory hierarchy: from slow and cheap: Registers

    Disk At first, consider just two adhierarchy

    The cache: High speed and ex Direct mapped, associative, set ass

    Virtual memorymakes the hi Translate the address from CPUs lo

    address where the information is ac

    The TLB helps in speeding the add

    Memory managementhow to and forth

    Practical Memory Hierarchy

    Issue is NOT inventing new mechanisms

    Issue is taste in selecting between manyalternatives in putting together a memoryhierarchy that fit well together

    e.g., L1 Data cache write through, L2 Write back

    TLB and Virtu

    Caches, TLBs, Virtual Memoexamining how they deal withWhere can block be placed? 3) What block is repalced onwrites handled?

    Page tables map virtual addr TLBs make virtual memory p Locality in data => locality in addr

    temporal and spatial

  • 7/29/2019 Computer Architecture Cache Design

    28/28

    DAP Spr.98UCB 35

    Alpha 21 064

    Separate Instr & DataTLB & Caches

    TLBs fully associative

    TLB updates in SW(Priv Arch Libr)

    Caches 8KB direct

    mapped, write thru Critical 8 bytes first

    Prefetch instr. streambuffer

    2 MB L2 cache, directmapped, WB (off-chip)

    256 bit path to mainmemory, 4 x 64-bitmodules

    Victim Buffer: to giveread priority over write

    4 entry write bufferbetween D$ & L2$

    Stream

    Buffer

    WriteBuffer

    Victim Buffer

    Instr Data