tk 2123 computer organisation & architecture

Dr Masri AyobDr Masri Ayob

TK 2123TK 2123COMPUTER ORGANISATION & COMPUTER ORGANISATION &

ARCHITECTUREARCHITECTURE

Lecture 7: CPU and Memory (3)Lecture 7: CPU and Memory (3)

22

ContentsContents

This lecture will discuss:This lecture will discuss:Cache.Cache.Error Correcting Codes.Error Correcting Codes.

33

The Memory HierarchyThe Memory Hierarchy

Trade-off: cost, capacity and access time.Trade-off: cost, capacity and access time.Faster access time, greater cost per bit.Faster access time, greater cost per bit.Greater capacity, smaller cost per bit.Greater capacity, smaller cost per bit.Greater capacity, slower access time.Greater capacity, slower access time.

Access time - the time it takes to perform a read or write operation.Memory Cycle time -Time may be required for the memory to “recover” before next access, i.e. access + recovery.Transfer Rate - rate at which data can be moved.

44

Memory HierarchiesMemory Hierarchies

A five-level memory hierarchy.A five-level memory hierarchy.

55

Hierarchy ListHierarchy List

RegistersRegisters L1 CacheL1 Cache L2 CacheL2 Cache Main memoryMain memory Disk cacheDisk cache DiskDisk OpticalOptical TapeTape

Internal Internal memorymemory

external external memorymemory

decreasing decreasing cost/bit, cost/bit, increasing increasing capacity, capacity, and slower and slower access timeaccess time

66

Hierarchy ListHierarchy List It would be nice to use only the fastest memory, but because It would be nice to use only the fastest memory, but because

that is the most expensive memory, that is the most expensive memory, we trade off access time for cost by using more of the slower memory. we trade off access time for cost by using more of the slower memory. The design challenge is to organise the data and programs in memory The design challenge is to organise the data and programs in memory

so that the accessed memory words are usually in the faster memory.so that the accessed memory words are usually in the faster memory. In general, it is likely that most future accesses to main memory In general, it is likely that most future accesses to main memory

by the processor will be to locations recently accessed. by the processor will be to locations recently accessed. So the cache automatically retains a copy of some of the recently used So the cache automatically retains a copy of some of the recently used

words from the DRAM. words from the DRAM. If the cache is designed properly, then most of the time the processor If the cache is designed properly, then most of the time the processor

will request memory words that are already in the cache.will request memory words that are already in the cache.

77

Hierarchy ListHierarchy List

No one technology is optimal in satisfying the No one technology is optimal in satisfying the memory requirements for a computer system. memory requirements for a computer system. As a consequence, the typical computer system is As a consequence, the typical computer system is

equipped with a hierarchy of memory subsystems;equipped with a hierarchy of memory subsystems; some internal to the system (directly accessible by the some internal to the system (directly accessible by the

processor) and processor) and some external (accessible by the processor via an I/O some external (accessible by the processor via an I/O

module).module).

88

CacheCache

Small amount of fast memorySmall amount of fast memorySits between normal main memory and Sits between normal main memory and

CPUCPUMay be located on CPU chip or moduleMay be located on CPU chip or module

or cache line.

99

CacheCache The cache contains a copy of portions of main memory. The cache contains a copy of portions of main memory.

When the processor attempts to read a word of memory, a check When the processor attempts to read a word of memory, a check is made to determine if the word is in the cache. is made to determine if the word is in the cache.

If so (hit), the word is delivered to the processor. If so (hit), the word is delivered to the processor. If not (miss), a block of main memory, consisting of some fixed If not (miss), a block of main memory, consisting of some fixed

number of words, is read into the cache and then the word is number of words, is read into the cache and then the word is delivered to the processor. delivered to the processor.

Because of the phenomenon of locality of reference, when a block Because of the phenomenon of locality of reference, when a block of data is fetched into the cache to satisfy a single memory of data is fetched into the cache to satisfy a single memory reference, it is likely that there will be future references to that reference, it is likely that there will be future references to that same memory location or to other words in the block. same memory location or to other words in the block.

The ratio of hits to the total number of requests is known as the hit ratio.

1010

Cache/Main Memory StructureCache/Main Memory Structure

1111

Cache operation – overviewCache operation – overview

CPU requests contents of memory locationCPU requests contents of memory locationCheck cache for this dataCheck cache for this data If present, get from cache (fast)If present, get from cache (fast) If not present, read required block from If not present, read required block from

main memory to cachemain memory to cacheThen deliver from cache to CPUThen deliver from cache to CPUCache includes tags to identify which block Cache includes tags to identify which block

of main memory is in each cache slotof main memory is in each cache slot

1212

Cac

he O

pera

tion

Cac

he O

pera

tion

1313

Cache DesignCache Design

SizeSizeMapping FunctionMapping FunctionReplacement AlgorithmReplacement AlgorithmWrite PolicyWrite PolicyBlock SizeBlock SizeNumber of Caches – L1, L2, L3 etc.Number of Caches – L1, L2, L3 etc.

1414

Size does matterSize does matter

CostCostMore cache is expensiveMore cache is expensive

SpeedSpeedMore cache is faster (up to a point)More cache is faster (up to a point)Checking cache for data takes timeChecking cache for data takes time

We would like the size of the cache to be small enough so that the overall average cost per bit is close to that of main memory alone and large enough so that the overall average access time is close to that of the cache alone.

The larger the cache, the larger the number of gates involved in addressing the cache. The result is that large caches tend to be slightly slower than small ones.

1515

Comparison of Cache SizesComparison of Cache Sizes

Processor Type Year of Introduction L1 cachea L2 cache L3 cache

IBM 360/85 Mainframe 1968 16 to 32 KB — —

PDP-11/70 Minicomputer 1975 1 KB — —

VAX 11/780 Minicomputer 1978 16 KB — —

IBM 3033 Mainframe 1978 64 KB — —

IBM 3090 Mainframe 1985 128 to 256 KB — —

Intel 80486 PC 1989 8 KB — —

Pentium PC 1993 8 KB/8 KB 256 to 512 KB —

PowerPC 601 PC 1993 32 KB — —

PowerPC 620 PC 1996 32 KB/32 KB — —

PowerPC G4 PC/server 1999 32 KB/32 KB 256 KB to 1 MB 2 MB

IBM S/390 G4 Mainframe 1997 32 KB 256 KB 2 MB

IBM S/390 G6 Mainframe 1999 256 KB 8 MB —

Pentium 4 PC/server 2000 8 KB/8 KB 256 KB —

IBM SP High-end server/ supercomputer 2000 64 KB/32 KB 8 MB —

CRAY MTAb Supercomputer 2000 8 KB 2 MB —

Itanium PC/server 2001 16 KB/16 KB 96 KB 4 MB

SGI Origin 2001 High-end server 2001 32 KB/32 KB 4 MB —

Itanium 2 PC/server 2002 32 KB 256 KB 6 MB

IBM POWER5 High-end server 2003 64 KB 1.9 MB 36 MB

CRAY XD-1 Supercomputer 2004 64 KB/64 KB 1MB —

1616

Cache: Mapping FunctionCache: Mapping Function

Cache lines < main memory blocks:Cache lines < main memory blocks: An algorithm is needed for mapping main memory An algorithm is needed for mapping main memory

blocks into cache lines. blocks into cache lines. Three techniques:Three techniques:

DirectDirectAssociativeAssociativeset associativeset associative

1717

Direct MappingDirect Mapping

Each block of main memory maps to only one Each block of main memory maps to only one cache linecache line i.e. if a block is in cache, it must be in one specific i.e. if a block is in cache, it must be in one specific

place.place. pros & conspros & cons

SimpleSimple InexpensiveInexpensive Fixed location for given blockFixed location for given block

If a program accesses 2 blocks that map to the If a program accesses 2 blocks that map to the same line repeatedly, cache misses are very highsame line repeatedly, cache misses are very high

1818

AssociativeAssociative Mapping Mapping

A main memory block can load into any line of A main memory block can load into any line of cachecache

Memory address is interpreted as tag and wordMemory address is interpreted as tag and word Tag uniquely identifies block of memoryTag uniquely identifies block of memory Every line’s tag is examined for a matchEvery line’s tag is examined for a match Disadvantage:Disadvantage:

Cache searching gets expensiveCache searching gets expensive The complex circuitry is required to examine the tags The complex circuitry is required to examine the tags

of all cache lines in parallel. of all cache lines in parallel.

1919

Set AssociativeSet Associative Mapping Mapping

A compromise that exhibits the strengths of both the A compromise that exhibits the strengths of both the direct and associative approaches while reducing their direct and associative approaches while reducing their disadvantages. disadvantages.

Cache is divided into a number of sets.Cache is divided into a number of sets. Each set contains a number of lines.Each set contains a number of lines. A given block maps to any line in a given setA given block maps to any line in a given set

e.g. Block B can be in any line of set i.e.g. Block B can be in any line of set i. With fully associative mapping, the tag in a memory address is With fully associative mapping, the tag in a memory address is

quite large and must be compared to the tag of every line in the quite large and must be compared to the tag of every line in the cache. cache.

With k-way set associative mapping, the tag in a memory address With k-way set associative mapping, the tag in a memory address is much smaller and is only compared to the k tags within a single is much smaller and is only compared to the k tags within a single set. set.

2020

Replacement AlgorithmsReplacement Algorithms

When cache memory is full, some block in When cache memory is full, some block in cache memory must be selected for cache memory must be selected for replacement. replacement.

Direct mapping :Direct mapping :No choiceNo choiceEach block only maps to one lineEach block only maps to one lineReplace that lineReplace that line

2121

Replacement Algorithms (2)Replacement Algorithms (2)Associative & Set AssociativeAssociative & Set Associative

Hardware implemented algorithm (speed)Hardware implemented algorithm (speed)Least Recently used (LRU)Least Recently used (LRU)

An LRU algorithm, keeps track of the usage of An LRU algorithm, keeps track of the usage of each block and replaces the block that was last each block and replaces the block that was last used the longest time ago.used the longest time ago.

First in first out (FIFO)First in first out (FIFO)replace block that has been in cache longestreplace block that has been in cache longest

Least frequently used (LFU)Least frequently used (LFU)replace block which has had fewest hitsreplace block which has had fewest hits

RandomRandom

2222

Write PolicyWrite Policy

Issues:Issues:Must not overwrite a cache block unless main Must not overwrite a cache block unless main

memory is up to datememory is up to dateMultiple CPUs may have individual cachesMultiple CPUs may have individual caches I/O may address main memory directlyI/O may address main memory directly

2323

Write throughWrite through

All writes go to main memory as well as All writes go to main memory as well as cachecache

Multiple CPUs can monitor main memory Multiple CPUs can monitor main memory traffic to keep local (to CPU) cache up to traffic to keep local (to CPU) cache up to datedate

Disadvantage:Disadvantage:Lots of trafficLots of trafficSlows down writesSlows down writesCreate a bottleneck.Create a bottleneck.

2424

Cache: Line SizeCache: Line Size As the block size increases from very small to larger As the block size increases from very small to larger

sizes, the hit ratio will at first increase because of the sizes, the hit ratio will at first increase because of the principle of locality.principle of locality.

Two issues:Two issues: Larger blocks reduce the number of blocks that fit into a cache. Larger blocks reduce the number of blocks that fit into a cache.

Because each block fetch overwrites older cache contents, a Because each block fetch overwrites older cache contents, a small number of blocks results in data being overwritten shortly small number of blocks results in data being overwritten shortly after they are fetched.after they are fetched.

As a block becomes larger, each additional word is farther from As a block becomes larger, each additional word is farther from the requested word, therefore less likely to be needed in the the requested word, therefore less likely to be needed in the near futurenear future. .

2525

Number of Caches Number of Caches Multilevel Caches:Multilevel Caches:

On-chip cache: On-chip cache: A cache on the same chip as the processor.A cache on the same chip as the processor. Reduces the processor’s external bus activity and therefore speeds up Reduces the processor’s external bus activity and therefore speeds up

execution times and increases overall system performance.execution times and increases overall system performance.

external cache: Is it still desirable?external cache: Is it still desirable?Yes - most contemporary designs include both on-chip and Yes - most contemporary designs include both on-chip and

external caches.external caches.E.g. two-level cache, with the internal cache (L1) and the E.g. two-level cache, with the internal cache (L1) and the

external cache (L2). Why?external cache (L2). Why? If there is no L2 cache and the processor makes an access request If there is no L2 cache and the processor makes an access request

for a memory location not in the L1 cache, then the processor must for a memory location not in the L1 cache, then the processor must access DRAM or ROM memory across the bus – poor access DRAM or ROM memory across the bus – poor performance. performance.

2626

Number of Caches Number of Caches More recently, it has become common to split the cache into two:More recently, it has become common to split the cache into two:

one dedicated to instructions and one dedicated to data.one dedicated to instructions and one dedicated to data. There are two potential advantages of a unified cache:There are two potential advantages of a unified cache:

For a given cache size, a unified cache has a higher hit rate than split For a given cache size, a unified cache has a higher hit rate than split caches because it balances the load between instruction and data caches because it balances the load between instruction and data fetches automatically.fetches automatically.

Only one cache needs to be designed and implemented.Only one cache needs to be designed and implemented. The trend is toward split caches, such as the Pentium and The trend is toward split caches, such as the Pentium and

PowerPC, which emphasize parallel instruction execution and PowerPC, which emphasize parallel instruction execution and the prefetching of predicted future instructions. Advantage:the prefetching of predicted future instructions. Advantage:

It eliminates contention for the cache between the instruction It eliminates contention for the cache between the instruction fetch/decode unit and the execution unit. fetch/decode unit and the execution unit.

2727

Intel Cache EvolutionIntel Cache EvolutionProblem Solution

Processor on which feature first appears

External memory slower than the system bus. Add external cache using faster memory technology.

386

Increased processor speed results in external bus becoming a bottleneck for cache access.

Move external cache on-chip, operating at the same speed as the processor.

486

Internal cache is rather small, due to limited space on chip Add external L2 cache using faster technology than main memory

486

Contention occurs when both the Instruction Prefetcher and the Execution Unit simultaneously require access to the cache. In that case, the Prefetcher is stalled while the Execution Unit’s data access takes place.

Create separate data and instruction caches.

Pentium

Increased processor speed results in external bus becoming a bottleneck for L2 cache access.

Create separate back-side bus that runs at higher speed than the main (front-side) external bus. The BSB is dedicated to the L2 cache.

Pentium Pro

Move L2 cache on to the processor chip.

Pentium II

Some applications deal with massive databases and must have rapid access to large amounts of data. The on-chip caches are too small.

Add external L3 cache. Pentium III

Move L3 cache on-chip. Pentium 4

2828

Locality Locality Why the principle of locality make sense?Why the principle of locality make sense?

In most cases, the next instruction to be fetched immediately In most cases, the next instruction to be fetched immediately follows the last instruction fetched (except for branch and call follows the last instruction fetched (except for branch and call instructions).instructions).

A program remains confined to a rather narrow window of A program remains confined to a rather narrow window of procedure-invocation depth. Thus, over a short period of time procedure-invocation depth. Thus, over a short period of time references to instructions tend to be localised to a few references to instructions tend to be localised to a few procedures.procedures.

Most iterative constructs consist of a relatively small number of Most iterative constructs consist of a relatively small number of instructions repeated many times. instructions repeated many times.

In many programs, much of the computation involves processing In many programs, much of the computation involves processing data structures, such as arrays or sequences of records. In data structures, such as arrays or sequences of records. In many cases, successive references to these data structures will many cases, successive references to these data structures will be to closely located data items.be to closely located data items.

2929

Internal Memory (revision)Internal Memory (revision)

3030

Memory Packaging and TypesMemory Packaging and Types

A SIMM holding 256 MB. Two of the chips control the SIMM.A SIMM holding 256 MB. Two of the chips control the SIMM.

A group of chips, typically 8 or 16, is mounted on a tiny PCB and sold as a unit.A group of chips, typically 8 or 16, is mounted on a tiny PCB and sold as a unit. SIMM - SIMM - single inline memory module, has a row of connectors on one side.single inline memory module, has a row of connectors on one side. DIMM – Dual inline memory module, has a row of connectors on both side.DIMM – Dual inline memory module, has a row of connectors on both side.

3131

Error CorrectionError Correction

Hard FailureHard Failure Permanent defectPermanent defect Caused by harsh environmental abuse, Caused by harsh environmental abuse,

manufacturing defects, and wear.manufacturing defects, and wear. Soft ErrorSoft Error

Random, non-destructiveRandom, non-destructive No permanent damage to memoryNo permanent damage to memory Caused by power supply problems.Caused by power supply problems.

Detected using Hamming error correcting code.Detected using Hamming error correcting code.

3232

Error CorrectionError Correction

When reading out the stored word, a new set of K code When reading out the stored word, a new set of K code bits is generated from M data bits and compared with bits is generated from M data bits and compared with fetch code bits. Results:fetch code bits. Results: No errors – the fetch data bits are sent out.No errors – the fetch data bits are sent out. An error is detected, and it is possible to correct the An error is detected, and it is possible to correct the

error. error. Data bits + error correction bits Data bits + error correction bits correctorcorrector sent sent

out the corrected set of M bits. out the corrected set of M bits. An error is detected, but it is not possible to correct An error is detected, but it is not possible to correct

the error. This condition is reported.the error. This condition is reported.

3333

Error Correcting Code FunctionError Correcting Code Function

A function to produce code

Stored codeword: M+K bits

3434

Error Correcting Codes: Venn diagramError Correcting Codes: Venn diagram

(a) Encoding of 1100(a) Encoding of 1100(b) Even parity added(b) Even parity added(c) Error in AC(c) Error in AC

3535

Error Correction: Hamming DistanceError Correction: Hamming Distance The number of bit positions in which two codewords differ is called the The number of bit positions in which two codewords differ is called the

Hamming distanceHamming distance.. If two codewords are a Hamming distance d apart, it will require d If two codewords are a Hamming distance d apart, it will require d

single-bit errors to convert one into the other.single-bit errors to convert one into the other. E.g. the codewords 11110001 and 00110000 are a Hamming E.g. the codewords 11110001 and 00110000 are a Hamming

distance 3 apart because it takes 3 single-bit errors to convert one distance 3 apart because it takes 3 single-bit errors to convert one into the other.into the other.

To detect d single-bit errors, you need a distance d + 1 code.To detect d single-bit errors, you need a distance d + 1 code. To correct d single-bit errors, you need a distance 2d + 1 code. To correct d single-bit errors, you need a distance 2d + 1 code.

To determine how many bits differ, just compute the bitwise Boolean EXCLUSIVE OR of the two codewords, and count the number of 1 bits in the result.

3636

Example: Hamming algorithmExample: Hamming algorithm All bits whose bit number (start with bit 1) is a power of 2 All bits whose bit number (start with bit 1) is a power of 2

are parity bits; the rest are used for data. are parity bits; the rest are used for data. E.g. with a 16-bit word, 5 parity bits are added. Bits 1, 2, 4, 8, and E.g. with a 16-bit word, 5 parity bits are added. Bits 1, 2, 4, 8, and

16 are parity bits, and all the rest are data bits. 16 are parity bits, and all the rest are data bits. Bit b is checked by those bits bBit b is checked by those bits b11, b, b22, …b, …bjj such that b1+b2+…+bj=b. such that b1+b2+…+bj=b. For example, bit 5 is checked by bits 1 and 4 because 1+4=5. For example, bit 5 is checked by bits 1 and 4 because 1+4=5.

3737

Construction of the Hamming code for the memory word Construction of the Hamming code for the memory word 1111000010101110 by adding 5 check bits to the 16 data bits.1111000010101110 by adding 5 check bits to the 16 data bits.

Example: Hamming algorithmExample: Hamming algorithm

We will (arbitrarily) use even parity in this example.

3838

Example: Hamming algorithmExample: Hamming algorithm

Consider what would happen if bit 5 were inverted by an Consider what would happen if bit 5 were inverted by an electrical surge on the power line. The new codeword would electrical surge on the power line. The new codeword would be be 001000100011000001011011101100000101101110 instead of instead of 001011100000101101110001011100000101101110. .

The 5 parity bits will be checked, with the following results:The 5 parity bits will be checked, with the following results:

Since parity bits 1 and 4 are incorrect but 2, 8, and 16 are correct bit 5 (1 + 4) has been inverted.

3939

Thank youThank youQ & AQ & A

tk 2123 computer organisation & architecture

Documents

word of memory

slower memory

faster memory

memory requirements

fastest memory

expensive memory

block of main memory

memory cycle time time