cmput229 - fall 2003

51
CMPUT 229 - Computer Org anization and Architectu re I 1 CMPUT229 - Fall 2003 Topic D: The Memory Hierarchy José Nelson Amaral

Upload: keilah

Post on 06-Jan-2016

34 views

Category:

Documents


0 download

DESCRIPTION

CMPUT229 - Fall 2003. Topic D: The Memory Hierarchy José Nelson Amaral. Bryant , Randal E., O’Hallaron , David, Computer Systems: A Programmer’s Perspective , Prentice Hall, 2003. (B&H). Reading Assignment. Chapter 6: The Memory Hierarchy. Types of Memories. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CMPUT229 - Fall 2003

CMPUT 229 - Computer Organization and Architecture I

1

CMPUT229 - Fall 2003

Topic D: The Memory HierarchyJosé Nelson Amaral

Page 2: CMPUT229 - Fall 2003

CMPUT 229 - Computer Organization and Architecture I

2

Reading Assignment

Bryant, Randal E., O’Hallaron, David, Computer Systems: A Programmer’s Perspective, Prentice Hall, 2003. (B&H)

Chapter 6: The Memory Hierarchy

Page 3: CMPUT229 - Fall 2003

CMPUT 229 - Computer Organization and Architecture I

3

Types of Memories

Read/Write Memory (RWM):

the time required to read orwrite a bit of memory is independent of the bit’s location.

once a word is writtento a location, it remains stored as long as power is appliedto the chip, unless the location is written again.

the data stored ateach location must be refreshed periodically by reading it andthen writing it back again, or else it disappears.

we can store and retrieve data.

Random Access Memory (RAM):

Static Random Access Memory (SRAM):

Dynamic Random Access Memory (DRAM):

Page 4: CMPUT229 - Fall 2003

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

DOUT3 DOUT2 DOUT1 DOUT0

3-to-8decoder

2

1

0

A2

A1

A0

0

1

2

3

4

5

6

7

DIN3 DIN0DIN2 DIN1

WE_LCS_L

OE_L

WR_L

IOE_L

0

1

1

Page 5: CMPUT229 - Fall 2003

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

DOUT3 DOUT3 DOUT3 DOUT3

3-to-8decoder

2

1

0

A2

A1

A0

0

1

2

3

4

5

6

7

DIN3 DIN3DIN3 DIN3

WE_LCS_L

OE_L

WR_L

IOE_L

0

1

1

Page 6: CMPUT229 - Fall 2003

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

DOUT3 DOUT3 DOUT3 DOUT3

3-to-8decoder

2

1

0

A2

A1

A0

0

1

2

3

4

5

6

7

DIN3 DIN3DIN3 DIN3

WE_LCS_L

OE_L

WR_L

IOE_L

0

1

1

Page 7: CMPUT229 - Fall 2003

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

DOUT3 DOUT3 DOUT3 DOUT3

3-to-8decoder

2

1

0

A2

A1

A0

0

1

2

3

4

5

6

7

DIN3 DIN3DIN3 DIN3

WE_LCS_L

OE_L

WR_L

IOE_L

0

1

1

Page 8: CMPUT229 - Fall 2003

CMPUT 229 - Computer Organization and Architecture I

8

Refreshing the Memory

Vcap

0V

HIGHLOW

VCC

time

0 stored

1 written refreshes

The solution is to periodically refresh the memorycells by reading and writing back each one of them.

Page 9: CMPUT229 - Fall 2003

CMPUT 229 - Computer Organization and Architecture I

9

SRAM with Bi-directional Data Bus

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

DIO3 DIO2 DIO1 DIO0

WE_LCS_L

OE_L

WR_L

IOE_L

microprocessor

Page 10: CMPUT229 - Fall 2003

CMPUT 229 - Computer Organization and Architecture I

10

DRAM High Level View

Cols

Rows

0 1 2 3

0

1

2

3

Internal row buffer

DRAM chip

addr

data

2/

8/

Memorycontroller

(to CPU)

Byant/O’Hallaron, pp. 459

Page 11: CMPUT229 - Fall 2003

CMPUT 229 - Computer Organization and Architecture I

11

DRAM RAS Request

RAS = 2

Cols

Rows

0 1 2 3

0

1

2

3

Internal row buffer

DRAM chip

Row 2

addr

data

2/

8/

Memorycontroller

RAS = Row Address StrobeByant/O’Hallaron, pp. 460

Page 12: CMPUT229 - Fall 2003

CMPUT 229 - Computer Organization and Architecture I

12

DRAM CAS Request

Supercell (2,1)

Cols

Rows

0 1 2 3

0

1

2

3

Internal row buffer

DRAM chip

CAS = 1

addr

data

2/

8/

Memorycontroller

CAS = Column Address StrobeByant/O’Hallaron, pp. 460

Page 13: CMPUT229 - Fall 2003

Memory Modules: Supercell (i,j)

031 78151623243263 394047485556

64-bit double word at main memory address A

addr (row = i, col = j)

data

64 MB memory module

consisting of8 8Mx8 DRAMs

Memorycontroller

bits0-7

DRAM 7

DRAM 0

bits8-15

bits16-23

bits24-31

bits32-39

bits40-47

bits48-55

bits56-63

64-bit doubleword to CPU chip

Byant/O’Hallaron, pp. 461

Page 14: CMPUT229 - Fall 2003

Step 1: Apply row address

1

Step 2: RAS go from high to low and remain low2

Step 4: WE must be high

4

Step 3: Apply column address

3Step 5: CAS goes from high to low and remain low

5

Step 6: OE goes low

6

Step 7: Data appears

7

Step 8: RAS and CAS return to high

8

Read Cycle on an Asynchronous DRAM

Page 15: CMPUT229 - Fall 2003

CMPUT 229 - Computer Organization and Architecture I

15

Improved DRAMs

Central Idea: Each read to a DRAM actuallyreads a complete row of bits or word line fromthe DRAM core into an array of sense amps.

A traditional asynchronous DRAM interfacethen selects a small number of these bits to bedelivered to the cache/microprocessor.

All the other bits already extracted from the DRAMcells into the sense amps are wasted.

Page 16: CMPUT229 - Fall 2003

CMPUT 229 - Computer Organization and Architecture I

16

Fast Page Mode DRAMs

In a DRAM with Fast Page Mode, a page is defined asall memory addresses that have the same row address.

To read in fast page mode, all the steps from 1 to 7 ofa standard read cycle are performed.

Then OE and CAS are switched high, but RAS remains low.

Then the steps 3 to 7 (providing a new column address,asserting CAS and OE) are performed for each newmemory location to be read.

Page 17: CMPUT229 - Fall 2003

A Fast Page Mode Read Cycle on an Asynchronous DRAM

Page 18: CMPUT229 - Fall 2003

CMPUT 229 - Computer Organization and Architecture I

18

Enhanced Data Output RAMs (EDO-RAM)

The process to read multiple locations in an EDO-RAMis very similar to the Fast Page Mode.

The difference is that the output drivers are not disabledwhen CAS goes high.

This distintion allows the data from the current read cycleto be present at the outputs while the next cyclebegins.

As a result, faster read cycle times are allowed.

Page 19: CMPUT229 - Fall 2003

An Enhanced Data Output Read Cycle on an Asynchronous DRAM

Page 20: CMPUT229 - Fall 2003

CMPUT 229 - Computer Organization and Architecture I

20

Synchronous DRAMs (SDRAM)

A Synchronous DRAM (SDRAM) has a clock input. It operatesin a similar fashion as the fast page mode and EDO DRAM.However the consecutive data is output synchronously on thefalling/rising edge of the clock, instead of on command byCAS.

How many data elements will be output (the length of the burst) is programmable up to the maximum size ofthe row.

The clock in an SDRAM typically runs oneorder of magnitude faster than the access time forindividual accesses.

Page 21: CMPUT229 - Fall 2003

CMPUT 229 - Computer Organization and Architecture I

21

DDR SDRAM

A Double Data Rate (DDR) SDRAM is an SDRAMthat allows data transfers both on the rising andfalling edge of the clock.

Thus the effective data transfer rate of a DDR SDRAM is two times the data transfer rate ofa standard SDRAM with the same clock frequency.

Page 22: CMPUT229 - Fall 2003

CMPUT 229 - Computer Organization and Architecture I

22

The Rambus DRAM (RDRAM)

Multiple memory arrays (banks)Rambus DRAMs are synchronous and transfer data on both edges of the clock.

Page 23: CMPUT229 - Fall 2003

CMPUT 229 - Computer Organization and Architecture I

23

SDRAM Memory Systems

Complex circuits for RAS/CAS/OE.

Each DIMM is connectedin parallel with the memorycontroller.(DIMM = Dual In-line Memory Module)

Often requires buffering.

Needs the whole clockcycle to establish valid data.

Making the bus wider ismechanically complicated.

Page 24: CMPUT229 - Fall 2003

CMPUT 229 - Computer Organization and Architecture I

24

RDRAM Memory Systems

Page 25: CMPUT229 - Fall 2003

CMPUT 229 - Computer Organization and Architecture I

25

Bus Structure

Mainmemory

I/O bridge

Bus interface

ALU

Register fileCPU

System bus Memory bus

Disk controller

Graphicsadapter

USBcontroller

Mouse Keyboard Monitor

Disk

I/O bus Expansion slots forother devices such

as network adapters

Byant/O’Hallaron, pp. 472

Page 26: CMPUT229 - Fall 2003

CMPUT 229 - Computer Organization and Architecture I

26

DMA Request

Mainmemory

I/O bridge

Bus interface

ALU

Register fileCPU

System bus Memory bus

Disk controller

Graphicsadapter

USBcontroller

Mouse Keyboard Monitor

Disk

I/O bus Expansion slots forother devices such

as network adapters

DMA = Direct Memory Access

Byant/O’Hallaron, pp. 473

Page 27: CMPUT229 - Fall 2003

CMPUT 229 - Computer Organization and Architecture I

27

DMA Transfer

Mainmemory

I/O bridge

Bus interface

ALU

Register fileCPU

System bus Memory bus

Disk controller

Graphicsadapter

USBcontroller

Mouse Keyboard Monitor

Disk

I/O bus Expansion slots forother devices such

as network adapters

DMA = Direct Memory Access

Byant/O’Hallaron, pp. 473

Page 28: CMPUT229 - Fall 2003

CMPUT 229 - Computer Organization and Architecture I

28

DMA Complet. Notification

Mainmemory

I/O bridge

Bus interface

ALU

Register fileCPU

Memory bus

Disk controller

Graphicsadapter

USBcontroller

Mouse Keyboard Monitor

Disk

I/O bus Expansion slots forother devices such

as network adapters

DMA = Direct Memory Access

Interrupt

Byant/O’Hallaron, pp. 474

Page 29: CMPUT229 - Fall 2003

CMPUT 229 - Computer Organization and Architecture I

29

Locality

We say that a computer program exhibits good locality if the program tends to reference data that is nearby or datathat has been referenced recently.

Because a program might do one of these things, but not the other,the principle of locality is separated into two flavors:

Temporal locality: a memory location that is referenced once is likely to be referenced multiple times in the near future.

Spatial locality: if a memory location that is referenced once then locations that are nearby are likely to be referenced in the near future.

Byant/O’Hallaron, pp. 478

Page 30: CMPUT229 - Fall 2003

CMPUT 229 - Computer Organization and Architecture I

30

Examples

In the Sampler function below, RandInt returns a randomly selected integer within the specified interval.Which program has better locality?

1 int SumVec(int v[], int N) 2 { 3 int i; 4 int sum = 0; 5 6 for (i=0 ; i<N ; i=i+1) 7 sum += v[i]; 8 return sum; 9 }

1 int Sampler(int v[], int N, int K) 2 { 3 int i, j; 4 int sum = 0; 5 6 for (i=0 ; i<K ; i=i+1) 7 { 8 j = RandInt(0,N-1); 9 sum += v[j];10 }11 return sum/K;12 }

Byant/O’Hallaron, pp. 479

Page 31: CMPUT229 - Fall 2003

Memory Hierarchy

Larger, slower,

and cheaper (per byte)storagedevices

Registers

CPU registers hold words retrieved from cache memory.

L0:

On-chip L1cache (SRAM)

L1 cache holds cache lines retrieved from the L2 cache.L1:

Off-chip L2cache (SRAM)

L2 cache holds cache lines retrieved from memory.L2:

Main memory(DRAM)

Main memory holds disk blocks retrieved from local

disks.

L3:

Local secondary storage(local disks)

Local disks hold files retrieved from disks on

remote network servers.

L4:

Remote secondary storage(distributed file systems, Web servers)

L5:

Smaller,faster,and

costlier(per byte)storage devices

Byant/O’Hallaron,

pp. 483

Page 32: CMPUT229 - Fall 2003

CMPUT 229 - Computer Organization and Architecture I

32

Caching Principle

4 9 14 3

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

Larger, slower, cheaper storagedevice at level k+1 is partitioned

into blocks.

Smaller, faster, more expensivedevice at level k caches a

subset of the blocks from level k+1

Data is copied betweenlevels in block-sized transfer units

Level k:

Level k+1:

Byant/O’Hallaron, pp. 484

Page 33: CMPUT229 - Fall 2003

CMPUT 229 - Computer Organization and Architecture I

33

Cache Misses

Cold Misses, or compulsory misses, occur the first time that a data is referenced.

Conflict Misses, occur when two memory references have to occupy the same memory line. It can occur even when the remainder of the cache is not in use.

Capacity Misses, occur when there are no more free lines in the cache.

Page 34: CMPUT229 - Fall 2003

CMPUT 229 - Computer Organization and Architecture I

34

L1 and L2 Bus System

Mainmemory

I/Obridge

Bus interfaceL2 cache

ALU

Register file

CPU chip

Cache bus System bus Memory bus

L1 cache

Byant/O’Hallaron, pp. 488

Page 35: CMPUT229 - Fall 2003

Cache Organization

• • • B–110

• • • B–110

Valid

Valid

Tag

TagSet 0:

B = 2b bytesper cache block

E lines per set

S = 2s sets

t tag bitsper line

1 valid bitper line

Cache size: C = B x E x S data bytes

• • •

• • • B–110

• • • B–110

Valid

Valid

Tag

TagSet 1:

• • •

• • • B–110

• • • B–110

Valid

Valid

Tag

TagSet S -1:

• • •• • •

Byant/O’Hallaron, pp. 488

Page 36: CMPUT229 - Fall 2003

CMPUT 229 - Computer Organization and Architecture I

36

Address Partition

t bits s bits b bits

0m-1

Tag Set index Block offset

Address:

Compared with tags in thecache to find a match.

Used to find the set wherethe data might be found inthe cache.

Selects which word, insidethe block, is referenced.

Byant/O’Hallaron, pp. 488

Page 37: CMPUT229 - Fall 2003

CMPUT 229 - Computer Organization and Architecture I

37

Multi-Level Cache Organization

Mainmemory Disk

L1 i-cache

L1 d-cacheRegs L2 unifiedcache

CPU

Byant/O’Hallaron, pp. 504

Page 38: CMPUT229 - Fall 2003

CMPUT 229 - Computer Organization and Architecture I

38

Writing Cache-Conscious Programs

Problem: Write C code for a function that computes the sum of the elements of a two dimensional array, a[M][N], of integers.

int SumArray(int a[][], int M, int N)

1 int SumArrayRows(int a[][], int M, int N) 2 { 3 int i, j; 4 int sum = 0; 5 6 for (i=0 ; i<M ; i++) 7 for (j=0 ; j<N ; j++) 8 sum += a[i][j]; 8 return sum; 9 }

1 int SumArrayCols(int a[][], int M, int N) 2 { 3 int i, j; 4 int sum = 0; 5 6 for (j=0 ; j<N ; i++) 7 for (i=0 ; i<M ; i++) 8 sum += a[i][j]; 8 return sum; 9 }

Byant/O’Hallaron, pp. 508

Page 39: CMPUT229 - Fall 2003

CMPUT 229 - Computer Organization and Architecture I

39

SumArrayRows Data Access Order

a[1][2]a[1][3]a[1][4]a[1][5]a[2][0]a[2][1]a[2]2]a[2][3]a[2][4]a[2][5]a[3][0]a[3][1]a[3][2]a[3][3]a[3][4]

•••

a[0][0]a[0][1]a[0][2]a[0][3]a[0][4]a[0][5]a[1][0]a[1][1]

0x8000 4000

0x8000 4004

0x8000 4010

0x8000 4024

0x8000 4008

0x8000 4014

0x8000 4028

0x8000 403C

0x8000 400C

0x8000 4018

0x8000 402C

0x8000 4040

0x8000 401C

0x8000 4030

0x8000 4044

0x8000 4050

0x8000 4020

0x8000 4034

0x8000 4048

0x8000 4054

0x8000 4038

0x8000 404C

0x8000 4058

•••

1 int SumArrayRows(int a[][], int M, int N) 2 { 3 int i, j; 4 int sum = 0; 5 6 for (i=0 ; i<M ; i++) 7 for (j=0 ; j<N ; j++) 8 sum += a[i][j]; 8 return sum; 9 }

Byant/O’Hallaron, pp. 508

Page 40: CMPUT229 - Fall 2003

CMPUT 229 - Computer Organization and Architecture I

40

SumArrayRows Data Access Order

a[1][2]a[1][3]a[1][4]a[1][5]a[2][0]a[2][1]a[2]2]a[2][3]a[2][4]a[2][5]a[3][0]a[3][1]a[3][2]a[3][3]a[3][4]

•••

a[0][0]a[0][1]a[0][2]a[0][3]a[0][4]a[0][5]a[1][0]a[1][1]

0x8000 4000

0x8000 4004

0x8000 4010

0x8000 4024

0x8000 4008

0x8000 4014

0x8000 4028

0x8000 403C

0x8000 400C

0x8000 4018

0x8000 402C

0x8000 4040

0x8000 401C

0x8000 4030

0x8000 4044

0x8000 4050

0x8000 4020

0x8000 4034

0x8000 4048

0x8000 4054

0x8000 4038

0x8000 404C

0x8000 4058

•••

1 int SumArrayRows(int a[][], int M, int N) 2 { 3 int i, j; 4 int sum = 0; 5 6 for (i=0 ; i<M ; i++) 7 for (j=0 ; j<N ; j++) 8 sum += a[i][j]; 8 return sum; 9 }

Byant/O’Hallaron, pp. 508

Page 41: CMPUT229 - Fall 2003

CMPUT 229 - Computer Organization and Architecture I

41

SumArrayRows Data Access Order

a[1][2]a[1][3]a[1][4]a[1][5]a[2][0]a[2][1]a[2]2]a[2][3]a[2][4]a[2][5]a[3][0]a[3][1]a[3][2]a[3][3]a[3][4]

•••

a[0][0]a[0][1]a[0][2]a[0][3]a[0][4]a[0][5]a[1][0]a[1][1]

0x8000 4000

0x8000 4004

0x8000 4010

0x8000 4024

0x8000 4008

0x8000 4014

0x8000 4028

0x8000 403C

0x8000 400C

0x8000 4018

0x8000 402C

0x8000 4040

0x8000 401C

0x8000 4030

0x8000 4044

0x8000 4050

0x8000 4020

0x8000 4034

0x8000 4048

0x8000 4054

0x8000 4038

0x8000 404C

0x8000 4058

•••

1 int SumArrayRows(int a[][], int M, int N) 2 { 3 int i, j; 4 int sum = 0; 5 6 for (i=0 ; i<M ; i++) 7 for (j=0 ; j<N ; j++) 8 sum += a[i][j]; 8 return sum; 9 }

Byant/O’Hallaron, pp. 508

Page 42: CMPUT229 - Fall 2003

CMPUT 229 - Computer Organization and Architecture I

42

SumArrayRows Data Access Order

a[1][2]a[1][3]a[1][4]a[1][5]a[2][0]a[2][1]a[2]2]a[2][3]a[2][4]a[2][5]a[3][0]a[3][1]a[3][2]a[3][3]a[3][4]

•••

a[0][0]a[0][1]a[0][2]a[0][3]a[0][4]a[0][5]a[1][0]a[1][1]

0x8000 4000

0x8000 4004

0x8000 4010

0x8000 4024

0x8000 4008

0x8000 4014

0x8000 4028

0x8000 403C

0x8000 400C

0x8000 4018

0x8000 402C

0x8000 4040

0x8000 401C

0x8000 4030

0x8000 4044

0x8000 4050

0x8000 4020

0x8000 4034

0x8000 4048

0x8000 4054

0x8000 4038

0x8000 404C

0x8000 4058

•••

1 int SumArrayRows(int a[][], int M, int N) 2 { 3 int i, j; 4 int sum = 0; 5 6 for (i=0 ; i<M ; i++) 7 for (j=0 ; j<N ; j++) 8 sum += a[i][j]; 8 return sum; 9 }

Byant/O’Hallaron, pp. 508

Page 43: CMPUT229 - Fall 2003

CMPUT 229 - Computer Organization and Architecture I

43

SumArrayCols Data Access Order

a[1][2]a[1][3]a[1][4]a[1][5]a[2][0]a[2][1]a[2]2]a[2][3]a[2][4]a[2][5]a[3][0]a[3][1]a[3][2]a[3][3]a[3][4]

•••

a[0][0]a[0][1]a[0][2]a[0][3]a[0][4]a[0][5]a[1][0]a[1][1]

0x8000 4000

0x8000 4004

0x8000 4010

0x8000 4024

0x8000 4008

0x8000 4014

0x8000 4028

0x8000 403C

0x8000 400C

0x8000 4018

0x8000 402C

0x8000 4040

0x8000 401C

0x8000 4030

0x8000 4044

0x8000 4050

0x8000 4020

0x8000 4034

0x8000 4048

0x8000 4054

0x8000 4038

0x8000 404C

0x8000 4058

•••

1 int SumArrayCols(int a[][], int M, int N) 2 { 3 int i, j; 4 int sum = 0; 5 6 for (j=0 ; j<N ; i++) 7 for (i=0 ; i<M ; i++) 8 sum += a[i][j]; 8 return sum; 9 }

Byant/O’Hallaron, pp. 508

Page 44: CMPUT229 - Fall 2003

CMPUT 229 - Computer Organization and Architecture I

44

SumArrayCols Data Access Order

a[1][2]a[1][3]a[1][4]a[1][5]a[2][0]a[2][1]a[2]2]a[2][3]a[2][4]a[2][5]a[3][0]a[3][1]a[3][2]a[3][3]a[3][4]

•••

a[0][0]a[0][1]a[0][2]a[0][3]a[0][4]a[0][5]a[1][0]a[1][1]

0x8000 4000

0x8000 4004

0x8000 4010

0x8000 4024

0x8000 4008

0x8000 4014

0x8000 4028

0x8000 403C

0x8000 400C

0x8000 4018

0x8000 402C

0x8000 4040

0x8000 401C

0x8000 4030

0x8000 4044

0x8000 4050

0x8000 4020

0x8000 4034

0x8000 4048

0x8000 4054

0x8000 4038

0x8000 404C

0x8000 4058

•••

1 int SumArrayCols(int a[][], int M, int N) 2 { 3 int i, j; 4 int sum = 0; 5 6 for (j=0 ; j<N ; i++) 7 for (i=0 ; i<M ; i++) 8 sum += a[i][j]; 8 return sum; 9 }

Byant/O’Hallaron, pp. 508

Page 45: CMPUT229 - Fall 2003

CMPUT 229 - Computer Organization and Architecture I

45

SumArrayCols Data Access Order

a[1][2]a[1][3]a[1][4]a[1][5]a[2][0]a[2][1]a[2]2]a[2][3]a[2][4]a[2][5]a[3][0]a[3][1]a[3][2]a[3][3]a[3][4]

•••

a[0][0]a[0][1]a[0][2]a[0][3]a[0][4]a[0][5]a[1][0]a[1][1]

0x8000 4000

0x8000 4004

0x8000 4010

0x8000 4024

0x8000 4008

0x8000 4014

0x8000 4028

0x8000 403C

0x8000 400C

0x8000 4018

0x8000 402C

0x8000 4040

0x8000 401C

0x8000 4030

0x8000 4044

0x8000 4050

0x8000 4020

0x8000 4034

0x8000 4048

0x8000 4054

0x8000 4038

0x8000 404C

0x8000 4058

•••

1 int SumArrayCols(int a[][], int M, int N) 2 { 3 int i, j; 4 int sum = 0; 5 6 for (j=0 ; j<N ; i++) 7 for (i=0 ; i<M ; i++) 8 sum += a[i][j]; 8 return sum; 9 }

Byant/O’Hallaron, pp. 508

Page 46: CMPUT229 - Fall 2003

CMPUT 229 - Computer Organization and Architecture I

46

SumArrayCols Data Access Order

a[1][2]a[1][3]a[1][4]a[1][5]a[2][0]a[2][1]a[2]2]a[2][3]a[2][4]a[2][5]a[3][0]a[3][1]a[3][2]a[3][3]a[3][4]

•••

a[0][0]a[0][1]a[0][2]a[0][3]a[0][4]a[0][5]a[1][0]a[1][1]

0x8000 4000

0x8000 4004

0x8000 4010

0x8000 4024

0x8000 4008

0x8000 4014

0x8000 4028

0x8000 403C

0x8000 400C

0x8000 4018

0x8000 402C

0x8000 4040

0x8000 401C

0x8000 4030

0x8000 4044

0x8000 4050

0x8000 4020

0x8000 4034

0x8000 4048

0x8000 4054

0x8000 4038

0x8000 404C

0x8000 4058

•••

1 int SumArrayCols(int a[][], int M, int N) 2 { 3 int i, j; 4 int sum = 0; 5 6 for (j=0 ; j<N ; i++) 7 for (i=0 ; i<M ; i++) 8 sum += a[i][j]; 8 return sum; 9 }

Byant/O’Hallaron, pp. 508

Page 47: CMPUT229 - Fall 2003

CMPUT 229 - Computer Organization and Architecture I

47

Read Bandwidth

The rate that a program reads data from the memory system iscalled the read throughput or the read bandwidth.

The read throughput of a program depends on the memory hierarchy level from which the data is retrieved.

The read throughput is measured in bytes per second, or morecommonly in Mbytes/s.

We can write a program to force the data to come from the various levels in the hierarchy to estimate the read throughput.

Page 48: CMPUT229 - Fall 2003

CMPUT 229 - Computer Organization and Architecture I

48

Measuring Read Bandwidth

1 int test(int elems, int stride) 2 { 3 int i; 4 int result = 0; 5 volatile int sink; 6 7 for(i=0 ; i<elems ; i += stride) 8 result += data[i]; 9 sink = result; /* to prevent compiler from optimizing away the loop */10 }

Byant/O’Hallaron, pp. 508

Page 49: CMPUT229 - Fall 2003

Pentium III Xeon Memory Mountain

s1s3

s5s7

s9s11

s13s15 8m

2m 512k128k

32k

8k2k

0

200

400

600

800

1000

1200

Read throughput (MB/s)

Stride (words) Working set size (bytes)

Pentium III Xeon550 MHz16 KB on-chip L1 d-cache16 KB on-chip L1 i-cache512 KB off-chip unifiedL2 cache

Ridges oftemporallocality

L1

L2

Mem

Slopes ofspatiallocality

xe

Byant/O’Hallaron, pp. 514

Page 50: CMPUT229 - Fall 2003

Temporal Locality(stride = 1)

0

200

400

600

800

1000

1200

8m 4m 2m1024k512k 256k 128k 64k 32k 16k

8k 4k 2k 1k

Working set size (bytes)

Read througput (MB/s)

L1 cacheregion

L2 cacheregion

Main memoryregion

Page 51: CMPUT229 - Fall 2003

Spatial Locality Slope(size = 256 KB)

0

100

200

300

400

500

600

700

800

s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16

Stride (words)

Read throughput (MB/s)

One access per cache line