parallel scientific computing: algorithms and tools lecture #2

1

Parallel Scientific Computing: Algorithms and Tools

Lecture #2

APMA 2821A, Spring 2008Instructors: George Em Karniadakis

Leopold Grinberg

2

Memory Bits: 0, 1; Bytes: 8 bits Memory size

PB – 10^15 bytes; TB – 10^12 bytes; GB – 10^9 bytes; MB – 10^6 bytes; KB – 10^3 bytes

Memory performance measures: Access time, or response time, latency: interval between time of

issuance of memory request and time when request is satisfied. Cycle time: minimum time between two successive memory

requests

t0 t1 t2Memory request

requestsatisfied

Access time: t1-t0 Cycle time: t2-t0

If there is another request at t0 < t < t2, memory is busy and will not respond; have to wait until t > t2

Memory busy t0 < t < t2DRAM only

3

Memory HierarchyMemory can be fast (costly) or slow (cheaper).Increase overall performance: use locality of

referenceFaster memory (also smaller) closer to CPU; slower memory (also larger) farther away from CPU.Have often-used data in fast memory; leave less-

often-used data in slow memory.Key: When lower levels of hierarchy send value

at location x to higher levels, also send content at x+1, x+2, etc. i.e. send a block of dataCache line

4

Memory Hierarchy

Performance of different levels can be very different e.g. access time for L1 cache can be 1 cycle, L2 can be 5 or 6

cycles, while main memory can be dozens of cycles and secondary memory can be orders of magnitude slower.

Registers

Level-1 cache

Level-2 cache

Main memorySecondary memory

(hard disk)

Network storage

… …

Cache: a piece of fast memoryExpensive, CA$H ?

Increasing speedIncreasing costDecreasing size

Decreasing speedDecreasing costIncreasing size

5

How Memory Hierarchy Works(RISC processor) CPU works only on data in

registers. If data is not in register, request data from memory

and load to register …Data in register come only from and go only to

L1 cache.When CPU requests data from memory, L1 cache

takes over;If data is in L1 cache (cache hit), return data to CPU

immediately; end memory access;If data is not in L1 cache (cache miss) …

6

How Memory Hierarchy Works If data is not in L1 cache, L1 cache forwards memory

request down to L2 cache. If L2 cache has the data (cache hit), it returns the data to L1

cache, which in turn returns data to CPU; end memory access; If L2 cache does not have the data (cache miss) …

If data is not in L2 cache, L2 cache forwards memory request down to main memory. If data is in main memory, main memory passes data to L2

cache, which then passes data to L1 cache, which then passes data to CPU.

If data is not in memory … Then request is passed to OS to read data from

secondary storage (disk), which then is passed to memory, L2 cache, L1 cache, register.

7

Cache Line A cache line is the smallest unit of data that can be

transferred to or from memory (and L2 cache). usually between 32 and 128 bytes May contain several data items

When L2 cache passes data to L1 cache, or when main memory passes data to L2 cache, a cache line, instead of a single piece of data, is transferred. When the data in variable X is requested from memory, the

cache line containing X (and adjacent data) is transferred to cache.

X[10] X[11] X[12] X[13]X[9] X[14]

Cache line Cache line

Assume: 32-byte cache line, X[11] is requested by CPUResult: X[10] – X[13] is brought into cache from memory.

8

Cache Effect on PerformanceCache miss degrading performance

When there is a cache miss, CPU is idle waiting for another cache line to be brought from lower level of memory hierarchy

Increasing cache hit rate higher performanceEfficiency directly related to reuse of data in cache

To increase cache hit rate, access memory sequentially; avoid strides, random access, and indirect addressing in programming.

for(i=0;i<100;i++) y[i] = 2*x[index[i]];

for(i=0;i<100;i++) y[i] = 2*x[i];

for(i=0;i<100;i=i+4) y[i] = 2*x[i];

sequential access

strides Indirect addressing

9

Where in Cache to Put Data from MemoryCache is organized into cache lines.Memory is also logically organized into cache

lines.

…32-byte cache line

1 MB (32,768 cache lines)

…

2 GB (67,108,864 cache lines)

cache

Main memory

Memory size >> cache size

Number of cache lines in memory >> number of cache lines in cache.

Many cache lines in memory correspond to one cache line in cache.

10

Cache ClassificationDirect-mapped cache

Given a memory cache line, it is always placed in one specific cache line in cache.

Fully associative cacheGiven a memory cache line, it can be placed

in any of the cache lines in cache.N-way set associative cache

Given a memory cache line, it can be placed in any of a set of N cache lines in cache.

11

Direct-Mapped Cache A set of memory cache lines always correspond to exactly the same

cache line in cache. Cheap to implement in hardware; May cause cache thrashing: repeatedly displacing and loading

cache lines.

…

… … … …

8 KB

0 8K 16K … 2G

Line-Index = Mod (mem-cache-line-index, tot-cache-lines-in-cache)

12

Cache Thrashing: ExampleAssumptions:

Direct-mapped cache;Cache size: 1 MB;Cache line: 32 bytes;

double X[131072], Y[131072];long i, j;// initialization of X, Y…for(i=0;i<131072;i++) Y[i] = X[i] + Y[i];…

1 double value = 8 bytes131072 double values = 1 MB1 cache line = 32 bytes = 4 double values

X[131072]: 1 MB memoryY[131072]: 1 MB memory

13

Cache Thrashing: Example

1 MB32768 lines

X[0] X[1] X[2] X[3]

X[4] X[5] X[6] X[7]

… … … …

… … … …

Y[0] Y[1] Y[2] Y[3]

Y[4] Y[5] Y[6] Y[7]

… … … …

… … … …

… … … …

… … … …

cache Memory

1 MB32768 lines

i=0: load line X[0]-X[3] into cache; load X[0] from cache to register; load line Y[0]-Y[3] into cache, displacing line X[0]-X[3]; load Y[0] from cache into register; add, update Y[0] in cache;i=1: load X[0]-X[3] into cache, displacing Y[0]-Y[3], write line Y[0]-Y[3] back to memory; load X[1] from cache to register; load Y[0]-Y[3] into cache, displacing X[0]-X[3]; load Y[1] from cache to register; add, update Y[1] in cache;i=2: load X[0]-X[3] into cache, displacing Y[0]-Y[3], write line Y[0]-Y[3] back to memory; load X[2] from cache to register; load Y[0]-Y[3] into cache, displacing X[0]-X[3]; load Y[2] from cache to register; add, update Y[2] in cache;i=3: …

No cache reuse!Poor performance!Avoid cache thrashing!

double X[131072], Y[131072];long i, j;// initialization of X, Y…for(i=0;i<131072;i++) Y[i] = X[i] + Y[i];…

14

Fully Associative CacheA cache line from memory can be placed

anywhere in cache;No cache thrashing; but costly.Direct-mapped cache at one extreme of

spectrum; fully associative cache at another extreme of spectrum.

Disadvantage: search entire cache to determine if a specific cache line is present.

15

N-Way Set Associative Cache Compromise between direct-mapped cache and fully associative

cache The cache lines in cache is divided into a number of sets; Each set

contains N cache lines. Given a cache line from memory, the index of set it belongs to is

first calculated; Then it is placed in one of the N cache lines in this set.

…

2 GB (67,108,864 cache lines)

…

1 MB 32,768 cache lines16,384 setsEach set has 2 lines

cache

Main memory

2-way set associative cache

Less likely to cause cache thrashing;Less costly;

Direct-mapped cache is 1-way set associative cache;Fully associative cache is N_c way set associative cache; N_c is total number of cache lines in cache.

16

Instruction/Data Cache

CPU may have separate instruction cache and data cache (split cache).

CPU may have a single cache, for both instructions and data from memory (unified cache).

17

Remember …

Efficiency directly related to cache reuseCache thrashing is eliminated by padding

arrays (array dimensions should not be a multiple of cache line – avoid powers of 2)

To improve cache reuse,Access memory sequentially as much as

possibleAvoid stride, random access, indirect addressingAvoid cache thrashing.

18

Example

Large stride in memory access pattern results in not only cache miss/poor reuse, but also TLB miss.

double X[1024][1024], Y[1024][1024];int i,j;…

for(j=0;j<1024;j++) for(i=0;i<1024;i++) X[i][j] = Y[i][j];

X[0][0]

X[0][1]

X[0][1023]

X[1][0]

X[1][1]

……

X[1023][1023]

……

Y[0][0]

Y[0][1]

Y[0][1023]

Y[1][0]

Y[1][1]

……

Y[1023][1023]

……

stride 1024or 8KB

19

Virtual Memory, Memory Paging

… …

0

4KB

8KB

2GB

… …

0

4KB

8KB

2GB

… …

… …

0

1024KB

1028KB

1032KB

1036KB

1040KB

1044KB

1048KB

4GB

Program #1

Program #2

Physical Memory

Modern computers use virtual memory;

Memory address seen in a program (virtual address) is not the actual address in physical memory;

Memory is divided into pages (e.g. 4KB);

A memory page in program’s address space corresponds to a page in physical memory;

To access memory, need to translate program’s virtual address to the actual address in physical memory.

This is done using a page table;

20

Translation Look-aside Buffer (TLB)

TLB is a special cache for the page tables Faster access to TLB for virtual-physical translation.

When program accesses a memory location, the translation between virtual and physical pages is loaded into TLB (if it is not already there);

If program exhibits locality of references, entries in TLB can be reused TLB hit better performance

Otherwise TLB miss performance degrades. Large stride in memory access pattern TLB miss (and

cache miss).

21

Remedies

Use large memory page sizeOn some systems, the memory page size can

be modified by user programs, e.g. IBM SP, HP machines

Avoid large stride in memory access; Sequential access to memory as much as possible.

22

Interleaved Memory Memory interleaving: alleviating the impact of memory cycle time.

Total memory divided into a set of memory banks; Contiguous memory addresses reside on different banks.

When accessing memory sequentially, effect of memory cycle time minimized When current bank is busy, next bank is idle and can be accessed immediately.

Stride in memory access not favorable may access the same bank repeatedly, need to wait due to cycle time poor performance

Total 2GB memoryDivide into 4 memory banksEach bank: 512 MBCache line: assumed 32 bytes0-31

128-159

…

32-63

160-191

…

64-95

192-223

…

96-127

224-255

…

Bank 1 Bank 2 Bank 3 Bank 4

1 cache line(32 bytes)

parallel scientific computing: algorithms and tools lecture #2

Documents