memory hierarchy - androbenchcsl.skku.edu/uploads/sse2030s18/15-memory.pdf · storage systems known...

25
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) Memory Hierarchy Jinkyu Jeong ([email protected] ) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu

Upload: others

Post on 14-Jun-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu

SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected])

Memory Hierarchy

Jinkyu Jeong ([email protected])Computer Systems Laboratory

Sungkyunkwan Universityhttp://csl.skku.edu

Page 2: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu

SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 2

The CPU-Memory Gap• The gap widens between DRAM, disk, and CPU speeds

0.0

0.1

1.0

10.0

100.0

1,000.0

10,000.0

100,000.0

1,000,000.0

10,000,000.0

100,000,000.0

1985 1990 1995 2000 2003 2005 2010 2015

Tim

e (n

s)

Year

Disk seek timeSSD access timeDRAM access timeSRAM access timeCPU cycle timeEffective CPU cycle time

Page 3: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu

SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 3

Principle of Locality

• Temporal locality– Recently referenced items are

likely to be referenced in the near future

• Spatial locality– Items with nearby addresses tend

to be referenced close together in time

Page 4: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu

SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 4

Principle of Locality: Example

• Data– Reference array elements in succession Spatial locality– Reference sum each iteration Temporal locality

• Instructions– Reference instructions in sequence Spatial locality– Cycle through loop repeatedly Temporal locality

sum = 0;for (i = 0; i < n; i++)

sum += a[i];return sum;

Page 5: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu

SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 5

Memory Hierarchy• Some fundamental and enduring properties of

hardware and software– Fast storage technologies cost more per byte, have less

capacity, and require more power– The gap between CPU and main memory speed is widening– Well-written programs tend to exhibit good locality

• These fundamental properties complement each other beautifully

• They suggest an approach for organizing memory and storage systems known as a memory hierarchy

Page 6: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu

SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 6

Memory Hierarchy: Example

Registers

L1 cache(SRAM)

L2 cache(SRAM)

L3 cache(SRAM)

Main memory(DRAM)

Local secondary storage(local disks)

Remote secondary storage(e.g. Web cache)

Larger, slower,

and cheaper

(per byte)storagedevices

Smaller,faster,and

costlier(per byte)storage devices

L0:

L1:

L2:

L3:

L4:

L5:

L6:

Page 7: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu

SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 7

Exploiting Locality

• How to exploit temporal locality?– Speed up data accesses by caching data in faster storage– Caching in multiple levels: form a memory hierarchy– The lower levels of the memory hierarchy tend to be

slower, but larger and cheaper

• How to exploit spatial locality?– Larger cache line size– Cache nearby data together

Page 8: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu

SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 8

Cache

• A smaller, faster storage device that acts as a staging area for a subset of the data in a larger, slower device

• Improves the average access time• Exploits both temporal and spatial locality

Page 9: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu

SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 9

General Cache Concepts

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

8 9 14 3Cache

MemoryLarger, slower, cheaper memoryviewed as partitioned into “blocks”

Data is copied in block-sized transfer units

Smaller, faster, more expensivememory caches a subset ofthe blocks

4

4

4

10

10

10

Page 10: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu

SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 10

General Cache Concepts: Hit

0 1 2 34 5 6 78 9 10 11

12 13 14 15

8 9 14 3Cache

Memory

Data in block b is neededRequest: 14

14Block b is in cache:Hit!

Page 11: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu

SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 11

General Cache Concepts: Miss

0 1 2 34 5 6 78 9 10 11

12 13 14 15

8 9 14 3Cache

Memory

Data in block b is neededRequest: 12

Block b is not in cache:Miss!

Block b is fetched frommemoryRequest: 12

12

12

12

Block b is stored in cache• Placement policy:

determines where b goes•Replacement policy:

determines which blockgets evicted (victim)

Page 12: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu

SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 12

CPU Cache Design Issues

• Cache size– 8KB ~ 64KB for L1

• Cache line size– Typically, 32 ~ 64 bytes for L1

• Lookup– Fully associative– Set associative: 2-way, 4-way, 8-way, 16-way, etc.– Direct mapped

• Replacement– LRU (Least Recently Used)– FIFO (First-In First-Out), RANDOM, etc.

Page 13: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu

SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 13

Intel Core i7 Cache Hierarchy

Regs

L1 d-cache

L1 i-cache

L2 unified cache

Core 0

Regs

L1 d-cache

L1 i-cache

L2 unified cache

Core 3

L3 unified cache(shared by all cores)

Main memory

Processor packageL1 i-cache and d-cache:

32 KB, 8-way, Access: 4 cycles

L2 unified cache:256 KB, 8-way,

Access: 10 cycles

L3 unified cache:8 MB, 16-way,Access: 40-75 cycles

Block size: 64 bytes for all caches.

Page 14: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu

SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 14

Cache Performance• Average memory access time = Thit + Rmiss * Tmiss

• Hit time (Thit)– Time to deliver a line in the cache to the processor– Includes time to determine whether the line is in the cache– 1 clock cycle for L1, 3 ~ 8 clock cycles for L2

• Miss rate (Rmiss)– Fraction of memory references not found in cache – 3 ~ 10% for L1, < 1% for L2

• Miss penalty (Tmiss)– Additional time required because of a miss– Typically 25 ~ 100 cycles for main memory

Page 15: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu

SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 15

Writing Cache Friendly Code

• Make the common case go fast– Focus on the inner loops of the core functions

• Minimize the misses in the inner loops– Repeated references to variables are good (temporal

locality)– Sequential reference patterns are good (spatial locality)

Page 16: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu

SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 16

Matrix Multiplication (1)

• Description– Multiply N x N matrices– O(N3) total operations

• Assumptions– Line size = 32 bytes (big enough for 4 64-bit words)– Matrix dimension (N) is very large

/* ijk */for (i=0; i<n; i++) {for (j=0; j<n; j++) {sum = 0.0;for (k=0; k<n; k++) sum += a[i][k] * b[k][j];

c[i][j] = sum;}

}

Variable sumheld in register

A B C

(i,*)

(*,j)(i,j)X =

Page 17: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu

SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 17

Matrix Multiplication (2)

• Matrix multiplication (ijk)

/* ijk */for (i=0; i<n; i++) {for (j=0; j<n; j++) {sum = 0.0;for (k=0; k<n; k++) sum += a[i][k] * b[k][j];

c[i][j] = sum;}

}

A B C(i,*)

(*,j)(i,j)

Inner loop:

Column-wise

Row-wise Fixed

Misses per Inner Loop Iteration:A B C

0.25 1.0 0.0

Page 18: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu

SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 18

Matrix Multiplication (3)

• Matrix multiplication (jik)

A B C(i,*)

(*,j)(i,j)

Inner loop:

Misses per Inner Loop Iteration:A B C

0.25 1.0 0.0

/* jik */for (j=0; j<n; j++) {for (i=0; i<n; i++) {sum = 0.0;for (k=0; k<n; k++)sum += a[i][k] * b[k][j];

c[i][j] = sum}

} Column-wise

Row-wise Fixed

Page 19: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu

SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 19

Matrix Multiplication (4)

• Matrix multiplication (kij)

Misses per Inner Loop Iteration:A B C0.0 0.25 0.25

/* kij */for (k=0; k<n; k++) {for (i=0; i<n; i++) {r = a[i][k];for (j=0; j<n; j++)c[i][j] += r * b[k][j];

}}

A B C(i,*)

(i,k) (k,*)

Inner loop:

Row-wiseFixed Row-wise

Page 20: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu

SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 20

Matrix Multiplication (5)

• Matrix multiplication (ikj)

Misses per Inner Loop Iteration:A B C0.0 0.25 0.25

A B C(i,*)

(i,k) (k,*)

Inner loop:/* ikj */for (i=0; i<n; i++) {for (k=0; k<n; k++) {r = a[i][k];for (j=0; j<n; j++)c[i][j] += r * b[k][j];

}} Row-wiseFixed Row-wise

Page 21: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu

SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 21

Matrix Multiplication (6)

• Matrix multiplication (jki)

Misses per Inner Loop Iteration:A B C1.0 0.0 1.0

/* jki */for (j=0; j<n; j++) {for (k=0; k<n; k++) {r = b[k][j];for (i=0; i<n; i++)c[i][j] += a[i][k] * r;

}}

A B C

(*,j)(k,j)

Inner loop:

(*,k)

Column-wise

Fixed Column-wise

Page 22: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu

SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 22

Matrix Multiplication (7)

• Matrix multiplication (kji)

Misses per Inner Loop Iteration:A B C1.0 0.0 1.0

A B C

(*,j)(k,j)

Inner loop:

(*,k)

/* kji */for (k=0; k<n; k++) {for (j=0; j<n; j++) {r = b[k][j];for (i=0; i<n; i++)c[i][j] += a[i][k] * r;

}}

Column-wise

Fixed Column-wise

Page 23: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu

SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 23

Matrix Multiplication (8)

• Summary

for (i=0; i<n; i++) {

for (j=0; j<n; j++) {

sum = 0.0;

for (k=0; k<n; k++)

sum += a[i][k] * b[k][j];

c[i][j] = sum;

}

}

ijk (& jik): • 2 loads, 0 stores• misses/iter = 1.25

for (k=0; k<n; k++) {

for (i=0; i<n; i++) {

r = a[i][k];

for (j=0; j<n; j++)

c[i][j] += r * b[k][j];

}

}

for (j=0; j<n; j++) {

for (k=0; k<n; k++) {

r = b[k][j];

for (i=0; i<n; i++)

c[i][j] += a[i][k] * r;

}

}

kij (& ikj): • 2 loads, 1 store• misses/iter = 0.5

jki (& kji): • 2 loads, 1 store• misses/iter = 2.0

A B C0.25 1.0 0.0

A B C0.0 0.25 0.25

A B C1.0 0.0 1.0

Page 24: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu

SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 24

Matrix Multiplication (9)

• Performance in Core i7

1

10

100

50 100 150 200 250 300 350 400 450 500 550 600 650 700

Cyc

les

per i

nner

loop

iter

atio

n

Array size (n)

jkikjiijkjikkijikj

ijk / jik

jki / kji

kij / ikj

Page 25: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu

SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 25

Summary

• Programmer can optimize for cache performance– How data structures are organized– How data are accessed• Nested loop structure• Blocking is a general technique

• All systems favor “cache friendly code”– Getting absolute optimum performance is very platform

specific• Cache sizes, line sizes, associativities, etc.– Can get most of the advantage with generic code• Keep working set reasonably small (temporal locality)• Use small strides (spatial locality)