memory hierarchy - androbenchcsl.skku.edu/uploads/sse2030s18/15-memory.pdf · storage systems known...
TRANSCRIPT
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected])
Memory Hierarchy
Jinkyu Jeong ([email protected])Computer Systems Laboratory
Sungkyunkwan Universityhttp://csl.skku.edu
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 2
The CPU-Memory Gap• The gap widens between DRAM, disk, and CPU speeds
0.0
0.1
1.0
10.0
100.0
1,000.0
10,000.0
100,000.0
1,000,000.0
10,000,000.0
100,000,000.0
1985 1990 1995 2000 2003 2005 2010 2015
Tim
e (n
s)
Year
Disk seek timeSSD access timeDRAM access timeSRAM access timeCPU cycle timeEffective CPU cycle time
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 3
Principle of Locality
• Temporal locality– Recently referenced items are
likely to be referenced in the near future
• Spatial locality– Items with nearby addresses tend
to be referenced close together in time
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 4
Principle of Locality: Example
• Data– Reference array elements in succession Spatial locality– Reference sum each iteration Temporal locality
• Instructions– Reference instructions in sequence Spatial locality– Cycle through loop repeatedly Temporal locality
sum = 0;for (i = 0; i < n; i++)
sum += a[i];return sum;
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 5
Memory Hierarchy• Some fundamental and enduring properties of
hardware and software– Fast storage technologies cost more per byte, have less
capacity, and require more power– The gap between CPU and main memory speed is widening– Well-written programs tend to exhibit good locality
• These fundamental properties complement each other beautifully
• They suggest an approach for organizing memory and storage systems known as a memory hierarchy
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 6
Memory Hierarchy: Example
Registers
L1 cache(SRAM)
L2 cache(SRAM)
L3 cache(SRAM)
Main memory(DRAM)
Local secondary storage(local disks)
Remote secondary storage(e.g. Web cache)
Larger, slower,
and cheaper
(per byte)storagedevices
Smaller,faster,and
costlier(per byte)storage devices
L0:
L1:
L2:
L3:
L4:
L5:
L6:
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 7
Exploiting Locality
• How to exploit temporal locality?– Speed up data accesses by caching data in faster storage– Caching in multiple levels: form a memory hierarchy– The lower levels of the memory hierarchy tend to be
slower, but larger and cheaper
• How to exploit spatial locality?– Larger cache line size– Cache nearby data together
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 8
Cache
• A smaller, faster storage device that acts as a staging area for a subset of the data in a larger, slower device
• Improves the average access time• Exploits both temporal and spatial locality
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 9
General Cache Concepts
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
8 9 14 3Cache
MemoryLarger, slower, cheaper memoryviewed as partitioned into “blocks”
Data is copied in block-sized transfer units
Smaller, faster, more expensivememory caches a subset ofthe blocks
4
4
4
10
10
10
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 10
General Cache Concepts: Hit
0 1 2 34 5 6 78 9 10 11
12 13 14 15
8 9 14 3Cache
Memory
Data in block b is neededRequest: 14
14Block b is in cache:Hit!
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 11
General Cache Concepts: Miss
0 1 2 34 5 6 78 9 10 11
12 13 14 15
8 9 14 3Cache
Memory
Data in block b is neededRequest: 12
Block b is not in cache:Miss!
Block b is fetched frommemoryRequest: 12
12
12
12
Block b is stored in cache• Placement policy:
determines where b goes•Replacement policy:
determines which blockgets evicted (victim)
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 12
CPU Cache Design Issues
• Cache size– 8KB ~ 64KB for L1
• Cache line size– Typically, 32 ~ 64 bytes for L1
• Lookup– Fully associative– Set associative: 2-way, 4-way, 8-way, 16-way, etc.– Direct mapped
• Replacement– LRU (Least Recently Used)– FIFO (First-In First-Out), RANDOM, etc.
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 13
Intel Core i7 Cache Hierarchy
Regs
L1 d-cache
L1 i-cache
L2 unified cache
Core 0
Regs
L1 d-cache
L1 i-cache
L2 unified cache
Core 3
…
L3 unified cache(shared by all cores)
Main memory
Processor packageL1 i-cache and d-cache:
32 KB, 8-way, Access: 4 cycles
L2 unified cache:256 KB, 8-way,
Access: 10 cycles
L3 unified cache:8 MB, 16-way,Access: 40-75 cycles
Block size: 64 bytes for all caches.
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 14
Cache Performance• Average memory access time = Thit + Rmiss * Tmiss
• Hit time (Thit)– Time to deliver a line in the cache to the processor– Includes time to determine whether the line is in the cache– 1 clock cycle for L1, 3 ~ 8 clock cycles for L2
• Miss rate (Rmiss)– Fraction of memory references not found in cache – 3 ~ 10% for L1, < 1% for L2
• Miss penalty (Tmiss)– Additional time required because of a miss– Typically 25 ~ 100 cycles for main memory
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 15
Writing Cache Friendly Code
• Make the common case go fast– Focus on the inner loops of the core functions
• Minimize the misses in the inner loops– Repeated references to variables are good (temporal
locality)– Sequential reference patterns are good (spatial locality)
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 16
Matrix Multiplication (1)
• Description– Multiply N x N matrices– O(N3) total operations
• Assumptions– Line size = 32 bytes (big enough for 4 64-bit words)– Matrix dimension (N) is very large
/* ijk */for (i=0; i<n; i++) {for (j=0; j<n; j++) {sum = 0.0;for (k=0; k<n; k++) sum += a[i][k] * b[k][j];
c[i][j] = sum;}
}
Variable sumheld in register
A B C
(i,*)
(*,j)(i,j)X =
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 17
Matrix Multiplication (2)
• Matrix multiplication (ijk)
/* ijk */for (i=0; i<n; i++) {for (j=0; j<n; j++) {sum = 0.0;for (k=0; k<n; k++) sum += a[i][k] * b[k][j];
c[i][j] = sum;}
}
A B C(i,*)
(*,j)(i,j)
Inner loop:
Column-wise
Row-wise Fixed
Misses per Inner Loop Iteration:A B C
0.25 1.0 0.0
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 18
Matrix Multiplication (3)
• Matrix multiplication (jik)
A B C(i,*)
(*,j)(i,j)
Inner loop:
Misses per Inner Loop Iteration:A B C
0.25 1.0 0.0
/* jik */for (j=0; j<n; j++) {for (i=0; i<n; i++) {sum = 0.0;for (k=0; k<n; k++)sum += a[i][k] * b[k][j];
c[i][j] = sum}
} Column-wise
Row-wise Fixed
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 19
Matrix Multiplication (4)
• Matrix multiplication (kij)
Misses per Inner Loop Iteration:A B C0.0 0.25 0.25
/* kij */for (k=0; k<n; k++) {for (i=0; i<n; i++) {r = a[i][k];for (j=0; j<n; j++)c[i][j] += r * b[k][j];
}}
A B C(i,*)
(i,k) (k,*)
Inner loop:
Row-wiseFixed Row-wise
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 20
Matrix Multiplication (5)
• Matrix multiplication (ikj)
Misses per Inner Loop Iteration:A B C0.0 0.25 0.25
A B C(i,*)
(i,k) (k,*)
Inner loop:/* ikj */for (i=0; i<n; i++) {for (k=0; k<n; k++) {r = a[i][k];for (j=0; j<n; j++)c[i][j] += r * b[k][j];
}} Row-wiseFixed Row-wise
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 21
Matrix Multiplication (6)
• Matrix multiplication (jki)
Misses per Inner Loop Iteration:A B C1.0 0.0 1.0
/* jki */for (j=0; j<n; j++) {for (k=0; k<n; k++) {r = b[k][j];for (i=0; i<n; i++)c[i][j] += a[i][k] * r;
}}
A B C
(*,j)(k,j)
Inner loop:
(*,k)
Column-wise
Fixed Column-wise
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 22
Matrix Multiplication (7)
• Matrix multiplication (kji)
Misses per Inner Loop Iteration:A B C1.0 0.0 1.0
A B C
(*,j)(k,j)
Inner loop:
(*,k)
/* kji */for (k=0; k<n; k++) {for (j=0; j<n; j++) {r = b[k][j];for (i=0; i<n; i++)c[i][j] += a[i][k] * r;
}}
Column-wise
Fixed Column-wise
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 23
Matrix Multiplication (8)
• Summary
for (i=0; i<n; i++) {
for (j=0; j<n; j++) {
sum = 0.0;
for (k=0; k<n; k++)
sum += a[i][k] * b[k][j];
c[i][j] = sum;
}
}
ijk (& jik): • 2 loads, 0 stores• misses/iter = 1.25
for (k=0; k<n; k++) {
for (i=0; i<n; i++) {
r = a[i][k];
for (j=0; j<n; j++)
c[i][j] += r * b[k][j];
}
}
for (j=0; j<n; j++) {
for (k=0; k<n; k++) {
r = b[k][j];
for (i=0; i<n; i++)
c[i][j] += a[i][k] * r;
}
}
kij (& ikj): • 2 loads, 1 store• misses/iter = 0.5
jki (& kji): • 2 loads, 1 store• misses/iter = 2.0
A B C0.25 1.0 0.0
A B C0.0 0.25 0.25
A B C1.0 0.0 1.0
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 24
Matrix Multiplication (9)
• Performance in Core i7
1
10
100
50 100 150 200 250 300 350 400 450 500 550 600 650 700
Cyc
les
per i
nner
loop
iter
atio
n
Array size (n)
jkikjiijkjikkijikj
ijk / jik
jki / kji
kij / ikj
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 25
Summary
• Programmer can optimize for cache performance– How data structures are organized– How data are accessed• Nested loop structure• Blocking is a general technique
• All systems favor “cache friendly code”– Getting absolute optimum performance is very platform
specific• Cache sizes, line sizes, associativities, etc.– Can get most of the advantage with generic code• Keep working set reasonably small (temporal locality)• Use small strides (spatial locality)