a data cache with dynamic mapping p. d'alberto, a. nicolau and a. veidenbaum ics-uci speaker...
TRANSCRIPT
A Data Cache with A Data Cache with Dynamic MappingDynamic MappingA Data Cache with A Data Cache with Dynamic MappingDynamic Mapping
P. D'Alberto, A. Nicolau and A. Veidenbaum
ICS-UCI
Speaker
Paolo D’Alberto
Problem IntroductionProblem IntroductionProblem IntroductionProblem Introduction
● Blocked algorithms have good performance on average– Because exploiting data temporal locality
● For some input sets data cache interference nullifies cache locality
Problem Introduction, cont.Problem Introduction, cont.Problem Introduction, cont.Problem Introduction, cont.
Blocked Matrix Multiplication: 16KB 1-way data cache
0
2
4
6
8
10
12
14
16
18
Matrix Size
mis
s r
ate
%
regular matrix multiply dynamic matrix multiply
Problem IntroductionProblem IntroductionProblem IntroductionProblem Introduction
● What if we remove the spikes– The average performance improves– Execution time is predictable
● We can achieve our goal by: ● Only Software● Only Hardware● Both HW-SW
Related Work (Software)Related Work (Software)Related Work (Software)Related Work (Software)
● Data layout reorganization [Flajolet et al. 91]– Data are reorganized before & after computation
● Data copy [Granston et al. 93]– Data are moved in memory during computation
● Padding [Panda et al. 99]● Computation reorganization [Pingali et al.02]
– e.g., Tiling
Related Work (Hardware)Related Work (Hardware)Related Work (Hardware)Related Work (Hardware)
● Changing cache mapping – Using a different cache mapping functions [Gonzalez 97]– Increasing cache associativity [IA64]– Changing cache Size
● Bypassing caches– No interference: data are not stored in cache. [MIPS R5K]– HW-driven Pre-fetching
Related Work (HW-SW)Related Work (HW-SW)Related Work (HW-SW)Related Work (HW-SW)● Profiling
– Hardware adaptation [UCI]
– Software adaptation [Gatlin et al. 99]
● Pre-fetching [Jouppi et al.]– Latency hiding mostly, used also for cache interference
reduction
● Static Analysis [Ghosh et al. 99 - CME]– e.g., compiler driven data cache line adaptation [UCI]
Dynamic Mapping, (Software)Dynamic Mapping, (Software)Dynamic Mapping, (Software)Dynamic Mapping, (Software)
● We consider applications where memory references are affine functions only
● We associate a memory reference with a twin affine function
● We use the twin function as input address for the target data cache
● We use the original affine function to access memory
Example of twin functionExample of twin functionExample of twin functionExample of twin function● We consider the references A[i][j] and B[i][j]
– The affine functions are: ● A0+(iN+j)*4 ● B0+(iN+j)*4
● When there is interference (i.e.,|A0-B0| mod C < L where C and
L are cache and cache line size): – We use the twin functions
● A0+(iN+j)*4 ● B0+(iN+j)*4+L
Dynamic Mapping, (Hardware)Dynamic Mapping, (Hardware)Dynamic Mapping, (Hardware)Dynamic Mapping, (Hardware)● We introduce a new 3-address load instruction
– A register destination – Two register operands: the results of twin function
and of original affine function ● Note:
– the twin function result may be no real address – the original function is a real address
● (and goes though TLB – ACU)
Pseudo Assembly CodePseudo Assembly CodePseudo Assembly CodePseudo Assembly Code
ORIGINAL CODESet $R0, A_0Set $R1, B_0…Load $F0, $R0Load $F1, $R1Add $R0,$R0,4Add $R1,$R1,4
MODIFIED CODESet $R0, A_0
Set $R1, B_0
Add $R2, $R1, 32…
Load $F0, $R0
Load $F1, $R1, $R2Load $F1, $R1, $R2
Add $R2, $R2, 4
Add $R0, $R0, 4
Add $R1, $R1, 4
…
Experimental ResultsExperimental ResultsExperimental ResultsExperimental Results● We present experimental results obtained by
using combination of software approaches:– Padding– Data Copy – Without using any cycle-accurate simulator
● Matrix multiplication: – Simulation of cache performance a data cache size
16KB 1-way for optimally blocked algorithm
Matrix Multiply (simulation) Matrix Multiply (simulation) Matrix Multiply (simulation) Matrix Multiply (simulation)
Blocked Matrix Multiplication: 16KB 1-way data cache
0
2
4
6
8
10
12
14
16
18
Matrix Size
mis
s r
ate
%
regular matrix multiply dynamic matrix multiply
Experimental Results, cont.Experimental Results, cont.Experimental Results, cont.Experimental Results, cont.● n-FFT, Cooley-Tookey algorithm using balanced
decomposition in factors – The algorithm has been proposed first by Vitter et al– Complexity
● Best case O(n log log n) - Worst case O(n2)
– Normalized performance (MFLOPS)● We use the codelets from FFTW● For 128KB 4-way data cache
– Performance comparison with FFTW is in the paper
FFT FFT 128KB 4-way data cache128KB 4-way data cache
FFT FFT 128KB 4-way data cache128KB 4-way data cache
SPARC64 100MHz
02468
10121416
1000
1960
4725
1036
8
2700
0
7560
0
1653
75
3628
80 128
256
51210
2420
4840
9681
92
1638
4
3276
8
6553
6
1310
72
2621
44
5242
88
1048
576
2097
152
4194
304
8388
608
size
no
rmal
ized
MF
LO
PS
Upper Bound Dynamic Mapping Base Line
Future workFuture workFuture workFuture work
● Dynamic Mapping is not fully automated– The code is hand made
● A clock-accurate processor simulator is missing– To estimate the effects of twin function
computations on performance and energy
● Application on a set of benchmarks
ConclusionsConclusionsConclusionsConclusions● The hardware is relatively simple
– Because it is the compiler (or user) that activates the twin computation
● and change the data cache mapping dynamically
● The approach aims to achieve a data cache mapping with:– zero interference,– no increase of cache hit latency – minimum extra hardware