a data cache with dynamic mapping p. d'alberto, a. nicolau and a. veidenbaum ics-uci speaker...

A Data Cache with A Data Cache with Dynamic MappingDynamic MappingA Data Cache with A Data Cache with Dynamic MappingDynamic Mapping

P. D'Alberto, A. Nicolau and A. Veidenbaum

ICS-UCI

Speaker

Paolo D’Alberto

Problem IntroductionProblem IntroductionProblem IntroductionProblem Introduction

● Blocked algorithms have good performance on average– Because exploiting data temporal locality

● For some input sets data cache interference nullifies cache locality

Problem Introduction, cont.Problem Introduction, cont.Problem Introduction, cont.Problem Introduction, cont.

Blocked Matrix Multiplication: 16KB 1-way data cache

0

2

4

6

8

10

12

14

16

18

Matrix Size

mis

s r

ate

%

regular matrix multiply dynamic matrix multiply

Problem IntroductionProblem IntroductionProblem IntroductionProblem Introduction

● What if we remove the spikes– The average performance improves– Execution time is predictable

● We can achieve our goal by: ● Only Software● Only Hardware● Both HW-SW

Related Work (Software)Related Work (Software)Related Work (Software)Related Work (Software)

● Data layout reorganization [Flajolet et al. 91]– Data are reorganized before & after computation

● Data copy [Granston et al. 93]– Data are moved in memory during computation

● Padding [Panda et al. 99]● Computation reorganization [Pingali et al.02]

– e.g., Tiling

Related Work (Hardware)Related Work (Hardware)Related Work (Hardware)Related Work (Hardware)

● Changing cache mapping – Using a different cache mapping functions [Gonzalez 97]– Increasing cache associativity [IA64]– Changing cache Size

● Bypassing caches– No interference: data are not stored in cache. [MIPS R5K]– HW-driven Pre-fetching

Related Work (HW-SW)Related Work (HW-SW)Related Work (HW-SW)Related Work (HW-SW)● Profiling

– Hardware adaptation [UCI]

– Software adaptation [Gatlin et al. 99]

● Pre-fetching [Jouppi et al.]– Latency hiding mostly, used also for cache interference

reduction

● Static Analysis [Ghosh et al. 99 - CME]– e.g., compiler driven data cache line adaptation [UCI]

Dynamic Mapping, (Software)Dynamic Mapping, (Software)Dynamic Mapping, (Software)Dynamic Mapping, (Software)

● We consider applications where memory references are affine functions only

● We associate a memory reference with a twin affine function

● We use the twin function as input address for the target data cache

● We use the original affine function to access memory

Example of twin functionExample of twin functionExample of twin functionExample of twin function● We consider the references A[i][j] and B[i][j]

– The affine functions are: ● A0+(iN+j)*4 ● B0+(iN+j)*4

● When there is interference (i.e.,|A0-B0| mod C < L where C and

L are cache and cache line size): – We use the twin functions

● A0+(iN+j)*4 ● B0+(iN+j)*4+L

Dynamic Mapping, (Hardware)Dynamic Mapping, (Hardware)Dynamic Mapping, (Hardware)Dynamic Mapping, (Hardware)● We introduce a new 3-address load instruction

– A register destination – Two register operands: the results of twin function

and of original affine function ● Note:

– the twin function result may be no real address – the original function is a real address

● (and goes though TLB – ACU)

Pseudo Assembly CodePseudo Assembly CodePseudo Assembly CodePseudo Assembly Code

ORIGINAL CODESet $R0, A_0Set $R1, B_0…Load $F0, $R0Load $F1, $R1Add $R0,$R0,4Add $R1,$R1,4

MODIFIED CODESet $R0, A_0

Set $R1, B_0

Add $R2, $R1, 32…

Load $F0, $R0

Load $F1, $R1, $R2Load $F1, $R1, $R2

Add $R2, $R2, 4

Add $R0, $R0, 4

Add $R1, $R1, 4

…

Experimental ResultsExperimental ResultsExperimental ResultsExperimental Results● We present experimental results obtained by

using combination of software approaches:– Padding– Data Copy – Without using any cycle-accurate simulator

● Matrix multiplication: – Simulation of cache performance a data cache size

16KB 1-way for optimally blocked algorithm

Matrix Multiply (simulation) Matrix Multiply (simulation) Matrix Multiply (simulation) Matrix Multiply (simulation)

Blocked Matrix Multiplication: 16KB 1-way data cache

0

2

4

6

8

10

12

14

16

18

Matrix Size

mis

s r

ate

%

regular matrix multiply dynamic matrix multiply

Experimental Results, cont.Experimental Results, cont.Experimental Results, cont.Experimental Results, cont.● n-FFT, Cooley-Tookey algorithm using balanced

decomposition in factors – The algorithm has been proposed first by Vitter et al– Complexity

● Best case O(n log log n) - Worst case O(n2)

– Normalized performance (MFLOPS)● We use the codelets from FFTW● For 128KB 4-way data cache

– Performance comparison with FFTW is in the paper

FFT FFT 128KB 4-way data cache128KB 4-way data cache

FFT FFT 128KB 4-way data cache128KB 4-way data cache

SPARC64 100MHz

02468

10121416

1000

1960

4725

1036

8

2700

0

7560

0

1653

75

3628

80 128

256

51210

2420

4840

9681

92

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

2097

152

4194

304

8388

608

size

no

rmal

ized

MF

LO

PS

Upper Bound Dynamic Mapping Base Line

Future workFuture workFuture workFuture work

● Dynamic Mapping is not fully automated– The code is hand made

● A clock-accurate processor simulator is missing– To estimate the effects of twin function

computations on performance and energy

● Application on a set of benchmarks

ConclusionsConclusionsConclusionsConclusions● The hardware is relatively simple

– Because it is the compiler (or user) that activates the twin computation

● and change the data cache mapping dynamically

● The approach aims to achieve a data cache mapping with:– zero interference,– no increase of cache hit latency – minimum extra hardware

a data cache with dynamic mapping p. d'alberto, a. nicolau and a. veidenbaum ics-uci speaker...

Documents

cache localitylcpc

data temporal locality

way data cacheregular

blocked matrix multiplication

good performance

lcpc 03chart11