static analysis of parameterized loop nests for energy efficient use of data caches p.d’alberto,...

Static Analysis of Static Analysis of Parameterized Loop Nests Parameterized Loop Nests for Energy Efficient Use of for Energy Efficient Use of

Data CachesData CachesP.D’AlbertoP.D’Alberto, ,

A.Nicolau, A.Veidenbaum, R.GuptaA.Nicolau, A.Veidenbaum, R.Gupta

University of California at IrvineUniversity of California at IrvineInformation and Computer Science Information and Computer Science

DepartmentDepartmentCenter for Embedded Computer SystemsCenter for Embedded Computer Systems

COLP01 - CECS ICS UCI

Talk OrganizationTalk Organization

•Motivation Motivation •Related workRelated work•Parameterized Static Analysis Parameterized Static Analysis • Implementation and Implementation and

Experimental ResultsExperimental Results•ConclusionsConclusions


MotivationMotivation

•Caches are important in modern systemsCaches are important in modern systems– A key performance determinantA key performance determinant– An important part of the overall energy equation:An important part of the overall energy equation:

» large and growing size and associativity, multi-large and growing size and associativity, multi-port accessport access

– Increasingly critical for data-intensive Increasingly critical for data-intensive applicationsapplications

• ‘‘Adaptive’ caches can improve Adaptive’ caches can improve performanceperformance– By changing line and/or fetch size based on runtime By changing line and/or fetch size based on runtime

behaviorbehavior– Varying associativity, etcVarying associativity, etc– ““Optimal”Optimal” is application specific is application specific


Adaptive Memory SystemAdaptive Memory System

•Problem: How to control adaptive cache Problem: How to control adaptive cache line size to maximize performance line size to maximize performance andand minimize energy consumption?minimize energy consumption?

•Approach: use a compiler to generate code Approach: use a compiler to generate code directing the adaptation at run timedirecting the adaptation at run time

• Issues in such a compilation approachIssues in such a compilation approach– What application characteristics do we need to What application characteristics do we need to

measure?measure?– When can we statically determine an optimal line size?When can we statically determine an optimal line size?– How can we “generate” code based on the static How can we “generate” code based on the static

analysis?analysis?


Rewards of Line Size Rewards of Line Size AdaptationAdaptation

•Miss rate is reduced resulting inMiss rate is reduced resulting in::– Fewer fetches from next-level cacheFewer fetches from next-level cache or memoryor memory

– Less transfer trafficLess transfer traffic– Fewer processor stallsFewer processor stalls– The code is, of course, left unchangedThe code is, of course, left unchanged

• What is the effect on energy What is the effect on energy dissipation?dissipation?


Energy EffectsEnergy Effects

• To ensure total energy reduction To ensure total energy reduction need toneed to– reduce memory accesses and memory reduce memory accesses and memory

traffictraffic• The choice of the line size affects both The choice of the line size affects both

memory accesses and memory trafficmemory accesses and memory traffic– Minimizing for either one alone will not Minimizing for either one alone will not

give minimal energy dissipation give minimal energy dissipation • A tradeoff is possible and practical, at A tradeoff is possible and practical, at

compile time, when they are compile time, when they are “quantified” accurately and fast“quantified” accurately and fast


Related Work on Cache Related Work on Cache OptimizationsOptimizations

• Two primary approaches to determine Two primary approaches to determine Memory Memory AccessesAccesses and and Cache MissesCache Misses – Profiling and Static AnalysisProfiling and Static Analysis

• ProfilingProfiling1.1. Given a set of inputs (training set)Given a set of inputs (training set)2.2. The number of interference is determinedThe number of interference is determined

» Simulation, hardware counters….Simulation, hardware counters….3.3. Compiler selects a line size and uses it to annotate Compiler selects a line size and uses it to annotate

the codethe code4.4. The annotated code is then used in actual execution The annotated code is then used in actual execution

– The problem is that often there is not a single optimum line size for different runs of the same code (input dependent)


Related Work … Related Work … cont

• Static Analysis Static Analysis 1.1. Based on loop nests and CMEsBased on loop nests and CMEs2.2. Representation of cache misses by inequalitiesRepresentation of cache misses by inequalities3.3. Each iteration checked to verify inequalitiesEach iteration checked to verify inequalities

» Yes/no missYes/no miss4.4. The number of iterations/misses is added upThe number of iterations/misses is added up

» The loop bounds are needed at this pointThe loop bounds are needed at this point • Note:Note:

– The complexity of the analysis is proportional to the The complexity of the analysis is proportional to the number of iterations number of iterations

– This work forms the basis of the parameterized analysisThis work forms the basis of the parameterized analysis» Because it characterizes the cause of missesBecause it characterizes the cause of misses


Parameterized Loop Nests Analysis Parameterized Loop Nests Analysis for Direct Mapped Data Cachefor Direct Mapped Data Cache

1.1. Every memory reference is symbolically equated Every memory reference is symbolically equated with every possible source of interferencewith every possible source of interference

2.2. Symbolic solution of the equation is sought Symbolic solution of the equation is sought • Is there a solution? Is there a solution? • Existence condition for a solutionExistence condition for a solution

3.3. Trade-off between spatial reuse and interferenceTrade-off between spatial reuse and interference• If there is a solution, how can we estimate the If there is a solution, how can we estimate the

number of solutions without counting them?number of solutions without counting them?• Bounding the misses by interference Bounding the misses by interference

4.4. Misses = Contribution from each referenceMisses = Contribution from each reference


Interference and Reuse, Interference and Reuse, explanation

• An interference equation represents the set of An interference equation represents the set of iterations in the loop nest where two references iterations in the loop nest where two references interfereinterfere– Where there is a miss Where there is a miss

• Existence conditions for solutions of the Existence conditions for solutions of the equation:equation:– Find at least one iteration for which the equation Find at least one iteration for which the equation

is satisfiedis satisfied

• Bounding the misses due to interferenceBounding the misses due to interference– If there is a solution, we propose, “If there is a solution, we propose, “thethe interference interference

density”, density”, to bound the to bound the ratio of the iteration ratio of the iteration solutions over the total number of iterations solutions over the total number of iterations


Interference Density, Interference Density, explanation

• The interference density is a The interference density is a straightforward quantity to determinestraightforward quantity to determine– Function of the coefficients of the Function of the coefficients of the

interference equationinterference equation

• It is independent from the loop nest and It is independent from the loop nest and the definition domain the definition domain

• It is a It is a good good upper bound to the cache upper bound to the cache miss ratio due to interference equationmiss ratio due to interference equation– (but only when it is known if there is (but only when it is known if there is

interference)interference)


int A[MAX][MAX];int A[MAX][MAX]; /* In memory starting at address /* In memory starting at address Aoffset Aoffset */*/

int B[MAX][MAX];int B[MAX][MAX]; /* In memory starting at address /* In memory starting at address Boffset Boffset */*/

/* Size of int = 4 Bytes, Row major layout *//* Size of int = 4 Bytes, Row major layout */

int empty(intint empty(int upbupb) {) { /* One parameter *//* One parameter */ x=0;x=0; for (i=0;i<for (i=0;i<upbupb;i++);i++) /* loop bounds affine function of /* loop bounds affine function of upb upb */*/ for (j=0;j<for (j=0;j<upbupb;j++) ;j++) x += (A[i][j+x += (A[i][j+upbupb] +1)] +1) /* Memory reference affine function of /* Memory reference affine function of

*/*/ *B[i][j];*B[i][j]; /* index variables /* index variables i, ji, j and parameter and parameter upbupb */*/

return x; return x; }}

ExampleExample


Our ApproachOur Approach

int A[MAX][MAX]; int A[MAX][MAX]; int B[MAX][MAX];int B[MAX][MAX];void empty(int upb) {void empty(int upb) {

x +=(A[i][j+upb]+1)*B[i][j];x +=(A[i][j+upb]+1)*B[i][j];

}}

1) Interference Equation A[][] <->B[][]:1) Interference Equation A[][] <->B[][]:Aoffset+MAX*4*i+(j+upb)*4+n(mL)+lAoffset+MAX*4*i+(j+upb)*4+n(mL)+l==Boffset+MAX*4*i+j*4 Boffset+MAX*4*i+j*4 n>0 and |l|<Ln>0 and |l|<L

For example, with For example, with Aoffset Aoffset = 64, = 64, BoffsetBoffset=8256 and =8256 and MAXMAX=10=10And And mL = 8192, mL = 8192, the equation becomes :the equation becomes :4*upb +n8192 +l = 81924*upb +n8192 +l = 8192 2) Symbolic Solution: D= 4*upb mod 8192

if L>D there is interfernce (i.e. L=D annotation )

3) Otherwise Interference Density: min(1,4/L+(L-D)/L)


Our Approach … Our Approach … contcont

void empty(int void empty(int upbupb) {) {

… …..

for (i=0;i<for (i=0;i<upbupb;i++) ;i++)

for (j=0;j<for (j=0;j<upbupb ;j++) { ;j++) {

… …

}}

}}

Iteration PointsIteration Points

upbi 0

upbj 0

4) Misses : interference density * number of iteration points4) Misses : interference density * number of iteration points : : 2 (4/L+2 (4/L+minmin(1,1/(up(1,1/(up mod mod L)) * upbL)) * upb22

upb2


Implementation of STAMINAImplementation of STAMINA

• Stamina is a 3-step phase in the Stamina is a 3-step phase in the ARMR compilerARMR compiler

1.1. Step I: Code Analysis Step I: Code Analysis – Input:Input:

» Code and the sequence of memory Code and the sequence of memory references in the inner loopsreferences in the inner loops

– Output:Output:» Loop bounds information: Loop bounds information:

• Expression of the boundsExpression of the bounds» Reference information Reference information

• Index computationsIndex computations• Reuse informationReuse information


Implementation of STAMINA. Implementation of STAMINA. contcont

Step II: Interference Equation Generation Step II: Interference Equation Generation – Input: Input: Step I output andStep I output and

» The cache size and array layoutThe cache size and array layout» Sequence of memory referencesSequence of memory references

– Output:Output:» Set of equations and domains of validity Set of equations and domains of validity

Step III: Interference estimationStep III: Interference estimation– Input: Input: Step II outputStep II output– Output:Output:

» Interference density and symbolic enumeration Interference density and symbolic enumeration of the iterations in the loop nest of the iterations in the loop nest


Example: SWIMExample: SWIM

• Swim: a loop-based code from SPEC 2000Swim: a loop-based code from SPEC 2000– It cannot be analyzed using CME because of It cannot be analyzed using CME because of

unknown loop bounds (introduced as input at run unknown loop bounds (introduced as input at run time)time)

• STAMINA analysis takes 1 minute per loop nestSTAMINA analysis takes 1 minute per loop nest– 4 main loop nests 4 main loop nests – The execution of swim takes less than 1 hrThe execution of swim takes less than 1 hr

• We verified the analysis using We verified the analysis using Shade cache Shade cache simulatorsimulator

• The analysis results: The analysis results: – For the reference set Swim has no interference: For the reference set Swim has no interference:

» Independent of the line sizeIndependent of the line size– Larger line size = better performanceLarger line size = better performance– Shorter line size = better energyShorter line size = better energy


Examples: “Self Interference”Examples: “Self Interference”

• ““Self Interference” is an artificial exampleSelf Interference” is an artificial example• It consists of 6 loop nests with only one It consists of 6 loop nests with only one

memory reference eachmemory reference each– Self interferenceSelf interference

• The choice of line size sharply affects the The choice of line size sharply affects the overall miss ratio and energy consumptionoverall miss ratio and energy consumption– Adaptive = optimal line size per each loop nestAdaptive = optimal line size per each loop nest

• Adaptive line size yieldsAdaptive line size yields– Optimum energy consumptionOptimum energy consumption– 1/3 of the total miss ratio 1/3 of the total miss ratio


Example “Self Interference”Example “Self Interference”"self interference"

00.5

11.5

22.5

33.5

44.5

8 16 32 64 128 256 adaptive

Line size

Inte

rfer

ence

Den

sity

Energy for "self interference"

0

10

20

30

40

50

Line Size

Nor

mal

ized

Ene

rgy

w.r.

t. A

dapt

ive

ener

gy

normalized energy 0.9948507 0.9979403 2.1511329 5.9397528 16.319065 42.516928

8 16 32 64 128 256

Adaptive = each loop nest has a different Adaptive = each loop nest has a different and optimal line sizeand optimal line size


Example: Matrix MultiplyExample: Matrix Multiply

• ijk-Matrix Multiplyijk-Matrix Multiply• Stamina takes about 2min for Stamina takes about 2min for untileduntiled and 8hrs for and 8hrs for

tiledtiled• The tiled MM cannot be analyzed by CMEsThe tiled MM cannot be analyzed by CMEs• Comparison with the “exact version”Comparison with the “exact version”

Energy: MM without Tiling

0

2000

4000

6000

8000

10000

12000

8 16 32 64 128 256 adaptive

Line size

Rela

tive

Ener

gy

Exact Stamina MM without Tiling

0

0.5

1

1.5

2

2.5

8 16 32 64 128 256

Line size

Inte

rfere

ce d

ensi

ty

Exact Stamina


ConclusionsConclusions• Architectural adaptation presents an Architectural adaptation presents an

opportunity to maximize performance based opportunity to maximize performance based on application and data needson application and data needs

• Energy consumption can be optimized within Energy consumption can be optimized within the same frameworkthe same framework

• Compiler analysis and its integration with the Compiler analysis and its integration with the runtime system is needed to achieve this runtime system is needed to achieve this – This work enables optimum tradeoff between This work enables optimum tradeoff between

conflict and reuse based on static analysis of conflict and reuse based on static analysis of nested loopsnested loops

– The result is a possible trade-off between energy The result is a possible trade-off between energy and performance or energy optimizationand performance or energy optimization

• The model is validated by the experimental The model is validated by the experimental resultsresults


Future WorkFuture Work

• Find efficient techniques forFind efficient techniques for symbolic solution symbolic solution of the interference equationof the interference equation

• Improve a friendlier “Improve a friendlier “applicability”applicability” on on benchmarksbenchmarks


Thank youThank you


Energy ImplicationsEnergy Implications

])())(1[(|| 21 LRLRLM acc Energy for data access

LLM acc )(|| Data traffic towards L2

])())(1[(|| 21 SLSLM acc Energy for address access

)(|| LM acc Address activations: reads+writes

)(L Data miss ratio|| accM Memory Accesses


Data Traffic and Address Data Traffic and Address ActivationsActivations

Data Traffic

0

2

4

6

8

10

12

14

16

applu gcc turb3d go hydro2d su2cor w ave5 ijpeg sw im tomcatv compress Average

Bill

ions

Byt

es

ALS AFS SB VC

Address Activations

0

0.05

0.1

0.15

0.2

0.25

applu gcc turb3d go hydro2d su2cor w ave5 ijpeg sw im tomcatv compress Average

Bill

ions

ALS AFS SB VC


Case I, InterferenceCase I, Interference, , first two iterationsfirst two iterations

• Line size Line size L=L=16B16B• Cache Size C=Cache Size C=m*Lm*L

LL

mm

In CacheIn Cache

•In MemoryIn Memory

Next iteration would Next iteration would reuse the same line ofreuse the same line of A[i][j+upb]A[i][j+upb]

LL

A[0][0+upb]A[0][0+upb]

A[0][1+upb][0][1+upb]

Every 4 accesses Every 4 accesses 2 misses2 misses

upb=2upb=2

A[0][0] and B[0][0] A[0][0] and B[0][0] are nm lines apartare nm lines apart

B[0][0]B[0][0]

B[0][1]B[0][1]


Case II, Cache Line Case II, Cache Line AdaptationAdaptation

• Line size Line size L’ = L/2=L’ = L/2=8B8B• Cache Size C=2Cache Size C=2m*(L/2)m*(L/2)

– Cache size constantCache size constant

2m2m

In CacheIn Cache•In MemoryIn Memory

No InterferenceNo Interference

A[0][0+upb]A[0][0+upb]

LL

A[0][1+upb]A[0][1+upb]

upb=2upb=2

A[0][0] A[0][0] and and B[0][0] B[0][0] are 2are 2nm+1nm+1 lines apart lines apart

L’L’

B[0][0]B[0][0]

B[0][1]B[0][1]

Every 4 accesses Every 4 accesses 2 misses2 misses

Half of the trafficHalf of the traffic

static analysis of parameterized loop nests for energy efficient use of data caches p.d’alberto,...

Documents