static analysis of parameterized loop nests for energy efficient use of data caches p.d’alberto,...
DESCRIPTION
COLP01 - CECS ICS UCIMotivation Caches are important in modern systemsCaches are important in modern systems –A key performance determinant –An important part of the overall energy equation: »large and growing size and associativity, multi-port access –Increasingly critical for data-intensive applications ‘Adaptive’ caches can improve performance‘Adaptive’ caches can improve performance –By changing line and/or fetch size based on runtime behavior –Varying associativity, etc –“Optimal” is application specificTRANSCRIPT
Static Analysis of Static Analysis of Parameterized Loop Nests Parameterized Loop Nests for Energy Efficient Use of for Energy Efficient Use of
Data CachesData CachesP.D’AlbertoP.D’Alberto, ,
A.Nicolau, A.Veidenbaum, R.GuptaA.Nicolau, A.Veidenbaum, R.Gupta
University of California at IrvineUniversity of California at IrvineInformation and Computer Science Information and Computer Science
DepartmentDepartmentCenter for Embedded Computer SystemsCenter for Embedded Computer Systems
COLP01 - CECS ICS UCI
Talk OrganizationTalk Organization
•Motivation Motivation •Related workRelated work•Parameterized Static Analysis Parameterized Static Analysis • Implementation and Implementation and
Experimental ResultsExperimental Results•ConclusionsConclusions
COLP01 - CECS ICS UCI
MotivationMotivation
•Caches are important in modern systemsCaches are important in modern systems– A key performance determinantA key performance determinant– An important part of the overall energy equation:An important part of the overall energy equation:
» large and growing size and associativity, multi-large and growing size and associativity, multi-port accessport access
– Increasingly critical for data-intensive Increasingly critical for data-intensive applicationsapplications
• ‘‘Adaptive’ caches can improve Adaptive’ caches can improve performanceperformance– By changing line and/or fetch size based on runtime By changing line and/or fetch size based on runtime
behaviorbehavior– Varying associativity, etcVarying associativity, etc– ““Optimal”Optimal” is application specific is application specific
COLP01 - CECS ICS UCI
Adaptive Memory SystemAdaptive Memory System
•Problem: How to control adaptive cache Problem: How to control adaptive cache line size to maximize performance line size to maximize performance andand minimize energy consumption?minimize energy consumption?
•Approach: use a compiler to generate code Approach: use a compiler to generate code directing the adaptation at run timedirecting the adaptation at run time
• Issues in such a compilation approachIssues in such a compilation approach– What application characteristics do we need to What application characteristics do we need to
measure?measure?– When can we statically determine an optimal line size?When can we statically determine an optimal line size?– How can we “generate” code based on the static How can we “generate” code based on the static
analysis?analysis?
COLP01 - CECS ICS UCI
Rewards of Line Size Rewards of Line Size AdaptationAdaptation
•Miss rate is reduced resulting inMiss rate is reduced resulting in::– Fewer fetches from next-level cacheFewer fetches from next-level cache or memoryor memory
– Less transfer trafficLess transfer traffic– Fewer processor stallsFewer processor stalls– The code is, of course, left unchangedThe code is, of course, left unchanged
• What is the effect on energy What is the effect on energy dissipation?dissipation?
COLP01 - CECS ICS UCI
Energy EffectsEnergy Effects
• To ensure total energy reduction To ensure total energy reduction need toneed to– reduce memory accesses and memory reduce memory accesses and memory
traffictraffic• The choice of the line size affects both The choice of the line size affects both
memory accesses and memory trafficmemory accesses and memory traffic– Minimizing for either one alone will not Minimizing for either one alone will not
give minimal energy dissipation give minimal energy dissipation • A tradeoff is possible and practical, at A tradeoff is possible and practical, at
compile time, when they are compile time, when they are “quantified” accurately and fast“quantified” accurately and fast
COLP01 - CECS ICS UCI
Related Work on Cache Related Work on Cache OptimizationsOptimizations
• Two primary approaches to determine Two primary approaches to determine Memory Memory AccessesAccesses and and Cache MissesCache Misses – Profiling and Static AnalysisProfiling and Static Analysis
• ProfilingProfiling1.1. Given a set of inputs (training set)Given a set of inputs (training set)2.2. The number of interference is determinedThe number of interference is determined
» Simulation, hardware counters….Simulation, hardware counters….3.3. Compiler selects a line size and uses it to annotate Compiler selects a line size and uses it to annotate
the codethe code4.4. The annotated code is then used in actual execution The annotated code is then used in actual execution
– The problem is that often there is not a single optimum line size for different runs of the same code (input dependent)
COLP01 - CECS ICS UCI
Related Work … Related Work … cont
• Static Analysis Static Analysis 1.1. Based on loop nests and CMEsBased on loop nests and CMEs2.2. Representation of cache misses by inequalitiesRepresentation of cache misses by inequalities3.3. Each iteration checked to verify inequalitiesEach iteration checked to verify inequalities
» Yes/no missYes/no miss4.4. The number of iterations/misses is added upThe number of iterations/misses is added up
» The loop bounds are needed at this pointThe loop bounds are needed at this point • Note:Note:
– The complexity of the analysis is proportional to the The complexity of the analysis is proportional to the number of iterations number of iterations
– This work forms the basis of the parameterized analysisThis work forms the basis of the parameterized analysis» Because it characterizes the cause of missesBecause it characterizes the cause of misses
COLP01 - CECS ICS UCI
Parameterized Loop Nests Analysis Parameterized Loop Nests Analysis for Direct Mapped Data Cachefor Direct Mapped Data Cache
1.1. Every memory reference is symbolically equated Every memory reference is symbolically equated with every possible source of interferencewith every possible source of interference
2.2. Symbolic solution of the equation is sought Symbolic solution of the equation is sought • Is there a solution? Is there a solution? • Existence condition for a solutionExistence condition for a solution
3.3. Trade-off between spatial reuse and interferenceTrade-off between spatial reuse and interference• If there is a solution, how can we estimate the If there is a solution, how can we estimate the
number of solutions without counting them?number of solutions without counting them?• Bounding the misses by interference Bounding the misses by interference
4.4. Misses = Contribution from each referenceMisses = Contribution from each reference
COLP01 - CECS ICS UCI
Interference and Reuse, Interference and Reuse, explanation
• An interference equation represents the set of An interference equation represents the set of iterations in the loop nest where two references iterations in the loop nest where two references interfereinterfere– Where there is a miss Where there is a miss
• Existence conditions for solutions of the Existence conditions for solutions of the equation:equation:– Find at least one iteration for which the equation Find at least one iteration for which the equation
is satisfiedis satisfied
• Bounding the misses due to interferenceBounding the misses due to interference– If there is a solution, we propose, “If there is a solution, we propose, “thethe interference interference
density”, density”, to bound the to bound the ratio of the iteration ratio of the iteration solutions over the total number of iterations solutions over the total number of iterations
COLP01 - CECS ICS UCI
Interference Density, Interference Density, explanation
• The interference density is a The interference density is a straightforward quantity to determinestraightforward quantity to determine– Function of the coefficients of the Function of the coefficients of the
interference equationinterference equation
• It is independent from the loop nest and It is independent from the loop nest and the definition domain the definition domain
• It is a It is a good good upper bound to the cache upper bound to the cache miss ratio due to interference equationmiss ratio due to interference equation– (but only when it is known if there is (but only when it is known if there is
interference)interference)
COLP01 - CECS ICS UCI
int A[MAX][MAX];int A[MAX][MAX]; /* In memory starting at address /* In memory starting at address Aoffset Aoffset */*/
int B[MAX][MAX];int B[MAX][MAX]; /* In memory starting at address /* In memory starting at address Boffset Boffset */*/
/* Size of int = 4 Bytes, Row major layout *//* Size of int = 4 Bytes, Row major layout */
int empty(intint empty(int upbupb) {) { /* One parameter *//* One parameter */ x=0;x=0; for (i=0;i<for (i=0;i<upbupb;i++);i++) /* loop bounds affine function of /* loop bounds affine function of upb upb */*/ for (j=0;j<for (j=0;j<upbupb;j++) ;j++) x += (A[i][j+x += (A[i][j+upbupb] +1)] +1) /* Memory reference affine function of /* Memory reference affine function of
*/*/ *B[i][j];*B[i][j]; /* index variables /* index variables i, ji, j and parameter and parameter upbupb */*/
return x; return x; }}
ExampleExample
COLP01 - CECS ICS UCI
Our ApproachOur Approach
int A[MAX][MAX]; int A[MAX][MAX]; int B[MAX][MAX];int B[MAX][MAX];void empty(int upb) {void empty(int upb) {
x +=(A[i][j+upb]+1)*B[i][j];x +=(A[i][j+upb]+1)*B[i][j];
}}
1) Interference Equation A[][] <->B[][]:1) Interference Equation A[][] <->B[][]:Aoffset+MAX*4*i+(j+upb)*4+n(mL)+lAoffset+MAX*4*i+(j+upb)*4+n(mL)+l==Boffset+MAX*4*i+j*4 Boffset+MAX*4*i+j*4 n>0 and |l|<Ln>0 and |l|<L
For example, with For example, with Aoffset Aoffset = 64, = 64, BoffsetBoffset=8256 and =8256 and MAXMAX=10=10And And mL = 8192, mL = 8192, the equation becomes :the equation becomes :4*upb +n8192 +l = 81924*upb +n8192 +l = 8192 2) Symbolic Solution: D= 4*upb mod 8192
if L>D there is interfernce (i.e. L=D annotation )
3) Otherwise Interference Density: min(1,4/L+(L-D)/L)
COLP01 - CECS ICS UCI
Our Approach … Our Approach … contcont
void empty(int void empty(int upbupb) {) {
… …..
for (i=0;i<for (i=0;i<upbupb;i++) ;i++)
for (j=0;j<for (j=0;j<upbupb ;j++) { ;j++) {
… …
}}
}}
Iteration PointsIteration Points
upbi 0
upbj 0
4) Misses : interference density * number of iteration points4) Misses : interference density * number of iteration points : : 2 (4/L+2 (4/L+minmin(1,1/(up(1,1/(up mod mod L)) * upbL)) * upb22
upb2
COLP01 - CECS ICS UCI
Implementation of STAMINAImplementation of STAMINA
• Stamina is a 3-step phase in the Stamina is a 3-step phase in the ARMR compilerARMR compiler
1.1. Step I: Code Analysis Step I: Code Analysis – Input:Input:
» Code and the sequence of memory Code and the sequence of memory references in the inner loopsreferences in the inner loops
– Output:Output:» Loop bounds information: Loop bounds information:
• Expression of the boundsExpression of the bounds» Reference information Reference information
• Index computationsIndex computations• Reuse informationReuse information
COLP01 - CECS ICS UCI
Implementation of STAMINA. Implementation of STAMINA. contcont
Step II: Interference Equation Generation Step II: Interference Equation Generation – Input: Input: Step I output andStep I output and
» The cache size and array layoutThe cache size and array layout» Sequence of memory referencesSequence of memory references
– Output:Output:» Set of equations and domains of validity Set of equations and domains of validity
Step III: Interference estimationStep III: Interference estimation– Input: Input: Step II outputStep II output– Output:Output:
» Interference density and symbolic enumeration Interference density and symbolic enumeration of the iterations in the loop nest of the iterations in the loop nest
COLP01 - CECS ICS UCI
Example: SWIMExample: SWIM
• Swim: a loop-based code from SPEC 2000Swim: a loop-based code from SPEC 2000– It cannot be analyzed using CME because of It cannot be analyzed using CME because of
unknown loop bounds (introduced as input at run unknown loop bounds (introduced as input at run time)time)
• STAMINA analysis takes 1 minute per loop nestSTAMINA analysis takes 1 minute per loop nest– 4 main loop nests 4 main loop nests – The execution of swim takes less than 1 hrThe execution of swim takes less than 1 hr
• We verified the analysis using We verified the analysis using Shade cache Shade cache simulatorsimulator
• The analysis results: The analysis results: – For the reference set Swim has no interference: For the reference set Swim has no interference:
» Independent of the line sizeIndependent of the line size– Larger line size = better performanceLarger line size = better performance– Shorter line size = better energyShorter line size = better energy
COLP01 - CECS ICS UCI
Examples: “Self Interference”Examples: “Self Interference”
• ““Self Interference” is an artificial exampleSelf Interference” is an artificial example• It consists of 6 loop nests with only one It consists of 6 loop nests with only one
memory reference eachmemory reference each– Self interferenceSelf interference
• The choice of line size sharply affects the The choice of line size sharply affects the overall miss ratio and energy consumptionoverall miss ratio and energy consumption– Adaptive = optimal line size per each loop nestAdaptive = optimal line size per each loop nest
• Adaptive line size yieldsAdaptive line size yields– Optimum energy consumptionOptimum energy consumption– 1/3 of the total miss ratio 1/3 of the total miss ratio
COLP01 - CECS ICS UCI
Example “Self Interference”Example “Self Interference”"self interference"
00.5
11.5
22.5
33.5
44.5
8 16 32 64 128 256 adaptive
Line size
Inte
rfer
ence
Den
sity
Energy for "self interference"
0
10
20
30
40
50
Line Size
Nor
mal
ized
Ene
rgy
w.r.
t. A
dapt
ive
ener
gy
normalized energy 0.9948507 0.9979403 2.1511329 5.9397528 16.319065 42.516928
8 16 32 64 128 256
Adaptive = each loop nest has a different Adaptive = each loop nest has a different and optimal line sizeand optimal line size
COLP01 - CECS ICS UCI
Example: Matrix MultiplyExample: Matrix Multiply
• ijk-Matrix Multiplyijk-Matrix Multiply• Stamina takes about 2min for Stamina takes about 2min for untileduntiled and 8hrs for and 8hrs for
tiledtiled• The tiled MM cannot be analyzed by CMEsThe tiled MM cannot be analyzed by CMEs• Comparison with the “exact version”Comparison with the “exact version”
Energy: MM without Tiling
0
2000
4000
6000
8000
10000
12000
8 16 32 64 128 256 adaptive
Line size
Rela
tive
Ener
gy
Exact Stamina MM without Tiling
0
0.5
1
1.5
2
2.5
8 16 32 64 128 256
Line size
Inte
rfere
ce d
ensi
ty
Exact Stamina
COLP01 - CECS ICS UCI
ConclusionsConclusions• Architectural adaptation presents an Architectural adaptation presents an
opportunity to maximize performance based opportunity to maximize performance based on application and data needson application and data needs
• Energy consumption can be optimized within Energy consumption can be optimized within the same frameworkthe same framework
• Compiler analysis and its integration with the Compiler analysis and its integration with the runtime system is needed to achieve this runtime system is needed to achieve this – This work enables optimum tradeoff between This work enables optimum tradeoff between
conflict and reuse based on static analysis of conflict and reuse based on static analysis of nested loopsnested loops
– The result is a possible trade-off between energy The result is a possible trade-off between energy and performance or energy optimizationand performance or energy optimization
• The model is validated by the experimental The model is validated by the experimental resultsresults
COLP01 - CECS ICS UCI
Future WorkFuture Work
• Find efficient techniques forFind efficient techniques for symbolic solution symbolic solution of the interference equationof the interference equation
• Improve a friendlier “Improve a friendlier “applicability”applicability” on on benchmarksbenchmarks
COLP01 - CECS ICS UCI
Thank youThank you
COLP01 - CECS ICS UCI
Energy ImplicationsEnergy Implications
])())(1[(|| 21 LRLRLM acc Energy for data access
LLM acc )(|| Data traffic towards L2
])())(1[(|| 21 SLSLM acc Energy for address access
)(|| LM acc Address activations: reads+writes
)(L Data miss ratio|| accM Memory Accesses
COLP01 - CECS ICS UCI
Data Traffic and Address Data Traffic and Address ActivationsActivations
Data Traffic
0
2
4
6
8
10
12
14
16
applu gcc turb3d go hydro2d su2cor w ave5 ijpeg sw im tomcatv compress Average
Bill
ions
Byt
es
ALS AFS SB VC
Address Activations
0
0.05
0.1
0.15
0.2
0.25
applu gcc turb3d go hydro2d su2cor w ave5 ijpeg sw im tomcatv compress Average
Bill
ions
ALS AFS SB VC
COLP01 - CECS ICS UCI
Case I, InterferenceCase I, Interference, , first two iterationsfirst two iterations
• Line size Line size L=L=16B16B• Cache Size C=Cache Size C=m*Lm*L
LL
mm
In CacheIn Cache
•In MemoryIn Memory
Next iteration would Next iteration would reuse the same line ofreuse the same line of A[i][j+upb]A[i][j+upb]
LL
A[0][0+upb]A[0][0+upb]
A[0][1+upb][0][1+upb]
Every 4 accesses Every 4 accesses 2 misses2 misses
upb=2upb=2
A[0][0] and B[0][0] A[0][0] and B[0][0] are nm lines apartare nm lines apart
B[0][0]B[0][0]
B[0][1]B[0][1]
COLP01 - CECS ICS UCI
Case II, Cache Line Case II, Cache Line AdaptationAdaptation
• Line size Line size L’ = L/2=L’ = L/2=8B8B• Cache Size C=2Cache Size C=2m*(L/2)m*(L/2)
– Cache size constantCache size constant
2m2m
In CacheIn Cache•In MemoryIn Memory
No InterferenceNo Interference
A[0][0+upb]A[0][0+upb]
LL
A[0][1+upb]A[0][1+upb]
upb=2upb=2
A[0][0] A[0][0] and and B[0][0] B[0][0] are 2are 2nm+1nm+1 lines apart lines apart
L’L’
B[0][0]B[0][0]
B[0][1]B[0][1]
Every 4 accesses Every 4 accesses 2 misses2 misses
Half of the trafficHalf of the traffic