sector cache optimizations for the k computerperarnau/pdfs/aics13poster.pdf · sector cache...
TRANSCRIPT
Sector Cache Optimizations for the K Computer
Swann Perarnau, Mitsuhisa Sato
RIKEN AICS, University of Tsukuba
Sector Cache on the SPARCVIIIfx
IOn-demand split of the shared L2 cache.IUser controlled mapping between accesses and sectors.→ Isolation of thrashing accesses.→ Select and keep useful data in cache.
Our Goals
Assess applicability of the sector cache to optimize HPC applications.I Study potential optimization strategies.IHelp users find good sector cache optimizations.IAim for as much automation as possible.
Issues
I Low-level compile-time API.IHard to predict impact on performance.I Little support from compiler and performance analysis tools.
Current Sector Cache API
double myarray[NSIZE];double otherarray[NSIZE];
void mywork(void){
int i;double sum = 0;
#pragma statement cache_sector_size 1 11#pragma statement cache_subsector_assign myarray
for(i = 2; i < NSIZE-2; i++){
// myarray in sector 1sum += myarray[i-2] + myarray[i-1] +
myarray[i] + myarray[i+1] +myarray[i+2] + otherarray[i];
}}
Results
I efficient cache optimization of classical HPC benchmarks.I framework for locality analysis and optimization of HPC applications.I on-going automation, already little user action required.
Framework Overview
Hotspotdetection
Structures/Scopesetup
DWARF reader
Binaryinstrumentation
Locality analysis
Codemodification
Locality Measurements
Reuse Distance: for a memory access, the number of uniquememory locations touched since the previous access to the samelocation.→ an architecture-independent measure closely related to the cachemisses triggered by an application.→ use it to measure consequences of isolating one structure by itselfwith the sector cache.
Reuse exemple
Access :
load 0x10load 0x20load 0x30load 0x10load 0x20load 0x10load 0x30
Distance :
∞∞∞2212
Den
sity
Distance
Validating the Framework: a toy example
1.75MB 3.5MB 7MB ∞
0
2 · 106
4 · 106
6 · 106
8 · 106
1 · 107
1.2 · 107
1.4 · 107M1M2M3
Reuse Distances of Each Structures
1 2 3 4 5 6 7 8 9 10 11
024681012141618202224
Sector Size
Cache
Miss
esReductio
n(%
)
M1M2M3
Selected by framework
Optimizations for all Sector Configurations
Optimizing the NAS Parallel Benchmarks
Benchmark Function Isolated Variables Sector Size Miss Reduction (%) Runtime Reduction (%)CG conj_grad p (1,11) 19 10
LU
ssor a,b,c,d
(2,10)
48 8blts ldz,ldy,ldx,d 75 10buts d,udx,udy,udz 18 3jacld a,b,c,d 64 14jacu a,b,c,d 57 6
Acknowledgements
Part of the results were obtained by early access to the K computer at the RIKENAICS. This work was supported by the JSPS Grant-in-Aid for JSPS Fellows Number24.02711.
RIKEN Advanced Institute for Computational Science Programming Environment Research Team [email protected]