general characteristics memory characteristics ... · general characteristics memory...
TRANSCRIPT
![Page 1: General Characteristics Memory Characteristics ... · General Characteristics Memory Characteristics Experimental Setup ... LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:](https://reader031.vdocuments.net/reader031/viewer/2022021512/5aecb8f27f8b9a3b2e8f8829/html5/thumbnails/1.jpg)
SIAM Parallel Processing 2012
Motivation Application Performance Characterization: ◦ Current approaches ◦ Our approach: General Characteristics Memory Characteristics
Experimental Setup ◦ Benchmarks ◦ Tools
Results Conclusion
![Page 2: General Characteristics Memory Characteristics ... · General Characteristics Memory Characteristics Experimental Setup ... LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:](https://reader031.vdocuments.net/reader031/viewer/2022021512/5aecb8f27f8b9a3b2e8f8829/html5/thumbnails/2.jpg)
Mantevo MiniApps are relatively new
Compare to well-known widely-used benchmark suites (e.g, SPEC CPU2006 )
Compare to original apps they represent
Low-level detailed characterization Provides insight into performance Reveals optimization opportunities if available Helps guide and/or validate the development of proxies
(miniApps) Gives an idea of suitable platforms for the applications to run on Helps find suitable sets of benchmarks for an experiment …
SIAM Parallel Processing 2012
![Page 3: General Characteristics Memory Characteristics ... · General Characteristics Memory Characteristics Experimental Setup ... LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:](https://reader031.vdocuments.net/reader031/viewer/2022021512/5aecb8f27f8b9a3b2e8f8829/html5/thumbnails/3.jpg)
How is it usually done?
Problems: No standard set of characteristics Most studies use microarchitecture/hardware
dependent characteristics execution time, CPI, miss rates…etc
Other suggest microarchitecture independent? Instruction dependence distance, Instruction mix Spatial and/or temporal locality information…etc
Limited set of characteristics is usually used due to simulation cost
SIAM Parallel Processing 2012
![Page 4: General Characteristics Memory Characteristics ... · General Characteristics Memory Characteristics Experimental Setup ... LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:](https://reader031.vdocuments.net/reader031/viewer/2022021512/5aecb8f27f8b9a3b2e8f8829/html5/thumbnails/4.jpg)
Our approach Wide range of low-level detailed characteristics better ability to explain performance
Hardware independent, but ISA dependent Dynamic binary instrumentation (DBI) tools such as PIN Most characteristics captured in terms of a frequency
distribution (histogram) Hardware dependent Hardware performance counters Validation
More efficient No simulation
SIAM Parallel Processing 2012
![Page 5: General Characteristics Memory Characteristics ... · General Characteristics Memory Characteristics Experimental Setup ... LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:](https://reader031.vdocuments.net/reader031/viewer/2022021512/5aecb8f27f8b9a3b2e8f8829/html5/thumbnails/5.jpg)
Instruction Mix INT, FP, LD, ST, BR FP: FP, SIMD LD: INT / FP, E_LD: INT/FP ST: INT / FP, E_ST: INT/FP BR: INT/FP
Instr-dependence distance Register-to-use distance histogram
Instr-to-Instr distance histograms ld-to-ld, fp-to-fp, br-to-br, …etc
Instr-to-Use distance histograms ld-to-use, fp-to-use….etc
Instruction size histogram Registers read per instruction Registers written per instruction
SIAM Parallel Processing 2012
![Page 6: General Characteristics Memory Characteristics ... · General Characteristics Memory Characteristics Experimental Setup ... LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:](https://reader031.vdocuments.net/reader031/viewer/2022021512/5aecb8f27f8b9a3b2e8f8829/html5/thumbnails/6.jpg)
CPI ( Cycles per Instruction ) Cache miss rates ( per 1k instructions )
L1, L2, L3…etc Branch misprediction rate Totals (for validation purposes)
Total instructions Total loads, stores, FP, and branches
SIAM Parallel Processing 2012
![Page 7: General Characteristics Memory Characteristics ... · General Characteristics Memory Characteristics Experimental Setup ... LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:](https://reader031.vdocuments.net/reader031/viewer/2022021512/5aecb8f27f8b9a3b2e8f8829/html5/thumbnails/7.jpg)
Characteristics obtained from DBI tools Spatial Locality histogram
Cache line access stride distribution Stride is the minimum stride found between current
access and the last N accesses (N currently set at 32) Max stride one page (4KB) 64-byte cache lines assumed
Temporal Locality histogram Memory-Reuse-Distance (MRD) histogram
MRD is # of unique memory references between two references to the same cache line
Or MRD is # of unique cache lines referenced between two references to the same cache line
Max distance currently set to cover 6MB 64-byte cache lines assumed
SIAM Parallel Processing 2012
![Page 8: General Characteristics Memory Characteristics ... · General Characteristics Memory Characteristics Experimental Setup ... LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:](https://reader031.vdocuments.net/reader031/viewer/2022021512/5aecb8f27f8b9a3b2e8f8829/html5/thumbnails/8.jpg)
Characteristics obtained from DBI tools Working Set size
Total unique bytes touched by application Distribution of unique bytes touched by every 1 billion
instructions Pattern of executed memory instructions
Distance defined in number of instructions between memory ops
Distribution of memory size read/written
Characteristics obtained from hardware performance counters: Cache miss rates ( per 1k instructions )
L1, L2, L3…etc
SIAM Parallel Processing 2012
![Page 9: General Characteristics Memory Characteristics ... · General Characteristics Memory Characteristics Experimental Setup ... LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:](https://reader031.vdocuments.net/reader031/viewer/2022021512/5aecb8f27f8b9a3b2e8f8829/html5/thumbnails/9.jpg)
MantevoMiniApps Explicit Finite Element MiniApps
PhdMesh Molecular Dynamics MiniApps
MiniMD Implicit Finite Element MiniApps
HPCCG, pHPCCG, MiniFE
SPEC CPU2006 6 Floating-point benchmarks:
cactusADM, LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:
Perlbench, Astar, Libquantum, Xalancbmk
Input sizes: Mantevo: adjusted for approximately same instruction
count as SPEC SPEC: reference input
SIAM Parallel Processing 2012
![Page 10: General Characteristics Memory Characteristics ... · General Characteristics Memory Characteristics Experimental Setup ... LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:](https://reader031.vdocuments.net/reader031/viewer/2022021512/5aecb8f27f8b9a3b2e8f8829/html5/thumbnails/10.jpg)
Platform: Experiments run on Xeon-E5504, Gainestown (based
on Nehalem), 45nm, 4 core, 256KB L2/core, 4MB L3 Tools:
PAPI (papiex) CPI, cache and branch statistics
PIN ( Dynamic Binary Instrumentation ) All general characteristics Some memory characteristics Benchmarks run to completion (~1day each)
PIN + PinPoints + Simpionts Spatial & temporal locality characteristics Simulation points of size 1 billion dynamic instructions
covering 95% of execution # of points ranges from 3 to 8 with different weights
SIAM Parallel Processing 2012
![Page 11: General Characteristics Memory Characteristics ... · General Characteristics Memory Characteristics Experimental Setup ... LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:](https://reader031.vdocuments.net/reader031/viewer/2022021512/5aecb8f27f8b9a3b2e8f8829/html5/thumbnails/11.jpg)
SIAM Parallel Processing 2012
0
20
40
60
80
100
%
% Stall Cycles
0 0.5
1 1.5
2 2.5
3
CPI
CPI
![Page 12: General Characteristics Memory Characteristics ... · General Characteristics Memory Characteristics Experimental Setup ... LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:](https://reader031.vdocuments.net/reader031/viewer/2022021512/5aecb8f27f8b9a3b2e8f8829/html5/thumbnails/12.jpg)
SIAM Parallel Processing 2012
0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
100%
Branches
Int Ops
FP Ops
FP Stores
FP Loads
Int Stores
Int Loads
0 2 4 6 8
10 #
of in
stru
ctio
ns
FP-to-Use
0 1 2 3 4 5
# of
inst
ruct
ions
FP-to-FP
![Page 13: General Characteristics Memory Characteristics ... · General Characteristics Memory Characteristics Experimental Setup ... LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:](https://reader031.vdocuments.net/reader031/viewer/2022021512/5aecb8f27f8b9a3b2e8f8829/html5/thumbnails/13.jpg)
SIAM Parallel Processing 2012
0 2 4 6 8
10 12 14
# of
inst
ruct
ions
Instruction Dependence Distance
0 10 20 30 40 50 60 70 80
# of
inst
ruct
ions
Basic Block Size
0 500
1000 1500 2000 2500 3000 3500 4000
Meg
a By
tes
Working Set Size
![Page 14: General Characteristics Memory Characteristics ... · General Characteristics Memory Characteristics Experimental Setup ... LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:](https://reader031.vdocuments.net/reader031/viewer/2022021512/5aecb8f27f8b9a3b2e8f8829/html5/thumbnails/14.jpg)
SIAM Parallel Processing 2012
0.00
10.00
20.00
30.00
40.00
50.00
60.00
L1 Misses/1K inst
0.00 5.00
10.00 15.00 20.00 25.00 30.00 35.00
L2 Misses/1K inst
0.00% 1.00% 2.00% 3.00% 4.00% 5.00% 6.00% 7.00% 8.00%
BR Misprediction Rate
0.00 2.00 4.00 6.00 8.00
10.00 12.00 14.00 16.00
L3 Misses/1K inst
![Page 15: General Characteristics Memory Characteristics ... · General Characteristics Memory Characteristics Experimental Setup ... LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:](https://reader031.vdocuments.net/reader031/viewer/2022021512/5aecb8f27f8b9a3b2e8f8829/html5/thumbnails/15.jpg)
0 0.5
1 1.5
2 2.5
3 3.5
4 4.5
# of
inst
ruct
ions
Distance Between Mem Ops
0%
10%
20%
30%
40%
50%
% Mem Ops
0 1 2 3 4 5
# of
inst
ruct
ions
LD-to-Use
![Page 16: General Characteristics Memory Characteristics ... · General Characteristics Memory Characteristics Experimental Setup ... LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:](https://reader031.vdocuments.net/reader031/viewer/2022021512/5aecb8f27f8b9a3b2e8f8829/html5/thumbnails/16.jpg)
SIAM Parallel Processing 2012
0%
2%
4%
6%
8%
10%
12%
Calculix DealII Leslie3d PHPCCG MiniFE MiniMD PhdMesh SPEC Avg Mantevo Avg
Cache Miss Rates
L1
L2
L3
0
1E+11
2E+11
3E+11
4E+11
0 1 64 Other
Freq
uenc
y
Stride
0
5E+10
1E+11
1.5E+11
2E+11
2.5E+11
0 <=10 <=512 <=4096 <=65536 >65536
Freq
uenc
y
# unique cache lines referenced b/w 2 references to same line
MemReuse Distance
![Page 17: General Characteristics Memory Characteristics ... · General Characteristics Memory Characteristics Experimental Setup ... LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:](https://reader031.vdocuments.net/reader031/viewer/2022021512/5aecb8f27f8b9a3b2e8f8829/html5/thumbnails/17.jpg)
SIAM Parallel Processing 2012
0
3E+11
6E+11
9E+11
0 1 Other
Freq
uenc
y
Stride
0%
2%
4%
6%
8%
10%
12%
Calculix DealII Leslie3d PHPCCG MiniFE MiniMD PhdMesh SPEC Avg Mantevo Avg
Cache Miss Rates
L1
L2
L3
0
2E+11
4E+11
6E+11
8E+11
0 <=10 <=512 <=4096 <=65536 >65536
Freq
uenc
y
# unique cache lines referenced b/w 2 references to same line
MemReuse Distance
![Page 18: General Characteristics Memory Characteristics ... · General Characteristics Memory Characteristics Experimental Setup ... LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:](https://reader031.vdocuments.net/reader031/viewer/2022021512/5aecb8f27f8b9a3b2e8f8829/html5/thumbnails/18.jpg)
SIAM Parallel Processing 2012
0
2E+11
4E+11
6E+11
8E+11
1E+12
1.2E+12
0 1 64 Other
Freq
uenc
y
Stride
0%
2%
4%
6%
8%
10%
12%
Calculix DealII Leslie3d PHPCCG MiniFE MiniMD PhdMesh SPEC Avg Mantevo Avg
Cache Miss Rates
L1
L2
L3
0
30000000
60000000
90000000
1.2E+09
1.5E+09
1.8E+09
0 <=10 <=512 <=4096 <=65536 >65536
Freq
uenc
y
# unique cache lines referenced b/w 2 references to same line
MemReuse Distance
![Page 19: General Characteristics Memory Characteristics ... · General Characteristics Memory Characteristics Experimental Setup ... LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:](https://reader031.vdocuments.net/reader031/viewer/2022021512/5aecb8f27f8b9a3b2e8f8829/html5/thumbnails/19.jpg)
SIAM Parallel Processing 2012
0
5E+11
1E+12
1.5E+12
2E+12
0 1 2 Other
Freq
uenc
y
Stride (Calculix)
0 2E+11 4E+11 6E+11 8E+11
0 1 64 Other
Freq
uenc
y
Stride (DealII)
0 2E+11 4E+11 6E+11 8E+11
0 1 15 16 64 Other
Freq
uenc
y
Stride (Leslie3D)
0
5E+11
1E+12
1.5E+12
0 <=10 <=512 <=4096 <=65536 >65536
Freq
uenc
y
MemReuse Distance(Calculix)
0
2E+11
4E+11
6E+11
0 <=10 <=512 <=4096 <=65536 >65536 Fr
eque
ncy
MemReuse Distance(DealII)
0
2E+11
4E+11
6E+11
0 <=10 <=512 <=4096 <=65536 >65536
Freq
uenc
y
# unique cache lines referenced b/w 2 references to same line
MemReuse Distance(Leslie3D)
![Page 20: General Characteristics Memory Characteristics ... · General Characteristics Memory Characteristics Experimental Setup ... LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:](https://reader031.vdocuments.net/reader031/viewer/2022021512/5aecb8f27f8b9a3b2e8f8829/html5/thumbnails/20.jpg)
• MiniApps exhibit more memory behavior • >100% more misses(L2 & L3) than SPEC! • Much larger data working set (500% more) • More memory ops per instruction (16% more) • Memory ops are closer to each other ( 2.1 vs. 2.9 ) • More prone to contention for memory resources
• MiniApps have much shorter (>100%) dependence distance than SPEC • Suggests more dependence stalls
• MiniApps have much shorter basic blocks than SPEC
• MiniApps experience more stall time (> 33%) • Greater CPI • Due to more cache misses & dependence stalls
SIAM Parallel Processing 2012
![Page 21: General Characteristics Memory Characteristics ... · General Characteristics Memory Characteristics Experimental Setup ... LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:](https://reader031.vdocuments.net/reader031/viewer/2022021512/5aecb8f27f8b9a3b2e8f8829/html5/thumbnails/21.jpg)
Compare performance of MiniApps and real full size apps: Single node At scale
Obtain memory performance characteristics using full runs instead of simulation points Compare findings to simulation points
How sensitive performance is to problem size