decomposing memory performance data structures and phases kartik k. agaram, stephen w. keckler,...
Post on 20-Jan-2016
217 views
TRANSCRIPT
Decomposing Memory Performance
Data Structures and Phases
Kartik K. Agaram,Stephen W. Keckler, Calvin Lin, Kathryn McKinley
Department of Computer SciencesThe University of Texas at Austin
2
Memory hierarchy trends
• Growing latency to main memory• Growing cache complexity
– More cache levels– New mechanisms, optimizations
• Growing application complexity– Lots of abstraction
Application-System interactions
increasingly hard to predict
3
The solution: More fine-grained metrics
• More insight within an application• More rigorous comparisons across applications• Potential applications:
– Hardware/software tuning– Global hints for online phase detection
Our approach: data structure decomposition
High-level, easy to understand
Highlights important access patterns
4
ammp vs twolf:The tale of two applications
Conventional view: they’re pretty similar• IPC: 0.57 vs 0.51• DL1 Miss-rate (%): 10% vs 9.5%• Access patterns
– Lots of pointer access in both..– Mostly linked list traversal
5
ammp vs twolf:Data structure decomposition
0102030405060708090
100
ammp twolf
RestDS3DS2DS1
DL
1 m
isse
s (%
)
6
ammp vs twolf:Access patterns
twolf
t1 = b[c[i]cblock]t2 = t1tiletermt3 = n[t2net]…
i=rand()
ammp
atom atom=atom next
atom[i] neighbour[j] ++j ++i
twolf has more complex access patterns
7
ammp vs twolf:Phase behavior
0
2
4
6
8
10
12
14
16
18D
L1
mis
ses
(mil
lio
ns)
total
0
1
2
3
4
5
6
7
8
9
DL
1 m
isse
s (m
illi
on
s)
total
ammp
twolf
Time60 billion cycles
8
ammp vs twolf:Phase behavior by data structure
0
2
4
6
8
10
12
14
16
18D
L1
mis
ses
(mil
lio
ns)
total
atoms
nodelist
0
1
2
3
4
5
6
7
8
9
DL
1 m
isse
s (m
illi
on
s)
total
netptr
tmp_rows
ammp
twolf
ammp has more interesting phase behavior
9
Outline
• Motivation• Data structure decomposition• Phase analysis: selecting sampling period• Results:
– Aggregate– Phase
10
Data structure decomposition
Application communicates with simulator
Leave core application oblivious; automatically add simulator-aware instrumentation
Application Simulator
Resources
11
DTrack
ApplicationSources
DetailedStatistics
ApplicationExecutable
InstrumentedSources
SourceTranslator
Compiler Simulator
- DTrack’s protocol for application-simulator communication
12
DTrack’s protocol
1. Application stores mapping at a predetermined shared location– (start address, end address) → variable name
2. ..and signals simulator by special opcode• Other techniques possible
3. Simulator detects signal, reads shared location
1. Application stores mapping at a predetermined shared location– (start address, end address) → variable name
2. ..and signals simulator by special opcode• Other techniques possible
3. Simulator detects signal, reads shared location
Simulator now knows variable names
of address regions
13
Instrumentation without perturbance
• Global segment: write to file– Expensive, but one-time cost during initialization– Amortized across all global variables
• Heap: save in special variables after every malloc/free– Overhead α frequency of mallocs/frees– Special variables always hit in cache
• Stack: no instrumentation– Function calls too frequent– Causes negligible misses anyway
14
Measuring perturbance
• Communicate specific start and end points in application to simulator
• Compare instruction counts between them with and without instrumentation
ΔInstruction count <4%
even with frequent malloc
15
Outline
• Motivation• Data structure decomposition• Phase analysis: selecting sampling period• Results:
– Aggregate– Phase
16
The importance of sampling period
Good sampling period Low noise
0
100
200
300
400
0
200
400
600
800
DL1 misses/
10M cycles
(thousands)
DL1 misses/
230M cycles
(thousands)
17
Volatility: A noise metricfor time sequence graphs
Raw datastream
Volatilityvalue
Volatilitygraph
Missgraph
Aggregatefor some
Sampling period
Pointvolatilities
Sort,extract 90th
percentile
Point volatility =abs(Xt-Xt-1)
max(Xt, Xt-1)
18
Volatility depends on sampling period
Raw datastream
Volatilityvalue
Volatilitygraph
samplingPeriod
Aggregate Pointvolatilities
19
Volatility profile:Volatility vs sampling period
00.10.20.30.40.50.60.70.80.9
1
0 80 160 240 320 400 480
Sampling period (millions of samples)
Vo
latil
ity
164.gzip
20
Outline
• Motivation• Data structure decomposition• Phase analysis: selecting sampling period• Results:
– Aggregate– Phase
21
Methodology
• A Source translator: C-Breeze• B Compiler: Alpha GEM cc• C Simulator: sim-alpha
– Validated model of 21264 pipeline
• Simulated machine: Alpha 21264– 4-way issue, 64KB 3-cycle DL1
• Benchmarks: 12 C applications from SPEC CPU2000 suite
A B C
22
Major data structures by DL1 misses
0
20
40
60
80
100
120
164.
gzip
175.
vpr
176.
gcc
177.
mes
a
179.
art
181.
mcf
183.
equak
e
188.
amm
p
197.
parse
r
256.
bzip2
300.
twol
f
#3#2#1
% D
L1 m
isse
s
23
Most misses Most pipeline stalls?≣• Process:
– Detect stall cycles when no instructions were committed
– Assign blame to data structure of oldest instruction in pipeline
• Results– Stall cycle ranks track miss count ranks– Exceptions:
•tds in 179.art•search in 186.crafty
24
Types of phase behavior
0
10
20
30
40
DL
1 M
isse
s (M
illio
ns)
I. mcf
0
10
20
30
II. art
115 billion cycles
25
0
0.2
0.4
0.6
0.8
DL
1 M
isse
s (M
illio
ns)
III. mesa
Types of phase behavior
cycles
26
Summary
• More detailed metrics richer application comparison
• Low-overhead data structure decomposition• Determining ideal sampling period
– A volatility metric inspired by spectral analysis
• Ideal sampling period is application-specific• Data structures in an application share
common phase boundaries