decomposing memory performance data structures and phases kartik k. agaram, stephen w. keckler,...

Decomposing Memory Performance

Data Structures and Phases

Kartik K. Agaram,Stephen W. Keckler, Calvin Lin, Kathryn McKinley

Department of Computer SciencesThe University of Texas at Austin

2

Memory hierarchy trends

• Growing latency to main memory• Growing cache complexity

– More cache levels– New mechanisms, optimizations

• Growing application complexity– Lots of abstraction

Application-System interactions

increasingly hard to predict

3

The solution: More fine-grained metrics

• More insight within an application• More rigorous comparisons across applications• Potential applications:

– Hardware/software tuning– Global hints for online phase detection

Our approach: data structure decomposition

High-level, easy to understand

Highlights important access patterns

4

ammp vs twolf:The tale of two applications

Conventional view: they’re pretty similar• IPC: 0.57 vs 0.51• DL1 Miss-rate (%): 10% vs 9.5%• Access patterns

– Lots of pointer access in both..– Mostly linked list traversal

5

ammp vs twolf:Data structure decomposition

0102030405060708090

100

ammp twolf

RestDS3DS2DS1

DL

1 m

isse

s (%

)

6

ammp vs twolf:Access patterns

twolf

t1 = b[c[i]cblock]t2 = t1tiletermt3 = n[t2net]…

i=rand()

ammp

atom atom=atom next

atom[i] neighbour[j] ++j ++i

twolf has more complex access patterns

7

ammp vs twolf:Phase behavior

0

2

4

6

8

10

12

14

16

18D

L1

mis

ses

(mil

lio

ns)

total

0

1

2

3

4

5

6

7

8

9

DL

1 m

isse

s (m

illi

on

s)

total

ammp

twolf

Time60 billion cycles

8

ammp vs twolf:Phase behavior by data structure

0

2

4

6

8

10

12

14

16

18D

L1

mis

ses

(mil

lio

ns)

total

atoms

nodelist

0

1

2

3

4

5

6

7

8

9

DL

1 m

isse

s (m

illi

on

s)

total

netptr

tmp_rows

ammp

twolf

ammp has more interesting phase behavior

9

Outline

• Motivation• Data structure decomposition• Phase analysis: selecting sampling period• Results:

– Aggregate– Phase

10

Data structure decomposition

Application communicates with simulator

Leave core application oblivious; automatically add simulator-aware instrumentation

Application Simulator

Resources

11

DTrack

ApplicationSources

DetailedStatistics

ApplicationExecutable

InstrumentedSources

SourceTranslator

Compiler Simulator

- DTrack’s protocol for application-simulator communication

12

DTrack’s protocol

1. Application stores mapping at a predetermined shared location– (start address, end address) → variable name

2. ..and signals simulator by special opcode• Other techniques possible

3. Simulator detects signal, reads shared location

1. Application stores mapping at a predetermined shared location– (start address, end address) → variable name

2. ..and signals simulator by special opcode• Other techniques possible

3. Simulator detects signal, reads shared location

Simulator now knows variable names

of address regions

13

Instrumentation without perturbance

• Global segment: write to file– Expensive, but one-time cost during initialization– Amortized across all global variables

• Heap: save in special variables after every malloc/free– Overhead α frequency of mallocs/frees– Special variables always hit in cache

• Stack: no instrumentation– Function calls too frequent– Causes negligible misses anyway

14

Measuring perturbance

• Communicate specific start and end points in application to simulator

• Compare instruction counts between them with and without instrumentation

ΔInstruction count <4%

even with frequent malloc

15

Outline



16

The importance of sampling period

Good sampling period Low noise

0

100

200

300

400

0

200

400

600

800

DL1 misses/

10M cycles

(thousands)

DL1 misses/

230M cycles

(thousands)

17

Volatility: A noise metricfor time sequence graphs

Raw datastream

Volatilityvalue

Volatilitygraph

Missgraph

Aggregatefor some

Sampling period

Pointvolatilities

Sort,extract 90th

percentile

Point volatility =abs(Xt-Xt-1)

max(Xt, Xt-1)

18

Volatility depends on sampling period

Raw datastream

Volatilityvalue

Volatilitygraph

samplingPeriod

Aggregate Pointvolatilities

19

Volatility profile:Volatility vs sampling period

00.10.20.30.40.50.60.70.80.9

1

0 80 160 240 320 400 480

Sampling period (millions of samples)

Vo

latil

ity

164.gzip

20

Outline



21

Methodology

• A Source translator: C-Breeze• B Compiler: Alpha GEM cc• C Simulator: sim-alpha

– Validated model of 21264 pipeline

• Simulated machine: Alpha 21264– 4-way issue, 64KB 3-cycle DL1

• Benchmarks: 12 C applications from SPEC CPU2000 suite

A B C

22

Major data structures by DL1 misses

0

20

40

60

80

100

120

164.

gzip

175.

vpr

176.

gcc

177.

mes

a

179.

art

181.

mcf

183.

equak

e

188.

amm

p

197.

parse

r

256.

bzip2

300.

twol

f

#3#2#1

% D

L1 m

isse

s

23

Most misses Most pipeline stalls?≣• Process:

– Detect stall cycles when no instructions were committed

– Assign blame to data structure of oldest instruction in pipeline

• Results– Stall cycle ranks track miss count ranks– Exceptions:

•tds in 179.art•search in 186.crafty

24

Types of phase behavior

0

10

20

30

40

DL

1 M

isse

s (M

illio

ns)

I. mcf

0

10

20

30

II. art

115 billion cycles

25

0

0.2

0.4

0.6

0.8

DL

1 M

isse

s (M

illio

ns)

III. mesa

Types of phase behavior

cycles

26

Summary

• More detailed metrics richer application comparison

• Low-overhead data structure decomposition• Determining ideal sampling period

– A volatility metric inspired by spectral analysis

• Ideal sampling period is application-specific• Data structures in an application share

common phase boundaries

decomposing memory performance data structures and phases kartik k. agaram, stephen w. keckler,...

Documents

cyclesammp vs twolf

sampling periodresults

phase behaviorammptwolftime60

access patternstwolft1

data structureammptwolfammp

signals simulator

thpercentilepoint volatility

end address variable