decomposing memory performance data structures and phases kartik k. agaram, stephen w. keckler,...

26
Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences The University of Texas at Austin

Post on 20-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences

Decomposing Memory Performance

Data Structures and Phases

Kartik K. Agaram,Stephen W. Keckler, Calvin Lin, Kathryn McKinley

Department of Computer SciencesThe University of Texas at Austin

Page 2: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences

2

Memory hierarchy trends

• Growing latency to main memory• Growing cache complexity

– More cache levels– New mechanisms, optimizations

• Growing application complexity– Lots of abstraction

Application-System interactions

increasingly hard to predict

Page 3: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences

3

The solution: More fine-grained metrics

• More insight within an application• More rigorous comparisons across applications• Potential applications:

– Hardware/software tuning– Global hints for online phase detection

Our approach: data structure decomposition

High-level, easy to understand

Highlights important access patterns

Page 4: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences

4

ammp vs twolf:The tale of two applications

Conventional view: they’re pretty similar• IPC: 0.57 vs 0.51• DL1 Miss-rate (%): 10% vs 9.5%• Access patterns

– Lots of pointer access in both..– Mostly linked list traversal

Page 5: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences

5

ammp vs twolf:Data structure decomposition

0102030405060708090

100

ammp twolf

RestDS3DS2DS1

DL

1 m

isse

s (%

)

Page 6: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences

6

ammp vs twolf:Access patterns

twolf

t1 = b[c[i]cblock]t2 = t1tiletermt3 = n[t2net]…

i=rand()

ammp

atom atom=atom next

atom[i] neighbour[j] ++j ++i

twolf has more complex access patterns

Page 7: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences

7

ammp vs twolf:Phase behavior

0

2

4

6

8

10

12

14

16

18D

L1

mis

ses

(mil

lio

ns)

total

0

1

2

3

4

5

6

7

8

9

DL

1 m

isse

s (m

illi

on

s)

total

ammp

twolf

Time60 billion cycles

Page 8: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences

8

ammp vs twolf:Phase behavior by data structure

0

2

4

6

8

10

12

14

16

18D

L1

mis

ses

(mil

lio

ns)

total

atoms

nodelist

0

1

2

3

4

5

6

7

8

9

DL

1 m

isse

s (m

illi

on

s)

total

netptr

tmp_rows

ammp

twolf

ammp has more interesting phase behavior

Page 9: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences

9

Outline

• Motivation• Data structure decomposition• Phase analysis: selecting sampling period• Results:

– Aggregate– Phase

Page 10: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences

10

Data structure decomposition

Application communicates with simulator

Leave core application oblivious; automatically add simulator-aware instrumentation

Application Simulator

Resources

Page 11: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences

11

DTrack

ApplicationSources

DetailedStatistics

ApplicationExecutable

InstrumentedSources

SourceTranslator

Compiler Simulator

- DTrack’s protocol for application-simulator communication

Page 12: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences

12

DTrack’s protocol

1. Application stores mapping at a predetermined shared location– (start address, end address) → variable name

2. ..and signals simulator by special opcode• Other techniques possible

3. Simulator detects signal, reads shared location

1. Application stores mapping at a predetermined shared location– (start address, end address) → variable name

2. ..and signals simulator by special opcode• Other techniques possible

3. Simulator detects signal, reads shared location

Simulator now knows variable names

of address regions

Page 13: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences

13

Instrumentation without perturbance

• Global segment: write to file– Expensive, but one-time cost during initialization– Amortized across all global variables

• Heap: save in special variables after every malloc/free– Overhead α frequency of mallocs/frees– Special variables always hit in cache

• Stack: no instrumentation– Function calls too frequent– Causes negligible misses anyway

Page 14: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences

14

Measuring perturbance

• Communicate specific start and end points in application to simulator

• Compare instruction counts between them with and without instrumentation

ΔInstruction count <4%

even with frequent malloc

Page 15: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences

15

Outline

• Motivation• Data structure decomposition• Phase analysis: selecting sampling period• Results:

– Aggregate– Phase

Page 16: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences

16

The importance of sampling period

Good sampling period Low noise

0

100

200

300

400

0

200

400

600

800

DL1 misses/

10M cycles

(thousands)

DL1 misses/

230M cycles

(thousands)

Page 17: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences

17

Volatility: A noise metricfor time sequence graphs

Raw datastream

Volatilityvalue

Volatilitygraph

Missgraph

Aggregatefor some

Sampling period

Pointvolatilities

Sort,extract 90th

percentile

Point volatility =abs(Xt-Xt-1)

max(Xt, Xt-1)

Page 18: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences

18

Volatility depends on sampling period

Raw datastream

Volatilityvalue

Volatilitygraph

samplingPeriod

Aggregate Pointvolatilities

Page 19: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences

19

Volatility profile:Volatility vs sampling period

00.10.20.30.40.50.60.70.80.9

1

0 80 160 240 320 400 480

Sampling period (millions of samples)

Vo

latil

ity

164.gzip

Page 20: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences

20

Outline

• Motivation• Data structure decomposition• Phase analysis: selecting sampling period• Results:

– Aggregate– Phase

Page 21: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences

21

Methodology

• A Source translator: C-Breeze• B Compiler: Alpha GEM cc• C Simulator: sim-alpha

– Validated model of 21264 pipeline

• Simulated machine: Alpha 21264– 4-way issue, 64KB 3-cycle DL1

• Benchmarks: 12 C applications from SPEC CPU2000 suite

A B C

Page 22: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences

22

Major data structures by DL1 misses

0

20

40

60

80

100

120

164.

gzip

175.

vpr

176.

gcc

177.

mes

a

179.

art

181.

mcf

183.

equak

e

188.

amm

p

197.

parse

r

256.

bzip2

300.

twol

f

#3#2#1

% D

L1 m

isse

s

Page 23: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences

23

Most misses Most pipeline stalls?≣• Process:

– Detect stall cycles when no instructions were committed

– Assign blame to data structure of oldest instruction in pipeline

• Results– Stall cycle ranks track miss count ranks– Exceptions:

•tds in 179.art•search in 186.crafty

Page 24: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences

24

Types of phase behavior

0

10

20

30

40

DL

1 M

isse

s (M

illio

ns)

I. mcf

0

10

20

30

II. art

115 billion cycles

Page 25: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences

25

0

0.2

0.4

0.6

0.8

DL

1 M

isse

s (M

illio

ns)

III. mesa

Types of phase behavior

cycles

Page 26: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences

26

Summary

• More detailed metrics richer application comparison

• Low-overhead data structure decomposition• Determining ideal sampling period

– A volatility metric inspired by spectral analysis

• Ideal sampling period is application-specific• Data structures in an application share

common phase boundaries