abacus: a hardware-based software profiler for modern processors eric matthews lesley shannon school...

Post on 22-Dec-2015

218 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

ABACUS: A Hardware-Based Software Profiler for Modern Processors

Eric Matthews • Lesley ShannonSchool of Engineering Science

Sergey Blagodurov • Sergey Zhuravlev • Alexandra FedorovaSchool of Computing Science

Simon Fraser University, Vancouver, BC, Canada

Overview

Legendary Introduction to ABACUS

Delicious Profiling Units

Epic Conclusion

2

Introduction to ABACUS

3

Introduction to ABACUS

4

Introduction to ABACUS

5

Introduction to ABACUS

6

ABACUS

7

ABACUS

8

ASPLOSrocks!

ABACUS

9

Performance comparison

10

Memory Reuse Profile

ABACUS avg runtime: 48.5seconds

Simics avg runtime: 1 hour 6minutes

ABACUS

Simics

missReuse 0

Reuse 1

01234

namd

Counts

(in

Mil-

lions)

missReuse 0

Reuse 1

0

2

4

hmmer

Counts

(in

Mil-

lions)

Conclusion

ABACUS is a generic profiler that can be easily integrated into modern processors

It can be used by the O/S to obtain runtime information about a thread’s behaviour to make better thread assignments

11

Thank you! Questions?

Motivation

Future systems will be multi-core and heterogeneous

How does the OS place threads on this architecture?

Characterize thread behaviour

Instruction MixMemory Reuse ProfileEffectiveness of pre-fetchingMemory bandwidth utilization

13

Motivation (cont'd)

How are these metrics collected?

Offline analysis

Code Instrumentation

Simulation (e.g., Simics)

Software-based instruction set simulator

Models systems with full OS support

14

Motivation (cont'd)

Why not use current hardware counters?

Architecture-specific

Not all desired metrics provided

Help detect symptoms, not causes

Limited in number and in concurrent use

15

Goal

Create a hardware profiler to collect thread characteristics at runtime

Imposed constraints

External to processor

Minimally invasive

Cycle accurate

OS controllable

16

ABACUS

hArdware-Based Analyzer for the Characterization of User Software

A collection of runtime configurable profiling units

Collects metrics useful for thread placement

Controllable through the O/S

17

Hardware Platform

18

Proof-of-concept System

LEON3 Sparc v8 Instruction Set Architecture

Single core, single threaded

Test System

OpenSparc Niagara T1 soft processor

1 to 4 hardware threads

Multi-core Multi-board support

Hardware Platform (cont'd)

19

ABACUS

20

External InterfaceBus slave and master modules

Processing required on processor signals

Designed such that only external interface changes with different processor/system

21

Portability

22

Previously integrated with a LEON3 (Sparc

v8 ISA) based system

Differences:

AMBA Advanced High-performance Bus (AHB) vs Processor Local Bus (PLB)

Processor internals

ControllerStarts or stops profiling

Can limit profiling to a specific address range

DMA interface for retrieving collected data

Linux device driver support

23

Profiling Units

Operate on one or more processor signals:

Instruction

PC

Cache Reuse Distance

etc.

Store data in a collection of counters

24

Profiling Units (cont'd)Focus on two dimensional metrics

– Gives bigger picture / greater insight

Aim to be as architecture independent as possible

25

Profile UnitBehaves like a traditional software profiler

Operates on Program Counter

26

Range Overlap

TraceRangeNon-Overlap

Code Space

Memory Reuse UnitCollects a measure of code or data reuse

Utilizes Least Recently Used (LRU) stack

Reuse distance is movement in the LRU stack or a miss

Uses in cache contention management

27

Memory Reuse UnitCreates histogram of cache reuse pattern

Range: [0, set associativity – 1] or cache miss

28

Reuse Distance

4-way set-associative reuse profile

Instruction Mix

29

Identify current instruction subset in use

Divide instructions into logical categories

Load/Store

Floating Point

Control Flow

Opcode-based table lookup

Latency Unit

30

Break down miss latency into constituent sources

Bus contention

DRAM latency

etc.

For each category create a histogram of latency in cycles

Stall Unit

31

Break down Cycles Per Instruction

Attribute cycles to their sources

Cache miss

Translation Lookaside Buffer (TLB) miss

Floating Point busy stalls

etc.

Verification

32

Run a subset of the SPECCPU2006 benchmarks

Those with memory usage within board specs

Collect metrics with ABACUS and Simics

Profile for a few billion instructions

Limited by Simics performace

Test Platform

Proof-of-concept System

Single core, single threaded

XUP V2Pro: 90% slice utilization

33

Processor LEON3 (SPARC v8 ISA) (50MHz)

Memory 256MB DDR RAM

OS Debian Etch (4.0)

Simulation Platform

Simics System:

Differences:

SPARC v9 ISA (64-bit processor)

Local filesystem vs NFS

34

Processor UltraSparc II (SPARC v9 ISA)

Memory 256MB DDR RAM

OS Debian Etch (4.0)

LEON3 Comparison

35

missReuse 0

Reuse 1

0

10

20

namd

Counts

(in

Mil-

lions)

missReuse 0

Reuse 1

05

10152025

hmmer

Counts

(in

Mil-

lions)

ABACUS

Simics

LEON3 Comparison (cont'd)

36

missReuse 0

Reuse 1

01234

namd

Counts

(in

Mil-

lions)

missReuse 0

Reuse 1

0

2

4

hmmer

Counts

(in

Mil-

lions)

DC Memory Reuse Profile

ABACUS

Simics

Resource Usage

3737

Default:

0

200

400

600

800

1000

1200

1400

1600

LUT (V2p)LUT (V5)FF

32bit counters 40bit counters 32bit countersProfile Unit added

2–way LRU Instruction Cache2–way LRU Data Cache5 Instruction Types

Conclusion

ABACUS is a generic profiler that can be easily integrated into modern processors

It can be used by the O/S to obtain runtime information about a thread’s behaviour to make better thread assignments

38

Future Plans

Move to multi-core/multi-threaded system

Memory reuse distance independent of existing cache implementation

Process tracking

Integrate results into OS scheduler

39

Questions

?

top related