abacus: a hardware-based software profiler for modern processors eric matthews lesley shannon school...
Post on 22-Dec-2015
218 views
TRANSCRIPT
ABACUS: A Hardware-Based Software Profiler for Modern Processors
Eric Matthews • Lesley ShannonSchool of Engineering Science
Sergey Blagodurov • Sergey Zhuravlev • Alexandra FedorovaSchool of Computing Science
Simon Fraser University, Vancouver, BC, Canada
Overview
Legendary Introduction to ABACUS
Delicious Profiling Units
Epic Conclusion
2
Introduction to ABACUS
3
Introduction to ABACUS
4
Introduction to ABACUS
5
Introduction to ABACUS
6
ABACUS
7
ABACUS
8
ASPLOSrocks!
ABACUS
9
Performance comparison
10
Memory Reuse Profile
ABACUS avg runtime: 48.5seconds
Simics avg runtime: 1 hour 6minutes
ABACUS
Simics
missReuse 0
Reuse 1
01234
namd
Counts
(in
Mil-
lions)
missReuse 0
Reuse 1
0
2
4
hmmer
Counts
(in
Mil-
lions)
Conclusion
ABACUS is a generic profiler that can be easily integrated into modern processors
It can be used by the O/S to obtain runtime information about a thread’s behaviour to make better thread assignments
11
Thank you! Questions?
Motivation
Future systems will be multi-core and heterogeneous
How does the OS place threads on this architecture?
Characterize thread behaviour
Instruction MixMemory Reuse ProfileEffectiveness of pre-fetchingMemory bandwidth utilization
13
Motivation (cont'd)
How are these metrics collected?
Offline analysis
Code Instrumentation
Simulation (e.g., Simics)
Software-based instruction set simulator
Models systems with full OS support
14
Motivation (cont'd)
Why not use current hardware counters?
Architecture-specific
Not all desired metrics provided
Help detect symptoms, not causes
Limited in number and in concurrent use
15
Goal
Create a hardware profiler to collect thread characteristics at runtime
Imposed constraints
External to processor
Minimally invasive
Cycle accurate
OS controllable
16
ABACUS
hArdware-Based Analyzer for the Characterization of User Software
A collection of runtime configurable profiling units
Collects metrics useful for thread placement
Controllable through the O/S
17
Hardware Platform
18
Proof-of-concept System
LEON3 Sparc v8 Instruction Set Architecture
Single core, single threaded
Test System
OpenSparc Niagara T1 soft processor
1 to 4 hardware threads
Multi-core Multi-board support
Hardware Platform (cont'd)
19
ABACUS
20
External InterfaceBus slave and master modules
Processing required on processor signals
Designed such that only external interface changes with different processor/system
21
Portability
22
Previously integrated with a LEON3 (Sparc
v8 ISA) based system
Differences:
AMBA Advanced High-performance Bus (AHB) vs Processor Local Bus (PLB)
Processor internals
ControllerStarts or stops profiling
Can limit profiling to a specific address range
DMA interface for retrieving collected data
Linux device driver support
23
Profiling Units
Operate on one or more processor signals:
Instruction
PC
Cache Reuse Distance
etc.
Store data in a collection of counters
24
Profiling Units (cont'd)Focus on two dimensional metrics
– Gives bigger picture / greater insight
Aim to be as architecture independent as possible
25
Profile UnitBehaves like a traditional software profiler
Operates on Program Counter
26
Range Overlap
TraceRangeNon-Overlap
Code Space
Memory Reuse UnitCollects a measure of code or data reuse
Utilizes Least Recently Used (LRU) stack
Reuse distance is movement in the LRU stack or a miss
Uses in cache contention management
27
Memory Reuse UnitCreates histogram of cache reuse pattern
Range: [0, set associativity – 1] or cache miss
28
Reuse Distance
4-way set-associative reuse profile
Instruction Mix
29
Identify current instruction subset in use
Divide instructions into logical categories
Load/Store
Floating Point
Control Flow
Opcode-based table lookup
Latency Unit
30
Break down miss latency into constituent sources
Bus contention
DRAM latency
etc.
For each category create a histogram of latency in cycles
Stall Unit
31
Break down Cycles Per Instruction
Attribute cycles to their sources
Cache miss
Translation Lookaside Buffer (TLB) miss
Floating Point busy stalls
etc.
Verification
32
Run a subset of the SPECCPU2006 benchmarks
Those with memory usage within board specs
Collect metrics with ABACUS and Simics
Profile for a few billion instructions
Limited by Simics performace
Test Platform
Proof-of-concept System
Single core, single threaded
XUP V2Pro: 90% slice utilization
33
Processor LEON3 (SPARC v8 ISA) (50MHz)
Memory 256MB DDR RAM
OS Debian Etch (4.0)
Simulation Platform
Simics System:
Differences:
SPARC v9 ISA (64-bit processor)
Local filesystem vs NFS
34
Processor UltraSparc II (SPARC v9 ISA)
Memory 256MB DDR RAM
OS Debian Etch (4.0)
LEON3 Comparison
35
missReuse 0
Reuse 1
0
10
20
namd
Counts
(in
Mil-
lions)
missReuse 0
Reuse 1
05
10152025
hmmer
Counts
(in
Mil-
lions)
ABACUS
Simics
LEON3 Comparison (cont'd)
36
missReuse 0
Reuse 1
01234
namd
Counts
(in
Mil-
lions)
missReuse 0
Reuse 1
0
2
4
hmmer
Counts
(in
Mil-
lions)
DC Memory Reuse Profile
ABACUS
Simics
Resource Usage
3737
Default:
0
200
400
600
800
1000
1200
1400
1600
LUT (V2p)LUT (V5)FF
32bit counters 40bit counters 32bit countersProfile Unit added
2–way LRU Instruction Cache2–way LRU Data Cache5 Instruction Types
Conclusion
ABACUS is a generic profiler that can be easily integrated into modern processors
It can be used by the O/S to obtain runtime information about a thread’s behaviour to make better thread assignments
38
Future Plans
Move to multi-core/multi-threaded system
Memory reuse distance independent of existing cache implementation
Process tracking
Integrate results into OS scheduler
39
Questions
?