profile-guided optimization targeting high performance embedded applications

Profile-Guided OptimizationTargeting High Performance

Embedded Applications David Kaeli

Murat Bicer

Efe Yardimci

Center for Subsurface Sensing and

Imaging Systems (CenSSIS)

Northeastern University

Jeffrey Smith

Mercury Computer

Why Consider Using Profile-Guided Optimization?

• Much of the potential performance available on data-parallel systems can not be obtained due to unpredictable control flow and data flow in programs

• Memory system performance continues to dominate the performance of many data-parallel applications

• Program profiles provide clues to the compiler/linker/runtime to:– Enable more aggressive use of interprocedural optimizations– Eliminate bottlenecks in the data flow/control flow and – Improve a program’s layout on the available memory hierarchy

• Applications can then be developed at higher levels of programming abstraction (e.g., from UML) and tuned for performance later

Profile Guidance• Obtain run-time profiles in the form of:

– Procedure call graphs, basic block traversals– Program variable value profiles– Hardware performance counters (using PCL)

• Cache and TLB misses, pipeline stalls, heap allocations, synchronization messages

• Utilize run-time profiles as input to:– Provide user feedback (e.g., program visualization)– Perform profile-driven compilation (recompile using

the profile)– Enable dynamic optimization (just-in-time compilation)– Evaluate software testing coverage

Profiling Tools

• Mercury Tools• TATL – Trace Analysis Tool and Library

• Procedure profiles• Gnu gprof

• PowerPC Performance Counters• PCL – Performance Counter Library• PM API – targeting the PowerPC

• Greenhills Compiler • MULTI profiling support

• Custom instrumentation drivers

SAR

Program profile• counter values• program paths• variable values

COMPILER

Feedback

Compile-time Optimizations

Data Parallel Applications

ProgramBinary

Binary-level Optimizations

Feedback

GPR

SoftwareDefinedRadio

MRI

Program run

Target Optimizations

• Compile-time– Aggressive procedure inlining– Aggressive constant propagation

• Program variable specialization • Procedure cloning

– Removal of redundant loads/stores • Link-time

– Code reordering utilizing coloring– Static data reordering

• Dynamic (during runtime)– Heap layout optimization

Memory Performance is Key to Scalability in Data-parallel applications

• The performance gap between processor technology and memory technology continues to grow

• Hierarchical memory systems (multi-level caches) have been used to bridge this gap

• Embedded processing applications place a heavy burden on the supporting memory system

• Applications will need to adapt (potentially dynamically) to better utilize the available memory system

Cache Line Coloring

• Attempts to reorder a program executable by coloring the cache space, avoiding caller-callee conflicts in a cache

• Can be driven by either statically-generated call graphs or profile data

• Improves upon the work of Pettis and Hansen by considering the organization of the cache space (i.e., cache size, line size, associativity)

• Can be used with different levels of granularity (procedures, basic blocks) and both intra- and inter- procedurally

Cache Line Coloring Algorithm• Build program call graph

– nodes represent procedures– edges represent calls– edge weight represent call frequencies

• Prune edges based on a threshold value• Sort graph edges and process in decreasing edge

weight order• Place procedures in the cache space, avoiding

color conflicts• Fill in gaps with remaining procedures• Reduces execution time by up to 49% for data

compression algorithms

A

B E

90 40

Data Memory Access

• A disproportionate number of data cache misses are caused by accesses to dynamically allocated (heap) memory

• Increases in cache size do not effectively reduce data cache misses caused by heap accesses

• A small number of objects account for a large percentage of heap misses (90/10 rule)

• Existing memory allocation routines tend to balance allocation speed and memory usage (locality preservation has not been a major concern)

vpr

0

5

10

15

20

25

Heap Non-Heap

mis

s ra

te

8k-dm16k-dm32k-2w32k-4w64k-2w64k-4w

em3d

0102030405060708090

Heap Non-heap

mis

s ra

te


equake

0

2

4

6

8

10

12

14

16

Heap Non-Heap

mis

s ra

te


ammp

0

5

10

15

20

25

30

Heap Non-Heap

mis

s ra

te8k-dm16k-dm32k-2w32k-4w64k-2w64k-4w

Miss rates (%) vs. Cache Configurations

Profile-driven Data Layout• We have developed a profile-guided approach

to allocating heap objects to improve heap behavior

• The idea is to use existing knowledge of the computing platform (e.g., cache organization), combined with profile data, to enable the target application to execute more efficiently

• Mapping temporally local memory blocks possessing high reference counts to the same cache area will generate a significant number of cache misses

Allocation

• We have developed our own malloc routine which uses a conflict profile to avoid allocating potentially conflicting addresses

• A multi-step allocation algorithm is repeated until a non-conflicting allocation is made

• If all steps produce conflicts, allocation is made within the wilderness region

• If conflicts still occur in the wilderness region, we allocate these conflicting chunks (creating a hole)

• Allocation occurs at the first non-conflicting address after the chunk

• The hole is immediately freed, causing minimal space wastage (though possibly some limited fragmentation)

Runtime improvements over non-optimized heap layout

0

1

2

3

4

5

6

% R

educ

tion

in R

unti

me

Application

vprem3dtsppowertwolfequake

Future Work

• Present algorithms have only been evaluated on uniprocessor platforms

• Follow-on work will target Mercury RACE multiprocessor systems

• Target applications will include:– FM3TR for Software Defined Radio– Steepest Decent Fast Multipole Methods

(SDFMM) and Method for demining applications

“Improving the Performance of Heap-based Memory Access,” E. Yardimci and D. Kaeli, Proc. of the Workshop on Memory Performance Issues, June 2001.

“Accurate Simulation and Evaluation of Code Reordering,” J. Kalamatianos and D. Kaeli, Proc. of the IEEE International Symposium on the Performance Analysis of Systems and Software, May 2000.

“`Model Based Parallel Programming with Profile-Guided Application Optimization,” J. Smith and D. Kaeli, Proc. of the 4th Annual High Performance Embedded Computing Workshop, MIT Lincoln Labs, Lexington, MA, September 2000, pp.85-86.

“Cache Line Coloring Using Real and Estimated Profiles,” A. Hashemi, J. Kalamatianos, D. Kaeli and W. Meleis, Digital Technical Journal, Special Issues on Tools and Languages, February 1999.

`` Parameter Value Characterization of Windows NT-based Applications,'‘ J. Kalamatianos and D. Kaeli, Workload Characterization: Methodology and Case Studies, IEEE Computer Society, 1999, pp.142-149.

Related Publications

“Analysis of Temporal-based Program Behavior for Improved Instruction Cache Performance,” J. Kalamatianos, A. Khalafi, H. Hashemi, D. Kaeli and W. Meleis, IEEE Transactions on Computers, Vol.10, No. 2, February 1999, pp. 168-175.

“Memory Architecture Dependent Program Mapping,” B. Calder, A. Hashemi, and D. Kaeli, US Patent No. 5,963,972, October 5, 1999.

“Temporal-based Procedure Reordering for Improved Instruction Cache Performance,” Proc. of the 4th HPCA, Feb. 1998, pp. 244-253.

“Efficient Procedure Mapping Using Cache Line Coloring,” H. Hashemi, D. Kaeli and B. Calder, Proc. of PLDI’97, June 1997, pp. 171-182.

“Procedure Mapping Using Static Call Graph Estimation,” Proc. of the Workshop on the Interaction Between Compilers and Computer Architecture, TCCA News, 1997.

Related Publications(also see http://www.ece.neu.edu/info/architecture/publications.html)

profile-guided optimization targeting high performance embedded applications

Documents