static program analysis for performance modeling · rose [3] is an open-source compiler framework...

Static Program Analysis for Performance Modeling

Kewen Meng, Boyana NorrisDepartment of Computer and Information Science

University of OregonEugene, Oregon

{kewen, norris}@cs.uoregon.edu

Abstract—The performance model of an applicationcan provide understanding about its runtime behavior onparticular hardware. Such information can be analyzedby developers for performance tuning. However, modelbuilding and analyzing is frequently ignored duringsoftware development until performance problems arisebecause they require significant expertise and can involvemany time-consuming application runs. In this paper,we propose a faster, accurate, flexible and user-friendlytool, Mira, for generating intuitive performance modelsby applying static program analysis, targeting scientificapplications running on supercomputers. Our tool parsesboth the source and binary to estimate performanceattributes with better accuracy than considering justsource or just binary code. Because our analysis isstatic, the target program does not need to be exe-cuted on the target architecture, which enables users toperform analysis on available machines instead of con-ducting expensive experiments on potentially expensiveresources. Moreover, statically generated models enableperformance prediction on non-existent or unavailablearchitectures. In addition to flexibility, because modelgeneration time is significantly reduced compared todynamic analysis approaches, our method is suitablefor rapid application performance analysis and improve-ment. We present several benchmark validation resultsto demonstrate the current capabilities of our approach.

Keywords-HPC; performance; modeling; static analy-sis

I. INTRODUCTION

Understanding application and system performanceplays a critical role in supercomputing software andarchitecture design. Developers require a thoroughinsight of the application and system to aid their devel-opment and improvement. Performance modeling pro-vides software developers with necessary informationabout bottlenecks and can guide users in identifying

potential optimization opportunities.As the development of new hardware and archi-

tectures progresses, the computing capability of highperformance computing (HPC) systems continues toincrease dramatically. However along with the risein computing capability, it is also true that manyapplications cannot use the full available computingpotential, which wastes a considerable amount of com-puting power. The inability to fully utilize availablecomputing resources or specific advantages of architec-tures during application development partially accountsfor this waste. Hence, it is important to be able tounderstand and model program behavior in order togain more information about its bottlenecks and per-formance potential. Program analysis provides insighton how exactly the instructions are executed in theCPU and data are transferred in the memory, which canbe used for further optimization of a program. In mydirected research project, we developed an approachfor analyzing and modeling programs using primarilystatic analysis techniques.

Current program performance analysis tools canbe categorized into two types: static and dynamic.Dynamic analysis is performed with execution ofthe target program. Runtime performance informationis collected through instrumentation or the hardwarecounter sampling. PAPI [1] is a widely used platform-independent interface to hardware performance coun-ters. By contrast, static analysis operates on the sourcecode or binary code without actually executing it.PBound [2] is one of the static analysis tools forautomatically modeling program performance. It parsesthe source code and collects operation informationfrom it to automatically generate parameterized ex-pressions. Combined with architectural information, itcan be used to estimate the program performance ona particular platform. Because PBound solely relies on

1

the source code, dynamic behaviors like memory allo-cation or recursion as well as the compiler optimizationwould impact the accuracy of the analysis result.However, compared with the high cost of resources andtime of dynamic analysis, static analysis is relativelymuch faster and requires less resources.

While some past research efforts mix static anddynamic analysis to create a performance tool, rel-atively little effort has been put into pure staticperformance analysis and increasing the accuracy ofstatic analysis. Our approach starts from object codesince the optimization behavior of compiler wouldcause non-negligible effects on the analysis accuracy.In addition, object code is language-independent andmuch closer to its runtime situation. Although objectcode could provide instruction-level information, it stillfails to offer some critical factors for understandingthe target program. For instance, it is impossible forobject code to provide detailed information about codestructures and variable names which will be used formodeling the program. Therefore, source code will beintroduced in our project as a supplement to collectnecessary information. Combining source code andobject code, we are able to obtain a more precisedescription of the program and its possible behaviorwhen running on a CPU, which will improve theaccuracy of modeling. The output of our project can beused as a way to rapidly explore detailed informationof given programs. In addition, because the analysisis machine-independent, our project provides usersvaluable insight of how programs may run on particulararchitecture without purchasing expensive machines.Furthermore, the output of our project can also beapplied to create Roofline arithmetic intensity estimatesfor performance modeling.

This report is organized as follows: Section II brieflydescribes the ROSE compiler framework, the polyhe-dral model for loop analysis, and performance mea-surement and analysis tools background. In Section III,we introduce related work about static performancemodeling. In Sections IV and V, we discuss our staticanalysis approach and the tool implementation. Sec-tion VI evaluates the accuracy of the generated modelson some benchmark codes. Section VII concludes witha summary and future work discussion.

Figure 1: ROSE overview

II. BACKGROUND

A. ROSE Compiler Framework

ROSE [3] is an open-source compiler frameworkdeveloped at Lawrence Livermore National Labora-tory (LLNL). It supports the development of source-to-source program transformation and analysis toolsfor large-scale Fortran, C, C++, OpenMP and UPC(Unified Parallel C) applications. As shown in Fig-ure 1, ROSE uses the EDG (Edison Design Group)parser and OPF (Open Fortran Parser) as the front-ends to parse C/C++ and Fortran. The front-end pro-duces ROSE intermediate representation (IR) that isthen converted into an abstract syntax tree (AST). Itprovides users a number of APIs for program analysisand transformation, such as call graph analysis, controlflow analysis, and data flow analysis. The wealth ofavailable analyses makes ROSE an ideal tools both forexperienced compiler researchers and tool developerswith minimal background to build custom tools forstatic analysis, program optimization, and performanceanalysis.

B. Polyhedral Model

The polyhedral model is an intuitive algebraic rep-resentation that treats each loop iteration as latticepoint inside the polyhedral space produced by loopbounds and conditions. Nested loops can be translatedinto a polyhedral representation if and only if theyhave affine bounds and conditional expressions, andthe polyhedral space generated from them is a convexset. The polyhedral model is well suited for optimizingthe algorithms with complexity depending on the theirstructure rather than the number of elements. More-over, the polyhedral model can be used to generategeneric representation depending on loop parametersto describe the loop iteration domain. In addition toprogram transformation [4], the polyhedral model is

2

broadly used for automating optimization and paral-lelization in compilers (e.g., GLooG [5]) and othertools [6]–[8].

C. Performance Tools

Performance tools are capable of gathering per-formance metrics either dynamically (instrumentation,sampling) or statically. PAPI [1] is used to accesshardware performance counters through both high- andlow-level interfaces. The high-level interface supportssimple measurement and event-related functionalitysuch as start, stop or read, whereas the low-levelinterface is designed to deal with more complicatedneeds. The Tuning and Analysis Utilities (TAU) [9]is another state-of-the-art performance tool that usesPAPI as the low-level interface to gather hardwarecounter data. TAU is able to monitor and collectperformance metrics by instrumentation or event-basedsampling. In addition, TAU also has a performancedatabase for data storage and analysis andvisualizationcomponents, ParaProf. There are several similar perfor-mance tools including HPCToolkit [10], Scalasca [11],MIAMI [12], gprof [13], Byfl [14], which can also beused to analyze application or systems performance.

III. RELATED WORK

There are two related tools that we are aware ofdesigned for static performance modeling, PBound [2]and Kerncraft [15]. Narayanan et al. introduce PBoundfor automatically estimating upper performance boundsof C/C++ application by static compiler analysis. Itcollects information and generates parametric expres-sion for particular operations including memory ac-cess, floating-point operation, which is combined withuser-provided architectural information to computemachine-specific performance estimates. However, itrelies on source code analysis, and ignores the effectsof compiler behavior (e.g., compiler optimization).

Hammer et al. describe Kerncraft, a static perfor-mance modeling tool with concentration on memoryhierarchy. It has capabilities of predicting performanceand scaling loop behavior based on Roofline [16] orExecution-Cache-Memory (ECM) [17] model. It usesYAML as the file format to describe low-level architec-ture and Intel Architecture Code Analyzer (IACA) [18]to operate on binaries in order to gather loop-relevantinformation. However, the reliance on IACA limits the

Figure 2: Mira overview

applicability of the tool so that the binary analysis isrestricted by Intel architecture and compiler.

In addition, system simulators can also used formodeling, for example, the Structural SimulationToolkit (SST) [19]. However, as a system simulator,SST has a different focus—it simulates the wholesystem instead of signle applications and it analyzes theinteraction among architecture, programming modeland communications system. Moreover, simulation iscomputationally expensive and limits the size andcomplexity of the applications that can be simulated.Compared with PBound, Kerncraft and Mira, SST isrelatively heavyweight, complex, and focuses on hard-ware, which is more suitable for exploring architecture,rather than performance of the application.

IV. METHODOLOGY AND IMPLEMENTATION

Our Mira static performance modeling tool is builton top of ROSE compiler framework, which providesseveral useful lower-level APIs as front-ends for pars-ing the source file and disassembling the ELF file. Mirais implemented in C++ and is able to process C/C++source code as input. The entire tool can be dividedinto three parts:

3

Figure 3: Loop structure in AST

• Input Processor - Input parsing and disassem-bling

• Metric Generator - AST traversal and metricgeneration

• Model Generator - Model generation in Python

A. Source code and binary representations

The Input Processor is the front-end of our mod-eling tool, and its primary goal is to process input andbuild an AST (Abstract Syntax Tree). It accepts thesource code and ELF file as the input to generatingthe ASTs which are the representation of the sourcecode and ELF. The AST represents the structure ofthe program, and enables the modeling tool to locatecritical structures such as function bodies, loops, andbranches. Furthermore, because the AST also preserveshigh-level source information, such as variable names,types, the order of statements and the right/left handside of assignment, it is convenient for the tool to re-trieve necessary information and perform analysis. Forinstance, by knowing the AST, it is possible to queryall information about the static control part (SCoP) ofthe loop to model a loop, including loop initialization,loop condition, and step. In addition, because variablenames are preserved, it makes identification of loopindexes much easier and processing of the variablesinside the loop more accurate.

B. Bridge between source and binary

1) Difference in AST: The AST is the output of thefrond end part of our modeling tool. After processingthe inputs, two ASTs are generated separately from

Figure 4: Partial binary AST

source and binary representing the structures of thetwo inputs. Our tool is designed to use informationretrieved from these trees to improve the accuracyof the generated models. Therefore, it is necessaryto build connections between the two ASTs so thatfor a structure in source it is able to instantly locatecorresponding nodes in the binary one.

Although both ASTs are representations of the in-puts, they have totally different shapes, node organi-zations and meaning of nodes. A partial binary AST(representing a function) is shown in Figure 4.Eachnode of the binary AST describes the syntax el-ement of assembly code, such as SgAsmFunction,SgAsmX86Instruction. As shown in Figure 4, a func-tion in binary AST is comprised of multiple instruc-tions while a functions is made up of statements insource AST, hence it is highly possible that one sourceAST node corresponds to several nodes in binary AST,which complicates the building of connection betweenthem.

2) Line number: Because the differences betweenthe two AST structures make it difficult to connectsource to binary, an alternate way is needed to makethe connection between ASTs more precise. Inspiredby debuggers, line numbers are used in our tool asthe bridge to associate source to binary. When weare debugging a program, the debugger knows exactlythe source line and column to locate errors. In fact,when using the -g option during program compilation,the compiler will insert debug-related information intothe object file for future reference. Most compilers

4

Table I: DWARF Section and its description

Section Description.debug info The core DWARF information section.debug loc Location lists used in DW AT location attributes.debug frame Call frame information.debug abbrev Abbreviations used in the .debug info section.debug aranges Lookup table for mapping addresses to compilation units.debug line Line number information.debug macinfo Macro information.debug pubnames Lookup table for mapping object and function names to compilation units.debug pubtypes Lookup table for mapping type names to compilation units.debug ranges Address ranges used in DW AT ranges attributes.debug str String table used in .debug info

and debuggers use DWARF (debugging with attributedrecord format) as the debugging file format to orga-nize the information for source-level debugging. Asshown in Table I, DWARF categories data into severalsections, such as .debug info, .debug frame, etc. The.debug line section stores the line number information.

Knowing the location of line number informationallows us to decode the specific DWARF section tomap the line number to the corresponding instructionaddress. Because line number information in the sourceAST is already preserved in each node, unlike thebinary AST, it can be retrieved directly. After linenumbers are obtained from both source and binary,connections are built in each direction. As it is men-tioned in the previous section, a source AST nodenormally links to several binary AST nodes due to thedifferent meaning of nodes. Specifically, a statementcontains several instructions, but an instruction onlyhas one connected source location. Once the node inbinary AST is associated to the source location, moreanalysis can be performed. For instance, it is possibleto narrow the analysis to a small scope and collect datasuch as the instruction count and type in a particularblock such as function body, loop body, and even asingle statement.

C. Generating metrics

The Metric Generator is an important part ofthe entire tool, which has significant impact on theaccuracy of the generated model. It receives the ASTsas inputs from the Input Processor and to produce pa-rameterized operation counts that are output as Pythoncode. An AST traversal is needed to collect and propa-gate necessary information about the specific structuresin the program for appropriate organization of the

program representation to precisely guide model gener-ation. During the AST traversal, additional informationis attached to the particular tree node as a supplementused for analysis and modeling. For example, if it istoo long, one statement is probably located in severallines. In this case, all the line numbers will be collectedtogether and stored as extra information attached to thestatement node.

To best model the program, Model Builder traversesthe source AST twice in different manners, bottom-upand then top-down. The upward traversal propagatesdetailed information about specific structures up to thehead node of the sub-tree. For instance, as shown inFigure 3, SgForStatement is the head node for the loopsub-tree; however, this node itself does not store anyinformation about the loop. Instead, the loop informa-tion such as loop initialization, loop condition and stepare stored in SgForInitStatement, SgExprStatement andSgPlusPlusOp separately as child nodes. In this case,the bottom-up traversal recursively collects informationfrom leaves to root and organizes it as extra dataattached to the head node for the loop. The attachedinformation will serve as context in modeling.

After upward traversal, top-down traversal is ap-plied to the AST. Because information about sub-tree structure has been collected and attached, thedownward traversal primarily focuses on the head nodeof sub-tree and those of interest, for example the loophead node, if head node, function head node, andassignment node, etc. Moreover, it is of significantimportance for the top-down traversal to pass downnecessary information from parent to child node inorder to model complicated structures correctly. Forexample, in nested loop and branch inside loop the

5

Table II: Loop coverage in high-performance applications

Application Number of loops Number of statements Statements in loops Percentageapplu 19 757 633 84%apsi 80 2192 1839 84%mdg 17 530 464 88%lucas 4 2070 2050 99%mgrid 12 369 369 100%quake 20 639 489 77%swim 6 123 123 100%adm 80 2260 1899 84%dyfesm 75 1497 1280 86%mg3d 39 1442 1242 86%

inner structure requires the information from parentnode as the outer context to model itself, otherwisethese complicated structures can not be correctly han-dled. Also, instruction information from ELF AST isconnected and associated to correspond structures intop-down traversal.

V. GENERATING MODELS

The Model Generator is built on the Metric Gener-ator, which consumes the intermediate analysis resultof the model builder and generates an easy-to-usemodel. To achieve the flexibility, the generated modelis coded in Python so that the result of the model canbe directly applied to various scientific libraries forfurther analysis and visualization. In some cases, themodel is in ready-to-execute condition for which usersare able to run it directly without providing any input.However, users are required to feed extra input in orderto run the model when the model contains parametricexpressions. The parametric expression exists in themodel because our static analysis is not able to handlesome cases. For example, when user input is expectedin the source code or the value of a variable comesfrom the returning of a call, the variables are preservedin the model as parameters that will be specified by theusers before running the model.

A. Loop modeling

Loops are common in HPC codes and are typicallyat the heart of the most time-consuming computations.A loop executes a block of code repeatedly until certainconditions are satisfied. Bastoul et al. [20] surveyedmultiple high-performance applications and summa-rized the results in Table II. The first column showsthe number of loops contained in the application. Thesecond column lists the total number of statements

in the applications and the third column counts thenumber of statements covered by loop scope. The ratioof in-loop statements to the total number of statementsare calculated in the last column. In the data shown inthe table, the lowest loop coverage is 77% for quakeand the coverage rates for the rest of applications areabove 80%. This survey data also indicates that the in-loop statements make up a large majority portion ofthe total statements in the selected high-performanceapplications.

B. Using the polyhedral model

Listing 1: Basic loop

for (i = 0; i < 10; i++){

statements;}

Whether loops can be precisely described and modeledhas a direct impact on the accuracy of the generatedmodel because the information about loops will beprovided as context for further in-loop analysis. Theterm “loop modeling” refers to analysis of the staticcontrol parts (SCoP) of a loop to obtain the informationabout the loop iteration domain, which includes under-standing of the initialization, termination condition andstep. Unlike dynamic analysis tools which may collectruntime information during execution, our approachruns statically so the loop modeling primarily relieson SCoP parsing and analyzing. Usually to model aloop, it is necessary to take several factors into con-sideration, such as depth, variable dependency, bounds,etc. Listing 1 shows a basic loop structure, the SCoPis complete and simple without any unknown variable.

6

(a) Polyhedral representation for double-nested loop (b) Polyhedral representation with if constraint

(c) if constraint causing holes in the polyhedral area (d) Exceptions in polyhedral modeling

Figure 5: Modeling double-nested loop

For this case, it is possible to retrieve the initial value,upper bound and steps from the AST, then calculatethe number of iterations. The iteration count is used ascontext when analyzing the loop body. For example, ifcorresponding instructions are obtained the from binaryAST for the statements in Listing 1, the actual countof these instructions is expected to be multiplied bythe iteration count to describe the real situation duringruntime.

Listing 2: Double-nested loop

for(i = 1; i <= 4; i++)

for(j = i + 1; j <= 6; j++){

statements;}

However, loops in real application are more com-plicated, which requires our tool to handle as manyscenarios as possible. Therefore, the challenge formodeling the loop is to create a general method forvarious cases. To address this problem, we use thepolyhedral model in Mira to accurately model the loop.The polyhedral model is capable of handling an N-

7

dimensional nested loop and represents the iterationdomain in an N-dimensional polyhedral space. Forsome cases, the index of inner loop has a dependencywith the outer loop index. As shown in Listing 2, theinitial value of the inner index j is based on the value ofthe outer index i. For this case, it is possible to derivea formula as the mathematical model to represent thisloop, but it would be difficult and time-consuming.Most importantly, it is not general; the derived formulamay not fit for other scenarios. To use the polyhedralmodel for this loop, the first step is to represent loopbounds in affine functions. The bounds for the outerand inner loop are 1 ≤ i ≤ 4 and i+1 ≤ j ≤ 6, whichcan be written as two equations separately:[

1 0−1 0

]×

[ij

]+

[−14

]≥ 0

[−1 10 −1

]×

[ij

]+

[−16

]≥ 0

In Figure 5(a), the two-dimensional polyhedral areapresenting the loop iteration domain is created basedon the two linear equations. Each dot in the figure rep-resents a pair of loop indexes (i, j), which correspondsto one iteration of the loop. Therefore, by countingthe integer in the polyhedral space, we are able toparse the loop iteration domain and obtain the iterationtimes. Besides, for loops with more complicated SCoP,such as the ones contain variables instead of concretenumerical values, the polyhedral model is also able tohandle. When modeling loops with unknown variables,Mira uses the polyhedral model to generate a para-metric expression representing the iteration domainwhich can be changed by specifying different valuesto the input. Mira maintains the generated paramet-ric expressions and uses as context in the followinganalysis. In addition, the unknown variables in loopSCoP are preserved as parameters until the parametricmodel is generated. With the parametric model, it isnot necessary for the users to re-generate the modelfor different values of the parameters. Instead, theyjust have to adjust the inputs for the model and runthe Python to produce a concrete value.

Listing 3: Exception in polyhedral modeling

for(i = 1; i <= 5; i++)for(j = min(6 - i, 3);

j <= max(8 - i, i); j++){statements;}

There are exceptions that the polyhedral model cannothandle. For the code snippet in Listing 3, the SCoP ofthe loop forms a non-convex set (Figure 5(d)) to whichpolyhedral model can not apply. Another problemin this code is that the loop initial value and loopbound depend on the returning value of function calls.For static analysis to track and obtain the returningvalue of function calls, more complex interproceduralanalysis is required. For such scenarios, we need moresophisticated approach that will be a part of our futurework.

C. Branches

Listing 4: Loop with if constraint

for(i = 1; i <= 4; i++)for(j = i + 1; j <= 6; j++){if(j > 4){statements;

}}

In addition to loops, branch statements are commonstructures. In scientific applications, branch statementsare frequently used to verify the intermediate out-put during the computing. Branch statements can behandled by the information retrieved from the AST.However, it complicates the analysis when the branchstatements reside in a loop. In Listing 4, the if con-straint j > 4 is introduced into the previous code snip-pet. The number of execution times of the statementinside the if depends on the branch condition. In ouranalysis, the polyhedral model of a loop is kept andpassed down to the inner scope. Thus the if node hasthe information of its outer scope. Because the loopconditions combined with branch conditions form apolyhedral space as well, shown in Figure 5(b), thepolyhedral representation is still able to model thisscenario by adding the branch constraint and regeneratea new polyhedral model for the if node. ComparingFigure 5(b) with Figure 5(a), it is obvious that the

8

iteration domain becomes smaller and the numberof integers decreases after introducing the constraint,which indicates the execution times of statements inthe branch is limited by the if condition.

Listing 5: if constraint breaks polyhedral space

for(i = 1; i <= 4; i++)for(j = i + 1; j <= 6; j++){if(j % 4 != 0){statements;

}}

However, some branch constraints might break thedefinition of a convex set that the polyhedral modelis not applicable. For the code in Listing 5, the ifcondition excludes several integers in the polyhedralspace causing ”holes” in the iteration space as shownin Figure (c). The excluded integers break the integrityof the space so that it no longer satisfies the definitionof the convex set, thus the polyhedral model is notavailable for this particular circumstance. In this case,the true branch of the if statement raises the problembut the false branch still satisfies the polyhedral model.Thus we can use:

Counttrue branch = Countloop total − Countfalse branch

Since the counter of the outer loop and false branchboth can be expressed by the polyhedral model, usingeither concrete value or parametric expression, so thecount of the true branch is obtained. The generalityof the polyhedral model makes it suitable for mostcommon cases in real applications, however there aresome cases that cannot be handled by the modelor even static analysis, such as loop-index irrelevantvariables or function call located inside the branchcondition. For such circumstances, we provide usersan option to annotate branches inside loops. The pro-cedure of annotation requires users to give an estimatedpercentage that the branch may take in the particularloop iteration or indication of skipping the branch. Theannotation is written in the source code as a comment,and processed during the generation of the AST.

Listing 6: User annotation for if statement

for(i = 1; i <= 4; i++)for(j = i + 1; j <= 6; j++){if(foo(i) > 10)/* @Annotation {p:skip} */{statements;

}}

In order to improve the applicability, an option isprovided to the users for annotation of the branch thatis not able to handle by Mira. As the example shownin Listing 6, the if condition contains a function callwhich cannot be modeled by Mira at present because itis difficult for a static tool to obtain the returning valueof a function call. To address such problem, users couldspecify an annotation in the comment to guide Mira’smodeling. In the given example, the whole branch willbe skipped when using polyhedral model. In additionto skip of the branch, a numeric value in the annotationrepresents the percentage of iteration times of thisbranch in the whole loop. For instance, @Annotation{p:0.8} means the body of the branch is executed 80%of the total iterations of the loop.

D. Functions

In the generated model, the function header ismodified for two reasons: flexibility and usability.Specifically, each user-defined function in the sourcecode is modeled into a corresponding Python functionwith a different function signature, which only includesthe arguments that are used by the model. In addition,the generated model makes a minor modification inthe function name in order to avoid potential conflictdue to function overwrite or other special cases. In thebody of the generated Python function, statements arereplaced with corresponding performance metrics dataretrieved from binary. These data are stored in Pythondictionaries and organized in the same order as thestatements. Each function is executable and returns theaggregate result within its scope. The advantage of thisdesign is to provide the user the freedom to separateand obtain the overview of the particular functions withonly minor changes to the model.

Another design goal is to appropriately handle func-tion calls. For modeling function calls, because allof the user-defined functions are generated in Python

9

(a) Source code (b) Generated foo function (c) Generated main function

Figure 6: Generated Model

but with modified names, the first step is to searchthe function-name mapping record maintained in themodel and obtain the modified name. For instancePython function with name foo 223 represents thatthe original name is foo and location in the sourcecode is line 223. Since the callee function from thesource named as foo, it is necessary for the calleefunction to recover its original name. Then in thefunction signature, the actual parameters are retrievedfrom the source AST and combined with function nameto generate a function call statement in the Pythonmodel.

E. Architecture configuration file

Because Mira provides a parameterized performancemodel of the performance of the target applicationrunning on different architectures and no architecturalinformation is collected during modeling, it is neces-sary for the users to provide information about thetarget architectures, such as number of CPU cores,size of cache line, and vector length, etc (in future,we will use microbenchmarks to obtain these valuesautomatically). For instance, in data-parallel (SIMD)processors, it is of importance to know how much datais processed simultaneously. These hardware-relatedmetrics are used to describe the target architectures,and Mira reads them from a user configurable file toapply to the model evaluation. By giving the detailinformation of the hardware, it improves the accuracyand extendability of the generated model.

F. Generated model

In this section, we describe the model generated(output) by Mira with an example. In Figure 6, it showsthe source code (input) and generated Python modelseparately. Source code (Figure 6(a)) contains twofunctions: function foo with three parameters includingtwo array variables and one integer used as loopupper bound. The main function is the entry pointof the whole program. In addition, the function foois called two times in main with different values forparameter size. Figure 6(b) shows part of the generatedPython function foo in which the new function nameis comprised of original function name and its linenumber in order to avoid possible naming conflicts.The body of the generated function foo 1 containsPython statements for keeping track of performancemetrics. Similarly, the main function is also modeledshown as Figure 6(c) which includes two functioncalls. It is noted that the only difference betweenthe two function calls is the third parameter. SinceMira recognizes that size is a loop-relevant variable,therefore it should be preserved as a parameter inthe generated model whose value can be specified bythe user. Two parameters in the function calls nameddifferently because function call is separate to eachother in the generated model.

G. Evaluating the model

In this section, we evaluate the correctness of themodel derived by our tool with TAU in instrumentationmode. Two benchmarks are separately executed stati-cally and dynamically on two different machines. Since

10

Table III: FLOP count in STREAM benchmark

Tool / Array size 2M 50M 100MTAU 8.239E7 4.108E9 2.055E10Mira 8.20E7 4.100E9 2.050E10

Table IV: FLOP count in DGEMM benchmark

Tool / Matrix size 256 512 1024TAU 1.013E9 8.077E9 6.452E10Mira 1.0125E9 8.0769E9 6.4519E10

(a) FLOP count in STREAM benchmark with different array size(b) FLOP count in DGEMM benchmark with different matrixsize

Figure 7: Validation results

the our performance tool currently concentrates onfloating-point operations, the validation is performedon comparison of the floating-point operation countingbetween our tool and TAU measurements.

1) Experiment environment: Mira is able to gen-erate performance model of the target machines withdifferent architectures. Therefore the validation is con-ducted on two different machines. The details of thetwo machines are as follows:

• Arya - It is equipped with two Intel XeonE5-2699v3 2.30GHz 18-core Haswell micro-architecture CPUs and 256GB of memory.

• Frankenstein - It has two Intel Xeon E56202.40GHz 4-core Nehalem micro-architectureCPUs and 22GB of memory.

2) Benchmarks: Two benchmarks are chosen forvalidation, STREAM and DGEMM. STREAM is de-signed for the measurement of sustainable memory

bandwidth and corresponded computation rate for sim-ple vector kernels. DEGMM is a widely used bench-mark for measuring the floating-point rate on a singleCPU. It uses double-precision real matrix-matrix mul-tiplication to calculate the floating-point rate. For bothbenchmarks, the non-OpenMP version is selected andexecuted serially with one thread.

VI. RESULTS

In this section, we present the validation results anddiscuss the primary features in on our tool which havesignificant impact on users for evaluating the tradeoffbetween static and dynamic methods for performanceanalysis and modeling.

Table III and IV shows the results of FLOP countingin two benchmarks separately. The metrics are gatheredby the calculation of the model generated by our tool,and instrumentation by TAU. In Figure 7(a), the Xaxis is the size of the array which represents the

11

problem size, and we choose 20 million, 50 millionand 100 million respectively as the inputs. To makethe graph clear enough, the FLOP counts is scaledlogarithmically and shown on the Y axis. Similarly,in Figure 7(b), the X axis is for input size and theY for FLOP counts. From the tables and graphs, wecan see the result from our model is close to theone instrumented by TAU. Moreover, the differenceof results increases along with the growth of problemsize. It is possibly due to the function calls inside ofloops. TAU conducts a whole program instrumentation,which records every floating-point operation in everyroutine of the program. However, our static modelingtool is not able to handle non-user-defined functioncalls, such as a function defined in mathematical li-brary. For such scenarios, Mira can only track thefunction call statements that may just contain severalstack manipulation instructions while the function itselfinvolves the floating-point operations.

Besides correctness, we compare the execution timeof the two methods. To analyze a program, the sourcecode must be recompiled after adjustment of the inputs.Because the analysis works on every single instruction,it spends a large amount of time on uninterestinginstructions, thus it can be time-consuming. However,our model only needs to be generated once, andthen can be used to obtain the corresponding resultsbased on different inputs, which saves many effortsand avoids duplicated work. Most importantly, theperformance analysis by a parametric model achieveshigher coverage than repeatedly conducting of theexperiments.

Furthermore, availability is a another factor neededto take into account. Due to the changes on the latermodels of Intel CPUs, PAPI-based performance toolsare not able to gather some particular performancemetrics (e.g. FLOP counts) by hardware counters.Hence, static performance analysis may be a solutionto such cases.

VII. CONCLUSION

In this report, we present our work about perfor-mance modeling by static analysis. We aim at design-ing a faster, accurate and flexible method for perfor-mance modeling as a supplement to existing tools inorder to address problems that cannot solved by currenttools. Our method focuses on floating-point operationsand achieves good accuracy for benchmarks. These

preliminary results suggest that this can be an effectivemethod for performance analysis.

However, much work remains to be done. In ourfuture work, the first problem we eager to tackleis to appropriately handling of more function calls,especially those from system or third-party libraries.We also consider combining dynamic analysis and in-troducing more performance metrics into the model toaccommodate more complicated scenarios. To enhancethe scalability is also a significant mission for us inorder to enable Mira to model the programs runningin the distributed environment.

REFERENCES

[1] P. J. Mucci, S. Browne, C. Deane, and G. Ho, “Papi: Aportable interface to hardware performance counters,”in Proceedings of the department of defense HPCMPusers group conference, 1999, pp. 7–10.

[2] S. H. K. Narayanan, B. Norris, and P. D. Hovland,“Generating performance bounds from source code,”in Parallel Processing Workshops (ICPPW), 2010 39thInternational Conference on. IEEE, 2010, pp. 197–206.

[3] D. Quinlan, “Rose homepage,” http://rosecompiler.org.

[4] L.-N. Pouchet, U. Bondhugula, C. Bastoul, A. Cohen,J. Ramanujam, P. Sadayappan, and N. Vasilache, “Looptransformations: Convexity, pruning and optimization,”in 38th ACM SIGACT-SIGPLAN Symposium on Princi-ples of Programming Languages (POPL’11). Austin,TX: ACM Press, Jan. 2011, pp. 549–562.

[5] C. Bastoul, “Code generation in the polyhedral modelis easier than you think,” in PACT’13 IEEE Interna-tional Conference on Parallel Architecture and Com-pilation Techniques, Juan-les-Pins, France, September2004, pp. 7–16.

[6] M. Griebl, “Automatic parallelization of loop programsfor distributed memory architectures,” 2004.

[7] U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sa-dayappan, “A practical automatic polyhedral paral-lelizer and locality optimizer,” in ACM SIGPLAN No-tices, vol. 43, no. 6. ACM, 2008, pp. 101–113.

[8] T. Grosser, A. Groesslinger, and C. Lengauer, “Pol-lyperforming polyhedral optimizations on a low-levelintermediate representation,” Parallel Processing Let-ters, vol. 22, no. 04, p. 1250010, 2012.

12

[9] S. S. Shende and A. D. Malony, “The tau parallelperformance system,” International Journal of HighPerformance Computing Applications, vol. 20, no. 2,pp. 287–311, 2006.

[10] L. Adhianto, S. Banerjee, M. Fagan, M. Krentel,G. Marin, J. Mellor-Crummey, and N. R. Tallent, “Hpc-toolkit: Tools for performance analysis of optimizedparallel programs,” Concurrency and Computation:Practice and Experience, vol. 22, no. 6, pp. 685–701,2010.

[11] M. Geimer, F. Wolf, B. J. Wylie, E. Abraham,D. Becker, and B. Mohr, “The scalasca performancetoolset architecture,” Concurrency and Computation:Practice and Experience, vol. 22, no. 6, pp. 702–719,2010.

[12] G. Marin, J. Dongarra, and D. Terpstra, “Miami: Aframework for application performance diagnosis,” inPerformance Analysis of Systems and Software (IS-PASS), 2014 IEEE International Symposium on. IEEE,2014, pp. 158–168.

[13] S. L. Graham, P. B. Kessler, and M. K. Mckusick,“Gprof: A call graph execution profiler,” in ACMSigplan Notices, vol. 17, no. 6. ACM, 1982, pp. 120–126.

[14] S. Pakin and P. McCormick, “Hardware-independentapplication characterization,” in Workload Character-ization (IISWC), 2013 IEEE International Symposiumon. IEEE, 2013, pp. 111–112.

[15] J. Hammer, G. Hager, J. Eitzinger, and G. Wellein,“Automatic loop kernel analysis and performance mod-eling with kerncraft,” in Proceedings of the 6th Inter-national Workshop on Performance Modeling, Bench-marking, and Simulation of High Performance Com-puting Systems. ACM, 2015, p. 4.

[16] S. Williams, A. Waterman, and D. Patterson, “Roofline:an insightful visual performance model for multicorearchitectures,” Communications of the ACM, vol. 52,no. 4, pp. 65–76, 2009.

[17] J. Hofmann, J. Eitzinger, and D. Fey, “Execution-cache-memory performance model: Introduction andvalidation,” arXiv preprint arXiv:1509.03118, 2015.

[18] Intel, “Intel architecture code analyzer homepage,”https://software.intel.com/en-us/articles/intel-architecture-code-analyzer.

[19] SNL, “Sst homepage,” http://sst-simulator.org.

[20] C. Bastoul, A. Cohen, S. Girbal, S. Sharma, andO. Temam, “Putting polyhedral loop transformationsto work,” in International Workshop on Languages andCompilers for Parallel Computing. Springer, 2003, pp.209–225.

13

static program analysis for performance modeling · rose [3] is an open-source compiler framework...

Documents